CN117033423A - SQL generating method for injecting optimal mode item and historical interaction information - Google Patents

SQL generating method for injecting optimal mode item and historical interaction information Download PDF

Info

Publication number
CN117033423A
CN117033423A CN202311004567.1A CN202311004567A CN117033423A CN 117033423 A CN117033423 A CN 117033423A CN 202311004567 A CN202311004567 A CN 202311004567A CN 117033423 A CN117033423 A CN 117033423A
Authority
CN
China
Prior art keywords
natural language
sql
database
node
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311004567.1A
Other languages
Chinese (zh)
Inventor
雷思情
桑楠
李贞昊
唐思佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202311004567.1A priority Critical patent/CN117033423A/en
Publication of CN117033423A publication Critical patent/CN117033423A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24526Internal representations for queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an SQL generating method for injecting optimal mode items and historical interaction information, which is applied to the semantic analysis field of natural language processing and aims at solving the problem that the prior art is difficult to solve the problem that the complex Chinese natural language across fields generates the actual business intelligent scene application of structured language SQL; the invention injects the optimal mode item with larger relativity with the Chinese natural language inquiry into the model through the pre-screening cross encoder; then, simplifying the complex query of the step-by-step user by using the historical query of the user; finally, a complete SQL structure prediction model is provided, a node sequence is generated from a root node in a depth priority order, the characteristics, father node characteristics, context characteristics and node type characteristics of the last moment are fused, probability values of correct SQL sentences corresponding to natural language problems are maximized, an SQL abstract syntax tree is generated, and then the SQL syntax tree is traversed in depth to generate a final SQL sentence.

Description

SQL generating method for injecting optimal mode item and historical interaction information
Technical Field
The invention belongs to the field of semantic parsing of natural language processing, and particularly relates to a technology for converting natural language into structured SQL sentences.
Background
Semantic parsing is a task of mapping natural language queries into corresponding machine executable logic forms, converting complex natural language into SQL (Structured Query Language ) sentences, and is a branch of semantic parsing, aiming at breaking barriers between non-technicians and databases and shortening the distance between users and databases. With the proposal of large cross-domain data sets Spider and WikiSQL, the research of converting natural language into SQL is greatly promoted, the data sets pushed above are all based on English, and the development and landing of the SQL generation technology under the Chinese scene are greatly promoted by the DuSQL and the translated English data set CSpider through the competition release of the Chinese characteristic data set TableQA.
However, the current research results and work do not enable the technology to be truly used in the actual ground scene, and the problems are as follows: (1) Complex queries in spoken form in chinese natural language make it difficult for a machine to understand the user's actual query intent. (2) The database designed in the actual landed application scenario may be large in size, resulting in an inability to fully input information into the model. (3) changes in database structure result in performance degradation.
Early researches mostly comprehensively grasp information contained in a database, all the information contained in the database is input into a model, and in practical application scenes, users have difficulty in expressing correct query intention in a sentence, so that complex SQL sentence generation is difficult to generate.
Disclosure of Invention
The method aims at solving the problem of generating the actual business intelligent scene application of the structured language SQL by the complex Chinese natural language in the cross-domain; the invention provides an SQL generating method for injecting optimal mode items and historical interaction information, which comprises the steps of extracting the optimal mode items by designing a pre-screening cross encoder, simplifying the generation of complex queries by injecting the historical interaction information of a user, and generating a final correct SQL sentence in a depth-first mode through an SQL grammar tree.
The invention adopts the technical scheme that: a SQL generating method for injecting optimal mode items and historical interaction information comprises the following steps:
s1, a pre-screening cross encoder calculates a mode item most relevant to natural language query according to the input natural language query and a corresponding database mode item;
s2, an SQL coding module integrating the historical interaction information constructs an interaction diagram of natural language and historical query according to the optimal mode item output by the pre-screening cross encoder and the historical interaction information of the user, and codes the obtained interaction diagram;
and S3, decoding the encoding result in the step S2 based on the SQL structure prediction model.
The invention has the beneficial effects that: the SQL generating technology for injecting the optimal mode item and the historical interaction information provided by the invention has the following beneficial effects:
1. in the invention, in the process of converting Chinese natural language query into SQL task in a floor scene aiming at practical application, the excessive large scale of a database possibly can be encountered and can not be completely input into a model, and the pre-screening cross encoder is provided for preferentially calculating the database pattern item (comprising a table and a column) most relevant to the natural language query, wherein the optimal pattern item comprises the top k 1 Front k of database table 2 The data columns are injected into the model, so that the high recall rate is ensured, and excessive noise injection is prevented;
2. the generation of the complex SQL sentences corresponding to the natural language sentences is also very complex, so the invention provides that before the user inputs Chinese natural language query sentences, a plurality of simple interdependent query sentences are input preferentially on the basis of historical interaction information so as to generate the final complex query;
3. adding relations among input sequence word fragments in computation based on a transducer structureBiasing the model more towards these predefined relationships during processing, includingThe database schema item includes a relationship between a database table and a column;the linkage relation between the word segmentation of the Chinese natural language query and the data table and the data column in the database mode;inquiring the dependency relationship among the word segmentation by natural language; />The relation between the current natural language query word segmentation and the historical query word segmentation;
4. according to the method for constructing the SQL sentence generation model of the Chinese natural language, the SQL sentence is generated according to the depth priority sequence by a grammar tree decoding method, the decoding process is divided into two types of generating grammar keywords and generating database modes, possible candidate items are determined according to the current node type and father node type, probability distribution is calculated through LSTM, and the candidate item with the highest probability value is selected. The decoding mode accords with the characteristics of SQL statement structure and structured data, can avoid grammar errors, supports the generation of SQL statements of complex structures including table connection and nested query, and has higher accuracy of a model.
Drawings
FIG. 1 shows the technical steps of an implementation of the present invention;
FIG. 2 is a schematic flow diagram of a multi-language encoding module of a pre-screening cross encoder;
FIG. 3 is an exemplary diagram of the steps of a pre-screening crossover encoder to calculate a database schema term probability;
FIG. 4 is an example of the values of k1 and k 2;
FIG. 5 is a schematic illustration of a natural language query and database schema necklace;
FIG. 6 is a schematic diagram of co-index relationships between natural language query runs;
FIG. 7 is a schematic flow chart of a process of encoding optimal pattern items to obtain a relational-aware encoding vector;
fig. 8 is a schematic diagram of an SQL decoding process based on an SQL syntax tree.
Detailed Description
The present invention will be further explained below with reference to the drawings in order to facilitate understanding of technical contents of the present invention to those skilled in the art.
As shown in fig. 1, the method of the present invention comprises the following specific steps:
s1, pre-screening a cross compiling model: screening out front k which accords with natural language and has stronger correlation 1 Database tables k 2 Database columns k 1 、k 2 Two important super parameters need to be set manually, and too large a setting may contain many unnecessary mode items, and too small a setting may exclude some important mode items. The problem that a large-scale database in an actual application scene cannot be completely input into a model is avoided, and a cross encoder is designed to input natural language query and database mode information as sequences. The pre-screening crossover compiling model comprises: the system comprises a data preprocessing module, a coding module, a column enhancement table embedding module and a loss calculating module.
The method specifically comprises the following sub-steps:
s11, the data preprocessing module comprises preprocessing of natural language and database modes. The user-entered chinese natural language may have spoken and non-normalized expressions, such as 2022 may be expressed as "last year", "22 years", "two-zero-two years". The amount 100000 may be expressed as "one hundred thousand", but the value corresponding to the natural language is expressed in the database as a normalized exact value, so that the data of the value type such as time, currency, percentage needs to be uniformly processed, expressed as an exact value. The preprocessing of the database schema is to uniformly process database tables and columns by using semantic names, such as the identification of unique identifiers of airports by using id columns in the database tables of the airports, but the real meaning of the database schema and whether the database schema is a column associated with natural language query cannot be identified for the model, so that the database schema is more similar to real semantics by using semantic names (airports id), and the subsequent screening process is facilitated.
Specifically, the semantic names are adopted to replace the database schema items, the true meanings of the database items can be identified when screening and matching is carried out, the similarity between the Chinese natural language and the database items represented by the semantics is calculated, and the schema items are classified. Traversing a data table in a database and specific column names in the table, restoring the real semantic names of the data table by using notes for the data table, and restoring the real semantic names by using information of the data table and column notes for the data column.
S12, a coding module: the data pre-processed natural language query and database mode item are encoded into vectors, the vectors comprise Chinese natural language query, historical query and English database mode representation, and the XLM-R (Cross-lingual Language Model RoBERTa) Cross-language pre-training model is used for encoding spliced input.
Specific implementation steps are that N database tables in one database are expressed as T= { T 1 ,t 2 …t N All columns are denoted C according to the table to which they belong, representing all table sets.
For a column of the database,represents the ith table t i The vector of the j-th column of the list indicates that the above-listed type is added before the column name +.>The type of the j-th column of the i-th table is shown as either a number or a character. Unstructured natural language queries, historical queries and terms are then usedDatabase schema of the sense name representation is spliced into a fused feature representation +.> Use |as delimiter. Further, a normalized fusion characterization X is obtained contact As an input sequence for PLM (Pre-trained Language Model). Because the input sequence contains both Chinese natural language query and English expressed database schema items, the example adopts Cross-language model XLM-R (Cross-lingual Language Model RoBERTa) to encode, and semantic encoding vectors of semantic names of table names and column names of the database in the Chinese natural language query and the database schema are obtained. The PLM will divide the pattern item into a plurality of token identifiers (e.g., divide "source air" into two words for encoding "source" and "air"), but as a condition for filtering the column, it is desirable that all pattern item information be fused on one encoding vector.
Thus, the example uses a BiLSTM layer to obtain an encoded representation of the hidden state of the segmented pattern items, represents each pattern item as a whole embedded via pooling, aggregates the hidden state into a fixed size representation of each pattern item, adds a non-linear fully-connected layer over the pooled representation to enable the pattern items to capture complex relationships and patterns in the data, and represents the table embedded as after the pooling operation described aboveEach column insert is denoted +.>d represents the hidden layer size, < >>Representing an embedded dimension size of 1 xd. The detailed steps are shown in fig. 2.
S13, a table embedding module for enhancing column information: when only a certain column name in the data table is explicitly mentioned in the natural language and no table name is mentioned, the column information is injected into the corresponding table code. Corresponding tables can be identified when only information related to column names is involved in natural language, column information is injected into the table embedding by stacking multi-head zoom dot product attention layers, and a feature fusion layer is added to obtain the table embedding with column information enhancement.
Specifically, as described in the example of fig. 3, the natural language only refers to "beijing" and does not refer to the table name "tables" to which the columns belong, in which case the table names may be filtered out, so the present invention proposes a method of injecting column information into the table embedment, merging the column embedment information in the table into the table embedment using a multi-headed attention mechanism, wherein T is i As the query matrix shape is (1 xd), C : i As both keys and value matrix size (n) i ×d),n i For the number of columns, h represents h heads. The model is focused on different locations by a different randomly initialized Q/K/V weight matrix for each head. The attention weight of each column is respectively calculated and the attention weight of each column is calculated, and then the table embedding which is sensitive to the column information and is obtained through multi-head attention calculation is obtained through a weighted summation mode>Then get column enhanced table embedding ++through a normalization function>
S14, the loss calculation module can be actually regarded as a similarity task of natural language problems and tables, so that the loss function of a classification task is finally calculated, but because fewer database tables and columns are contained in the SQL sentence, negative samples in the classification task are multiple times of positive samples, and the samples are unbalanced, the focal loss function is adopted as the loss function of classification. The loss function of the pre-screened cross encoder is formed in a multitasking manner, consisting of table classification losses and column classification losses.
The detailed implementation step adopts two different MLP (Multi-Layer permission) Multi-Layer perceptron modules to respectively classify the enhanced table embedding and the column embedding, and respectively calculate the related probability of natural language problems
Are all trainable parameters, σ (·) represents softmax. For an actual floor scene, the SQL query required to be written by a user generally only comprises a few tables in a database, so that more negative samples can appear when screening data tables and columns, and in order to solve the problem of unbalanced samples, the invention adopts a focus loss function. In a task learning manner, we calculate a penalty function for the data table classification and the data column classification:
n represents all tables, N i Representing the ith column in the first database,indicating that all of the columns are to be used,and y is i The real tag y being the ith table i =1 indicates that the corresponding SQL statement contains this table, otherwise it is 0./>Representing the true tag of column k in the ith table,/for>Then the column is referred to in the corresponding SQL statement, otherwise 0.
Finally we find the correlation probability of natural language to different tables and columns, and set k with highest screening probability in the example 1 Individual table and k 2 The most probability-related columns, k 1 And k 2 Two very important super parameters, we need k 1 And k 2 Selecting proper values to ensure high recall rate, preventing excessive noise from being introduced, restoring the reordered database schema items to original names, using the original names for linking by the subsequent SQL generating modules, and taking the obtained optimal schema items as the input of the subsequent SQL generating modules. The steps are shown in fig. 3.
For example, query sample "what is the flight number from Beijing? As shown in FIG. 4, the table and column in the database show the semantic names of the pattern items in brackets, the table phases Guan Du of the (airines, airports, flings) query sample are (0.02,0.93,0.95), the column similarity is shown in FIG. 4, the columns mainly related to SQL sentences are (airports, airport_code, flightflightjnumber, flights, source_airport), in order to promote the performance of the filtering task, k 1 And k 2 The value of (2) is relatively high, the related probability value is set to be 80% larger than the given threshold value, and the threshold value can be set according to the query requirement of the user, so that the filtered k in the example 1 Is 2, k 2 At 4, where k 1 ,k 2 The method is used for pre-screening and extracting the optimal mode items, and the mode items truly involved in the SQL sentence are also determined according to the subsequent coding part.
S2, the SQL coding module integrated with the history interaction information requires a user to preferentially search a history generation record which is more in line with the query intention in the history query when inputting the natural language query, and generates a final result through multiple times of interaction. Encoding the relationships between schema terms and natural language of the database and between natural language and historical queries, including based on the dependencies between schema term links and natural language queries, has been demonstrated by research to explicitly represent these relationships to improve the expressivity of the model. After each relation is displayed, the relation-based perceptron can be encoded. The system specifically comprises four modules, a natural language query and database mode coding module, a mode item linking module, a history interaction linking module and a relation-based perception graph coding module. The method specifically comprises the following sub-steps:
s21, a natural language query and database coding module:
the module can be independently carried out, or the output result of the S1 module can be directly used as the input of the module, specifically: the pre-screened cross encoder has obtained the initial encoding vector of natural language and database mode items integrated with the optimal mode item and the history inquiry encoded by the cross-language pre-training encoder:
Q i represents the ith historical query statement, where |Q i I represents the total number of tokens in the ith historical natural language query,represents the j-th word vector representation, |k, in the i-th historical query natural language query 1 I represents the top k of the pre-screening in database schema 1 Watch(s)>An initial vector representation representing an ith table in the database, for a database column +.>Vector representation representing the jth column in the ith table, adding the above-listed type ++before column name>The type of the j-th column of the i-th table is shown as either a number or a character. All->Is the sum k of the numbers of all columns 2
S22, the links based on the pattern items are mainly aimed at the table, the column and the link alignment between the natural language query and the table of the database, wherein the relation is table, column, table and natural language subword, column and natural language subword and column. Representing relationships between words in a natural language query asThe relationship between natural language queries and database schema (including database columns and tables and database values) is expressed as +.>The relation between the database schema items such as foreign key information and the like is expressed as +.>The natural language and the database schema are processed as nodes, the links or relations between the natural language query and the database schema items are edges, then the whole input constitutes an interaction graph, because +.>The dependency between the syntax contains several relations, such as up to 55 syntactical dependencies contained in the SpaCy toolkit, all of which are injected into the inter-composition, possibly resulting in overfitting, thus abstracting the dependency between the syntax into three relations, forward, backward, none, respectively representing dependenciesDepending on the word preceding the current word, depending on the word following the current word, and no dependencies, such as "see, this lovely cat", "cat" forward depends on "lovely", "this" backward depends on "kitten".
Further preferably, any two different vertices in the directed graph are vertex A and vertex B;
the expressed relationship is a relationship between database schema item including database tables and columns, wherein the relationship +.>Expressed as shown in table 1:
TABLE 1Relationships involving
Further, linking relationships between queries and database schemas with respect to natural languageAs shown in fig. 5, the "beijing" and "flight number" in the natural language are respectively linked with the city column in the aiports table and the fightno column in the lights table, and an important value linking manner is also performed here when the database schema item and the word in the natural language query are correctly linked, and the "beijing" is the value in the city column, and the values in the schema mentioned in the problem are aligned. Thus->The term and database schema are completely matched, represented by EM (exact match), and partial match by PM (partial match) As shown in table 2:
TABLE 2Relationships involving
The analysis of the syntactic dependency of the query problem is seldom focused on, and the pre-training model can process the dependency to a certain extent, but experiments show that the explicit labeling of the dependency has obvious improvement on SQL generating effect. The saidTo represent the dependency between natural language query tokens, to prevent too much overfitting of the total number of relationships, the relationships are abstracted such that when both vertices A and B are natural language tokens, syntax-F represents that A has forward syntactic dependency on B, syntax-R represents that A has reverse syntactic dependency on B, and Syntax-None represents that A and B have no syntactic dependency. Said->The relationships contained are shown in Table 3:
TABLE 3 Table 3Relationships involving
Finally, the followingRepresenting the relationship between the current natural language query word and the historical query word, if the user inputs the natural language query, the user can search the multiple tables in a joint wayIn the practical application scenario, it is difficult for the user to accurately describe the real query intention at one time, so that the user is expected to generate more complex queries based on the history which the user has queried by using a superposition manner when describing the requirement, which can enable the model to more clearly know the real intention of the user, for example, the user wants to query "get to the Shanghai from Beijing and the airline is what kind of aviation is in China", and the process of generating complex queries using the history queries for multiple rounds is shown in fig. 6:
the saidThe relationships involved are: vertex A is the word of the natural language query, vertex B is the word of the history query, coref_Reaction indicates that B and A have a Co-fingered relationship, and Co_Reaction indicates that the Co-fingered relationship is regarded as an overall relationship as shown in Table 4:
table 4 relationship contained in co_relationship
S24, coding a graph based on relation perception, wherein a series of relations are defined, a model is enabled to perceive the predefined relations, and an encoder initializes each edge of the preliminary iso-graph constructed by the nodes and the edges into a vector representation. Setting key relation by splicing the characteristics of all edges in different patternsSum value relation->As shown in the formula
i and j represent an i-th node and a j-th node respectively,contact (·) represents a splice operation, p ij Representing said certain relationship, is a trainable edge vector representation, m ij Is a variable for graphic clipping. According to the input natural language query, each edge in the heterogeneous graph is classified in a binary way to record whether the edge needs to be reserved or not, if not, m is used ij The vector representation of the edge is set to a vector of the same dimension consisting of all zeros, R being the total number of relational edges.
The constructed isomerism map is encoded according to RGAT (Relational graph attention transformers, relationship map attention transformer), the following formula:
RGAT is stacked by multi-layer multi-head self-attention, each layer containing a total of H heads, each head by initializing a different W Q ,W K ,W V Focusing on different positions in the vector, adding a predefined relation based on the original self-attention mechanism to form a new K matrix asV matrix is->Respectively obtaining through linear transformation; w is shown Q ,W K ,W V Is to set a randomly initialized coefficient matrix, x, in each head i ,x j Is the input sequence X init The i and j-th word of (a) and ∈>Representing the calculated attention score of the ith word segment for the jth word segment,/for each of the ith word segments>The attention value of the ith term to the jth term is then used. Finally->For the attention result of the H attention head, H heads are adopted to splice and multiply W by a plurality of independent attention values o The parameter matrix gets the final multi-headed attention value.
Then adding the multi-head attention value of the ith word segmentation result and the composite semantic coding vector of the ith word segmentation result through a first adding normalization unit, and normalizing to obtainInputting the data into a feedforward neural network for processing; thereafter->After the linear layer, the result is processed by ReLU activation function, and then output is obtained by a linear layer and then the result is obtained by the first addition normalization unit ∈ ->The addition is specifically as follows:each word is processed according to the above to generate y i Y of all participles i Vector constitution matrix to obtain final relation code X encode The overall process is shown in fig. 7.
S3, SQL decoding module
SQL decoding is performed by relational encoding tensor X encode Depth-first SQL decoding based on grammar tree provides a complete SQL structure prediction model, the grammar tree structure is characterized in that a node sequence is generated according to SQL grammar rules in depth-first order from root node root, the feature, father node feature, context feature and node type feature of the last moment are fused, the probability value of the correct SQL sentence corresponding to the natural language problem is maximized, firstly, an SQL abstract grammar tree is generated, then a series of decoder actions are output through LSTM network, and the formula of the LSTM update state of the decoder of the tree structure is:
wherein m is t Is the LSTM cell state, h t Is the output of LSTM at step t,is the output of the parent node of the current node. a, a t-1 Is an embedded representation of the previous action (behavioural actions APPLYRULE, SELECTTABLE and select column), is +.>Behavior z representing parent node of current node t Is a contextual representation, p t Syntax tree parent node, which is the current node, +.>Is an embedded representation of the current node type, the goal of the model is to maximize the probability value of predicting the correct SQL statement, namely: prob represents a conditional probability, at a given q pre ,X encode On the premise of (a) calculating a t Probability of a) pre Representing all of the predicted behavior sequences before.
Comprises the following sub-steps:
s31, expanding a non-leaf node, and expanding the node into a grammar rule when the generated node is the non-leaf node, wherein the expansion is APPLYRULE, APPLYRULE, and the grammar rule is applied; the formula is as follows: where g (-) is a 2-layer MLP with Tanh as the activation function
Prob(a t =APPLYRULE[R]|a pre ,X encode )=softmax(g(h t ))
softmax (·) is a normalized exponential function;
s32, generating leaf nodes, wherein the generated nodes are leaf nodes, and then selecting a table name or a column name from the database mode to be respectively a selection table and a selection column, wherein the formulas are the same calculation: lambda (lambda) j As a weight vector of the weight vector,making an embedded representation of a table
For example, the following SQL statement is generated:
SELECT COUNT(*)
FROM students
WHERE name....
the generation process is as shown in fig. 8, where nodes are generated in depth-first order starting from the root node root of the syntax tree, the names of the data table are the names of the columns in the table. FIG. 8 is a diagram of an SQL grammar key or terminator [ END ] generation operation, namely APPLYRULE; the actions of generating a data table name and a data column name are called select table and select column, respectively. The possible types of successor nodes for each node are determined by the SQL syntax. When all branches END with [ END ], the SQL decoding process is completed, at this time, the grammar tree is traversed according to the depth-first order, so that a unique SQL sentence can be obtained, and the SQL sentence is delivered to the database execution engine to return the SQL query result.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (8)

1. The SQL generating method for injecting the optimal mode item and the historical interaction information is characterized by comprising the following steps of:
s1, a pre-screening cross encoder calculates a mode item most relevant to natural language query according to the input natural language query and a corresponding database mode item;
s2, an SQL coding module integrating the historical interaction information constructs an interaction diagram of natural language and historical query according to the optimal mode item output by the pre-screening cross encoder and the historical interaction information of the user, and codes the obtained interaction diagram;
and S3, decoding the encoding result in the step S2 based on the SQL structure prediction model.
2. The method for generating the SQL injection of the optimal pattern item and the historical interaction information according to claim 1, wherein the step S1 specifically comprises:
s11, carrying out standardization processing on numerical items in the input natural language query, and replacing a database mode item by adopting a semantic name;
s12, splicing the natural language query processed in the step S11 with the database mode item, and inputting the spliced natural language query and database mode item into a cross-language pre-training model to obtain the coding representation of the hidden state of the segmented mode item;
pooling the coded representation of the hidden state of the segmented mode item by adopting BiLSTM, representing each mode item as an integral embedding, inputting the pooled result into a nonlinear full-connection layer to obtain a table embedding and a column embedding;
s13, injecting column embedding information into the table embedding through stacking the multi-head scaling dot product attention layer, and adding a feature fusion layer to obtain column embedding information enhanced table embedding;
s14, performing classification tasks on the column embedding and the enhanced table embedding by adopting two different multi-layer perceptron modules, and calculating the relevant probabilities of the natural language on different columns and tables; screening k with highest probability for related probabilities of different columns and tables according to natural language 1 Individual table and k 2 And the most relevant columns of the probabilities obtain the optimal mode item.
3. The method for generating the SQL injection with the optimal pattern item and the historical interaction information according to claim 2, wherein in the step S11, the database pattern item is replaced by a semantic name, specifically: traversing a data table in the database and specific column names in the table, using notes to restore the real semantic names of the data table for the data table, and using information of the data table and column notes to restore the real semantic names for the data column.
4. The method for generating SQL by injecting optimal pattern items and historical interaction information according to claim 3, wherein the step S2 specifically comprises:
s21, obtaining initial coding vectors of natural language and database mode items according to the optimal mode items and the historical query;
s22, processing natural language and database modes as nodes, and processing links or relations between natural language query and database mode items as edges to construct a preliminary interaction diagram;
s23, setting key relation by splicing all edge characteristics in the interaction diagramSum value relation->
S24, annotating the isomerism graph processed in the step S23 according to a relationCoding by the force transducer to obtain a final relation code X encode
5. The method for generating SQL injection of optimal schema terms and historical interaction information according to claim 4, wherein the link or relationship between the natural language query and the database schema terms in step S22 comprises: the relationships between terms in a natural language query are expressed asThe relationship between the natural language query and the database schema is expressed asThe relation between database schema items is denoted +.>
6. The method for generating SQL injection of optimal schema terms and historical interaction information according to claim 5, wherein,dependencies between syntax include: depending on the word preceding the current word, depending on the word following the current word, and no dependencies.
7. The method for generating the SQL injection of the optimal pattern item and the historical interaction information according to claim 5, wherein the step S3 specifically comprises the following sub-steps:
s31, generating a node sequence from a root node in a depth priority order, if the generated node is a non-leaf node, executing a step S32, and if the generated node is a leaf node, executing a step S33;
s32, expanding non-leaf nodes into grammar rules:
Prob(a t =APPLYRULE[R]|a pre ,X encode )=softmax(g(h t ))
wherein g (-) is a 2-layer multi-layer perceptron module with Tanh as an activation function, and applyrun represents an application grammar rule, a pre Represent all predicted behavior sequences before, h t Is the output of LSTM at step t, a t An embedded representation representing the current node;
s33, selecting a table name or a column name from the database schema to be a selection table and a selection table respectively, wherein the formulas are the same calculation: lambda (lambda) j As a weight vector of the weight vector,making an embedded representation of a table
S34, when all branches END up with [ END ], generating an SQL abstract syntax tree; the SQL abstract syntax tree is then decoded over the LSTM network.
8. The method for generating SQL injection of optimal pattern items and historical interaction information according to claim 7, wherein the formula of the LSTM network update state is:
wherein m is t Is the LSTM cell state, h t Is the output of LSTM at step t,is the output of the parent node of the current node, a t-1 Is an embedded representation of the previous action, +.>Representing the behavior of the parent node of the current node, z t Is a contextual representation, p t Syntax tree parent node, which is the current node, +.>Is an embedded representation of the current node type.
CN202311004567.1A 2023-08-10 2023-08-10 SQL generating method for injecting optimal mode item and historical interaction information Pending CN117033423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311004567.1A CN117033423A (en) 2023-08-10 2023-08-10 SQL generating method for injecting optimal mode item and historical interaction information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311004567.1A CN117033423A (en) 2023-08-10 2023-08-10 SQL generating method for injecting optimal mode item and historical interaction information

Publications (1)

Publication Number Publication Date
CN117033423A true CN117033423A (en) 2023-11-10

Family

ID=88631235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311004567.1A Pending CN117033423A (en) 2023-08-10 2023-08-10 SQL generating method for injecting optimal mode item and historical interaction information

Country Status (1)

Country Link
CN (1) CN117033423A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591543A (en) * 2024-01-19 2024-02-23 成都工业学院 SQL sentence generation method and device for Chinese natural language

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591543A (en) * 2024-01-19 2024-02-23 成都工业学院 SQL sentence generation method and device for Chinese natural language
CN117591543B (en) * 2024-01-19 2024-04-02 成都工业学院 SQL sentence generation method and device for Chinese natural language

Similar Documents

Publication Publication Date Title
CN108519890B (en) Robust code abstract generation method based on self-attention mechanism
Wang et al. Learning to extract attribute value from product via question answering: A multi-task approach
CN110020438B (en) Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
Lyu et al. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching
Zhang et al. SG-Net: Syntax guided transformer for language representation
Chen et al. ShadowGNN: Graph projection neural network for text-to-SQL parser
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN113761893B (en) Relation extraction method based on mode pre-training
CN111985205A (en) Aspect level emotion classification model
CN110765240A (en) Semantic matching evaluation method for multiple related sentence pairs
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN115145551A (en) Intelligent auxiliary system for machine learning application low-code development
Qin et al. A survey on text-to-sql parsing: Concepts, methods, and future directions
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
CN117077655A (en) Method and system for extracting aspect-level emotion triples based on emotion knowledge enhancement
Xiong et al. Transferable natural language interface to structured queries aided by adversarial generation
CN114841353A (en) Quantum language model modeling system fusing syntactic information and application thereof
Liu et al. Chinese named entity recognition based on BERT with whole word masking
CN113807079A (en) End-to-end entity and relation combined extraction method based on sequence-to-sequence
CN116629361A (en) Knowledge reasoning method based on ontology learning and attention mechanism
CN116611436A (en) Threat information-based network security named entity identification method
CN110413796A (en) A kind of coal mine typical power disaster Methodologies for Building Domain Ontology
CN114510569A (en) Chemical emergency news classification method based on Chinesebert model and attention mechanism
CN113822018A (en) Entity relation joint extraction method
Xiao et al. Chapter-level entity relationship extraction method based on joint learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination