CN112925794B

CN112925794B - Complex multi-table SQL generation method and device based on bridging filling

Info

Publication number: CN112925794B
Application number: CN202110362073.5A
Authority: CN
Inventors: 谭真; 张啸宇; 赵翔; 王俞涵; 黄旭倩; 廖劲智; 肖卫东; 唐九阳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2022-09-16
Anticipated expiration: 2041-04-02
Also published as: CN112925794A

Abstract

The application relates to a method and a device for generating complex multi-table SQL based on bridging filling. Two layers of decoding architectures are adopted, and the three parts are a semantic coding layer, an SQL template generation layer and an SQL detail filling layer respectively. The SQL template generation layer is a first decoding layer. The SQL detail filling layer is a second decoding layer, in addition, the SQL generating model based on bridging filling only uses a sequence generating technology in the first decoding layer, the length of the SQL template is short, compared with a single sequence generating model, the calculation efficiency is greatly enhanced, and the calculation resource consumption is obviously reduced.

Description

Complex multi-table SQL generation method and device based on bridging filling

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for generating a complex multi-table SQL based on bridge padding.

Background

In the actual application scenario of SQL analysis, the natural text input by the user is generally a single table, and for the single table, the SQL statement may be generated in a full-matching manner.

Compared with a single table SQL analysis task, the multi-table SQL analysis task has a distinct characteristic that a database may contain a plurality of data tables, and a main foreign key relationship may exist among the data tables, and compared with the single table SQL analysis task, the multi-table SQL analysis task is relatively high in complexity and mainly embodied in two aspects, namely, as the number of the data tables is increased, the number of fields is increased, and the input length may exceed the maximum length allowed by a pre-training model; on the other hand, the SQL mode in the multi-table SQL analysis task is more complex, and various components (hanging, Order by, Group by), set operations (Union, exception, Intersect), nested SQL and other contents are added. At present, no technical scheme for automatically generating SQL statements for multiple tables exists.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for generating complex multi-table SQL based on bridge padding.

A bridge fill based complex multi-table SQL generation, the method comprising:

inputting the natural language table sample into a multi-table SQL analysis model; the multi-table SQL analysis model comprises the following steps: the system comprises a semantic coding module, an SQL template generating module and an SQL detail filling module; the natural language table sample includes: natural language questions, database names, and database table fields;

analyzing the natural language table sample according to the pre-trained semantic coding module to obtain field sequence codes, natural language problem sequence codes and table name field sequence codes; wherein the field sequence code, the natural language question sequence code and the table name field sequence code form integral code information; the field sequence code and the table name field sequence code are connected through a serial connection character to form an enhanced sequence code;

inputting the whole coding information into the SQL template generating module to generate filling fields corresponding to SQL predefined templates in the SQL template generating module; the SQL template generation module is composed of LSTM units, and a plurality of types of SQL predefined templates are pre-constructed in the SQL template generation module; the SQL predefined template is formed by predefined SQL statement components;

inputting the filling field into the SQL detail filling module to fill the SQL predefined template to obtain a predicted multi-table SQL statement;

training the multi-table SQL analysis model according to the predicted multi-table SQL statement and a preset loss function to obtain a trained multi-table SQL analysis model;

and inputting the natural language table to be analyzed into the trained multi-table SQL analysis model to obtain a corresponding multi-table SQL statement.

In one embodiment, the method further comprises the following steps: the initial sequence of obtaining a natural language table sample is:

[XLS],q ₁ ,q ₂ ,...,q _L ,[SEP],t ₁₁ ,t ₁₂ ,...,[CAT],c ₁₁₁ ,c ₁₁₂ ,...,[SEP],...,[SEP]

wherein, [ XLS]Indicates an initial mark, [ SEP ]]Indicates a spacer, [ CAT ]]Denotes a concatenation symbol, q ₁ ,q ₂ ,...,q _L Is a natural language question sequence, t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,.. is an enhanced sequence of the jth field in the ith data table, L representing the length of the natural language question; q. q.s _t Representing the t token in the natural language question sequence;

analyzing the natural language table sample according to the pre-trained semantic coding module to obtain field sequence codes, natural language problem sequence codes and table name field sequence codes as follows:

h _[XLS] ,h _q1 ,h _q2 ,...,h _qL ,h _[SEP] ,h _t11 ,h _t12 ,...,h _[CAT] ,h _c111 ,h _c112 ,...,h _[SEP] ,...,h _[SEP]

wherein h is _[XLS] Representing the overall coding information, h _[SEP] Denotes [ SEP]Code of h _[CAT] Is expressed as [ CAT]Code of h _qt Denotes q _t Code of h _ti1 ,h _ti2 ,...,h _[CAT] ,h _cij1 ,h _cij2 ,.. representing t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,..

In one embodiment, the method further comprises the following steps: inputting the whole coding information into the SQL template generating module, and generating filling fields corresponding to the SQL predefined template in the SQL template generating module by using a calculation formula of an LSTM unit as follows:

f _t ＝σ(W _f ·[x _t ,h _t-1 ]+b _f )

i _t ＝σ(W _i ·[x _t ,h _t-1 ]+b _i )

o _t ＝σ(W _o ·[x _t ,h _t-1 ]+b _o )

h _t ＝o _t *tanh(c _t )

wherein, W _t For learnable parameters, W _t ∈R ^m×d For different types of SQL statement components, W _t When the size of the output set is larger than 2, using softmax as an activation function, and taking m as the size of the output set; when the output set size is 2, sigmoid is used as the activation function, h _[XLS] Information is coded for the whole, where x ₀ ＝h _[XLS] 。

In one embodiment, the types of the SQL predefined template include: non-nested SQL, collective operations SQL, FROM nested SQL, and VALUE nested SQL.

In one embodiment, the SQL statement component comprises: SELECT, WHERE, HAVING, ORDER BY, GROUP BY, LIMIT, and FROM.

In one embodiment, the method further comprises the following steps: and inputting the filling fields into the SQL detail filling module to perform field selection, operation judgment and extraction on the SQL predefined template to obtain a predicted multi-table SQL statement.

In one embodiment, the method further comprises the following steps: obtaining a loss function of the SQL template generation module and a loss function of the SQL detail filling module; and fusing the loss function of the SQL template generation module and the loss function of the SQL detail filling module to obtain the loss function.

An SQL parsing apparatus based on relevance determination, the apparatus comprising:

the input module is used for inputting the natural language table sample into the multi-table SQL analysis model; the multi-table SQL analysis model comprises the following steps: the system comprises a semantic coding module, an SQL template generating module and an SQL detail filling module; the natural language table sample includes: natural language questions, database names, and database table fields;

the coding module is used for analyzing the natural language table sample according to the pre-trained semantic coding module to obtain field sequence coding, natural language problem sequence coding and table name field sequence coding; wherein the field sequence code, the natural language question sequence code and the table name field sequence code form integral code information; the field sequence code and the table name field sequence code are connected through a serial connection character to form an enhanced sequence code;

the generating module is used for inputting the whole coding information into the SQL template generating module and generating filling fields corresponding to the SQL predefined template in the SQL template generating module; the SQL template generation module is composed of LSTM units, and a plurality of types of SQL predefined templates are pre-constructed in the SQL template generation module; the SQL predefined template is formed by predefined SQL statement components; inputting the filling field into the SQL detail filling module to fill the SQL predefined template to obtain a predicted multi-table SQL statement; training the multi-table SQL analytical model according to the predicted multi-table SQL statement and a preset loss function to obtain a trained multi-table SQL analytical model; and inputting the natural language table to be analyzed into the trained multi-table SQL analysis model to obtain a corresponding multi-table SQL statement.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

analyzing the natural language table sample according to the pre-trained semantic coding module to obtain field sequence codes, natural language problem sequence codes and table name field sequence codes; wherein the field sequence code, the natural language question sequence code and the table name field sequence code constitute overall coded information; the field sequence code and the table name field sequence code are connected through a concatenation character to form an enhanced sequence code;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The complex multi-table SQL generation method and device based on bridging filling, the computer equipment and the storage medium adopt two layers of decoding architectures, and the decoding architectures comprise three parts which are a semantic coding layer, an SQL template generation layer and an SQL detail filling layer respectively. The SQL template generation layer is a first decoding layer, the SQL template is decoded by adopting an end-to-end-based sequence generation technology, and because the SQL statement has sequence independence, only the type information of each component is considered in the design of the SQL template, and detailed information such as fields, operations and the like is not involved. The SQL detail filling layer is a second decoding layer, and the detail part of the SQL statement is predicted and filled by adopting a template filling technology. Since a large number of sub models can cause large accumulated errors, in order to reduce the number of sub models, the SQL detail filling layer uses the output information of the SQL template generation layer as additional input to distinguish different areas in the SQL statement to cope with the complex SQL generation situation, such as nested SQL. In addition, the SQL generating model based on bridging filling only uses a sequence generating technology in the first decoding layer, the length of the SQL template is short, compared with a single sequence generating model, the calculation efficiency is greatly enhanced, and the calculation resource consumption is obviously reduced.

Drawings

FIG. 1 is a flow diagram of a complex multi-table SQL generating method based on bridge padding in one embodiment;

FIG. 2 is a block diagram of the structure of a complex multi-table SQL generating device based on bridge padding in one embodiment;

FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a method for inputting natural text samples into an SQL parsing model, comprising the steps of:

step 102, inputting the natural language table sample into a multi-table SQL analysis model.

The multi-table SQL analysis model comprises the following steps: the system comprises a semantic coding module, an SQL template generating module and an SQL detail filling module; the natural language table sample includes: natural language questions, database table names, and database table fields.

In particular, the semantic code module model may be a BERT model.

And 104, analyzing the natural language table sample according to the pre-trained semantic coding module to obtain field sequence codes, natural language problem sequence codes and table name field sequence codes.

The field sequence code, the natural language question sequence code and the table name field sequence code form integral code information; the field sequence code and the table name field sequence code are connected by a concatenation character to form an enhanced sequence code.

The encoding refers to encoding a vector, that is, converting the natural text into a vector form.

And 106, inputting the whole coding information into the SQL template generating module, and generating a filling field corresponding to the SQL predefined template in the SQL template generating module.

The SQL template generation module is composed of LSTM units, and a plurality of types of SQL predefined templates are pre-constructed in the SQL template generation module; the SQL predefined template is composed of predefined SQL statement components.

And 108, inputting the filling fields into the SQL detail filling module to fill the SQL predefined template to obtain the predicted multi-table SQL statement.

And step 110, training the multi-table SQL analytic model according to the predicted multi-table SQL statement and a preset loss function to obtain the trained multi-table SQL analytic model.

And step 112, inputting the natural language table to be analyzed into the trained multi-table SQL analysis model to obtain a corresponding multi-table SQL statement.

In the complex multi-table SQL generation method based on bridging filling, a two-layer decoding architecture is adopted, and the decoding architecture comprises three parts, namely a semantic coding layer, an SQL template generation layer and an SQL detail filling layer. The SQL template generation layer is a first decoding layer, the SQL template is decoded by adopting an end-to-end-based sequence generation technology, and because the SQL statement has sequence independence, only the type information of each component is considered in the design of the SQL template, and detailed information such as fields, operations and the like is not involved. The SQL detail filling layer is a second decoding layer, and the detail part of the SQL statement is predicted and filled by adopting a template filling technology. Because a large number of sub-models can cause large accumulated errors, in order to reduce the number of sub-models, the SQL detail filling layer uses the output information of the SQL template generation layer as additional input to distinguish different areas in the SQL statement to cope with the complex SQL generation situation, such as nested SQL. In addition, the SQL generation model based on bridging filling only uses the sequence generation technology in the first decoding layer, the length of the SQL template is short, compared with a single sequence generation model, the calculation efficiency is greatly enhanced, and the calculation resource consumption is obviously reduced.

In one embodiment, the initial sequence of obtaining the natural language table sample is:

wherein h is _[XLS] Representing the overall coding information, h _[SEP] Represents [ SEP ]]Code of h _[CAT] Is expressed as [ CAT]Code of h _qt Denotes q _t Code of h _ti1 ,h _ti2 ,...,h _[CAT] ,h _cij1 ,h _cij2 ,.. representing t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,..

In the above embodiment, in the original BERT pre-training model, the input sequence uses [ CLS ] as an initial marker for representing the entire information of the input sequence, and the initial code of the input sequence is obtained by pre-training an external corpus. However, the invention is a multi-task joint learning model, and the [ CLS ] coding vector is used for representing the whole information of an input sequence, so that the model is earlier trapped into the local optimal solution, partial subtasks cannot be sufficiently trained, and the whole SQL analysis effect is further influenced. Based on the above considerations, the present invention replaces [ CLS ] with [ XLS ], which is obtained from random initialization without external pre-training.

It should be noted that invalid table fields in the data table may also be filtered by field sorting to reduce the length of the input sequence to meet the length constraint allowed by the pre-training model.

The specific operation is as follows:

based on the above coding, the table name and the field name are represented by using a mean pooling method, and the field sorting score is calculated as follows:

s _ij ＝vtanh(W _t table _i +W _c column _ij )

score _ij ＝sigmoid(s _ij )

wherein, start _ti And end _ti Indicates the head and tail index position, table, of the ith table name in the input sequence _i Is the representation vector for the ith table name. starting time _cij And end _cij Indicates the head and tail index positions of the jth field in the ith data table in the input sequence, column _ij Is the representation vector, score, of the jth field in the ith data table _ij Is its corresponding ranking score.

In the testing stage, the mode sorts the table fields according to the field sorting scores. Then according to the total length of the input sequence and the ranking of the table fields in the next stage, the table fields are intercepted, and the table fields with lower ranking are discarded. The set of table fields intercepted is dynamically changing for different natural language question queries.

In one embodiment, the whole encoding information is input into the SQL template generating module, and the calculation formula of the LSTM unit is used to generate the filling fields corresponding to the SQL predefined template in the SQL template generating module as follows:

f _t ＝σ(W _f ·[x _t ,h _t-1 ]+b _f )

i _t ＝σ(W _i ·[x _t ,h _t-1 ]+b _i )

o _t ＝σ(W _o ·[x _t ,h _t-1 ]+b _o )

h _t ＝o _t *tanh(c _t )

Specifically, compared with a general sequence generation task, the SQL statement has strong grammatical rule constraint, and in order to enable the SQL statement generated by the model to conform to relevant grammatical constraint, the SQL template generation layer is used as a first decoding layer for generating the SQL template. The SQL setting statement consists of components such as SELECT, WHERE, HAVING, ORDER BY, GROUP BY, LIMIT, FROM, etc., wherein the FROM component can be derived FROM field information of other components. Because the SQL statement has order independence, only the type information of each component is considered on the design of the SQL template, and the detail information such as fields and related operations thereof are not involved. Each component type information is shown in table 1.

TABLE 1 component types

The relationship _ TYPE is a condition relationship in a WHERE or a HAVING component, and when the number of fields is greater than or equal to 2, the determination is required. The SQL template is responsible for describing the information of each component type in the SQL statement and mainly consists of the component types in the table 1. In addition, SQL clauses exist in some complex SQL statements, which are defined as "nested SQL". According to the nested form of SQL, four kinds of SQL predefined templates are designed, namely non-nested SQL, collective operation SQL, FROM nested SQL and VALUE nested SQL. The non-nested SQL does not contain SQL clauses, the last three types of SQL predefined templates all contain SQL clauses, but the nested forms of the SQL clauses are different. The four classes of SQL predefined templates are described in detail as follows:

non-nested SQL, this type of SQL does not contain nested structures. The data sample is that the number of cities covered by the SELECT APP name FROM taxi taking APP WHERE > is 100 ".

The SQL type contains set operators (UNION, EXCEPT, INTERSECT) and is formed by connecting two SQL sentences in series. The data sample is "(SELECT chinese team name FROM ball team ORDER BY standing time ASC LIMIT 3) UNION (SELECT chinese team name FROM ball team ORDER BY holding people DESC LIMIT 5)" for example.

FROM nests SQL, which contains SQL clauses in FROM components. Examples of the data are "SELECT a. civil gun count-b. civil gun count FROM (SELECT civil gun count FROM national WHERE name is" Urarey ') a, (SELECT civil gun count FROM national WHERE name is "Seerwia') b".

The VALUE of the SQL is SQL clauses. The data sample is "SELECT name FROM college WHERE entry id NOT IN (SELECT college id FROM prize)".

In another embodiment, in a conventional sequence generation task (e.g., machine translation), the decoding space at each time step is fixed and does not change with time step changes. However, in the SQL template generation process, the decoding space at each time step is variable. Taking a non-nested SQL output sequence as an example, the first time step is used for decoding the SQL template type, and belongs to the four-classification problem, and the decoding space is used for enumerating the SQL template type; the second time step is used to decode the type of SELECT component, the decoding space enumerates the number of fields in the SELECT component. The decoding space of these two time steps is different, the weight parameter (W) in the LSTM unit _f 、W _i 、W _c 、W _o ) Same, but decoding the weight parameter (W) _t ) Different.

In one embodiment, the filled fields are input into the SQL detail filling module to perform field selection, operation judgment and extraction on the SQL predefined template, so as to obtain the predicted multi-table SQL statement.

Specifically, the steps of field selection, operation judgment and extraction are as follows:

and field selection, wherein the subtasks are responsible for predicting fields of different components in the SQL statement, and mainly comprise 6 subtasks, namely, SELECT field prediction, WHERE field prediction, havingfield prediction, ORDER BY field prediction, GROUP BY field prediction and auxiliary field prediction. For convenience of subsequent description, a first field in the combined field is referred to as a primary field, a second field is referred to as a secondary field, and the secondary field is empty for the non-combined field. "how many people are in the functional department of the Changsha government? For example, the corresponding SQL sentences are "SELECT number of people who are in the house + number of people who are not in the FROM Long Sand government", "number of injured people" is the main field, and "number of people who are in the house" is the auxiliary field. The auxiliary field prediction subtask is responsible for predicting auxiliary fields in the SQL statement, and the other field prediction subtasks are responsible for predicting main fields in each component. The field selection class task is modeled as an ordering problem, a plurality of fields are selected as output according to probability, and the number of the fields is determined by the component type generated by the SQL template generation layer. The subtasks are similar in calculation mode, only weight parameters are different, and the calculation details are as follows:

p _i ＝sigmoid(vtanh(W _t h _t +W _s h _i ))

wherein v, W _t 、W _s Is a learnable parameter, v ∈ R ^1×d ，W _t ∈R ^d×d ，W _s ∈R ^d×d The above parameters are not shared for different subtasks. h is _i Is the expression vector of the i-th field and is obtained by calculation in a mean pooling mode, start and end represent the head and tail indexes of the field in the input sequence, end-start +1 is the length of the field, and h _ij Is the BERT encoded vector for the jth token in the ith field. p is a radical of _i Indicates the ith fieldProbability of being selected in a particular component, h _t Is the output information of the SQL template generation layer on the component type. Taking the SELECT field to predict subtasks, p _i Indicates the probability of the i-th field appearing selected in the SELECT component, h _t Is the output information of the LSTM unit at the SELECT _ TYPE time step in the SQL template generation layer.

And operation judgment, wherein the subtasks are responsible for predicting field-related operations in the SQL statement, and mainly comprise four types of operations, namely aggregation operation, conditional operation, combined column operation and ascending and descending order, which correspond to the four subtasks. Wherein, the aggregation operation set is [ NONE, MAX, MIN, COUNT, SUM, AVG ], and the conditional operation set is [ NOT IN ═ NOT >! IN, LIKE, combine column operation set as + and \, and ascending and descending sequence set as ASC, DESC. And modeling the operation judgment type task as a classification problem, wherein the four subtask calculation modes are similar, and only weight parameters are different. For the binary task, sigmoid is used as an activation function; for the multi-classification task, softmax is used as the activation function. The operation judgment class subtask calculation details are as follows:

p _i ＝sigmoid(vtanh(W _t h _t +W _s h _i ))

p _i ＝softmax(W _o tanh(W _t h _t +W _s h _i ))

wherein, W _t 、W _s 、v、W _o Is a learnable parameter, W _t ∈R ^d×d ，W _s ∈R ^d×d ，v∈R ^1×d ，W _o ∈R ^m×d . For different subtasks, the parameters are not shared; and for the same subtasks of different components, parameters are shared, and the different components are distinguished by the component type information of the layer generated by the SQL template. h is _i Is the expression vector of the i-th field, and is obtained by calculation in a mean pooling mode, h _ij Is the BERT encoded vector for the jth token in the ith field.p _i Representing the prediction probability of the associated operation of the i-th field, h _t Is the output information of the SQL template generation layer on the component type.

And the extraction class is used for extracting fragments from the natural language problem so as to fill the SQL template, and mainly comprises 2 subtasks, VALUE extraction and LIMIT extraction. And a multi-task joint learning framework is adopted to train a plurality of subtasks, and the loss functions of the subtasks are not easy to differ greatly. The target value is extracted by using a 0-1 mark mode, namely 0/1 marks are marked on each token in the natural language question sequence, wherein 1 represents that the token should be extracted, and 0 represents that the token should not be extracted. The probability that the ith token should be extracted in the natural language question is:

p _i ＝Sigmoid(W _v h _qi )

wherein, W _v Is a learnable parameter, W _v ∈R ^1×d 。h _qi Is the BERT coding vector, p, of the ith token in the natural language question sequence _i Representing the probability that the token was extracted.

The SQL statement may contain multiple VALUE VALUEs that belong to different table fields. And on the basis of the VALUE extraction subtask, judging the matching relation between the VALUE and the field by using the matching subtask. When constructing the training tag, if the VALUE belongs to a field, the corresponding "VALUE, COLUMN" is marked as 1, otherwise, it is marked as 0. The VALUE of VALUE and the matching score of the field are calculated as follows:

score _i ＝sigmoid(vtanh(W _v h _v +W _i h _i ))

wherein v, W _v 、W _i Is a learnable parameter, v ∈ R ^1×d ，W _v ∈R ^d×d ，W _i ∈R ^d×d 。h _v The expression vector of the VALUE is obtained by mean pooling of BERT coding vectors of the VALUE. h is _i Is the expression vector of the ith field, and is obtained by mean pooling the BERT coding vector of the ith field. starting time _v And end _v Indicating the start and end index of the VALUE of VALUE in the input sequence _i And end _i Indicating the head and tail index of the ith field in the input sequence. score _i Is the matching score of VALUE to the ith field.

In one embodiment, a loss function of the SQL template generation module is obtained, and a loss function of the SQL detail filling module is obtained; and fusing the loss function of the SQL template generation module and the loss function of the SQL detail filling module to obtain the loss function.

Specifically, a multi-task joint learning framework is used to train multiple subtasks simultaneously. These subtasks are essentially a classification problem, using cross entropy as a loss function, the loss function of the model being the accumulation of multiple subtask loss functions. The loss function is calculated as follows:

loss＝loss _s2s +loss _m

and loss is a loss function of the model and consists of a SQL template generation layer and a loss function of an SQL detail filling layer. loss _s2s The loss function of the SQL template generation layer is generated, the SQL template generation layer is responsible for predicting the template type and various component types, and the decoding space of each time is dynamically changed and comprises 8 decoding spaces in total. loss _m The loss function of the SQL detail filling layer is filled, and the SQL detail filling layer is responsible for performing prediction filling on the detail part of the SQL statement, and the total number of the loss function is 13And (4) subtasks.

In practical applications, for a field of a string type, the target VALUE may not be included in the natural language question, and only synonyms of the target VALUE are included. For such cases, the VALUE extracted from the natural language question cannot be used directly to populate the SQL template. Two types of methods are used herein to post-process the VALUE, one is a format conversion method that converts the VALUE extracted into the specific format required. For example, the user's natural language question is "give the station before 1/1 in 1997 the opening time", the corresponding SQL statement is "SELECT name FROM station WHERE online time > 1998-01-01", the VALUE of VALUE extraction is "1/1 in 1997", which needs to be converted to "1998-01-01". The other type is synonym retrieval, and the content with the highest similarity to the VALUE extracted from the database is retrieved by using a rule matching mode and is used for filling an SQL template, and the Rouge-L is used as a similarity calculation mode.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 2, there is provided a complex multi-table SQL generating apparatus based on bridge padding, including: an input module 202, an encoding module 204, and a generating module 206, wherein:

an input module 202, configured to input a natural language table sample into a multi-table SQL parsing model; the multi-table SQL analysis model comprises the following steps: the system comprises a semantic coding module, an SQL template generating module and an SQL detail filling module; the natural language table sample includes: natural language questions, database names, and database table fields;

the encoding module 204 is configured to analyze the natural language table sample according to the pre-trained semantic encoding module to obtain a field sequence code, a natural language problem sequence code, and a table name field sequence code; wherein the field sequence code, the natural language question sequence code and the table name field sequence code constitute overall coded information; the field sequence code and the table name field sequence code are connected through a serial connection character to form an enhanced sequence code;

the generating module 206 is configured to input the entire encoding information into the SQL template generating module, and generate a filling field corresponding to an SQL predefined template in the SQL template generating module; the SQL template generation module is composed of LSTM units, and a plurality of types of SQL predefined templates are pre-constructed in the SQL template generation module; the SQL predefined template is formed by predefined SQL statement components; inputting the filling field into the SQL detail filling module to fill the SQL predefined template to obtain a predicted multi-table SQL statement; training the multi-table SQL analysis model according to the predicted multi-table SQL statement and a preset loss function to obtain a trained multi-table SQL analysis model; and inputting the natural language table to be analyzed into the trained multi-table SQL analysis model to obtain a corresponding multi-table SQL statement.

In one embodiment, the input module 202 is further configured to obtain an initial sequence of natural language table samples as:

wherein, [ XLS]Indicates an initial mark, [ SEP ]]Indicates a spacer, [ CAT ]]Denotes a concatenation symbol, q ₁ ,q ₂ ,...,q _L Is a natural language question sequence, t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,.. is an enhanced sequence of the jth field in the ith data table, L represents the length of the natural language questionDegree; q. q.s _t Representing the t token in the natural language question sequence;

wherein h is _[XLS] Representing the overall coding information, h _[SEP] Represents [ SEP ]]Code of h _[CAT] Is expressed as [ CAT]Code of h _qt Represents q _t Code of h _ti1 ,h _ti2 ,...,h _[CAT] ,h _cij1 ,h _cij2 ,.. representing t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,..

In one embodiment, the encoding module 204 is further configured to input the whole encoding information into the SQL template generating module, and generate, by using the calculation formula of the LSTM unit, a filling field corresponding to the SQL predefined template in the SQL template generating module as follows:

f _t ＝σ(W _f ·[x _t ,h _t-1 ]+b _f )

i _t ＝σ(W _i ·[x _t ,h _t-1 ]+b _i )

o _t ＝σ(W _o ·[x _t ,h _t-1 ]+b _o )

h _t ＝o _t *tanh(c _t )

wherein, W _t For learnable parameters, W _t ∈R ^m×d For different types of SQL statement components, W _t When the size of the output set is larger than 2, using softmax as an activation function, and taking m as the size of the output set; when the output set size is 2, sigmoid is used as the activation function, h _[XLS] Encoding information for the whole, wherein x ₀ ＝h _[XLS] 。

In one embodiment, the types of the SQL predefined template include: non-nested SQL, aggregate operation SQL, FROM nested SQL, and VALUE nested SQL.

In one embodiment, the generating module 206 is further configured to input the filled field into the SQL detail filling module to perform field selection, operation judgment and extraction on the SQL predefined template, so as to obtain a predicted multi-table SQL statement.

In one embodiment, the generating module 206 is further configured to obtain a loss function of the SQL template generating module and obtain a loss function of the SQL detail filling module; and fusing the loss function of the SQL template generation module and the loss function of the SQL detail filling module to obtain the loss function. .

For specific limitations of the bridge padding-based complex multi-table SQL generation apparatus, reference may be made to the above limitations of the bridge padding-based complex multi-table SQL generation method, and details are not repeated here. The modules in the complex multi-table SQL generating device based on bridge padding may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a complex multi-table SQL generation method based on bridge padding. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for generating complex multi-table SQL based on bridge padding, which is characterized by comprising the following steps:

2. The method of claim 1, wherein parsing the natural language table samples according to the pre-trained semantic code module to obtain field sequence codes, natural language problem sequence codes, and table name field sequence codes comprises:

the initial sequence of obtaining natural language table samples is:

wherein h is _[XLS] Representing the overall coding information, h _[SEP] Represents [ SEP ]]Code of h _[CAT] Is expressed by [ CAT]Code of h _qt Denotes q _t Code of h _ti1 ,h _ti2 ,...,h _[CAT] ,h _cij1 ,h _cij2 ,.. representing t _i1 ,t _i2 ,...,[CAT],c _ij1 ,c _ij2 ,..

3. The method of claim 2, wherein inputting the overall coding information into the SQL template generation module, and generating the filled fields corresponding to the SQL predefined templates in the SQL template generation module comprises:

inputting the whole coding information into the SQL template generating module, and generating filling fields corresponding to the SQL predefined template in the SQL template generating module by using a calculation formula of an LSTM unit, wherein the filling fields are as follows:

f _t ＝σ(W _f ·[x _t ,h _t-1 ]+b _f )

i _t ＝σ(W _i ·[x _t ,h _t-1 ]+b _i )

o _t ＝σ(W _o ·[x _t ,h _t-1 ]+b _o )

h _t ＝o _t *tanh(c _t )

wherein, W _t For learnable parameters, W _t ∈R ^m×d For different types of SQL statement components, W _t When the size of an output set is larger than 2, using softmax as an activation function, and using m as the size of the output set; when the output set size is 2, sigmoid is used as the activation function, h _[XLS] Information is coded for the whole, where x ₀ ＝h _[XLS] 。

4. The method of claim 3, wherein the types of SQL predefined templates include: non-nested SQL, aggregate operation SQL, FROM nested SQL, and VALUE nested SQL.

5. The method of claim 4, wherein the SQL statement component comprises: SELECT, WHERE, HAVING, ORDER BY, GROUP BY, LIMIT, and FROM.

6. The method according to any one of claims 1 to 5, wherein entering the fill field into the SQL detail fill module to fill the SQL predefined template, resulting in a predicted multi-table SQL statement, comprises:

and inputting the filling field into the SQL detail filling module to perform field selection, operation judgment and extraction on the SQL predefined template to obtain a predicted multi-table SQL statement.

7. The method of claim 6, wherein the step of constructing a loss function comprises:

obtaining a loss function of the SQL template generation module and a loss function of the SQL detail filling module;

and fusing the loss function of the SQL template generation module and the loss function of the SQL detail filling module to obtain a loss function.

8. An apparatus for generating complex multi-table SQL based on bridge padding, the apparatus comprising:

the generating module is used for inputting the whole coding information into the SQL template generating module and generating filling fields corresponding to the SQL predefined template in the SQL template generating module; the SQL template generation module is composed of LSTM units, and a plurality of types of SQL predefined templates are pre-constructed in the SQL template generation module; the SQL predefined template is formed by predefined SQL statement components; inputting the filling field into the SQL detail filling module to fill the SQL predefined template to obtain a predicted multi-table SQL statement; training the multi-table SQL analysis model according to the predicted multi-table SQL statement and a preset loss function to obtain a trained multi-table SQL analysis model; and inputting the natural language table to be analyzed into the trained multi-table SQL analysis model to obtain a corresponding multi-table SQL statement.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.