CN113515955A

CN113515955A - Semantic understanding-based online translation system and method from text sequence to instruction sequence

Info

Publication number: CN113515955A
Application number: CN202110453842.2A
Authority: CN
Inventors: 张晓芳; 欧睿; 饶攀军; 陈科; 马东红; 郑元
Original assignee: Taiji Computer Corp Ltd
Current assignee: Taiji Computer Corp Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-10-19

Abstract

The invention provides semantic understanding-based online translation from a text sequence to an instruction sequence, in particular to a method for researching translation from the text sequence to the instruction sequence by utilizing a language pre-training model and deep learning, belonging to the field of natural language processing; the method aims to solve the problems that the artificial text and the json language cannot be directly converted and the complex sentences cannot be translated in the text conversion process in the prior art; according to the invention, a template generation layer is introduced, an SQL template can be extracted according to SQL sentences in training data, different templates can be selected according to different input question sentences by a model, different SQL subtasks can be divided according to the templates, and the SQL sentences with complex structures are generated on the basis; the method for filling the SQL sub-sentences is expanded, and complex SQL sentences can be generated; and no extra encoder and decoder are needed, and in addition, no extra intermediate representation layer is needed to be introduced, so that the model complexity is reduced, and the model generalization capability can be improved.

Description

Semantic understanding-based online translation system and method from text sequence to instruction sequence

Technical Field

The invention relates to online translation from a text sequence to an instruction sequence, in particular to research on translation from the text sequence to the instruction sequence by using a language pre-training model and deep learning, and belongs to the field of natural language processing.

Background

In the traditional software based on the service flow, a user can call a program interface corresponding to a service according to the service type to complete the service requirement. However, as the service requirements are diversified, the program interfaces corresponding to the services become more complicated, and at this time, the number of operation steps required by the user to complete one service increases, which may lead to a decrease in work efficiency and an increase in error probability.

Basically, business requirements can be mapped into addition, deletion, modification and check of the database; the instruction is similar to the SQL statement, and can be converted into an expression mode similar to the SQL statement, and then the conversion from the natural language to the instruction is completed by using the related technologies of Text2SQL and NL2 SQL.

There are three major challenges to the problem of Text2 SQL: the first is information fusion, how to jointly represent the question sentence and the database table; the second is mismatch, which refers to a mismatch between the intent of the natural language representation and the details of the SQL statement, the main reason for which is that the SQL statement is an efficient query designed for relational databases, rather than for expressing semantic information; the third is the generalization problem, which refers to whether a correct SQL statement can be generated for an unknown table (out-of-domain schema), where an unknown table refers to a table that does not appear in the model training dataset.

Aiming at the three technical problems and combining the work related to Text2SQL in recent years, the past model is improved, and a model called Hybrid Ranking Filling Network (H-Net) is provided. The model can be summarized into the following five steps, namely, firstly, a column and a question are jointly coded by using a language pre-training model BERT so as to obtain the relation between the column and the question; secondly, selecting an SQL template, and dividing the complete SQL statement task into a plurality of sub-SQL statement tasks according to the SQL template; thirdly, calculating the similarity of the columns and the question sentences in different sub SQL sentences and sequencing the calculated results; and fourthly, decoding the generating tasks of different sub SQL sentences by using the sequencing result and a decoding mode corresponding to the sequencing result to generate the sub SQL sentences. And fifthly, filling the SQL template by using the sub SQL sentences to generate complete SQL sentences.

The hybrid ordered filling model is different from the traditional end-to-end model, and does not need an additional encoder and a decoder, and in addition, does not need to introduce an additional intermediate representation layer, so that the characteristics reduce the complexity of the model and can improve the generalization capability of the model. In order to verify the effectiveness of the proposed method, a data set from a text sequence to an instruction sequence is constructed, and the experimental result on the data set shows the effectiveness of the H-NET method proposed by the method.

Disclosure of Invention

In order to solve the problems that the artificial text and json language can not be directly converted and the complex sentences can not be translated in the text conversion process in the prior art, the invention provides an online translation system, a method and a storage medium from a text sequence to an instruction sequence based on semantic understanding, and the scheme is as follows:

the first scheme is as follows: the system comprises four subsystems, namely a JSON sequence conversion subsystem, a template generation subsystem, an online translation subsystem and an SQL post-processing subsystem; completing construction of a data set through JSON to SQL statement conversion and template extraction; the online translation process from the text sequence to the instruction sequence based on semantic understanding is completed by encoding and online translating the text sequence and pruning SQL sentences;

the JSON sequence conversion subsystem comprises a SQL construction statement module and a data table construction module, is responsible for converting the instruction sequence in the JSON format into an SQL statement, constructs a data table according to the instruction sequence and is used for integrating SQL initial data;

the template generation subsystem comprises a template clustering module and a template extraction module and is used for guiding the translation system to generate an instruction sequence;

the online translation subsystem comprises a training and predicting module used for completing the online translation process from a text sequence to an instruction sequence;

the SQL post-processing subsystem comprises an execution guidance module, an SQL statement pruning module and an instruction reduction module, and is used for removing redundant parts in the SQL statement and comprehensively processing the SQL statement and reducing the SQL statement into an instruction sequence in a JSON format.

Scheme II: the online translation method from a text sequence to an instruction sequence based on semantic understanding is realized on the basis of the system by utilizing the template generation subsystem, and defining translation rules and initializing statement data through template extraction, SQL statement input representation and model establishment;

the statement data comprises two parts, wherein one part comprises the type, the table name, the column name, the table annotation, the column annotation and the connection operation of the column in the SQL reference data set; the other part introduces the word sequence into which the question sentence is cut.

Further, aiming at the online translation process, the expanding SQL sub-statement subsystem and the instruction sequence conversion subsystem are core system components borne by the method, and the method specifically comprises the following steps:

step one, input data is coded;

step two, establishing an SQL model by defining translation rules;

step three, aiming at the calculation of the SQL column and the division of the subtasks of the structure, the process of initializing the template is completed;

step four, training and predicting the SQL model through the initialization process of the step one to the step three;

step five, combining an algorithm to guide each clause and finish the pruning of the sentences;

step six, combining the SQL data set in the SQL database to complete the instruction sequence conversion;

and seventhly, completing the online translation process from the text sequence to the instruction sequence through JSON and SQL data integration.

Further, in step three, the subtask partition described with respect to the calculation of the column includes a column independent task, a column dependent task, and a structure dependent task;

wherein the column independent task refers to predicting columns involved in the selection clause, columns involved in the conditional clause, columns involved in the ordering clause, and columns involved in the complete SQL statement;

the column related tasks refer to prediction function operators, condition operators, values corresponding to conditions and the number of the values corresponding to the conditions;

the structurally related tasks include prediction set operators and join operators.

Further, the column calculation is detailed as follows:

analyzing the similarity of the columns and the question sentences for calculation, calculating the similarity of the columns and the question sentences in different SQL clauses, and further judging whether the columns appear in the corresponding SQL clauses;

processing column independent tasks, specifically for column independent tasks, predicting columns related in different SQL clauses so as to independently screen the columns;

thirdly, processing column related tasks, specifically, predicting function operators corresponding to columns in the selected clauses, conditional operators corresponding to columns in the conditional clauses, and values corresponding to columns in the conditional clauses;

and step three, processing the structure related tasks, specifically, predicting a set operator and a connection operator of the SQL statements, and completing the structure related tasks.

Further, in the fourth step, the method comprises a training stage and a prediction stage, wherein the training stage firstly carries out training input; secondly, generating SQL statement labels corresponding to the question sentences, and finally minimizing the cross entropy loss of each type of subtasks to complete the optimization of the training target;

step four, encoding the input question sentence;

step four, selecting a template in the SQL data set, and then dividing the SQL sentence subtasks;

fourthly, processing the column independent tasks, specifically for the column independent tasks, predicting columns related in different SQL clauses so as to independently screen the columns;

step four, processing column related tasks, specifically, predicting function operators corresponding to columns in the selected clauses, conditional operators corresponding to columns in the conditional clauses, and values corresponding to columns in the conditional clauses;

and step four, processing the structure related tasks, specifically, predicting a set operator and a connection operator of the SQL statements, and completing the structure related tasks.

Step IV, comparing the SQL statement label corresponding to the question with the processing results of the step IV to the step IV, calculating the cross entropy loss of each step, and finally, completing the training target.

Further, in the prediction stage, the specific refinement steps are as follows:

fifthly, encoding the input question sentence;

step two, selecting a template in the SQL data set, and then dividing the SQL statement subtasks;

step five, calculating the similarity of the columns and the question sentences in the selected clauses and sequencing the columns and the question sentences; taking a plurality of first columns by predicting the number of columns in the selected clause, and finally obtaining the selected clause by calculating a function operator corresponding to each column;

fifthly, calculating the similarity of the columns and the question sentences in the selected clauses and sequencing the columns and the question sentences; taking a plurality of columns in advance by predicting the number of columns in the conditional clause, and obtaining the conditional clause according to the conditional operator corresponding to each column:

fifthly, calculating the similarity of the columns and the question sentences in the selected clauses and sequencing the columns and the question sentences; taking a plurality of first columns by predicting the number of columns in the sorting clause, and finally obtaining the sorting clause according to a sorting operator corresponding to each column;

step five, predicting the structural information of the SQL statement by combining the step five to obtain a set operator and a connection operator;

step five, according to the column selected by the clause, a table corresponding to the column is searched in the SQL data set to obtain a from clause;

step five eight, finding the shortest path between the tables according to the link of the tables searched in the step five and the relation between the external key and the main key between the tables, and adding the connection condition between the tables on the shortest path into the condition clause;

through the steps, different SQL clauses are obtained, and the SQL clauses are filled into the corresponding positions in the template according to the SQL clause types, so that a complete SQL statement is obtained finally.

Further, in the fifth step, the executing, guiding, judging and SQL statement pruning process includes executing a part of the generated SQL statements in a prediction phase of the SQL statements, adjusting the generation of the SQL statements according to the execution result of the part of the generated SQL statements on the SQL data set, thereby completing the SQL statement pruning, limiting the search range of the model in the solution space, and removing the search path which does not meet the requirement; the search path adopts four execution guidance modes, including clause selection execution guidance, conditional clause execution guidance, clause sorting execution guidance and structure execution guidance.

Further, in step six, the instruction sequence conversion refers to conversion into a JSON statement format, and the question type in the data set is answer retrieval based on conditional query matching.

Furthermore, a database table set consisting of at least 80 tables is obtained by extracting and constructing the tables from the JSON statements, and meanwhile, the JSON statements are converted into SQL statements to serve as a new data set through the data integration module and used for model training, verification and testing.

The invention has the beneficial effects that:

according to the invention, a template generation layer is introduced, an SQL template can be extracted according to SQL sentences in training data, different templates can be selected according to different input question sentences by a model, different SQL subtasks can be divided according to the templates, and the SQL sentences with complex structures are generated on the basis; the method for filling the SQL sub-sentences is expanded, and complex SQL sentences can be generated; in addition, additional encoders and decoders are not needed, and in addition, additional intermediate representation layers are not needed to be introduced, so that the model complexity is reduced, and the model generalization capability can be improved;

in addition, the invention executes the guidance and SQL sentence pruning process: dividing parameters in JSON into direct parameters and indirect parameters, and designing corresponding conversion rules according to the attributes of the direct parameters and the indirect parameters; further, a rule for constructing the data table according to the JSON statement is provided, so that the data table can be automatically constructed from the JSON statement, and finally the JSON statement is converted into a rule of an SQL statement, so that a Text2JSON technology can be applied to realize the Text2 JSON;

the instruction sequence conversion researched by the invention is to convert the service requirement expressed by the natural language of the user into an instruction which can be executed by the system, and the service requirement expressed by the user is completed by executing the instruction, so that the operation steps can be effectively shortened, the working efficiency is improved, and the error probability is reduced.

Drawings

FIG. 1 is a general block diagram of an online translation from a text sequence to an instruction sequence based on semantic understanding;

FIG. 2 is a diagram of a context-free grammar for an SQL statement

FIG. 3 is a SQL statement template;

FIG. 4 is a schematic diagram of a SQL statement template sample;

FIG. 5 is an exemplary diagram of a question sentence and JSON sentence according to the present invention;

FIG. 6 is a diagram of SQL statements corresponding to JSON in FIG. 5;

FIG. 7 is an exemplary diagram of experimental results regarding question sentences and JSON sentences;

FIG. 8 is a diagram of SQL statements corresponding to JSON in FIG. 6;

FIG. 9 is an exemplary diagram of a context free grammar experiment result for an SQL statement;

FIG. 10 is a diagram of the syntax structure of an SQL statement;

FIG. 11 is a diagram of an online translation module from a text sequence to an instruction sequence based on semantic understanding.

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Detailed Description

The first embodiment is as follows: the specific structure and simplified logic sequence of the online translation system for text sequences and instruction sequences based on semantic understanding in this embodiment are shown in fig. 11, and the detailed modules are as follows: the system comprises three subsystems of template generation, SQL expansion sub-statement and instruction sequence conversion; the method comprises the steps of completing an online translation process from a text sequence to an instruction sequence based on semantic understanding by establishing a text language sequence-SQL statement expansion-instruction sequence conversion; the introduced template generation subsystem comprises a template extraction module, an input representation module and a model establishment module and is responsible for integrating SQL initial data; the SQL expanding sub-statement subsystem comprises a training and predicting module, an execution guidance module and an SQL statement pruning module and is used for the SQL statement comprehensive processing; the instruction sequence conversion subsystem comprises a data and data set parameter module, an instruction conversion module and a data integration module and is responsible for converting SQL sentences into instruction sequences and completing the online translation process from text sequences to instruction sequences.

The second embodiment is as follows: in this embodiment, in combination with some defects in the existing language translation technology, an online translation method from a text sequence to an instruction sequence based on semantic understanding is further elaborated in detail in combination with a system flow in the first embodiment:

recent work on Text2SQL can be largely classified into the following four categories:

end-to-end translation: the main idea is to generate a Sketch (Sketch) with SQL syntax structure through the end-to-end model and then fill the Sketch with the content in the question. The method can be simply summarized into the following two steps: firstly, respectively coding an input question and a database table, calculating attention of the question to the database table, generating an attention expression as a model input, and then generating a rough sketch through a decoder; second, fill in the missing details in the sketch by aligning the question with the sketch. The end-to-end model is sensitive to sentence order, but SQL sentences are not. For example, exchanging two conditional orders in a conditional statement does not affect the SQL statement, but affects the generation process of the end-to-end model.

Intermediate semantic representation: the main idea is to splice the question and all columns, encode to obtain input, then generate the intermediate semantic representation of the tree structure by a decoder Based on Grammar (Grammar Based), and finally decode again Based on the intermediate representation to generate the complete SQL sentence. The method can be simply summarized into the following three steps: firstly, Linking question sentences with a database (Schema Linking), finding out tables, columns and value sequences corresponding to the database in the question sentences, and then respectively coding and splicing all the columns and the question sentences in a database table as input; secondly, generating an intermediate semantic representation of a tree structure by utilizing a decoder based on grammar through two types of rules of ApplyRule and GenToken; and a third part, decoding the intermediate semantic representation again to generate complete SQL. The intermediate semantic representation method requires a special syntax-based decoder; and a special middle semantic representation layer is introduced to increase the complexity of the model. In addition, the introduction of the intermediate semantics cannot completely express the content of the question, and semantic loss is caused.

SQL subtask prediction: the main idea is to disassemble the SQL statement into a plurality of sub SQL statements, then generate the sub SQL statements respectively, and finally combine into a complete SQL statement. Such a process can be briefly summarized in the following two steps: firstly, dividing the SQL sentence into sub SQL sentences such as SELECT-column, SELECT-aggregation, WHERE-number, WHERE-column, WHERE-operator and WHERE-value; and secondly, each sub SQL statement corresponds to a sub task, and each sub task is equivalent to a classification task. The subtasks are independent from each other, generate corresponding sub SQL sentences respectively, and finally combine into a complete SQL sentence.

Disadvantages of such methods include: 1) the division of SQL subtasks is completed by manually defining rules; 2) the dependencies between the subtasks are not fully exploited.

Language pre-training model: the main idea is to capture the relation between question and column by using language pre-practice model BERT, and then generate SQL sentence by using the relation between question and column. The method can be simply summarized into the following two steps: splicing a question with all columns or splicing a question with a single column as input, acquiring a code by using a language pre-training model to obtain output, and acquiring a relation between the question and the column; secondly, dividing the complete SQL into a plurality of sub SQL sentences by using a method similar to SQL sub task prediction; and thirdly, decoding each sub SQL statement by using an LSTM (Long Short-Term Memory) or a linear neural network through the output of the language preprocessing model, filling the sub SQL statements, and finally combining the sub SQL statements to obtain a complete sub SQL statement. The disadvantages of such processes are mainly: 1) the division of SQL subtasks is completed by manually defining rules; 2) cannot be expanded to complicated SQL sentences; 3) additional LSTM decoders are required to complete SQL statement generation.

The method of the hybrid ordering filling network model (H-Net) is similar to that of the H-Net, and a language pre-training model and SQL statement subtask division are adopted. But there are two main differences: firstly, a template generation layer is introduced, an SQL template can be extracted according to SQL sentences in training data, different templates can be generated by a model according to different input question sentences, different SQL subtasks can be divided according to the templates, and the SQL sentences with complex structures are generated on the basis. Secondly, the method for filling the SQL sub-statements is expanded, and complex SQL statements can be generated.

The overall structure of the H-Net model of the hybrid-ranking filling network model is shown in fig. 1, and the functions and calculation modes of each part of the model will be described in detail in the following subsections.

1. Template extraction: first, a context-free grammar is defined for the SQL statement, as shown in fig. 2;

and analyzing the SQL sentences s in the training set by using the context-free grammar. For each SQL statement s, a syntax structure of the SQL statement is obtained.

For example, for an SQL statement:

select class_type from cource where id＞＝100order by class_type asc

union

select class_type from cource_opened where id＞＝100order id desc

after the context-free grammar is used for analysis, a template shown in FIG. 3 is obtained;

and clustering according to the obtained SQL grammar structure. For each type in the clustering result, a grammar structure capable of covering all SQL sentences of the type is taken as a Template, and all templates are recorded as a Template set T.

2. The input represents:

defining a question may be expressed as: q;

one column in the database may be represented as:

wherein: type-represents the type of column, type ∈ { string, number, date, };

t-represents a table name;

c-represents the column name;

-presentation table annotation;

-representing column annotations;

concat (·) — represents a join operation;

question q and column c_iCombine to obtain an input pair (c)_iQ); transfusion systemAfter word segmentation, a word sequence is obtained

The method comprises the following steps:

[CLS],x₁,x₂,...,x_m,[SEP],y₁,y₂,...,y_n,[SEP]

wherein x₁,x₂,...,x_m-the representation column c_iThe word sequence of (a);

y₁,y₂,...,y_n-a sequence of words representing a question q;

handle [ CLS],x₁,x₂,...,x_m,[SEP],y₁,y₂,...,y_n,[SEP]As model input, it is denoted input.

The third concrete implementation mode: according to the transformation rule described in the second embodiment, the present embodiment introduces a specific process of encoding and completing a sequence task:

A. and (3) encoding:

encoding the input using a BERT language pre-training model for capturing the relationship between question and column, and obtaining a hidden state sequence after encoding:

SQL model building correlation:

1. template selection:

the template selection can be regarded as a classification task, and for a given question q, the corresponding template is marked as tau;

P(τ＝k|c_i,q)＝softmax(W^template[k,:]·h_[CLS]) (3)

P(c_i∈R_c| q) is question q and column c_iSimilarity in the complete SQL statement.

2. Dividing subtasks: assume that an SQL statement template is shown in FIG. 4:

according to the SQL statement template, the predicted SQL statement can be divided into three types of subtasks:

(a) the following independent tasks: predicting the relevant column S of a selected clause_cColumn W referred to in conditional clause_cSorting the columns O involved in the clause_cAnd the column R involved in the complete SQL statement_c。

(b) The following related tasks: prediction function operator aggregation, condition operator, value corresponding to a condition, value _ num corresponding to a condition, and the like.

(c) Structure-related tasks: prediction set operator rel, join operator link, etc.

C. With respect to the columns:

1. column and question similarity calculation:

for question q, assume note S_cIs the column, W, involved in selecting the clause select _ close_cIs the column, O, involved in the conditional clause where _ clause_cIs the column referred to in the ordering clause order _ close, R_cIs the column, R, involved in the complete SQL statement_c＝S_c∪W_c∪O_c。

For column c_iThe column c needs to be calculated_iSimilarity with question q in different SQL clauses, and then column c is determined_iWhether it appears in the corresponding SQL clause.

The following are four different similarity calculation methods:

in selecting clause select _ close, column c_iThe similarity with question q is noted as P (c)_i∈S_c|q)：

P(c_i∈S_c|q)＝sigmod(W_sc·h_[CLS]) (4)

In the conditional clause where _ clause, column c_iThe similarity with question q is noted as P (c)_i∈W_c|q)：

P(c_i∈W_c|q)＝sigmod(W_wc·h_[CLS]) (5)

In the sort clause order _ close, column c_iThe similarity with question q is noted as P (c)_i∈O_c|q)：

P(c_i∈O_c|q)＝sigmod(W_oc·h_[CLS]) (6)

In the complete SQL statement, column c_iThe similarity with question q is noted as P (c)_i∈R_c|q)：

P(c_i∈R_c|q)＝sigmod(W_rc·h_[CLS]) (7)

2. The following independent tasks:

for column-independent tasks, it is meant to predict columns involved in different SQL clauses, e.g., S_c，W_cAnd O_cThere are two prediction modes:

first, a threshold is set, and for each type of clause, only columns with similarity to question sentences exceeding the threshold are retained.

Second, for each type of clause, a value of n is predicted, and only the top n columns of similarity with question sentences are reserved.

For the second method, n is calculated as follows:

for the number of columns in the selection clause, select _ num, is noted

P(n_s＝k|c_i,q)＝softmax(W^ns[k,:]·h_[CLS]) (9)

Wherein P (n)_s＝k|c_iQ) corresponds to a classification task.

For the number of columns in the conditional clause, where _ num, is recorded

P(n_w＝k|c_i,q)＝softmax(W^nw[k,:]·h_[CLS]) (11)

Wherein P (n)_w＝k|c_iQ) corresponds to a classification task.

For the number of columns in the sorting clause, order _ num, is recorded

P(n_o＝k|c_i,q)＝softmax(W^no[k,:]·h_[CLS]) (13)

Wherein P (n)_o＝k|c_iQ) corresponds to a classification task.

3. The following related tasks:

for column-related tasks, we can consider a classification task:

for function operator a in a selection clause_j：

P(a_j|c_i,q)＝softmax(W^agg[j,:]·h_[CLS]) (14)

For conditional operator o in conditional clause_j：

P(o_j|c_i,q)＝softmax(W^op[j,:]·h_[CLS]) (15)

For the value corresponding to a column in the conditional clause:

firstly, the number value _ num of the corresponding values of the prediction conditions is required to be recorded as n_v，

P(n_v＝j|c_i,q)＝softmax(W^nv[j,:]·h_[CLS]) (16)

Value corresponding to the k value of the condition_kOnly the value needs to be indicated_kStart position and end position in question:

starting position:

end position:

for operators or in ordering clauses_j：

P(or_j|c_i,q)＝softmax(W^order[j,:]·h_[CLS]) (19)

4. Structure-related tasks:

for the set operator rel:

P(rel＝k|c_i,q)＝softmax(W^rel[k,:]·h_[CLS]) (21)

where P (rel ═ k | c)_iQ) corresponds to a classification task.

For the join operator link:

P(link＝k|c_i,q)＝softmax(W^link[k,:]·h_[CLS]) (23)

wherein P (link ═ k | c)_iQ) corresponds to a classification task.

D. Training and predicting:

1. a training stage:

inputting: (q) a_i,c_j). Wherein q is_iIs the ith question, U_C＝{c₁,c₂,...,c_LAre all columns in the database table, c_j∈U_C

Labeling: and q is_iCorresponding SQL statement

Optimizing the target: minimizing cross-entropy loss for each class of subtasks:

min∑CrossEntropy(sub_task) (24)

2. a prediction stage:

firstly, selecting a template through a formula (2), and then dividing SQL statement subtasks.

Secondly, calculating the similarity of the columns and the question sentences in the selected clauses according to a formula (4), and sorting; the number of columns in the selection clause is predicted by equation (8)

Then before taking the similarity

Column (2) of

And finally, calculating a function operator corresponding to each column according to a formula (14) to obtain a selection clause select _ close:

thirdly, calculating the similarity of the columns and the question sentences in the conditional clauses according to the formula (5), and sorting; predicting the number of columns in a conditional clause by equation (10)

Then before taking the similarity

Column (2) of

Finally, calculating a conditional operator corresponding to each column according to the formula (15), and calculating a value corresponding to each column according to the formulas (16), (17) and (18), so as to obtain a conditional clause where _ close:

fourthly, calculating the similarity of the columns and the question sentences in the sorting clauses according to the formula (6) and sorting; the number of columns in the sorting clause is predicted by equation (12)

Then before taking the similarity

Column (2) of

And finally, calculating an operator corresponding to each column according to a formula (19) to obtain a sorting clause order _ close:

fifthly, the structural information of the SQL statement is predicted through the formulas (20) and (22), and the structural information can be obtained

And

sixthly, extracting samples by combining the templates shown in the figures 5 and 6, and knowing the tables corresponding to the columns according to the columns selected by the clauses, wherein S is the table corresponding to the clauses,

the from clause can be found:

and seventhly, table connection is carried out, the shortest path between the tables S is found according to the relation between the external key and the main key between the tables S, and the connection condition between the tables on the shortest path is added into the condition clause.

And eighthly, through the steps, clauses of different SQL can be obtained. And filling the SQL clause into a corresponding position in the template according to the type of the SQL clause to finally obtain a complete SQL sentence.

E. Executing guidance judgment and SQL statement pruning:

in the prediction phase of the SQL statement, the partially generated SQL statement may also be executed, and the generation of the SQL statement may be adjusted according to the Execution result of the partially generated SQL statement on the database, which is called Execution Guidance (EG). The execution guidance plays a role of pruning, limiting the search range of the model in the solution space and removing the search path which does not meet the requirement.

The following are four different execution guidance modes adopted by the embodiment, which are respectively directed to different subtasks:

selecting clauses to execute guidance: as shown in algorithm 1

Conditional clause execution guidance: as shown in algorithm 2:

the sort clause execution guide: similar to conditional clause execution guidance algorithms;

structure execution guidance: the execution of the guiding algorithm is similar for the collective operator rel and the join operator link, wherein the execution of the guiding algorithm for the collective operator rel is shown as algorithm 3:

F. and (3) instruction sequence conversion:

first, explanation is made with respect to data sets: the constructed text-to-instruction sequence data set is in a JSON format, and the JSON format is mainly used for conveniently representing instruction sequences. The dataset contains 2312 (question, JSON statement) pairs.

1923 pieces of data correspond to one JSON statement corresponding to one question, and 389 pieces of data correspond to two JSON statements corresponding to one question. The question type is mostly answer retrieval based on multi-condition query matching.

The JSON statement is composed of a method part, a url part and a params part, wherein a plurality of parameters param are contained in the params part, and each parameter param is composed of a name part, an option part and a value part.

One (question sentence, JSON sentence) pair is shown in fig. 5;

for parameter param, there are two types:

direct parameters: the value values appear directly in the question, e.g. the parameters create _ time and create _ org in fig. 5;

indirect parameters: value values do not appear directly in the question, such as the parameter columns in FIG. 5;

for both parameters, there are respective processing functions in the instruction conversion.

Regarding the processing function, firstly, instruction conversion is performed, and in order to use Text2SQL technology, JSON statements need to be converted into SQL statements; secondly, a table is constructed, and the construction rule is as follows:

url → List name

Name → column name, if param ∈ direct parameter

Name & param.value → column name, if param ∈ indirect parameter

From the JSON statement in fig. 5, a table structure as shown in table 1 can be constructed:

table 1 is a data table corresponding to JSON in fig. 5: table name: "/api/v 2/extract/query/export"

Specifically, the JSON statement is converted into an SQL statement:

the conversion rules are as follows:

url→table name

param → (name, option, value), if param ∈ direct parameter

param → (option, name & value), if param ∈ indirect parameter

For the direct parameter, the converted parameter is placed in the conditional clause where _ close; for indirect parameters, the converted parameters are placed in the selection clause select _ close; url is placed in the from clause. The JSON statement in FIG. 5 can be converted into the SQL statement as shown in FIG. 6.

G. Data integration:

extracting and constructing tables from JSON to obtain a database table set consisting of 80 tables;

by converting the JSON statement into the SQL statement, a new 2312 (question sentence, SQL statement) pair can be obtained;

2312 (question sentence, SQL sentence) pairs and database table sets are used as new data sets for model training, verification and testing.

Specifically, the experimental result is that 2312 pieces of data (question sentences, SQL sentences) are divided, wherein 1800 data pairs are used for a training set, 100 data pairs are used for a verification set, and 412 data pairs are used for a test set.

Table 2 shows the models pre-trained using different languages and their logical accuracy on the validation set and the test set with and without execution guidance. Logical accuracy refers to the proportion of the predicted SQL statement that is completely consistent with the tagged SQL statement.

TABLE 2 logical accuracy of different models on data validation set and test set

Model (model)	Language pre-training model	Verification set	Test set
				H-Net	RoBERTa-wwm-ext,Chinese	84.0	83.5
H-Net	RoBERTa-wwm-ext-large,Chinese	85.0	84.7
				H-Net+EG	RoBERTa-wwm-ext,Chinese	88.0	87.6
H-Net+EG	RoBERTa-wwm-ext-large,Chinese	89.0	89.1

And dividing the SQL sentences into different sub SQL sentences by referring to the evaluation mode of the WikiSQL data set, and respectively evaluating the accuracy of the model on the subtasks corresponding to each type of sub SQL sentences. Table 3 shows the logical accuracy of the model on the validation set and the test set with and without execution guidance on each type of subtask.

TABLE 3 logical accuracy of different models for each type of subtask on the data validation set and test set

Subtask \ model	H-Net + EG (verification set, test set)	H-Net (verification, test)
			S-COL	98.7,98.5	97.9,97.6
S-AGG	98.7,98.5	98.3,98.0
			W-COL	98.3,97.9	98.1,97.6
W-OP	99.2,98.8	97.6,96.8
			VAL_NUM	98.7,98.5	96.2,95.7
VAL	96.2,96.0	95.7,93.7
			O-COL	99.2,98.8	98.9,98.6
O-OR	99.4,99.3	99.2,98.7
			Rel	99.8,99.6	99.4,99.1
Link	99.8,99.6	99.4,99.1
			Tem	99.8,99.6	99.4,99.1

In combination with the above examples, the conclusions are as follows:

the method is characterized in that a language pre-training model and deep learning are utilized to translate text into an instruction sequence, a mixed sequencing filling network model is provided, the model can well utilize the characteristics of the language pre-training model, and complex SQL sentences can be processed. Good experimental results are obtained from the text to instruction sequence data set constructed by the user, and the logic accuracy of the translation results can reach 89.1%.

Converting the JSON sequence into an SQL sequence and describing a JSON statement structure:

the JSON statement is composed of a method part, a url part and a params part, wherein a plurality of parameters param are contained in the params part, and each parameter param is composed of a name part, an option part and a value part. One (question sentence, JSON sentence) pair is shown in fig. 7:

for parameter param, there are two types:

direct parameters: the value values appear directly in the question, such as the parameters create _ time and create _ org in FIG. 7;

indirect parameters: value values do not appear directly in the question, such as the parameter columns in FIG. 7;

for these two parameters, there are respective ways of handling in instruction conversion.

Specifically, the construction rule of the construction table is as follows:

url → List name

Name → column name, if param ∈ direct parameter

Name & param.value → column name, if param ∈ indirect parameter

From the JSON statement in fig. 7, a table structure as shown in table 1 can be constructed:

table 4 data table names corresponding to JSON in fig. 7: "/api/v 2/extract/query/export"

create_time	create_org	columns&contract_name	columns&contract_amount
				…	…	…	…

Specifically, the JSON statement is converted into an SQL statement;

the conversion rules are as follows:

url→table name

param → (name, option, value), if param ∈ direct parameter

param → (option, name & value), if param ∈ indirect parameter

For the direct parameter, the converted parameter is placed in the conditional clause where _ close; for indirect parameters, the converted parameters are placed in the selection clause select _ close; url is placed in the from clause. The JSON statement in FIG. 7 can be converted into the SQL statement as shown in FIG. 8.

Innovation points a:

firstly, dividing parameters in JSON into direct parameters and indirect parameters, and designing corresponding conversion rules according to the attributes of the direct parameters and the indirect parameters;

second, a rule for building a data table from JSON statements is proposed so that a data table can be automatically built from JSON statements.

And thirdly, a rule for converting the JSON statement into the SQL statement is provided, so that the Text2JSON can be realized by applying a Text2SQL technology.

In particular, template extraction

SQL statement parsing

First, a context-free grammar is defined for the SQL statement, as shown in fig. 9;

for SQL statement s in data set, it is parsed by context-free grammar. For each SQL statement s, a syntax structure of the SQL statement is obtained.

For example, for an SQL statement:

select class_type from cource where id＞＝100order by class_type asc

union

select class_type from cource_opened where id＞＝100order id desc

after parsing by using the context-free grammar, the syntax structure of the SQL statement shown in fig. 10 is obtained.

Regarding template extraction: and clustering according to the obtained SQL grammar structure. And for each type in the clustering result, taking the syntax structure of the SQL sentence with the highest coverage rate of the type as a Template, and recording all the templates as a Template set T.

The innovation points are as follows:

the template is automatically generated from the SQL statements in the dataset without the need for manually defining the template.

Different templates can be derived from different data sets, and generalization capability is strong.

The generated template granularity is controllable, and clustering can be performed according to different granularities, so that the granularity of the generated template can be adjusted according to the characteristics of a data set.

Specifically, the model overview:

the model can be summarized into the following five steps, namely, firstly, a column and a question are jointly coded by using a language pre-training model BERT so as to obtain the relation between the column and the question; secondly, selecting an SQL template, and dividing the complete SQL statement task into a plurality of sub-SQL statement tasks according to the SQL template; thirdly, calculating the similarity of the columns and the question sentences in different sub SQL sentences and sequencing the calculated results; and fourthly, decoding the generating tasks of different sub SQL sentences by using the sequencing result and a decoding mode corresponding to the sequencing result to generate the sub SQL sentences. And fifthly, filling the SQL template by using the sub SQL sentences to generate complete SQL sentences.

Innovation points b:

firstly, a template generation layer is introduced, an SQL template can be extracted according to SQL sentences in training data, different templates can be selected according to different input question sentences by a model, different SQL subtasks can be divided according to the templates, and the SQL sentences with complex structures are generated on the basis.

Secondly, the method for filling the SQL sub-statements is expanded, and complex SQL statements can be generated.

Third, no additional encoder and decoder are required, and in addition, no additional intermediate representation layer is required, which reduces model complexity and improves model generalization capability.

Specifically, the execution guidance and SQL statement pruning:

in the prediction phase of the SQL statement, the partially generated SQL statement may also be executed, and the generation of the SQL statement may be adjusted according to the execution result of the partially generated SQL statement on the database, which is called an execution guidance. The execution guidance plays a role of pruning, limiting the search range of the model in the solution space and removing the search path which does not meet the requirement.

selecting clauses to execute guidance: as shown in algorithm 1:

conditional clause execution guidance: as shown in algorithm 2:

the sort clause execution guide: similar to conditional clause execution guidance algorithm

Structure execution guidance: the method is carried out on the collective operator and the connection operator, and the execution guidance algorithm of the collective operator and the connection operator is similar, wherein the execution guidance algorithm of the collective operator is shown as an algorithm 3:

innovation points c:

firstly, an execution guidance algorithm is designed for each type of sub SQL statement, and when the sub SQL statement is generated, the generation of the sub SQL statement can be dynamically adjusted according to an execution result.

And secondly, clauses are selected, the conditional clauses and the sequencing clauses are independent from each other, and the execution guidance algorithm of each type of clauses is not influenced by each other.

And thirdly, the structural execution guidance algorithm considers the structural relationship among the SQL clauses, and removes redundant SQL clauses according to the structural relationship among the SQL clauses.

According to the method example, the functional modules may be divided according to the block diagram shown in fig. 11 of the specification, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module; the integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Specifically, the device used by the system comprises a processor, a memory, a bus and a communication device;

the memory is used for storing computer execution instructions, the processor is connected with the memory through the bus, the processor executes the computer execution instructions stored in the memory, and the communication equipment is responsible for being connected with an external network and carrying out a data receiving and sending process; the processor is connected with the memory, and data are stored in the computer readable storage medium; the processor and the memory contain instructions for causing the personal computer or the server or the network device to perform all or part of the steps of the method; the type of processor used includes central processing units, general purpose processors, digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof; the storage medium comprises a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Specifically, the software system is partially carried by a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication device for communication between the relevant person and the user may utilize a transceiver, a transceiver circuit, a communication interface, or the like.

(1) Hardware environment: processor Nvidia GPU: video memory 10G or more, hard disk space: 20G or more, memory space: 16G or more;

(2) software environment: operating the system: windows, Linux, language Python: 3.6 or more version, PyTorch: 1.7.0 or versions above, Cuda: version 9.2 or above, Numpy: 1.14.3 or above, transformations: 2.5.1 or above, Chinese2 digits: 4.0.2 or more versions.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

It will thus be seen that the present invention is illustrative of methods and systems, and is not limited thereto, since numerous modifications and variations may be made by those skilled in the art without departing from the spirit of the invention, which is set forth in the following claims.

Claims

1. The system for on-line translation of text sequences to instruction sequences based on semantic understanding is characterized in that: the system comprises four subsystems of JSON sequence conversion, template generation, online translation and SQL post-processing; completing construction of a data set through JSON to SQL statement conversion and template extraction; the online translation process from the text sequence to the instruction sequence based on semantic understanding is completed by encoding and online translating the text sequence and pruning SQL sentences;

2. An online translation method from a text sequence to an instruction sequence based on semantic understanding, which is realized based on the system of claim 1, and is characterized in that: the method utilizes the template generation subsystem, defines translation rules and initializes statement data through template extraction, SQL statement input representation and model establishment;

3. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 2, characterized in that: aiming at the online translation process, the expanding SQL sub-statement subsystem and the instruction sequence conversion subsystem are core system components borne by the method, and the method specifically comprises the following steps:

step one, input data is coded;

step two, establishing an SQL model by defining translation rules;

4. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 3, characterized in that: in step three, the subtask division is described with respect to the calculation of the column, including column independent tasks, column dependent tasks, and structure dependent tasks;

5. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 4, characterized in that: the column calculation is detailed as follows:

6. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 3, characterized in that: in the fourth step, the method comprises a training stage and a prediction stage, wherein the training stage firstly carries out training input; secondly, generating SQL statement labels corresponding to the question sentences, and finally minimizing the cross entropy loss of each type of subtasks to complete the optimization of the training target; in the training stage, the specific refining steps are as follows:

step four, encoding the input question sentence;

7. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 6, characterized in that:

in the prediction stage, the specific refining steps are as follows:

fifthly, encoding the input question sentence;

8. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 7, characterized in that: in the fifth step, the executing, guiding, judging and SQL statement pruning process comprises executing a part of generated SQL statements in the SQL statement prediction phase, and adjusting the generation of the SQL statements according to the execution result of the part of generated SQL statements on the SQL data set, thereby completing SQL statement pruning, limiting the search range of the model in the solution space, and removing the search path which does not meet the requirement; the search path adopts four execution guidance modes, including clause selection execution guidance, conditional clause execution guidance, clause sorting execution guidance and structure execution guidance.

9. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 8, characterized in that: in the sixth step, the instruction sequence conversion refers to conversion into a JSON statement format, and the question type in the data set is answer retrieval based on conditional query matching.

10. The semantic understanding-based online translation method from a text sequence to an instruction sequence according to claim 9, characterized in that: a database table set consisting of at least 80 tables is obtained by extracting and constructing the tables from JSON statements, and meanwhile, the JSON statements are converted into SQL statements serving as a new data set through the data integration module and used for model training, verification and testing.