CN112783921A

CN112783921A - Database operation method and device

Info

Publication number: CN112783921A
Application number: CN202110100847.7A
Authority: CN
Inventors: 王阳; 邱雪涛; 王宇
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-11

Abstract

The invention discloses a database operation method and a database operation device, wherein the method comprises the following steps: acquiring a target natural language text; inputting the target natural language text into a prediction model for converting NL2SQL into structured query language based on natural language to obtain a target structured query language SQL statement of the target natural language text; the data structure of the target SQL statement is a preset data structure; the preset data structure comprises parts; the parts correspond to the statement fragments of the SQL statement in the preset format; the prediction model is obtained based on a target data set according to machine learning training; wherein any training data of the target data set comprises: the system comprises a natural language text and an SQL statement corresponding to the natural language text, wherein the data structure of the SQL statement is the preset data structure; and operating the target database by executing the target SQL statement.

Description

Database operation method and device

Technical Field

The invention relates to the technical field of databases, in particular to a database operation method and device.

Background

In daily life, users often interact with databases, such as online shopping, ticket ordering, meal ordering, and the like. However, there are many problems with the current database interaction method, and the database is generally operated by Structured Query Language (SQL). But for users without database expertise, it is a great challenge. In the current method, based on a special interface of condition screening, a user can operate a database by clicking different conditions, but the operation on the interface is limited by a preset SQL template, and only individual SQL statements can be obtained without flexibility.

In the prior art, a more friendly way is to convert Natural Language to structured query Language (NL 2SQL), that is, to convert the Natural Language expression of a user into SQL statements by using a Natural Language processing technology, and to query a database directly and return a result. However, current NL2SQL only supports simpler application scenarios, such as the data form and design of the data set is too simple, multiple columns cannot be selected for SQL statements, and multiple query conditions cannot be processed. This means that current NL2SQL can only translate a part of simpler SQL statements, only support some simpler application scenarios, and only perform simpler operations on the database.

Disclosure of Invention

The invention provides a database operation method and a database operation device, which solve the problem that NL2SQL only supports some simpler application scenes in the prior art.

In a first aspect, the present invention provides a database operation method, including:

acquiring a target natural language text;

inputting the target natural language text into a prediction model for converting NL2SQL into structured query language based on natural language to obtain a target structured query language SQL statement of the target natural language text;

the data structure of the target SQL statement is a preset data structure; the preset data structure comprises parts; the parts correspond to the statement fragments of the SQL statement in the preset format;

the prediction model is obtained based on a target data set according to machine learning training; wherein any training data of the target data set comprises: the system comprises a natural language text and an SQL statement corresponding to the natural language text, wherein the data structure of the SQL statement is the preset data structure;

and operating the target database by executing the target SQL statement.

In the method, the data structure of the database operation statement is a preset data structure, the preset data structure details each part of the SQL statement, and each part corresponds to each statement fragment of the SQL statement in the preset format, so that each possible situation of the SQL statement is covered, and when the prediction model is trained, the data structure of the SQL statement in each piece of training data of the target data set is also a preset data structure, so that when the target data set is trained, knowledge of each situation of the SQL statement is learned, so that the prediction model can recognize more complex scenes, and after the target natural language text is obtained, even if the target natural language text corresponds to a more complex scene, the target natural language text can be converted into a corresponding SQL statement, thereby realizing more complex operations on the database.

Optionally, the prediction model is obtained by training in the following manner:

performing machine learning training on an initial model based on first type data of a natural language text of each training data in the target data set and SQL sentences corresponding to the natural language text to obtain an intermediate model; the first type data of any training data is data selected according to a preset rule in the training data;

performing machine learning training on the intermediate model based on the second class data of the natural language text of each training data and the SQL sentences corresponding to the natural language text to obtain the prediction model; the second type of data is data in the training data other than the first type of data.

According to the method, the intermediate model is firstly obtained and then the prediction model is obtained through segmented training, different data knowledge is learned in different processes according to different data characteristics of the first type of data and the second type of data, and therefore the influence of different types of data cannot be caused in each training process.

Optionally, each of the portions specifically includes a first portion, a second portion, a third portion, and a fourth portion;

the first part is each first column name to be inquired; the second part is an aggregation function; the third part is a relation operator among all query conditions; the fourth part is the query conditions; any query condition comprises the defined second column names, the operators of the second column names and the condition values;

the first type of data comprises: operators of said second column names, said second column names in said respective query conditions of said first portion, said second portion, said third portion, and said fourth portion; the second type of data is specifically a condition value of each second column name in the fourth portion.

In the method, the four parts of the preset data structure are matched with the current mainstream SQL statement, and the specific query conditions are taken as one part, so that the centralized processing of the query conditions is facilitated, and the universality of the method is improved.

Optionally, the performing machine learning training on the initial model based on the first type of data of the natural language text of each training data in the target data set and the SQL statement corresponding to the natural language text to obtain an intermediate model includes:

aiming at learning and training of any turbine, inputting first type data of a natural language text of any training data in the target data set and an SQL (structured query language) statement corresponding to the natural language text into the initial model to obtain predicted data of the first type data in a predicted SQL statement;

according to the SQL sentences corresponding to the natural language texts and the prediction data of the first type of data, obtaining a function value of a first loss function, a function value of a second loss function and a function value of a third loss function;

the first loss function is used for characterizing the difference degree of the first part and the second part in the predicted data of the first class of data; the second loss function is used for characterizing the difference degree of the predicted data of the first class of data for the third part; the third loss function is used for representing the difference degree of column names and operators to be operated in the query conditions of the fourth part in the predicted data of the first type of data;

if the initial model does not meet a first preset convergence condition, adjusting parameters of the initial model according to the function value of the first loss function, the function value of the second loss function and the function value of the third loss function, and continuing to iteratively train the initial model;

and if the initial model meets the first preset convergence condition, taking the initial model at the moment as the intermediate model.

In the above manner, since values of the second column names and the operators of the second column names in the query conditions of the first part, the second part, the third part and the fourth part can be enumerated, the intermediate model can learn knowledge of similar data at the same time through joint training together, and the function values of three loss functions are referred to each time the model is updated, so that the intermediate model is learned more accurately.

Optionally, the performing machine learning training on the intermediate model based on the second type of data of the natural language text of each training data and the SQL statement corresponding to the natural language text to obtain the prediction model includes:

aiming at learning and training of any turbine, inputting second type data of a natural language text of any training data in the target data set and an SQL (structured query language) statement corresponding to the natural language text into the intermediate model to obtain prediction data of the second type data in a prediction SQL statement;

obtaining a function value of a fourth loss function according to the SQL sentence corresponding to the natural language text and the prediction data of the second type of data;

the fourth loss function is used for representing the difference degree of the condition values in the query conditions of the fourth part in the predicted data of the second class of data;

if the intermediate model does not meet a second preset convergence condition, adjusting parameters of the intermediate model according to a function value of the fourth loss function, and continuing to iteratively train the intermediate model;

and if the intermediate model meets the second preset convergence condition, taking the intermediate model at the moment as the prediction model.

In the foregoing manner, since the condition values of the second column names in the fourth portion are not enumeratable, the condition values of the second column names in the fourth portion are trained after the intermediate model is trained, so that the prediction model can learn related knowledge, and thus the prediction model can be learned more accurately.

Optionally, for any training data of the target data set, the training data is obtained by performing normalization processing in the following manner:

matching a natural language text in original data of the training data with a regular data expression of a preset data type, and if a child text and the regular data expression are successfully matched in the natural language text, replacing the child text with a successfully matched regular matching result; and the regular matching result is a preset standardized expression text.

In the mode, similar texts are converted into standardized expression texts through regular matching, so that data collection is more accurate, and the trained prediction model is more accurate.

Optionally, for at least one piece of training data in the target data set, the at least one piece of training data is obtained by performing desensitization processing in the following manner:

converting corresponding data to be converted in at least one piece of original data into confusion data according to a preset data conversion rule and the data type of at least one piece of original data of at least one piece of training data; the difference between the data characteristics of the data to be converted and the data characteristics of the obfuscated data meets a preset data difference standard;

and obtaining the at least one piece of training data according to the at least one piece of original data and the confusion information.

In the above manner, according to a preset data conversion rule, the corresponding data to be converted in the at least one piece of original data is converted into the obfuscated data, and a difference between the data characteristic of the data to be converted and the data characteristic of the obfuscated data meets a preset data difference standard, so that after obfuscation, the at least one piece of training data has a stronger privacy and meets the preset data difference standard.

In a second aspect, the present invention provides a database operating apparatus, including:

the acquisition module is used for acquiring a target natural language text; inputting the target natural language text into a prediction model for converting NL2SQL into structured query language based on natural language to obtain a target structured query language SQL statement of the target natural language text;

and the execution module is used for operating the target database by executing the target SQL statement.

Optionally, the obtaining module is specifically configured to: training to obtain the prediction model according to the following modes:

Optionally, the obtaining module is specifically configured to:

Optionally, the obtaining module is further configured to:

aiming at any training data of the target data set, carrying out standardization processing according to the following mode to obtain the training data:

Optionally, the obtaining module is further configured to:

for at least one piece of training data in the target data set, performing desensitization treatment according to the following mode to obtain the at least one piece of training data:

The advantageous effects of the second aspect and the various optional apparatuses of the second aspect may refer to the advantageous effects of the first aspect and the various optional methods of the first aspect, and are not described herein again.

In a third aspect, the present invention provides a computer device comprising a program or instructions for performing the method of the first aspect and the alternatives of the first aspect when the program or instructions are executed.

In a fourth aspect, the present invention provides a storage medium comprising a program or instructions which, when executed, is adapted to perform the method of the first aspect and the alternatives of the first aspect.

These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart illustrating steps corresponding to a database operation method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a specific process corresponding to a database operation method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a system architecture corresponding to a database operation method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a possible model structure in an intermediate model training process in a database operation method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a possible model structure in a prediction model training process in a database operation method according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating an experimental result of a prediction model in a database operation method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a database operating apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a database operation method.

Step 101: and acquiring a target natural language text.

Step 102: and inputting the target natural language text into a prediction model for converting NL2SQL into structured query language based on natural language to obtain a target structured query language SQL statement of the target natural language text.

Step 103: and operating the target database by executing the target SQL statement.

It should be noted that, in steps 101 to 103, the data structure of the target SQL statement is a preset data structure; the preset data structure comprises parts; the parts correspond to the statement fragments of the SQL statement in the preset format; the prediction model is obtained based on a target data set according to machine learning training; wherein any training data of the target data set comprises: the system comprises a natural language text and an SQL statement corresponding to the natural language text, wherein the data structure of the SQL statement is the preset data structure.

It should be noted that, for various types of target databases, the SQL statements have a standardized preset format.

In an alternative embodiment, each of the portions specifically includes a first portion, a second portion, a third portion, and a fourth portion;

the first part is each first column name to be inquired; the second part is an aggregation function; the third part is a relation operator among all query conditions; the fourth part is the query conditions; any query condition includes the defined second column names, the operators for the second column names, and the condition values.

The setting of each part may be according to a specific application scenario, for example, an operator for setting the fourth part as each second column name and each second column name may be set, and a condition value for setting the fifth part as each second column name may be set.

It should be noted that the training data from step 101 to step 103 may be preprocessed:

in an alternative embodiment, for any training data of the target data set, the training data is obtained by performing normalization in the following manner:

For example, the standardized expression text is 2019, and the regular data expression corresponding to 2019 can standardize the texts in 19 years, nine years, 2019 and the like to 2019.

In an alternative embodiment, for at least one piece of training data in the target data set, the at least one piece of training data is desensitized according to the following method:

For example, the value of a column name in the at least one piece of original data is averaged to serve as the obfuscation information of the at least one piece of original data, so that the information of the column name is protected and is not too different from the original data.

The two alternative embodiments of the training data preprocessing described above may be used in combination, or one of them may be selected.

It should be noted that, in the modes of step 101 to step 103, the training mode of the prediction model may be multiple, for example, the training data of the target data set is directly trained, and each training time uses complete training data and adjusts parameters; or training in stages according to partial data of the training data of the target data set, firstly training an intermediate model according to the data characteristics of the partial data, and then training to obtain a prediction model.

In an alternative embodiment, the predictive model is trained in the following way:

step (1): and performing machine learning training on the initial model based on the first type of data of the natural language text of each training data in the target data set and the SQL sentences corresponding to the natural language text to obtain an intermediate model.

The first type of data of any training data is data selected according to a preset rule in the training data.

Step (2): and performing machine learning training on the intermediate model based on the second class data of the natural language text of each training data and the SQL sentences corresponding to the natural language text to obtain the prediction model.

The second type of data is data in the training data other than the first type of data.

In the above alternative embodiments, there may be a combination of the following:

one possible scenario is that the first type of data comprises: operators of said second column names, said second column names in said respective query conditions of said first portion, said second portion, said third portion, and said fourth portion; the second type of data is specifically a condition value of each second column name in the fourth portion.

Another possible scenario is that the second type of data comprises: operators of said second column names, said second column names in said respective query conditions of said first portion, said second portion, said third portion, and said fourth portion; the first type of data is specifically a condition value of each second column name in the fourth portion.

Other combinations of situations are also possible, and the setting can be flexibly set according to the scene.

In an alternative embodiment, step (1) may specifically be:

In an alternative embodiment, step (2) may specifically be:

A database operation method provided by the embodiment of the present invention is described in detail below with reference to fig. 2.

As can be seen from fig. 2, the method proposed in this embodiment includes four steps:

step one, constructing a Chinese NL2SQL data set:

for the related data table, data desensitization is carried out by using methods such as replacement, disorder, average value taking, deviation and the like, SQL query statements are artificially drawn on the desensitized data, data labeling is carried out according to a certain format, and a Chinese NL2SQL data set is constructed and completed.

Step two, data standardization treatment:

aiming at various digital forms including percentage, decimal and the like in Chinese natural language expression and spoken date expression, the Chinese natural language expression is converted into a standard form, so that subsequent values can be conveniently extracted and matched.

Step three, building an NL2SQL algorithm model:

the SQL generation task is disassembled, the selected columns and the aggregation function are combined for prediction, the selected columns and the operational characters of each condition are combined for prediction, three multi-classification problems are formed in the whole, the three multi-classification problems can be brought into the same intermediate model for joint prediction, the condition value prediction relates to the extraction of condition values, and one prediction model is used for prediction independently.

Step four, applying NL2SQL scene:

the trained prediction model can be used, and a visual report can be generated through the prediction model, so that a visual analysis result is directly displayed.

Further, a database operation method provided by the embodiment of the present invention is described in detail below with reference to fig. 3. As can be seen from fig. 3, a database operation method and application based on NL2SQL technology includes the following steps:

step one, constructing a Chinese NL2SQL data set:

step (1.1) data desensitization:

for each piece of original data of the obtained original data set, carrying out data desensitization on a sensitive data column of the original data through methods such as replacement, disorder, average value taking, deviation and the like.

And (3) replacing: and replacing real data with the fictional data, establishing a dictionary data table, recording each real value to generate a random factor, and replacing the dictionary content of the original data content.

Disorder: the values of the sensitive data columns are distributed randomly again, and the original values are confused with the relation of other fields.

Average value: for numerical data, the mean is calculated and then the desensitized values are randomly distributed around the mean, keeping the sum of the data constant.

Offsetting: the digital data is changed by random shifting.

Step (1.2) data annotation:

carrying out data annotation on the desensitized data, firstly artificially writing SQL (structured query language) statements, and then carrying out annotation according to the following format (the preset format of the SQL statements):

"SQL":{"sel":[a],"agg":[b],"cond_conn_op":c,"conds":[[d,e,f]]}。

where a is the first part, i.e. each first column name, b is the second part, i.e. the aggregation function, c is the third part, and [ d, e, f ] is the fourth part, including d, e, f. d is cond _ sel, e is cond _ op, f is cond _ val, and d, e, f are the second column names, the operators and the condition values of the second column names.

The labeling rules are as follows:

operators of the second column names in the condition: op _ SQL _ fact ═ {0: ">, 1:" < ",2:" ═ and ", 3:"! Get it "};

aggregation function: agg _ SQL _ fact ═ {0: "",1: "AVG",2: "MAX",3: "MIN",4: "COUNT",5: "SUM" };

relational operators between query conditions: conn _ SQL _ fact: {0: "",1: "and",2: "or" };

according to the method, a total of 1000 pieces of data related to the payment tool, the branch transaction condition and the bank card are constructed.

Step two, data standardization treatment:

because the Chinese natural language expression is too complex for SQL query, especially for various digital forms including percentage, decimal and the like, the Chinese natural language expression needs to be converted into a standard form, so that the extraction and matching of subsequent values are convenient, and the numbers, dates and the like can be standardized by adopting some regularization methods.

Step (2.1) common digital front and back embellishment arrangement:

for numbers, common suffixes may be sorted first.

The common prefixes include: 'greater than', 'equal to', 'less than', 'more than', 'together', 'up to', 'above', 'less than', 'high enough', 'less than', 'up to', 'less than', 'in total', 'ranked', 'nominal', 'broken', 'every strand', 'over', 'more than', 'old', 'recent', 'still monthly', 'behind the' front ',' unbroken ',' below 'charge', 'more than', 'in need', 'every flat' ].

Common suffixes include: 'above', 'gram', 'block', 'square meter', 'multiple person', 'each plane', 'each of two', 'hair', 'month', 'lane', 'milliliter', 'level', 'more than one hundred billion', 'square meter', 'below one yuan', 'more than one dollar', etc.

Step (2.2) Chinese digital standardization:

after assisted positioning to the digital position, matching using regularization:

The percentage is as follows: r' (one percent [ one | two | three | four | five | six | seven | eight | nine | ten | one hundred | thousand | ten | billion | zero |0|1|2|3|4|5|6|7|8|9| \\. ] +).

For pure chinese expressions, we convert chinese expressions to arabic numbers using the cn2an function.

Step (2.3) date standardization:

for the spoken language expression, for example, "look up 19 years of business volume", 19 years are normalized to 2019. A similar regular match is used:

number, year, month, day: ' (\\ d +) (\\ - | \\\\ d +) (\\ d +);

number, year and month: ' (\ d +) (\\\\ d +);

contains Chinese characters, and has the following contents: ' ({ 2,4}) year (;

contains Chinese characters, including year and month: ' ({ 2,4}) year (;

contains Chinese characters, only in years: '(\ d {2,4}) year';

a standard temporal expression may be extracted.

Step three, building a data query model based on NL2 SQL:

the problem can be first defined simply, given a database table, and the question query (natural language text) associated with it, and the query is automatically converted into machine executable SQL statements by the machine.

The SQL statement structure (preset data structure) is as follows: sel is a list representing the column selected by the SELECT statement; agg is a list, corresponding to sel one by one, and indicating which aggregation operation is performed on the column, such as sum, max, min, and the like; conds is a list representing a series of conditions in the WHERE statement, each condition being a triple of (condition column-cond _ sel, conditional operator-cond _ op, condition value-cond _ val); cond _ conn _ op is int, which represents the parallel relationship between the conditions in the conds, and may be and or.

Step (3.1) task merging:

the method disassembles the partial subtasks. The sel and agg structures are first merged into sel _ agg while prediction is done, forming a multi-class problem for each column.

The category includes [ "", AVG, MAX, MIN, COUNT SUM, NONE ], "NONE" indicates that this column is not selected.

For each condition, combine cond _ sel and cond _ op into cond _ sel _ op while making predictions, again forming a multi-classification problem for each column. The categories include [ >, < ═! "NONE" indicates that this column is not selected.

cond _ conn _ op and cond _ val remain unchanged as subtasks for individual prediction.

In general, cond _ conn _ op, sel _ agg, and cond _ sel _ op can be regarded as a multi-classification problem, so we incorporate them into a unified model for joint prediction; cond _ val involves an extraction prediction of condition values, using a model prediction alone.

Step (3.2) training of an intermediate model:

it should be noted that the overall structure of the intermediate model is shown in fig. 4.

The model input links the user query with the header. After passing through the BERT pre-training layer, [ cls ] can obtain the vector representation of the whole sentence, and [ Str ] and [ Real ] in front of each column obtain the vector representation of each column, wherein [ Str ] represents a character type header and [ Real ] represents a numerical type header.

For cond _ conn _ op, the inter-conditional operator in where is predicted. The [ cls ] output is classified 3 by the fully connected (dense) layer, and is "" and "" or "" respectively, which indicates that there is only one condition.

For sel _ agg, the select column and its aggregation function are predicted. We sort the [ Str ]/[ Real ] input dense layer of each column by 7, which are "" AVG "," MAX "," MIN "," SUM "," COUNT "," NON ", respectively.

For cond _ sel _ op, the column in where is predicted to be operated on and its corresponding operator. Similarly, the [ Str ]/[ Real ] input dense layer of each column is classified into 5 groups, which are ">", "<", "═ and", "|! = and "," NONE ".

The loss functions of the three classification tasks, i.e., the first loss function, the second loss function, and the third loss function, may be jointly trained. The intermediate model (BERT) is trained over multiple periodic (epoch) iterations of the target data set. A select column, aggregation function, where condition column, operator in condition, and inter-condition connector of the SQL statement can be obtained. The prediction of the condition values is next performed:

step (3.3) training a prediction model:

the overall structure of the model is shown in fig. 5.

In step (3.2), the selected columns and operators in the condition are obtained. For the prediction of the condition value, the following settings can be made: as an enumeration is used, all possible lists of [ cond _ sel, cond _ op, cond _ value ] are used as inputs for the candidate and query connections alone.

For non-numeric columns, cond _ value is extracted directly from the table contents; for numeric columns, cond _ value is directly extracted from the query using regularization.

The same [ cls ] output can obtain a vector representation of the whole sentence, which is input to the dense layer for 2 classification. This is done for all candidates, and the candidate with the highest probability value is finally selected, and cond _ value can be determined. By this time, the NL2SQL model is completed, and all components of the SQL statement are determined.

Step (3.4), experiment:

as shown in FIG. 6, experiments were performed using the Chinese NL2SQL dataset constructed at step one. The data are processed according to the following steps of 7: and 3, dividing by a proportion, wherein 70 percent is a training set, and 30 percent is a testing set. The test set results are as follows:

it can be seen that the conditional connector prediction accuracy is about 95%, the aggregation function and the selected column prediction function is about 89%, and the overall accuracy of the conditional portion is about 77%.

Step four, NL2SQL scene application:

the NL2 SQL-based prediction model can be applied to the data report, so that visual display of various operation indexes is provided for the branch company, and the branch company can be helped to know operation dynamics anytime and anywhere.

In general, through the above steps one to four, the user voice question can be converted into text information, and then processed using the above mentioned data standardization method. And then, the trained NL2SQL model is used for converting the SQL statement, and the table of the data warehouse is directly queried. The returned result can be visually displayed through the data report.

Further, the following describes in detail a flow of the database operation method and application of the method of steps 101 to 103, which is executed on a computer based on the NL2SQL technology, and may correspond to steps one to four shown in fig. 2. The following describes the implementation process of dataset construction, normalization process, NL2SQL model construction and scenario application step by step following the sequence of steps in the scenario shown in fig. 2.

NL2SQL dataset construction:

the "branch trade situation" query scenario is chosen here, and the raw data table is as follows:

division company full scale	Amount of transaction	Number of transaction	Movable card	Cumulative card	Date
						Shanghai division Co Ltd	36523255	253514	35572	8521866	20200318
Zhejiang division Co Ltd	26824277	153514	22562	7541476	20200318
						Shanghai division Co Ltd	36343255	143514	24868	9821496	20200322

The columns from "transaction amount" to "cumulative live card" are subjected to data desensitization by means of average value taking.

Next, the SQL statement is constructed manually, for example, as follows:

as can be seen, SQL standardized labeling is performed for "how much the transaction amount of the marine affiliate is in 20 years, 3 months, 18 days", and "table _ id" represents a table corresponding to the sentence according to a predefined rule; "queuing" indicates the question, "sel": 1 indicates the selected column as "transaction amount"; "agg": 0 indicates that the aggregation function is empty, i.e., unused; "cond _ conn _ op":1 denotes that the inter-condition operator is "and"; "conds" [ [5,2, "20200318" ], [0,2, "shanghai division" ] ] indicates that there are two conditions, the first being "date" ═ 20200318", and the second being" division full name "═ shanghai division".

According to the method, 1000 pieces of training data related to the payment tool, the branch transaction condition and the bank card are constructed.

And (3) data standardization treatment:

taking an example of a query statement "which branch companies have transaction amount greater than thirty million at 3/18/20/year", a user voice problem can be firstly converted into text information, digital standardization is carried out, the text information is positioned to be three million by the prefix-greater auxiliary regular positioning, and the text information can be directly converted into the text information of 30000000 by using a cn2an function; date normalization is then performed, taking days by year (.

After the normalization process, the query statement becomes "which branch transactions at 20200318 are more than 30000000".

And (3) converting the target SQL statement:

in the last step, NL2 SQL-based prediction models, including joint prediction models and conditional value prediction models, have been trained using large amounts of data. Firstly, SQL sentences are connected with a header and input into a joint prediction model. Predicting to obtain a select column, an aggregation function, a where condition column, an operator in the condition and a connector between the conditions, namely obtaining:

"SQL":{"sel":[0],"agg":[0],"cond_conn_op":1,"conds":[[5,2,"xxx"],[1,0,"xxx"]]}。

prediction of the condition value may then be performed. Using enumeration, enumerate all possible [ cond _ sel, cond _ op, cond _ value ]:

a, "date ═ 20200318", and "date ═ 20200322".

"transaction amount > 30000000".

The date column is a non-numerical type, and the value is directly taken from the table data for enumeration; the transaction amount is numeric and is also extracted from the problem using regularization.

Each case is connected as an input separately as a candidate and a query. The same [ cls ] output can obtain a vector representation of the whole sentence, which is input to the dense layer for 2 classification. This is done for all candidates, and the candidate with the highest probability value is finally selected, and cond _ value can be determined. In this example, a. select "date-20200318", b. select the only candidate "transaction amount > 30000000". The final SQL statement is "SELECT division full term WHERE date 20200318AND transaction amount > 30000000".

And (3) SQL statement query:

in the step, the natural language statement is converted into an SQL statement, and the SQL statement is directly inquired in a database table to obtain a result, namely Shanghai Bingzhan, and is displayed in a visual form. This completes the entire flow of NL2 SQL.

As shown in fig. 7, the present invention provides a database operating apparatus, including:

an obtaining module 701, configured to obtain a target natural language text; inputting the target natural language text into a prediction model for converting NL2SQL into structured query language based on natural language to obtain a target structured query language SQL statement of the target natural language text;

an executing module 702, configured to operate the target database by executing the target SQL statement.

Optionally, the obtaining module 701 is specifically configured to: training to obtain the prediction model according to the following modes:

Optionally, the obtaining module 701 is specifically configured to:

Optionally, the obtaining module 701 is further configured to:

Based on the same inventive concept, the embodiment of the present invention also provides a computer device, which includes a program or instructions, and when the program or instructions are executed, the database operation method and any optional method provided by the embodiment of the present invention are executed.

Based on the same inventive concept, the embodiment of the present invention also provides a computer-readable storage medium, which includes a program or instructions, and when the program or instructions are executed, the database operation method and any optional method provided by the embodiment of the present invention are executed.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of database operation, comprising:

acquiring a target natural language text;

and operating the target database by executing the target SQL statement.

2. The method of claim 1, wherein the predictive model is trained by:

3. The method according to claim 2, wherein the portions comprise in particular a first portion, a second portion, a third portion and a fourth portion;

4. The method of claim 3, wherein the performing machine learning training on the initial model based on the first type of data of the natural language text of each training data in the target data set and the SQL statement corresponding to the natural language text to obtain an intermediate model comprises:

5. The method of claim 3, wherein the performing machine learning training on the intermediate model based on the second type of data of the natural language text of each training data and the SQL sentence corresponding to the natural language text to obtain the prediction model comprises:

6. The method of any one of claims 1 to 5, wherein for any training data of the target data set, the training data is normalized in the following manner:

7. A method as claimed in any one of claims 1 to 5, wherein for at least one piece of training data in the target data set, the at least one piece of training data is desensitised by:

8. A database operating apparatus, comprising:

9. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 7.

10. A computer-readable storage medium comprising a program or instructions which, when executed, perform the method of any of claims 1 to 7.