CN115422220A

CN115422220A - Method for converting natural language into SQL (structured query language) based on deep learning model

Info

Publication number: CN115422220A
Application number: CN202210807719.0A
Authority: CN
Inventors: 郭大勇; 王渊; 欧阳奎
Original assignee: Shanghai Tongban Information Service Co ltd
Current assignee: Shanghai Tongban Information Service Co ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-12-02

Abstract

The application discloses a method for converting natural language into SQL based on a deep learning model, which is based on the deep learning model and is characterized in that natural language question sentences of users are predicted through the deep learning model, and predicted results are converted into SQL sentences through CYK algorithm, and the method comprises the following steps: step 1, preparing a corpus; step 2, establishing a table model; step 3, SQL conversion is carried out through a CYK algorithm; the method can conveniently and directly convert the human natural language into query sentences such as SQL and the like, and then directly acquire the data desired by the user by utilizing the SQL sentences.

Description

Method for converting natural language into SQL (structured query language) based on deep learning model

Technical Field

The invention relates to a method for converting natural language into SQL, in particular to a method for converting natural language into SQL based on a deep learning model.

Background

With the rise of the fourth industrial revolution surge, the related industries of artificial intelligence are developed rapidly, and the artificial intelligence can be used for assisting the daily work and life of people. The natural language processing technology is an important branch of artificial intelligence, and aims to enable a machine to understand the language used by people in daily life, so that communication barriers between people and the machine are reduced, and the machine can conveniently perform the next step of processing and conversion.

At present, various information data in the digital age are stored in various ways in various systems such as databases of various colors and the like. For massive digital data on the internet, for convenience of management and operation and maintenance, a traditional relational database is generally used for storing the data. However, when the corresponding data is queried from these systems, a professional still needs to write a professional query statement through a professional tool to query the data from the systems, or a query system is developed, and the desired data is queried in a form of specifying query parameters in an easily understood manner, but the interactive manners of these queries have certain expertise and need certain professional knowledge to be used better. How to query the required information from these relational databases through natural language, that is, how to convert the natural language query description of human into executable database query statement SQL, has become one of the hottest research directions in the field of natural language processing. The natural language query interface aims to complete human-computer interaction between a user and a relational database through natural language so as to obtain data wanted by the user, and is an indispensable part for constructing an automatic database intelligent query system. The most important task for realizing the natural language query interface is how to generate an SQL statement, which is abbreviated as NL2SQL task.

The natural language query interface enables a user to describe and directly query data by using a daily natural language, so that the cost of learning professional knowledge is saved for the user, the time for constructing SQL sentences is saved, and the query efficiency is improved.

Compared with an English NL2SQL task, the research difficulty of the Chinese NL2SQL task is higher. Almost all of the research work related to the NL2SQL task is done on english datasets, and migrating models and methods in english directly to chinese is not well suited. The Chinese natural language has no natural boundary to separate words from each other, and is more variable in expression mode compared with English, at present, NL2SQL tasks mostly process different tasks respectively to complete the final result, although uniqueness of the tasks is guaranteed, integrity of the whole tasks is lacked, and machine understanding is more difficult. Therefore, the NL2SQL task for chinese is still yet to be studied, and this document completes a plurality of tasks in a single model based on the english model and method, directly converts the human natural language into query statements such as SQL, and then directly obtains data desired by the user using the SQL statements. A new solution is provided for the NL2SQL task of the Chinese language.

The invention provides a method for converting natural language into SQL based on a deep learning model, and discloses a training method of the deep learning model and a corresponding algorithm corresponding to the training method. Through the deep learning model after training, the invention firstly directly converts the human natural language into query sentences such as SQL and the like; then, the system directly obtains the data needed by the user by utilizing the SQL statement operation database; finally, the system outputs the relevant data. In the prior art, data can be queried from a system only by a professional person in a way of compiling professional query sentences and the like through professional tools, but the interactive modes of the query have certain speciality and can be better used only by professional knowledge. The method can directly complete the conversion from the human natural language to the query language, so that non-professionals can conveniently obtain answers, and the intellectualization of data query and acquisition is improved.

Disclosure of Invention

The present invention is directed to a method for converting natural language into SQL, so as to solve the problems set forth in the above technical background.

In order to realize the purpose, the invention adopts the following technical scheme:

the invention provides a method for converting natural language into SQL (structured query language) based on a deep learning model, which is characterized in that a question of a user is converted into an SQL (structured query language) statement through a CYK (cyclic annular K) algorithm under the prediction of the deep learning model, and the method mainly comprises the following steps:

step 1: preparing a corpus;

step 2: establishing a table model;

and step 3: SQL conversion is carried out through a CYK algorithm.

Preferably, step 1 comprises the steps of:

step 1-1: preparing corpora for training of the model;

step 1-2: and (5) corpus labeling.

Further, step 2 comprises the steps of:

step 2-1: acquiring semantic features of natural language;

step 2-2: predicting the relation between the entities through the vector of the first character CLS of the semantic features based on the relation between the predicted entities of the semantic features;

step 2-3: predicting a function corresponding to the selected column by a vector of each column of semantic features based on the function corresponding to the predicted selected column of semantic features;

step 2-4: predicting the selected entity based on the predicted selected entity of the semantic features by the vector of each character of the semantic features;

step 2-5: and predicting the selected entity based on the predicted selected entity of the semantic features by using the vector of each column of the semantic features and the corresponding function result corresponding to the selected column.

Preferably, step 2-1 comprises the steps of:

step 2-1-1: the natural language converts the characters into three vectors through an Embedding layer, wherein the three vectors are respectively a word vector, a token basic semantic information vector and a Segment Embedding fragmentation vector;

step 2-1-2: obtaining semantic features through an Encoder layer stacked by 12 layers, wherein the Encoder layer comprises self-attention, and each word can simultaneously use the information of the rest words before and after the word;

preferably, the relationship between entities described in step 2-2 has three results, which are: has no relation, and, or; converting the vector dimension of the initial character cls in the preprocessed inputs1 into 3 through a full link layer, and then selecting a value with the maximum probability through softmax as an entry _ type result.

Preferably, the function corresponding to the selected column described in step 2-3 comprises 6 types, respectively: no function, averaging, maximum, minimum, total number statistics, summation; converting the vector dimension of the cls of each col in the preprocessed inputs1 into 6 through a full connection layer, and then selecting a value with the maximum probability as an agg result through softmax.

Preferably, the entity extraction in the step 2-4 is performed by using named entities, adopting a BERT + LSTM + CRF model, c1 is the first character in the question, obtaining LSTM OUTPUT through forward LSTM and backward LSTM, and then adding a CRF layer; the CRF algorithm includes two features, where the first feature is a state feature function and a state score is calculated, and the second feature is a transition feature function and a transition score is calculated.

Preferably, in step 2-5, the selected entity is predicted by a vector for each column of semantic features and the corresponding function result for the selected column. The operator has five types of no operation, equal to, unequal to, greater than and less than. And accumulating the vector of each column in the preprocessed output 1 result and the corresponding agg result, converting the dimensionality into 5 through a full connection layer, and selecting the value with the maximum probability as a relationship result through softmax.

Further, step 3 comprises the steps of:

step 3-1: initializing an algorithm according to SQL grammar;

step 3-2: any given string determines whether the SQL syntax is satisfied.

Preferably, step 3-1 comprises the steps of:

step 3-1-1: and constructing an identification matrix. First construct the main diagonal, let t _0，0 =0, then from t _1，1 To t _n，n Put the word w of the input sentence x at the position of the main diagonal line _i Constructing an element t immediately above the principal diagonal and next to the principal diagonal _i，i+1 For an input sentence x = w ₁ w ₂ 8230and w is from w ₁ Begin analysis if there is a rule A in the production set of grammar G>w ₁ Then fill in t _0，1 = A and so on, for each terminator w on the main diagonal _i All that is possible to deduce that its non-terminal character is written at a position above its right main diagonal.

Step 3-1-2: the sentences were analyzed using the CYK algorithm.

In a preferred embodiment, if the sentence is TNWO:

s->select where

select->T

where->P Q V

P->N

Q->W

V->O

(1) Four strings are present, so an upper matrix of 5 x 5 is formed, with hit keys filled out on the diagonal, and the type of hit noted above.

0	select
					T	P
		N	Q
							W	V
				O

(2) P Q V may constitute where, select has no value to constitute, so shift right by one bin.

0	select	select
					T	P		where
		N	Q
							W	V
				O

(3) select and where can constitute S, and S is the end point, the algorithm ends.

0	select	select	select	s
					T	P		where
		N	Q
							W	V
				O

The resulting structure of the SQL parse tree is shown in FIG. 6.

Preferably, before obtaining the semantic features, the data needs to be preprocessed, and the preprocessing step includes:

step (1): splicing the question with the table data, adding CLS to the beginning position of the question, adding SEP characters to the end position of the question, and performing the same operation on the columns of the table; c1 represents the first character of the string output by the user, c2 and so on; col1 represents the first column in the table, col2 and so on;

step (2): converting the preprocessed question sentence into input Encoding, adding position variable Positional Encoding, performing residual error linkage through a layer of Multi-Head orientation, performing norm normalization, and performing residual error linkage and norm normalization through a feedforward neural network;

and (3): finally, two vectors are output 1and output 2 respectively.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. a CRF model is added into the named entity task, so that the accuracy is improved;

2. an SQL rule model is established by a CYK algorithm, so that SQL is conveniently and quickly generated;

3. a plurality of tasks are combined into a single model, so that the training and the deployment are convenient and rapid.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 is a flow diagram of natural language data preprocessing;

FIG. 2 is a diagram of an entity relationship prediction model;

FIG. 3 is a functional model corresponding to a predicted column;

FIG. 4 shows a prediction of a selected solid model;

FIG. 5 illustrates a selected entity operand model;

FIG. 6 is a string analysis tree based on the CYK algorithm;

FIG. 7 is a logical block diagram of a preferred embodiment of the present invention.

Detailed Description

The invention provides a method for converting natural language into SQL based on a deep learning model, which is further described in detail below by referring to the attached drawings and examples in order to make the purpose, technical scheme and effect of the invention clearer and clearer. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that the data so used may be interchanged under appropriate circumstances. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The first embodiment is as follows:

a method for converting natural language into SQL based on a deep learning model is characterized in that a question sentence of a user is converted into an SQL sentence by a CYK algorithm under the prediction of the deep learning model, and the method mainly comprises the following steps:

step 1: and (4) preparing the corpus. Wherein the step 1 comprises the following steps:

step 1-1: corpora for training of the model are prepared. Collecting tabular data, and processing into a data structure required by the model, wherein the data structure is preferably as follows: { "query": question "," sql ": {" ag ": function type ]," entity _ type ": entity relationship," col ": column in table ]," relationship ": [ [ column in table, operator, entity ] ] }," ner ": BIO notation }. The collected data has five characteristics, the first characteristic agg represents a function corresponding to the selected column, and the functions comprise 6 types, namely no function, average value calculation, maximum value calculation, minimum value calculation, total quantity statistics and summation. The second characteristic is entity ner, which represents the entity in question, different types of entities are labeled in BIO mode, B is the beginning of the entity, I is the rest of the entity, and O represents not the entity. The third feature is entity _ type, which represents the relationship between entities, and the relationships are three, which are respectively none, and/or, and indicate that multiple entities need to be satisfied simultaneously, or indicate that only one entity needs to be satisfied. The fourth property is col, which represents the selected column in the table. The fifth feature is relation, which represents the selected entity in the table in the column and what type the operator is, and the operator has five types which are no operation, equal to, not equal to, greater than and less than.

Step 1-2: and (5) corpus labeling.

And 2, step: and establishing a table model. The table model is mainly divided into three layers: an encoder, a context enhancement layer, and an output layer. The coding layer outputs a vector of each character based on BERT; the context enhancement layer adds the output vector corresponding to each column name into the question; the output layer obtains a column vector. The table model integrates four tasks (predicting the relationship among entities, predicting the function corresponding to the selected column, predicting the selected entity and predicting the operation sign of the entity) into the same model, the model has 7 inputs and 3 outputs, and finally four tasks of predicting the relationship among entities, predicting the function corresponding to the selected column, predicting the selected entity and predicting the operation sign of the entity are completed.

Wherein the step 2 comprises the following steps:

step 2-1: and (5) semantic features. The natural language converts the characters into three vectors through an Embedding layer, wherein the three vectors are respectively a word vector, a semantic information vector for representing token basis and a Segment Embedding vector, then semantic features are obtained through an Encoder layer stacked by 12 layers, the Encoder layer comprises self-attribute, and each character can simultaneously use the information of the rest characters before and after the character.

As shown in fig. 1, the data is preprocessed. The pretreatment steps are as follows: (1) Firstly, splicing a question with table data, adding CLS to the beginning of the question and adding SEP characters to the end of the question, and performing the same operation on the columns of the table. c1 represents the first character of the string output by the user, c2 and so on. col1 represents the first column in the table, col2 and so on. (2) Converting the preprocessed question sentence into input Encoding, then adding position variable Encoding, carrying out residual error linkage through a layer of Multi-Head orientation, then carrying out norm normalization, and carrying out residual error linkage and norm normalization through a feedforward neural network; (3) Finally, the two vectors are output 1and output 2.

Step 2-2: based on the relationships between the predicted entities of the semantic features, the relationships between the entities are predicted by a vector of the first character CLS of the semantic features. As shown in fig. 2, there are three results of the relationship between the entities, which are respectively unrelated, and/or the dimension is converted into 3 by the vector of the first character cls in the preprocessed inputs1 through the full link layer, and then the value with the highest probability is selected as the result of the entity _ type through softmax.

Step 2-3: based on the prediction of the semantic features the function corresponding to the selected column is predicted by a vector for each column of semantic features. As shown in fig. 3, the functions corresponding to the selected columns include 6 types, which are no function, averaging, maximizing, minimizing, counting the total number, and summing, the dimensionality of the vector of the cls of each col in the preprocessed inputs1 is converted into 6 through the full connection layer, and then the value with the largest probability is selected as the agg result through softmax.

Step 2-4: and predicting the selected entity based on the semantic features, wherein the selected entity is predicted through the vector of each character of the semantic features. As shown in FIG. 4, the entity extraction uses named entity for extraction, a BERT + LSTM + CRF model is adopted, c1 is the first character in a question, LSTM OUTPUT is obtained through forward LSTM and backward LSTM, and a CRF layer is added. The CRF algorithm mainly comprises two characteristics, namely a state characteristic function, calculating a state score, a transfer characteristic function and calculating a transfer score, wherein the former can be converted into entity labels aiming at characters at the current position, and the latter focuses on which entity label combinations exist in the characters at the current position and adjacent positions of the current position.

Step 2-5: and predicting the selected entity based on the predicted selected entity of the semantic features by using the vector of each column of the semantic features and the corresponding function result corresponding to the selected column. As shown in FIG. 5, there are five types of operators, namely no operation, equal to, not equal to, greater than and less than. And accumulating the vector of each column in the preprocessed output 1 result and the corresponding agg result, converting the dimensionality into 5 through a full connection layer, and selecting the value with the maximum probability as a relationship result through softmax.

And 3, step 3: the CYK algorithm. The CYK algorithm is an analysis algorithm based on CFG rules, is a bottom-up analysis algorithm, and can obtain corresponding SQL sentences by inputting the results of the table models into the CYK algorithm after the user sentences pass through the table models according to SQL query rules. Wherein the step 3 comprises the following steps:

step 3-1: the algorithm is initialized according to the SQL syntax. The method mainly comprises the following steps:

step 3-1-1: an identification matrix is constructed. First construct the main diagonal line, let t _0，0 =0, then from t _1，1 To t _n，n Put the word w of the input sentence x at the position of the main diagonal line _i Constructing an element t immediately above the principal diagonal and next to the principal diagonal _i，i+1 For an input sentence x = w ₁ w ₂ 8230w from w ₁ Begin analysis if there is a rule A->w ₁ Then fill in t _0，1 = A and so on, for each terminator w on the main diagonal _i All possible non-terminal characters deducing it are written at a position above its right main diagonal.

Step 3-1-2: the sentences were analyzed using the CYK algorithm. If the sentence is TNWO, then:

s->select where

select->T

where->P Q V

P->N

Q->W

V->O

(2) Four strings are present, so an upper matrix of 5 x 5 is formed, with hit keys filled out on the diagonal lines, and the type of hit noted above.

0	select
					T	P
		N	Q
							W	V
				O

0	select	select
					T	P		where
		N	Q
							W	V
				O

0	select	select	select	s
					T	P		where
		N	Q
							W	V
				O

The resulting structure of the SQL parse tree is shown in FIG. 6.

Step 3-2: and judging whether any given character string meets the SQL grammar.

An embodiment logical process diagram is shown in fig. 7.

Example two:

further, a method for converting natural language into SQL based on a deep learning model comprises the following steps:

step 1:

question: woolen cloth capable of controlling income of Shanghai town residents in 2021 year

Step 2:

identifying through a table model:

{ "agg": 2], "entity _ type": 2,2], "col": 2], "relationship": [ (1 = = 2021), (3 = = town dweller) ] },

"ner":[B-A,B-I,B-I,I-A,I-A,I-A,B-N,I-N,I-N,I-N,O,O,O,O,O,O,O,O,O,O,O]}

and (3) analysis: wherein col indicates that the selected column is in the second column in the table, the head of the second column is the disposable income of all people, agg indicates that the average value needs to be calculated for the disposable income of all people, entity _ type is the relationship between the urban residents in Shanghai and the 2022 entity, relationship is that the residential style is urban residents, and the year is 2021.

And step 3:

the SQL parse tree results are: select avg (everyone can dominate income) where from table where year = =2021and residential manner = = town citizen.

And 4, step 4:

searching the SQL sentence in the database, wherein the result is as follows: the average value of human disposable income is 23772 yuan.

The embodiments of the present invention have been described in detail, but the embodiments are only examples, and the present invention is not limited to the embodiments described above. Any equivalent modifications and substitutions to those skilled in the art are also within the scope of the present invention. Accordingly, equivalent changes and modifications made without departing from the spirit and scope of the present invention should be covered by the present invention.

Claims

1. A method for converting natural language into SQL based on a deep learning model is based on the deep learning model and is characterized in that natural language question sentences of users are predicted through the deep learning model, and the predicted results are converted into SQL sentences through a CYK algorithm.

2. The method for converting natural language into SQL according to claim 1, characterized by comprising the following steps:

step 1: preparing a corpus;

step 2: establishing a table model;

and 3, step 3: SQL conversion is carried out through a CYK algorithm.

3. The method for converting natural language into SQL based on deep learning model according to claim 2, wherein the process of preparing corpus in step 1 includes the following steps:

step 1-1: preparing corpora for training of the model;

step 1-2: and (5) corpus labeling.

4. The method for converting natural language into SQL based on deep learning model according to claim 2, wherein the process of building a table model in step 2 includes the following steps:

step 2-1: acquiring semantic features of natural language; the natural language converts characters into three vectors through an Embedding layer, wherein the three vectors are respectively a word vector, a representation token basic semantic information vector and a Segment Embedding vector, semantic features are obtained through an Encoder layer stacked by 12 layers, the Encoder layer comprises self-attributes, and each character can simultaneously use information of other characters in front of the character and in back of the character;

step 2-3: predicting a function corresponding to the selected column through a vector of each column of the semantic features based on the function corresponding to the selected column of the semantic features;

5. The method for converting natural language into SQL based on the deep learning model according to claim 2, wherein the SQL conversion process performed by the CYK algorithm in step 3 comprises the following steps:

step 3-1: initializing an algorithm according to SQL grammar;

step 3-2: any given string determines whether the SQL syntax is satisfied.

6. The method for converting natural language into SQL based on deep learning model according to claim 4, wherein, before obtaining the semantic features, the data needs to be preprocessed, and the preprocessing step includes:

step (1): splicing the question and the table data, adding CLS at the initial position of the question, adding SEP characters at the end position of the question, and carrying out the same operation on the columns of the table; c1 represents the first character of the character string output by the user, c2 and so on; col1 represents the first column in the table, col2 and so on;

and (3): finally, two vectors are output 1and output 2 respectively.

7. The method for converting natural language into SQL based on deep learning model according to claim 4, wherein the relationship between entities in step 2-2 has three results, which are: has no relation, and, or; converting the vector dimension of the initial character cls in the preprocessed inputs1 into 3 through a full link layer, and then selecting a value with the maximum probability through softmax as an entry _ type result.

8. The method for converting natural language into SQL based on deep learning model according to claim 4, wherein the functions corresponding to the selected columns in step 2-3 include 6 types, respectively: no function, averaging, maximum, minimum, total number statistics, summation; converting the vector dimension of the cls of each col in the preprocessed inputs1 into 6 through a full connection layer, and then selecting a value with the maximum probability as an agg result through softmax.

9. The method for converting natural language into SQL based on deep learning model according to claim 4, wherein the entity extraction in step 2-4 is performed by using named entity, using BERT + LSTM + CRF model, c1 is the first character in the question, LSTM OUTPUT is obtained through forward LSTM and backward LSTM, and then a CRF layer is added; the CRF algorithm includes two features, wherein the first feature is a state feature function and calculates a state score, and the second feature is a transition feature function and calculates a transition score.

10. The method for converting natural language into SQL based on deep learning model according to claim 5, wherein the step 3-1 of initializing algorithm according to SQL syntax includes the following steps:

step 3-1-1: constructing an identification matrix;

step 3-1-2: the sentences were analyzed using the CYK algorithm.