CN114020768A

CN114020768A - Construction method and application of SQL (structured query language) statement generation model of Chinese natural language

Info

Publication number: CN114020768A
Application number: CN202111191677.4A
Authority: CN
Inventors: 李瑞轩; 林毅炜; 辜希武; 李玉华; 马学旭
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-02-08

Abstract

The invention discloses a construction method and application of a Chinese natural language SQL statement generation model, which comprises the following steps: s1, building an SQL statement generation model; s2, taking the collected Chinese natural language questions and relevant database modes in the training set as input, taking the corresponding correct SQL sentences as output, and training the SQL sentence generation model by minimizing the difference between the SQL sentences generated by the SQL sentence generation model and the correct SQL sentences. The method captures the alignment relation and the implicit reference relation between the Chinese natural language problem and the database mode by combining the type information of different data columns, and expresses the explicit defined relation of the database mode and the link relation between the natural language problem and the database mode in a directed graph mode; and by comprehensively considering the characteristics between the unstructured data and the structured data, the semantic features and the relational features are combined to carry out joint coding, so that the accuracy of the SQL statement generation model is greatly improved.

Description

Construction method and application of SQL (structured query language) statement generation model of Chinese natural language

Technical Field

The invention belongs to the technical field of semantic analysis of natural language processing, and particularly relates to a construction method and application of an SQL statement generation model of a Chinese natural language.

Background

With the rapid development of database technology and information technology, in various industries, massive digital data are generated at all times, and the data may be independent from each other and may have a certain dependency relationship. Therefore, in order to facilitate query and update of data, and unified management and maintenance, the data is usually structured and stored in a database, and a uniform database query language SQL is required to retrieve data meeting specific requirements from the database. The structured database query language SQL has strict grammatical constraints, and has certain learning and using thresholds for non-professional users without relevant knowledge of the database and the SQL. With the development of deep learning technology and natural language technology in recent years, people have attracted extensive attention in a way of directly querying a database through natural language to realize more convenient and efficient information retrieval.

Early research was carried out on databases in specific fields by manually formulating rules, which was not universal and required a lot of manpower to maintain costs. In order to solve the above problems, in recent years, most of complex SQL generation methods use relatively independent encoding methods to encode the natural language and the database schema respectively, and rarely consider the existing relationship between the database schema and the user questions, neglect the relationship between the natural language and the database schema, and have low model accuracy. In addition, in an actual scene, data tables and data columns in a database are often very complex, situations of data columns with the same name in different tables often occur, ambiguity easily exists, but the current research rarely considers the problem, most researches directly decode the feature vectors into the SQL sentences when the SQL sentences are finally generated, and syntax rules of the complex SQL sentences are ignored. The direct decoding process is often less accurate due to the complexity of the syntax constraints.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a construction method and application of a Chinese natural language SQL statement generation model, which are used for solving the technical problem of low accuracy rate of SQL statement generation in the prior art.

In order to achieve the above object, in a first aspect, the present invention provides a method for constructing a SQL statement generation model in a chinese natural language, including the following steps:

s1, building an SQL statement generation model; the SQL statement generation model comprises the following steps:

the word segmentation and splicing module is used for carrying out word segmentation on the Chinese natural language problem and data table naming and data column naming in a related database mode, then splicing the word segmentation and the data column naming by combining data column type information to obtain a fusion expression vector of the Chinese natural language problem of unstructured data and the database mode of structured data, and outputting the fusion expression vector to the semantic coding module;

the semantic coding module is used for extracting semantic features in the fusion expression vector by adopting a natural language pre-training model to obtain semantic coding vectors of word segmentation results of data table naming and data column naming in a Chinese natural language problem and a related database mode, and outputting the semantic coding vectors to the relation coding module after forming a combined coding tensor;

the relational expression module is used for expressing the defined relation in the database mode and the link relation between the natural language question and the database mode as a relational directed graph; the vertex in the relational directed graph comprises data table naming, data column naming and word segmentation results of Chinese natural language problems; the edges in the relational directed graph represent the relationship existing between the vertexes, including the database definition relationship existing between the vertexes, and the synonymy and correlation relationship between the word segmentation result of the Chinese natural language problem and the data table name or the data column name, and output to the relational coding module;

the relational coding module is used for carrying out joint coding on the alignment relation between the joint coding tensor and the relational digraph, extracting the relational characteristics of the joint coding tensor and the relational digraph to obtain a relational coding tensor, and outputting the relational coding tensor to the SQL decoding module;

the SOL decoding module is used for decoding the relation coding tensor into SQL sentences based on the syntax tree structure;

s2, taking the collected Chinese natural language questions and relevant database modes in the training set as input, taking the corresponding correct SQL sentences as output, and training the SQL sentence generation model by minimizing the difference between the SQL sentences generated by the SQL sentence generation model and the correct SQL sentences.

Preferably, the word segmentation and concatenation module is configured to filter stop words and punctuation marks in the data table names and the data column names in the chinese natural language problem and the related database schema, perform word segmentation, recover a mis-word segmentation item occurring in the chinese natural language problem, obtain word segmentation results of the data table names and the data column names in the chinese natural language problem and the related database schema, perform concatenation, connect corresponding data column type information to the word segmentation result of each data column name in the concatenation process, and obtain a fusion expression vector of the database schema of the unstructured data and the chinese natural language problem and the structured data;

the mis-segmentation items comprise Arabic numerals, unit symbols, arithmetic expressions and English words; the data column type information includes a number type, a time-date type, and a text type.

Further preferably, the semantic coding module extracts semantic features in the fusion expression vector by using a multilingual BERT pre-training model.

Further preferably, any two different vertices in the relational directed graph are vertex a and vertex B, and when vertex a and vertex B are named by data columns, the database definition relationship existing between the vertices includes: the vertex A and the vertex B belong to the same data table, and the vertex A and the vertex B are associated by the external key;

when the vertex A is named for the data column and the vertex B is named for the data table, the database definition relationship existing between the vertices comprises: vertex A is the primary key for vertex B and vertex A is the column for vertex B, but not the primary key for vertex B;

when the vertex A is named as a data table and the vertex B is named as a data column, the database definition relationship existing between the vertexes comprises: vertex B is the primary key for vertex A and vertex B is the column for vertex A, but not the primary key for vertex A;

when the vertex A is named for the data table and the vertex B is named for the data table, the database definition relationship existing between the vertexes comprises: vertex A has an out-of-column key associated with vertex B, vertex B has an out-of-column key associated with vertex A, and vertex A and vertex B are out-of-column key associated with each other.

Further preferably, the link relation between the natural language question and the database schema includes: the link relation between the word segmentation result of the Chinese natural language problem and the data table and the data list in the database mode, and the link relation between the word segmentation result of the Chinese natural language problem and the data value in the database mode;

the link relation between the word segmentation result of the Chinese natural language question and the data table and the data column in the database mode comprises the following steps: matching the TEM _ S, TEM _ R, TPM _ S, TPM _ R, CEM _ S, CEM _ R, CPM _ S and CPM _ R with high-to-low priority;

TEM _ S represents that the word segmentation result of the Chinese natural language problem is synonymous with and completely matched with the data table name; TEM _ R represents that the word segmentation result of the Chinese natural language question is related to and completely matched with the data table name; the word segmentation result of the TPM _ S Chinese natural language problem is synonymous with the name of the data table and is partially matched with the name of the data table; TPM _ R represents that the word segmentation result of the Chinese natural language question is related to the name of the data table and is partially matched with the name of the data table; CEM _ S represents that the word segmentation result of the Chinese natural language problem is synonymous with and completely matched with the data column name; CEM _ R represents that the word segmentation result of the Chinese natural language problem is related to and completely matched with the data column name; CPM _ S represents that the word segmentation result of the Chinese natural language question is synonymous with the data column name and is partially matched with the data column name; the CPM _ R represents that the word segmentation result of the Chinese natural language question is related to and partially matched with the data column name;

the link relation between the word segmentation result of the Chinese natural language question and the data value in the database mode comprises the following steps: matching the VEM and the VRM with high to low priority;

VEM represents the word segmentation result of the Chinese natural language question to appear in the data column; the VRM expresses that the word segmentation result of the Chinese natural language question is related to the data column name.

Further preferably, n-gram matching is carried out on the word segmentation result of the Chinese natural language problem and the name of the data table after word segmentation respectively to obtain the link relation between the word segmentation result of the Chinese natural language problem and the data table and the data list in the database mode;

respectively judging whether the word segmentation results of the Chinese natural language problem exist in the data values stored in the same type of columns in the database or not so as to link the word segmentation results of the Chinese natural language problem with the data values stored in the same type of columns in the database; and when the word segmentation result of the Chinese natural language problem does not appear in the data column, linking the word segmentation result of the Chinese natural language problem and the name of the data column by using a knowledge map ConceptNet, thereby obtaining the link relation between the word segmentation result of the Chinese natural language problem and the data value in the database mode.

Further preferably, the relationship coding module includes: the device comprises a position coding unit, a multi-head attention unit, a first addition and normalization unit, a feedforward neural network and a second addition and normalization unit which are sequentially cascaded;

the position coding unit is used for carrying out position coding on semantic coding vectors of word segmentation results of Chinese natural language problems in the combined coding tensor and data table naming and data column naming in the related database mode, adding the position coding vectors of the word segmentation results with corresponding semantic coding vectors to obtain composite semantic coding vectors of the word segmentation results, and outputting the composite semantic coding vectors to the multi-attention unit;

the multi-head attention unit comprises a plurality of attention subunits, wherein the attention subunits are used for performing first linear transformation on a composite semantic coding vector of an ith word segmentation result, performing second linear transformation and third linear transformation on the composite semantic coding vector of a jth word segmentation result respectively, introducing the link relation of the ith word segmentation result and the jth word segmentation result in a relation directed graph as a bias in the second linear transformation and the third linear transformation respectively, obtaining a query vector, a key vector and a value vector respectively, and obtaining an attention value between the ith word segmentation result and the jth word segmentation result by combining an attention mechanism, wherein j is 1,2, …, n and n are the number of the word segmentation results; adding the obtained attention values to obtain the attention value of the ith word segmentation result; the multi-head attention unit is used for splicing the attention values of the ith word segmentation result obtained by each attention subunit and then performing dimension conversion to ensure that the dimension of the multi-head attention unit is the same as that of the composite semantic coding vector of the ith word segmentation result, so as to obtain the multi-head attention value of the ith word segmentation result;

the first addition and normalization unit adds the multi-head attention value of the ith word segmentation result and the composite semantic coding vector of the ith word segmentation result, normalizes the result and inputs the result into a feedforward neural network for processing;

the second addition and normalization unit adds the result obtained by the feedforward neural network and the result obtained by the first addition and normalization unit to obtain a relation coding vector of the ith word segmentation result; wherein i is 1,2, …, n.

In a second aspect, the present invention provides a method for generating an SQL statement in a chinese natural language, including: and inputting the Chinese natural language question and the relevant database mode into the SQL sentence generation model constructed by adopting the construction method of the SQL sentence generation model of the Chinese natural language to obtain the SQL sentence.

In a third aspect, the present invention provides a database retrieval method based on a chinese natural language, including: inputting the Chinese natural language question and the relevant database mode into the SQL sentence generation model constructed by adopting the construction method of the SQL sentence generation model of the Chinese natural language, and executing the SQL sentence by a database execution engine to perform information retrieval after obtaining the SQL sentence.

In a fourth aspect, the present invention also provides a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement one or more of a method for constructing the SQL statement generation model of the chinese natural language, a method for generating the SQL statement of the chinese natural language, and a method for searching a database based on the chinese natural language.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

1. the invention provides a construction method of a Chinese natural language SQL statement generation model, which combines the type information of different data columns to capture the alignment relation and the implicit reference relation between the Chinese natural language problem and a database mode and represents the explicit definition relation of the database mode and the link relation between the natural language problem and the database mode in a directed graph mode; the invention considers the relation between the natural language and the database mode in the construction process of the model, reasonably and effectively expresses the relation between the Chinese natural language problem and the database mode, combines the semantic features and the relation features for joint coding by comprehensively considering the characteristics between the unstructured data and the structured data, and greatly improves the accuracy of the SQL statement generation model.

2. The invention provides a construction method of a SQL statement generation model of Chinese natural language, and provides a cross-language semantic coding expression method of Chinese natural language problems and English database modes; firstly, extracting semantic features in a fusion expression vector by adopting a multilingual BERT pre-training model; secondly, dividing the link relation between the natural language problem and the database mode into two categories of data table names and data column names and data value links, wherein the former links each segmented data table name and data column name by using an n-gram statistical language model during linking, and uses synonymy, correlation and other relation edges in a concept semantic network of concept net to help the model to solve the problem of cross-language and enhance relation judgment through common sense; the latter is divided into two modes of directly searching data values in related data columns of the database and linking by utilizing the correlation edges of the conceptNet during linking. Different relation labels are distributed to relations obtained by different ways in the linking process, different importance degree weights of information obtained by different ways are given in training, the accuracy of the SQL sentence generation model of the Chinese natural language is greatly improved, and the cross-language semantic coding barrier between the Chinese natural language problem and the English database mode is solved.

3. The method for constructing the SQL sentence generation model of the Chinese natural language provided by the invention refines the flow of generating the SQL sentence by the natural language, performs the relation coding based on a multi-head attention mechanism on the basis of performing the semantic coding and the relation expression, trains different attention bias weights for different relation labels in a directed graph in the coding process, obtains the relation vector output between the natural language problem and the database mode, and improves the accuracy of the model.

4. The method for constructing the SQL sentence generation model of the Chinese natural language provided by the invention generates SQL sentences according to the depth priority order by a decoding method of a syntax tree, divides the decoding process into two types of generating syntax keywords and generating a database mode, determines possible candidate items according to the current node type and the father node type, calculates probability distribution through LSTM and selects the candidate item with the highest probability value. The decoding mode conforms to the characteristics of the SQL statement structure and the structured data, syntax errors can be avoided, the generation of the SQL statement with the complex structure including table connection and nested query is supported, and the accuracy of the model is high.

Drawings

Fig. 1 is a flowchart of a method for constructing an SQL statement generation model in a chinese natural language according to embodiment 1 of the present invention;

fig. 2 is a schematic view of a semantic encoding process provided in embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a word segmentation process of a natural language question provided in embodiment 1 of the present invention;

fig. 4 is a schematic diagram of a word segmentation process of a data table and a data column provided in embodiment 1 of the present invention;

FIG. 5 is a diagram illustrating an example of mode linking according to embodiment 1 of the present invention;

FIG. 6 is a diagram illustrating mode linking results provided in embodiment 1 of the present invention;

fig. 7 is a schematic diagram of a relationship coding process in the relationship coding module according to embodiment 1 of the present invention;

FIG. 8 is a schematic diagram of an attention calculation process provided in embodiment 1 of the present invention;

fig. 9 is a schematic diagram of a decoding process of the SQL statement based on the syntax tree according to embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Examples 1,

A method for constructing a Chinese natural language SQL statement generation model, as shown in FIG. 1, includes the following steps:

s1, building an SQL statement generation model; the SQL statement generation model comprises the following steps: the system comprises a word segmentation and splicing module, a semantic coding module, a relation representation module, a relation coding module and an SOL decoding module.

Word segmentation concatenation module:

the word segmentation and splicing module is used for carrying out word segmentation on the Chinese natural language problem and data table naming and data column naming in a related database mode, then splicing the words by combining data column type information, wherein the spliced data column type information can be added with the column type information in the encoding process, the properties of the column can be expressed in more detail, and a fusion expression vector of the Chinese natural language problem of unstructured data and the database mode of structured data is obtained and output to the semantic encoding module;

specifically, the word segmentation and splicing module is used for filtering Chinese natural language problems and Chinese and English nonsense stop words and punctuation marks in data table naming and data column naming in a related database mode, performing word segmentation operation, recovering wrong word segmentation items appearing in the Chinese natural language problems, obtaining word segmentation results of the Chinese natural language problems and data table naming and data column naming in the related database mode, splicing, connecting corresponding data column type information to the word segmentation results named by each data column in the splicing process, and obtaining a fusion expression vector of the Chinese natural language problems of unstructured data and the database mode of structured data; the mis-segmentation items comprise Arabic numerals, unit symbols, arithmetic expressions and English words; the data column type information includes a number type, a time-date type, and a text type.

As shown in FIG. 2, in the present embodiment, the question of natural language is Q_originObtaining a substring after word segmentation processing, wherein the natural language question after word segmentation processing is Q ═ Q (Q ═ Q)₁,q₂,...,q_|Q|) (ii) a Wherein, the | Q | is the number of characters or words obtained after the word segmentation processing of the natural language question;

naming T for data tables in relational database schema_originAnd data column name C_originThe data table is composed of English with underlined word segmentation symbols, and word segmentation is carried out according to underlining to obtain a data table name T ═ T (T ═ T) after word segmentation processing₁,t₂,...,t_|T|) And the data column name after word segmentation processing is C ═ C₁,c₂,...,c_|C|) (ii) a Wherein, | T | is the number of words obtained after the data table name is subjected to word segmentation processing; and | C | is the number of words obtained after the data column names are subjected to word segmentation processing. Specifically, in an alternative embodiment, as shown in fig. 3, a chinese word segmentation tool Jieba with a higher precision is used. The Jieba word segmentation tool can process Chinese natural language problems and return word segmentation combinations with the maximum probability, but because the natural language problems often include Arabic numerals, unit symbols, punctuation marks and the like besides Chinese characters, the output of the Jieba word segmentation tool needs to combine substrings separated by the conditions, and the original intention of the Jieba word segmentation tool is kept unmodified. For the column and table names of the database, combined with the requirements of engineering practice, are composed of English with underlined segmentation symbols, so as to perform segmentation using the underlined segmentation symbols as separators as shown in FIG. 4。

After word segmentation is finished, the obtained natural language question, the data table and the data columns are spliced, and the corresponding type is connected to each data column to obtain the fusion expression vector of the Chinese natural language question of the unstructured data and the database mode of the structured data

Wherein the content of the first and second substances,

the type of the ith data column is represented, and the types of the data column comprise a number type (including integer and floating point numbers), a time-date type and a text type.

A semantic coding module:

the semantic coding module is used for extracting semantic features in the fusion expression vector by adopting a natural language pre-training model, wherein the semantic features are embodied in a semantic vector form, are used for carrying out combined coding on the Chinese natural language problem, the data list and the data column, and finally obtain semantic coding vectors of word segmentation results of the data list naming and the data column naming in the Chinese natural language problem and the related database mode to form a combined coding tensor and output the combined coding tensor to the relational coding module;

specifically, in order to capture the alignment relationship between the natural language question posed by the user and the database schema, they need to be semantically coded at the same time; the fusion expression vector X is composed of two languages, i.e., chinese and english, and is encoded by using a Multilingual pre-training model (Multilingual-BERT) in this embodiment.

Obtaining a sequence suitable for semantic coding after word segmentation and data splicing; the input of the BERT model requires re-splicing of a special symbol [ CLS ]]And [ SEP ]]Therefore, a special symbol [ CLS ] is spliced]And [ SEP ]]To obtain

Inputting the Chinese natural language question and the word segmentation result of data table naming and data column naming in the related database model into the multilingual BERT pre-training modelAfter defining the coding vector, a joint coding tensor X is formed_bert。

A relationship representation module:

wherein the relational representation comprises a database schema relational representation and a schema link;

1) database schema relationship:

there are a number of known relationships in database schemas that are explicitly specified at the time of data table and data column definition. The relationships in the database schema conform to the definition of a directed graph, with data tables and data columns as vertices and edges representing the relationships between two vertices. Specifically, the relational representation of the relevant database schema is built in a directed graph mode, wherein in the directed graph, the vertex comprises a data table name and a data column name, and the edge is a database definition relation existing between the vertices.

Specifically, as shown in table 1, the relationship label type in the database schema relationship directed graph is a type of relationship label, where any two different vertices in the relationship directed graph are vertex a and vertex B, and when vertex a and vertex B are named as data columns, the database definition relationship existing between the vertices includes: the vertex A and the vertex B belong to the same data table, and the vertex A and the vertex B are associated by the external key;

TABLE 1

2) Mode linking:

the mode link is used for finding the alignment relation between the Chinese natural language problem and the relevant database mode, obtaining synonymy and relevant relation between cross-language words and phrases by comprehensively utilizing methods of character string matching, a statistical language model, a semantic network and a common sense network, and representing by using a directed graph so as to find data table, data column or data value information mentioned in the natural language problem; the vertices of the directed graph include words and data tables, data columns, and data values in the natural language question. The mode link fuses text features, numerical features, semantic features and common sense feature relations between the natural language problem and the database mode, and different relation labels are distributed to different feature relations in the directed graph. Specifically, the schema link includes a link relationship between the word segmentation result (word, word or phrase) of the chinese natural language question and the data table and the data column in the database schema, and a link relationship between the word segmentation result (word, word or phrase) of the chinese natural language question and the data value in the database schema; schema chaining is a process of chaining word segmentation results in natural language questions and data tables, data columns or data values in database schemas, discovering target tables and target columns to be queried by a questioner and discovering constraints constituting SQL statements.

Table 2 is a relationship tag type of the data table name and data column name link; the link relation between the word segmentation result of the Chinese natural language question and the data table and the data column in the database mode comprises the following steps: matching the TEM _ S, TEM _ R, TPM _ S, TPM _ R, CEM _ S, CEM _ R, CPM _ S and CPM _ R with high-to-low priority;

TABLE 2

Relationship label	Description of the invention
		TEM_S	Synonymous with table name and completely matching
TEM_R	Associated with and perfectly matched to table names
		TPM_S	Synonymous with table name and partially matching
TPM_R	Associated with and partially matching table names
		CEM_S	Synonymous with column name and completely matching
CEM_R	Associated with and perfectly matched to column names
		CPM_S	Synonymous with column name and partially matching
CPM_R	Associated with and partially matching column names

In the embodiment, n-gram matching is respectively carried out on the word segmentation result of the Chinese natural language problem and the name of the data table after word segmentation, so that the link relation between the word segmentation result of the Chinese natural language problem and the data table and the data list in the database mode is obtained; the n-gram algorithm is based on a statistical language model, a data table name or a data column name is subjected to sliding window by a sliding window method to obtain segment sequences with the length of m, and then the sequences are respectively matched with word segmentation results of Chinese natural language problems. When matching, the matching priority is: synonym perfect match > correlation perfect match > synonym partial match > correlation partial match.

Table 3 is the relationship tag type of the data value link; the link relation between the word segmentation result of the Chinese natural language question and the data value in the database mode comprises the following steps: matching the VEM and the VRM with high to low priority;

TABLE 3

Relationship label	Description of the invention
		VEM	The value appearing in the data column
VRM	The value being associated with the data column name

Specifically, when the linkage with the relationship label of VEM is performed, firstly, the same data column is selected according to the type of the word or word in the natural language question, wherein the type comprises a digital number, a date type date and a text type text, and then whether the word appears in the data column as a data value is checked, and if the word or word appears in the data column, the corresponding word or word in the natural language question and the appearing data column are linked by the VEM relationship label.

When the relation label is VRM, because the linked objects are Chinese characters or words and English data column names, two relation sides are allowed to be linked in the conceptNet, wherein one relation side is a synonymy relation side, and the other relation side is a related relation side. And obtaining related words according to the relation edges, and linking the related words with the data columns, wherein the VRM link requires that the related words are completely matched with the data columns.

In one embodiment of a complete natural language question, relational database schema, corresponding SQL statement, the natural language question and relational database schema have been processed by the word segmentation module, as shown in FIG. 5. The department is associated with the participled result of the data table name, detail _ head in the database mode, the table appears in the FROM clause for generating SQL, and the question answer proposed by the questioner needs to be searched in the data table of detail _ head; the term "age" is associated with the age data column under the department _ head data sheet in the database schema, which is the column present in the WHERE clause as a constraint; "56" (marked in green) belongs to integer value, conforms to the data type of two data columns of head _ ID and age in the destination _ head data table, and is associated with the age data column according to data value matching and common sense network judgment, so that the value appears as the limit value of age in the WHERE clause. FIG. 6 is the schema linking result of this example, where the oval boxes represent words in a natural language question and the rectangular boxes represent names of database schemas.

A relationship encoding module:

the relational coding module is used for carrying out joint coding on the alignment relation between the joint coding tensor and the relational digraph, extracting the relational characteristics of the joint coding tensor and the relational digraph to obtain a relational coding tensor, and outputting the relational coding tensor to the SQL decoding module; specifically, the relational coding extracts the semantic coding vectors in the joint coding tensor and the relational features in the relational directed graph, wherein the relational features are embodied in the form of relational vectors and are the joint coding of the alignment relations among the semantic coding vectors, the database mode relations, the Chinese natural language problems and the related database modes.

Wherein the relational coding module is implemented based on an improved Transformer model, as shown in FIG. 7, X_bertFor the input of the relation coding module, calculating the multi-head attention after position coding, performing addition and layer normalization, inputting the result into a feedforward neural network, and finally performing addition and layer normalization to obtain the output X of the relation coding_encode。

Specifically, the relationship encoding module includes: the device comprises a position coding unit, a multi-head attention unit, a first addition and normalization unit, a feedforward neural network and a second addition and normalization unit which are sequentially cascaded;

specifically, in this embodiment, the position encoding vector of the ith word segmentation result is:

wherein k is 0,1, …, d/2; d is the dimension of the semantic coding vector of the ith word segmentation result; d_modelIs a constant, i.e. K, Q, V dimension in the predetermined attention mechanism, and in this embodiment takes 10000.

The composite semantic coding vector of the ith word segmentation result is as follows: x is the number of_pos,i＝P_i+x_bert,i(ii) a Wherein x is_bert,iTensor X for joint coding_bertSemantic coding vectors of the ith word segmentation result; wherein i is 1,2, …, n; and n is the number of word segmentation results.

The multi-head attention unit comprises a plurality of attention subunits, wherein the h-th attention subunit is used for performing first linear transformation on a composite semantic coding vector of the ith word segmentation result, performing second linear transformation and third linear transformation on the composite semantic coding vector of the jth word segmentation result respectively, introducing the link relation of the ith word segmentation result and the jth word segmentation result in a relation directed graph as a bias in the second linear transformation and the third linear transformation respectively, obtaining a query vector, a key vector and a value vector respectively, and obtaining an attention value between the ith word segmentation result and the jth word segmentation result by combining an attention mechanism, wherein j is 1,2, …, n and n are the number of the word segmentation results; adding the obtained attention values to obtain the attention value of the ith word segmentation result; the multi-head attention unit is used for splicing the attention values of the ith word segmentation result obtained by each attention subunit and then performing dimension conversion to ensure that the dimension of the multi-head attention unit is the same as that of the composite semantic coding vector of the ith word segmentation result, so as to obtain the multi-head attention value of the ith word segmentation result;

in particular, the core of the Transformer is a multi-attention mechanism, which consists of multiple independent attention layers, the outputs of which together make up the final attention output for giving more "attention" to critical information in the input. In the invention, a relation vector between the ith word segmentation result and the jth word segmentation result is defined as r_ijFormed by a plurality of relation vectors alpha_ijSplicing to obtain the finished product; the method specifically comprises the following steps:

wherein, R is the number of the relationship label types in the relationship directed graph, the value in this embodiment is 20, and each relationship label type is represented by a corresponding vector; when there is no relation between the ith word segmentation result and the jth word segmentation result, alpha_ijReplaced by a 0 vector of the corresponding dimension.

In this embodiment, the query vector obtained by performing the first linear transformation on the composite semantic code vector of the ith word segmentation result by the nth attention subunit is:

wherein the content of the first and second substances,

is the coefficient matrix of the first linear transformation in the h-th attention subunit.

Performing second linear transformation on the composite semantic coding vector of the ith word segmentation result through the h attention subunit, and introducing a link relation between the ith word segmentation result and the jth word segmentation result in the relation directed graph as a bias in the second linear transformation to obtain a key vector which is:

wherein the content of the first and second substances,

a coefficient matrix which is a second linear transformation in the h attention subunit;

transforming the dimensionality of a relation vector between the ith word segmentation result and the jth word segmentation result into the AND

Results after the same dimension.

And performing third linear transformation on the composite semantic coding vector of the ith word segmentation result through the h attention subunit, introducing the link relation between the ith word segmentation result and the jth word segmentation result in the relationship directed graph in the third linear transformation as a bias to obtain a key vector which is:

wherein the content of the first and second substances,

a coefficient matrix which is a third linear transformation in the h-th attention subunit;

Results after the same dimension.

In this embodiment, the attention value between the ith word segmentation result and the jth word segmentation result is obtained by performing dot product attention calculation, and specifically includes:

wherein, H is the number of the attention subunits, namely the number of the heads of the multi-head attention; d_modelH isKey vector

Dimension (d);

and

dot product divided by

In order to make the gradient more stable. Further, the attention value of the ith word segmentation result is:

the multi-head attention value of the ith word segmentation result is obtained by splicing a plurality of independent attention values and multiplying the spliced attention values by a coefficient matrix, and specifically comprises the following steps:

wherein, W^oIs a coefficient matrix for

And carrying out linear transformation to obtain the final multi-head attention value.

It should be noted that the idea of attention is to map a query to a set of key-value pairs; as shown in fig. 8, the flow of the calculation method is as follows: firstly, calculating the correlation between the query and each key to obtain the weight of the value; then carrying out softmax normalization processing on the weight to obtain a weight coefficient of each value; and finally, multiplying and accumulating the values and the corresponding weight coefficients to obtain an attention calculation result. And giving different attention offsets to different relation labels in the relation directed graph in the attention mechanism, so as to calculate a relation vector representing the relation feature.

after the calculation of multi-head attention is finished, the multi-head attention result is input into a first adding and normalizing unit, and is added with the composite semantic coding vector of the ith word segmentation result and then is normalized to obtain y_i＝LayerNorm(x_pos,i+O_i)。

Will y_iInput into a feedforward neural network, in which y is_iAfter passing through the linear layer, processed by a ReLU activation function, and then passed through a linear layer to obtain an output z_i(ii) a The method specifically comprises the following steps: z is a radical of_i＝Linear(ReLU(Linear(y_i)))。

The second adding and normalizing unit adds the result obtained by the feedforward neural network and the result obtained by the first adding and normalizing unit to obtain a relation coding vector of the ith word segmentation result, and the method specifically comprises the following steps: x_i＝LayerNorm(y_i+z_i)。

Processing each word segmentation result according to the steps to finally obtain a relation coding tensor as follows:

SOL decoding module:

specifically, SQL decoding is to encode tensor X by relationship_encodeAnd generating the SQL statement. The syntax of SQL is a tree structure, and therefore a decoding scheme based on a syntax tree is used. The SOL decoding module decodes the relation coding tensor into SQL sentences by using a syntax tree structure; the syntax tree structure is characterized in that a node sequence is generated from a root node root in a depth priority order according to an SQL syntax rule, the characteristics, the father node characteristics, the context characteristics and the node type characteristics at the last moment are fused, and the probability value of a correct SQL statement corresponding to a natural language problem is maximized; the types of the nodes comprise generation of SQL keywords and end characters and generation of data table names or data column names.

The root node of the syntax tree is root, the nodes are generated according to the depth-first order, and the generated nodes can have two categories: (1) generating SQL grammar keywords or end characters called GenGrammar; (2) a data table name and a data column name are generated, referred to as GenTable and GenCol, respectively. The possible types of the successor nodes of each node are determined by SQL syntax, and when decoding, the decoding sequence is generated by long-short term memory neural network LSTM.

Model vs. cell state s at time t of LSTM_tAnd an output o_tThe update formula of (2) is:

s_t,o_t＝LSTM([action_t-1,action_pt,z_t,o_pt,type_t],s_t-1,o_t-1)

wherein the action represents node behavior, namely GenGrammar, GenTable or GenCol; action_ptA behavior of a parent node representing a current node; z is a radical of_tIs a context vector, output o from the last moment_t-1The method is obtained through multi-head attention calculation; o_ptType, output representing the time of the parent node of the current node_tA type vector representing the current node.

The goal of the model is to maximize the probability value of the correct SQL statement, i.e.:

wherein action_preRepresenting all previous behavior sequences.

And finally ending each branch of the syntax tree by [ END ], and traversing the syntax tree according to the depth-first order when all the branches are ended to generate the SQL statement which accords with the syntax rule.

In one embodiment, the following SQL statement is generated:

SELECT COUNT(*)

FROM tbl_students

WHERE name....

the generation process is shown in fig. 9. Nodes are generated in the order of depth priority starting from the root node root of the syntax tree, tbl _ stubs is a data table name, and name is a column name in the table. An act of generating an SQL grammar key or terminator [ END ] in the graph, called GenGrammar; the act of generating the data table name, data column name is referred to as GenTable and GenCol, respectively. The possible types of successor nodes to each node are determined by the SQL syntax. And when all branches are finally ended by [ END ], the SQL decoding process is finished, the syntax tree is traversed according to the depth-first order to obtain a unique SQL statement, and the SQL statement is delivered to a database execution engine to return an SQL query result.

The embodiment collects a large amount of Chinese natural language questions, relevant database modes and corresponding correct SQL sentences in advance. The triple formed by the expression of a Chinese natural language problem, all database modes appearing in SQL sentences and a corresponding correct SQL sentence forms a training sample, and all the training samples form the training set.

The SQL sentence generation model extracts semantic features in a natural language problem of unstructured data and semantic features and relational features in a database mode of structured data, and fuses the semantic features and the relational features to obtain fusion features; then generating the fusion characteristics into SQL sentences which can be identified and executed by a database execution engine corresponding to the natural language problem; and calculating the error between the generated SQL statement and the correct SQL statement corresponding to the natural language problem as loss, and reducing the loss through continuous iterative training so as to train the SQL statement generation model.

Examples 2,

A method for generating SQL sentences of Chinese natural language comprises the following steps: the chinese natural language question and the relevant database schema are input into the SQL statement generation model constructed by the method for constructing the SQL statement generation model of the chinese natural language provided in embodiment 1, and the SQL statement is obtained.

The related technical scheme is the same as embodiment 1, and is not described herein.

Examples 3,

A database retrieval method based on Chinese natural language includes: the chinese natural language question and the relevant database schema are input to the SQL statement generation model constructed by the method for constructing the SQL statement generation model of the chinese natural language provided in embodiment 1, and after the SQL statement is obtained, the database execution engine executes the SQL statement to perform information retrieval.

Specifically, the trained SQL sentence generation model is applied, a Chinese natural language question of a user and a database mode related to the question are received, after text features and relation features are extracted, the syntax tree structure is used for decoding the text features and the relation features into corresponding SQL sentences, a database execution engine executes the SQL sentences to obtain SQL query results asked by the user, and finally the SQL query results are returned. The invention converts the natural language problem proposed by the user into the SQL sentence, then the database execution engine executes the SQL sentence to carry out information retrieval, and finally the result is returned to the user, thereby greatly reducing the use threshold of the database and the learning cost of the user and improving the efficiency of the information retrieval.

Examples 4,

A machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement one or more of the method of constructing the SQL statement generation model of the chinese natural language provided in embodiment 1, the method of generating the SQL statement of the chinese natural language provided in embodiment 2, and the method of database retrieval based on the chinese natural language provided in embodiment 3.

The related technical scheme is the same as that of embodiment 1, embodiment 2 and embodiment 3, and is not described herein.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for constructing an SQL statement generation model of a Chinese natural language is characterized by comprising the following steps:

s1, building an SQL statement generation model; wherein the SQL statement generation model comprises: the system comprises a word segmentation and splicing module, a semantic coding module, a relation representation module, a relation coding module and an SOL decoding module;

the semantic coding module is used for extracting semantic features in the fusion expression vector by adopting a natural language pre-training model to obtain semantic coding vectors of word segmentation results of data table naming and data column naming in a Chinese natural language problem and a related database mode, and outputting the semantic coding vectors to the relation coding module after a combined coding tensor is formed;

the relational coding module is used for performing joint coding on the alignment relation between the joint coding tensor and the relational digraph, extracting relational features in the joint coding tensor and the relational digraph to obtain a relational coding tensor, and outputting the relational coding tensor to the SQL decoding module;

the SOL decoding module is used for decoding the relation coding tensor into SQL sentences based on a syntax tree structure;

2. The method for constructing the SQL statement generating model of the chinese natural language according to claim 1, wherein the segmentation and concatenation module is configured to filter stop words and punctuation marks in the data table names and the data column names in the chinese natural language problem and the related database schema, perform segmentation, recover mis-segmentation items occurring in the chinese natural language problem, obtain the segmentation results of the data table names and the data column names in the chinese natural language problem and the related database schema, perform concatenation, connect the segmentation result of each data column name with the corresponding data column type information in the concatenation process, and obtain a fusion expression vector of the database schema of the unstructured data and the chinese natural language problem;

3. The method of claim 1, wherein the semantic coding module uses a multilingual BERT pre-training model to extract semantic features in the fusion expression vector.

4. The method for constructing an SQL statement generation model in chinese natural language according to any one of claims 1 to 3, wherein any two different vertices in the relational directed graph are vertex a and vertex B, and when vertex a and vertex B are named as data columns, the database definition relationship existing between the vertices includes: the vertex A and the vertex B belong to the same data table, and the vertex A and the vertex B are associated by the external key;

when the vertex A is named for a data column and the vertex B is named for a data table, the database definition relationship existing between the vertices comprises: vertex A is the primary key for vertex B and vertex A is the column for vertex B, but not the primary key for vertex B;

when the vertex A is named for a data table and the vertex B is named for a data column, the database definition relationship existing between the vertexes comprises the following steps: vertex B is the primary key for vertex A and vertex B is the column for vertex A, but not the primary key for vertex A;

when the vertex A is named for a data table and the vertex B is named for the data table, the database definition relationship existing between the vertexes comprises the following steps: vertex A has an out-of-column key associated with vertex B, vertex B has an out-of-column key associated with vertex A, and vertex A and vertex B are out-of-column key associated with each other.

5. The method for constructing an SQL statement generation model in the chinese natural language according to any one of claims 1 to 3, wherein the link relationship between the natural language question and the database schema includes: the link relation between the word segmentation result of the Chinese natural language problem and the data table and the data list in the database mode, and the link relation between the word segmentation result of the Chinese natural language problem and the data value in the database mode;

the TEM _ S represents that the word segmentation result of the Chinese natural language problem is synonymous with and completely matched with the data table name; the TEM _ R represents that the word segmentation result of the Chinese natural language question is related to and completely matched with the data table name; the word segmentation result of the TPM _ S Chinese natural language problem is synonymous with the name of the data table and is partially matched with the name of the data table; the TPM _ R represents that the word segmentation result of the Chinese natural language question is related to and partially matched with the data table name; the CEM _ S represents that the word segmentation result of the Chinese natural language problem is synonymous with and completely matched with the data column name; the CEM _ R represents that the word segmentation result of the Chinese natural language question is related to and completely matched with the data list name; the CPM _ S represents that the word segmentation result of the Chinese natural language question is synonymous with the data column name and is partially matched with the data column name; the CPM _ R represents that the word segmentation result of the Chinese natural language question is related to and partially matched with the data column name;

the VEM represents the word segmentation result of the Chinese natural language question to appear in a data column; the VRM expresses that the word segmentation result of the Chinese natural language question is related to the data column name.

6. The method for constructing a SQL statement generating model of Chinese natural language according to claim 5, characterized in that the word segmentation result of the Chinese natural language problem is n-gram matched with the name of the data table after word segmentation to obtain the link relation between the word segmentation result of the Chinese natural language problem and the data table and data column in the database mode;

7. The method of claim 1, wherein the relational coding module comprises: the device comprises a position coding unit, a multi-head attention unit, a first addition and normalization unit, a feedforward neural network and a second addition and normalization unit which are sequentially cascaded;

the position coding unit is used for carrying out position coding on semantic coding vectors of word segmentation results of Chinese natural language problems in the combined coding tensor and data table naming and data column naming in a related database mode, adding the position coding vectors of the word segmentation results with corresponding semantic coding vectors to obtain composite semantic coding vectors of the word segmentation results, and outputting the composite semantic coding vectors to the multi-head attention unit;

the multi-head attention unit comprises a plurality of attention subunits, wherein the attention subunits are used for performing first linear transformation on a composite semantic coding vector of an ith word segmentation result, performing second linear transformation and third linear transformation on a composite semantic coding vector of a jth word segmentation result respectively, introducing a link relation of the ith word segmentation result and the jth word segmentation result in a relation directed graph as a bias in the second linear transformation and the third linear transformation respectively, obtaining a query vector, a key vector and a value vector respectively, and obtaining an attention value between the ith word segmentation result and the jth word segmentation result by combining an attention mechanism, wherein j is 1,2, …, n and n are the number of the word segmentation results; adding the obtained attention values to obtain the attention value of the ith word segmentation result; the multi-head attention unit is used for splicing the attention values of the ith word segmentation result obtained by each attention subunit and then performing dimension conversion to ensure that the dimension of the multi-head attention unit is the same as that of the composite semantic coding vector of the ith word segmentation result, so that the multi-head attention value of the ith word segmentation result is obtained;

the first adding and normalizing unit adds the multi-head attention value of the ith word segmentation result and the composite semantic coding vector of the ith word segmentation result, normalizes the result and inputs the result into the feedforward neural network for processing;

the second adding and normalizing unit adds the result obtained by the feedforward neural network and the result obtained by the first adding and normalizing unit to obtain a relation coding vector of the ith word segmentation result; wherein i is 1,2, …, n.

8. A method for generating SQL sentences of Chinese natural language is characterized by comprising the following steps: inputting the Chinese natural language question and the relevant database mode into the SQL sentence generating model constructed by the construction method of the SQL sentence generating model of the Chinese natural language according to any one of claims 1 to 7 to obtain the SQL sentence.

9. A database retrieval method based on Chinese natural language is characterized by comprising the following steps: inputting the Chinese natural language question and the relevant database mode into the SQL sentence generation model constructed by the construction method of the SQL sentence generation model of the Chinese natural language according to any one of claims 1 to 7, and executing the SQL sentence by the database execution engine to perform information retrieval after the SQL sentence is obtained.

10. A machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement one or more of the method of constructing the SQL statement generation model of the chinese natural language according to any one of claims 1 to 7, the SQL statement generation method of the chinese natural language according to claim 8, and the database retrieval method based on the chinese natural language according to claim 9.