CN110609849A

CN110609849A - Natural language generation method based on SQL syntax tree node type

Info

Publication number: CN110609849A
Application number: CN201910796688.1A
Authority: CN
Inventors: 蔡瑞初; 梁智豪; 许柏炎; 郝志峰; 温雯; 李梓健
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-12-24
Anticipated expiration: 2039-08-27
Also published as: CN110609849B

Abstract

The invention relates to the field of natural language, in particular to a natural language generation method based on SQL syntax tree node types. The present invention does not require extensive manual operations and does not require that natural language must support multiple patterns. Compared with a natural language generating method based on sequence-to-sequence learning, the method can acquire the text information of the SQL language, and can be used by combining the tree-shaped structured data of the SQL syntax tree and the tree-shaped long and short term memory network to more fully acquire the syntax structure information of the SQL sentences, thereby having practical application significance, avoiding the defect that the development document and the network data are searched and searched manually, greatly reducing the time cost and the labor cost and improving the working efficiency.

Description

Natural language generation method based on SQL syntax tree node type

Technical Field

The invention relates to the field of natural language, in particular to a natural language generation method based on SQL syntax tree node types.

Background

The Structured Query Language (SQL) is a programming language of a non-procedural operation relational database, and allows a user to interactively query data on a high-level data structure, so that the specific storage mode of the data is transparent to the user; structured query languages are widely used in database manipulation transactions. Since the SQL Language is a programming Language, it can be converted into an Abstract Syntax Tree (AST) through an Abstract Syntax Description Language (ASDL), a Language used to describe a Tree-like data structure in a compiler. The abstract syntax tree is capable of representing the syntax structure of the SQL language in the form of a tree without showing the concrete details of the SQL language. The abstract syntax tree of SQL is an abstract representation of the SQL language, and by representing each segment of SQL statement in the form of an abstract syntax tree, the syntax structure of each segment of SQL statement can be easily and clearly obtained.

The SQL language is widely applied to various projects and products to meet various data operations and database requirements, and a large number of SQL statements exist in the system to support most of the data operations of the system. In the process, in order to facilitate future maintenance work, clear natural language comments need to be marked on the SQL statements, or in the process of updating the SQL statements, it is necessary to refer to development documents and online data to know the functional requirements to be realized by the SQL statements, which requires much time and effort. In the face of such a practical requirement, a method capable of converting the SQL language into the natural language is necessary. There are several ideas to solve the problem. The first is to convert SQL language into natural language according to the pre-designed artificial rules and templates, and the method has the disadvantages that the generated natural language has high similarity, the sentence pattern lacks diversity, and can support the limited kinds of SQL sentences, after all, the artificially designed template is used as the basis; the second idea is to consider the problem of translation from sequence to sequence by converting the SQL language into the natural language, consider a segment of SQL statement and a segment of natural language description as the form of sequence, encode the sequence of SQL statement through the neural network to extract the whole expression of the sequence of SQL statement, and generate the natural language sequence according to the expression.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a natural language generation method based on SQL syntax tree node types. The present invention does not require extensive manual operations and does not require that natural language must support multiple patterns. Compared with a natural language generating method based on sequence-to-sequence learning, the method can acquire the text information of the SQL language, and can acquire the syntax structure information of the SQL statement more fully by combining the tree-shaped structured data of the SQL syntax tree and the use of the tree-shaped long-term and short-term memory network, so that the method has practical application significance.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a natural language generation method based on SQL syntax tree node type includes the following steps:

step S1: constructing a natural language generation model, wherein the model comprises a language encoder and a language decoder which are formed on the basis of a memory network;

step S2: collecting a natural language data set from an SQL text, traversing the natural language data set according to breadth first to obtain an SQL abstract syntax tree T ═ node with n nodes₁，...，node_nAnd the corresponding natural language sequence X ═ X₁，...，x_m}; wherein, the node represents a node of the SQL abstract syntax tree T, and the subscript is a node serial number; x represents a word in a natural language sentence X, and subscripts represent sequence numbers;

step S3: generating a language in a model using natural languageThe encoder computes each node in the SQL abstract syntax tree_iNode state vector of

Step S4: selecting node state vectors of an SQL abstract syntax treeAs an initial hidden state vector h₀Inputting the input into a language decoder in a natural language generation model;

step S5: at each time step t, the speech decoder updates the hidden state vector h at more than one time step_t-1And predicted word x_t-1As input, a new hidden state vector h is updated_t；

Step S6: based on hidden state vector h_tWith each node state vector of the SQL syntax treeCalculating to obtain an attention vector attn of the current time step_t；

Step S7: attention vector attn_tAs input, into a language decoder of a natural language generation model;

step S8: the language decoder is based on the input attention vector attn_tExecuting copy operation or generation operation to generate a corresponding natural language sequence;

step S9: and training the natural language generating model by using a gradient descent method, determining a model parameter theta of the natural language generating model, and obtaining the optimized natural language generating model.

Preferably, the speech decoder of step S1 further includes a binary discriminator, the speech encoder is composed of a tree-type long-short term memory network based on node types, the speech decoder is composed of a long-short term memory network, and the binary discriminator is composed of a fully-connected network; the tree-type long-short term memory network based on the node type is a variation of the tree-type long-short term memory network,the tree-shaped long and short term memory network is composed of father nodes, root nodes and child nodes, wherein one father node comprises a plurality of root nodes, one root node comprises a plurality of child nodes, and if each node of one SQL syntax tree is a node_iIf there are K subnodes, the tree-type long-short term memory network based on node type is composed of 1 input gate InputGate, K forgetting gates ForgetGate and one output gate OutputGate, and the network uses node_iText vector ofAnd node typeState vector of its K child nodesAnd node typeAnd cell statusRespectively calculating with an input gate, a forgetting gate and an output gate according to the following formula to obtain node state vectors

Wherein b is an offset, W_· ^(·)Andthe learnable parameters of the tree-type long-term and short-term memory network based on the node type are determined according to the node type_(·)Selecting different parameter values; sigmoid (. cndot.) and tanh (. cndot.) are nonlinear activation functions, and the specific formula is as follows:

preferably, the natural language data set in step S2 is collected by manual or machine statistics, the natural language data set includes structured query language SQL and natural language sequence pairs, and the data set is split into a training set, a verification set and a test set according to a proportion for training the reliability of the natural language generation model.

Preferably, the time step in step S5 is an input unit of the long-short term memory network when processing the sequence data.

Preferably, in step S5, x is 0 when t-1 is_t-1Is a special symbol that indicates the beginning.

Preferably, in step S6, attention vector attn is calculated_tThe method comprises the following specific steps:

first based on the hidden state vector h_tWith each node state vector of the SQL syntax treeCalculating weights for each node state vectorThen weighted and summed according to the weight to obtain a context vector ctx_tFinally, the context vector ctx_tAnd hidden state vector h_tCalculating to obtain attention vector attn_tThe concrete formula is as follows:

attn_t＝tanh([ctx_t；h_t]) (13)

wherein S is a real number matrix of n × d, and represents n node state vectors of an SQL syntax treeThe dimension of the vector is d; softmax (·) and tanh (·) are nonlinear activation functions, and the concrete formula of softmax (·) is:

preferably, in step S7, attention vector attn is added_tThe binary arbiter is input into a binary arbiter of the speech decoder, which is a fully connected network with an output dimension of 2, i.e.:

P(action|x₁，...，x_t-1，T)＝W×attn_t；W∈R^2×d (15)

wherein W ∈ R^2×dIs a fully-connected network trainable parameter, d is an attention vector attn_tDimension (d);

the binary arbiter outputs 2 probabilities P (action ═ copy | x)₁，...，x_t-1T) and P (action | x)₁，...，x_t-1，T)，P(action＝copy|x₁，...，x_t-1T) represents the probability of executing a copy operation, P (action | x)₁，...，x_t-1And T) represents the probability of executing the generating operation, the sizes of the two probability values are compared, and the operation with the higher probability is selected to be executed.

Preferably, in step S8, if the binary arbiter determines that the copy operation is performed, the binary arbiter bases on the attention vector attn_tWith the state vector of each node i of the SQL syntax treeCalculating the probability P (x) of each node being copied_t|x₁，...，x_t-1T), selecting the node with the maximum probability to copy the node text as the output x of the current time step_t(ii) a Attention vector attn in replication mechanism_tWith the state vector of each node i of the SQL syntax treeThe specific formula of the probability that each node is duplicated is calculated as follows:

wherein the content of the first and second substances,state vector representing ith node of SQL syntax treeIn the transposed form of (a) to (b),is a scalar quantity representing a value at the ith node with respect to the t time step, the value representing a state vectorAnd attention vector attn_tIs likeDegree;

P(x_t|x₁，...，x_t-1，T)＝softmax(u^t) (17)

if the binary classifier is judged to be the generating operation, attention vector attn is added_tInputting the input into a full-connection network with one output dimension being the size of the target dictionary to obtain the probability P (x) of each word in the target dictionary_t|x₁，...，x_t-1T), the word with the highest probability is selected as the output x of the current time step_t(ii) a And repeating the steps S5-S8 until a corresponding natural language sequence is generated.

Preferably, the gradient descent algorithm in step S9 includes the following steps:

step S201: assuming an objective function J (theta) with respect to a model parameter theta of a natural language generation model;

step S202: calculating the gradient of J (theta)

Step S203: the parameter theta is updated with an update step alpha (alpha > 0),

preferably, in step S9, in the training of the natural language generation model, the model parameter θ is trained by an objective function or a loss function until the model converges, where the objective function is:

wherein, P (x)_t，action＝copy|x₁，...，x_t-1T) represents the probability that the text will perform a copy operation, P (x)_t，action＝generate|x₁，...，x_t-1T) represents the probability of the text performing the generating operation;

the corresponding loss function L is:

L＝-logP(X|T)

＝-∑_tlog(P(x_t|x₁，...，x_t-1，T))

＝-∑_tlog(P(x_t，action＝copy|x₁，...，x_t-1，T)

+P(x_t，action＝generate|x₁，...，x_t-1，T))

＝-∑_tlog(P(x_t|x₁，...，x_t-1，T)×P(action＝copy|x₁，...，x_t-1，T)

+P(x_t|x₁，...，x_t-1，T)

×P(action＝generate|x₁，...，x_t-1，T)) (2)

wherein X represents a natural language sentence, each sentence being X₁，...，x_mThe word sequence of (1); t represents an abstract syntax tree, each tree is a node₁，...，node_nP (X | T) represents the conditional probability of X given the syntax tree T, P (X | T)_t，action＝copy|x₁，...，x_t-1T) represents the probability that the text will perform a copy operation, P (x)_t，action＝generate|x₁，...，x_t-1And T) represents the probability of the text performing the generating operation.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the present invention does not require extensive manual operations and does not require that natural language must support multiple patterns. Compared with a natural language generating method based on sequence-to-sequence learning, the method can acquire the text information of the SQL language, and can be used by combining the tree-shaped structured data of the SQL syntax tree and the tree-shaped long and short term memory network to more fully acquire the syntax structure information of the SQL sentences, thereby having practical application significance, avoiding the defect that the development document and the network data are searched and searched manually, greatly reducing the time cost and the labor cost and improving the working efficiency.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of the present invention;

FIG. 3 is a schematic diagram of a tree-like long short term memory network;

fig. 4 is a schematic diagram of a replication mechanism.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a natural language generation method based on SQL syntax tree node types includes the following steps:

step S3: computing each node in an SQL abstract syntax tree using a language encoder in a natural language generative model_iNode state vector of

As shown in fig. 2, as a preferred embodiment, the speech decoder of step S1 further includes a binary arbiter, the speech encoder is composed of a tree-type long-short term memory network based on node types, the speech decoder is composed of a long-short term memory network, and the binary arbiter is composed of a fully-connected network; the tree-type long-short term memory network based on node types is a variation of the tree-type long-short term memory network, and the specific structure is shown in fig. 3, the tree-type long-short term memory network is composed of father nodes, root nodes and child nodes, one father node comprises a plurality of root nodes, one root node comprises a plurality of child nodes, and if each node of one SQL syntax tree is a node_iWith K child nodes, the tree-type long-term and short-term memory network based on node typeThe network consists of 1 input gate InputGate, K forgetting gates ForgetGate and one output gate OutputGate, and the network utilizes node_iText vector ofAnd node typeState vector of its K child nodesAnd node typeAnd cell statusRespectively calculating with an input gate, a forgetting gate and an output gate according to the following formula to obtain node state vectors

as a preferred embodiment, the natural language data set in step S2 is collected by human or machine statistics, the natural language data set includes structured query language SQL and natural language sequence pairs, and the data set is proportionally split into a training set, a verification set and a test set for training the reliability of the natural language generation model.

As a preferred embodiment, the time step described in step S5 is an input unit of the long-short term memory network when processing the sequence data.

As a preferred embodiment, in step S5, when t-1 is 0, x_t-1Is a special symbol that indicates the beginning.

As a preferred embodiment, in step S6, attention vector attn is calculated_tThe method comprises the following specific steps:

first based on the hidden state vector h_tWith each node state vector of the SQL syntax treeCalculating weights for each node state vectorThen theWeighting and summing according to weight to obtain a context vector ctx_tFinally, the context vector ctx_tAnd hidden state vector h_tCalculating to obtain attention vector attn_tThe concrete formula is as follows:

attn_t＝tanh([ctx_t；h_t]) (13)

as a preferred embodiment, in step S7, attention vector attn is added_tThe binary arbiter is input into a binary arbiter of the speech decoder, which is a fully connected network with an output dimension of 2, i.e.:

P(action|x₁，...，x_t-1，T)＝W×attn_t；W∈R^2×d (15)

the binary arbiter outputs 2 probabilities P (action ═ copy | x)₁，...，x_t-1T) and P (action | x)₁，...，x_t-1，T)，P(action＝copy|x₁，...，x_t-1T) represents the probability of executing a copy operation, P (action | x)₁，...，x_t-1T) stands for execution GenerationAnd comparing the sizes of the two probability values, and selecting the operation with higher probability to execute.

As a preferred embodiment, in step S8, if the binary arbiter determines to be a copy operation, attn is based on the attention vector_tWith the state vector of each node i of the SQL syntax treeCalculating the probability P (x) of each node being copied_t|x₁，...，x_t-1T), selecting the node with the maximum probability to copy the node text as the output x of the current time step_t(ii) a Attention vector attn in replication mechanism_tWith the state vector of each node i of the SQL syntax treeThe specific formula of the probability that each node is duplicated is calculated as follows:

wherein the content of the first and second substances,state vector representing ith node of SQL syntax treeIn the transposed form of (a) to (b),is a scalar quantity representing a value at the ith node with respect to the t time step, the value representing a state vectorAnd attention vector attn_tThe similarity of (2);

P(x_t|x₁，...，x_t-1，T)＝softmax(u^t) (17)

if two isThe classifier discriminates as a generating operation that attention vector attn_tInputting the input into a full-connection network with one output dimension being the size of the target dictionary to obtain the probability P (x) of each word in the target dictionary_t|x₁，...，x_t-1T), the word with the highest probability is selected as the output x of the current time step_t(ii) a And repeating the steps S5-S8 until a corresponding natural language sequence is generated.

As a preferred embodiment, the gradient descent algorithm in step S9 comprises the following steps:

step S202: calculating the gradient of J (theta)

as a preferred embodiment, in step S9, in the process of training the natural language generation model, the model parameter θ is trained through an objective function or a loss function until the model converges, where the objective function is:

the corresponding loss function L is:

L＝-logP(X|T)

＝-∑_tlog(P(x_t|x₁，...，x_t-1，T))

＝-∑_tlog(P(x_t，action＝copy|x₁，...，x_t-1，T)

+P(x_t，action＝generate|x₁，...，x_t-1，T))

+P(x_t|x₁，...，x_t-1，T)

×P(action＝generate|x₁，...，x_t-1，T)) (2)。

Example 2

As shown in fig. 4, in the present embodiment, a detailed word is input into the speech coder in the natural language generation model, and a word capable of summarizing the input content is output into the speech decoder, and the specific examples are as follows:

inputting: xiaoming goes to Guangzhou wine house to eat lunch, and has 3 dishes, so that people have a very pleasant taste.

And (3) outputting: the Xiaoming lunch is very happy.

If the word "Xiaoming" is not in the constructed dictionary, an "unknown" word is generated without the existence of a copying mechanism; whereas if a copy mechanism is present, the word "Xiaoming" may be copied from input to output. The replication mechanism is specifically implemented based on a Pointer Network (Pointer Network). The pointer network is based on a speech coder-speech decoder framework, assuming the input is X ═ X₁，...，x_nThe output is Y ═ Y₁，...，y_mIn a certain time step i of the decoder stage, the language decoder hidden state vector d_iHidden state vector e associated with each time step j ∈ (1., n) of speech coder input_jThe operation yields a probability P (y) for each time step entered_i|y₁，...，y_i-1X), representing the possibility of copying the input word at this time step, selecting the input word with the highest probability for copying, and the specific formula is as follows:

P(y_i|y₁，...，y_i-1，X)＝softmax(uⁱ)

wherein, softmax (·) is a nonlinear activation function, and the concrete formula of softmax (·) is:

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A natural language generation method based on SQL syntax tree node type is characterized by comprising the following steps:

step S2: collecting a natural language data set from an SQL text, traversing the natural language data set according to breadth first to obtain an SQL abstract syntax tree T ═ node with n nodes₁,…,node_nAnd the corresponding natural language sequence X ═ X₁,…,x_m}; wherein, the node represents a node of the SQL abstract syntax tree T, and the subscript is a node serial number; x represents a word in a natural language sentence X, and subscripts represent sequence numbers;

2. The method according to claim 1, wherein the language decoder of step S1 further includes a binary discriminator, the language encoder is composed of a tree-type long-short term memory network based on node types, the language decoder is composed of a long-short term memory network, and the binary discriminator is composed of a fully-connected network; the tree-shaped long and short term memory network based on the node type is a variation of the tree-shaped long and short term memory network, the tree-shaped long and short term memory network is composed of father nodes, root nodes and child nodes, one father node comprises a plurality of root nodes, one root node comprises a plurality of child nodes, and if each node of one SQL syntax tree is a node_iIf there are K subnodes, the tree-type long-short term memory network based on node type is composed of 1 input gate InputGate, K forgetting gates ForgetGate and one output gate OutputGate, and the network uses node_iText vector ofAnd node typeState vector of its K child nodesAnd node typeAnd cell statusRespectively calculating with an input gate, a forgetting gate and an output gate according to the following formula to obtain node state vectors

Wherein, b is an offset,andthe learnable parameters of the tree-type long-term and short-term memory network based on the node type are determined according to the node type_(·)Selecting different parameter values; sigmoid (. cndot.) and tanh (. cndot.) are nonlinear activation functions, and the specific formula is as follows:

3. the method according to claim 2, wherein the natural language data set in step S2 is collected by manual or machine statistics, the natural language data set includes structured query language SQL and natural language sequence pairs, and the data set is proportionally split into a training set, a validation set and a test set for training the reliability of the natural language generation model.

4. The method according to claim 3, wherein the time step in step S5 is an input unit of the long-term and short-term memory network when processing the sequence data.

5. The method according to claim 4, wherein in step S5, when t-1 is 0, x is_t-1Is a special symbol that indicates the beginning.

6. The method according to claim 5, wherein in step S6, an attention vector attn is calculated_tThe method comprises the following specific steps:

first based on the hidden state vector h_tWith each node state vector of the SQL syntax treeCalculating weights for each node state vectorThen weighted and summed according to the weight to obtain a context vector ctx_tFinally, the context vector ctx_tAnd hidden state vector h_tCalculating to obtain attention vector attn_tIn particularThe formula is as follows:

attn_t＝tanh([ctx_t；h_t]) (13)

7. the method according to claim 6, wherein in step S7, attention vector attn is used_tThe binary arbiter is input into a binary arbiter of the speech decoder, which is a fully connected network with an output dimension of 2, i.e.:

P(action|x₁,…,x_t-1,T)＝W×attn_t；W∈R^2×d (15)

the binary arbiter outputs 2 probabilities P (action ═ copy | x)₁,…,x_t-1T) and P (action | x)₁,…,x_t-1,T)，P(action＝copy|x₁,…,x_t-1T) represents the probability of executing a copy operation, P (action | x)₁,…,x_t-1T) represents the probability of performing the generating operation, the magnitude of the two probability values are compared,and selecting the operation with higher probability to execute.

8. The method according to claim 7, wherein in step S8, if the binary decision device decides the copy operation, it bases on the attention vector attn_tWith the state vector of each node i of the SQL syntax treeCalculating the probability P (x) of each node being copied_t|x₁,…,x_t-1T), selecting the node with the maximum probability to copy the node text as the output x of the current time step_t(ii) a Attention vector attn in replication mechanism_tWith the state vector of each node i of the SQL syntax treeThe specific formula of the probability that each node is duplicated is calculated as follows:

P(x_t|x₁,…,x_t-1,T)＝softmax(u^t) (17)

if the binary classifier is judged to be the generating operation, attention vector attn is added_tInputting the input into a full-connection network with one output dimension being the size of the target dictionary to obtain the probability P (x) of each word in the target dictionary_t|x₁,…,x_t-1T), the word with the highest probability is selected as the output x of the current time step_t(ii) a And repeating the steps S5-S8 until a corresponding natural language sequence is generated.

9. The method according to claim 8, wherein the gradient descent algorithm in step S9 comprises the following steps:

step S202: calculating the gradient of J (theta)

Step S203: to update the step size alpha (alpha)>0) The parameters theta are updated in such a way that,

10. the method according to claim 9, wherein in step S9, in the process of training the natural language generation model, the model parameter θ is trained by an objective function or a loss function until the model converges, wherein the objective function is:

wherein, P (x)_t,action＝copy|x₁,…,x_t-1T) is a textProbability of executing a copy operation, P (x)_t,action＝generate|x₁,…,x_t-1T) represents the probability of the text performing the generating operation;

the corresponding loss function L is:

wherein X represents a natural language sentence, each sentence being X₁,…,x_mThe word sequence of (1); t represents an abstract syntax tree, each tree is a node₁,…,node_nP (X | T) represents the conditional probability of X given the syntax tree T, P (X | T)_t,action＝copy|x₁,…,x_t-1T) represents the probability that the text will perform a copy operation, P (x)_t,action＝generate|x₁,…,x_t-1And T) represents the probability of the text performing the generating operation.