CN113536741B

CN113536741B - Method and device for converting Chinese natural language into database language

Info

Publication number: CN113536741B
Application number: CN202010303263.5A
Authority: CN
Inventors: 陈江捷; 梁家卿; 方世能; 肖仰华
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2022-10-14
Anticipated expiration: 2040-04-17
Also published as: CN113536741A

Abstract

The invention provides a method and a device for converting Chinese natural language into database language, which are used for converting natural language texts input by users into query sentences capable of querying the database according to the database, and are characterized by comprising the following steps: preprocessing, namely performing standardized correction on the natural language text to obtain a standardized text; a column filling step, wherein column filling processing is carried out on the basis of the specification text and the headers of all data tables in the database, so that a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator are generated; a condition filling step, namely extracting the standard text based on the standard text and the WHERE column and filling WHERE content corresponding to the WHERE column; and an assembly output step, namely assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

Description

Method and device for converting Chinese natural language into database language

Technical Field

The invention belongs to the field of natural language to structured text, and particularly relates to a method and a device for converting a Chinese natural language to a structured query language of a database form.

Background

The conversion from natural language to SQL is an important subject of natural language structuring, and requires that a machine can understand information such as query intention and restriction conditions of a human question sentence and generate an executable SQL sentence corresponding to the natural language question sentence according to syntax of a database structured query language. The application scene of converting natural language into SQL is wide, and the technology is a key technology of intelligent customer service and intelligent assistance, but due to the complexity of human language, the technology still needs to be improved.

In the prior art, the method for converting the natural language into the SQL can be divided into three categories:

1) A rule-based approach. The method uses manually defined rules to extract intentions and conditions in the question, such as table fields and table contents in the question through a predefined NER dictionary, and assembles the complete SQL statement according to the SQL grammar.

2) A method based on sequence generation. The method takes a natural language SQL conversion task as a sequence-to-sequence generation task and adopts a method similar to Seq2 Seq. The capability of researchers to enhance the model generation to conform to the SQL grammar based on the SQL grammar combined with reinforcement learning when generating the SQL sequence is provided.

3) Based on the slot filling method, the SQL statement is a highly structured language, and the generated statement conforms to a uniform template. Therefore, the conversion from natural language to SQL can be regarded as a slot filling task, and the slot template is filled through a series of classification or extraction tasks, so that the conversion of the SQL statement is completed, for example, when a field to be queried in SELECT is filled, the table field is classified.

In the technology of converting natural language into SQL, because the artificial definition rule is limited, the rule-based method can only generate SQL sentences aiming at the natural language question of a specific simple sentence pattern, and cannot process more complex question sentences; the method based on sequence generation ignores the structural characteristics of SQL sentences, cannot utilize the template information of SQL grammar, often causes the generation of SQL sentences which do not conform to the grammar, and reduces the accuracy rate of the generated sentences; the existing method based on slot filling is limited by the capability of a depth model for coding natural language question sentences, information in the natural language with a sentence pattern and complicated semantics cannot be coded to a low-dimensional vector, the expression of downstream classification and extraction tasks is influenced by weaker natural language representation, and the accuracy of slot value filling is reduced. In addition, the prior art does not consider the problems of synonymy and ambiguity existing in the natural language, so that the accuracy of capturing the association between the data table and the question and the condition value in the SQL sentence is reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for enhancing the expression capability of a Chinese natural language by using transfer learning and rules so as to accurately convert the Chinese natural language into an SQL statement, wherein the method adopts the following technical scheme:

the invention provides a method for converting Chinese natural language into database language, which is used for converting natural language text input by a user into query sentences capable of querying the database according to the database and is characterized by comprising the following steps of: preprocessing, namely performing standardized correction on the natural language text to obtain a standardized text; a column filling step, namely inputting the standard text and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation of the natural language text and the header representation of each header, and performing column filling processing based on the semantic representation and the header representation so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator; a condition filling step, namely inputting the standard text and the WHERE column into a preset second BERT model and a second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling the WHERE content corresponding to the WHERE column; and an assembly output step, namely assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that in the condition filling step, when the standard text is extracted based on the semantic information, the text content corresponding to the WHERE column is extracted from the standard text, the similarity calculation is sequentially carried out on the text content and each data content in the data table, and the data content with the highest similarity is further selected as the WHERE content.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that the input of the first BERT model is an input text formed by adding token [ CLS ] before a standard text and separating and splicing each header by using [ SEP ], and the column filling processing comprises the following steps: filling a connector, namely inputting a coding vector which is correspondingly output after the [ CLS ] passes through a first BERT model and a first DGCNN model into a first full-connection layer for prediction so as to fill the connector; filling a SELECT column and an aggregation function, inputting all the correspondingly output coding vectors of [ SEP ] after passing through a first BERT model and a first DGCNN model into a second fully-connected layer for prediction so as to fill the SELECT column and the aggregation function; filling a WHERE column, sequentially inputting a coding vector which is correspondingly output after the [ SEP ] corresponding to each header passes through the first BERT model and the first DGCNN model into a third full-connection layer, predicting whether the corresponding header is in the WHERE condition or not, and filling the header into the WHERE column if the corresponding header is in the WHERE condition; and filling a WHERE operator, taking the output of the first DGCNN model as the input of a fourth full-link layer, predicting the WHERE operator corresponding to each word in the standard question, and finding the corresponding WHERE operator according to the words corresponding to the WHERE content in the standard question and completing assembly in the assembly and output steps when assembling the WHERE column, the corresponding WHERE operator and the WHERE content.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that the normalized correction comprises the following steps: the digital unified processing, which converts the Chinese number into Arabic number by using a regular matching mode; the year and date are processed in a standard mode, and the date and time in the natural language text are corrected into an expression mode consistent with that in the database; the numerical unit is processed in a unified way, and different numerical units in the natural language text are unified into numerical units consistent with those in the database; and synonymy expression correction processing, namely correcting the reference in the natural language text into a corresponding entity in the database by adopting an entity disambiguation technology.

The invention also provides a device for converting the Chinese natural language into the database language, which is used for converting the natural language text input by the user into the query sentence capable of querying the database according to the database, and is characterized by comprising the following steps: the preprocessing module is used for carrying out standardized correction on the natural language text to obtain a standardized text; the column filling module is used for inputting the standard text and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation of the natural language text and the header representation of each header, and performing column filling processing based on the semantic representation and the header representation so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator; the condition filling module is used for inputting the standard text and the WHERE column into a preset second BERT model and a second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling the WHERE content corresponding to the WHERE column; and the assembly output module is used for assembling the connector, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

Action and Effect of the invention

According to the method and the device for converting the Chinese natural language into the database language, the natural language text is processed into the standard text through the standardized correction, so that the paradigm of question and sentence inquiry can be unified, and the text characteristics can be conveniently mined and modeled in a subsequent process. Further, when the standard text is converted into the SQL query statement, the processing is performed by a column filling step for processing a classification task and a condition filling step for processing a reading understanding task in stages, and two sets of BERT and DGCNN which do not share parameters are respectively adopted for feature extraction, so that in the column filling step, the header of the data table and the standard text can be subjected to semantic analysis simultaneously, and the connector, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator in the SQL query statement can be more accurately predicted by combining the representations of the two, and meanwhile, in the condition filling step, the corresponding WHERE content can be accurately extracted from the standard text based on the predicted WHERE column, so that the representation capability of the Chinese natural language is enhanced. The conversion method and the device can better adapt to Chinese language texts, well express synonymous entities and extract more accurate contents, thereby ensuring the accuracy of the generated SQL query statement.

Drawings

FIG. 1 is a schematic diagram of a movie data table according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for converting Chinese natural language to database language according to an embodiment of the present invention;

FIG. 3 is a block diagram of a method for converting Chinese natural language into database language according to an embodiment of the present invention;

FIG. 4 is a block diagram of a column filling step in an embodiment of the present invention; and

FIG. 5 is a schematic diagram of the structure of the SQL statement in the embodiment of the present invention.

Detailed Description

In order to make the technical means, creation features, achievement objectives and effects of the present invention easy to understand, the following embodiments and drawings are used to describe the method and apparatus for converting the chinese natural language into the database language.

< example >

When natural language texts are processed, the method for converting the Chinese natural language into the database language can also process natural language questions (hereinafter referred to as questions) which are more complex and have rich semantics relative to simple sentences, and then in the embodiment, how much the total percentage of the box-office of the two films of rheum officinale and secret room escape in the fourth week of 19 years of the question is used as an example to explain the process of processing the natural language texts, in addition, the database adopted in the embodiment is an SQL database, wherein at least one film data table sequentially comprises the names of the films, the week box-office, the percentage of the box-office and the number of people in the header, and the schematic diagram of the film data table is shown in FIG. 1.

In addition, in this embodiment, the method for converting the chinese natural language into the database language can be implemented by a computer, and each step of the method is programmed into a corresponding executable module and stored in the computer, and the question sentence is converted into an SQL sentence (query sentence) capable of querying the database by sequentially operating each executable module.

FIG. 2 is a flowchart illustrating a method for converting Chinese natural language into database language according to an embodiment of the present invention.

FIG. 3 is a block diagram of a method for converting Chinese natural language into database language according to an embodiment of the present invention.

As shown in fig. 2 and fig. 3, the method for converting the chinese natural language into the database language includes the following steps:

and a preprocessing step S1, namely carrying out standardized correction on the natural language question (namely the natural language text) to obtain a standardized question (namely the standardized text).

In this embodiment, the chinese natural language question is mainly characterized by a very high degree of spoken language, and lack of uniform description specifications, for example, the numeric representation has both arabic numerals and chinese, and the processing difficulty of the SQL generation method is increased, so that the question preprocessing step S1 performs normalized correction on expressions of an irregular natural language question in four aspects of numbers, year and date, numerical units, synonyms, and the like, and facilitates the subsequent steps to mine and model features of the natural language question. The normalized correction specifically comprises:

1) Digital unified processing: in order to reduce the difficulty of extracting contents from question sentences by a downstream method, the numbers convert Chinese numbers into Arabic numbers in a regular matching mode;

2) A year and date specification process of, in order to specify a question to express a date and a year, correcting the date and time in the question to an expression pattern that is consistent with the expression pattern in the database, for example, correcting an expression pattern such as "10 years" to "2010 years" that is consistent with the expression pattern in the data table;

3) The numerical value units are processed uniformly, namely, the question and the numerical value units in the table are unified, namely, different numerical value units in the question are unified into the numerical value units consistent with those in the database, for example, "5000 meters" in the question is unified into "5 kilometers" in the database;

4) And (3) synonymy expression correction processing, in order to carry out question reference resolution, an entity disambiguation technology is adopted to correct the reference in the question into a corresponding entity, for example, the 'yellow car' is corrected into the 'ofo sharing single car'.

Through the standardized correction processing, the natural language question can be corrected into a standardized question, for example, after the pretreatment step S1 is used for processing how many times the total ratio of the ticket rooms of the two films of the rheum officinale and the secret room escape in the fourth week of 19 years, the obtained standardized question is how many times the total ratio of the ticket rooms of the rheum officinale and the secret room escape in the 4 th week of 2019.

And a column filling step S2, inputting the standard question and the header of each data table in the database into a preset first BERT model and a first DGCNN model so as to obtain the semantic representation of the natural language question and the header representation of each header, and filling the connectors, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator on the basis of the semantic representation and the header representation.

FIG. 4 is a flowchart illustrating a column filling step according to an embodiment of the present invention.

As shown in fig. 4, the input of the column filling step S2 is the standard question modified in the preprocessing step S1 and the header of the data table corresponding to the question, and is used for predicting the SELECT query column, the corresponding aggregation function, the connector, and the WHERE condition column in the pre-generated SQL statement. The column filling step S2 can be divided into two sub-steps:

and S2-1, representing the question and the header, and inputting the standard question and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation of the natural language question and the representation of the header.

In step S2-1 of this embodiment, first, a pre-training language model (i.e., a first BERT model) is used to obtain semantic representations of an input natural language (i.e., a canonical question), and then, an expansion gate convolutional neural network model (i.e., a first DGCNN model) based on a Convolutional Neural Network (CNN) and an Attention mechanism (Attention) is used to perform sequence feature extraction, so as to obtain representations of the question and each header.

Since it is not possible to confirm which part of the data table the user really wants to query by only the question itself, the question and the header of the data table need to be encoded together to help the computer know the correspondence between the question and the header. Therefore, to obtain the representation of question and header, chinese BERT is used as the input coder, because the BERT model is pre-trained using a large-scale corpus and has strong semantic expression capability. Then, a DGCNN model based on a CNN and Attention mechanism is used for further extracting the semantic relation between question headers, the calculation efficiency of the CNN compared with the RNN is utilized, and in order to enable the CNN model to capture information at a longer distance, model parameters are not increased, so that the operation efficiency of a computer is improved, and meanwhile, the prediction accuracy is guaranteed.

Specifically, in step S2-1, as shown in fig. 4, a Question sentence and a header (i.e., H1, H2, H3, and H4 in fig. 4) are spliced together by adding a token [ CLS ] in front of a natural language Question sentence (i.e., question in fig. 4) for classifying the connectors and separating the header and the header using [ SEP ]. And then, the spliced input text is used as the input of a BERT encoder, and the BERT output of the question is obtained after the input is subjected to a BERT multi-head self-attention mechanism and calculation of a plurality of transform layers.

Meanwhile, in order to enhance the expression of the Question, the Question (namely Question POS Tag) after part of speech tagging acquires the part of speech information of each word level through an Embedding layer, and the BERT output of the Question and the Embedding after the part of speech tagging of the Question are added to be used as the input of the DGCNN model. After passing through multiple one-dimensional volume blocks of the DGCNN, the anchoring mechanism is used to replace the Pooling layer in the conventional CNN to integrate the information of the input sequence effectively. Finally, the output of the DGCNN, i.e., the column filling step S2, characterizes the question and the header.

And S2-2, filling columns, namely filling connectors, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator based on the semantic representation and the header representation output in the step S2-1.

In this embodiment, before column filling, SELECT and WHERE in the slot template are preset template keywords, in step S2-2, a connector, a SELECT column and a corresponding aggregation function, and a WHERE column and cop are respectively generated through a plurality of full connection layers (sense) according to a coded question and a header. As shown in fig. 5, taking a segment of SQL statement as an example, the connectors 10, i.e. operators when multiple conditions occur in WHERE, are respectively "AND", "OR", "NULL"; the SELECT column 20 is a column name in the SELECT condition, and the corresponding aggregation function 30 is a function for performing aggregation processing such as statistics and summation on values corresponding to the SELECT column in the data table, such as "AVG", "COUNT", "SUM", and the like; the WHERE column 40 is a column name in the WHERE condition, and is used for the condition filling step S3 to extract the WHERE content 50 corresponding to the column name, and the cop represents an operator 60 corresponding to each header in the input. For the above connectors, SELECT columns and corresponding aggregation functions, the specific processes of the WHERE column and cop are as follows:

and (3) connector filling: connectors have three possible outputs, "AND", "OR", "NULL", respectively, so filling connectors can be considered a three-classification problem. Here, the input is a state corresponding to [ CLS ], which may be regarded as an expression of a whole question and a header, and [ CLS ] is used to predict a connector through one fully-connected layer (first fully-connected layer D1), that is, the connector is padded such that a coding vector output by [ CLS ] after passing through the first BERT model and the first DGCNN model is input to the first fully-connected layer to be predicted, thereby padding the connector.

SELECT column and aggregation function population: the predicted SELECT column and the corresponding aggregation function are used as the same task, namely, each header is classified, and the output is ' selected but not aggregated function ', ' AVG ', ' MAX ', ' MIN ', ' COUNT ', ' SUM ', ' and ' unselected '. Where the first 6 representations are selected and the 7 th representation is not. Predicting whether each header is selected and corresponds to an aggregation function is therefore essentially a seven-class problem. Regarding the state of each [ SEP ] as a representation of a corresponding header, predicting a SELECT column and an aggregation function through a fully-connected layer (a second fully-connected layer D2), that is, filling the SELECT column and the aggregation function, so that all the coding vectors correspondingly output after the [ SEP ] passes through the first BERT model and the first DGCNN model are input into the second fully-connected layer for prediction, thereby filling the SELECT column and the aggregation function.

WHERE column filling: the WHERE column fills, i.e., predicts whether the header is in the WHERE condition, and is therefore a two-classification problem for each header. The [ SEP ] corresponding to each header is used for predicting whether the header is in a WHERE condition through a full connection layer (a third full connection layer D3). That is, the WHERE column is filled by sequentially inputting the encoded vector, which is correspondingly output after the [ SEP ] corresponding to each header passes through the first BERT model and the first DGCNN model, to a third fully-connected layer and predicting whether the corresponding header is in a WHERE condition, and if so, filling the header into the WHERE column.

The WHERE operator populates: the WHERE operator fills the operator corresponding to the prediction WHERE column, and because there are cases WHERE one header corresponds to multiple operators, the SELECT column and the aggregation function are regarded as one task and the prediction WHERE column and the operator are divided into two tasks, unlike the SELECT column filling. Specifically, an operator corresponding to each word of the original question is predicted, that is, table contents in the question are mapped to the corresponding operator. Therefore, the output of the first DGCNN model is used as the input of a full-connected layer (fourth full-connected layer D4), the operator corresponding to each word of the question sentence is predicted, and the operator corresponding to the word is found in the assembling and outputting step S4 according to the WHERE content extracted by the condition filling step S3.

And a condition filling step S3, inputting the standard text and the WHERE column into a preset second BERT model and a preset second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information, and filling WHERE content corresponding to the WHERE column. In this embodiment, the conditional filling step S3 may be divided into two sub-steps:

and S3-1, extracting text content from the standard text based on the WHERE column.

In step S3-1 of this embodiment, the normalized question and the WHERE column predicted in the column filling step S2 are input, and the corresponding WHERE content is further extracted from the normalized question based on the WHERE column, so that the task of this step is essentially a reading-understanding substring extraction problem, that is, the starting position and the ending position of the content corresponding to the WHERE column in the normalized question are predicted, and therefore, the condition extraction can be modeled as a sequence tagging problem of the question.

Specifically, the input canonical question is spliced with a WHERE column (column name), the input semantic information is extracted by using a second BERT model and a second DGCNN model which have the same structure as those in the column filling step S2 (because of different tasks, the BERT + DGCNN model in the condition filling step S3 does not share parameters with the BERT + DGCNN model in the column filling step S2), the output of the second DGCNN model is used as the input of the full link layer, and then the probability that each token is used as the content start position and the content end position is predicted. And a certain value is set as a threshold value for extracting the content, so that the text content corresponding to the WHERE column in the standard question sentence can be extracted as much as possible, and the coverage rate is ensured.

In this embodiment, on one hand, the text content is used for performing similarity calculation with the data content of the data table in step S3-2, and finally, the content with the highest similarity is selected as a filling condition; and on the other hand, acquiring an operator corresponding to the WHERE column according to the cop generated in the column filling step S2.

And S3-2, sequentially carrying out fuzzy matching on the text content and each data content in the data table, and further matching the data content with the highest similarity as the WHERE content.

Although the question has been normalized in the preprocessing step S1, there is a certain percentage of "missing fish", i.e., the content in the question does not necessarily exactly coincide with the representation of the content in the database. Therefore, it is still necessary to perform certain post-processing on the text content extracted from the canonical question to find similar content in the database as the final extracted content. According to the content queried by the user in the question sentence, the extracted substrings (i.e. the SQL conditional columns) can be roughly divided into two types, namely numeric values and character strings:

1) For a numeric substring, the numeric value needs to be modified in a manner of keeping the unit of the numeric value consistent with the unit of the numeric value applied to the corresponding column in the data table, for example, the unit of the numeric value is "5 ten thousand square kilometers" mentioned in the question, and the unit of the numeric value of the corresponding column in the data table is "square kilometers", so that the extracted numeric substring "5" needs to be modified to "50000";

2) For a string-type substring, due to the diversity of the chinese natural language, there are two cases that need to be handled: the substrings are incomplete abbreviations of corresponding elements of the data table, such as Shanghai transportation university and Shanghai communication or Shanghai communication, and the substrings are synonyms of the corresponding elements. By comprehensively using the characteristics of editing distance and maximum common substring length and the like to carry out fuzzy matching at the character string level and using an entity reference resolution technology to carry out entity disambiguation at the semantic level, the substrings of the character string type can be well matched into a database.

Through the processing, the data content with the highest similarity to the text content is obtained from the data table and is used as the WHERE content, and the WHERE content can be ensured to correspond to the data content in the data table.

And a step S4 of assembling and outputting, namely assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

In the assembly output step S4 of this embodiment, the operator corresponding to the WHERE column can be extracted from the operator corresponding to each token in the condition filling step S3 through the contents corresponding to the WHERE column and the WHERE column acquired in the column filling step S2, so that the generated WHERE condition column, operator, and content can complete the filling of the WHERE condition, thereby completing the assembly of the SQL statement (i.e., query statement) and outputting the SQL statement.

In this embodiment, the query sentence obtained by final processing of "how many times the total ratio of the box rooms of the two films including bumblebee and escape from a closed room in the fourth week of 19" is shown in fig. 3 and 5.

In this embodiment, the assembly output step S4 may output the query statement to a display screen of the computer, so that the user may confirm or perform other operations such as running on the converted SQL query statement; or directly outputting the query result to an SQL database and operating the SQL database so as to directly obtain the corresponding query result.

In this embodiment, for more convenience of practical application, the steps S1 to S4 of the method for converting the chinese natural language into the database language may also be packaged in advance into corresponding program modules, that is, the preprocessing module, the column filling module, the condition filling module, and the assembly output module, to form a device for converting the chinese natural language into the database language, so as to facilitate the processing of the steps S1 to S4 on the natural language text input by the user and the output of the query sentence obtained by the conversion.

Effects and effects of the embodiments

According to the method and the device for converting the Chinese natural language into the database language, the natural language text is processed into the standard text through the standardized correction, so that the paradigm of question can be unified, and the text characteristics can be conveniently mined and modeled in the subsequent process. Further, when the standard text is converted into the SQL query statement, the processing is performed by a column filling step for processing a classification task and a condition filling step for processing a reading understanding task in stages, and two sets of BERT and DGCNN which do not share parameters are respectively adopted for feature extraction, so that in the column filling step, the header of the data table and the standard text can be subjected to semantic analysis simultaneously, and the connector, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator in the SQL query statement can be more accurately predicted by combining the representations of the two, and meanwhile, in the condition filling step, the corresponding WHERE content can be accurately extracted from the standard text based on the predicted WHERE column, so that the representation capability of the Chinese natural language is enhanced. The conversion method and the conversion device can better adapt to Chinese language texts, well express synonymous entities and extract more accurate contents, thereby ensuring the accuracy of the generated SQL query sentence.

In addition, in the embodiment, when the text content corresponding to the WHERE column is extracted from the standard text, the similarity calculation is also performed on the text content and the data content in the data table, so that the data content with the highest similarity is selected as the WHERE content, and therefore, the effect of introducing external knowledge is achieved, the obtained question sub-string can be mapped to the specific element on the corresponding data table, and the finally obtained SQL statement can be executed in the database. That is to say, the accuracy of the WHERE condition value in the converted SQL statement is further improved, and errors caused by the difference of the synonymous entities at the string level are reduced. However, most of the previous methods for generating SQL have not considered the problem of synonyms between the question and the entity in the data table, and thus neglect the semantic relationship between the question and the entity.

In the embodiment, when the normalization correction is carried out, the natural language text which is expressed in an irregular way is corrected in the four aspects of number, year and date, numerical value unit and synonymy, so that the natural language text with higher spoken language degree can be uniformly described and normalized, and meanwhile, when the synonymy expression correction is carried out, entity disambiguation and reference resolution are carried out by utilizing an entity disambiguation technology, the semantic difference of synonymy is reduced, and therefore the conversion accuracy of the SQL statement is further improved.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the scope of the description of the above-described embodiments.

Claims

1. A method for converting Chinese natural language into database language is used for converting natural language text input by a user into a query sentence capable of querying the database according to the database, and is characterized by comprising the following steps:

a preprocessing step, namely performing standardized correction on the natural language text to obtain a standardized text;

a column filling step, namely inputting the specification text and the headers of all data tables in the database into a preset first BERT model and a preset first DGCNN model so as to obtain semantic representations of the natural language text and the header representations of all the headers, and performing column filling processing on the basis of the semantic representations and the header representations so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator;

a condition filling step, namely inputting the standard text and the WHERE column into a preset second BERT model and a preset second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information, and filling WHERE content corresponding to the WHERE column;

an assembly output step, in which the connector, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator, and the WHERE content are assembled into the query statement and output,

and the condition filling step extracts text contents corresponding to the WHERE column from the standard text when the standard text is extracted based on the semantic information, sequentially performs fuzzy matching on the text contents and each data content in the data table, and further matches the data content with the highest similarity as the WHERE content.

2. The method for converting a chinese natural language into a database language according to claim 1, wherein:

wherein the input of the first BERT model is an input text formed by adding token [ CLS ] before the canonical text and separating the header by using [ SEP ] to splice,

the column filling process includes:

a connector filling step of inputting the [ CLS ] corresponding to the output coded vector after passing through the first BERT model and the first DGCNN model into a first fully-connected layer for prediction so as to fill the connector;

filling a SELECT column and an aggregation function, inputting all the correspondingly output encoding vectors of the [ SEP ] after passing through the first BERT model and the first DGCNN model into a second fully-connected layer for prediction so as to fill the SELECT column and the aggregation function;

filling a WHERE column, sequentially inputting a coding vector correspondingly output by the [ SEP ] corresponding to each header after passing through the first BERT model and the first DGCNN model into a third full-connection layer, predicting whether the corresponding header is in a WHERE condition, and filling the header into the WHERE column if the corresponding header is in the WHERE condition; and

filling a WHERE operator, taking the output of the first DGCNN model as the input of a fourth full-link layer, predicting the WHERE operator corresponding to each character in the canonical question,

and in the assembling and outputting step, when the WHERE column is assembled with the corresponding WHERE operator and the WHERE content, the corresponding WHERE operator is found according to the word corresponding to the WHERE content in the canonical question sentence, and the assembling is completed.

3. The method for converting chinese natural language into database language according to claim 1, wherein:

wherein the normalized correction comprises:

the digital unified processing, which converts the Chinese number into Arabic number by using a regular matching mode;

the specification processing of year and date, which modifies the date and time in the natural language text into the expression mode consistent with the expression mode in the database;

numerical unit unification processing, wherein different numerical units in the natural language text are unified into a numerical unit which is consistent with the numerical unit in the database; and

and (4) synonymy expression correction processing, namely correcting the reference in the natural language text into a corresponding entity in the database by adopting an entity disambiguation technology.

4. A device for converting a natural language text input by a user into a query sentence capable of querying a database according to the database, comprising:

the preprocessing module is used for carrying out standardized correction on the natural language text to obtain a standardized text;

a column filling module, configured to input the canonical text and headers of the data tables in the database into a preset first BERT model and a preset first DGCNN model, so as to obtain semantic representations of the natural language text and header representations of the headers, and perform column filling processing based on the semantic representations and the header representations, so as to generate a connector, a SELECT column and a corresponding aggregation function, and a WHERE column and a corresponding WHERE operator;

the condition filling module is used for inputting the standard text and the WHERE column into a preset second BERT model and a preset second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling WHERE content corresponding to the WHERE column; and

an assembly output module, which assembles the connector, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator, and the WHERE content into the query statement and outputs the query statement,

when the standard text is extracted based on the semantic information, the condition filling module extracts text contents corresponding to the WHERE column from the standard text, performs fuzzy matching on the text contents and each data content in the data table in sequence, and further matches the data content with the highest similarity as the WHERE content.