CN116108175A

CN116108175A - Language conversion method and system based on semantic analysis and data construction

Info

Publication number: CN116108175A
Application number: CN202211704106.0A
Authority: CN
Inventors: 沈然; 孙钢; 沈皓; 章江铭; 金良峰; 王庆娟; 倪琳娜; 吴慧; 江晗; 姜伟昊
Original assignee: Zhejiang University ZJU; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Zhejiang University ZJU; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-12

Abstract

The invention discloses a language conversion method and system based on semantic analysis and data construction. The technical scheme adopted by the invention comprises the following steps: list selecting task: converting the list selecting task into a text classification model, and predicting a list in a corresponding text database; selecting a column: converting the column selecting task into a sequence labeling model, and predicting columns in a text corresponding database; SQL generating tasks: converting the SQL generating task into a text generating task, and storing an optimal text generating model to generate an SQL query statement; and (3) predicting: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards. According to the invention, a text2sql technology in a Pipeline form is adopted, and additional related information is added before each model is trained, so that the accuracy of the model is improved, and the optimization is realized by optimizing each model, so that the method is more convenient and effective.

Description

Language conversion method and system based on semantic analysis and data construction

Technical Field

The invention relates to the field of semantic analysis and deep learning, in particular to a language conversion method and a language conversion system based on semantic analysis and data construction.

Background

The database stores a large amount of high-value data, and when a user wants to inquire the content of the database, the user needs to write structured query language SQL and then interact with the database, which brings inconvenience to the common user in non-professional fields; and for some complex query conditions, the SQL is easy to make mistakes when being written manually, so that the problems of insufficient data mining depth, weak data increment and rendering capability and the like are caused.

Text2SQL is an artificial intelligence technique that converts natural language into SQL query language. At present, there are two main methods: one is fixed slot filling with rules, and the other is end-to-end learning based on deep learning. But in practice, the precision is not high, and the practicability in industry is not strong.

Disclosure of Invention

In view of this, the invention provides a language conversion method and system based on semantic analysis and data construction, which utilizes semantic analysis technology and deep learning to improve the accuracy of a model so as to meet the practical requirement of SQL query in industry, improve the efficiency of user data mining, activate the value of data and promote the data value to be changed.

For this purpose, the invention adopts a technical scheme that: a language translation method based on semantic parsing and data construction, comprising:

1) Data preparation: collecting text2sql question-answer pair data and the organization and structure of a database;

2) List selecting task: converting the list selecting task into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;

3) Selecting a column: converting the list selecting task into a sequence labeling model, and manufacturing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;

4) SQL generating tasks: the SQL generating task is converted into a text generating task, a training set and a verification set of the text generating task are made according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an optimal text generating model is stored to generate an SQL query statement;

5) And (3) predicting: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.

According to the invention, the traditional text2SQL task is disassembled into the list selection, the list selection and the SQL to generate three simple subtasks, the list selection is converted into the text classification problem, the list selection is converted into the sequence annotation problem, the SQL generation is converted into a text generation problem, each subtask is respectively converted into an NLP model to solve the problem, and the data formats of different tasks are constructed according to the existing data, so that the accuracy of each model is improved, and automatic, efficient and accurate SQL retrieval is provided for users.

Further, the specific contents of the data preparation are as follows:

collecting text2SQL question-answer pair data, wherein each sample comprises an input text and a corresponding SQL query sentence;

the organization and structure of the database is collected, including database table names, column value types and primary foreign key relationships between tables.

Further, the specific contents of the table selecting task include:

training sets and verification sets required by constructing a text classification model according to text2sql question-answer pairs;

analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing input and output structures of a training set and a verification set;

constructing a text classification model by utilizing a pre-training model bert and a finish, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;

and obtaining a text classification model through multiple times of training and back propagation, and storing the optimal text classification model.

Further, the specific content of the column selecting task comprises:

constructing a training set and a verification set of a sequence annotation model by using text2sql question-answer pairs;

analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing a label, namely marking the label corresponding to the separator/ in front of the label as 'B-C', otherwise marking the label as 'B-N', and marking the labels at other positions as 'O' in a consistent manner only when a column hit occurs; constructing input and output structures;

constructing a sequence labeling model of a selected column, utilizing a pre-training model bert+finetune, pre-training by the bert to obtain a coding vector of each separator/, and then carrying out multi-classification on the coding vector;

and then, calculating a loss function with the standard label, obtaining a sequence labeling model through multiple times of training and back propagation, and storing the optimal sequence labeling model.

Further, the specific content of the SQL generating task comprises:

constructing a training set and a verification set of a text generation model by using text2sql question-answer pairs;

the text problem and sql in text2sql are analyzed to construct input and label formats; the text is followed by a table name and a column name, wherein sentences are separated from the table name by special separators, and identifiers corresponding to the table name and the column name are added; the tag replaces the table names and column names in the SQL query sentences corresponding to the text with the corresponding identifiers;

directly generating corresponding texts through a pre-training model T5+finetune, and sequentially generating output texts through extracting features from the input texts; after multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.

Further, the predicted concrete content includes:

1) Receiving text query input of a user;

2) Selecting a table: converting the input into an input format of a text classification model, and selecting a table related to the text through the stored text classification model;

3) Selecting the row: combining the list selected by the list selecting unit and the input text to construct an input format of a sequence labeling model, and selecting columns related to the text through a pre-trained sequence labeling model;

4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier;

5) SQL output: and converting the generated list identifier in the standard SQL query statement into a corresponding list name and a corresponding column name, and outputting the standard SQL.

The invention adopts another technical scheme that: a language translation system based on semantic parsing and data construction, comprising:

1) A data preparation unit: collecting text2sql question-answer pair data and the organization and structure of a database;

2) A table selecting unit: converting the list selection unit into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;

3) A selection unit: converting the list unit into a sequence labeling model, and preparing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;

4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an optimal text generating model is stored to generate SQL query sentences;

5) Prediction unit: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.

The invention has the following beneficial effects: analyzing related information contained in the text query through a semantic analysis technology, constructing data input of a special text classification model according to all table information and column information in a table, and matching the data input to corresponding table information in a database by using the text classification model; constructing data input of a sequence annotation model according to the result of the selection table, and utilizing corresponding column information in a sequence annotation matching table; and constructing data input of a text generation model by combining the table information and the column information selected in the first two steps with the query text, and generating a standard SQL query statement. Compared with the text2sql technology in an end-to-end form, the text2sql technology in a Pipeline form is adopted, additional related information is added before each model is trained, the accuracy of the model is improved, and optimization is realized by optimizing each model, so that the method is more convenient and effective.

Drawings

FIG. 1 is a schematic diagram of a text classification flow in a language conversion method based on semantic parsing and data construction according to the present invention;

FIG. 2 is a schematic diagram of a sequence labeling flow in a language conversion method based on semantic parsing and data construction according to the present invention;

FIG. 3 is a schematic diagram of the flow of SQL query statement generation in the language translation method based on semantic parsing and data construction of the present invention;

FIG. 4 is a schematic diagram of a complete predictive process for text-to-SQL query statement of the present invention.

Detailed Description

The invention is further illustrated and described in connection with the drawings and the detailed description which follow, but the invention is not limited to these examples.

Example 1

The embodiment provides a text2sql method based on semantic analysis and data construction, which comprises the following steps:

step 1, data preparation

the text2SQL question-answer pair data set, each sample contains an input text and a corresponding SQL query statement. The specific format of each sample is as follows: { "text": "the monitoring institution of the line to which the end point of the firewood line belongs contacts fax", "sql": "SELECT T3.Fax_ NO FROM ACLINEEND _basic AS T1 JOIN action_basic AS T2 ON T1. Line_id=t2. ID JOIN dcc_basic AS T3 ON T2. Monitor_org_id=t3. ID WHERE T1.Name LIke'% end point of the firewood line%" }.

And collecting database information, wherein the database information comprises Chinese names and English names of storage tables in the database, column names and column value types in each table, and whether the column names are primary keys or not and whether the column is connected with external keys of other tables or not.

Step 2, selecting list task

And analyzing the list selecting task into a text classifying task of the NLP. Essentially a bifurcated task that determines if the table is the desired target table based on the problem.

As shown in fig. 1, firstly, constructing a training set and a verification set required by a text classification model for a data set according to text2sql questions and answers;

and carrying out sentence analysis on text in the question-answer pair, and carrying out word segmentation on the text by utilizing a word segmentation technology, wherein the text corresponds to all table names and column names in a database. And judging the table name and the column name corresponding to each text, and obtaining whether the column is a main key or an external key according to the database information. The standard sql statement is marked, the standard table name is behind the "from", the table name analyzed from text corresponds to the table name analyzed by sql, the consistency is marked as 1, and otherwise, the consistency is marked as 0.

Formats of text classification model training set and validation set: { "input": whether the column hit column of the extra1 column name column type column is a primary key or not ("label") table name extra 0: 1}. Wherein extra0 is a table name separator and extra1 is a column name separator.

Specific examples of text classification models are as follows: { "input": "monitoring mechanism of the line to which the plateau three-wire endpoint belongs contacts fax extra0 transformer winding power characteristic value history data table extra1 measurement type text main key extra1 id text main key extra1 date and date main key", "labels":0}.

When there is less training data, it is necessary to augment the data by way of data augmentation. The invention performs data enhancement by randomly replacing keywords in the problem. The condition value in the sql is identified, is associated with the value in the database, and is replaced by other values in the column, so that the aim of expanding the data volume is fulfilled.

Data sets after data enhancement are processed according to 9:1 to construct training and validation sets.

The text classification model is constructed by utilizing a pre-training bert+finishing, an input is coded through the pre-training model, a [ CLS ] vector is obtained and is used as an input expression vector, and a classification layer is carried out on the expression vector to judge whether the table name of extra0 is the table name required by the sql query statement corresponding to text; after multiple iterations, the text classification model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text classification model is stored.

Through the steps, a text classification model can be trained. The function is to pick out the table name in the input.

Step 3, selecting a row task

And analyzing the column selection task into a sequence labeling task, and selecting the column names contained in the input text.

As shown in FIG. 2, a training set and a verification set of the sequence annotation model are constructed for the data set according to text2sql questions and answers.

And (3) carrying out semantic analysis on text and sql in the data set through question and answer as in the text classification task in the step (2) to obtain input information. The "input" of the sequence annotation model data is consistent with the "input" of the text classification model data. The difference is in the labels. Constructing a label, wherein each field in the "input" corresponds to a label, when a column hit occurs, the label corresponding to the column name separator in front of the label is marked as "B-C", otherwise, the label of the column name separator is marked as "B-N", and the labels at other positions are consistently marked as "O". The related structure is as follows:

the input format of the sequence labeling model is as follows: { "text": "text table name columns name column type column hit column is whether the primary key column is the foreign key" }. Where is a table name separator and is a column name separator, each column also contains corresponding column information, such as a column type, whether the column is a primary key or a foreign key.

Examples of samples of the sequence annotation model training set are as follows: { "input" [ "show 2 months in 2020," integrated power of Tianjin grid "," "," grid total month power history data table "," "," reported power value "," column hit "," value "," "," object id "," text "," main "]," labels "[" O "," O "," O "," B-C "," O "," O "," O "," B-N "," O "," O ] }.

The data set is amplified in the same way by using the data enhancement, and the data set after the data enhancement is processed according to 9:1 to construct training and validation sets.

The sequence labeling model is also constructed by using a pre-training bert+finishing, the input is coded by the pre-training bert model, each vector of '' is obtained as a representing vector of a column name, and a multi-classification layer is carried out on all the representing vectors of the column names to judge whether the column name of '' is the column name required by the sql query statement corresponding to text; after multiple iterations, training out a sequence labeling model and storing an optimal sequence labeling model through calculation of a loss function and gradient back propagation. And in the training process, an FGM countermeasure training method is used so as to improve the robustness of the model, and after multiple iterations, a sequence labeling model is trained and the optimal sequence labeling model is stored.

Through the steps, a sequence annotation model can be trained, and the sequence annotation model is used for selecting the column names in input.

Step 4 SQL generating task

And analyzing the SQL generating task into a text generating model, and generating a corresponding SQL query statement according to the input text.

As shown in fig. 3, the training set and the verification set required for the text generation model are first constructed from text2sql question-answer pair data sets. The input format of the text generation model dataset is { "input": "text extra51 extra54 table name extra51 extra0 column extra52 external key extra53 text". Wherein extra51 is denoted as a delimiter of a table, extra52 is denoted as a delimiter of a column, extra52 is denoted as a delimiter of an external key, and extra53 is denoted as a delimiter of a sentence. extra54 is an identifier corresponding to the following table name. Extra0 is the identifier corresponding to the latter column. The format of the corresponding output is { "sql": standard sql query statements (where the table and column names of sql are translated into corresponding identifiers because it is better to generate identifiers than directly generate table or column names) }.

Examples of datasets for the text generation model are as follows:

{ input ": monitoring mechanism of line to which" maple white 4dq8 line end point belongs contacts fax extra50 extra54 regulation center basic information extra51 extra0 mechanism id main key extra1 contacts fax extra50 extra55 alternating line basic information extra51 extra2 monitoring mechanism extra51 extra3 id main key extra50 extra56 alternating line basic information extra51extra4 line end name extra52 extra2 extra0 extra52 extra5 extra3 extra53 maple white 4dq8 line end point belongs line monitoring mechanism contacts fax,

"sql": "select extra54@ extra1 from extra56 join extra55 on extra56@ extra5 = extra55@ extra3 join extra54 on extra55@ extra2 = extra54@ extra0 where extra56@ extra4 like'% maple white 4dq8 line end point%" }.

The key words in the problem are randomly replaced and the column information in the text is randomly added and deleted while the data is constructed, so that the purpose of enhancing the data is achieved. Data sets after data enhancement are processed according to 9:1 to construct training and validation sets.

The SQL sentence is mainly generated by combining a large pre-training T5 model with the structure of the Seq2 Seq. By extracting features from the input, sql output text is generated in turn. After multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.

In order to reduce the learning difficulty of the model, a copy mechanism is used, and a required field can be directly selected from the input text for direct generation. And a beam-search method is used, so that illegal sql statement generation is avoided. Meanwhile, a SWA and FGM countermeasure training method is used for training the model, so that the generalization performance of the model is improved, and the accuracy of the model is improved.

Through the steps, a text generation model can be trained, and the function is to convert input text into an sql query statement with an identifier.

Step 5, predicting

As shown in fig. 4, the text information x of the user needs to be accepted first. x is the number of the communication module files measured by each state of the statistical unit 001.

First, the table is selected, the received text information x is processed into the format required for the text classification model input (combined with each table in the database). { "input": "statistics unit 001 number of files for each state metering communication module extra0 metering communication module archive table extra1 unit code text primary key column hit extra1 unit name text column hit extra1 state name text column hit extra1 total number of text column hits" }.

Through the text classification model trained in the step 2, table names contained in the text information x, namely a metering communication module archive table, can be selected.

Then selecting a column, combining the text information x and the selected table name to process the text information x and the selected table name into a format required by inputting a sequence labeling model, namely { "input": "statistics of the number of the metering communication module files in each state of the unit 001" the number of the metering communication module files unit code text column hits area hierarchy text unit name text column hits state name text column hits total number of text column hits "}.

Through the trained sequence labeling model in the step 3, column names contained in the text information x, namely 'unit codes, unit names, state names and total numbers', can be selected.

Processing the text information x and the selected table names and column names into a format required by the text generation model input, namely 'input_text': the total number extra50 extra54 metering communication module archives of each state of the statistical unit 001 is equal to the total number extra53 of extra51 extra2 state names extra51 extra3 of the extra51 extra1 unit codes extra51 extra0 unit names extra51 extra1 unit codes, and the integral electric quantity of the Tianjin power grid is shown as 2 months in 2020.

And (4) obtaining a corresponding sql query sentence through the trained text generation model in the step (4). The output "select extra54@extra0 extra54@extra 1extra 54@extra2 extra54@extra3 from extra54 where extra54@extra 1= 'value' }.

And replacing the table identifier and the column identifier in the obtained sql query statement with the corresponding table names and column names to obtain a standard sql query statement, namely ' select name, zhuangtai, number from table where daima = ' Tianjin electric network '.

According to the invention, text2sql is disassembled into a plurality of simple subtasks, the overall accuracy is improved by improving the accuracy of each subtask, and the semantic understanding capability of a large-scale pre-training model is fully utilized; the data of each model is enhanced, so that the anti-interference capability of the model is improved; different model training tricks are used for resisting training, SWA and the like to improve the accuracy of model prediction.

Example 2

The present embodiment provides a language conversion system based on semantic parsing and data construction, which includes:

4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an SQL query statement is generated;

The list selection unit comprises the following specific contents:

constructing a text classification model by using a Google pre-training model bert and a finetune, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;

The specific contents of the column selecting unit include:

The specific content of the SQL generating unit comprises:

The specific content of the prediction unit comprises:

1) Receiving text query input of a user;

Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. The language conversion method based on semantic analysis and data construction is characterized by comprising the following steps:

2. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the list selection task include:

3. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the column selecting task include:

4. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the SQL generation task include:

5. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the predicted concrete contents include:

1) Receiving text query input of a user;

4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier through the text generation model;

6. A language translation system based on semantic parsing and data construction, comprising:

4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, an optimal text generating model is stored, and an SQL query statement is generated;

7. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the table selection unit include:

8. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the column selection unit include:

9. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the SQL generating unit include:

10. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the prediction unit include:

1) Receiving text query input of a user;