CN116108175A - Language conversion method and system based on semantic analysis and data construction - Google Patents

Language conversion method and system based on semantic analysis and data construction Download PDF

Info

Publication number
CN116108175A
CN116108175A CN202211704106.0A CN202211704106A CN116108175A CN 116108175 A CN116108175 A CN 116108175A CN 202211704106 A CN202211704106 A CN 202211704106A CN 116108175 A CN116108175 A CN 116108175A
Authority
CN
China
Prior art keywords
text
model
training
column
sql
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211704106.0A
Other languages
Chinese (zh)
Inventor
沈然
孙钢
沈皓
章江铭
金良峰
王庆娟
倪琳娜
吴慧
江晗
姜伟昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Zhejiang University ZJU
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical Zhejiang University ZJU
Priority to CN202211704106.0A priority Critical patent/CN116108175A/en
Publication of CN116108175A publication Critical patent/CN116108175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a language conversion method and system based on semantic analysis and data construction. The technical scheme adopted by the invention comprises the following steps: list selecting task: converting the list selecting task into a text classification model, and predicting a list in a corresponding text database; selecting a column: converting the column selecting task into a sequence labeling model, and predicting columns in a text corresponding database; SQL generating tasks: converting the SQL generating task into a text generating task, and storing an optimal text generating model to generate an SQL query statement; and (3) predicting: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards. According to the invention, a text2sql technology in a Pipeline form is adopted, and additional related information is added before each model is trained, so that the accuracy of the model is improved, and the optimization is realized by optimizing each model, so that the method is more convenient and effective.

Description

Language conversion method and system based on semantic analysis and data construction
Technical Field
The invention relates to the field of semantic analysis and deep learning, in particular to a language conversion method and a language conversion system based on semantic analysis and data construction.
Background
The database stores a large amount of high-value data, and when a user wants to inquire the content of the database, the user needs to write structured query language SQL and then interact with the database, which brings inconvenience to the common user in non-professional fields; and for some complex query conditions, the SQL is easy to make mistakes when being written manually, so that the problems of insufficient data mining depth, weak data increment and rendering capability and the like are caused.
Text2SQL is an artificial intelligence technique that converts natural language into SQL query language. At present, there are two main methods: one is fixed slot filling with rules, and the other is end-to-end learning based on deep learning. But in practice, the precision is not high, and the practicability in industry is not strong.
Disclosure of Invention
In view of this, the invention provides a language conversion method and system based on semantic analysis and data construction, which utilizes semantic analysis technology and deep learning to improve the accuracy of a model so as to meet the practical requirement of SQL query in industry, improve the efficiency of user data mining, activate the value of data and promote the data value to be changed.
For this purpose, the invention adopts a technical scheme that: a language translation method based on semantic parsing and data construction, comprising:
1) Data preparation: collecting text2sql question-answer pair data and the organization and structure of a database;
2) List selecting task: converting the list selecting task into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;
3) Selecting a column: converting the list selecting task into a sequence labeling model, and manufacturing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;
4) SQL generating tasks: the SQL generating task is converted into a text generating task, a training set and a verification set of the text generating task are made according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an optimal text generating model is stored to generate an SQL query statement;
5) And (3) predicting: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.
According to the invention, the traditional text2SQL task is disassembled into the list selection, the list selection and the SQL to generate three simple subtasks, the list selection is converted into the text classification problem, the list selection is converted into the sequence annotation problem, the SQL generation is converted into a text generation problem, each subtask is respectively converted into an NLP model to solve the problem, and the data formats of different tasks are constructed according to the existing data, so that the accuracy of each model is improved, and automatic, efficient and accurate SQL retrieval is provided for users.
Further, the specific contents of the data preparation are as follows:
collecting text2SQL question-answer pair data, wherein each sample comprises an input text and a corresponding SQL query sentence;
the organization and structure of the database is collected, including database table names, column value types and primary foreign key relationships between tables.
Further, the specific contents of the table selecting task include:
training sets and verification sets required by constructing a text classification model according to text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing input and output structures of a training set and a verification set;
constructing a text classification model by utilizing a pre-training model bert and a finish, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;
and obtaining a text classification model through multiple times of training and back propagation, and storing the optimal text classification model.
Further, the specific content of the column selecting task comprises:
constructing a training set and a verification set of a sequence annotation model by using text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing a label, namely marking the label corresponding to the separator/ in front of the label as 'B-C', otherwise marking the label as 'B-N', and marking the labels at other positions as 'O' in a consistent manner only when a column hit occurs; constructing input and output structures;
constructing a sequence labeling model of a selected column, utilizing a pre-training model bert+finetune, pre-training by the bert to obtain a coding vector of each separator/, and then carrying out multi-classification on the coding vector;
and then, calculating a loss function with the standard label, obtaining a sequence labeling model through multiple times of training and back propagation, and storing the optimal sequence labeling model.
Further, the specific content of the SQL generating task comprises:
constructing a training set and a verification set of a text generation model by using text2sql question-answer pairs;
the text problem and sql in text2sql are analyzed to construct input and label formats; the text is followed by a table name and a column name, wherein sentences are separated from the table name by special separators, and identifiers corresponding to the table name and the column name are added; the tag replaces the table names and column names in the SQL query sentences corresponding to the text with the corresponding identifiers;
directly generating corresponding texts through a pre-training model T5+finetune, and sequentially generating output texts through extracting features from the input texts; after multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.
Further, the predicted concrete content includes:
1) Receiving text query input of a user;
2) Selecting a table: converting the input into an input format of a text classification model, and selecting a table related to the text through the stored text classification model;
3) Selecting the row: combining the list selected by the list selecting unit and the input text to construct an input format of a sequence labeling model, and selecting columns related to the text through a pre-trained sequence labeling model;
4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier;
5) SQL output: and converting the generated list identifier in the standard SQL query statement into a corresponding list name and a corresponding column name, and outputting the standard SQL.
The invention adopts another technical scheme that: a language translation system based on semantic parsing and data construction, comprising:
1) A data preparation unit: collecting text2sql question-answer pair data and the organization and structure of a database;
2) A table selecting unit: converting the list selection unit into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;
3) A selection unit: converting the list unit into a sequence labeling model, and preparing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;
4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an optimal text generating model is stored to generate SQL query sentences;
5) Prediction unit: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.
The invention has the following beneficial effects: analyzing related information contained in the text query through a semantic analysis technology, constructing data input of a special text classification model according to all table information and column information in a table, and matching the data input to corresponding table information in a database by using the text classification model; constructing data input of a sequence annotation model according to the result of the selection table, and utilizing corresponding column information in a sequence annotation matching table; and constructing data input of a text generation model by combining the table information and the column information selected in the first two steps with the query text, and generating a standard SQL query statement. Compared with the text2sql technology in an end-to-end form, the text2sql technology in a Pipeline form is adopted, additional related information is added before each model is trained, the accuracy of the model is improved, and optimization is realized by optimizing each model, so that the method is more convenient and effective.
Drawings
FIG. 1 is a schematic diagram of a text classification flow in a language conversion method based on semantic parsing and data construction according to the present invention;
FIG. 2 is a schematic diagram of a sequence labeling flow in a language conversion method based on semantic parsing and data construction according to the present invention;
FIG. 3 is a schematic diagram of the flow of SQL query statement generation in the language translation method based on semantic parsing and data construction of the present invention;
FIG. 4 is a schematic diagram of a complete predictive process for text-to-SQL query statement of the present invention.
Detailed Description
The invention is further illustrated and described in connection with the drawings and the detailed description which follow, but the invention is not limited to these examples.
Example 1
The embodiment provides a text2sql method based on semantic analysis and data construction, which comprises the following steps:
step 1, data preparation
the text2SQL question-answer pair data set, each sample contains an input text and a corresponding SQL query statement. The specific format of each sample is as follows: { "text": "the monitoring institution of the line to which the end point of the firewood line belongs contacts fax", "sql": "SELECT T3.Fax_ NO FROM ACLINEEND _basic AS T1 JOIN action_basic AS T2 ON T1. Line_id=t2. ID JOIN dcc_basic AS T3 ON T2. Monitor_org_id=t3. ID WHERE T1.Name LIke'% end point of the firewood line%" }.
And collecting database information, wherein the database information comprises Chinese names and English names of storage tables in the database, column names and column value types in each table, and whether the column names are primary keys or not and whether the column is connected with external keys of other tables or not.
Step 2, selecting list task
And analyzing the list selecting task into a text classifying task of the NLP. Essentially a bifurcated task that determines if the table is the desired target table based on the problem.
As shown in fig. 1, firstly, constructing a training set and a verification set required by a text classification model for a data set according to text2sql questions and answers;
and carrying out sentence analysis on text in the question-answer pair, and carrying out word segmentation on the text by utilizing a word segmentation technology, wherein the text corresponds to all table names and column names in a database. And judging the table name and the column name corresponding to each text, and obtaining whether the column is a main key or an external key according to the database information. The standard sql statement is marked, the standard table name is behind the "from", the table name analyzed from text corresponds to the table name analyzed by sql, the consistency is marked as 1, and otherwise, the consistency is marked as 0.
Formats of text classification model training set and validation set: { "input": whether the column hit column of the extra1 column name column type column is a primary key or not ("label") table name extra 0: 1}. Wherein extra0 is a table name separator and extra1 is a column name separator.
Specific examples of text classification models are as follows: { "input": "monitoring mechanism of the line to which the plateau three-wire endpoint belongs contacts fax extra0 transformer winding power characteristic value history data table extra1 measurement type text main key extra1 id text main key extra1 date and date main key", "labels":0}.
When there is less training data, it is necessary to augment the data by way of data augmentation. The invention performs data enhancement by randomly replacing keywords in the problem. The condition value in the sql is identified, is associated with the value in the database, and is replaced by other values in the column, so that the aim of expanding the data volume is fulfilled.
Data sets after data enhancement are processed according to 9:1 to construct training and validation sets.
The text classification model is constructed by utilizing a pre-training bert+finishing, an input is coded through the pre-training model, a [ CLS ] vector is obtained and is used as an input expression vector, and a classification layer is carried out on the expression vector to judge whether the table name of extra0 is the table name required by the sql query statement corresponding to text; after multiple iterations, the text classification model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text classification model is stored.
Through the steps, a text classification model can be trained. The function is to pick out the table name in the input.
Step 3, selecting a row task
And analyzing the column selection task into a sequence labeling task, and selecting the column names contained in the input text.
As shown in FIG. 2, a training set and a verification set of the sequence annotation model are constructed for the data set according to text2sql questions and answers.
And (3) carrying out semantic analysis on text and sql in the data set through question and answer as in the text classification task in the step (2) to obtain input information. The "input" of the sequence annotation model data is consistent with the "input" of the text classification model data. The difference is in the labels. Constructing a label, wherein each field in the "input" corresponds to a label, when a column hit occurs, the label corresponding to the column name separator in front of the label is marked as "B-C", otherwise, the label of the column name separator is marked as "B-N", and the labels at other positions are consistently marked as "O". The related structure is as follows:
the input format of the sequence labeling model is as follows: { "text": "text table name columns name column type column hit column is whether the primary key column is the foreign key" }. Where is a table name separator and is a column name separator, each column also contains corresponding column information, such as a column type, whether the column is a primary key or a foreign key.
Examples of samples of the sequence annotation model training set are as follows: { "input" [ "show 2 months in 2020," integrated power of Tianjin grid "," "," grid total month power history data table "," "," reported power value "," column hit "," value "," "," object id "," text "," main "]," labels "[" O "," O "," O "," B-C "," O "," O "," O "," B-N "," O "," O ] }.
The data set is amplified in the same way by using the data enhancement, and the data set after the data enhancement is processed according to 9:1 to construct training and validation sets.
The sequence labeling model is also constructed by using a pre-training bert+finishing, the input is coded by the pre-training bert model, each vector of '' is obtained as a representing vector of a column name, and a multi-classification layer is carried out on all the representing vectors of the column names to judge whether the column name of '' is the column name required by the sql query statement corresponding to text; after multiple iterations, training out a sequence labeling model and storing an optimal sequence labeling model through calculation of a loss function and gradient back propagation. And in the training process, an FGM countermeasure training method is used so as to improve the robustness of the model, and after multiple iterations, a sequence labeling model is trained and the optimal sequence labeling model is stored.
Through the steps, a sequence annotation model can be trained, and the sequence annotation model is used for selecting the column names in input.
Step 4 SQL generating task
And analyzing the SQL generating task into a text generating model, and generating a corresponding SQL query statement according to the input text.
As shown in fig. 3, the training set and the verification set required for the text generation model are first constructed from text2sql question-answer pair data sets. The input format of the text generation model dataset is { "input": "text extra51 extra54 table name extra51 extra0 column extra52 external key extra53 text". Wherein extra51 is denoted as a delimiter of a table, extra52 is denoted as a delimiter of a column, extra52 is denoted as a delimiter of an external key, and extra53 is denoted as a delimiter of a sentence. extra54 is an identifier corresponding to the following table name. Extra0 is the identifier corresponding to the latter column. The format of the corresponding output is { "sql": standard sql query statements (where the table and column names of sql are translated into corresponding identifiers because it is better to generate identifiers than directly generate table or column names) }.
Examples of datasets for the text generation model are as follows:
{ input ": monitoring mechanism of line to which" maple white 4dq8 line end point belongs contacts fax extra50 extra54 regulation center basic information extra51 extra0 mechanism id main key extra1 contacts fax extra50 extra55 alternating line basic information extra51 extra2 monitoring mechanism extra51 extra3 id main key extra50 extra56 alternating line basic information extra51extra4 line end name extra52 extra2 extra0 extra52 extra5 extra3 extra53 maple white 4dq8 line end point belongs line monitoring mechanism contacts fax,
"sql": "select extra54@ extra1 from extra56 join extra55 on extra56@ extra5 = extra55@ extra3 join extra54 on extra55@ extra2 = extra54@ extra0 where extra56@ extra4 like'% maple white 4dq8 line end point%" }.
The key words in the problem are randomly replaced and the column information in the text is randomly added and deleted while the data is constructed, so that the purpose of enhancing the data is achieved. Data sets after data enhancement are processed according to 9:1 to construct training and validation sets.
The SQL sentence is mainly generated by combining a large pre-training T5 model with the structure of the Seq2 Seq. By extracting features from the input, sql output text is generated in turn. After multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.
In order to reduce the learning difficulty of the model, a copy mechanism is used, and a required field can be directly selected from the input text for direct generation. And a beam-search method is used, so that illegal sql statement generation is avoided. Meanwhile, a SWA and FGM countermeasure training method is used for training the model, so that the generalization performance of the model is improved, and the accuracy of the model is improved.
Through the steps, a text generation model can be trained, and the function is to convert input text into an sql query statement with an identifier.
Step 5, predicting
As shown in fig. 4, the text information x of the user needs to be accepted first. x is the number of the communication module files measured by each state of the statistical unit 001.
First, the table is selected, the received text information x is processed into the format required for the text classification model input (combined with each table in the database). { "input": "statistics unit 001 number of files for each state metering communication module extra0 metering communication module archive table extra1 unit code text primary key column hit extra1 unit name text column hit extra1 state name text column hit extra1 total number of text column hits" }.
Through the text classification model trained in the step 2, table names contained in the text information x, namely a metering communication module archive table, can be selected.
Then selecting a column, combining the text information x and the selected table name to process the text information x and the selected table name into a format required by inputting a sequence labeling model, namely { "input": "statistics of the number of the metering communication module files in each state of the unit 001" the number of the metering communication module files unit code text column hits area hierarchy text unit name text column hits state name text column hits total number of text column hits "}.
Through the trained sequence labeling model in the step 3, column names contained in the text information x, namely 'unit codes, unit names, state names and total numbers', can be selected.
Processing the text information x and the selected table names and column names into a format required by the text generation model input, namely 'input_text': the total number extra50 extra54 metering communication module archives of each state of the statistical unit 001 is equal to the total number extra53 of extra51 extra2 state names extra51 extra3 of the extra51 extra1 unit codes extra51 extra0 unit names extra51 extra1 unit codes, and the integral electric quantity of the Tianjin power grid is shown as 2 months in 2020.
And (4) obtaining a corresponding sql query sentence through the trained text generation model in the step (4). The output "select extra54@extra0 extra54@extra 1extra 54@extra2 extra54@extra3 from extra54 where extra54@extra 1= 'value' }.
And replacing the table identifier and the column identifier in the obtained sql query statement with the corresponding table names and column names to obtain a standard sql query statement, namely ' select name, zhuangtai, number from table where daima = ' Tianjin electric network '.
According to the invention, text2sql is disassembled into a plurality of simple subtasks, the overall accuracy is improved by improving the accuracy of each subtask, and the semantic understanding capability of a large-scale pre-training model is fully utilized; the data of each model is enhanced, so that the anti-interference capability of the model is improved; different model training tricks are used for resisting training, SWA and the like to improve the accuracy of model prediction.
Example 2
The present embodiment provides a language conversion system based on semantic parsing and data construction, which includes:
1) A data preparation unit: collecting text2sql question-answer pair data and the organization and structure of a database;
2) A table selecting unit: converting the list selection unit into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;
3) A selection unit: converting the list unit into a sequence labeling model, and preparing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;
4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an SQL query statement is generated;
5) Prediction unit: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.
The list selection unit comprises the following specific contents:
training sets and verification sets required by constructing a text classification model according to text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing input and output structures of a training set and a verification set;
constructing a text classification model by using a Google pre-training model bert and a finetune, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;
and obtaining a text classification model through multiple times of training and back propagation, and storing the optimal text classification model.
The specific contents of the column selecting unit include:
constructing a training set and a verification set of a sequence annotation model by using text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing a label, namely marking the label corresponding to the separator/ in front of the label as 'B-C', otherwise marking the label as 'B-N', and marking the labels at other positions as 'O' in a consistent manner only when a column hit occurs; constructing input and output structures;
constructing a sequence labeling model of a selected column, utilizing a pre-training model bert+finetune, pre-training by the bert to obtain a coding vector of each separator/, and then carrying out multi-classification on the coding vector;
and then, calculating a loss function with the standard label, obtaining a sequence labeling model through multiple times of training and back propagation, and storing the optimal sequence labeling model.
The specific content of the SQL generating unit comprises:
constructing a training set and a verification set of a text generation model by using text2sql question-answer pairs;
the text problem and sql in text2sql are analyzed to construct input and label formats; the text is followed by a table name and a column name, wherein sentences are separated from the table name by special separators, and identifiers corresponding to the table name and the column name are added; the tag replaces the table names and column names in the SQL query sentences corresponding to the text with the corresponding identifiers;
directly generating corresponding texts through a pre-training model T5+finetune, and sequentially generating output texts through extracting features from the input texts; after multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.
The specific content of the prediction unit comprises:
1) Receiving text query input of a user;
2) Selecting a table: converting the input into an input format of a text classification model, and selecting a table related to the text through the stored text classification model;
3) Selecting the row: combining the list selected by the list selecting unit and the input text to construct an input format of a sequence labeling model, and selecting columns related to the text through a pre-trained sequence labeling model;
4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier;
5) SQL output: and converting the generated list identifier in the standard SQL query statement into a corresponding list name and a corresponding column name, and outputting the standard SQL.
Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims (10)

1. The language conversion method based on semantic analysis and data construction is characterized by comprising the following steps:
1) Data preparation: collecting text2sql question-answer pair data and the organization and structure of a database;
2) List selecting task: converting the list selecting task into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;
3) Selecting a column: converting the list selecting task into a sequence labeling model, and manufacturing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;
4) SQL generating tasks: the SQL generating task is converted into a text generating task, a training set and a verification set of the text generating task are made according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, and an optimal text generating model is stored to generate an SQL query statement;
5) And (3) predicting: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.
2. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the list selection task include:
training sets and verification sets required by constructing a text classification model according to text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing input and output structures of a training set and a verification set;
constructing a text classification model by utilizing a pre-training model bert and a finish, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;
and obtaining a text classification model through multiple times of training and back propagation, and storing the optimal text classification model.
3. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the column selecting task include:
constructing a training set and a verification set of a sequence annotation model by using text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing a label, namely marking the label corresponding to the separator/ in front of the label as 'B-C', otherwise marking the label as 'B-N', and marking the labels at other positions as 'O' in a consistent manner only when a column hit occurs; constructing input and output structures;
constructing a sequence labeling model of a selected column, utilizing a pre-training model bert+finetune, pre-training by the bert to obtain a coding vector of each separator/, and then carrying out multi-classification on the coding vector;
and then, calculating a loss function with the standard label, obtaining a sequence labeling model through multiple times of training and back propagation, and storing the optimal sequence labeling model.
4. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the specific contents of the SQL generation task include:
constructing a training set and a verification set of a text generation model by using text2sql question-answer pairs;
the text problem and sql in text2sql are analyzed to construct input and label formats; the text is followed by a table name and a column name, wherein sentences are separated from the table name by special separators, and identifiers corresponding to the table name and the column name are added; the tag replaces the table names and column names in the SQL query sentences corresponding to the text with the corresponding identifiers;
directly generating corresponding texts through a pre-training model T5+finetune, and sequentially generating output texts through extracting features from the input texts; after multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.
5. The language conversion method based on semantic parsing and data construction according to claim 1, wherein the predicted concrete contents include:
1) Receiving text query input of a user;
2) Selecting a table: converting the input into an input format of a text classification model, and selecting a table related to the text through the stored text classification model;
3) Selecting the row: combining the list selected by the list selecting unit and the input text to construct an input format of a sequence labeling model, and selecting columns related to the text through a pre-trained sequence labeling model;
4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier through the text generation model;
5) SQL output: and converting the generated list identifier in the standard SQL query statement into a corresponding list name and a corresponding column name, and outputting the standard SQL.
6. A language translation system based on semantic parsing and data construction, comprising:
1) A data preparation unit: collecting text2sql question-answer pair data and the organization and structure of a database;
2) A table selecting unit: converting the list selection unit into a text classification model, constructing a training set and a verification set according to text2sql question-answer data, training the text classification model by using a pre-training model bert+finetune, storing an optimal text classification model, and predicting a list in a text corresponding database;
3) A selection unit: converting the list unit into a sequence labeling model, and preparing a training set and a verification set of the sequence labeling model according to text2sql question-answer data; utilizing a pre-training model bert+finishing training sequence labeling model, storing an optimal sequence labeling model, and predicting columns in a text corresponding database;
4) SQL generating unit: the SQL generating unit is converted into a text generating task, a training set and a verification set of the text generating task are manufactured according to text2SQL question and answer data, a pre-training model T5+finishing training text generating model is utilized, an optimal text generating model is stored, and an SQL query statement is generated;
5) Prediction unit: constructing a pipeline structure by using the trained three models; receiving text data input by a user, and sequentially passing through the trained three models to generate SQL query sentences with corresponding standards.
7. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the table selection unit include:
training sets and verification sets required by constructing a text classification model according to text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing input and output structures of a training set and a verification set;
constructing a text classification model by utilizing a pre-training model bert and a finish, coding the constructed training data through a large pre-training model bert, taking a vector of a first character [ CLS ] in each piece of data as data information, transmitting the data information into a two-classification layer, predicting, and then calculating a loss function with a standard label;
and obtaining a text classification model through multiple times of training and back propagation, and storing the optimal text classification model.
8. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the column selection unit include:
constructing a training set and a verification set of a sequence annotation model by using text2sql question-answer pairs;
analyzing the text problem and the sql in the text2sql to obtain a value or a schema corresponding to the column name to which the value belongs, wherein the schema comprises whether the column is a primary key or not and whether the column of the value hits or not; constructing a label, namely marking the label corresponding to the separator/ in front of the label as 'B-C', otherwise marking the label as 'B-N', and marking the labels at other positions as 'O' in a consistent manner only when a column hit occurs; constructing input and output structures;
constructing a sequence labeling model of a selected column, utilizing a pre-training model bert+finetune, pre-training by the bert to obtain a coding vector of each separator/, and then carrying out multi-classification on the coding vector;
and then, calculating a loss function with the standard label, obtaining a sequence labeling model through multiple times of training and back propagation, and storing the optimal sequence labeling model.
9. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the SQL generating unit include:
constructing a training set and a verification set of a text generation model by using text2sql question-answer pairs;
the text problem and sql in text2sql are analyzed to construct input and label formats; the text is followed by a table name and a column name, wherein sentences are separated from the table name by special separators, and identifiers corresponding to the table name and the column name are added; the tag replaces the table names and column names in the SQL query sentences corresponding to the text with the corresponding identifiers;
directly generating corresponding texts through a pre-training model T5+finetune, and sequentially generating output texts through extracting features from the input texts; after multiple iterations, the text generation model is trained through the calculation of the loss function and the gradient back propagation, and the optimal text generation model is stored.
10. The language conversion system based on semantic parsing and data construction according to claim 6, wherein the specific contents of the prediction unit include:
1) Receiving text query input of a user;
2) Selecting a table: converting the input into an input format of a text classification model, and selecting a table related to the text through the stored text classification model;
3) Selecting the row: combining the list selected by the list selecting unit and the input text to construct an input format of a sequence labeling model, and selecting columns related to the text through a pre-trained sequence labeling model;
4) SQL generating unit: combining the selected list and column with the input text, converting the list and column into an input format of a text generation model, and generating a standard SQL query statement with an identifier through the text generation model;
5) SQL output: and converting the generated list identifier in the standard SQL query statement into a corresponding list name and a corresponding column name, and outputting the standard SQL.
CN202211704106.0A 2022-12-29 2022-12-29 Language conversion method and system based on semantic analysis and data construction Pending CN116108175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211704106.0A CN116108175A (en) 2022-12-29 2022-12-29 Language conversion method and system based on semantic analysis and data construction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211704106.0A CN116108175A (en) 2022-12-29 2022-12-29 Language conversion method and system based on semantic analysis and data construction

Publications (1)

Publication Number Publication Date
CN116108175A true CN116108175A (en) 2023-05-12

Family

ID=86263160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211704106.0A Pending CN116108175A (en) 2022-12-29 2022-12-29 Language conversion method and system based on semantic analysis and data construction

Country Status (1)

Country Link
CN (1) CN116108175A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131070A (en) * 2023-10-27 2023-11-28 之江实验室 Self-adaptive rule-guided large language model generation SQL system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131070A (en) * 2023-10-27 2023-11-28 之江实验室 Self-adaptive rule-guided large language model generation SQL system
CN117131070B (en) * 2023-10-27 2024-02-09 之江实验室 Self-adaptive rule-guided large language model generation SQL system

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN101286161B (en) Intelligent Chinese request-answering system based on concept
CN110287482B (en) Semi-automatic participle corpus labeling training device
CN101866337A (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN101178705A (en) Free-running speech comprehend method and man-machine interactive intelligent system
CN114020862A (en) Retrieval type intelligent question-answering system and method for coal mine safety regulations
CN111274267A (en) Database query method and device and computer readable storage medium
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN104331449A (en) Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN109447266A (en) A kind of agricultural science and technology service intelligent sorting method based on big data
CN110888943A (en) Method and system for auxiliary generation of court referee document based on micro-template
CN114281968B (en) Model training and corpus generation method, device, equipment and storage medium
CN113032418B (en) Method for converting complex natural language query into SQL (structured query language) based on tree model
CN115062070A (en) Question and answer based text table data query method
CN101933017B (en) Document search device, document search system, and document search method
CN113065349A (en) Named entity recognition method based on conditional random field
CN116108175A (en) Language conversion method and system based on semantic analysis and data construction
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN116467437A (en) Automatic flow modeling method for complex scene description
CN116644740A (en) Dictionary automatic extraction method and system based on single text term solidification degree
CN114116779A (en) Deep learning-based power grid regulation and control field information retrieval method, system and medium
CN113449038A (en) Mine intelligent question-answering system and method based on self-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination