CN112559556B - Language model pre-training method and system for table mode analysis and sequence mask - Google Patents
Language model pre-training method and system for table mode analysis and sequence mask Download PDFInfo
- Publication number
- CN112559556B CN112559556B CN202110210906.6A CN202110210906A CN112559556B CN 112559556 B CN112559556 B CN 112559556B CN 202110210906 A CN202110210906 A CN 202110210906A CN 112559556 B CN112559556 B CN 112559556B
- Authority
- CN
- China
- Prior art keywords
- sequence
- natural language
- column
- training
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004458 analytical method Methods 0.000 title claims abstract description 26
- 230000006870 function Effects 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 12
- 239000003550 marker Substances 0.000 claims description 15
- 230000000873 masking effect Effects 0.000 claims description 10
- 230000001960 triggered effect Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000002194 synthesizing effect Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 7
- 241000239290 Araneae Species 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000002679 ablation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a language model pre-training method and system for table pattern analysis and sequence mask, and belongs to the field of natural language processing. The method comprises the steps of (1) giving a natural language question, an association table and a target SQL sequence, and searching a unit value with the highest overlapping degree with the natural language question; (2) splicing the natural language question sentences, the column names, the column types and the unit values into a long sequence, adding separators and markers, and performing mask processing; (3) according to the mask sequence prediction task, the table mode analysis task and the condition number prediction task, a language model is jointly trained; (4) optimizing the objective function by using a gradient descent algorithm; (5) after training, saving the model weight parameters, and directly adding the model into the existing Text2SQL model to complete the initialization coding of natural language question sentences and database modes.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a system for pre-training a language model of table pattern analysis and sequence mask.
Background
Natural language processing (hereinafter abbreviated as NLP) in the field of artificial intelligence has developed rapidly in recent years. Since Google proposed a pre-trained language model BERT at the end of 2018, a two-stage task of pre-training and fine-tuning became a new target in the field of NLP, which greatly promoted the understanding ability of machines to human language, and it was no longer a far-reaching goal to analyze natural language and complete various downstream complex tasks. The report provided by the Chinese language understanding evaluation benchmark CLUE shows that many models perform beyond the manual evaluation result in the traditional NLP tasks such as text classification, named entity identification, even reading and understanding and the like. To the extent that these tasks are considered fully solved by current computer technology.
Meanwhile, the rise of big data makes the storage scale of structured data and databases larger and larger. In the past, when a user wants to query the database content, the structured database query language SQL needs to be compiled and then interacts with the database, which brings inconvenience to common users who are not in computer specialization. How to freely query the database through natural language becomes a new research hotspot. Text to SQL (hereinafter referred to as Text2 SQL) is a subtask in the field of natural language understanding semantic parsing, and aims to directly convert a user's natural language into corresponding SQL, and then complete subsequent query work. Its purpose can be summarized simply as breaking the barrier between people and structured databases.
For example, there is a structured table as shown in table 1. The user wants to ask questions according to the form: "what the rise and fall of the man-net and the waveThe Text2SQL system generates SQL query sentences according to user questions and table contents: "SELECT week rise and fall FROM table 1 WHERE name = 'new wave' OR name = 'human network'", automatically executes query and returns result: "-3.87, -4.09".
Table 1 structured table example
Yale university in 2018 discloses a cross-domain complex Text2SQL data set, which comprises nested query, cross-table query, and complex SQL grammars such as 'Group By', 'Order By', 'Having', and the like, and the task difficulty is close to the actual application scene. Although the models with top ranking on the public ranking list generally enhance the coding by means of the pre-training language models such as BERT, the highest accuracy of the ranking list is still only 70%, and a great promotion space is proved.
The current pre-training language models such as BERT, Roberta and the like are generally pre-trained in a Text data set of a general scene based on a mask language model prediction task (MLM), and have a significant difference with the downstream task scene Text2 SQL.
Disclosure of Invention
In order to solve the technical problems, the invention provides a language model pre-training method and a language model pre-training system based on table pattern analysis and sequence mask, and the pre-training language model based on structured table pattern analysis and sequence mask provided by the invention provides a new pre-training method, which can be used for carrying out joint coding on a natural language text and a structured table and fully combining the text and the table. Firstly, the structured form data and the form analysis task are used for pre-training the language model to obtain TCBERT, and then fine tuning is carried out on the Text2SQL task, so that the consistency of pre-training and fine tuning scenes can be ensured, and the capability of the language model in joint coding between the natural language question and the structured form can be more fully exerted.
According to the invention, through learning the SQL analysis related task, the NLP semantic analysis subtask Text2SQL can be effectively helped to be solved. Meanwhile, due to the strong self-coding capability and semantic generalization of the language model, the system can be quickly migrated to database scenes in different fields, and the problem of shortage of query-SQL labeling data is remarkably solved.
In order to achieve the purpose, the invention adopts the following technical scheme:
one of the objectives of the present invention is to provide a method for pre-training a language model with table pattern parsing and sequence masking, comprising the following steps:
s1: giving a natural language question, an association table and a target SQL sequence, and respectively searching a unit value with the highest overlapping degree with the natural language question from each column of the association table;
s2: synthesizing a segment in the form of 'a natural language question sentence, column names and column types in an association table, and a unit value with the highest overlapping degree with the natural language question sentence', sequentially splicing all columns in the association table into a long sequence according to the sequence, adding an initial marker at the starting position of the long sequence, and separating a plurality of segments by using separators;
s3: carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and replacing 10% of randomly extracted characters by mask characters;
s4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; after the pre-training is finished, saving the model weight parameters;
s5: aiming at the Text2SQL task, the pre-trained TCBERT model is used as an initialization encoder in the Text2SQL model to perform initialization encoding of a natural language question and a database mode.
Another object of the present invention is to provide a language model pre-training system based on table pattern parsing and sequence masking according to the above method, comprising:
the sequence representation module is used for extracting a unit value with the highest overlapping degree with the natural language question in the association table, converting the structured association table into a text sequence, splicing the text sequence with the given natural language question and outputting a masked long sequence;
the coder module is internally provided with a language model and is used for converting the masked long sequence into an embedded vector and carrying out depth coding on the embedded vector;
and the loss function calculation module is used for comparing the output of the encoder module with the target SQL sequence, respectively calculating the loss values of the mask sequence prediction task, the table mode analysis task and the condition number prediction task, summing the three loss values and optimizing the target function by using a gradient descent strategy.
Compared with the prior art, the invention has the advantages that:
1. the invention designs sequence mask, mode analysis and condition quantity prediction tasks aiming at the characteristics of texts and tables, and can train and obtain a pre-training language model TCBERT with the text and table joint coding capability.
2. The invention can be directly added into the existing Text2SQL model, and can be used in a plug-and-play mode. A large number of experiments prove that the SQL generation accuracy can be remarkably improved, and the superiority of the model is proved.
3. The pre-training language model has strong self-coding capability and generalization, can be quickly migrated to databases in various fields for use, and remarkably relieves the problem of lack of Text2SQL labeled data.
Drawings
FIG. 1 is a diagram of the overall framework design of the method of the present invention.
FIG. 2 is a schematic overall flow chart of the system of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 2, the framework of the present invention is mainly divided into three parts: (a) a sequence representation module, (b) an encoder module: a depth encoder comprising a vector embedding module and a 12-layer Transformer, (c) a loss function computation module.
Are illustrated below:
(a) and a sequence representation module. The method is used for extracting key information of the structured form, converting the key information into a text sequence, splicing the text sequence with a given natural language question and outputting a masked long sequence. In this embodiment, as shown in fig. 1, the basic steps are as follows:
1. and aiming at the natural language question in the Text2SQL data set, finding out the unit value with the maximum overlapping degree in each column of the association table and the natural language question by using an n-gram method.
2. The long sequences were spliced in the order of "query, column _ type, value". Wherein the fragments of "column, column _ type and value" of different columns are separated by a separator "[ SEP ]", and a "[ XLS ]" marker is added at the beginning position of the long sequence X. Here, query represents a natural language question, column represents a column name in the association table, column _ type represents a column type, and value represents a unit value with the highest degree of overlap with the natural language question.
3. And (3) carrying out mask processing on the long sequence X: 10% of characters are randomly selected from the query and column parts of the long sequence X respectively and are replaced by mask characters, and in the embodiment, the characters are replaced by 'mask'. The masked long sequence X obtained in this embodiment is shown in the third last row in fig. 1, and the masked long sequence X is used as a complete input of the encoder module.
(b) An encoder module. It is used to convert the masked long sequence X into an embedded vector and depth encode the embedded vector. In this embodiment, as shown in fig. 1, the basic steps are as follows:
1. cutting the masked long sequence X according to characters, and obtaining vector codes of character levels through word vector matrix conversion; meanwhile, position embedded codes and sequence embedded codes are obtained according to the position and the serial number of each character in the text (in the embodiment, the serial number of the query part is 0, and the serial numbers of the rest parts are 1); the vector corresponding positions of the three parts are summed as an embedded vector representation of the text.
2. The embedded vector of the text is coded by a 12-layer Transformer network, and context semantic association information, particularly joint coding of natural language question sentences and column names in an association table, is learned. The Transformer network can avoid the problems that the traditional CNN network can only capture local features, the RNN network trains slowly and is difficult to acquire long-distance features and the like.
(c) And a loss function calculation module. The method is used for comparing the output of the encoder module with a target SQL sequence, respectively calculating loss values of a mask sequence prediction task, a table mode analysis task and a condition quantity prediction task, summing the three loss values and optimizing a target function by using a gradient descent strategy. In this embodiment, referring to fig. 1, loss functions of 3 tasks are sequentially calculated and summed in the output of the last layer of the transform network, and the basic steps are as follows:
1. and calculating the MLM loss of the mask sequence prediction task. In the output feature, the real characters of the [ mask ] character in the vocabulary are predicted respectively. The predicted loss L1 is calculated by a cross entropy function.
2. And calculating the SLP loss of the table mode resolving task. In the output layer, the characteristics corresponding to each column name right separator "[ SEP ]" are used to predict the operation triggered by the column. In order to increase the non-linearity of the network, a 2-layer fully connected network is additionally applied to the output characteristic of the separator "[ SEP ]" and processed using an activation function and layer normalization. Finally, each column calculates the cross entropy loss and sums to get the total loss L2 for SLP.
3. Calculating the condition quantity predicts the task Count loss. The conditional number of the where portion in the target SQL is predicted using the output characteristics of the first character "[ XLS ]" of the long sequence X. The loss L3 is calculated using a cross entropy function.
L1, L2, L3 sum as the total loss of training, the min-batch gradient descent method is used to back-propagate the gradient to update the parameter values of the network.
The above is the basic structure of the system of the present invention, and each module is further divided to realize the system function.
The sequence representation module comprises:
and the information acquisition module is used for searching out the unit value with the maximum overlapping degree with the natural language question in each column of the association table according to the given natural language question.
And the sequence splicing module is used for synthesizing a segment according to the form of 'a natural language question sentence, a column name and a column type in the association table, and a unit value with the highest overlapping degree with the natural language question sentence', respectively cutting the segment into characters, sequentially splicing all the columns in the association table into a long sequence according to the sequence, adding a start marker into the start position of the long sequence, and separating a plurality of segments by using separators.
And the sequence mask module is used for carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and each randomly extracted 10% of characters is replaced by a mask character.
The encoder module comprises:
the vector embedding module is used for cutting the masked long sequence according to characters, converting the long sequence into vector codes at a character level through a word vector matrix, and obtaining position embedded codes and sequence embedded codes according to the position and the serial number of each character in the long sequence; summing the character-level vector coding, the position embedding coding and the sequence embedding coding, and converting the original masked long sequence into an embedded vector with a fixed size.
And the Transformer network module adopts 12 layers of Transformer networks to carry out depth coding on the embedded vector, and takes the output of the last layer of Transformer network as the final output of the encoder module.
The invention also provides a language model pre-training method based on the structured form pattern analysis and the sequence mask, which mainly comprises the following steps:
step 1: and giving a natural language question, an association table and a target SQL sequence, and respectively searching a unit value with the highest overlapping degree with the natural language question from each column of the association table.
Step 2: synthesizing a segment in the form of 'a natural language question sentence, column names and column types in an association table, and a unit value with the highest overlapping degree with the natural language question sentence', sequentially splicing all columns in the association table into a long sequence according to the sequence, adding a start marker at the start position of the long sequence, and separating a plurality of segments by using separators.
And step 3: and carrying out random mask processing on the natural language question in the long sequence and the unit value in the association table, wherein 10% of randomly extracted characters are replaced by mask characters.
And 4, step 4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; and after the pre-training is finished, saving the model weight parameters.
And 5: aiming at the Text2SQL task, the pre-trained TCBERT model is used as an initialization encoder in the Text2SQL model to perform initialization encoding of a natural language question and a database mode.
The method for analyzing the structured form mode and training the sequence mask is an improved language model pre-training algorithm, designs a special objective function by combining a text and a structured form, and has the capability of performing combined learning and coding between the text and the structured form. The language model TCBERT obtained by training by using the method is used as an initial feature encoder, and excellent results are obtained on WikiSQL and Spider of a semantic analysis subtask Text2SQL data set. Whereas previous pre-trained language models such as BERT, RoBERTa, etc. do not involve table coding and are not sensitive to the relationship between the text and the target table. In addition, the invention can make the Text2SQL model additionally obtain the prior relation between the Text and the table in the training process, provides richer characteristic representation for the SQL sequence analysis and generation task, and greatly improves the quality of generating the SQL.
When step 1 is performed, the format of the data to be processed needs to be defined first.
The natural language question is a sequence Q = Q containing a series of characters1, ……, q|Q| ,qiThe ith character in the natural language question is represented, and the | Q | represents the number of characters in the natural language question; the association table contains column names, column types and unit values, and the column names are expressed as { C = C }1, ……, c|C|},ciColumn names representing ith column in the association table, | C | being the number of columns in the association table, each column name being composed of one or more characters, and a unit value corresponding to each column name being represented as vi=vi_1, ……, vi_|vi|,vi_kDenotes ciCorresponding kth unit value, | vi | represents ciThe corresponding number of cell values; the column types include both text and numeric types.
In the step 2, the long sequence is represented by X = "[ XLS], Q, [SEP], c1, c1_type, v1, [SEP], ……, c|C|, c|C|_type, v|C|, [SEP]"; wherein [ XLS]Denotes an initial marker, Q denotes a natural language question, [ SEP ]]Representing separators between segments, ci、ci_ type, and viRespectively representing the column name, the column type and the unit value with the highest overlapping degree with the natural language question of the ith column in the association table.
In step 4, the preferred language model is a 12-layer Transformer network structure. The joint training language model of the task predicted according to the mask sequence, the table mode analysis task and the condition number prediction task specifically comprises the following steps:
embedding vector representation is carried out on the long sequence subjected to the random mask processing in the step S3 and then the long sequence is used as the input of the language model, and the predicted character corresponding to the position of the mask character, the predicted probability of each column name and the initial marker are obtained according to the output sequence of the language model; the prediction probability comprises whether the column name appears in the target SQL sequence and SQL operation triggered in the target SQL sequence; the start marker is used for predicting the condition number in the target SQL sequence.
Regarding the mask sequence prediction task MLM, the cross entropy function value between the predicted character and the real character is used as a loss value L1 of the mask sequence prediction task. In this embodiment, the real character of the [ mask ] character in the vocabulary is predicted in the feature output by the last layer of the Transformer. Aiming at a table mode analysis task SLP, calculating cross entropy loss according to the prediction probability and a target SQL sequence, and taking the sum of the losses of all columns in an associated table as a loss value of the table mode analysis task; in this embodiment, for the table schema parsing task, the features corresponding to the separator positions of each segment of the association table are used to predict whether the column appears in the target SQL sequence (i.e., determine whether the column is related to the problem) and the SQL operation triggered in the target SQL sequence.
For example, there is a question statement: "Show the cities that have had at least squares, corresponding to SQL: "SELECT cities FROM airport ground GROUP BY cities HAVING COUNT (>) > 2". For the "city" column, the triggered operation is "SELECT AND GROUP BY HAVING". In this embodiment, there are 108 possible operation categories, which are numbered 0 to 107, where 0 indicates that the column does not appear in SQL. Finally, cross entropy loss is calculated for each column and summed to obtain total loss L2 for SLP.
In one embodiment of the invention, two-layer fully-connected network extraction features are used for the separators in the output sequence of the language model.
For the condition quantity prediction task Count, when the SQL is analyzed, if the condition part is wrong in prediction, the query result obtained by finally executing the SQL is definitely wrong. Therefore, the Text2SQL task needs to ensure the prediction accuracy of the where part as much as possible. The method predicts the condition quantity in the target SQL sequence according to the characteristics of the initial marker, and takes the cross entropy function value between the predicted condition quantity and the actual condition quantity in the target SQL sequence as the loss value of the condition quantity prediction task. In this embodiment, the start marker is "[ XLS ]".
The sum of the loss functions of the three tasks is used as model training total loss L = L1+ L2+ L3, and a gradient descent algorithm is used for optimizing a target function; the chain rule is used when the objective function is optimized, and the model parameter calculation formula is as follows:
wherein,is the function of the object of the function,it is indicated that the learning rate is,is the parameter value before updating in the language model,are the updated parameter values in the language model.
The pre-trained language model can be directly added into the existing Text2SQL model to complete query and database mode initialization coding.
Examples
The invention respectively carries out comparison experiments on two large public data sets WikiSQL and Spider. WikiSQL is the first manually labeled large-scale dataset in the field of Text2SQL, containing a total of 26,375 tables from Wikipedia, 87,726 natural language questions. Each question includes, in addition to the original question, a corresponding SQL statement and a database schema.
Spider is a large-scale complex cross-domain Text2SQL dataset released BY Yale university, containing 10,181 natural language questions, 5,693 SQL query grammars, and 200 cross-domain independent databases, wherein the SQL sentences encompass almost all SQL grammars, such as "ORDER BY", "HAVING", "INTERSECT", "UNION", "JOIN", and the like.
In order to train TCBERT, the invention collects natural language question and its associated table and SQL from WDC WebTable corpus. WDC WebTable is a large-scale tabular dataset crawled from the commoncrowl website. The raw data contains much noise and needs to be cleaned and filtered, for example, HTML tags and hyperlinks in the samples are removed. The end result is 300,000 high quality query-SQL data samples and associated tables for pre-training.
The evaluation index used by the invention is the exact match accuracy (exact match accuracy) of generating SQL, and the model prediction is considered to be correct only if all the segments in SELECT and WHERE are completely matched. The total improvement effects of TCBERT on 6 mainstream Text2SQL models are compared, namely the IncSQL, XSQL and SQLova models with the WikiSQL list ranking at the top and the RAT-SQL, IRNet and EditSQL models with the Spider list ranking at the top. And during testing, the structure of each Text2SQL model is reserved, and only the original encoder parts (such as word2vec and BERT) are uniformly replaced by TCBERT codes. Finally, the comparative results shown in tables 2 and 3 were obtained. Furthermore, in order to prove the effectiveness of the 3 training tasks proposed by the present invention, the ablation experiments shown in table 4 were additionally performed.
TABLE 2
TABLE 3
TABLE 4
As can be seen from tables 2 and 3, the pre-training language model method for semantic parsing and sequence masking of the structured table provided by the invention obtains the optimal result for the judgment index under each model of each task, has more remarkable promoting effect on the Spider of a data set with greater difficulty, and fully shows the superiority of the algorithm provided by the invention.
In addition, through table 4 ablation and disassembly experiment comparison, the sequence mask task (MLM), the table mode analysis task (SLP) and the condition quantity prediction task (Count) used in the method improve the final effect of the model in a uniform degree, wherein the SLP task based on the characteristics of the structured table is improved most obviously.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.
Claims (9)
1. A language model pre-training method for table pattern analysis and sequence mask is characterized by comprising the following steps:
s1: giving a natural language question, an association table and a target SQL sequence, and respectively searching a unit value with the highest overlapping degree with the natural language question from each column of the association table;
s2: synthesizing a segment in the form of 'a natural language question sentence, column names and column types in an association table, and a unit value with the highest overlapping degree with the natural language question sentence', sequentially splicing all columns in the association table into a long sequence according to the sequence, adding an initial marker at the starting position of the long sequence, and separating a plurality of segments by using separators;
s3: carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and replacing 10% of randomly extracted characters by mask characters;
s4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; the method specifically comprises the following steps: embedding vector representation is carried out on the long sequence subjected to the random mask processing in the step S3 and then the long sequence is used as the input of the language model, and the predicted character corresponding to the position of the mask character, the predicted probability of each column name and the initial marker are obtained according to the output sequence of the language model; the prediction probability comprises whether the column name appears in the target SQL sequence and SQL operation triggered in the target SQL sequence; the starting marker is used for predicting the condition quantity in the target SQL sequence;
aiming at the mask sequence prediction task, taking a cross entropy function value between a predicted character and a real character as a loss value of the mask sequence prediction task;
aiming at the table mode analysis task, calculating cross entropy loss according to the prediction probability and the target SQL sequence, and taking the sum of the losses of all columns in the associated table as a loss value of the table mode analysis task;
predicting the condition quantity in the target SQL sequence according to the characteristics of the initial marker aiming at the condition quantity prediction task, and taking a cross entropy function value between the predicted condition quantity and the actual condition quantity in the target SQL sequence as a loss value of the condition quantity prediction task;
in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; after the pre-training is finished, saving the model weight parameters;
s5: aiming at the Text2SQL task, the pre-trained TCBERT model is used as an initialization encoder in the Text2SQL model to perform initialization encoding of a natural language question and a database mode.
2. The method of claim 1, wherein the natural language question is a sequence Q comprising a series of characters1,……,q|Q|,qiThe ith character in the natural language question is represented, and the | Q | represents the number of characters in the natural language question; said associationThe table contains column names, column types, and cell values, the column names being denoted as { C ═ C1,……,c|C|},ciColumn names representing ith column in the association table, | C | being the number of columns in the association table, each column name being composed of one or more characters, and a unit value corresponding to each column name being represented as vi=vi_1,……,vi_|vi|,vi_kDenotes ciCorresponding kth unit value, | vi | represents ciThe corresponding number of cell values; the column types include both text and numeric types.
3. The method for language model pre-training for table pattern parsing and sequence masking as claimed in claim 2, wherein the long sequence in step S2 is represented as X ═ XLS],Q,[SEP],c1,c1_type,v1,[SEP],……,c|C|,c|C|_type,v|C|,[SEP]"; wherein [ XLS]Denotes an initial marker, Q denotes a natural language question, [ SEP ]]Representing separators between segments, ci、ci_ type, and viRespectively representing the column name, the column type and the unit value with the highest overlapping degree with the natural language question of the ith column in the association table.
4. The method for pre-training a language model for table pattern parsing and sequence masking as claimed in claim 1, wherein the language model established in step S4 is a 12-layer Transformer network structure.
5. The method for pre-training a language model through table pattern parsing and sequence masking as claimed in claim 1, wherein the optimization of the objective function in step S4 uses a chain rule, and the model parameter calculation formula is:
where L (θ) is an objective function, α represents a learning rate, and wjIs the parameter value, w ', in the language model before update'jAre the updated parameter values in the language model.
6. The method of pre-training a language model for table mode parsing and sequence masking according to claim 1, wherein for the table mode parsing task, the features corresponding to the positions of the separators of each segment of the associated table are used to predict whether the column appears in the target SQL sequence and the SQL operation triggered in the target SQL sequence, and the features are extracted for the separators in the output sequence of the language model by using a two-layer fully-connected network.
7. A language model pre-training system using table pattern parsing and sequence masking according to the method of claim 1, comprising:
the sequence representation module is used for extracting a unit value with the highest overlapping degree with the natural language question in the association table, converting the structured association table into a text sequence, splicing the text sequence with the given natural language question and outputting a masked long sequence;
the coder module is internally provided with a language model and is used for converting the masked long sequence into an embedded vector and carrying out depth coding on the embedded vector;
and the loss function calculation module is used for comparing the output of the encoder module with the target SQL sequence, respectively calculating the loss values of the mask sequence prediction task, the table mode analysis task and the condition number prediction task, summing the three loss values and optimizing the target function by using a gradient descent strategy.
8. The system for language model pre-training for table pattern parsing and sequence masking as claimed in claim 7, wherein said sequence representation module comprises:
the information acquisition module is used for searching out a unit value with the maximum overlapping degree with the natural language question in each column of the association table according to the given natural language question;
the sequence splicing module is used for synthesizing a segment according to the form of 'a natural language question sentence, a column name in an associated table, a column type and a unit value with the highest overlapping degree with the natural language question sentence', cutting the segment into characters respectively, sequentially splicing all the columns in the associated table into a long sequence according to the sequence, adding an initial marker at the starting position of the long sequence, and separating a plurality of segments by using separators;
and the sequence mask module is used for carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and each randomly extracted 10% of characters is replaced by a mask character.
9. The system for language model pre-training for table pattern parsing and sequence masking as claimed in claim 7, wherein said encoder module comprises:
the vector embedding module is used for cutting the masked long sequence according to characters, converting the long sequence into vector codes at a character level through a word vector matrix, and obtaining position embedded codes and sequence embedded codes according to the position and the serial number of each character in the long sequence; summing the vector coding, the position embedding coding and the sequence embedding coding of the character level, and converting the original masked long sequence into an embedded vector with a fixed size;
and the Transformer network module adopts 12 layers of Transformer networks to carry out depth coding on the embedded vector, and takes the output of the last layer of Transformer network as the final output of the encoder module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110210906.6A CN112559556B (en) | 2021-02-25 | 2021-02-25 | Language model pre-training method and system for table mode analysis and sequence mask |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110210906.6A CN112559556B (en) | 2021-02-25 | 2021-02-25 | Language model pre-training method and system for table mode analysis and sequence mask |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112559556A CN112559556A (en) | 2021-03-26 |
CN112559556B true CN112559556B (en) | 2021-05-25 |
Family
ID=75034763
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110210906.6A Active CN112559556B (en) | 2021-02-25 | 2021-02-25 | Language model pre-training method and system for table mode analysis and sequence mask |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112559556B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112925794B (en) * | 2021-04-02 | 2022-09-16 | 中国人民解放军国防科技大学 | Complex multi-table SQL generation method and device based on bridging filling |
CN113011136B (en) * | 2021-04-02 | 2022-09-16 | 中国人民解放军国防科技大学 | SQL (structured query language) analysis method and device based on correlation judgment and computer equipment |
CN112988785B (en) * | 2021-05-10 | 2021-08-20 | 浙江大学 | SQL conversion method and system based on language model coding and multitask decoding |
CN113673236A (en) * | 2021-07-15 | 2021-11-19 | 北京三快在线科技有限公司 | Model training method, table recognition method, device, electronic equipment and storage medium |
CN113591475B (en) * | 2021-08-03 | 2023-07-21 | 美的集团(上海)有限公司 | Method and device for unsupervised interpretable word segmentation and electronic equipment |
CN113705652B (en) * | 2021-08-23 | 2024-05-28 | 西安交通大学 | Task type dialogue state tracking system and method based on pointer generation network |
CN113986958B (en) * | 2021-11-10 | 2024-02-09 | 北京有竹居网络技术有限公司 | Text information conversion method and device, readable medium and electronic equipment |
CN114281968B (en) * | 2021-12-20 | 2023-02-28 | 北京百度网讯科技有限公司 | Model training and corpus generation method, device, equipment and storage medium |
CN114942937B (en) * | 2022-04-18 | 2024-07-19 | 江苏方天电力技术有限公司 | Noisy NL2SQL method and device based on restriction constraint |
CN114528394B (en) * | 2022-04-22 | 2022-08-26 | 杭州费尔斯通科技有限公司 | Text triple extraction method and device based on mask language model |
CN114579606B (en) * | 2022-05-05 | 2022-07-29 | 阿里巴巴达摩院(杭州)科技有限公司 | Pre-training model data processing method, electronic device and computer storage medium |
CN114897163A (en) * | 2022-05-23 | 2022-08-12 | 阿里巴巴(中国)有限公司 | Pre-training model data processing method, electronic device and computer storage medium |
CN115203236B (en) * | 2022-07-15 | 2023-05-12 | 哈尔滨工业大学 | text-to-SQL generating method based on template retrieval |
CN115081428B (en) * | 2022-07-22 | 2022-11-29 | 粤港澳大湾区数字经济研究院(福田) | Method for processing natural language, natural language processing model and equipment |
CN115438183B (en) * | 2022-08-31 | 2023-07-04 | 广州宝立科技有限公司 | Business website monitoring system based on natural language processing |
CN115408506B (en) * | 2022-09-15 | 2023-09-12 | 云南大学 | NL2SQL method combining semantic analysis and semantic component matching |
CN116451708B (en) * | 2023-03-16 | 2024-10-18 | 苏州大学 | Text prediction method and system based on self-adaptive mask strategy and electronic equipment |
CN118246409B (en) * | 2024-05-28 | 2024-08-27 | 珠海金山办公软件有限公司 | Programming statement generation method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625554A (en) * | 2020-07-30 | 2020-09-04 | 武大吉奥信息技术有限公司 | Data query method and device based on deep learning semantic understanding |
US20200334252A1 (en) * | 2019-04-18 | 2020-10-22 | Sap Se | Clause-wise text-to-sql generation |
US20200334233A1 (en) * | 2019-04-18 | 2020-10-22 | Sap Se | One-shot learning for text-to-sql |
CN112287093A (en) * | 2020-12-02 | 2021-01-29 | 上海交通大学 | Automatic question-answering system based on semi-supervised learning and Text-to-SQL model |
-
2021
- 2021-02-25 CN CN202110210906.6A patent/CN112559556B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200334252A1 (en) * | 2019-04-18 | 2020-10-22 | Sap Se | Clause-wise text-to-sql generation |
US20200334233A1 (en) * | 2019-04-18 | 2020-10-22 | Sap Se | One-shot learning for text-to-sql |
CN111625554A (en) * | 2020-07-30 | 2020-09-04 | 武大吉奥信息技术有限公司 | Data query method and device based on deep learning semantic understanding |
CN112287093A (en) * | 2020-12-02 | 2021-01-29 | 上海交通大学 | Automatic question-answering system based on semi-supervised learning and Text-to-SQL model |
Non-Patent Citations (2)
Title |
---|
F-SQL: Fuse Table Schema and Table Content for Single-Table Text2SQL Generation;XIAOYU ZHANG等;《IEEE ACCESS》;20200724;第8卷;第136409-136420页 * |
M-SQL: Multi-Task Representation Learning for Single-Table Text2sql Generation;XIAOYU ZHANG等;《IEEE ACCESS》;20200302;第8卷;第43156-43167页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112559556A (en) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112559556B (en) | Language model pre-training method and system for table mode analysis and sequence mask | |
CN111310438B (en) | Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
WO2021114745A1 (en) | Named entity recognition method employing affix perception for use in social media | |
CN114020862B (en) | Search type intelligent question-answering system and method for coal mine safety regulations | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN110737763A (en) | Chinese intelligent question-answering system and method integrating knowledge map and deep learning | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN111858932A (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN106202010A (en) | The method and apparatus building Law Text syntax tree based on deep neural network | |
CN110390049B (en) | Automatic answer generation method for software development questions | |
CN103189860A (en) | Machine translation device and machine translation method in which a syntax conversion model and a vocabulary conversion model are combined | |
CN116821168B (en) | Improved NL2SQL method based on large language model | |
CN112256847B (en) | Knowledge base question-answering method integrating fact texts | |
CN110765755A (en) | Semantic similarity feature extraction method based on double selection gates | |
CN111026941A (en) | Intelligent query method for demonstration and evaluation of equipment system | |
CN111324691A (en) | Intelligent question-answering method for minority nationality field based on knowledge graph | |
CN116680377B (en) | Chinese medical term self-adaptive alignment method based on log feedback | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
CN111966810A (en) | Question-answer pair ordering method for question-answer system | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112732888A (en) | Answer prediction method and device based on graph reasoning model | |
CN115730078A (en) | Event knowledge graph construction method and device for class case retrieval and electronic equipment | |
CN114757184A (en) | Method and system for realizing knowledge question answering in aviation field | |
CN115171870A (en) | Diagnosis guiding and prompting method and system based on m-BERT pre-training model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |