CN112559556B

CN112559556B - Language model pre-training method and system for table mode analysis and sequence mask

Info

Publication number: CN112559556B
Application number: CN202110210906.6A
Authority: CN
Inventors: 徐叶琛
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-05-25
Anticipated expiration: 2041-02-25
Also published as: CN112559556A

Abstract

The invention discloses a language model pre-training method and system for table pattern analysis and sequence mask, and belongs to the field of natural language processing. The method comprises the steps of (1) giving a natural language question, an association table and a target SQL sequence, and searching a unit value with the highest overlapping degree with the natural language question; (2) splicing the natural language question sentences, the column names, the column types and the unit values into a long sequence, adding separators and markers, and performing mask processing; (3) according to the mask sequence prediction task, the table mode analysis task and the condition number prediction task, a language model is jointly trained; (4) optimizing the objective function by using a gradient descent algorithm; (5) after training, saving the model weight parameters, and directly adding the model into the existing Text2SQL model to complete the initialization coding of natural language question sentences and database modes.

Description

Language model pre-training method and system for table mode analysis and sequence mask

Technical Field

The invention relates to the field of natural language processing, in particular to a method and a system for pre-training a language model of table pattern analysis and sequence mask.

Background

Natural language processing (hereinafter abbreviated as NLP) in the field of artificial intelligence has developed rapidly in recent years. Since Google proposed a pre-trained language model BERT at the end of 2018, a two-stage task of pre-training and fine-tuning became a new target in the field of NLP, which greatly promoted the understanding ability of machines to human language, and it was no longer a far-reaching goal to analyze natural language and complete various downstream complex tasks. The report provided by the Chinese language understanding evaluation benchmark CLUE shows that many models perform beyond the manual evaluation result in the traditional NLP tasks such as text classification, named entity identification, even reading and understanding and the like. To the extent that these tasks are considered fully solved by current computer technology.

Meanwhile, the rise of big data makes the storage scale of structured data and databases larger and larger. In the past, when a user wants to query the database content, the structured database query language SQL needs to be compiled and then interacts with the database, which brings inconvenience to common users who are not in computer specialization. How to freely query the database through natural language becomes a new research hotspot. Text to SQL (hereinafter referred to as Text2 SQL) is a subtask in the field of natural language understanding semantic parsing, and aims to directly convert a user's natural language into corresponding SQL, and then complete subsequent query work. Its purpose can be summarized simply as breaking the barrier between people and structured databases.

For example, there is a structured table as shown in table 1. The user wants to ask questions according to the form: "what the rise and fall of the man-net and the wave

The Text2SQL system generates SQL query sentences according to user questions and table contents: "SELECT week rise and fall FROM table 1 WHERE name = 'new wave' OR name = 'human network'", automatically executes query and returns result: "-3.87, -4.09".

Table 1 structured table example

Yale university in 2018 discloses a cross-domain complex Text2SQL data set, which comprises nested query, cross-table query, and complex SQL grammars such as 'Group By', 'Order By', 'Having', and the like, and the task difficulty is close to the actual application scene. Although the models with top ranking on the public ranking list generally enhance the coding by means of the pre-training language models such as BERT, the highest accuracy of the ranking list is still only 70%, and a great promotion space is proved.

The current pre-training language models such as BERT, Roberta and the like are generally pre-trained in a Text data set of a general scene based on a mask language model prediction task (MLM), and have a significant difference with the downstream task scene Text2 SQL.

Disclosure of Invention

In order to solve the technical problems, the invention provides a language model pre-training method and a language model pre-training system based on table pattern analysis and sequence mask, and the pre-training language model based on structured table pattern analysis and sequence mask provided by the invention provides a new pre-training method, which can be used for carrying out joint coding on a natural language text and a structured table and fully combining the text and the table. Firstly, the structured form data and the form analysis task are used for pre-training the language model to obtain TCBERT, and then fine tuning is carried out on the Text2SQL task, so that the consistency of pre-training and fine tuning scenes can be ensured, and the capability of the language model in joint coding between the natural language question and the structured form can be more fully exerted.

According to the invention, through learning the SQL analysis related task, the NLP semantic analysis subtask Text2SQL can be effectively helped to be solved. Meanwhile, due to the strong self-coding capability and semantic generalization of the language model, the system can be quickly migrated to database scenes in different fields, and the problem of shortage of query-SQL labeling data is remarkably solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

one of the objectives of the present invention is to provide a method for pre-training a language model with table pattern parsing and sequence masking, comprising the following steps:

s1: giving a natural language question, an association table and a target SQL sequence, and respectively searching a unit value with the highest overlapping degree with the natural language question from each column of the association table;

s2: synthesizing a segment in the form of 'a natural language question sentence, column names and column types in an association table, and a unit value with the highest overlapping degree with the natural language question sentence', sequentially splicing all columns in the association table into a long sequence according to the sequence, adding an initial marker at the starting position of the long sequence, and separating a plurality of segments by using separators;

s3: carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and replacing 10% of randomly extracted characters by mask characters;

s4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; after the pre-training is finished, saving the model weight parameters;

s5: aiming at the Text2SQL task, the pre-trained TCBERT model is used as an initialization encoder in the Text2SQL model to perform initialization encoding of a natural language question and a database mode.

Another object of the present invention is to provide a language model pre-training system based on table pattern parsing and sequence masking according to the above method, comprising:

the sequence representation module is used for extracting a unit value with the highest overlapping degree with the natural language question in the association table, converting the structured association table into a text sequence, splicing the text sequence with the given natural language question and outputting a masked long sequence;

the coder module is internally provided with a language model and is used for converting the masked long sequence into an embedded vector and carrying out depth coding on the embedded vector;

and the loss function calculation module is used for comparing the output of the encoder module with the target SQL sequence, respectively calculating the loss values of the mask sequence prediction task, the table mode analysis task and the condition number prediction task, summing the three loss values and optimizing the target function by using a gradient descent strategy.

Compared with the prior art, the invention has the advantages that:

1. the invention designs sequence mask, mode analysis and condition quantity prediction tasks aiming at the characteristics of texts and tables, and can train and obtain a pre-training language model TCBERT with the text and table joint coding capability.

2. The invention can be directly added into the existing Text2SQL model, and can be used in a plug-and-play mode. A large number of experiments prove that the SQL generation accuracy can be remarkably improved, and the superiority of the model is proved.

3. The pre-training language model has strong self-coding capability and generalization, can be quickly migrated to databases in various fields for use, and remarkably relieves the problem of lack of Text2SQL labeled data.

Drawings

FIG. 1 is a diagram of the overall framework design of the method of the present invention.

FIG. 2 is a schematic overall flow chart of the system of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 2, the framework of the present invention is mainly divided into three parts: (a) a sequence representation module, (b) an encoder module: a depth encoder comprising a vector embedding module and a 12-layer Transformer, (c) a loss function computation module.

Are illustrated below:

(a) and a sequence representation module. The method is used for extracting key information of the structured form, converting the key information into a text sequence, splicing the text sequence with a given natural language question and outputting a masked long sequence. In this embodiment, as shown in fig. 1, the basic steps are as follows:

1. and aiming at the natural language question in the Text2SQL data set, finding out the unit value with the maximum overlapping degree in each column of the association table and the natural language question by using an n-gram method.

2. The long sequences were spliced in the order of "query, column _ type, value". Wherein the fragments of "column, column _ type and value" of different columns are separated by a separator "[ SEP ]", and a "[ XLS ]" marker is added at the beginning position of the long sequence X. Here, query represents a natural language question, column represents a column name in the association table, column _ type represents a column type, and value represents a unit value with the highest degree of overlap with the natural language question.

3. And (3) carrying out mask processing on the long sequence X: 10% of characters are randomly selected from the query and column parts of the long sequence X respectively and are replaced by mask characters, and in the embodiment, the characters are replaced by 'mask'. The masked long sequence X obtained in this embodiment is shown in the third last row in fig. 1, and the masked long sequence X is used as a complete input of the encoder module.

(b) An encoder module. It is used to convert the masked long sequence X into an embedded vector and depth encode the embedded vector. In this embodiment, as shown in fig. 1, the basic steps are as follows:

1. cutting the masked long sequence X according to characters, and obtaining vector codes of character levels through word vector matrix conversion; meanwhile, position embedded codes and sequence embedded codes are obtained according to the position and the serial number of each character in the text (in the embodiment, the serial number of the query part is 0, and the serial numbers of the rest parts are 1); the vector corresponding positions of the three parts are summed as an embedded vector representation of the text.

2. The embedded vector of the text is coded by a 12-layer Transformer network, and context semantic association information, particularly joint coding of natural language question sentences and column names in an association table, is learned. The Transformer network can avoid the problems that the traditional CNN network can only capture local features, the RNN network trains slowly and is difficult to acquire long-distance features and the like.

(c) And a loss function calculation module. The method is used for comparing the output of the encoder module with a target SQL sequence, respectively calculating loss values of a mask sequence prediction task, a table mode analysis task and a condition quantity prediction task, summing the three loss values and optimizing a target function by using a gradient descent strategy. In this embodiment, referring to fig. 1, loss functions of 3 tasks are sequentially calculated and summed in the output of the last layer of the transform network, and the basic steps are as follows:

1. and calculating the MLM loss of the mask sequence prediction task. In the output feature, the real characters of the [ mask ] character in the vocabulary are predicted respectively. The predicted loss L1 is calculated by a cross entropy function.

2. And calculating the SLP loss of the table mode resolving task. In the output layer, the characteristics corresponding to each column name right separator "[ SEP ]" are used to predict the operation triggered by the column. In order to increase the non-linearity of the network, a 2-layer fully connected network is additionally applied to the output characteristic of the separator "[ SEP ]" and processed using an activation function and layer normalization. Finally, each column calculates the cross entropy loss and sums to get the total loss L2 for SLP.

3. Calculating the condition quantity predicts the task Count loss. The conditional number of the where portion in the target SQL is predicted using the output characteristics of the first character "[ XLS ]" of the long sequence X. The loss L3 is calculated using a cross entropy function.

L1, L2, L3 sum as the total loss of training, the min-batch gradient descent method is used to back-propagate the gradient to update the parameter values of the network.

The above is the basic structure of the system of the present invention, and each module is further divided to realize the system function.

The sequence representation module comprises:

and the information acquisition module is used for searching out the unit value with the maximum overlapping degree with the natural language question in each column of the association table according to the given natural language question.

And the sequence splicing module is used for synthesizing a segment according to the form of 'a natural language question sentence, a column name and a column type in the association table, and a unit value with the highest overlapping degree with the natural language question sentence', respectively cutting the segment into characters, sequentially splicing all the columns in the association table into a long sequence according to the sequence, adding a start marker into the start position of the long sequence, and separating a plurality of segments by using separators.

And the sequence mask module is used for carrying out random mask processing on the natural language question in the long sequence and the unit values in the association table, and each randomly extracted 10% of characters is replaced by a mask character.

The encoder module comprises:

the vector embedding module is used for cutting the masked long sequence according to characters, converting the long sequence into vector codes at a character level through a word vector matrix, and obtaining position embedded codes and sequence embedded codes according to the position and the serial number of each character in the long sequence; summing the character-level vector coding, the position embedding coding and the sequence embedding coding, and converting the original masked long sequence into an embedded vector with a fixed size.

And the Transformer network module adopts 12 layers of Transformer networks to carry out depth coding on the embedded vector, and takes the output of the last layer of Transformer network as the final output of the encoder module.

The invention also provides a language model pre-training method based on the structured form pattern analysis and the sequence mask, which mainly comprises the following steps:

step 1: and giving a natural language question, an association table and a target SQL sequence, and respectively searching a unit value with the highest overlapping degree with the natural language question from each column of the association table.

Step 2: synthesizing a segment in the form of 'a natural language question sentence, column names and column types in an association table, and a unit value with the highest overlapping degree with the natural language question sentence', sequentially splicing all columns in the association table into a long sequence according to the sequence, adding a start marker at the start position of the long sequence, and separating a plurality of segments by using separators.

And step 3: and carrying out random mask processing on the natural language question in the long sequence and the unit value in the association table, wherein 10% of randomly extracted characters are replaced by mask characters.

And 4, step 4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; and after the pre-training is finished, saving the model weight parameters.

And 5: aiming at the Text2SQL task, the pre-trained TCBERT model is used as an initialization encoder in the Text2SQL model to perform initialization encoding of a natural language question and a database mode.

The method for analyzing the structured form mode and training the sequence mask is an improved language model pre-training algorithm, designs a special objective function by combining a text and a structured form, and has the capability of performing combined learning and coding between the text and the structured form. The language model TCBERT obtained by training by using the method is used as an initial feature encoder, and excellent results are obtained on WikiSQL and Spider of a semantic analysis subtask Text2SQL data set. Whereas previous pre-trained language models such as BERT, RoBERTa, etc. do not involve table coding and are not sensitive to the relationship between the text and the target table. In addition, the invention can make the Text2SQL model additionally obtain the prior relation between the Text and the table in the training process, provides richer characteristic representation for the SQL sequence analysis and generation task, and greatly improves the quality of generating the SQL.

When step 1 is performed, the format of the data to be processed needs to be defined first.

The natural language question is a sequence Q = Q containing a series of characters₁, ……, q_|Q|，q_iThe ith character in the natural language question is represented, and the | Q | represents the number of characters in the natural language question; the association table contains column names, column types and unit values, and the column names are expressed as { C = C }₁, ……, c_|C|}，c_iColumn names representing ith column in the association table, | C | being the number of columns in the association table, each column name being composed of one or more characters, and a unit value corresponding to each column name being represented as v_i=v_{i_1}, ……, v_{i_|vi|}，v_{i_k}Denotes c_iCorresponding kth unit value, | vi | represents c_iThe corresponding number of cell values; the column types include both text and numeric types.

In the step 2, the long sequence is represented by X = "[ XLS], Q, [SEP], c₁, c₁_type, v¹, [SEP], ……, c_|C|, c_|C|_type, v^|C|, [SEP]"; wherein [ XLS]Denotes an initial marker, Q denotes a natural language question, [ SEP ]]Representing separators between segments, c_i、c_i_ type, and vⁱRespectively representing the column name, the column type and the unit value with the highest overlapping degree with the natural language question of the ith column in the association table.

In step 4, the preferred language model is a 12-layer Transformer network structure. The joint training language model of the task predicted according to the mask sequence, the table mode analysis task and the condition number prediction task specifically comprises the following steps:

embedding vector representation is carried out on the long sequence subjected to the random mask processing in the step S3 and then the long sequence is used as the input of the language model, and the predicted character corresponding to the position of the mask character, the predicted probability of each column name and the initial marker are obtained according to the output sequence of the language model; the prediction probability comprises whether the column name appears in the target SQL sequence and SQL operation triggered in the target SQL sequence; the start marker is used for predicting the condition number in the target SQL sequence.

Regarding the mask sequence prediction task MLM, the cross entropy function value between the predicted character and the real character is used as a loss value L1 of the mask sequence prediction task. In this embodiment, the real character of the [ mask ] character in the vocabulary is predicted in the feature output by the last layer of the Transformer. Aiming at a table mode analysis task SLP, calculating cross entropy loss according to the prediction probability and a target SQL sequence, and taking the sum of the losses of all columns in an associated table as a loss value of the table mode analysis task; in this embodiment, for the table schema parsing task, the features corresponding to the separator positions of each segment of the association table are used to predict whether the column appears in the target SQL sequence (i.e., determine whether the column is related to the problem) and the SQL operation triggered in the target SQL sequence.

For example, there is a question statement: "Show the cities that have had at least squares, corresponding to SQL: "SELECT cities FROM airport ground GROUP BY cities HAVING COUNT (>) > 2". For the "city" column, the triggered operation is "SELECT AND GROUP BY HAVING". In this embodiment, there are 108 possible operation categories, which are numbered 0 to 107, where 0 indicates that the column does not appear in SQL. Finally, cross entropy loss is calculated for each column and summed to obtain total loss L2 for SLP.

In one embodiment of the invention, two-layer fully-connected network extraction features are used for the separators in the output sequence of the language model.

For the condition quantity prediction task Count, when the SQL is analyzed, if the condition part is wrong in prediction, the query result obtained by finally executing the SQL is definitely wrong. Therefore, the Text2SQL task needs to ensure the prediction accuracy of the where part as much as possible. The method predicts the condition quantity in the target SQL sequence according to the characteristics of the initial marker, and takes the cross entropy function value between the predicted condition quantity and the actual condition quantity in the target SQL sequence as the loss value of the condition quantity prediction task. In this embodiment, the start marker is "[ XLS ]".

The sum of the loss functions of the three tasks is used as model training total loss L = L1+ L2+ L3, and a gradient descent algorithm is used for optimizing a target function; the chain rule is used when the objective function is optimized, and the model parameter calculation formula is as follows:

wherein,

is the function of the object of the function,

it is indicated that the learning rate is,

is the parameter value before updating in the language model,

are the updated parameter values in the language model.

The pre-trained language model can be directly added into the existing Text2SQL model to complete query and database mode initialization coding.

Examples

The invention respectively carries out comparison experiments on two large public data sets WikiSQL and Spider. WikiSQL is the first manually labeled large-scale dataset in the field of Text2SQL, containing a total of 26,375 tables from Wikipedia, 87,726 natural language questions. Each question includes, in addition to the original question, a corresponding SQL statement and a database schema.

Spider is a large-scale complex cross-domain Text2SQL dataset released BY Yale university, containing 10,181 natural language questions, 5,693 SQL query grammars, and 200 cross-domain independent databases, wherein the SQL sentences encompass almost all SQL grammars, such as "ORDER BY", "HAVING", "INTERSECT", "UNION", "JOIN", and the like.

In order to train TCBERT, the invention collects natural language question and its associated table and SQL from WDC WebTable corpus. WDC WebTable is a large-scale tabular dataset crawled from the commoncrowl website. The raw data contains much noise and needs to be cleaned and filtered, for example, HTML tags and hyperlinks in the samples are removed. The end result is 300,000 high quality query-SQL data samples and associated tables for pre-training.

The evaluation index used by the invention is the exact match accuracy (exact match accuracy) of generating SQL, and the model prediction is considered to be correct only if all the segments in SELECT and WHERE are completely matched. The total improvement effects of TCBERT on 6 mainstream Text2SQL models are compared, namely the IncSQL, XSQL and SQLova models with the WikiSQL list ranking at the top and the RAT-SQL, IRNet and EditSQL models with the Spider list ranking at the top. And during testing, the structure of each Text2SQL model is reserved, and only the original encoder parts (such as word2vec and BERT) are uniformly replaced by TCBERT codes. Finally, the comparative results shown in tables 2 and 3 were obtained. Furthermore, in order to prove the effectiveness of the 3 training tasks proposed by the present invention, the ablation experiments shown in table 4 were additionally performed.

TABLE 2

TABLE 3

TABLE 4

As can be seen from tables 2 and 3, the pre-training language model method for semantic parsing and sequence masking of the structured table provided by the invention obtains the optimal result for the judgment index under each model of each task, has more remarkable promoting effect on the Spider of a data set with greater difficulty, and fully shows the superiority of the algorithm provided by the invention.

In addition, through table 4 ablation and disassembly experiment comparison, the sequence mask task (MLM), the table mode analysis task (SLP) and the condition quantity prediction task (Count) used in the method improve the final effect of the model in a uniform degree, wherein the SLP task based on the characteristics of the structured table is improved most obviously.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A language model pre-training method for table pattern analysis and sequence mask is characterized by comprising the following steps:

s4: establishing a language model, taking the long sequence subjected to random mask processing in the step S3 as a pre-training data set of the language model, and jointly training the language model according to a mask sequence prediction task, a table mode analysis task and a condition number prediction task to obtain a TCBERT model; the method specifically comprises the following steps: embedding vector representation is carried out on the long sequence subjected to the random mask processing in the step S3 and then the long sequence is used as the input of the language model, and the predicted character corresponding to the position of the mask character, the predicted probability of each column name and the initial marker are obtained according to the output sequence of the language model; the prediction probability comprises whether the column name appears in the target SQL sequence and SQL operation triggered in the target SQL sequence; the starting marker is used for predicting the condition quantity in the target SQL sequence;

aiming at the mask sequence prediction task, taking a cross entropy function value between a predicted character and a real character as a loss value of the mask sequence prediction task;

aiming at the table mode analysis task, calculating cross entropy loss according to the prediction probability and the target SQL sequence, and taking the sum of the losses of all columns in the associated table as a loss value of the table mode analysis task;

predicting the condition quantity in the target SQL sequence according to the characteristics of the initial marker aiming at the condition quantity prediction task, and taking a cross entropy function value between the predicted condition quantity and the actual condition quantity in the target SQL sequence as a loss value of the condition quantity prediction task;

in the combined pre-training process, the sum of the loss functions of the three tasks is used as the total loss of model pre-training, and a target function is optimized by using a gradient descent algorithm; after the pre-training is finished, saving the model weight parameters;

2. The method of claim 1, wherein the natural language question is a sequence Q comprising a series of characters₁,……,q_|Q|，q_iThe ith character in the natural language question is represented, and the | Q | represents the number of characters in the natural language question; said associationThe table contains column names, column types, and cell values, the column names being denoted as { C ═ C₁,……,c_|C|}，c_iColumn names representing ith column in the association table, | C | being the number of columns in the association table, each column name being composed of one or more characters, and a unit value corresponding to each column name being represented as v_i＝v_{i_1},……,v_{i_|vi|}，v_{i_k}Denotes c_iCorresponding kth unit value, | vi | represents c_iThe corresponding number of cell values; the column types include both text and numeric types.

3. The method for language model pre-training for table pattern parsing and sequence masking as claimed in claim 2, wherein the long sequence in step S2 is represented as X ═ XLS],Q,[SEP],c₁,c₁_type,v¹,[SEP],……,c_|C|,c_|C|_type,v^|C|,[SEP]"; wherein [ XLS]Denotes an initial marker, Q denotes a natural language question, [ SEP ]]Representing separators between segments, c_i、c_i_ type, and vⁱRespectively representing the column name, the column type and the unit value with the highest overlapping degree with the natural language question of the ith column in the association table.

4. The method for pre-training a language model for table pattern parsing and sequence masking as claimed in claim 1, wherein the language model established in step S4 is a 12-layer Transformer network structure.

5. The method for pre-training a language model through table pattern parsing and sequence masking as claimed in claim 1, wherein the optimization of the objective function in step S4 uses a chain rule, and the model parameter calculation formula is:

where L (θ) is an objective function, α represents a learning rate, and w_jIs the parameter value, w ', in the language model before update'_jAre the updated parameter values in the language model.

6. The method of pre-training a language model for table mode parsing and sequence masking according to claim 1, wherein for the table mode parsing task, the features corresponding to the positions of the separators of each segment of the associated table are used to predict whether the column appears in the target SQL sequence and the SQL operation triggered in the target SQL sequence, and the features are extracted for the separators in the output sequence of the language model by using a two-layer fully-connected network.

7. A language model pre-training system using table pattern parsing and sequence masking according to the method of claim 1, comprising:

8. The system for language model pre-training for table pattern parsing and sequence masking as claimed in claim 7, wherein said sequence representation module comprises:

the information acquisition module is used for searching out a unit value with the maximum overlapping degree with the natural language question in each column of the association table according to the given natural language question;

the sequence splicing module is used for synthesizing a segment according to the form of 'a natural language question sentence, a column name in an associated table, a column type and a unit value with the highest overlapping degree with the natural language question sentence', cutting the segment into characters respectively, sequentially splicing all the columns in the associated table into a long sequence according to the sequence, adding an initial marker at the starting position of the long sequence, and separating a plurality of segments by using separators;

9. The system for language model pre-training for table pattern parsing and sequence masking as claimed in claim 7, wherein said encoder module comprises:

the vector embedding module is used for cutting the masked long sequence according to characters, converting the long sequence into vector codes at a character level through a word vector matrix, and obtaining position embedded codes and sequence embedded codes according to the position and the serial number of each character in the long sequence; summing the vector coding, the position embedding coding and the sequence embedding coding of the character level, and converting the original masked long sequence into an embedded vector with a fixed size;