CN112287093B

CN112287093B - Automatic question-answering system based on semi-supervised learning and Text-to-SQL model

Info

Publication number: CN112287093B
Application number: CN202011391296.6A
Authority: CN
Inventors: 罗宇侠; 饶若楠
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-08-12
Anticipated expiration: 2040-12-02
Also published as: CN112287093A

Abstract

An automatic question-answering system based on semi-supervised learning and Text-to-SQL model, comprising: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question proposed by the user into the SQL, and the model compression unit compresses the trained Text-to-SQL model. The invention converts the problem of the user into SQL, and then queries the database through the SQL to obtain the result, thereby directly returning the answer of the problem to the user, greatly reducing the difficulty of obtaining information by the user and improving the efficiency of retrieving data by the user.

Description

Automatic question-answering system based on semi-supervised learning and Text-to-SQL model

Technical Field

The invention relates to a technology in the field of information processing, in particular to an automatic question-answering system (Semi-Supervised Learning and Text-to-SQL, SST2SQLQA) based on a Semi-Supervised Learning and Text-to-Structured Query Language (Text-to-SQL) technology.

Background

Existing automatic question-answering systems include: based on the automatic question answering of the frequently-used question set, directly finding the answer of the question most similar to the question in the frequently-used question set according to the question provided by the user; based on the automatic question and answer of the knowledge graph, according to the questions provided by the user, performing semantic understanding and analysis on the questions, performing query and reasoning in the established knowledge graph, and finally obtaining the answers of the questions; reading understanding-based automatic question answering means that a given document library is given, and fine-grained segments meeting requirements and capable of answering user questions are found from documents by understanding the user questions. However, these techniques are directed to unstructured data, and the questions that can be answered are relatively simple, relatively inflexible, and relatively costly to build and maintain.

Aiming at a structured database, the Text-to-SQL technology can directly convert the problem of a user into SQL, and then the result is obtained by querying the database through the SQL, so that the answer of the problem is directly returned to the user, and the effect of automatically asking and answering is achieved. Currently, most of the research of Text-to-SQL technology is to construct a Text-to-SQL model based on a pre-training language model and train on a large number of labeled samples such as question and SQL, so as to obtain a model capable of directly converting the question into SQL. In practical application, however, it is difficult to directly obtain large-scale labeled samples, so a Text-to-SQL model needs to be trained on a small number of samples by a semi-supervised learning method, and the accuracy of the model should be equivalent to that of the model trained on a large number of samples. In addition, the Text-to-SQL model proposed at present does not fully utilize the implicit information of the question, has low accuracy, large parameter quantity of the model and low reasoning speed, and is not beneficial to engineering practice.

Disclosure of Invention

Aiming at the difficult problems in the prior art that the Text-to-SQL technology cannot be utilized to construct an automatic question-answering system, the accuracy is low, a large number of samples need to be labeled, and the reasoning speed is low, the invention provides the automatic question-answering system based on the semi-supervised learning and Text-to-SQL model, the questions of the user are converted into SQL, and then the results are obtained by querying a database through the SQL, so that the answers to the questions are directly returned to the user, the difficulty of obtaining information by the user is greatly reduced, and the efficiency of retrieving data by the user is improved.

The invention is realized by the following technical scheme:

the invention relates to an automatic question-answering system based on semi-supervised learning and a Text-to-SQL model, which comprises: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question proposed by the user into the SQL, and the model compression unit compresses the trained Text-to-SQL model.

The Text-to-SQL model adopts a Sequence-to-Slot model architecture, and specifically comprises the following steps: several different sub-modules for predicting the sub-parts of SQL and an entity category coding module for identifying the category to which the entity in the question corresponds in the database, wherein: the entity class encoding module encodes the entity class by again inputting the class as input to the sub-module.

The massive sample generating unit comprises: question generation model training module, collection module and sample generation module, wherein: the query generation module converts SQL into a query, the acquisition module generates SQL by sampling through a built-in SQL sampler, and the SQL sampler and the query generation module are combined to automatically generate a < query, SQL > sample labeled by a machine, so that a Text-to-SQL model is trained on the manually labeled and machine labeled sample in a semi-supervised learning mode, and the sample generation module automatically generates the < query, SQL > sample required by training the Text-to-SQL model according to the generated SQL sentence and the corresponding query.

The model training unit comprises: the device comprises a preprocessing module and a Text-to-SQL model training module, wherein: the preprocessing module splices each column name of a database table through a special separator 'SEP' to obtain a database table column sequence c, performs primary word segmentation on a question q and the database table column sequence c through a BasICTokenizer, and the Text-to-SQL model training module divides SQL into a plurality of sub-parts and then predicts the sub-parts respectively through different sub-modules of a Text-to-SQL model and then combines the sub-parts to obtain a complete SQL sentence.

The first word segmentation means that: sequentially converting the question q and the database list sequence c into unicode, removing unknown characters, processing Chinese, space word segmentation, removing redundant characters, punctuation and space word segmentation; then, the words are divided again by a field marker (WordpieceTokenizer), the words divided in the previous step are obtained in the form of words (beginning with #), and finally, each divided word is converted into an index number (ID) of an integer numerical type through a BERT dictionary mapping table.

The different sub-modules for predicting the sub-portion of SQL comprise: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the work cluster conditional constraint, and a W-OP-COL submodule for predicting join symbols between columns in the work cluster conditional constraint.

The aggregation function is "MAX", "MIN", "AVERAGE", "SUM", "NON", or "COUNT".

The comparator is ">", "<" or "═ for.

The connection symbol is 'AND' OR 'OR'.

The compression is that: the model compression unit takes a built-in Text-to-SQL simplified model DistillText-to-SQL as a student model, takes a trained Text-to-SQL model output by the model training unit as a Text model, trains the student model by a knowledge distillation mode, namely, takes the output of the Text model as a label, and enables the output of the student model to approach the output of the Text model as much as possible in the training process.

The Text-to-SQL simplified model DistillText-to-SQL is constructed by adopting a pre-training language model and a downstream model, but the number of layers of the pre-training language model layer of the DistillText-to-SQL is less, so that the DistillText-to-SQL parameter quantity is less and the reasoning speed is higher.

Technical effects

The invention integrally solves the problems that the automatic question-answering system constructed aiming at the structured database based on the Text-to-SQL technology is low in accuracy, needs a large number of labeled samples and is low in reasoning speed of the Text-to-SQL model.

Compared with the prior art, the entity type coding module is additionally added in the Text-to-SQL model, so that the model accuracy is improved by 3%; the model is subjected to semi-supervised learning by combining machine labeling and manually labeled samples, and the same effect as full sample training can be achieved by only 10% of manually labeled samples; the knowledge distillation mode of the two stages compresses the Text-to-SQL model, the model reasoning speed is improved by 50%, and the accuracy rate is only reduced by 1%. The accuracy of knowledge distillation in one stage is reduced by 2%, and compared with the traditional knowledge distillation in one stage, the accuracy of knowledge distillation in two stages provided by the invention is reduced by a smaller range.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic diagram of a mass sample generation unit;

FIG. 3 is a diagram of a question generation model of a massive sample generation unit;

FIG. 4 is a diagram of a Text-to-SQL model training unit according to the present invention;

FIG. 5 is a diagram of a Text-to-SQL model related to the Text-to-SQL model training unit according to the present invention;

FIG. 6 is a schematic diagram of a model compression unit;

FIG. 7 is a diagram of DistillText-to-SQL model related to the model compression unit.

Detailed Description

As shown in fig. 1, the present embodiment relates to an automatic question-answering system based on semi-supervised learning and Text-to-SQL model, which includes: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to combine the manually marked samples and the automatically generated machine marked samples to train a Text-to-SQL model, converts the question provided by a user into the SQL, and the model compression unit compresses the trained Text-to-SQL model.

As shown in fig. 2, the SQL sampler in the acquisition module automatically generates a plurality of different SQL statements for a given database table, that is, in a regular manner, and massive SQL is generated by inputting an existing database table into the acquisition module; then, a question generation model training module is used for training a question generation model by using a small number of question and SQL samples as a training set, the training characteristics are SQL, training labels are question, and the model can automatically generate question given SQL; then, SQL generated by the acquisition module is input into the question generation model to obtain corresponding question, so that massive question and SQL samples are automatically generated.

The SQL sampler sampling and generating SQL is as follows: according to the determined format of SQL, namely SELECT agg1_ op agg1_ col, agg2_ op agg2_ col … FROM tableWHERE cond1_ col cond1_ op cond1 AND cond2_ col …; for a table, a specific segment of SQL on the table is obtained according to the rule: agg _ col and cond _ col are arbitrary columns in the table, when a certain condk _ col condk _ op condk of the segment of SQL does not affect the final execution result, the triplet determined by the condition may be removed, and in addition, only when the execution result of the segment of SQL is not an empty set, the segment of SQL is selected to be reserved.

The rule is as follows: the number of agg _ col and cond _ col is randomly generated, but the number of agg _ col is certainly greater than 0, and the number of cond _ col is greater than or equal to 0; agg _ op is COUNT or null operation, and in addition, when agg _ col is a column of a numerical type, the agg _ op is MAX and MIN; cond _ op is "═", and when cond _ col is a column of a numerical type, cond _ op is ">" and "<"; and cond is any value of the column of cond _ col in the database, and when cond _ col is a column of a numerical value type, cond is any value between the minimum value and the maximum value of the column.

The question generation model comprises: the system comprises a word embedding layer, an LSTM coding layer and an LSTM decoding layer; as shown in FIG. 3, the question generation model is [ s1, s2, s3, …, sn ] according to the input SQL statement]Firstly, embedding words in a word embedding layer for each word in the SQL sentence to obtain

Then coding is carried out through an LSTM coding layer to obtain the hidden state of each time step (word) of the coding end corresponding to the input SQL,

the decoder-side LSTM is initialized with the implicit state hn of the last time step.

At the decoder end, a token representing the beginning of the output sequence is used as the input of the decoder end, the word embedding is carried out, and then the coding is carried out through an LSTM decoding layer to obtain the implicit state of the current time step, namely

Then a matrix of dim x D dimensions is connected, where: d represents the size of the whole word stock to obtain

Then connecting a softmax layer to obtain the probability of each word in the corresponding word bank output at the time step

The word argmax with the highest probability is selected _t (P _t ) As output of this time step, i.e. o _t Then outputs o of the time step _t As input i for the next time step _t+1 And so on until a certain time step outputs a token indicating the end or the sequence of outputs exceeds a given length.

The Text-to-SQL model training unit specifically comprises: the system comprises a preprocessing unit, an entity type identification unit and a gradient descent algorithm training model unit; as shown in fig. 4, after the preprocessing module preprocesses the question, SQL sample and identifies the entity of the question, the entity type identification unit performs entity matching in the database, and the Text-to-SQL model encodes the entity type related to the question through the entity type encoding module for the identified entity type; training a Text-to-SQL model by using a preprocessed training sample set, optimizing by using a BERT-Adam optimization algorithm, training at most 20 epochs with the learning rate of 0.00002, and terminating the training when the accuracy of the model on the verification set is not improved any more.

The pretreatment is as follows: firstly, the column names of the tables are spliced together through a special separator, [ SEP ] ", so as to obtain a database table column sequence c, and then the question q, the database table column sequence c and the entity category t related in the question are subjected to primary word segmentation through a BasicTokenizer.

The entity matching means: when an entity matches a certain column in the database table, when it is c1, when it can match, the category of the entity is considered to be c 1.

The training sample set comprises: manually labeled samples and automatically generated samples.

As shown in FIG. 5, the Text-to-SQL model includes: a pre-training language model layer and an SQL prediction layer; when the question is [ q1, q2, …, qn]All columns of the database table are [ c1, c2, …, cm]The entity category related to the question is [ t1, t2, …, tl]Splicing them together to form "[ SEP ]]"separate, and add at the beginning of sentence" [ CLS ]]", input into the pre-trained language model BERT. The input is coded by BERT to obtain corresponding output

In the S-NUM submodule, h _[cls] Connecting a matrix

To obtain

Taking the value of argmax (P) to predict how many columns should be selected in the seletcuster of SQL; in the W-NUM submodule, h _[cls] Connecting a matrix

To obtain

Taking the value of argmax (P) to predict how many columns should be selected in the where cluster of SQL; in W-OP-COLIn a module, h _[cls] Connecting a matrix

To obtain

Taking the value of argmax (P) predicts the connector between columns of the where cluster of SQL, either ' and ' or '.

In the W-COL sub-module,

the outputs C of the columns being connected to a matrix

To obtain

And taking a column corresponding to top k values in P as a column selected by a where cluster, wherein k is the output value of the W-NUM submodule.

In the W-OP submodule, k columns selected by the W-COL submodule correspond to implicit vectors

Connecting a matrix

To obtain

Take the value of argmax (P) to predict the corresponding operator on the column selected by the where cluster of SQL'>’，‘<’，’＝’。

In the W-VAL submodule, implicit vectors corresponding to question sentences

Selection of W-COL submoduleImplicit vectors corresponding to selected columns

Definition matrix, U ^s ，U ^e ，V ^s ，

Separately calculated WVs ═ Q ═ U ^s +WC*V ^s )*W ^s ，

Taking the values of argmax (WVs), argmax (Wve) as the starting position and the ending position of the value in the question sentence corresponding to the column selected by the where cluster, and intercepting the word between the starting position and the ending position as the value of the where cluster in the column.

In the S-COL sub-module,

the outputs C of the columns being connected to a matrix

To obtain

And taking a column corresponding to a top d value in P as a column selected by a select cluster, wherein d is an output value of the S-NUM submodule.

The S-AGG submodule and the S-COL submodule select the column corresponding to the hidden variable as

Connecting a matrix

To obtain

Take argmax (P) as the aggregation operation on the selected column on select cluster, "MAX", "MIN", "AVERAGE", "SUM", "NON", "COUNT".

As shown in fig. 6, the compression means: the trained Text-to-SQL model is used as a student model, a student model with fewer network layers and fewer parameters is constructed, and knowledge distillation is performed on the student model in a two-stage knowledge distillation mode, so that knowledge is distilled into the student model.

The two-stage knowledge distillation mode is as follows:

knowledge distillation in the first stage, only the upstream model is subjected to knowledge distillation, and the output of each layer of the upstream model is output for the same input X, Text-to-SQL model

Each layer output of DistillText-to-SQL model upstream model

Minimizing the difference between the output of the Teacher model and the Student model upstream

And knowledge distillation is carried out on the whole Text-to-SQL model in the second stage, the Text-to-SQL model outputs Y, the Distill Text-to-SQL model outputs Y ', and the Loss between Y and Y ' is minimized as Cross entry (Y, Y '), so that the output of the Distill Text-to-SQL model is continuously close to the Text-to-SQL model, and the purpose of replacing the Text-to-SQL model with the lighter Distill Text-to-SQL model is achieved.

As shown in FIG. 7, the Distill Text-to-SQL model is similar to the Text-to-SQL model architecture diagram shown in FIG. 5, except that the number of network layers of the BERT pre-training language model in the Distill Text-to-SQL model is smaller, only 6 layers, and the number of network layers of the original Text-to-SQL model (the Teacher model) is 12 layers.

In the embodiment, an automatic question-answering system is constructed based on Text-to-SQL technology, the questions of the user are converted into SQL, and then the results are obtained by querying the database through the SQL, so that answers to the questions are directly returned to the user, the difficulty of obtaining information by the user is greatly reduced, and the efficiency of retrieving data by the user is improved. The embodiment solves the engineering problem of constructing the automatic question-answering system by using the Text-to-SQL technology: training a Text-to-SQL model on a small number of samples (thousands of samples) in a semi-supervised learning mode, wherein the accuracy rate of the model is not greatly influenced; the Text-to-SQL model obtained by training is compressed, so that the parameter quantity of the model can be reduced, the reasoning speed of the model is improved, the reasoning speed of the model is faster by more than 50% under the same calculation force, and the response time of the automatic question-answering system is reduced. Finally, the embodiment innovatively proposes to add an entity type coding module on the basis of the existing Text-to-SQL model research, codes the type corresponding to the entity related to the question, and fully utilizes the hidden information in the question, so that the accuracy of the Text-to-SQL model is improved by about 3%.

Through specific practical experiments, under the specific environment settings of an Ubuntu18.04 server, a Python3.6.8 and a PyTorch1.3.1, thousands of < question, SQL > samples based on manual labeling automatically generate machine-labeled samples, then the manual and machine-labeled samples are mixed to train a Text-to-SQL model and two-stage knowledge distillation is carried out to obtain a Distill Text-to-SQL model, and finally the Distill Text-to-SQL model is applied to an automatic question-answering system, and the obtained experimental data are as follows: the accuracy rate of converting the question into the SQL by the Distill Text-to-SQL model is 83.6%, and the response speed of the whole automatic question-answering system is 0.12 second.

Compared with the traditional Text-to-SQL model, the accuracy of the entity-class-coding-based Text-to-SQL model provided by the invention is improved by 3%, and in combination with automatic generation of samples, a semi-supervised learning method is adopted, the model with high accuracy can be trained on only 10% of samples, through two-stage knowledge distillation, the model accuracy is slightly reduced (reduced by 1%), but the parameter quantity is reduced by 50%, while the accuracy of the automatic question-answering system constructed on the basis of the distilled Distill Text-to-SQL model is reduced by only 1%, but the response speed is improved by 45%.

The described embodiments may be modified in several ways by those skilled in the art without departing from the principle and spirit of the invention, the scope of which is defined by the appended claims and not by the described embodiments, and each implementation within its scope is restricted by the present invention.

Claims

1. An automatic question-answering system based on semi-supervised learning and Text-to-SQL model is characterized by comprising: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question provided by a user into the SQL, and the model compression unit compresses the Text-to-SQL model obtained by training;

the Text-to-SQL model adopts a Sequence-to-Slot model architecture, and specifically comprises the following steps: several different sub-modules for predicting the sub-parts of SQL and an entity category coding module for identifying the category to which the entity in the question corresponds in the database, wherein: the entity category coding module takes the category as the input of the sub-module again so as to code the entity category; the different sub-modules used to predict the sub-parts of SQL include: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the where cluster conditional constraint, a W-OP-COL submodule for predicting join symbols between columns in the where cluster conditional constraint;

the compression is that: the model compression unit takes a built-in Text-to-SQL simplified model DistillText-to-SQL as a student model, takes a trained Text-to-SQL model output by the model training unit as a Text model, trains the student model in a two-stage knowledge distillation mode, namely, takes the output of the Text model as a label, and enables the output of the student model to approach the output of the Text model as much as possible in the training process;

the two-stage knowledge distillation mode comprises the following steps:

Each layer output of DistillText-to-SQL model upstream model

2. The automatic question-answering system based on semi-supervised learning and Text-to-SQL model as claimed in claim 1, wherein the question generation model converts SQL into question, the acquisition module generates SQL by built-in SQL sampler sampling, and combines SQL sampler and question generation model to automatically generate machine-labeled < question, SQL > samples, so as to train Text-to-SQL model on manually labeled and machine-labeled samples by adopting semi-supervised learning mode, and the sample generation module automatically generates < question, SQL > samples required by training Text-to-SQL model according to generated SQL sentences and corresponding question.

3. The automatic question-answering system based on the semi-supervised learning and Text-to-SQL model as claimed in claim 1, wherein the question generation model is a Sequence-to-Sequence model, the input end encodes SQL to obtain an implicit vector, and the output end decodes the implicit vector to generate a corresponding question.

4. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 1, wherein the model training unit comprises: the device comprises a preprocessing module and a Text-to-SQL model training module, wherein: the preprocessing module splices each column name of a database table through a special separator 'SEP' to obtain a database table column sequence c, performs primary word segmentation on a question q and the database table column sequence c through a BasICTokenizer, and the Text-to-SQL model training module divides SQL into a plurality of sub-parts and then predicts the sub-parts respectively through different sub-modules of a Text-to-SQL model and then combines the sub-parts to obtain a complete SQL sentence.

5. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 4, wherein the primary word segmentation is: sequentially converting the question q and the database list sequence c into unicode, removing unknown characters, processing Chinese, space word segmentation, removing redundant characters, punctuation and space word segmentation; and then, dividing the words by the field marker again to obtain the word forms of the words, and finally converting each divided word into an index number of an integer numerical type by a BERT dictionary mapping table.

6. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 1, wherein the different sub-modules for predicting the sub-parts of SQL comprise: the different sub-modules used to predict the sub-parts of SQL include: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the work cluster conditional constraint, and a W-OP-COL submodule for predicting join symbols between columns in the work cluster conditional constraint.

7. The automatic question-answering system based on semi-supervised learning and Text-to-SQL model according to claim 1, wherein the SQL sampler sampling generation SQL is as follows: according to the determined format of SQL, namely SELECT agg1_ op agg1_ col, agg2_ op agg2_ col … FROM table WHERE cond1_ col cond1_ op cond1 AND cond2_ col …; for a table, a specific segment of SQL on the table is obtained according to the rule: agg _ col and cond _ col are arbitrary columns in the table, when a certain condk _ col condk _ op condk of the section of SQL does not affect the final execution result, the triple determined by the condition is removed, and in addition, only when the execution result of the section of SQL is not an empty set, the section of SQL is selected to be reserved.