CN112287093B - Automatic question-answering system based on semi-supervised learning and Text-to-SQL model - Google Patents

Automatic question-answering system based on semi-supervised learning and Text-to-SQL model Download PDF

Info

Publication number
CN112287093B
CN112287093B CN202011391296.6A CN202011391296A CN112287093B CN 112287093 B CN112287093 B CN 112287093B CN 202011391296 A CN202011391296 A CN 202011391296A CN 112287093 B CN112287093 B CN 112287093B
Authority
CN
China
Prior art keywords
sql
model
text
question
col
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011391296.6A
Other languages
Chinese (zh)
Other versions
CN112287093A (en
Inventor
罗宇侠
饶若楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202011391296.6A priority Critical patent/CN112287093B/en
Publication of CN112287093A publication Critical patent/CN112287093A/en
Application granted granted Critical
Publication of CN112287093B publication Critical patent/CN112287093B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An automatic question-answering system based on semi-supervised learning and Text-to-SQL model, comprising: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question proposed by the user into the SQL, and the model compression unit compresses the trained Text-to-SQL model. The invention converts the problem of the user into SQL, and then queries the database through the SQL to obtain the result, thereby directly returning the answer of the problem to the user, greatly reducing the difficulty of obtaining information by the user and improving the efficiency of retrieving data by the user.

Description

Automatic question-answering system based on semi-supervised learning and Text-to-SQL model
Technical Field
The invention relates to a technology in the field of information processing, in particular to an automatic question-answering system (Semi-Supervised Learning and Text-to-SQL, SST2SQLQA) based on a Semi-Supervised Learning and Text-to-Structured Query Language (Text-to-SQL) technology.
Background
Existing automatic question-answering systems include: based on the automatic question answering of the frequently-used question set, directly finding the answer of the question most similar to the question in the frequently-used question set according to the question provided by the user; based on the automatic question and answer of the knowledge graph, according to the questions provided by the user, performing semantic understanding and analysis on the questions, performing query and reasoning in the established knowledge graph, and finally obtaining the answers of the questions; reading understanding-based automatic question answering means that a given document library is given, and fine-grained segments meeting requirements and capable of answering user questions are found from documents by understanding the user questions. However, these techniques are directed to unstructured data, and the questions that can be answered are relatively simple, relatively inflexible, and relatively costly to build and maintain.
Aiming at a structured database, the Text-to-SQL technology can directly convert the problem of a user into SQL, and then the result is obtained by querying the database through the SQL, so that the answer of the problem is directly returned to the user, and the effect of automatically asking and answering is achieved. Currently, most of the research of Text-to-SQL technology is to construct a Text-to-SQL model based on a pre-training language model and train on a large number of labeled samples such as question and SQL, so as to obtain a model capable of directly converting the question into SQL. In practical application, however, it is difficult to directly obtain large-scale labeled samples, so a Text-to-SQL model needs to be trained on a small number of samples by a semi-supervised learning method, and the accuracy of the model should be equivalent to that of the model trained on a large number of samples. In addition, the Text-to-SQL model proposed at present does not fully utilize the implicit information of the question, has low accuracy, large parameter quantity of the model and low reasoning speed, and is not beneficial to engineering practice.
Disclosure of Invention
Aiming at the difficult problems in the prior art that the Text-to-SQL technology cannot be utilized to construct an automatic question-answering system, the accuracy is low, a large number of samples need to be labeled, and the reasoning speed is low, the invention provides the automatic question-answering system based on the semi-supervised learning and Text-to-SQL model, the questions of the user are converted into SQL, and then the results are obtained by querying a database through the SQL, so that the answers to the questions are directly returned to the user, the difficulty of obtaining information by the user is greatly reduced, and the efficiency of retrieving data by the user is improved.
The invention is realized by the following technical scheme:
the invention relates to an automatic question-answering system based on semi-supervised learning and a Text-to-SQL model, which comprises: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question proposed by the user into the SQL, and the model compression unit compresses the trained Text-to-SQL model.
The Text-to-SQL model adopts a Sequence-to-Slot model architecture, and specifically comprises the following steps: several different sub-modules for predicting the sub-parts of SQL and an entity category coding module for identifying the category to which the entity in the question corresponds in the database, wherein: the entity class encoding module encodes the entity class by again inputting the class as input to the sub-module.
The massive sample generating unit comprises: question generation model training module, collection module and sample generation module, wherein: the query generation module converts SQL into a query, the acquisition module generates SQL by sampling through a built-in SQL sampler, and the SQL sampler and the query generation module are combined to automatically generate a < query, SQL > sample labeled by a machine, so that a Text-to-SQL model is trained on the manually labeled and machine labeled sample in a semi-supervised learning mode, and the sample generation module automatically generates the < query, SQL > sample required by training the Text-to-SQL model according to the generated SQL sentence and the corresponding query.
The model training unit comprises: the device comprises a preprocessing module and a Text-to-SQL model training module, wherein: the preprocessing module splices each column name of a database table through a special separator 'SEP' to obtain a database table column sequence c, performs primary word segmentation on a question q and the database table column sequence c through a BasICTokenizer, and the Text-to-SQL model training module divides SQL into a plurality of sub-parts and then predicts the sub-parts respectively through different sub-modules of a Text-to-SQL model and then combines the sub-parts to obtain a complete SQL sentence.
The first word segmentation means that: sequentially converting the question q and the database list sequence c into unicode, removing unknown characters, processing Chinese, space word segmentation, removing redundant characters, punctuation and space word segmentation; then, the words are divided again by a field marker (WordpieceTokenizer), the words divided in the previous step are obtained in the form of words (beginning with #), and finally, each divided word is converted into an index number (ID) of an integer numerical type through a BERT dictionary mapping table.
The different sub-modules for predicting the sub-portion of SQL comprise: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the work cluster conditional constraint, and a W-OP-COL submodule for predicting join symbols between columns in the work cluster conditional constraint.
The aggregation function is "MAX", "MIN", "AVERAGE", "SUM", "NON", or "COUNT".
The comparator is ">", "<" or "═ for.
The connection symbol is 'AND' OR 'OR'.
The compression is that: the model compression unit takes a built-in Text-to-SQL simplified model DistillText-to-SQL as a student model, takes a trained Text-to-SQL model output by the model training unit as a Text model, trains the student model by a knowledge distillation mode, namely, takes the output of the Text model as a label, and enables the output of the student model to approach the output of the Text model as much as possible in the training process.
The Text-to-SQL simplified model DistillText-to-SQL is constructed by adopting a pre-training language model and a downstream model, but the number of layers of the pre-training language model layer of the DistillText-to-SQL is less, so that the DistillText-to-SQL parameter quantity is less and the reasoning speed is higher.
Technical effects
The invention integrally solves the problems that the automatic question-answering system constructed aiming at the structured database based on the Text-to-SQL technology is low in accuracy, needs a large number of labeled samples and is low in reasoning speed of the Text-to-SQL model.
Compared with the prior art, the entity type coding module is additionally added in the Text-to-SQL model, so that the model accuracy is improved by 3%; the model is subjected to semi-supervised learning by combining machine labeling and manually labeled samples, and the same effect as full sample training can be achieved by only 10% of manually labeled samples; the knowledge distillation mode of the two stages compresses the Text-to-SQL model, the model reasoning speed is improved by 50%, and the accuracy rate is only reduced by 1%. The accuracy of knowledge distillation in one stage is reduced by 2%, and compared with the traditional knowledge distillation in one stage, the accuracy of knowledge distillation in two stages provided by the invention is reduced by a smaller range.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a schematic diagram of a mass sample generation unit;
FIG. 3 is a diagram of a question generation model of a massive sample generation unit;
FIG. 4 is a diagram of a Text-to-SQL model training unit according to the present invention;
FIG. 5 is a diagram of a Text-to-SQL model related to the Text-to-SQL model training unit according to the present invention;
FIG. 6 is a schematic diagram of a model compression unit;
FIG. 7 is a diagram of DistillText-to-SQL model related to the model compression unit.
Detailed Description
As shown in fig. 1, the present embodiment relates to an automatic question-answering system based on semi-supervised learning and Text-to-SQL model, which includes: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to combine the manually marked samples and the automatically generated machine marked samples to train a Text-to-SQL model, converts the question provided by a user into the SQL, and the model compression unit compresses the trained Text-to-SQL model.
As shown in fig. 2, the SQL sampler in the acquisition module automatically generates a plurality of different SQL statements for a given database table, that is, in a regular manner, and massive SQL is generated by inputting an existing database table into the acquisition module; then, a question generation model training module is used for training a question generation model by using a small number of question and SQL samples as a training set, the training characteristics are SQL, training labels are question, and the model can automatically generate question given SQL; then, SQL generated by the acquisition module is input into the question generation model to obtain corresponding question, so that massive question and SQL samples are automatically generated.
The SQL sampler sampling and generating SQL is as follows: according to the determined format of SQL, namely SELECT agg1_ op agg1_ col, agg2_ op agg2_ col … FROM tableWHERE cond1_ col cond1_ op cond1 AND cond2_ col …; for a table, a specific segment of SQL on the table is obtained according to the rule: agg _ col and cond _ col are arbitrary columns in the table, when a certain condk _ col condk _ op condk of the segment of SQL does not affect the final execution result, the triplet determined by the condition may be removed, and in addition, only when the execution result of the segment of SQL is not an empty set, the segment of SQL is selected to be reserved.
The rule is as follows: the number of agg _ col and cond _ col is randomly generated, but the number of agg _ col is certainly greater than 0, and the number of cond _ col is greater than or equal to 0; agg _ op is COUNT or null operation, and in addition, when agg _ col is a column of a numerical type, the agg _ op is MAX and MIN; cond _ op is "═", and when cond _ col is a column of a numerical type, cond _ op is ">" and "<"; and cond is any value of the column of cond _ col in the database, and when cond _ col is a column of a numerical value type, cond is any value between the minimum value and the maximum value of the column.
The question generation model comprises: the system comprises a word embedding layer, an LSTM coding layer and an LSTM decoding layer; as shown in FIG. 3, the question generation model is [ s1, s2, s3, …, sn ] according to the input SQL statement]Firstly, embedding words in a word embedding layer for each word in the SQL sentence to obtain
Figure BDA0002812910430000045
Then coding is carried out through an LSTM coding layer to obtain the hidden state of each time step (word) of the coding end corresponding to the input SQL,
Figure BDA0002812910430000044
the decoder-side LSTM is initialized with the implicit state hn of the last time step.
At the decoder end, a token representing the beginning of the output sequence is used as the input of the decoder end, the word embedding is carried out, and then the coding is carried out through an LSTM decoding layer to obtain the implicit state of the current time step, namely
Figure BDA0002812910430000043
Then a matrix of dim x D dimensions is connected, where: d represents the size of the whole word stock to obtain
Figure BDA0002812910430000041
Then connecting a softmax layer to obtain the probability of each word in the corresponding word bank output at the time step
Figure BDA0002812910430000042
The word argmax with the highest probability is selected t (P t ) As output of this time step, i.e. o t Then outputs o of the time step t As input i for the next time step t+1 And so on until a certain time step outputs a token indicating the end or the sequence of outputs exceeds a given length.
The Text-to-SQL model training unit specifically comprises: the system comprises a preprocessing unit, an entity type identification unit and a gradient descent algorithm training model unit; as shown in fig. 4, after the preprocessing module preprocesses the question, SQL sample and identifies the entity of the question, the entity type identification unit performs entity matching in the database, and the Text-to-SQL model encodes the entity type related to the question through the entity type encoding module for the identified entity type; training a Text-to-SQL model by using a preprocessed training sample set, optimizing by using a BERT-Adam optimization algorithm, training at most 20 epochs with the learning rate of 0.00002, and terminating the training when the accuracy of the model on the verification set is not improved any more.
The pretreatment is as follows: firstly, the column names of the tables are spliced together through a special separator, [ SEP ] ", so as to obtain a database table column sequence c, and then the question q, the database table column sequence c and the entity category t related in the question are subjected to primary word segmentation through a BasicTokenizer.
The entity matching means: when an entity matches a certain column in the database table, when it is c1, when it can match, the category of the entity is considered to be c 1.
The training sample set comprises: manually labeled samples and automatically generated samples.
As shown in FIG. 5, the Text-to-SQL model includes: a pre-training language model layer and an SQL prediction layer; when the question is [ q1, q2, …, qn]All columns of the database table are [ c1, c2, …, cm]The entity category related to the question is [ t1, t2, …, tl]Splicing them together to form "[ SEP ]]"separate, and add at the beginning of sentence" [ CLS ]]", input into the pre-trained language model BERT. The input is coded by BERT to obtain corresponding output
Figure BDA0002812910430000051
Figure BDA0002812910430000052
In the S-NUM submodule, h [cls] Connecting a matrix
Figure BDA0002812910430000053
To obtain
Figure BDA0002812910430000054
Taking the value of argmax (P) to predict how many columns should be selected in the seletcuster of SQL; in the W-NUM submodule, h [cls] Connecting a matrix
Figure BDA0002812910430000055
To obtain
Figure BDA0002812910430000056
Taking the value of argmax (P) to predict how many columns should be selected in the where cluster of SQL; in W-OP-COLIn a module, h [cls] Connecting a matrix
Figure BDA0002812910430000057
To obtain
Figure BDA0002812910430000058
Taking the value of argmax (P) predicts the connector between columns of the where cluster of SQL, either ' and ' or '.
In the W-COL sub-module,
Figure BDA0002812910430000059
the outputs C of the columns being connected to a matrix
Figure BDA00028129104300000510
Figure BDA00028129104300000511
To obtain
Figure BDA00028129104300000512
And taking a column corresponding to top k values in P as a column selected by a where cluster, wherein k is the output value of the W-NUM submodule.
In the W-OP submodule, k columns selected by the W-COL submodule correspond to implicit vectors
Figure BDA00028129104300000513
Connecting a matrix
Figure BDA00028129104300000514
To obtain
Figure BDA00028129104300000515
Take the value of argmax (P) to predict the corresponding operator on the column selected by the where cluster of SQL'>’,‘<’,’=’。
In the W-VAL submodule, implicit vectors corresponding to question sentences
Figure BDA00028129104300000516
Selection of W-COL submoduleImplicit vectors corresponding to selected columns
Figure BDA00028129104300000517
Definition matrix, U s ,U e ,V s
Figure BDA00028129104300000518
Separately calculated WVs ═ Q ═ U s +WC*V s )*W s
Figure BDA00028129104300000519
Figure BDA00028129104300000520
Taking the values of argmax (WVs), argmax (Wve) as the starting position and the ending position of the value in the question sentence corresponding to the column selected by the where cluster, and intercepting the word between the starting position and the ending position as the value of the where cluster in the column.
In the S-COL sub-module,
Figure BDA00028129104300000521
the outputs C of the columns being connected to a matrix
Figure BDA00028129104300000522
Figure BDA00028129104300000523
To obtain
Figure BDA00028129104300000524
And taking a column corresponding to a top d value in P as a column selected by a select cluster, wherein d is an output value of the S-NUM submodule.
The S-AGG submodule and the S-COL submodule select the column corresponding to the hidden variable as
Figure BDA00028129104300000525
Figure BDA0002812910430000061
Connecting a matrix
Figure BDA0002812910430000062
To obtain
Figure BDA0002812910430000063
Take argmax (P) as the aggregation operation on the selected column on select cluster, "MAX", "MIN", "AVERAGE", "SUM", "NON", "COUNT".
As shown in fig. 6, the compression means: the trained Text-to-SQL model is used as a student model, a student model with fewer network layers and fewer parameters is constructed, and knowledge distillation is performed on the student model in a two-stage knowledge distillation mode, so that knowledge is distilled into the student model.
The two-stage knowledge distillation mode is as follows:
knowledge distillation in the first stage, only the upstream model is subjected to knowledge distillation, and the output of each layer of the upstream model is output for the same input X, Text-to-SQL model
Figure BDA0002812910430000064
Each layer output of DistillText-to-SQL model upstream model
Figure BDA0002812910430000065
Minimizing the difference between the output of the Teacher model and the Student model upstream
Figure BDA0002812910430000066
Figure BDA0002812910430000067
And knowledge distillation is carried out on the whole Text-to-SQL model in the second stage, the Text-to-SQL model outputs Y, the Distill Text-to-SQL model outputs Y ', and the Loss between Y and Y ' is minimized as Cross entry (Y, Y '), so that the output of the Distill Text-to-SQL model is continuously close to the Text-to-SQL model, and the purpose of replacing the Text-to-SQL model with the lighter Distill Text-to-SQL model is achieved.
As shown in FIG. 7, the Distill Text-to-SQL model is similar to the Text-to-SQL model architecture diagram shown in FIG. 5, except that the number of network layers of the BERT pre-training language model in the Distill Text-to-SQL model is smaller, only 6 layers, and the number of network layers of the original Text-to-SQL model (the Teacher model) is 12 layers.
In the embodiment, an automatic question-answering system is constructed based on Text-to-SQL technology, the questions of the user are converted into SQL, and then the results are obtained by querying the database through the SQL, so that answers to the questions are directly returned to the user, the difficulty of obtaining information by the user is greatly reduced, and the efficiency of retrieving data by the user is improved. The embodiment solves the engineering problem of constructing the automatic question-answering system by using the Text-to-SQL technology: training a Text-to-SQL model on a small number of samples (thousands of samples) in a semi-supervised learning mode, wherein the accuracy rate of the model is not greatly influenced; the Text-to-SQL model obtained by training is compressed, so that the parameter quantity of the model can be reduced, the reasoning speed of the model is improved, the reasoning speed of the model is faster by more than 50% under the same calculation force, and the response time of the automatic question-answering system is reduced. Finally, the embodiment innovatively proposes to add an entity type coding module on the basis of the existing Text-to-SQL model research, codes the type corresponding to the entity related to the question, and fully utilizes the hidden information in the question, so that the accuracy of the Text-to-SQL model is improved by about 3%.
Through specific practical experiments, under the specific environment settings of an Ubuntu18.04 server, a Python3.6.8 and a PyTorch1.3.1, thousands of < question, SQL > samples based on manual labeling automatically generate machine-labeled samples, then the manual and machine-labeled samples are mixed to train a Text-to-SQL model and two-stage knowledge distillation is carried out to obtain a Distill Text-to-SQL model, and finally the Distill Text-to-SQL model is applied to an automatic question-answering system, and the obtained experimental data are as follows: the accuracy rate of converting the question into the SQL by the Distill Text-to-SQL model is 83.6%, and the response speed of the whole automatic question-answering system is 0.12 second.
Compared with the traditional Text-to-SQL model, the accuracy of the entity-class-coding-based Text-to-SQL model provided by the invention is improved by 3%, and in combination with automatic generation of samples, a semi-supervised learning method is adopted, the model with high accuracy can be trained on only 10% of samples, through two-stage knowledge distillation, the model accuracy is slightly reduced (reduced by 1%), but the parameter quantity is reduced by 50%, while the accuracy of the automatic question-answering system constructed on the basis of the distilled Distill Text-to-SQL model is reduced by only 1%, but the response speed is improved by 45%.
The described embodiments may be modified in several ways by those skilled in the art without departing from the principle and spirit of the invention, the scope of which is defined by the appended claims and not by the described embodiments, and each implementation within its scope is restricted by the present invention.

Claims (7)

1. An automatic question-answering system based on semi-supervised learning and Text-to-SQL model is characterized by comprising: the device comprises a mass sample generation unit, a model training unit and a model compression unit, wherein: the model training unit adopts a semi-supervised learning method to train a Text-to-SQL model by combining the manually marked sample and the automatically generated sample, converts the question provided by a user into the SQL, and the model compression unit compresses the Text-to-SQL model obtained by training;
the Text-to-SQL model adopts a Sequence-to-Slot model architecture, and specifically comprises the following steps: several different sub-modules for predicting the sub-parts of SQL and an entity category coding module for identifying the category to which the entity in the question corresponds in the database, wherein: the entity category coding module takes the category as the input of the sub-module again so as to code the entity category; the different sub-modules used to predict the sub-parts of SQL include: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the where cluster conditional constraint, a W-OP-COL submodule for predicting join symbols between columns in the where cluster conditional constraint;
the compression is that: the model compression unit takes a built-in Text-to-SQL simplified model DistillText-to-SQL as a student model, takes a trained Text-to-SQL model output by the model training unit as a Text model, trains the student model in a two-stage knowledge distillation mode, namely, takes the output of the Text model as a label, and enables the output of the student model to approach the output of the Text model as much as possible in the training process;
the two-stage knowledge distillation mode comprises the following steps:
knowledge distillation in the first stage, only the upstream model is subjected to knowledge distillation, and the output of each layer of the upstream model is output for the same input X, Text-to-SQL model
Figure FDA0003717230630000011
Each layer output of DistillText-to-SQL model upstream model
Figure FDA0003717230630000012
Minimizing the difference between the output of the Teacher model and the Student model upstream
Figure FDA0003717230630000013
Figure FDA0003717230630000014
And knowledge distillation is carried out on the whole Text-to-SQL model in the second stage, the Text-to-SQL model outputs Y, the Distill Text-to-SQL model outputs Y ', and the Loss between Y and Y ' is minimized as Cross entry (Y, Y '), so that the output of the Distill Text-to-SQL model is continuously close to the Text-to-SQL model, and the purpose of replacing the Text-to-SQL model with the lighter Distill Text-to-SQL model is achieved.
2. The automatic question-answering system based on semi-supervised learning and Text-to-SQL model as claimed in claim 1, wherein the question generation model converts SQL into question, the acquisition module generates SQL by built-in SQL sampler sampling, and combines SQL sampler and question generation model to automatically generate machine-labeled < question, SQL > samples, so as to train Text-to-SQL model on manually labeled and machine-labeled samples by adopting semi-supervised learning mode, and the sample generation module automatically generates < question, SQL > samples required by training Text-to-SQL model according to generated SQL sentences and corresponding question.
3. The automatic question-answering system based on the semi-supervised learning and Text-to-SQL model as claimed in claim 1, wherein the question generation model is a Sequence-to-Sequence model, the input end encodes SQL to obtain an implicit vector, and the output end decodes the implicit vector to generate a corresponding question.
4. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 1, wherein the model training unit comprises: the device comprises a preprocessing module and a Text-to-SQL model training module, wherein: the preprocessing module splices each column name of a database table through a special separator 'SEP' to obtain a database table column sequence c, performs primary word segmentation on a question q and the database table column sequence c through a BasICTokenizer, and the Text-to-SQL model training module divides SQL into a plurality of sub-parts and then predicts the sub-parts respectively through different sub-modules of a Text-to-SQL model and then combines the sub-parts to obtain a complete SQL sentence.
5. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 4, wherein the primary word segmentation is: sequentially converting the question q and the database list sequence c into unicode, removing unknown characters, processing Chinese, space word segmentation, removing redundant characters, punctuation and space word segmentation; and then, dividing the words by the field marker again to obtain the word forms of the words, and finally converting each divided word into an index number of an integer numerical type by a BERT dictionary mapping table.
6. The semi-supervised learning and Text-to-SQL model based automatic question-answering system according to claim 1, wherein the different sub-modules for predicting the sub-parts of SQL comprise: the different sub-modules used to predict the sub-parts of SQL include: an S-NUM submodule for predicting the number of selected columns in the select cluster, the S-COL submodule corresponding to scol1, scol 2., for predicting the aggregation function on each column of the select cluster selection, an S-AGG submodule corresponding to AGG1, AGG 2., a W-NUM submodule for predicting the number of columns in the where cluster conditional constraint, the W-COL submodule corresponding to wcol1, wcol 2., for predicting the comparators of each column within the where cluster conditional constraint, a W-OP submodule corresponding to OP1, OP 2., a W-VAL submodule for predicting constraint values of columns in the work cluster conditional constraint, and a W-OP-COL submodule for predicting join symbols between columns in the work cluster conditional constraint.
7. The automatic question-answering system based on semi-supervised learning and Text-to-SQL model according to claim 1, wherein the SQL sampler sampling generation SQL is as follows: according to the determined format of SQL, namely SELECT agg1_ op agg1_ col, agg2_ op agg2_ col … FROM table WHERE cond1_ col cond1_ op cond1 AND cond2_ col …; for a table, a specific segment of SQL on the table is obtained according to the rule: agg _ col and cond _ col are arbitrary columns in the table, when a certain condk _ col condk _ op condk of the section of SQL does not affect the final execution result, the triple determined by the condition is removed, and in addition, only when the execution result of the section of SQL is not an empty set, the section of SQL is selected to be reserved.
CN202011391296.6A 2020-12-02 2020-12-02 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model Active CN112287093B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011391296.6A CN112287093B (en) 2020-12-02 2020-12-02 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011391296.6A CN112287093B (en) 2020-12-02 2020-12-02 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model

Publications (2)

Publication Number Publication Date
CN112287093A CN112287093A (en) 2021-01-29
CN112287093B true CN112287093B (en) 2022-08-12

Family

ID=74426036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011391296.6A Active CN112287093B (en) 2020-12-02 2020-12-02 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model

Country Status (1)

Country Link
CN (1) CN112287093B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559556B (en) * 2021-02-25 2021-05-25 杭州一知智能科技有限公司 Language model pre-training method and system for table mode analysis and sequence mask
CN113204633B (en) * 2021-06-01 2022-12-30 吉林大学 Semantic matching distillation method and device
CN113609158A (en) * 2021-08-12 2021-11-05 国家电网有限公司大数据中心 SQL statement generation method, device, equipment and medium
CN114996424B (en) * 2022-06-01 2023-05-09 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815318A (en) * 2018-12-24 2019-05-28 平安科技(深圳)有限公司 The problems in question answering system answer querying method, system and computer equipment
CN110209787A (en) * 2019-05-29 2019-09-06 袁琦 A kind of intelligent answer method and system based on pet knowledge mapping
CN111382253A (en) * 2020-03-02 2020-07-07 苏州思必驰信息科技有限公司 Semantic parsing method and semantic parser

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10339453B2 (en) * 2013-12-23 2019-07-02 International Business Machines Corporation Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
US10552498B2 (en) * 2016-09-19 2020-02-04 International Business Machines Corporation Ground truth generation for machine learning based quality assessment of corpora
US10699215B2 (en) * 2016-11-16 2020-06-30 International Business Machines Corporation Self-training of question answering system using question profiles
US11023593B2 (en) * 2017-09-25 2021-06-01 International Business Machines Corporation Protecting cognitive systems from model stealing attacks
US11495332B2 (en) * 2017-12-28 2022-11-08 International Business Machines Corporation Automated prediction and answering of medical professional questions directed to patient based on EMR
CN110377275A (en) * 2019-07-18 2019-10-25 中汇信息技术(上海)有限公司 Automatically create the method and storage medium of application program
CN111639254A (en) * 2020-05-28 2020-09-08 华中科技大学 System and method for generating SPARQL query statement in medical field

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815318A (en) * 2018-12-24 2019-05-28 平安科技(深圳)有限公司 The problems in question answering system answer querying method, system and computer equipment
CN110209787A (en) * 2019-05-29 2019-09-06 袁琦 A kind of intelligent answer method and system based on pet knowledge mapping
CN111382253A (en) * 2020-03-02 2020-07-07 苏州思必驰信息科技有限公司 Semantic parsing method and semantic parser

Also Published As

Publication number Publication date
CN112287093A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN112287093B (en) Automatic question-answering system based on semi-supervised learning and Text-to-SQL model
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN111666764B (en) Automatic abstracting method and device based on XLNet
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN114662476A (en) Character sequence recognition method fusing dictionary and character features
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN113012822A (en) Medical question-answering system based on generating type dialogue technology
CN114780582A (en) Natural answer generating system and method based on form question and answer
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN118313382A (en) Small sample named entity recognition method and system based on feature pyramid
CN114611520A (en) Text abstract generating method
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN115408506B (en) NL2SQL method combining semantic analysis and semantic component matching
CN114625759B (en) Model training method, intelligent question-answering method, device, medium and program product
CN111382583A (en) Chinese-Uygur name translation system with mixed multiple strategies
CN117493548A (en) Text classification method, training method and training device for model
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN115238705A (en) Semantic analysis result reordering method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant