CN115982586A

CN115982586A - Semi-supervised continuous learning method for converting few-sample text into SQL task flow

Info

Publication number: CN115982586A
Application number: CN202310025951.3A
Authority: CN
Inventors: 陈永锐; 郭心南; 吴桐桐; 漆桂林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-04-18

Abstract

The invention relates to a semi-supervised continuous learning method for converting a few-sample text into an SQL task flow, which comprises the following steps: step 1, for a new task, a model is trained by using semi-supervised learning in the task, step 2, for a past task in a task flow, the model keeps memory of the learned task by using continuous learning, and step 3, a semi-supervised learning process and a continuous learning process are respectively executed by using a teacher-student framework; and 4, strengthening and executing semi-supervised learning and continuous learning by utilizing double sampling, wherein the semi-supervised learning and the continuous learning respectively comprise prompt sampling and review sampling, and strengthening mutual promotion of two learning processes. The technical scheme applies a 'teacher-student' framework to isolate different optimization targets of semi-supervised learning and continuous learning, and adopts a double sampling strategy to strengthen the mutual promotion effect between the semi-supervised learning and the continuous learning.

Description

Semi-supervised continuous learning method for converting few-sample text into SQL task flow

Technical Field

The invention belongs to the field of natural language processing, and relates to a semi-supervised continuous learning method for converting a few-sample text into an SQL (structured query language) task flow.

Background

The power relational database stores a large amount of information at present, and provides a foundation for application of customer relation management, financial markets, medical records and the like. The text-to-SQL technique trains a parser to translate natural language questions into machine-readable SQL programs, providing an ideal way for non-technical users to easily interact with data stored in a relational database. Current research on text-to-SQL has covered single-table, multi-table, and dialogue scenarios, with the common assumption that the size of training and test data does not change over time. Unfortunately, in real-world applications, new databases are constantly emerging to adapt to changing environments (e.g., new diseases and adjusted financial policies) and to continually generate new tasks for the parser. While machine learning-based text-to-SQL approaches have achieved the most advanced performance, they suffer from the following two challenges when faced with a rapidly growing task:

1) Sample-less supervision data. For a new text-to-SQL task for unseen databases, it is often impractical to train by annotating enough SQL tags in a short time, resulting in easy overfitting of the parser. 2) Expensive full retraining. Considering a new task, an intuitive idea is to train the model from scratch on all the tasks seen. Unfortunately, due to the large scale of the pre-trained model, the computational cost of such retraining is prohibitive even in a small sample scenario.

Based on this, the present work proposes to integrate semi-supervised learning (semi-supervised learning) and continuous learning (continuous learning) to solve the sample-less text-to-SQL task flow. The parser applies self-training to predict pseudo-tag instances to improve generalization over current tasks while replaying a small portion of past instances stored in memory to mitigate forgetting over previous tasks. In this process, semi-supervised learning and continuous learning can be mutually facilitated. On the one hand, some examples in previous tasks may provide valuable information for semi-supervised learning to predict the pseudo-label of unlabeled examples. On the other hand, high quality pseudo-tag instances may also enrich the memory of past tasks. A "teacher-student" framework is applied to semi-supervised learning and continuous learning processes, respectively. The teacher model strives to achieve optimization of each single task through self-training, while the student model achieves optimization of the entire task flow by replaying the pseudo labels that learn all tasks predicted by the teacher model. To take advantage of the interactive facilitation of continuous learning and semi-supervised learning, past instances associated with a current task are used to prompt the semi-supervised learning process when training the teacher model; when training the student model, both tagged and untagged instances of previous tasks are sampled to ensure complete memory for replay.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a semi-supervised continuous learning method for converting a few-sample text into an SQL task flow.

The invention adopts the following technical scheme: a semi-supervised continuous learning method for a sample-less text-to-SQL task flow, the method comprising the steps of:

step 1, for a new task, a model is trained by using semi-supervised learning in the task,

for the ith new task D in the task stream ⁱ Using its supervision data A ⁱ ＝{a ₁ ,a ₂ …,a _n And unsupervised data U ⁱ ＝{u ₁ ,u ₂ ,…,u _m Self-training is carried out, and a semi-supervised learning process in the task is completed, wherein a _j ＝(q _j ,S _j ,y _j )，u _j ＝(q _j ,S _j )，q _j And S _j Natural language question and database model for respectively representing inputFormula (ii) y _j Representing a target SQL program, wherein n and m are respectively the number of supervised data and unsupervised data, and the process mainly comprises the following two stages:

and (3) hot start: model using coding-decoding architecture

Where θ is the model parameter and i represents the ith task. The encoder is a table pre-training language model and the decoder is a long-short term memory network (LSTM), first, the model is firstly at the ith task D ⁱ Supervision data A of ⁱ Training is carried out to obtain parameters of parameter initialization, wherein the loss L of each sample is calculated as the following formula,

wherein

For the actions predicted by the model at the decoding time t, z _＜t All actions before the moment t, Z is the whole action sequence, | Z | is the length of the whole action sequence, and when all actions are completed, an SQL program is generated;

self-renewal: subsequently, the air conditioner is operated to,

in supervision data A ⁱ And unsupervised data U ⁱ Performs a joint iterative training, in particular, the model ≥ at each round of the training process>

First for each unlabeled data u _j ∈U ⁱ Predictive pseudo-tags, i.e., corresponding SQL programs +>

Constitute a pseudo-sample>

Randomly selecting k pseudo samples p to form an ith task D ⁱ Of the pseudo sample set P ⁱ ＝{p ₁ ,p ₂ ,…,p _k Then further update theta by optimizing the penalty,

wherein, mu _j ＝P(z _t |q _j ,S _j θ) is a pseudo sample p _j The confidence score of.

Step 2, for past tasks in the task flow, the model keeps the memory of the learned tasks by using continuous learning.

When a new task is encountered, the task is executed,

cannot be retrained in all the previous examples, i.e. < >>

Forget a past task after learning a new task and, therefore, based on the past task and the previous task, based on the previous task and the previous task>

Some of the examples in the past tasks need to be reviewed to ensure that their performance does not significantly degrade, and the process mainly includes the following two phases:

memory storage construction: in preparation, a fixed size memory M is constructed ⁱ ＝{m ₁ ,m ₂ ,…,m _|M| With each task D ⁱ Is associated to store a small number of replay instances of the task, wherein

Is from supervisory data A ⁱ The sampling-in-process is carried out,

calculating the replay loss: whenever a

When self-training is performed, M1 to M are calculated ^i-1 All replay sample losses, added to the semi-supervised loss L _ST Upper, replay loss L _EMR By the following calculation formula,

at task D ⁱ After the training of (2), from A ⁱ In the random selection

A sample with a label and storing it in M ⁱ For replay in future tasks.

Step 3, respectively executing a semi-supervised learning process and a continuous learning process by utilizing a teacher-student framework;

since semi-supervised learning addresses the optimization of a single (current) task, while continuous learning focuses more on the overall performance of all tasks, the present method proposes to perform both processes separately using a "teacher-student" framework. The model is composed of two basic text-to-SQL models, namely a teacher for semi-supervised learning

And student for continuous learning>

In learning task D ⁱ During, is greater than or equal to>

And &>

Are all made by>

Initialized, but with a separate parameter update during each task, the parameter being recorded as ≥ er>

And &>

To facilitate the mutual promotion of semi-supervised learning and continuous learning, double sampling is used: involving two different strategies, respectively enhancing>

And &>

The training data of (a) is obtained,

a teacher model:

is to provide the current task D through the intra-task ST ⁱ Of interest to D only ⁱ Allowing forgetting and D ⁱ Unrelated past tasks; simultaneously with D ⁱ The relevant past instance may be emphasized to deepen @>

To D ⁱ To achieve this goal, during ST, the appropriate instance with the potential hint is replayed, in particular while observing each task D ⁱ In time, slave->

Extract and D ⁱ Associative tagged sample composition memory storage>

This step is called prompt sampling, after which the decision is taken to be based on the result of the determination>

Self-training is performed as in step 1, except that L will be lost during the self-refresh process _ST Replacement by the following loss L _tea ，

When the training has converged on the time the training is,

is considered to be close to D ⁱ At the cost of->

May forget D ¹ ,…,D ^i-1 Some of the key information of (1).

A student model: after training

Can be considered as a proficiency task D ⁱ ，D1,…,D ^K Providing K experts, where K is the total task number->

By learning all +>

The pseudo label is provided, so that the overall performance of the whole task flow is optimized. For each task D ^j (1. Ltoreq. J. Ltoreq. I-1) except for the original D ^j In addition to this, it is also possible to select from trained->

Learning in the generated pseudo label example, and considering the training efficiency, the method is to do every D ⁱ Using cross-task sample replay, specifically, based on the evaluation of the results of the evaluation>

Comprises 1) D ⁱ Task loss of 2) D1, …, D ^i-1 Is lost.

Step 4, performing semi-supervised learning and continuous learning by utilizing double sampling reinforcement, wherein the double sampling reinforcement respectively comprises prompt sampling and review sampling, and mutual promotion of two learning processes is reinforced;

prompting sampling: to ensure there is study D ⁱ Should the sample instance satisfy two criteria, 1) their corresponding database schema S should be semantically related to D ⁱ The database schema presented in (1) is close, 2) the structure of their golden SQL programs should be diversified to improve

The generalization ability of the method to different structures,

the sampling process mainly comprises the following two stages, firstly, the sampling process is carried out

Select the first s and D ⁱ Most correlated samples, each sample x _j And D ⁱ The correlation ω of (d) is calculated by the following formula:

wherein, d _sch (x _l ,x _j ) Represents a sample x _l And x _j Distance over the database schema, Ψ represents a dictionary that encompasses all words present in the database schema, Ψ is each word,

is the ψ th bit of the unique heat vector corresponding to the bag of words, if ψ exists in the dictionary, then +>

Are all 1, then>

Is a group of a number of 0 s,

then, the obtained s samples are divided into N clusters by using a K-means clustering algorithm. Wherein the structure distance is defined as d _str To satisfy the diversity of SQL constructs. d is a radical of _str Similar in form to d _sch But instead of Ψ (including ODRER BY, GROUP BY, LIMIT, etc.) a SQL keyword table Φ is used. Finally, the center sample of each cluster is selected to be composed

Reviewing and sampling: review sampling except for the labeled sample set A ⁱ In addition to sampling, a pseudo label instance P is also sampled ⁱ Sampling is carried out, wherein P ⁱ The label of each p in (1) is

Prediction during intra-task self-training, in such a way as to allow

Review task D more fully ⁱ To make a sampled instance represent D in terms of both SQL structure and database schema ⁱ First a combined distance d (x) is defined ₁ ,x ₂ )＝d _sch (x ₁ ,x ₂ )*d _str (x ₁ ,x ₂ ) To measure two samples x ₁ And x ₂ Then, using the distance A ⁱ ∪P ⁱ Division into M clusters, result +>

Composed of the center sample of each cluster as D ⁱ Representative sample of (a).

Compared with the prior art, the invention has the beneficial effects that:

1. a solution combining semi-supervised learning and continuous learning is provided to solve the problem of converting text into SQL task flow under the scene of few samples;

2. a teacher-student framework is provided to isolate different optimization targets of semi-supervised learning and continuous learning, so that the respective advantages of the semi-supervised learning and the continuous learning can be fully exerted in the training process;

3. a dual sampling strategy is proposed to strengthen the link between semi-supervised learning and continuous learning, facilitating the effect of both learning processes by reviewing different samples.

Drawings

FIG. 1 is a block flow diagram of a semi-supervised continuous learning text to SQL conversion

FIG. 2 is a schematic diagram of the text-to-SQL model of the present invention

Detailed Description

The present application is further described below with reference to the accompanying drawings. The following examples are only used to illustrate the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.

Example 1: the semi-supervised continuous learning method aiming at the conversion from the text with few samples to the SQL task flow of the invention has the overall flow as shown in figure 1,

the method comprises the following steps:

step 1, for a new task, the model is trained by using semi-supervised learning in the task.

For the ith new task D in the task stream ⁱ Using its supervision data A ⁱ ＝{a ₁ ,a ₂ …,a _n And unsupervised data U ⁱ ＝{u ₁ ,u ₂ ,…,u _m Self-training is carried out, and a semi-supervised learning process in the task is completed, wherein a _j ＝(q _j ,S _j ,y _j )，u _j ＝(q _j ,S _j )，q _j And S _j Respectively representing input natural language questions and database schemas, y _j Representing the target SQL program, m and, respectively, the number of supervised and unsupervised data, the process mainly comprises the following two phases:

and (3) hot start: model using coding-decoding architecture

Where θ is the model parameter and i represents the ith task. The encoder is a table pre-training language model and the decoder is a long-short memory network (LSTM), first, the model is firstly at the ith task D ⁱ Supervision data A of ⁱ Training is performed to obtain a parameter for which a parameter is initialized, wherein the loss L per sample is calculated as follows, and ` H `>

Wherein

For the actions predicted by the model at the decoding time t, z _＜t For all moves before time t, Z is the entire move sequence and | Z | is the length of the entire move sequence. />

Indicates that the time t is generating>

As shown in FIG. 2, the model iteratively generates an action z using the problem and database patterns as inputs _t . When all actions are completed, the SQL program is generated;

self-renewal: subsequently, the process of the present invention,

First, for each non-label data u _j ∈U ⁱ Predictive pseudo-tags, i.e., corresponding SQL programs +>

Constitute a dummy sample>

Randomly selecting k pseudo samples p to form an ith task D ⁱ Pseudo sample set P of ⁱ ＝{p ₁ ,p ₂ ,…,p _k Then further update theta by optimizing the penalty,

wherein, mu _j ＝P(z _t |q _j ,S _j θ) is a dummy sample p _j The confidence score of.

When a new task is encountered, the task is executed,

cannot be retrained in all the preceding examples, i.e. <' >>

Forgets the past task after learning a new task, and therefore, based on the past task>

Some examples in past tasks need to be reviewed to ensure that their performance does not significantly degrade, and the process mainly includes the following two stages:

memory storage construction: as a preparation, a fixed-size memory M is constructed ⁱ ＝{m ₁ ,m ₂ ,…,m _|M| With each task D ⁱ Associated to store a small number of replay instances of the task, wherein

Is from supervisory data A ⁱ The sampling rate of the medium-frequency sampling is medium,

calculating the replay loss: whenever there is a need for

When self-training is performed, M1 to M are calculated ^i-1 All replay sample losses of (2) added to the semi-supervised loss L _ST Upper, replay loss L _EMR By the following calculation formula,

at task D ⁱ After the training of (A) is finished, the training is performed ⁱ In the random selection

A labeled sample is stored in M ⁱ For replay in future tasks.

And student for continuous learning>

In learning task D ⁱ During, is greater than or equal to>

And &>

Are all determined by>

Initialized, but with individual updating of parameters during each task, with the parameters respectively noted as->

And &>

And &>

The training data of (2) is obtained,

a teacher model:

is to provide the current task D through the intra-task ST ⁱ Of interest to D only ⁱ Allowing forgetting and D ⁱ Unrelated past tasks; simultaneously with D ⁱ Related past instances may be emphasized to deepen @>

To D ⁱ To achieve this goal, during ST, the appropriate instance with the potential hint is replayed, in particular while observing each task D ⁱ Then, the slave->

Extract and D ⁱ The associated tagged sample constitutes a memory store->

This step is called a prompt sample, after which>

Self-training is performed as per step 1, except that L will be lost during self-refresh _ST Replacement by the following loss L _te0 ，

When the training has converged on the time the training is,

is considered to be close to D ⁱ In the optimal model of (4), at a cost of &>

May forget D ¹ ,…,D ^i-1 Some of the key information of (2).

A student model: trained

Can be considered as a proficiency task D ⁱ ，D ¹ ,…,D ^K Providing K experts, where K is the total task number, based on the total task number>

By learning all->

The pseudo label is provided, so that the overall performance optimization of the whole task flow is achieved. For each task D ^j (1. Ltoreq. J. Ltoreq. I-1) except for the original D ^j In addition to this, it is also possible to select from trained->

Learning in the generated pseudo label instance, as shown in FIG. 1, takes into account training efficiency by doing so at every D ⁱ Using cross-task sample replay, particularly>

Comprises 1) D ⁱ Task loss of and 2) D ¹ ,…,D ^i-1 Is lost.

prompting sampling: to ensure there is study D ⁱ Should meet two criteria, 1) their corresponding database schema S should be semantically related to D ⁱ The database schema presented in (1) is close, 2) the structure of their golden SQL program should be diversified to improve

For the generalization ability of different structures,

wherein, d _sch (x _l ,x _j ) Representing a sample x _l And x _j Distance over the database schema, Ψ denotes a dictionary that covers all words present in the database schema, Ψ is each word,

1, then>

Is a group of a number of 0 s,

then, the obtained s samples are divided into N clusters by using a K-means clustering algorithm. Wherein the structure distance is defined as d _str To satisfy the diversity of SQL constructs. d _s(r Similar in form to d _sch But instead of Ψ (including ODRER BY, GROUP BY, LIMIT, etc.) a SQL keyword table Φ is used. Finally, the center sample of each cluster is selected to be composed

Reviewing and sampling: review sampling except for labeled sample set A ⁱ In addition to sampling, pseudo tag instance P is also sampled ⁱ Sampling is carried out, wherein P ⁱ The label of each p in (1) is

Prediction during intra-task self-training, in such a way as to allow

Review task D more fully ⁱ To make a sample instance represent D in terms of both SQL constructs and database schema ⁱ First a combined distance d (x) is defined ₁ ,x ₂ )＝d _sch (x ₁ ,x ₂ )*d _str (x ₁ ,x ₂ ) To measure two samples x ₁ And x ₂ Then, using the distance A ⁱ ∪P ⁱ Divided into M clusters, result->

Experimental analysis proves that the semi-supervised continuous learning method for converting the few-sample text into the SQL task flow uses unsupervised data to supplement supervision signals under the condition of insufficient sample amount, and adopts a replay strategy to recall the past samples according to the characteristics of the task flow. In addition, the method also uses a 'teacher-student' framework to respectively process the continuous learning process and the semi-supervised learning process, and utilizes double sampling to mutually promote the two processes, thereby realizing an accurate and efficient task flow training method.

The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are merely preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for limiting the scope of the present invention, and on the contrary, any improvement or modification made based on the spirit of the present invention should fall within the scope of the present invention.

Claims

1. The semi-supervised continuous learning method for the conversion from the few-sample text to the SQL task flow is characterized by comprising the following steps of:

step 2, for the past tasks in the task flow, the model keeps the memory of the learned tasks by using continuous learning,

and 4, strengthening and executing semi-supervised learning and continuous learning by utilizing double sampling, wherein the semi-supervised learning and the continuous learning respectively comprise prompt sampling and review sampling, and strengthening mutual promotion of two learning processes.

2. The semi-supervised continuous learning method for the text-to-SQL task flow with few samples according to claim 1, wherein the step 1 is as follows: for the ith new task D in the task stream ⁱ Using its supervision data A ⁱ ＝{a ₁ ，a ₂ ...，a _n And unsupervised data U ⁱ ＝{u ₁ ，u ₂ ，...，u _m Self-training is carried out, and a semi-supervised learning process in the task is completed, wherein a _j ＝(q _j ，S _j ，y _j )，u _j ＝(q _j ，S _j )，q _j And S _j Respectively representing input natural language questions and database schemas, y _j Representing a target SQL program, wherein n and m are respectively the number of supervised data and unsupervised data, and the process mainly comprises the following two stages:

and (3) hot start: model employing an encoding-decoding architecture

Where θ is the model parameter, i represents the ith task, the encoder is the table pre-training language model, the decoder is the long-short memory network (LSTM), first, the model is first at the ith task D ⁱ Supervision data A of ⁱ Training is carried out to obtain parameters of parameter initialization, wherein the loss L of each sample is calculated as the following formula,

wherein

self-renewal: subsequently, the air conditioner is operated to,

First, for each non-label data u _j ∈U ⁱ Predictive false tag, i.e., the corresponding SQL procedure >>

Constitute a pseudo-sample>

Randomly selecting k pseudo samples p to form an ith task D ⁱ Of the pseudo sample set P ⁱ ＝{p ₁ ，p ₂ ，...，p _k H, then further update theta by optimizing for losses as follows,

wherein, mu _j ＝P(z _t |q _j ，S _j θ) is a dummy sample p _j The confidence score of.

3. The semi-supervised continuous learning method for a text-to-SQL task flow with few samples according to claim 1, wherein the step 2 is specifically as follows: when a new task is encountered, the task is executed,

cannot be retrained in all the preceding examples, i.e. <' >>

memory storage construction: in preparation, a fixed size memory M is constructed ⁱ ＝{m ₁ ，m ₂ ，...，m _|M| With each task D ⁱ Associated to store a small number of replay instances of the task, wherein

Is from supervisory data A ⁱ The sampling-in-process is carried out,

and (3) calculating the replay loss: whenever there is a need for

When self-training is performed, M is calculated ¹ To M ^i-1 All replay sample losses of (2) added to the semi-supervised loss L _ST Upper, replay loss L _EMR By the following calculation formula,

A sample with a label and storing it in M ⁱ For replay in future tasks.

4. The semi-supervised continuous learning method for the text-to-SQL task flow with few samples according to claim 1, wherein the step 3 is as follows: step 3, respectively executing a semi-supervised learning process and a continuous learning process by utilizing a teacher-student framework;

the model is composed of two basic text-to-SQL models, namely a teacher for semi-supervised learning

And students for continuous learning>

In learning task D ⁱ During, is greater than or equal to>

And &>

Are all made by>

And &>

And &>

The training data of (2) is obtained,

a teacher model:

is to provide the current task D through the intra-task ST ⁱ Of interest to D only ⁱ Allowing forgetting to associate with D ⁱ Unrelated past tasks; simultaneously with D ⁱ The relevant past instance may be emphasized to deepen @>

Extract and D ⁱ The associated tagged sample constitutes a memory store->

Self-training is performed as in step 1, except that L will be lost during the self-refresh process _ST Is replaced by the following loss L _tea ，

When the training has converged on the time the training is,

is considered to be close to D ⁱ At the cost of->

May forget D ¹ ，...，D ^i-1 Is determined by the user, and some of the key information of (2),

a student model: assumed to be trained

Is a task of essence ⁱ Then the task flow could theoretically be D ¹ ，...，D ^K Providing K experts, where K is the total number of tasks, a parser will be the global optimum, or "arm", for the entire task stream if it can inherit the capabilities of all of these experts>

Intended to be a parser for each task D ^j (1. Ltoreq. J. Ltoreq. I-1) except for the original D ^j In addition to this, it is also possible to select from trained->

The loss of (1) D ⁱ Task loss of and 2) D ¹ ，...，D ^i-1 Is lost.

5. The semi-supervised continuous learning method for the text-to-SQL task flow with few samples according to claim 1, wherein the step 4 is as follows:

For the generalization ability of different structures,

Select the first s and D ⁱ Most correlated samples, each sample x _j And D ⁱ The correlation ω of (d) is calculated by the following formula: />

Wherein d is _sch (x _l ，x _j ) Represents a sample x _l And x _j Distance over the database schema, Ψ represents a dictionary that encompasses all words present in the database schema, Ψ is each word,

to correspond to the ψ th digit of the bag-of-words unique heat vector, if ψ exists in the dictionary

1, then>

Is a group of a number of 0 s,

then, dividing the obtained s samples into N clusters by using a K-means clustering algorithm, wherein the structure distance is defined as d _str To satisfy the diversity of SQL structure, finally, the central sample of each cluster is selected to be composed

Prediction during intra-task self-training, in such a way as to allow

Review task D more fully ⁱ To make a sample instance represent D in terms of both SQL constructs and database schema ⁱ First a combined distance d (x) is defined ₁ ，x ₂ )＝d _sch (x ₁ ，x ₂ )*d _str (x ₁ ，x ₂ ) To measure two samples x ₁ And x ₂ Then, using the distance A ⁱ ∪P ⁱ Divided into M clusters, result->

Composed of the center sample of each cluster as D ⁱ Representative sample of (a). />