CN113420123A

CN113420123A - Language model training method, NLP task processing method and device

Info

Publication number: CN113420123A
Application number: CN202110705729.9A
Authority: CN
Inventors: 张学君; 张震; 王晗; 李鹏; 刘建; 石瑾; 刘睿霖; 颜永红
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-21

Abstract

The application provides a language model training method, an NLP task processing method and a device, comprising the following steps: acquiring a training sample set; the training sample set comprises a first task label of a new task, a plurality of first training texts of the new task, a first text label of each first training text, and a second task label of each old task in at least one old task; copying the language model to obtain a teacher language model, and taking the language model as a student language model; inputting the second task labels into the teacher language model, and generating a plurality of second training texts corresponding to the old tasks and a second text label of each second training text; and inputting the first task label, the second task label, the first training text and the second training text into a student language model, generating a first prediction text, a first prediction result, a second prediction text and a second prediction result, and training the student language model. According to the embodiment of the application, the problem that storage resources occupy a large amount in the related technology can be solved.

Description

Language model training method, NLP task processing method and device

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method for training a language model, a method and an apparatus for NLP task processing.

Background

With the development of Natural Language Processing (NLP) technology, Language models have been widely applied to various fields of life, so as to accomplish different kinds of Natural Language Processing tasks, such as text classification, emotion classification, semantic role labeling of a dialog system, and the like.

At present, after a conventional neural network learns a new task, the performance of the conventional neural network on an old task is sharply reduced, namely the performance of the neural network is reduced along with the distribution change of a learned data set, so that the problem of catastrophic forgetting of the neural network exists. In the related art, in order to avoid catastrophic forgetting of the neural network, a plurality of sub-neural network models can be added on the main neural network model, different sub-neural network models can learn different tasks, and the trained tasks can be reviewed by storing training samples and prediction results of the trained tasks.

However, in the implementation manner for avoiding catastrophic forgetting of the neural network, the storage resources occupied by the neural network are increased as the number of subnetworks is increased by constructing the subnetworks to cause the model to be increased along with the increase of the subnetworks; saving training samples and prediction results of already trained tasks requires a large amount of memory resources.

Disclosure of Invention

The embodiment of the application provides a language model training method, an NLP task processing method and an NLP task processing device, which can generate a training text of a trained task through a teacher language model, so that the problem of large storage resource occupation in the related technology is solved.

In a first aspect, an embodiment of the present application provides a method for training a language model, where the method includes:

acquiring a training sample set; the training sample set comprises a first task label of a new task, a plurality of first training texts of the new task, a first text label of each first training text, and a second task label of each old task in at least one old task, wherein the old task is an NLP task with a trained language model, and the new task is an NLP task without the trained language model;

copying the language model to obtain a teacher language model, and taking the language model as a student language model;

inputting the second task labels into the teacher language model, and generating a plurality of second training texts corresponding to the old tasks and a second text label of each second training text;

inputting the first task label, the second task label, the first training text and the second training text into a student language model to generate a first prediction text corresponding to the first task label, a first prediction result corresponding to the first training text, a second prediction text corresponding to the second task label and a second prediction result corresponding to the second training text;

and training the student language model according to the first training text and the first predicted text, the second training text and the second predicted text, the first predicted result and the first text label, and the second predicted result and the second text label.

In one possible implementation manner, the first training text and the second training text are both question and answer format texts, and the question and answer format texts include question prompt information and questions generated according to the question prompt information.

In one possible implementation, training a student language model according to a first training text, a first predicted text, a second training text, and a second predicted text includes:

performing knowledge distillation on the student language model according to the second text label and the second prediction result;

and training the student language model according to the loss between the first training text and the first prediction text, the loss between the second training text and the second prediction text, the loss between the first prediction result and the first text label and the loss between the second prediction result and the second text label.

In one possible implementation, knowledge distillation is performed on the student language model according to the second text label and the second prediction result, and the knowledge distillation comprises the following steps:

calculating JS divergence of the teacher language model and the student language model according to the second text label and the second prediction result;

calculating the earth movement distance of transferring the characteristics of each layer in the teacher language model to each layer in the student language model;

calculating the loss between the teacher language model and the student language model according to the JS divergence and the earth moving distance;

and updating the student language model according to the loss between the teacher language model and the student language model.

In one possible implementation, inputting the second task tag into the teacher language model to generate a plurality of second predicted texts corresponding to the new task, including:

and inputting the second task labels into the teacher language model, and generating a plurality of second prediction texts according to the number and the preset proportion of the second training texts in the second training text set.

In a second aspect, an embodiment of the present application provides an NLP task processing method, including:

acquiring a text of the NLP task;

inputting the text into a student language model as in the first aspect or any possible implementation manner of the first aspect, and generating a prediction result of the NLP task.

In one possible implementation, the text includes question prompt information and questions for the NLP task; inputting text into a student language model as in the first aspect or any possible implementation manner of the first aspect, and generating a prediction result of the NLP task, including:

and inputting the question prompt information and the question into a student language model to generate an answer corresponding to the question.

In a third aspect, an embodiment of the present application provides a training apparatus for a language model, including:

the acquisition module is used for acquiring a training sample set; the training sample set comprises a first task label of a new task, a plurality of first training texts of the new task, a first text label of each first training text, and a second task label of each old task in at least one old task, wherein the old task is an NLP task with a trained language model, and the new task is an NLP task without the trained language model;

the copying module is used for copying the language model to obtain a teacher language model and taking the language model as a student language model;

the generating module is used for inputting the second task labels into the teacher language model and generating a plurality of second training texts corresponding to the old tasks and a second text label of each second training text;

the generating module is further used for inputting the first task label, the second task label, the first training text and the second training text into the student language model, and generating a first prediction text corresponding to the first task label, a first prediction result corresponding to the first training text, a second prediction text corresponding to the second task label and a second prediction result corresponding to the second training text;

and the training module is used for training the student language model according to the first training text and the first prediction text, the second training text and the second prediction text, the first prediction result and the first text label, and the second prediction result and the second text label.

In one possible implementation, the training module is configured to:

In a possible implementation manner, the generating module is configured to input the second task label into the teacher language model, and generate a plurality of second prediction texts according to the number and the preset proportion of the second training texts in the second training text set.

In a fourth aspect, an embodiment of the present application provides an NLP task processing apparatus, including:

the acquisition module is used for acquiring a text of the NLP task;

a generating module for inputting text into the student language model according to any one of claims 1 to 5, generating a predicted result of the NLP task.

In one possible implementation, the text includes question prompt information and questions for the NLP task; the generation module is to: and inputting the question prompt information and the question into a student language model to generate an answer corresponding to the question.

In a fifth aspect, embodiments of the present application provide a computer device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the method as provided in the first aspect or any one of the possible implementations of the first aspect, or implements the method as provided in the second aspect or any one of the possible implementations of the second aspect.

In a sixth aspect, embodiments of the present application provide a computer storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method provided in the first aspect or any one of the possible implementations of the first aspect, or implement the method provided in the second aspect or any one of the possible implementations of the second aspect.

According to the language model training method, the NLP task processing method and the NLP task processing device, the second task label of the old task is input into the teacher language model trained by the old task, and therefore the second training text corresponding to the old task and the text label corresponding to the second training text are generated. The new task is an NLP task with an untrained language model, and the old task is an NLP task with a trained language model. The teacher language model is obtained by copying the language model, and the student language model is the language model. And then inputting the first training text and the first task label of the new task, and the second training text and the second task label into a student language model to generate a first predicted text corresponding to the first task label and a second predicted text corresponding to the second task label. And training the student language model according to the first training text, the first predicted text, the second training text, the second predicted text, the first text label and the second text label. Therefore, the training texts corresponding to the tasks which are trained by the teacher language model can be obtained without storing the training texts corresponding to the tasks which are trained by the teacher language model, and the student language models which are the same as the teacher language model are trained by combining the training texts corresponding to the tasks which are not trained by the model, so that the model which can execute the trained tasks and the untrained tasks can be obtained, the problem of catastrophic forgetting of the model is avoided, and the performance of the model is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for training a language model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart illustrating an NLP task processing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a training apparatus for a language model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram illustrating an NLP task processing apparatus according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, the words "exemplary," "for example," or "for instance" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "e.g.," or "e.g.," is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "e.g.," or "exemplary" is intended to present relevant concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time. In addition, the term "plurality" means two or more unless otherwise specified. For example, the plurality of systems refers to two or more systems, and the plurality of screen terminals refers to two or more screen terminals.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

At present, the performance of the neural network is reduced due to the catastrophic forgetting problem of the neural network. In the related art, in order to avoid catastrophic forgetting of the neural network, a plurality of sub-neural network models can be added on the main neural network model, different sub-neural network models can learn different tasks, and the trained tasks can be reviewed by storing training samples and prediction results of the trained tasks.

Based on this, the embodiment of the application provides a training method of a language model, an NLP task processing method and an NLP task processing device, a training text of a trained task is generated through a teacher language model, and the problem that storage resources for storing training sample data occupy a large amount is solved.

Fig. 1 is a schematic flowchart of a method for training a language model according to an embodiment of the present application. As shown in fig. 1, the method for training a language model provided in the embodiment of the present application may include S101-S104.

S101: acquiring a training sample set; the training sample set comprises a first task label of a new task, a plurality of first training texts of the new task, a first text label of each first training text, and a second task label of each old task in at least one old task, wherein the old task is an NLP task with a trained language model, and the new task is an NLP task without the trained language model.

NLP tasks may involve multiple aspects, such as mathematics, poetry, music, and so forth. In order to distinguish the categories between different NLPs, a task label of each NLP task can be set, and the task label can characterize the task category. Here, the old task is an NLP task that has been trained by a language model, and the new task is an NLP task that has not been trained by the language model. For example, the new task is a mathematical task and the old task is a poetry task. The first training text for the new task may be "1 +1 equals 2".

Here, there may be a plurality of old tasks.

In some embodiments, to improve the performance of the model, after the first training text is obtained, the training text may be formatted. Specifically, the first training texts may be unified into a question-answer format text by using a regularization method according to a question-answer format. Here, the question-and-answer format text includes question presentation information, and a question generated based on the question presentation information. For example, if the text of the new task is "add, 1+1 equals 2", the text of the new task may be unified into "C #1+1, Q # equals several, a # 2", the first training text of the new task is "C #1+1, Q # equals several", and the first text label of the first training text is "a # 2".

Here, the first training text may include a first text label, for example, if the text of the new task is "add, 1+1 equals 2", the text of the new task may be unified into "C #1+1, Q # equals several, a # 2", and the first training text may be determined to be "C #1+1, Q # equals several, a # 2".

In some embodiments, the language model is the GPT-2 model.

S102: and copying the language model to obtain a teacher language model, and taking the language model as a student language model.

In order to realize the self-distillation of the language model and avoid that the language model cannot execute old tasks due to catastrophic forgetting, the language model can be copied, the copied language model is used as a teacher language model, and then the language model is used as a student language model.

Here, in order to avoid the occupation of storage resources, the teacher language model may be deleted after the student language model is trained.

S103: and inputting the second task labels into the teacher language model, and generating a plurality of second training texts corresponding to the old tasks and a second text label of each second training text.

In order to avoid that the student language model is catastrophically forgotten in the training process of the new task and loses the performance of executing the old task, the training text of the new task and the training text of the old task can be combined in the training process. And inputting the second task label of the old task into the teacher language model to obtain the pseudo data corresponding to the old task, wherein the pseudo data can be used as a training sample for training the old task of the student language model, namely a second training text and a second text label. The second training text is a predicted text of second task labels of the old tasks randomly generated by the teacher language model from the second task labels.

With the increase of different tasks, training samples can also be increased, and therefore the occupied storage space is larger and larger, the training text of the old task of the student language model is trained in the embodiment of the application in a mode that the teacher language model of the old task is trained to generate the pseudo data, and therefore the storage space does not need to be allocated to store a large number of training samples of the old task, and the problem that the storage resource is large in occupation is avoided.

In some embodiments, the text format of the second training text is a question and answer format. For example, if the second task label is math, the generated second training text may be "1 +1 ═ 2" and "1 ═ 1".

In order to ensure the performance of the model, the second training texts may be unified in format, for example, "1 +1 — 2" is unified into "C #1+1," Q # is equal to several "and" a #2, "1 — 1" is unified into "C #1 — 1," Q # is equal to several "and" a #1, "and so on. The second text label represents a true value corresponding to the question, e.g., "X" in "A # X" represents a true value corresponding to the second training text.

In some embodiments, to improve the training efficiency of the model, the second training text may be generated according to the number of the first training texts. Specifically, the product of the number of the first training texts and the preset proportion is calculated, and the number required by the second training texts is determined.

S104: and inputting the first task label, the second task label, the first training text and the second training text into a student language model to generate a first prediction text corresponding to the first task label, a first prediction result corresponding to the first training text, a second prediction text corresponding to the second task label and a second prediction result corresponding to the second training text.

Here, the second predicted text refers to a predicted text corresponding to the second task tag generated by the student language model based on the second task tag.

Specifically, a first task label is input into a student language model, and a first prediction text is generated; and inputting the second task tag into the student language model to generate a second predicted text, namely the student language model executes the language task. And inputting the first training text into the student language model to obtain a first predicted text corresponding to the first training text. For example, the first training text is "C # quiet night thought, who is the Q # author," and the first predictive text can be "a # pau. And inputting the second training text into the student language model to obtain a second predicted text corresponding to the second training text. That is, the student language model performs the question-answering task.

S105: and training the student language model according to the first training text and the first predicted text, the second training text and the second predicted text, the first predicted result and the first text label, and the second predicted result and the second text label.

In some embodiments, to ensure that the student language model retains the performance of the teacher language model on the old tasks, knowledge distillation of the student language model is required so that the language model completes self-distillation. Specifically, in S105, first, knowledge distillation is performed on the student language model according to the second predicted text and the second text label; next, a language model is trained based on the loss between the first predicted text and the first training text, the loss between the first text label and the first predicted result, the loss between the second predicted text and the second training text, and the loss between the second text label and the second predicted result.

Here, in the knowledge distillation of the student language model, first, JS (Jensen-Shannon) divergence between the teacher language model and the student language model is calculated from the second prediction text and the second training text. Calculating the earth movement distance of transferring the characteristics of each layer in the teacher language model to each layer in the student language model; secondly, calculating the loss between the teacher language model and the student language model according to the JS divergence and the earth movement distance; and updating the student language model according to the loss between the teacher language model and the student language model.

Therefore, the teacher language model is obtained by copying the language model, knowledge distillation is carried out on the student model, self-distillation of the language model is achieved, namely the language model is subjected to incremental learning, and the universality of the language model capable of executing tasks is improved.

According to the training method of the language model, the second task label of the old task is input into the teacher language model trained by the old task, so that a second training text corresponding to the old task and a text label corresponding to the second training text are generated. The new task is an NLP task with an untrained language model, and the old task is an NLP task with a trained language model. The teacher language model is obtained by copying the language model, and the student language model is the language model. And then inputting the first training text and the first task label of the new task, and the second training text and the second task label into a student language model to generate a first predicted text corresponding to the first task label and a second predicted text corresponding to the second task label. And training the student language model according to the first training text, the first predicted text, the second training text, the second predicted text, the first text label and the second text label. Therefore, the training texts corresponding to the tasks which are trained by the teacher language model can be obtained without storing the training texts corresponding to the tasks which are trained by the teacher language model, and the student language models which are the same as the teacher language model are trained by combining the training texts corresponding to the tasks which are not trained by the model, so that the model which can execute the trained tasks and the untrained tasks can be obtained, the problem of catastrophic forgetting of the model is avoided, and the performance of the model is improved.

Based on the language model in the above embodiment, the embodiment of the present application further provides an NLP task processing method. Fig. 2 is a schematic flowchart of an NLP task processing method provided in an embodiment of the present application, and as shown in fig. 2, the NLP task processing method provided in the embodiment of the present application may include S201 to S202.

S201: and acquiring the text of the NLP task.

In some embodiments, the NLP task text is in question and answer format, and the NLP task text includes question prompt information and a question.

S202: and inputting the text into a language model to generate a prediction result of the NLP task.

Here, the language model is a language model trained by the embodiment corresponding to fig. 1.

In some embodiments, the question prompt information and the question are input into a language model, and an answer corresponding to the question is generated through greedy decoding.

According to the NLP task processing method provided by the embodiment of the application, the NLP task is input into the language model trained by the embodiment corresponding to the graph 1, a prediction result is obtained, and the universality of task processing is improved.

Based on the training method of the language model in the above embodiment, the embodiment of the present application further provides a training device of the language model. Fig. 3 is a schematic structural diagram of a training apparatus 300 for a language model according to an embodiment of the present disclosure, and as shown in fig. 3, the training apparatus 300 for a language model may include an obtaining module 301, a copying module 302, a generating module 303, and a training module 304.

An obtaining module 301, configured to obtain a training sample set; the training sample set comprises a first task label of a new task, a plurality of first training texts of the new task, a first text label of each first training text, and a second task label of each old task in at least one old task, wherein the old task is an NLP task with a trained language model, and the new task is an NLP task without the trained language model;

the copying module 302 is used for copying the language model to obtain a teacher language model and taking the language model as a student language model;

the generating module 303 is configured to input the second task label into the teacher language model, and generate a plurality of second training texts corresponding to the old task and a second text label of each second training text;

the generating module 303 is further configured to input the first task label, the second task label, the first training text, and the second training text into a student language model, and generate a first predicted text corresponding to the first task label, a first predicted result corresponding to the first training text, a second predicted text corresponding to the second task label, and a second predicted result corresponding to the second training text;

the training module 304 is configured to train the student language model according to the first training text and the first predicted text, the second training text and the second predicted text, the first predicted result and the first text label, and the second predicted result and the second text label.

In one possible implementation, the training module 304 is configured to:

In a possible implementation manner, the generating module 303 is configured to input the second task label into the teacher language model, and generate a plurality of second prediction texts according to the number and the preset ratio of the second training texts in the second training text set.

The training device for language models provided in the embodiment of the present application can perform the steps of the method in the embodiment corresponding to fig. 1, and can achieve the same technical effect, and in order to avoid repetition, detailed description is not provided here.

According to the training device for the language model, the second task label of the old task is input into the teacher language model trained by the old task, so that a second training text corresponding to the old task and a text label corresponding to the second training text are generated. The new task is an NLP task with an untrained language model, and the old task is an NLP task with a trained language model. The teacher language model is obtained by copying the language model, and the student language model is the language model. And then inputting the first training text and the first task label of the new task, and the second training text and the second task label into a student language model to generate a first predicted text corresponding to the first task label and a second predicted text corresponding to the second task label. And training the student language model according to the first training text, the first predicted text, the second training text, the second predicted text, the first text label and the second text label. Therefore, the training texts corresponding to the tasks which are trained by the teacher language model can be obtained without storing the training texts corresponding to the tasks which are trained by the teacher language model, and the student language models which are the same as the teacher language model are trained by combining the training texts corresponding to the tasks which are not trained by the model, so that the model which can execute the trained tasks and the untrained tasks can be obtained, the problem of catastrophic forgetting of the model is avoided, and the performance of the model is improved.

Based on the NLP task processing method in the above embodiment, the present application embodiment further provides an NLP task processing device. Fig. 4 is a schematic structural diagram of an NLP task processing device 400 according to an embodiment of the present disclosure, and as shown in fig. 4, the NLP task processing device 400 according to the embodiment of the present disclosure may include an obtaining module 401 and a generating module 402.

An obtaining module 401, configured to obtain a text of an NLP task;

a generating module 402 for inputting text into the student language model according to any one of claims 1 to 5, generating a prediction result of the NLP task.

In one possible implementation, the text includes question prompt information and questions for the NLP task; the generation module 402 is configured to: and inputting the question prompt information and the question into a student language model to generate an answer corresponding to the question.

The training device for language models provided in the embodiment of the present application can perform the steps of the method in the embodiment corresponding to fig. 2, and can achieve the same technical effect, and in order to avoid repetition, detailed description is not provided here.

The NLP task processing device provided in the embodiment of the present application obtains a prediction result by inputting an NLP task to a language model trained in the embodiment corresponding to fig. 1, and improves the universality of task processing.

A computer device provided in an embodiment of the present application is described below.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 5, the computer device provided in the embodiment of the present application may be used to implement the method for training a language model or the NLP task processing method described in the foregoing method embodiment.

The computer device may comprise a processor 501 and a memory 502 in which computer program instructions are stored.

Specifically, the processor 501 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 502 may include removable or non-removable (or fixed) media, where appropriate. The memory 502 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 502 is non-volatile solid-state memory.

The memory may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors), it is operable to perform operations described with reference to methods in accordance with the present application.

The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement the training method of the language model or the NLP task processing method in any of the above embodiments.

In one example, the electronic device can also include a communication interface 505 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 505 are connected via a bus 510 to complete communication therebetween.

The communication interface 505 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application.

Bus 510 includes hardware, software, or both to couple the components of the electronic device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the above embodiments, the embodiments of the present application may be implemented by providing a computer storage medium. The computer storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement a method of training a language model or a method of NLP task processing in any of the above embodiments.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware for performing the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. A method for training a language model, comprising:

inputting the first task label, the second task label, the first training text and the second training text into the student language model, and generating a first predicted text corresponding to the first task label, a first predicted result corresponding to the first training text, a second predicted text corresponding to the second task label and a second predicted result corresponding to the second training text;

2. The method according to claim 1, wherein the first training text and the second training text are question and answer formatted texts, and the question and answer formatted texts comprise question prompt information and questions generated according to the question prompt information.

3. The method of claim 2, wherein training the student language model based on the first training text, the first predictive text, the second training text, and the second predictive text comprises:

and training the student language model according to the loss between the first training text and the first predicted text, the loss between the second training text and the second predicted text, the loss between the first predicted result and the first text label and the loss between the second predicted result and the second text label.

4. The method of claim 3, wherein knowledge distillation of the student language model based on the second text label and the second prediction comprises:

calculating the earth movement distance of the feature transfer of each layer in the teacher language model to each layer in the student language model;

calculating the loss between the teacher language model and the student language model according to the JS divergence and the earth movement distance;

updating the student language model based on the loss between the teacher language model and the student language model.

5. The method of claim 1, wherein the inputting the second task label into a teacher language model and generating a plurality of second predicted texts corresponding to the new task comprises:

and inputting the second task labels into a teacher language model, and generating a plurality of second prediction texts according to the number and preset proportion of the second training texts in the second training text set.

6. An NLP task processing method, comprising:

acquiring a text of the NLP task;

inputting the text into the student language model of any one of claims 1-5, generating a prediction result of the NLP task.

7. The method of claim 6, wherein the text includes question prompt information and questions for the NLP task; the inputting the text into the student language model of any one of claims 1-5, generating a predicted result of the NLP task, comprising:

and inputting the question prompt information and the question into the student language model to generate an answer corresponding to the question.

8. An apparatus for training a language model, comprising:

the generating module is used for inputting the second task labels into the teacher language model, and generating a plurality of second training texts corresponding to the old task and a second text label of each second training text;

the generating module is further configured to input the first task label, the second task label, the first training text, and the second training text into the student language model, and generate a first predicted text corresponding to the first task label, a first predicted result corresponding to the first training text, a second predicted text corresponding to the second task label, and a second predicted result corresponding to the second training text;

and the training module is used for training the student language model according to the first training text and the first predicted text, the second training text and the second predicted text, the first predicted result and the first text label, and the second predicted result and the second text label.

9. An NLP task processing apparatus, comprising:

the acquisition module is used for acquiring a text of the NLP task;

a generating module for inputting the text into the student language model according to any one of claims 1 to 5, and generating a prediction result of the NLP task.