CN114036300A

CN114036300A - Language model training method and device, electronic equipment and storage medium

Info

Publication number: CN114036300A
Application number: CN202111367500.5A
Authority: CN
Inventors: 张晗; 杜新凯; 吕超; 谷姗姗; 孙垚锋; 李文灏
Original assignee: Sunshine Insurance Group Co Ltd
Current assignee: Sunshine Insurance Group Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-11

Abstract

The application provides a method and a device for training a language model, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task; acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample; loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text; and iteratively updating the initial language model by using the training text sample to generate a target language model. The method and the device can be used for synchronously training the language model by using the training text with the identification information and the training text without the identification information, so that the model precision can be improved.

Description

Language model training method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for training a language model, an electronic device, and a storage medium.

Background

The pre-training Language model is a basic research work of Natural Language Processing (Natural Language Processing), and is widely applied to various task scenarios such as text classification, semantic similarity, entity recognition and the like. After Google released the open source pre-training language model bert (bidirectional Encoder responses from transformers) in 2019, the research and application in this area became more and more intense. The standard paradigm currently used for models of various natural language tasks is pretraining + fine tuning (Pretrain + Finetune), i.e. a language model is pretrained on a large amount of non-labeled corpus, then some modules such as full connection layers are added to the model, and the model is put on labeled data on the task to perform Finetune. However, the training mode may cause a difference between the pre-training stage model and the downstream task fine-tuning stage model, so that the model accuracy of the finally obtained language model is low, and the training mode also needs a large amount of manual labeling, so that the training cost is high.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method and an apparatus for training a language model, an electronic device, and a storage medium, which can improve model accuracy by synchronously training the language model using a training text with identification information and a training text without identification information.

The embodiment of the application provides a training method of a language model, which comprises the following steps:

acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task;

acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample;

loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text;

and iteratively updating the initial language model by using the training text sample to generate a target language model.

Optionally, when the preset natural language processing task is a news topic classification task, the training method includes:

acquiring initial news text data;

acquiring a text loading template of the news topic classification task;

loading news initial text data according to the text loading template of the news topic classification task to obtain a training text sample for training a news topic classification model; the training text sample comprises an identification information training text and a non-identification information training text;

and iteratively updating the initial news topic classification model by using the training text sample to generate a target news topic classification model.

Optionally, after acquiring the initial text data, the training method further includes:

preprocessing the initial text data, removing special characters, spaces and messy code characters, and cutting the initial text data into a preset length to obtain preprocessed initial text data, wherein the preprocessed initial text data is the initial text data loaded by the text loading template; the initial text data includes text data with identification information and text data without identification information.

Optionally, before obtaining a text loading template for loading a training text sample corresponding to the preset natural language processing task, the training method further includes:

acquiring a plurality of text loading templates which are pre-designed by a user and a natural language processing task corresponding to each text loading template;

binding and storing each acquired text loading template and the corresponding natural language processing task, and constructing a text loading template library; the text loading template comprises a text loading position and a text answer position.

Optionally, the loading initial text data according to the text loading template to obtain a training text sample for training a language model includes:

loading the text in the text data with the identification information to a text loading position in the text loading template, loading the identification information corresponding to the text to a text answer position in the text loading template, and generating a training text with the identification information;

and taking the text data without the identification information as the training text without the identification information.

Optionally, the natural language processing task includes an emotion classification task, a news topic classification task, an intention recognition task, a named entity recognition task, and a semantic matching task, and when the natural language processing task to be processed is the intention recognition task, after a target language model is generated, the training method further includes:

acquiring a trained intention recognition model, an intention recognition text loading template and text data to be processed needing intention recognition;

loading the text data to be processed into the intention recognition text loading template, and determining the text data to be predicted with a blank text answer position;

and outputting the text data to be predicted to the intention recognition model, determining a predicted answer of a text answer position in the text data to be predicted, and determining the predicted answer as an intention recognition result of the text data to be processed.

The embodiment of the present application further provides a training device for a language model, where the training device includes:

the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task;

the second acquisition module is used for acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample;

the loading module is used for loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text;

and the generating module is used for performing iterative updating on the initial language model by using the training text sample to generate a target language model.

Optionally, when the preset natural language processing task is a news topic classification task, the training device is configured to:

acquiring initial news text data;

acquiring a text loading template of the news topic classification task;

and iteratively updating the initial news theme classification model by using the training text sample to generate a target news theme classification language model.

Optionally, the training device further includes a preprocessing module, and the preprocessing module is configured to:

Optionally, the training apparatus further includes a template library construction module, where the template library construction module is configured to:

Optionally, when the loading module is configured to load initial text data according to the text loading template to obtain a training text sample for training a language model, the loading module is configured to:

Optionally, the training apparatus further includes an application module, and when the natural language processing task to be processed is an intention recognition task, the application module is configured to:

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the training method as described above.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the training method as described above.

The embodiment of the application provides a method and a device for training a language model, an electronic device and a storage medium, and the method comprises the following steps: acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task; acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample; loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text; and iteratively updating the initial language model by using the training text sample to generate a target language model.

According to the method and the device, the text loading template is used for generating the identification information training text, the identification information training text and the identification information-free training text are used for simultaneously training the language model, so that the pre-training and fine-tuning stage tasks can be combined into one training task, a better language model can be obtained by training on less manual labeling data, a large amount of manual labeling cost is not needed, and the model training time is shortened.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flowchart of a method for training a language model according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of an apparatus for training a language model according to an embodiment of the present disclosure;

FIG. 3 is a second schematic structural diagram of an apparatus for training a language model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

The standard paradigm used for models applied to various natural language tasks is pretraining + fine tuning (Pretrain + Finetune), i.e. a language model is pretrained on a large amount of non-labeled corpus, then some modules such as full connection layers are added to the model, and the model is put on labeled data on the task to perform Finetune. However, the training mode may cause a difference between the pre-training stage model and the downstream task fine-tuning stage model, so that the model accuracy of the finally obtained language model is low, and the training mode also needs a large amount of manual labeling, so that the training cost is high.

Based on this, the embodiment of the application provides a training method of a language model, which realizes an end-to-end training target and can improve the model precision.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for training a language model according to an embodiment of the present disclosure. As shown in fig. 1, a method for training a language model provided in an embodiment of the present application includes:

s101, acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task.

It should be noted that the natural language processing tasks may include a lexical analysis class task, a sentence analysis class task, a semantic analysis class task, an information extraction class task, and a top-level task.

The lexical analysis task is used for performing lexical level analysis on natural language, is fundamental work of Natural Language Processing (NLP), and specifically comprises word segmentation, new word discovery, morphological analysis, part of speech tagging, spelling correction and the like.

The sentence analysis task is used for analyzing a sentence level of a natural language, comprises syntactic analysis and other sentence level analysis tasks, and specifically comprises the following steps: chunk analysis, super label labeling, sentence component analysis, dependency syntax analysis, language identification and the like.

The semantic analysis task is to analyze and understand given text to form a formal representation or a distributed representation capable of expressing semantics, and comprises the following steps: word sense disambiguation, semantic role labeling, abstract semantic representation analysis, first-order predicate logic calculus, frame semantic analysis, vectorized representation of vocabulary/sentences/paragraphs, and the like.

The information extraction task is used for extracting structured information from unstructured text and comprises the following steps: named entity recognition, entity disambiguation, term extraction, coreference resolution, relationship extraction, event extraction, sentiment analysis, intent recognition, slot filling, and the like.

The top-level task is a system-level task which is directly oriented to common users and provides natural language processing product services, and can use natural language processing technologies of multiple layers, and the top-level task specifically comprises the following steps: machine translation, text summarization, question-answering systems, dialog systems, etc.

The natural language processing task in this step may include all of the tasks described above, or may include some of the tasks described above, for example, the natural language processing task includes an emotion classification task, a news topic classification task, an intention recognition task, a named entity recognition task, and a semantic matching task.

Here, the acquired initial text data is determined according to a specific natural language processing task set in advance. For example, when an insurance domain knowledge question-answering system is to be made, a crawler can be used for collecting texts such as proper nouns encyclopedia and insurance clauses of the insurance domain. In the crawling process, unsupervised text data related to the task is collected as much as possible, and partial supervised text data are collected, or partial unsupervised text data are extracted from the collected unsupervised text data and manually marked to obtain the supervised text data.

In one example of the present application, after obtaining the initial text data, the training method further comprises: preprocessing the initial text data, removing special characters, spaces and messy code characters, and cutting the initial text data into a preset length to obtain preprocessed initial text data, wherein the preprocessed initial text data is the initial text data loaded by the text loading template; the initial text data includes text data with identification information and text data without identification information.

Here, the pretreatment includes one or more of the following treatment modes: regularization processing and text length clipping. The regularization process is used to clean up the initial text data and delete meaningless special characters, spaces, characters with garbled codes, and the like. And cutting the text length into text data which accords with the length required by model training.

And S102, acquiring a text loading template corresponding to the preset natural language processing task and used for loading the training text sample.

In an example of the present application, before obtaining a text loading template for loading a training text sample corresponding to the preset natural language processing task, the training method further includes: acquiring a plurality of text loading templates which are pre-designed by a user and a natural language processing task corresponding to each text loading template; binding and storing each acquired text loading template and the corresponding natural language processing task, and constructing a text loading template library; the text loading template comprises a text loading position and a text answer position.

In the step, the text loading template is pre-designed by the user according to the specific natural language processing task to be processed, and the designed text loading template is a prompt template. After the design is finished, aiming at each text loading template, binding the text loading template and the corresponding natural language processing task. And constructing a text loading template library based on a plurality of text loading templates which are designed in advance. Each text loading template comprises a text loading position and a text answer position.

It should be noted that the text loading template in this step is generated based on prompt, and this template is usually a segment of natural language and includes two or more empty positions, and these empty positions include at least one position for loading text and one position for generating text answers.

For example, referring to table 1, table 1 is an example of a text loading template. As shown in Table 1, different natural language processing tasks correspond to different text loading templates.

Table 1: text load template example

Natural language task	Template example
		Emotion analysis	[ text ] the product is [ answer ]
News topic taxonomy	[ text ] this report is about [ answer ]
		Intent recognition	The problem is about [ answer ]
Named entity recognition	[ text ] A Beijing university is an [ answer ]
		Semantic matching	[ text ] and [ text ], and [ answer ] are similar

S103, loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text.

In an example of the present application, the loading initial text data according to the text loading template to obtain a training text sample for training a language model includes: loading the text in the text data with the identification information to a text loading position in the text loading template, loading the identification information corresponding to the text to a text answer position in the text loading template, and generating a training text with the identification information; and taking the text data without the identification information as the training text without the identification information.

The method comprises the following steps of preprocessing initial text data to obtain initial text data meeting requirements, and then performing a generation process of a sample for model training, wherein the generation process specifically comprises the following steps: and for the text data with the identification information in the initial text data, loading the text data by using the acquired text loading template, loading the text data to a text loading position in the text loading template, loading the identification information corresponding to the text data to a text answer position of the text loading template, and taking the loaded natural language as a training sample of the language model, wherein the sample is a training text with the identification information. For the text data without the identification information in the initial text data, because the text data lacks the identification information and cannot fill the text answer position in the text loading model, the text data is directly used as the training sample of the model, and the sample is the training text without the identification information.

Here, through the above steps, a plurality of training texts with identification information and a plurality of training texts without identification information can be obtained.

For example, as shown in table 1, it is assumed that the natural language task to be processed is a news topic classification, the obtained text loading template is "text", and this report is about "answer", and the obtained initial text data includes "a chapter of a book man score defeat rocket, sports" text data with identification information, "text data without identification information that beijing will hold an olympic games in 2022 years," and "chapter of a lake man score defeat rocket without identification information. And when a training text sample is constructed by loading a text template, generating a ' Severe lake man score defeat rocket, wherein the report is a training text with identification information about sports, ' No-identification-information training text of Olympic Games in 2022 years will be held by Beijing ', and No-identification-information training text of ' Severe lake man score defeat rocket '.

And S104, performing iterative updating on the initial language model by using the training text sample to generate a target language model.

In an example of the present application, after obtaining a plurality of training text samples, the training method further includes: according to a preset sample segmentation proportion, segmenting the training text samples into two groups of training text samples, wherein one group is a training set, and the other group is a verification set; the training set is used for training an initial language model, and the verification set is used for verifying the model effect in the training process.

It should be noted that the initial language model may be a BERT model, and may be trained by using methods such as gradient back propagation, gradient descent, and adaptive time estimation during the model training process, and when the loss function converges or reaches a preset requirement, the training is stopped to generate the target language model.

In addition, after the training is finished, the Prompt template is saved as a json format file, and the trained model parameter binary format file is saved for the prediction of the online system.

In another example of the present application, when the preset natural language processing task is a news topic classification task, the training method includes: acquiring initial news text data; acquiring a text loading template of the news topic classification task; loading news initial text data according to the text loading template of the news topic classification task to obtain a training text sample for training a news topic classification model; the training text sample comprises an identification information training text and a non-identification information training text; and iteratively updating the initial news topic classification model by using the training text sample to generate a target news topic classification model.

The step is a specific process of training and generating a news theme classification model when the preset natural language processing task is the news theme classification task.

In another example of the present application, the natural language processing tasks include an emotion classification task, a news topic classification task, an intention recognition task, a named entity recognition task, and a semantic matching task, and when the natural language processing task to be processed is the intention recognition task, after generating the target language model, the training method further includes: acquiring a trained intention recognition model, an intention recognition text loading template and text data to be processed needing intention recognition; loading the text data to be processed into the intention recognition text loading template, and determining the text data to be predicted with a blank text answer position; and outputting the text data to be predicted to the intention recognition model, determining a predicted answer of a text answer position in the text data to be predicted, and determining the predicted answer as an intention recognition result of the text data to be processed.

For example, in an actual application process, when the natural language processing task to be solved is intent recognition, the selected text loading template is a text loading template for intent recognition, and the problem is about "answer", and the selected target language model is a model trained by using a training sample generated by the text loading template for intent recognition. When the acquired text data to be processed is "what is your WeChat public number? When the text data to be predicted is 'what the WeChat public number of your is' and the problem is about 'answer', the text data to be predicted is input into a target language model for prediction, the 'answer' is filled, the model outputs the probability distribution of a predefined intention word (Token) set, the intention with the maximum probability is extracted from the set to serve as a text answer, and the text answer is determined as the prediction result of the text data to be processed.

The language model training method provided by the embodiment of the application comprises the following steps: acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task; acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample; loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text; and iteratively updating the initial language model by using the training text sample to generate a target language model.

Referring to fig. 2 and fig. 3, fig. 2 is a schematic structural diagram of a training apparatus for a language model according to an embodiment of the present application, and fig. 3 is a second schematic structural diagram of the training apparatus for a language model according to the embodiment of the present application. As shown in fig. 2, the training apparatus 200 includes:

a first obtaining module 210, configured to obtain initial text data related to a preset natural language processing task according to the preset natural language processing task;

a second obtaining module 220, configured to obtain a text loading template, corresponding to the preset natural language processing task, for loading a training text sample;

the loading module 230 is configured to load initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text;

and the generating module 240 is configured to iteratively update the initial language model by using the training text sample to generate a target language model.

Optionally, when the preset natural language processing task is a news topic classification task, the training device 200 is configured to:

acquiring initial news text data;

acquiring a text loading template of the news topic classification task;

Optionally, as shown in fig. 3, the training device 200 further includes a preprocessing module 250, where the preprocessing module 250 is configured to:

Optionally, the training apparatus 200 further includes a template library constructing module 260, where the template library constructing module 260 is configured to:

Optionally, when the loading module 230 is configured to load initial text data according to the text loading template to obtain a training text sample for training a language model, the loading module 230 is configured to:

Optionally, the training apparatus 200 further includes an application module 270, and when the natural language processing task to be processed is an intention recognition task, the application module 270 is configured to:

The embodiment of the application provides a training device of language model, includes: acquiring initial text data related to a preset natural language processing task according to the preset natural language processing task; acquiring a text loading template which corresponds to the preset natural language processing task and is used for loading a training text sample; loading initial text data according to the text loading template to obtain a training text sample for training a language model; the training text sample comprises an identification information training text and a non-identification information training text; and iteratively updating the initial language model by using the training text sample to generate a target language model.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 runs, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the training method in the method embodiment shown in fig. 1 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the training method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a language model, the method comprising:

2. The training method according to claim 1, wherein when the predetermined natural language processing task is a news topic classification task, the training method comprises:

acquiring initial news text data;

acquiring a text loading template of the news topic classification task;

3. The training method of claim 1, wherein after obtaining initial text data, the training method further comprises:

4. The training method according to claim 3, wherein before acquiring the text loading template for loading the training text sample corresponding to the preset natural language processing task, the training method further comprises:

5. The training method of claim 4, wherein the loading initial text data according to the text loading template to obtain training text samples for training a language model comprises:

6. The training method of claim 5, wherein the natural language processing tasks include an emotion classification task, a news topic classification task, an intent recognition task, a named entity recognition task, and a semantic matching task, and when the natural language processing task to be processed is the intent recognition task, after generating the target language model, the training method further comprises:

7. An apparatus for training a language model, the apparatus comprising:

8. The training device of claim 7, further comprising an application module, when the natural language processing task to be processed is an intent recognition task, to:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operated, the machine-readable instructions being executable by the processor to perform the steps of the training method of any of claims 1 to 6.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the training method as claimed in any one of the claims 1 to 6.