CN113961669A

CN113961669A - Training method of pre-training language model, storage medium and server

Info

Publication number: CN113961669A
Application number: CN202111251502.8A
Authority: CN
Inventors: 程德生; 王梨; 余星; 万晶; 钱刚; 周靖峰; 陈志方; 刘阳
Original assignee: China Soft Hangzhou Anren Network Communication Co ltd
Current assignee: China Soft Hangzhou Anren Network Communication Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-21

Abstract

The invention provides a training method, a storage medium and a server for a pre-training language model, wherein the training method is used for pre-training a language model in the general field by utilizing a text corpus of a specific scene, and the obtained pre-training language model in the professional field can better capture unique information in the corpus under the specific scene. The words are segmented through the word segmentation tool, so that the whole words are used as targets for being shielded or not, the training difficulty of the language model can be increased, the semantic understanding capability of the language model is improved, and the accuracy of the pre-training language model obtained through training is further improved. The added category label information of each text is rich in abundant semantic information, and the pre-training language model can better understand the whole language effect by adding the category label information. The accuracy and efficiency in the process of processing the downstream natural language processing task by adopting the pre-training language model are improved.

Description

Training method of pre-training language model, storage medium and server

Technical Field

The invention relates to the field of natural language processing, in particular to a training method, a storage medium and a server for a pre-training language model.

Background

Natural language processing is an important branch of the field of artificial intelligence. Pre-trained language models have proven effective in practice for enhancing many natural language processing tasks, such as natural language inference, question answering, sequence tagging, and the like. Currently, the success of the Masking Language Model (MLM) is: some single-character words in the sentence are initially masked randomly with a probability of 15%, and the masking language model learns the filled single-character words where they are masked according to given labels.

This training method is relatively simplistic in that it directly predicts the filled-in single-character words at each occluded position, which is much simpler than the predicted vocabulary and is relatively simplistic in training tasks. Furthermore, the original mask language model does not use tagged corpus data, but in some scenarios is able to obtain tagged corpus data.

Disclosure of Invention

The invention provides a training method, a storage medium and a server of a pre-training language model, which are used for improving the accuracy of the pre-training language model obtained by training, so that the accuracy and the efficiency of processing a downstream natural language processing task by adopting the pre-training language model are improved.

In a first aspect, the present invention provides a training method for pre-training a language model, the training method comprising:

acquiring a text corpus of a specific scene, wherein the text corpus comprises a plurality of texts;

labeling each text by adopting a category label;

performing word segmentation on each text by adopting a word segmentation tool to obtain a word segmented text of each text;

inputting each segmented text into a Word2vec model for training to obtain a Word bank containing Word vector information of each Word;

adding a start marker and a first end marker in front of and behind the text after each word segmentation respectively;

adding a category label of each text after each word segmentation behind the first end marker of each text, and adding a second end marker behind the category label to obtain a text containing a label of each text;

randomly selecting words for masking each text containing the label according to a set probability value, extracting similar words of each masked Word from a Word bank through a Word2vec model, and replacing the similar words to obtain a masked replacement text of each text;

converting each piece of text and the mask replacement text of the text into a numeric ID;

and inputting the digital ID and the class label of each text into a pre-training language model for supervision training to obtain the pre-training language model containing label information.

In the above scheme, the language model in the general field is pre-trained by using the text corpus in the specific scene, and the obtained pre-trained language model in the professional field can better capture unique information in the text corpus in the specific scene. The words are segmented through the word segmentation tool, so that the whole words are used as targets for being shielded or not, the training difficulty of the language model can be increased, the semantic understanding capability of the language model is improved, and the accuracy of the pre-training language model obtained through training is further improved. In addition, the added category label information of each text is rich in abundant semantic information, and the pre-training language model can better understand the whole language effect by adding the category label information. The accuracy and efficiency in the process of processing the downstream natural language processing task by adopting the pre-training language model are improved.

In a specific embodiment, the word segmentation tool is a Jieba word segmentation tool or a Hanlp word segmentation tool.

In a specific embodiment, inputting each segmented text into a Word2vec model for training, and obtaining a lexicon containing Word vector information of each Word comprises: and predicting the central Word through the surrounding words based on a Word2vec model to obtain Word vector information of each Word, so that the accuracy of the obtained Word vector information is improved.

In a specific embodiment, the start marker is [ cls ], and the first end marker and the second end marker are both [ sep ].

In a specific embodiment, adding the category label of each piece of text after the first end marker of the piece of text after word segmentation comprises: defining n class labels as [ unused1], [ unused2], [ unused3], … and [ unused n ] in [ unused ], respectively; and splicing the [ unused ] corresponding to the category label of each text behind the first end marker of each participled text. So as to be convenient for integrating the class label information and improve the accuracy of the pre-training language model.

In a specific embodiment, randomly selecting words for masking each text containing tags according to a set probability value, and extracting similar words of each masked Word from a Word library through a Word2vec model for similar Word replacement to obtain a masked replacement text of each text, wherein the step of obtaining the masked replacement text of each text comprises the following steps: continuously shielding m words of each text containing the label by adopting an N-gram model; wherein, m is [ set a probability value as the total number of words included in the word after the word segmentation ]; skipping the current word when the current word is the start marker, the first end marker, or the second end marker; when the current Word needs to be masked, replacing the current Word with [ mask ] according to the probability of P1, keeping the current Word unchanged according to the probability of P2, extracting similar words of the current Word from a Word stock through a Word2vec model according to the probability of (1-P1-P2) to replace the similar words, and obtaining a masked replacement text of each text; and the similar word length is the same as the current word length. So as to improve the semantic comprehension capability of the language model and further improve the accuracy of the pre-training language model obtained through training.

In a specific embodiment, a probability value of 15% is set; 80% of P1 and 10% of P2, so as to improve the accuracy of the pre-trained language model obtained by final training.

In one particular embodiment, converting each piece of text and the occluding replacement text of the text to a numeric ID comprises: cutting words of each text according to BPE, and converting each text into a digital ID according to Vocab. txt text; and cutting words of the mask replacement text of each text according to BPE, and converting the mask replacement text into a numerical ID according to Vocab.

In a second aspect, the present invention also provides a storage medium having a computer program stored therein, which, when run on a computer, causes the computer to perform any of the training methods described above.

In a third aspect, the present invention further provides a server, which includes a processor and a memory, wherein the memory stores a computer program, and the processor is configured to execute any one of the training methods by calling the computer program stored in the memory.

Drawings

FIG. 1 is a flowchart of a training method for pre-training a language model according to an embodiment of the present invention;

fig. 2 is a flowchart of another training method for pre-training a language model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to facilitate understanding of the training method of the pre-training language model provided by the embodiment of the present invention, an application scenario of the pre-training language model provided by the embodiment of the present invention is described below. The training method is described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a training method for a pre-training language model provided by an embodiment of the present invention includes:

s10: obtaining corpus data of a specific scene, wherein the text corpus comprises a plurality of texts;

s20: labeling each text by adopting a category label;

s30: performing word segmentation on each text by adopting a word segmentation tool to obtain a word segmented text of each text;

s40: inputting each segmented text into a Word2vec model for training to obtain a Word bank containing Word vector information of each Word;

s50: adding a start marker and a first end marker in front of and behind the text after each word segmentation respectively;

s60: adding a category label of each text after each word segmentation behind the first end marker of each text, and adding a second end marker behind the category label to obtain a text containing a label of each text;

s70: randomly selecting words for masking each text containing the label according to a set probability value, extracting similar words of each masked Word from a Word bank through a Word2vec model, and replacing the similar words to obtain a masked replacement text of each text;

s80: converting each piece of text and the mask replacement text of the text into a numeric ID;

s90: and inputting the digital ID and the class label of each text into a pre-training language model for supervision training to obtain the pre-training language model containing label information.

In the above scheme, the language model in the general field is pre-trained by using the text corpus in the specific scene, and the obtained pre-trained language model in the professional field can better capture unique information in the text corpus in the specific scene. The words are segmented through the word segmentation tool, so that the whole words are used as targets for being shielded or not, the training difficulty of the language model can be increased, the semantic understanding capability of the language model is improved, and the accuracy of the pre-training language model obtained through training is further improved. In addition, the added category label information of each text is rich in abundant semantic information, and the pre-training language model can better understand the whole language effect by adding the category label information. The accuracy and efficiency in the process of processing the downstream natural language processing task by adopting the pre-training language model are improved. Each of the above steps will be described in detail with reference to the accompanying drawings.

First, referring to fig. 1 and fig. 2, text corpus data of a specific scene is obtained. The text corpus data includes a plurality of texts, and specifically, the number of the texts included in the text corpus may be 50, 100, 200, and the like. The specific scene can be professional scenes such as sports, finance and economics, military affairs, entertainment, history and tax, and correspondingly, the text corpora under the specific scene also belong to texts in the professional field.

Next, as shown in fig. 1 and 2, each text is labeled with a category label. During specific labeling, each text in a specific scene can be labeled in a manual labeling mode. The labeled labels are the categories to which each piece of text belongs. And the categories corresponding to all texts in the text corpus at least comprise two categories. It should be noted that this category is a finer grained category relative to the scene category. For example, in a specific scene in the sports field, the classification of the text corpus may be classified into basketball, football, badminton, and the like.

Next, with continued reference to fig. 1 and fig. 2, a word segmentation tool is used to segment each text to obtain a segmented text of each text. When the words are specifically segmented, an open-source word segmentation tool such as a Jieba word segmentation tool or a Hanlp word segmentation tool can be adopted to segment each text through the open-source word segmentation tool, so as to obtain the segmented text of each text.

Next, as shown in fig. 1 and fig. 2, inputting each segmented text into Word2vec model training to obtain a Word bank containing Word vector information of each Word. Namely, after Word segmentation of each text, the text is input into a Word2vec model for training, so that a Word bank containing Word vector information of each Word can be obtained and used as a similar Word bank for subsequent similar Word replacement.

In addition, when the Word after each participle is specifically input into a Word2vec model for training to obtain a Word bank containing Word vector information of each Word, the Word vector information of each Word can be obtained by predicting a central Word through surrounding words based on the Word2vec model, and the accuracy of the obtained Word vector information is improved.

Next, as shown in fig. 1 and 2, a start marker and a first end marker are added in front of and behind the text after each word segmentation, respectively. The start marker may be [ cls ], the first end marker may be [ sep ], and the text after adding the start marker and the first end marker is: [ cls ] post-segmentation text [ sep ].

Next, with continued reference to fig. 1 and fig. 2, a category label of each piece of text is added after the first end marker of the text after word segmentation, and a second end marker is added after the category label, so as to obtain a labeled text of each piece of text. Wherein the second end marker may also be [ sep ]. The label-containing text of each text formed after the label category is added is as follows: [ cls ] text [ sep ] category tag [ sep ] after word segmentation.

In addition, when adding the category label of each piece of text after word segmentation after the first end marker of the text, n category labels can be defined as [ unused1], [ unused2], [ unused3], …, [ unused n ] in [ unused ] in sequence. Where n is obviously any integer greater than 1. And then splicing the [ unused ] corresponding to the category label of each text behind the first end marker of each participled text. That is, the text containing the label of each finally formed text is: [ cls ] the participled text [ sep ] [ unsugedi ] [ sep ], wherein [ unsugedi ] is [ unused ] corresponding to each participled text. So as to be convenient for integrating the class label information and improve the accuracy of the pre-training language model.

And then, randomly selecting words for each text containing the labels according to a set probability value to carry out occlusion, extracting similar words of each occluded Word from a Word library through a Word2vec model to carry out similar Word replacement, and obtaining an occlusion replacement text of each text. The set probability value may be about 15% specifically, so as to improve the accuracy of the pre-training language model obtained by the final training.

Specifically, for each text containing the tag, randomly selecting words according to a set probability value for masking, extracting similar words of each masked Word from a Word library through a Word2vec model for similar Word replacement, and when obtaining the masked and replaced text of each text, continuously masking m words of each text containing the tag in an N-gram model mode. Wherein, m is [ set probability value ═ total word number included in the segmented text ], that is, the number m of words that each text containing labels needs to be masked is related to the total word number included in the segmented text corresponding to the text containing labels. The size of the word is a set probability value, and the rounding mode can be upwards rounding or downwards rounding.

When the current word is the start marker, the first end marker or the second end marker, the current word needs to be skipped to prevent the markers from being masked.

In addition, when the current Word needs to be masked, replacing the current Word with [ mask ] according to the probability of P1, keeping the current Word unchanged according to the probability of P2, extracting similar words of the current Word from a Word stock through a Word2vec model according to the probability of (1-P1-P2) for similar Word replacement, and obtaining a masked replacement text of each text; and the similar word length is the same as the current word length. So as to improve the semantic comprehension capability of the language model and further improve the accuracy of the pre-training language model obtained through training. Wherein, P1-80% and P2-10% can be adopted. Namely, the adopted scheme is as follows: when the current Word needs to be shielded, replacing the current Word with [ mask ] at a probability of 80%, keeping the current Word unchanged at a probability of 10%, extracting similar words of the current Word from a Word stock through a Word2vec model at a probability of 10% to perform similar Word replacement, and obtaining a shielding replacement text of each text so as to improve the accuracy of a pre-training language model obtained by final training.

In addition, when similar words of the current Word are extracted from the Word stock through the Word2vec model for similar Word replacement, Word vectors of the current Word can be calculated through the Word2vec model, words with the highest similarity and the same length as those of the Word vectors of the current Word are selected from the Word stock to serve as the similar words of the current Word, and similar Word replacement is carried out, so that the accuracy of the pre-training language model obtained through final training is improved.

Next, as shown in fig. 1 and 2, each piece of text and the mask-replacement text of the text are converted into a numeric ID. In some embodiments, each piece of text may be cut into words according to BPE to obtain a token sequence at a character level, and converted into a numeric ID according to vocab. And cutting the mask replacement text of each text according to BPE to obtain a token sequence at a character level, and converting the mask replacement text into a digital ID according to the Vocab. The Vocab. txt text is a word stock file of a universal Chinese pre-training model and can be obtained through downloading. The BPE can be Byte Piece Encoding, and is a simple character-dividing algorithm.

And finally, inputting the digital ID and the class label of each text into a pre-training language model for supervision training to obtain the pre-training language model containing label information. And finally obtaining the pre-training language model of the N-gram and the full word mask based on the label information.

Compared with the prior art, the language model in the general field is pre-trained by utilizing the text corpus of the specific scene, and the obtained pre-trained language model in the professional field can better capture unique information in the text corpus of the specific scene. The words are segmented through the word segmentation tool, so that the whole words are used as targets for being shielded or not, the training difficulty of the language model can be increased, the semantic understanding capability of the language model is improved, and the accuracy of the pre-training language model obtained through training is further improved. In addition, the added category label information of each text is rich in abundant semantic information, and by adding the category label information, the generalization capability of the language model can be better improved, and the pre-training language model can better understand the whole language effect. The accuracy and efficiency in the process of processing the downstream natural language processing task by adopting the pre-training language model are improved. The pre-training language model based on the N-gram of the label information and the whole word shielding can increase the difficulty of the language model training and improve the semantic comprehension capability of the pre-training language model.

Furthermore, an embodiment of the present invention further provides a storage medium, where a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute any one of the training methods described above. The above description is referred to for the effect, and the description is omitted here.

In addition, an embodiment of the present invention further provides a server, where the server includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute any one of the training methods by calling the computer program stored in the memory. The above description is referred to for the effect, and the description is omitted here.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A training method for pre-training a language model, comprising:

labeling each text by adopting a category label;

randomly selecting words for masking each text containing the label according to a set probability value, extracting similar words of each masked Word from the Word stock through a Word2vec model, and replacing the similar words to obtain a masked and replaced text of each text;

2. The training method of claim 1, wherein the word segmentation tool is a Jieba word segmentation tool or a Hanlp word segmentation tool.

3. The training method of claim 1, wherein the step of inputting each segmented text into a Word2vec model for training to obtain a lexicon containing Word vector information of each Word comprises:

and predicting the central Word through the surrounding words based on a Word2vec model to obtain Word vector information of each Word.

4. The training method of claim 1, wherein the start marker is [ cls ] and the first end marker and the second end marker are both [ sep ].

5. The training method of claim 1, wherein said appending a category label for each participled text after its first end marker comprises:

defining n class labels as [ unused1], [ unused2], [ unused3], … and [ unused n ] in [ unused ], respectively;

and splicing the [ unused ] corresponding to the category label of each text behind the first end marker of each participled text.

6. The training method of claim 1, wherein for each text containing labels, randomly selecting words for masking according to a set probability value, and extracting similar words of each masked Word from the lexicon through a Word2vec model for similar Word replacement, and obtaining a masked replacement text of each text comprises:

continuously shielding m words of each text containing the label by adopting an N-gram model; wherein m is [ the set probability value is the total number of words included in the word after the word segmentation ];

skipping a current word when the current word is the start marker, the first end marker, or the second end marker;

when the current Word needs to be masked, replacing the current Word with [ mask ] according to the probability of P1, keeping the current Word unchanged according to the probability of P2, extracting similar words of the current Word from the Word stock through a Word2vec model according to the probability of (1-P1-P2) for similar Word replacement, and obtaining a masked replacement text of each text; and the similar word length is the same as the current word length.

7. The training method of claim 6, wherein the set probability value is 15%; p1-80%, P2-10%.

8. The training method of claim 1, wherein converting each piece of text and the occluding replacement text of the text to a numeric ID comprises:

cutting words of each text according to BPE, and converting each text into the digital ID according to Vocab. txt text;

and cutting words of the mask replacement text of each text according to BPE, and converting the mask replacement text into the numerical ID according to Vocab.

9. A storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the training method of any one of claims 1 to 8.

10. A server, characterized by comprising a processor and a memory, wherein the memory stores a computer program, and the processor executes the training method according to any one of claims 1 to 8 by calling the computer program stored in the memory.