CN116244432B

CN116244432B - Pre-training method and device for language model and electronic equipment

Info

Publication number: CN116244432B
Application number: CN202211703408.6A
Authority: CN
Inventors: 徐钰坪; 朱志凡; 冯仕堃; 黄世维
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-11-14
Anticipated expiration: 2042-12-28
Also published as: CN116244432A

Abstract

The disclosure provides a pre-training method and device for a language model and electronic equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning, natural language processing and intelligent searching. The specific implementation scheme is as follows: acquiring at least one mask sample text, reference words of mask positions in the mask sample text, and reference categories of word segmentation words in the mask sample text; inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words; further inputting a prediction network in the language model, and obtaining a prediction word of the mask position and a prediction category of the word segmentation word; according to the predicted words and the reference words of the mask positions and the predicted categories and the reference categories of the word segmentation words, a loss function of the language model is constructed, and the language model is subjected to pre-training treatment, so that the language model can learn the word category capability, and the accuracy of the language model is improved.

Description

Pre-training method and device for language model and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, natural language processing and intelligent searching, and particularly relates to a pre-training method and device of a language model and electronic equipment.

Background

At present, a common pre-training method of a language model mainly comprises the steps of masking part of words in a sample text, carrying out word prediction processing on masking positions based on a context to obtain predicted words, constructing a loss function by combining the predicted words and original words of the masking positions, and training the language model, so that the language model can learn the text understanding capability.

However, in practical application, if new words exist in the input text, that is, words which are not contacted in the training process of the language model, the final output result of the language model is seriously disturbed, the accuracy of the output result is reduced, and the accuracy of the language model is poor.

Disclosure of Invention

The disclosure provides a pre-training method and device of a language model and electronic equipment.

According to an aspect of the present disclosure, there is provided a pre-training method of a language model, the method comprising: acquiring at least one mask sample text, reference words of mask positions in the mask sample text, and reference categories of word segmentation words in the mask sample text; inputting the mask sample text into a feature extraction network in a language model to obtain feature vectors of the mask positions and feature vectors of the word segmentation words; inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word; and constructing a loss function of the language model according to the predicted words and the reference words of the mask positions, the predicted categories and the reference categories of the word segmentation words, and performing pre-training processing on the language model.

According to another aspect of the present disclosure, there is provided a pre-training apparatus of a language model, the apparatus comprising: the first acquisition module is used for acquiring at least one mask sample text, reference words of mask positions in the mask sample text and reference categories of word segmentation words in the mask sample text; the second acquisition module is used for inputting the mask sample text into a feature extraction network in a language model to acquire feature vectors of the mask positions and feature vectors of the word segmentation words; the third obtaining module is used for inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain the prediction word of the mask position and the prediction category of the word segmentation word; and the training module is used for constructing a loss function of the language model according to the predicted words and the reference words of the mask positions, the predicted categories and the reference categories of the word segmentation words and performing pre-training processing on the language model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the pre-training method of the language model set forth above in the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the pre-training method of the language model proposed above in the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the pre-training method of the language model proposed above in the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a pre-training schematic of a language model;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing a pre-training method of a language model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In view of the above problems, the present disclosure provides a method and apparatus for pre-training a language model, and an electronic device.

Fig. 1 is a schematic diagram of a first embodiment of the disclosure, and it should be noted that the method for pre-training a language model according to the embodiment of the disclosure may be applied to a device for pre-training a language model, where the device may be configured in an electronic device, so that the electronic device may perform a pre-training function of the language model. In the following embodiments, an execution body is described as an example of an electronic device.

The electronic device may be any device with computing capability, for example, may be a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc., and may be a hardware device with various operating systems, a touch screen, and/or a display screen.

As shown in fig. 1, the pre-training method of the language model may include the steps of:

step 101, at least one mask sample text, reference words of mask positions in the mask sample text, and reference categories of word segmentation words in the mask sample text are obtained.

In the embodiment of the disclosure, the masking of the sample text may be obtained by masking a part of words in the sample text. For each word in partial words in the sample text, two masking processing modes of the word can be adopted, and one masking character is adopted to replace the word; the other is to replace each character in the word with a mask character.

The masking position is the position of the masked word in the masking sample text. Wherein the word may include one character or a plurality of characters.

For example, the sample text may be "shiitake mushroom may fry celery" and the words therein may be "shiitake mushroom", "may", "fry", "celery", "morph", respectively. The mask processing can be performed on the "mushrooms" in the sample text, and the mask processing can be performed on each character in the "celery" in the sample text, so that the obtained mask sample text can be, for example, the mask can be fried to be the mask mask. Wherein each "mask" represents a mask character.

In the embodiment of the disclosure, the reference category of the word segmentation term may be an upper concept of the word segmentation term, or may be a number without actual meaning, etc., and may be set according to actual needs.

And 102, inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words.

And step 103, inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word.

In an embodiment of the present disclosure, the language model may include: a feature extraction network and a prediction network. The predictive network may include: word prediction network and category prediction network. And the feature extraction network is used for extracting feature vectors of all mask positions in the mask sample text and feature vectors of all word segmentation words in the mask sample text. The feature extraction network may be, for example, an ERNIE network; the word prediction network and the category prediction network may be, for example, full Connected (FC) networks.

The word prediction network is used for combining the feature vectors of all mask positions in the mask sample text and the feature vectors of all word segmentation words to determine the predicted words of the mask positions. And the category prediction network is used for combining the feature vectors of the word segmentation words and determining the prediction category of the word segmentation words.

In the embodiment of the present disclosure, the process of executing step 103 by the electronic device may be, for example, inputting the feature vector of the masking position and the feature vector of the word segmentation word into the word prediction network, and obtaining the predicted word of the masking position output by the word prediction network; and inputting the feature vector of the word segmentation word into a category prediction network, and obtaining the prediction category of the word segmentation word output by the category prediction network.

The word prediction network is adopted to conduct word prediction processing, the category prediction network is adopted to conduct category prediction processing, and different coefficients can be adjusted for the word prediction network and the category prediction network, so that the word prediction network and the category prediction network have different network coefficients, mutual interference between the word prediction and the category prediction is avoided, accuracy of the word prediction is further improved, and accuracy of the category prediction is improved.

And 104, constructing a loss function of the language model according to the predicted words and the reference words of the mask positions and the predicted categories and the reference categories of the word segmentation words, and performing pre-training treatment on the language model.

In the embodiment of the disclosure, the electronic device may construct a loss function of the language model according to the predicted word and the reference word of the mask position, the predicted category and the reference category of the word segmentation word; and adjusting the coefficient of the language model according to the numerical value of the loss function to obtain the pre-trained language model.

After the pre-trained language model is obtained, a task processing model of a downstream task can be built by combining a feature extraction network in the pre-trained language model aiming at a specific downstream task; and further, training the task processing model by combining training data of the downstream task to obtain the task processing model applicable to the downstream task.

The downstream tasks such as translation task, keyword extraction task, question-answer task, intention recognition task, and the like can be set according to actual needs.

The pre-training method of the language model comprises the steps of obtaining at least one mask sample text, reference words of mask positions in the mask sample text and reference categories of word segmentation words in the mask sample text; inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words; inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word; according to the predicted words and the reference words of the mask positions and the predicted categories and the reference categories of the word segmentation words, a loss function of the language model is constructed, and the language model is pre-trained, so that the language model can learn the word category capability, further, the new words can be combined with other words of the same category for carrying out understanding processing, interference of the new words on an output result of the language model is avoided, and the accuracy of the language model is improved.

In order to accurately acquire the mask sample text, the acquisition cost of the mask sample text is reduced, and masking processing can be performed on partial words in the sample text. As shown in fig. 2, fig. 2 is a schematic diagram of a second embodiment according to the present disclosure, and the embodiment shown in fig. 2 may include the following steps:

at step 201, at least one sample is obtained.

In one example, in an embodiment of the present disclosure, a sentence or multiple sentences may be included in the sample text. In another example, a phrase or multiple phrases may be included in the sample text.

It should be noted that, the sample text may be chinese, english, etc., and is not limited herein, and may be set according to actual needs.

Step 202, for each sample text, masking a part of words in the sample text to obtain a masking sample text corresponding to the sample text.

In the embodiment of the disclosure, for each word in the partial words in the sample text, two masking processing modes of the word can be adopted, one is to replace the word with one masking character; the other is to replace each character in the word with a mask character.

Correspondingly, in one example, the process of executing step 202 by the electronic device may be, for example, performing word segmentation processing on the sample text for each sample text, to obtain a plurality of word segmentation terms in the sample text; selecting a first word segmentation word from a plurality of word segmentation words; and carrying out mask processing on the position of the first word segmentation word in the sample text to obtain mask sample text corresponding to the sample text.

In this example, the electronic device may input the sample text into the word segmentation model, obtaining a plurality of word segmentation terms output by the word segmentation model. Taking a sample text as an example of 'mushroom can fry celery' and 'mushroom', 'fry', 'celery', 'morpholine' as a word segmentation word.

In this example, the electronic device may replace each first word segment word in the sample text with a mask character to obtain a mask sample text. Wherein, the first word may be, for example, "mushroom".

The electronic device may select a first word segmentation term from a plurality of word segmentation terms of the sample text according to a first preset ratio. And the first preset proportion represents the maximum ratio of the number of characters in the selected first word segmentation word to the number of characters in the sample text.

The electronic equipment performs word segmentation on the sample text, replaces partial words obtained by the word segmentation with mask characters to quickly obtain mask sample text, and performs mask processing by taking the words as units to ensure the meaning integrity of the words to be predicted at mask positions; the mask sample is combined to pretrain the language model, so that the accuracy of the language model obtained by training can be further improved.

In another example, the electronic device may perform the process of step 202, for example, perform word segmentation processing on the sample text for each sample text, to obtain a plurality of word segmentation terms in the sample text; selecting a first word segmentation word and a second word segmentation word from the plurality of word segmentation words; splitting the second word and phrase to obtain a second character in the second word and phrase; and carrying out mask processing on the position of the first word segmentation word and the position of the second character in the sample text to obtain mask sample text corresponding to the sample text.

In this example, the electronic device may select a first word segment word from the plurality of word segment words of the sample text according to a first preset ratio, and select a second word segment word from the plurality of word segment words of the sample text according to a second preset ratio. And the second preset proportion represents the maximum ratio of the number of characters in the selected second word-dividing words to the number of characters in the sample text.

The first preset proportion and the second preset proportion may be the same or different. The first word segmentation word and the second word segmentation word need to be different. That is, the same word-segmentation word in the sample text is not the same as the first word-segmentation word and the second word-segmentation word at the same time.

Taking a sample text as an example of "shiitake mushroom can fry celery", wherein the second word can be "celery", for example.

The electronic device performs mask processing on the position of the first word segmentation word and the position of the second character in the sample text, and the process of obtaining the mask sample text corresponding to the sample text may be, for example, performing replacement processing on each first word segmentation word by using one mask character; and respectively replacing each second character by adopting one mask character, thereby obtaining mask sample text corresponding to the sample text.

The electronic equipment performs word segmentation on the sample text, and performs replacement processing on first word segmentation words obtained by word segmentation by adopting mask characters; carrying out replacement processing on each character in the second word obtained by word segmentation by adopting mask characters; and pre-training the language model by combining the mask sample text, so that the language model can learn the information of the characters in the sample text and the association relation among the second characters, and the accuracy of the trained language model can be further improved.

Step 203, determining a mask position in the mask sample text and a reference word of the mask position according to the sample text.

In the embodiment of the present disclosure, the electronic device performs the process of step 203 may be, for example, determining a mask position in the mask sample text; the words at the masking positions in the sample text are used as reference words of the masking positions.

The electronic device may obtain a masking character in the masking sample text, and use a position of a word replaced by the masking character in the sample text as a masking position. For example, taking a sample text as an example of "mushroom can fry celery", word segmentation words replaced by mask characters can be "mushroom", "celery" and "dish", and the first mask position can be the positions of the first character and the second character in the sample text; the second mask position may be the position of the sixth character in the sample text; the third mask position may be the position of the seventh character in the sample text.

The electronic equipment takes words at the mask positions in the sample text as reference words at the mask positions, so that the corresponding relation between the subsequent predicted words at the mask positions and the reference words is conveniently determined, a loss function is further constructed to pretrain the language model, and the accuracy of the language model obtained through training is improved.

Step 204, according to the word segmentation word query category vocabulary in the mask sample text, obtaining the reference category of the word segmentation word.

In embodiments of the present disclosure, multiple terms may be included in the category vocabulary, as well as the category of each term. The electronic device may perform the step 204, for example, by querying a category vocabulary according to the word segmentation term in the mask sample text, and obtaining a term matching the word segmentation term in the category vocabulary; and taking the category of the matched word in the category word list as the reference category of the word segmentation word.

In an embodiment of the present disclosure, the electronic device may obtain the category vocabulary prior to step 204. The process of the electronic device obtaining the category vocabulary may be, for example, obtaining a preset vocabulary, where the preset vocabulary includes words greater than a preset number; determining the coding vector of each word; clustering each word according to the coding vector of each word to obtain at least one clustering result; determining a reference category of each word according to the category of at least one clustering result; and generating a category vocabulary according to each word and the reference category of each word.

The process of determining the encoding vector of each word by the electronic device may be, for example, sequentially inputting each word into the bag-of-words model, and obtaining the encoding vector of each word output by the bag-of-words model.

The clustering process of the electronic device according to the encoding vectors of the words may be, for example, determining, for any two words, a vector similarity between the encoding vectors of the two words; clustering each word according to the vector similarity to obtain at least one clustering result; the vector similarity between any two words in the same clustering result is larger; vector similarity between two words in different clustering results is smaller.

The category of the at least one clustering result may be a number of the at least one clustering result. Wherein, the serial numbers of different clustering results are different. The numbers such as 0,1,2, and 3 may be set as needed. The number of clustering results may be 256, for example. The clustering method may be, for example, a k-means clustering method.

The electronic equipment performs clustering processing on each word according to the coding vector of each word so as to determine the category of each word; words of similar meaning can be made to have the same category; according to the generated category word list, the reference category of the word segmentation words in the mask sample text is determined, the reference category of the word segmentation words can be accurately determined, and the accuracy of the language model obtained through training is further improved.

Step 205, inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words.

And 206, inputting the feature vectors of the mask positions and the feature vectors of the word segmentation words into a prediction network in the language model to obtain the predicted words of the mask positions and the predicted categories of the word segmentation words.

Step 207, constructing a loss function of the language model according to the predicted word and the reference word of the mask position, the predicted category and the reference category of the word segmentation word, and performing pre-training processing on the language model.

It should be noted that, for details of step 205 to step 207, reference may be made to step 102 to step 104 in the embodiment shown in fig. 1, and detailed description thereof will not be provided here.

The pre-training method of the language model of the embodiment of the disclosure comprises the steps of obtaining at least one text sample; masking part of words in the sample text aiming at each sample text to obtain a mask sample text corresponding to the sample text; determining a mask position and a reference word of the mask position in the mask sample text according to the sample text; inquiring a category vocabulary according to the word segmentation words in the mask sample text, and acquiring reference categories of the word segmentation words; inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words; inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word; according to the predicted words and the reference words of the mask positions and the predicted categories and the reference categories of the word segmentation words, a loss function of the language model is constructed, and the language model is pre-trained, so that the language model can learn the word category capability, further, the new words can be combined with other words of the same category for carrying out understanding processing, interference of the new words on an output result of the language model is avoided, and the accuracy of the language model is improved.

In order to accurately combine the predicted words and the reference words at the mask positions and the predicted categories and the reference categories of the word segmentation words, a loss function of the language model is built, and then coefficient adjustment is carried out on the language model, so that training is realized, a word prediction loss function can be built by combining the predicted words and the reference words at the mask positions, and a category prediction loss function can be built by combining the predicted categories and the reference categories of the word segmentation words; and further constructing a loss function of the language model. As shown in fig. 3, fig. 3 is a schematic diagram of a third embodiment according to the present disclosure, and the embodiment shown in fig. 3 may include the following steps:

step 301, obtaining at least one mask sample text, reference words of mask positions in the mask sample text, and reference categories of word segmentation words in the mask sample text.

Step 302, inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words.

Step 303, inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word.

Step 304, constructing a word prediction loss function according to the predicted word and the reference word of the mask position.

In the embodiment of the disclosure, the reference word of the mask position may be a word segmentation word in the sample text corresponding to the mask sample text, or may be a word segmentation word split.

In one example, the reference word of the mask position may be a word segmentation word in the sample text corresponding to the mask sample text. Correspondingly, the electronic device may perform the process of step 304 by, for example, determining, for each mask position, a feature vector of a predicted word of the mask position; determining a feature vector of a reference word of the mask position; constructing a word prediction loss term according to the feature vector of the predicted word of the mask position and the feature vector of the reference word; and constructing a word prediction loss function according to the word prediction loss terms of the mask positions.

The feature vector of the predicted word of the mask position may be a feature vector of the mask position obtained after the mask sample text is input to the feature extraction network. The feature vector of the reference word at the mask position may be a feature vector of the reference word obtained after extracting the reference word or the text input feature including the reference word from the network.

Under the condition that the reference words at the mask positions are word segmentation words in the sample text corresponding to the mask sample text, a word prediction loss function is constructed according to the feature vectors of the predicted words at each mask position and the feature vectors of the reference words, so that the accuracy of the constructed word prediction loss function can be improved, the accuracy of the constructed loss function is further improved, and the accuracy of the language model obtained through training is improved.

In another example, the reference word of the partial mask position may be a word segmentation word in the sample text corresponding to the mask sample text; the reference words of the partial mask positions can be obtained by splitting word segmentation words in the sample text corresponding to the mask sample text. Correspondingly, the process of executing step 304 by the electronic device may be, for example, determining a correspondence between at least one mask position and a position of a word segmentation word in the sample text, where in the correspondence, a reference word of the at least one mask position is obtained by splitting the word segmentation word of the corresponding position; aiming at the position of each word segmentation word in the sample text, constructing a word prediction loss term according to the feature vector of the reference word of the corresponding at least one mask position and the feature vector of the word segmentation word; and constructing a word prediction loss function according to each word prediction loss term.

The correspondence between at least one mask position and the position of the word segmentation word in the sample text, that is, the position of the word segmentation word in the sample text, may correspond to one or more mask positions. For example, taking the sample text as "shiitake can fry celery" as an example, the corresponding mask sample text can be "mask can fry mask" as "mask". The word segmentation word 'mushroom' in the sample text can correspond to a mask position where a first mask character 'mask' in the mask sample text is located; the word "celery" in the sample text may correspond to the location of the second masking character and the location of the third masking character in the masking sample text.

Under the condition that the word segmentation words in the sample text corresponding to the mask sample text are split, the corresponding relation between at least one mask position and the position of the word segmentation word in the sample text is determined, and then the feature vector of the reference word and the feature vector of the word segmentation word in at least one mask position in the corresponding relation are combined to construct a word prediction loss function, so that the accuracy of the constructed word prediction loss function can be improved, the accuracy of the constructed loss function can be improved, and the accuracy of the language model obtained through training is improved.

The process of the electronic device constructing the word prediction loss term according to the position of each word segmentation word in the sample text may be, for example, according to the position of each word segmentation word in the sample text,

adding the feature vectors of the reference words of the corresponding at least one mask position to obtain a predicted feature vector of the position 5; constructing according to the predicted feature vector and the feature vector of the word segmentation word

A term prediction loss term is built.

The electronic device adds the feature vectors of the reference words of the corresponding at least one mask position according to the position of each word segmentation word in the sample text to obtain a prediction feature vector of the position, so as to construct a word prediction loss term, and the accuracy of the constructed word prediction 0 loss term can be improved, so that the accuracy of the constructed word prediction loss function is improved.

Step 305, constructing a category prediction loss function according to the prediction category and the reference category of the word segmentation word.

In the embodiment of the present disclosure, the electronic device performs the process of step 305 may be, for example, specific to

Masking each word in the sample text, and constructing a category prediction loss item according to the prediction category and the reference category of the word; constructing category prediction loss according to each category prediction loss term

A function.

Step 306, constructing a loss function according to the word prediction loss function and the category prediction loss function.

Wherein the loss function may be a sum of the word prediction loss function and the category prediction loss function. 0 step 307, according to the value of the loss function, performing coefficient adjustment on the language model to realize the matching

Pre-training processing of language models.

It should be noted that, for details of steps 301 to 303, reference may be made to steps 101 to 103 in the embodiment shown in fig. 1, and detailed description thereof will not be provided here.

The pre-training method of the language model of the embodiment of the disclosure comprises the steps of obtaining at least one mask sample 5 text, reference words of mask positions in the mask sample text and word segmentation words in the mask sample text

Is defined by the reference class of (2); inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words; inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word; constructing a word prediction loss function according to the predicted words and the reference words of the mask positions; constructing a category prediction loss function according to the prediction category and the reference category of the word segmentation words; constructing a loss function according to the word prediction loss function and the category prediction loss function; according to the numerical value of the loss function, coefficient adjustment is carried out on the language model, and pre-training processing on the language model is achieved, so that the language model can learn word class capacity and word association capacity, further understanding processing can be carried out on new words by combining other words in the same class and associated words, interference of the new words on an output result of the language model is avoided, and accuracy of the language model is improved.

The following examples are illustrative. The pre-training schematic of the language model may be as shown in fig. 4, and in fig. 4, the language model may include: a feature extraction network and a prediction network. The predictive network may include: word prediction network and category prediction network. Wherein a word prediction network (not shown in fig. 4) performs a word prediction task (word prediction task) based on an output result of the feature extraction network; a category prediction network (not shown in fig. 4) performs a category prediction task (word classification task) based on the output result of the feature extraction network. Wherein the input of the feature extraction network (ERNIE pre-training model) may be a mask sample text (sample mask) and the output may be feature vectors of mask positions in the mask sample text and feature vectors of word segmentation words.

In fig. 4, the generation process of the category vocabulary (clustering vocabulary) used in the category prediction task may be, for example, obtaining coding vectors (open source word codes) of a large number of words, and performing open source word clustering (k-mean clustering) and category mapping (word class mapping) to obtain the clustering vocabulary.

In order to implement the above embodiment, the present disclosure further provides a pre-training device for a language model. As shown in fig. 5, fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure. The pre-training device 50 of the language model may include: a first acquisition module 501, a second acquisition module 502, a third acquisition module 503, and a training module 504.

The first obtaining module 501 is configured to obtain at least one mask sample text, a reference word of a mask position in the mask sample text, and a reference category of a word segmentation word in the mask sample text;

a second obtaining module 502, configured to input the mask sample text into a feature extraction network in a language model, to obtain a feature vector of the mask position and a feature vector of the word segmentation word;

a third obtaining module 503, configured to input the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model, to obtain a predicted word of the mask position and a predicted category of the word segmentation word;

and the training module 504 is configured to construct a loss function of the language model according to the predicted word and the reference word of the mask position, the predicted category and the reference category of the word segmentation word, and perform pre-training processing on the language model.

As one possible implementation manner of the embodiment of the present disclosure, the first obtaining module 501 includes: a first acquisition unit, a mask processing unit, a determination unit, and a second acquisition unit; the first acquisition unit is used for acquiring at least one sample; the mask processing unit is used for carrying out mask processing on partial words in the sample text aiming at each sample text to obtain mask sample text corresponding to the sample text; the determining unit is used for determining a mask position in the mask sample text and a reference word of the mask position according to the sample text; the second obtaining unit is configured to obtain a reference category of the word segmentation word according to a word segmentation word query category vocabulary in the mask sample text.

As a possible implementation manner of the embodiment of the present disclosure, the mask processing unit is specifically configured to perform word segmentation processing on each sample text to obtain a plurality of word segmentation terms in the sample text; selecting a first word segmentation word from a plurality of word segmentation words; and carrying out mask processing on the position of the first word segmentation word in the sample text to obtain mask sample text corresponding to the sample text.

As a possible implementation manner of the embodiment of the present disclosure, the mask processing unit is specifically configured to perform word segmentation processing on each sample text to obtain a plurality of word segmentation terms in the sample text; selecting a first word segmentation word and a second word segmentation word from a plurality of word segmentation words; splitting the second word and phrase to obtain a second character in the second word and phrase; and carrying out mask processing on the position of the first word segmentation word and the position of the second character in the sample text to obtain mask sample text corresponding to the sample text.

As one possible implementation manner of the embodiments of the present disclosure, the determining unit is specifically configured to determine a mask position in the mask sample text; and taking the words at the mask positions in the sample text as reference words of the mask positions.

As a possible implementation manner of the embodiments of the present disclosure, the second obtaining unit is specifically configured to obtain a preset vocabulary, where the preset vocabulary includes words greater than a preset number; determining the coding vector of each word; clustering each word according to the coding vector of each word to obtain at least one clustering result; determining the reference category of each word according to the category of at least one clustering result; and generating the category vocabulary according to each word and the reference category of each word.

As one possible implementation of the embodiments of the present disclosure, the prediction network includes: the third obtaining module 503 is specifically configured to input, into the word prediction network, a feature vector of the mask position and a feature vector of the word segmentation word, and obtain a predicted word of the mask position output by the word prediction network; and inputting the feature vector of the word segmentation word into the category prediction network to obtain the prediction category of the word segmentation word output by the category prediction network.

As one possible implementation of the embodiments of the present disclosure, the training module 504 includes: the device comprises a first construction unit, a second construction unit, a third construction unit and a training unit; the first construction unit is used for constructing a word prediction loss function according to the predicted word and the reference word of the mask position; the second construction unit is used for constructing a category prediction loss function according to the prediction category and the reference category of the word segmentation words; the third construction unit is used for constructing the loss function according to the word prediction loss function and the category prediction loss function; and the training unit is used for carrying out coefficient adjustment on the language model according to the numerical value of the loss function so as to realize the pre-training treatment on the language model.

As a possible implementation manner of the embodiment of the present disclosure, the reference word of the mask position is a word segmentation word in a sample text corresponding to the mask sample text, and the first construction unit is specifically configured to determine, for each mask position, a feature vector of a predicted word of the mask position; determining a feature vector of a reference word of the mask position; constructing a word prediction loss term according to the feature vector of the predicted word of the mask position and the feature vector of the reference word; and constructing the word prediction loss function according to the word prediction loss terms of the mask positions.

As a possible implementation manner of the embodiment of the present disclosure, at least one reference word of the mask position is obtained by splitting a word segmentation word in a sample text corresponding to the mask sample text, and the first construction unit is specifically configured to determine a correspondence between at least one mask position and a location of the word segmentation word in the sample text, where in the correspondence, at least one reference word of the mask position is obtained by splitting the word segmentation word of the corresponding location; constructing a word prediction loss term according to the feature vector of the reference word of at least one corresponding mask position and the feature vector of the word segmentation word aiming at the position of each word segmentation word in the sample text; and constructing the word prediction loss function according to each word prediction loss term.

As a possible implementation manner of the embodiment of the present disclosure, the first construction unit is specifically configured to, for a location of each word segmentation word in the sample text, add and process a feature vector of a reference word of at least one corresponding mask location to obtain a predicted feature vector of the location; and constructing a word prediction loss term according to the prediction feature vector and the feature vector of the word segmentation word.

The pre-training device of the language model of the embodiment of the disclosure obtains at least one mask sample text, reference words of mask positions in the mask sample text and reference categories of word segmentation words in the mask sample text; inputting the mask sample text into a feature extraction network in the language model to obtain feature vectors of mask positions and feature vectors of word segmentation words; inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain a prediction word of the mask position and a prediction category of the word segmentation word; according to the predicted words and the reference words of the mask positions and the predicted categories and the reference categories of the word segmentation words, a loss function of the language model is constructed, and the language model is pre-trained, so that the language model can learn the word category capability, further, the new words can be combined with other words of the same category for carrying out understanding processing, interference of the new words on an output result of the language model is avoided, and the accuracy of the language model is improved.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are performed on the premise of proving the consent of the user, and all the processes accord with the regulations of related laws and regulations, and the public welfare is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as a pre-training method of a language model. For example, in some embodiments, the pre-training method of the language model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the language model pre-training method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the pre-training method of the language model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., having a graphical user interface or a web browser)

Through which a user may interact with embodiments of the systems and techniques described here 5) or include such a background component, middleware component

In a computing system of any combination of pieces, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other 0 and typically interact through a communication network. By running on respective computers and having each other

A computer program having a client-server relationship generates a relationship of client and server. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially

The execution may also be performed in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. Techniques in the art

It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of pre-training a language model, comprising:

acquiring at least one mask sample text, reference words of mask positions in the mask sample text and reference categories of word segmentation words in the mask sample text, wherein the reference words are words corresponding to the mask positions in the mask sample text;

inputting the mask sample text into a feature extraction network in a language model to obtain feature vectors of the mask positions and feature vectors of the word segmentation words;

Inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model, wherein the prediction network comprises: a word prediction network and a category prediction network;

inputting the feature vector of the mask position and the feature vector of the word segmentation word into the word prediction network to obtain a predicted word of the mask position output by the word prediction network;

inputting the feature vector of the word segmentation word into the category prediction network to obtain the prediction category of the word segmentation word output by the category prediction network;

and constructing a loss function of the language model according to the predicted words and the reference words of the mask positions, the predicted categories and the reference categories of the word segmentation words, and performing pre-training processing on the language model.

2. The method of claim 1, wherein the obtaining at least one mask sample text, reference words for mask locations in the mask sample text, and reference categories for word segmentation words in the mask sample text comprises:

obtaining at least one sample;

masking part of words in each sample text to obtain a masking sample text corresponding to the sample text;

Determining a mask position in the mask sample text and a reference word of the mask position according to the sample text;

and inquiring a category vocabulary according to the word segmentation words in the mask sample text, and acquiring the reference category of the word segmentation words.

3. The method of claim 2, wherein the masking the partial words in the sample text for each sample text to obtain a masked sample text corresponding to the sample text comprises:

performing word segmentation processing on each sample text to obtain a plurality of word segmentation words in the sample text;

selecting a first word segmentation word from a plurality of word segmentation words;

and carrying out mask processing on the position of the first word segmentation word in the sample text to obtain mask sample text corresponding to the sample text.

4. The method of claim 2, wherein the masking the partial words in the sample text for each sample text to obtain a masked sample text corresponding to the sample text comprises:

Selecting a first word segmentation word and a second word segmentation word from a plurality of word segmentation words;

splitting the second word and phrase to obtain a second character in the second word and phrase;

and carrying out mask processing on the position of the first word segmentation word and the position of the second character in the sample text to obtain mask sample text corresponding to the sample text.

5. The method of claim 2, wherein prior to obtaining the reference category of the segmented word from the segmented word query category vocabulary in the mask sample text, the method further comprises:

acquiring a preset word list, wherein the preset word list comprises words larger than a preset number;

determining the coding vector of each word;

clustering each word according to the coding vector of each word to obtain at least one clustering result;

determining the reference category of each word according to the category of at least one clustering result;

and generating the category vocabulary according to each word and the reference category of each word.

6. The method of claim 1, wherein constructing the loss function of the language model according to the predicted word and the reference word of the mask position, the predicted category and the reference category of the word segmentation word, and performing pre-training processing on the language model comprises:

Constructing a word prediction loss function according to the predicted words and the reference words of the mask positions;

constructing a category prediction loss function according to the prediction category and the reference category of the word segmentation words;

constructing a loss function according to the word prediction loss function and the category prediction loss function;

and according to the numerical value of the loss function, carrying out coefficient adjustment on the language model to realize the pre-training treatment of the language model.

7. The method of claim 6, wherein the reference word of the mask position is a word segmentation word in a sample text corresponding to the mask sample text, the constructing a word prediction loss function according to the predicted word and the reference word of the mask position includes:

for each mask position, determining a feature vector of a predicted word of the mask position;

determining a feature vector of a reference word of the mask position;

constructing a word prediction loss term according to the feature vector of the predicted word of the mask position and the feature vector of the reference word;

and constructing the word prediction loss function according to the word prediction loss terms of the mask positions.

8. The method of claim 6, wherein the reference word of at least one masking position is obtained by splitting word segmentation words in sample text corresponding to the masking sample text, and the constructing a word prediction loss function according to the predicted word and the reference word of the masking position includes:

determining a corresponding relation between at least one mask position and the position of the word segmentation word in the sample text, wherein in the corresponding relation, at least one reference word of the mask position is obtained by splitting the word segmentation word at the corresponding position;

constructing a word prediction loss term according to the feature vector of the reference word of at least one corresponding mask position and the feature vector of the word segmentation word aiming at the position of each word segmentation word in the sample text;

and constructing the word prediction loss function according to each word prediction loss term.

9. The method of claim 8, wherein the constructing, for each word segmentation term in the sample text, a term prediction loss term according to the feature vector of the reference term of the corresponding at least one mask position and the feature vector of the word segmentation term includes:

Adding the feature vector of the reference word of at least one corresponding mask position according to the position of each word segmentation word in the sample text to obtain a predicted feature vector of the position;

and constructing a word prediction loss term according to the prediction feature vector and the feature vector of the word segmentation word.

10. A pre-training apparatus for a language model, comprising:

the first acquisition module is used for acquiring at least one mask sample text, reference words of mask positions in the mask sample text and reference categories of word segmentation words in the mask sample text, wherein the reference words are words corresponding to the mask positions in the mask sample text;

the second acquisition module is used for inputting the mask sample text into a feature extraction network in a language model to acquire feature vectors of the mask positions and feature vectors of the word segmentation words;

the third obtaining module is used for inputting the feature vector of the mask position and the feature vector of the word segmentation word into a prediction network in the language model to obtain the prediction word of the mask position and the prediction category of the word segmentation word;

The training module is used for constructing a loss function of the language model according to the predicted words and the reference words of the mask positions, the predicted categories and the reference categories of the word segmentation words and performing pre-training processing on the language model;

the predictive network includes: a word prediction network and a category prediction network, the third acquisition module being specifically configured to,

and inputting the feature vector of the word segmentation word into the category prediction network to obtain the prediction category of the word segmentation word output by the category prediction network.

11. The apparatus of claim 10, wherein the first acquisition module comprises: a first acquisition unit, a mask processing unit, a determination unit, and a second acquisition unit;

the first acquisition unit is used for acquiring at least one sample;

the mask processing unit is used for carrying out mask processing on partial words in the sample text aiming at each sample text to obtain mask sample text corresponding to the sample text;

The determining unit is used for determining a mask position in the mask sample text and a reference word of the mask position according to the sample text;

the second obtaining unit is configured to obtain a reference category of the word segmentation word according to a word segmentation word query category vocabulary in the mask sample text.

12. The apparatus of claim 11, wherein the mask processing unit is specifically configured to,

13. The apparatus of claim 11, wherein the mask processing unit is specifically configured to,

14. The apparatus of claim 11, wherein the second acquisition unit is configured to,

determining the coding vector of each word;

15. The apparatus of claim 10, wherein the training module comprises: the device comprises a first construction unit, a second construction unit, a third construction unit and a training unit;

the first construction unit is used for constructing a word prediction loss function according to the predicted word and the reference word of the mask position;

the second construction unit is used for constructing a category prediction loss function according to the prediction category and the reference category of the word segmentation words;

The third construction unit is used for constructing the loss function according to the word prediction loss function and the category prediction loss function;

and the training unit is used for carrying out coefficient adjustment on the language model according to the numerical value of the loss function so as to realize the pre-training treatment on the language model.

16. The apparatus of claim 15, wherein the reference word of the mask position is a word segmentation word in a sample text corresponding to the mask sample text, and the first construction unit is specifically configured to,

determining a feature vector of a reference word of the mask position;

17. The apparatus according to claim 15, wherein the reference word of at least one of the mask positions is a word segmentation word in a sample text corresponding to the mask sample text, and the first construction unit is specifically configured to,

18. The apparatus of claim 17, wherein the first building element is specifically configured to,

19. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 9.