CN114186043B

CN114186043B - Pre-training method, device, equipment and storage medium

Info

Publication number: CN114186043B
Application number: CN202111505109.7A
Authority: CN
Inventors: 李如寐; 王思睿; 张富峥; 武威
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-10-21
Anticipated expiration: 2041-12-10
Also published as: CN114186043A

Abstract

The application discloses a pre-training method, a pre-training device, pre-training equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: obtaining an initial text sentence after the character masking processing; obtaining a target text sentence based on the initial text sentence after the character masking processing and the additional characters before the sentence; determining a mask matrix corresponding to a target text sentence, wherein the mask matrix comprises a plurality of elements, each element is used for indicating the operation association degree of two characters corresponding to the elements in the target text sentence in the feature extraction process to a feature extraction model to be trained, and the element corresponding to an additional character before the sentence is not 0; and training the feature extraction model to be trained based on the initial text sentence, the target text sentence and the mask matrix. By adopting the method and the device, not only can the characteristic vector corresponding to each character in the target text sentence be obtained, but also the characteristic vector corresponding to the target text sentence can be obtained, other training is not needed, and data operation resources and operation time are reduced.

Description

Pre-training method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a pre-training method, apparatus, device, and storage medium.

Background

With the continuous development of machine learning technology, pre-training models are widely popular. The pre-trained model is a model pre-trained using a large number of unlabeled training sets. After the pre-training is completed, a small amount of labeled training sets can be used for target training according to different requirements, and therefore a trained model is obtained. For example, to obtain a model for predicting emotion of a text sentence (e.g., predicting that a text sentence is happy or sad), a BERT (Bidirectional Encoder prediction from transforms) model may be pre-trained to obtain a pre-trained BERT model, and then a classifier is added after the BERT model, and the BERT model and the classifier are trained together by using a labeled training set, so as to obtain a trained model capable of predicting emotion of a text sentence.

Generally, the method of pre-training the model is: and carrying out character covering processing on one text sentence, covering a part of characters in the text sentence, and inputting the text sentence subjected to the character covering processing into the model to be trained so as to obtain the feature vector corresponding to each character in the text sentence output by the model to be trained and subjected to the character covering processing. And training the model to be trained based on the feature vector corresponding to each character and the text sentence which is not subjected to character covering processing.

By using the pre-training method, the obtained trained model can only predict the characteristic vector of each character in the text sentence, namely, only the character vector of each character in the text sentence can be obtained, but the sentence vector of the whole text sentence cannot be predicted. If the vector representation of the whole text sentence is required to be obtained, other machine learning modules are added after the model which is pre-trained, and the model which can predict the vector of the text sentence can be obtained only by performing labeled training on the model, but a large amount of data operation resources and operation time are required.

Disclosure of Invention

The embodiment of the application provides a pre-training method, which can solve the problem that in the prior art, a large amount of data operation economy and operation time are needed to obtain a vector of a text sentence.

In a first aspect, a pre-training method is provided, where the method includes:

carrying out character covering processing on the initial text sentence to obtain the initial text sentence after the character covering processing;

obtaining a target text sentence based on the initial text sentence after the character covering processing and the additional characters before the sentence;

determining a mask matrix corresponding to the target text sentence, wherein the mask matrix comprises a plurality of elements, each element is used for indicating the operation association degree of two characters corresponding to the elements in the target text sentence in the feature extraction process to a feature extraction model to be trained, and the elements corresponding to the extra characters before the sentence in the mask matrix are not 0;

and training the feature extraction model to be trained based on the initial text sentence, the target text sentence and the mask matrix.

In a possible implementation manner, the performing a character masking process on the initial text sentence to obtain the initial text sentence after the character masking process includes:

randomly selecting characters with a preset proportion from the initial text sentence as reference characters;

and for each reference character, based on the selection probabilities respectively corresponding to a plurality of types of processing, selecting target processing corresponding to the reference character in the plurality of types of processing, and performing the target processing on the reference character to obtain the initial text sentence after the character covering processing, wherein the plurality of types of processing comprise at least one of processing of replacing a mask character, unchanging processing and processing of replacing any character.

In one possible implementation manner, in the mask matrix, the element corresponding to the mask character and the text character is 0, the element corresponding to the text character and the text character, and the element corresponding to the additional character before the sentence are 1, where the text character is another character in the target text sentence except the mask character and the additional character.

In a possible implementation manner, the training the feature extraction model to be trained based on the initial text sentence, the target text sentence, and the mask matrix includes:

acquiring an actual ID of the reference character based on a corresponding relation between a pre-stored character and an ID (Identity);

inputting the target text sentence and the mask matrix into the feature extraction model to be trained to obtain feature information corresponding to each character in the target text sentence;

inputting the characteristic information corresponding to each character in the target text sentence into a softmax (normalization) module to be trained to obtain the prediction ID of each character in the target text sentence;

calculating a loss value based on the actual ID of the reference character and the predicted ID of the reference character;

training the feature extraction model to be trained and the softmax module to be trained based on the loss value.

In one possible implementation, the calculating a loss value based on the actual ID of the reference character and the predicted ID of the reference character includes:

for each reference character, calculating a cross entropy error value between the actual ID of the reference character and the predicted ID of the reference character;

and determining the average value of the cross entropy error values corresponding to all the reference characters as the loss value.

In a possible implementation manner, the obtaining a target text sentence based on the initial text sentence after the character masking processing and the additional characters before the sentence includes:

adding additional characters before the sentence before the initial text sentence after the character covering processing to obtain a reference text sentence;

determining the number of characters of the reference text sentence;

and if the number of the characters of the reference text sentence is less than the preset number of the characters, adding at least one additional character after the reference text sentence to obtain a target text sentence, wherein the number of the characters of the target text sentence is equal to the preset number of the characters.

In one possible implementation, in the mask matrix, the element corresponding to the additional character after the sentence is 0.

In a second aspect, there is provided a pre-training apparatus, the apparatus comprising:

the first determining module is used for carrying out character covering processing on the initial text sentence to obtain the initial text sentence after the character covering processing;

the second determining module is used for obtaining a target text sentence based on the initial text sentence after the character masking processing and the additional characters before the sentence;

a third determining module, configured to determine a mask matrix corresponding to the target text sentence, where the mask matrix includes multiple elements, each element is used to indicate, to a feature extraction model to be trained, an operation association degree of two characters corresponding to the element in the target text sentence in a feature extraction process, and an element corresponding to an additional character before the sentence in the mask matrix is not 0;

and the training module is used for training the feature extraction model to be trained on the basis of the initial text sentence, the target text sentence and the mask matrix.

In a possible implementation manner, the first determining module is configured to:

and for each reference character, based on the selection probabilities respectively corresponding to a plurality of types of processing, in the plurality of types of processing, selecting target processing corresponding to the reference character, and performing the target processing on the reference character to obtain the initial text sentence after the character covering processing, wherein the plurality of types of processing comprise at least one of processing of replacing mask characters, unchanging processing and processing of replacing any character.

In one possible implementation, the training module is configured to:

acquiring the actual ID of the reference character based on the corresponding relation between the pre-stored character and the ID;

inputting the characteristic information corresponding to each character in the target text sentence into a softmax module to be trained to obtain the prediction ID of each character in the target text sentence;

In one possible implementation, the training module is configured to:

In a possible implementation manner, the second determining module is configured to:

determining the number of characters of the reference text sentence;

In a third aspect, a computer device is provided that includes a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform an operation performed by a pre-training method.

In a fourth aspect, a computer-readable storage medium is provided that has at least one instruction stored therein, the instruction being loaded and executed by a processor to perform operations performed by a pre-training method.

The technical scheme provided by the embodiment of the application has the following beneficial effects: according to the scheme, the initial text sentence can be subjected to character covering processing to obtain the initial text sentence after the character covering processing, then the target text sentence is obtained based on the initial text sentence after the character covering processing and the additional characters before the sentence, and then the mask matrix corresponding to the target text sentence is determined.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a pre-training method provided by an embodiment of the present application;

FIG. 2 is a flow chart of a pre-training method provided by an embodiment of the present application;

FIG. 3 is a flow chart for determining a loss value according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a mask matrix provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a mask matrix provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a pre-training apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides a pre-training method, which can be implemented by a server. The server may be a single server or may be a server cluster composed of a plurality of servers.

The server may comprise a processor, a memory, a communication component, etc., to which the processor is connected, respectively.

The processor may be a Central Processing Unit (CPU). The processor may be configured to read the instruction and process the data, for example, perform character masking processing on the initial text sentence, determine the target text sentence, determine a mask matrix corresponding to the target text sentence, train a feature extraction model to be trained, and so on.

The Memory may include a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic disk, an optical data storage device, and the like. The memory may be used for data storage, for example, to store pre-stored data required for performing character masking processing on an initial text sentence, to store intermediate data generated during the character masking processing on the initial text sentence, to store an initial text sentence after the obtained character masking processing, to store intermediate data generated during the process of determining a target text sentence, to store a determined target text sentence, to store intermediate data generated during the process of determining a mask matrix corresponding to the target text sentence, to store the determined mask matrix, to store pre-stored data required for training a feature extraction model to be trained, and so on.

The communication means may be a wired network connector, a WiFi (Wireless Fidelity) module, a bluetooth module, a cellular network communication module, etc. The communication means may be used for receiving and transmitting signals.

Fig. 1 is a flowchart of a pre-training method according to an embodiment of the present disclosure. Referring to fig. 1, the embodiment includes:

101. and carrying out character masking processing on the initial text sentence to obtain the initial text sentence after the character masking processing.

In implementation, when the feature extraction model needs to be pre-trained, a training sample set may be obtained first, where the training sample set includes a plurality of different text sentences, and each time pre-training is performed, one text sentence is obtained from the training sample set and determined as an initial text sentence.

Then, the initial text sentence is subjected to character covering processing, and a part of characters in the initial text sentence are covered, so that the characters do not have any essential information. In the embodiment of the present application, the processing procedure of the character masking process may be as follows:

and randomly selecting characters with a preset proportion in the initial text sentence as reference characters. And for each reference character, based on the selection probabilities respectively corresponding to the multiple processes, selecting a target process corresponding to the reference character in the multiple processes, and performing the target process on the reference character to obtain an initial text sentence after the character masking process, wherein the multiple processes comprise at least one of a process of replacing a mask character, a process of not changing the mask character and a process of replacing any character.

In implementation, the worker may preset the preset ratio. When the initial text sentence needs to be subjected to character masking processing, characters with preset proportion can be randomly selected from the initial text sentence, and the characters are determined as reference characters. It is to be understood that, if the number of characters of the initial text sentence multiplied by the preset ratio is not an integer, an integer closest to the number and smaller than the number may be used as the number of all the reference characters, or an integer closest to the number and larger than the number may be used as the number of the reference characters. The specific value of the preset ratio may be any reasonable value, for example, 15%, or 20%, and the like, and the embodiment of the present application is not limited thereto.

The staff can also preset a plurality of processing modes for the characters and the selection probability corresponding to each processing mode, and the selection probabilities of the plurality of processing modes are added to be 1. Then, for each reference character, based on the selection probabilities respectively corresponding to the multiple processes, one of the multiple processes is selected as a target process corresponding to the reference character, the reference character is subjected to the target process, and after the corresponding target process is performed on each reference character, the character covering process is completed, so that the initial text sentence after the character covering process is obtained.

The plurality of processes described above may include at least one of a process of replacing a mask character, an invariant process, and a process of replacing an arbitrary character. The process of replacing the mask character may be to replace the reference character with the mask character, the invariable process is to keep the content of the reference character unchanged, and the process of replacing the arbitrary character is to replace the reference character with an arbitrary character at random. It is understood that the characters in the embodiments of the present application include letters and punctuation marks.

For example, in the embodiment of the present application, a plurality of kinds of processing may be set as processing for replacing a mask character, invariant processing, and processing for replacing an arbitrary character, where the selection probability corresponding to the processing for replacing a mask character is 80%, the selection probability corresponding to the invariant processing is 10%, and the selection probability corresponding to the processing for replacing an arbitrary character is 10%.

The various processes described above may include other processes in addition to the process of replacing a mask character, the process of not changing, and the process of replacing an arbitrary character, and this is not limited in the embodiment of the present application. The setting of the selection probability corresponding to each kind of processing can be any reasonable value, and the sum of the selection probabilities of the various kinds of processing is only required to be 1.

102. And obtaining the target text sentence based on the initial text sentence after the character masking processing and the additional characters before the sentence.

In practice, the target text sentence may be obtained by adding an additional character before the sentence to the initial text sentence after the character masking processing. The additional characters before the sentence are commonly used as CLS characters, and of course, other characters may be used, which is not limited in the embodiment of the present application.

103. And determining a mask matrix corresponding to the target text sentence.

The mask matrix comprises a plurality of elements, each element is used for indicating the operation association degree of two characters corresponding to the elements in the target text sentence in the feature extraction process to the feature extraction model to be trained, and the elements corresponding to the additional characters before the sentence in the mask matrix are not 0.

In implementation, after the target text sentence is determined, a corresponding mask matrix may be set according to the training target. The mask matrix comprises a plurality of elements, each element corresponds to two characters in the target text sentence, and the element user indicates the operation association degree of the two characters corresponding to the element in the feature extraction process to the feature extraction model to be trained.

Generally, after a text sentence is input into the feature extraction model, the feature extraction model may output feature information corresponding to each character in the text sentence. When the feature extraction model calculates the feature information corresponding to each character in the target text sentence, the feature extraction model does not only perform feature extraction on each character in the target text sentence alone, but also performs calculation according to the meaning of the character in the target text sentence, so to speak, the feature information corresponding to the character output by the feature extraction model can embody the meaning of the character in the target text sentence, which requires that the feature extraction model performs calculation on the basis of combining the character and other characters when calculating the feature information corresponding to one character. Each element in the mask matrix in the embodiment of the present application is to indicate to a feature extraction model to be trained, and when feature information corresponding to one character is calculated, the calculation may not be performed together based on the contents of other characters except the character.

The following two cases may be included in the mask matrix:

first, if the target text sentence includes a character a and a character B, and the element corresponding to the character a and the character B is 0, after the target text sentence and the mask matrix are input into the feature extraction model to be trained, the content of the character B will not be synthesized when the feature extraction model to be trained calculates the feature information corresponding to the character a, and the content of the character a will not be synthesized when the feature information corresponding to the character B is calculated.

Secondly, if the target text sentence includes characters C and D, and the elements corresponding to the characters C and D are not 0, after the target text sentence and the mask matrix are input into the feature extraction model to be trained, the feature extraction model to be trained will comprehensively consider the contents of the characters C and D when calculating the feature information corresponding to the characters C, and will also comprehensively consider the contents of the characters C and D when calculating the feature information corresponding to the characters D.

In the embodiment of the application, the elements corresponding to the extra characters before the sentence in the mask matrix are not all 0, so that after the target text sentence and the mask matrix are input into the feature extraction model to be trained, when the feature extraction model to be trained calculates the feature information corresponding to the extra characters before the sentence, the contents of other characters in the target text sentence are comprehensively considered, the feature information corresponding to the extra characters before the sentence output by the feature extraction model to be trained is the feature information capable of representing the meaning of each character in the target text sentence, that is, the feature information is capable of representing the meaning of the whole sentence of the target text sentence, and therefore, the feature information corresponding to the extra characters before the sentence can be used as the feature information corresponding to the target text sentence.

104. And training the feature extraction model to be trained based on the initial text sentence, the target text sentence and the mask matrix.

The feature extraction model to be trained may be a BERT model, a roberta model, an xlnet model, or the like, which is not limited in the embodiment of the present application.

In implementation, the target text sentence and the mask matrix may be input into the feature extraction model to be trained, the feature extraction to be trained may output feature information corresponding to each character in the predicted target text sentence, and then the feature extraction model to be trained may be trained based on the initial text sentence and the output feature information corresponding to each character.

In the following, a method for training a feature extraction model to be trained in the embodiment of the present application is described in more detail, as shown in fig. 2 and fig. 3, the processing steps thereof correspond to the following steps:

1041. the actual ID of the reference character is acquired based on the correspondence between the characters and IDs stored in advance.

In implementation, a worker may store in advance a plurality of corresponding relationships between characters and IDs in a database, that is, each character corresponds to a unique ID, and the ID may represent a serial number of the character corresponding to the ID in all characters in the database, for example, 10 characters are stored in the database, then the ID of the third character is 0010000000, and the ID of the seventh character is 0000001000. Of course, the characters stored in the database are tens of thousands or even more, and this example is merely illustrative.

The actual ID of each character in the target text sentence may be obtained in the database based on the correspondence between the characters and IDs stored in the database.

1042. And inputting the target text sentence and the mask matrix into the feature extraction model to be trained to obtain feature information corresponding to each character in the target text sentence.

1043. And inputting the characteristic information corresponding to each character in the target text sentence into a softmax module to be trained to obtain the prediction ID of each character in the target text sentence.

In implementation, after the feature information corresponding to each character in the target text sentence is obtained, the feature information is input into the softmax module to be trained, so that the prediction ID of each character in the target text sentence can be obtained, the number of numerical values contained in the prediction ID is the same as the number of characters stored in the database, and each numerical value in the prediction ID represents the predicted probability of the character at the position. Taking 10 characters in the database as an example, if the prediction ID of one character is (0.01,0.01,0.01,0.01,0.01,0.01,0.9,0.01,0.02,0.01), it indicates that the probability that the model predicts that the character is the seventh character in the database is 0.9, the probability that the character is the ninth character is 0.02, and the probabilities that the characters are the other characters are all 0.01.

1044. A loss value is calculated based on the actual ID of the reference character and the predicted ID of the reference character.

In practice, since the initial text sentence is subjected to the character masking processing, the reference character in the target text sentence may be an incorrect character or a mask character, and therefore, when the feature extraction model calculates the feature vector corresponding to the reference character, it is necessary to predict the meaning of the whole sentence by using other characters so as to predict whether the reference character is correct, and if not, it is necessary to predict the correct character and the feature vector of the correct character. For example, the initial text sentence is "today's weather is really good", and after character masking processing, the initial text sentence becomes "today's mask is good, it can be seen that the third character" day "is replaced by a mask character, the fifth character" true "is replaced by" me ", and when the feature extraction model is trained, the training target is that the feature extraction model can predict feature vectors corresponding to correct characters, such as" today's weather is really good ", according to the input target text sentence. Therefore, when training a subsequent feature extraction model to be trained, the reference character can be used as a label.

Therefore, after obtaining the actual ID and the predicted ID of each character, the actual ID and the predicted ID of the reference character can be input to the loss function to calculate the loss value. In the embodiment of the present application, the loss function may be an MLM (Masked Language Model) loss function.

The corresponding process of calculating the loss value may be: for each reference character, a cross entropy error value between the actual ID of the reference character and the predicted ID of the reference character is calculated. And determining the average value of the cross entropy error values corresponding to all the reference characters as a loss value.

1045. And training the feature extraction model to be trained and the softmax module to be trained on the basis of the loss value.

In implementation, the model can train the feature extraction model to be trained and the softmax module to be trained based on the calculated loss value, and parameters in the feature extraction model to be trained and the softmax module to be trained are adjusted, so that the predicted ID output by the softmax module is more accurate and closer to the actual ID, and the feature extraction model with accurate output can be obtained.

The feature extraction model to be trained and the softmax module to be trained are trained for multiple times by using the text sentences in the training sample set, the training is stopped until a preset finishing condition is reached, the preset finishing condition can be various, and the following three conditions are adopted:

first, the training times of the feature extraction model to be trained and the softmax module to be trained reach a training time threshold value by using different text sentences. In implementation, a training time threshold value can be preset by a worker, when the training time reaches the training time threshold value, the training can be stopped, and the feature extraction model obtained after the last training is determined as the trained feature extraction model. The training time threshold may be any reasonable value, for example, 200, 300, or the like, which is not limited in this embodiment of the application.

Secondly, the loss values obtained by continuous training for a preset number of times are all smaller than a preset loss value threshold value. The preset number and the preset loss value threshold may be any reasonable values, for example, the preset number may be 3 or 5, and the like, and the preset loss value threshold may be 0.05, and the like, which is not limited in this embodiment of the application.

Thirdly, the training times reach the threshold value of the training times, and the loss values obtained by continuous training for the preset number of times are all smaller than the threshold value of the preset loss value.

The preset termination condition may be any one of the three conditions, or may be another termination condition, which is not limited in this embodiment of the application.

Since the text sentences in the sample training set do not contain the same number of characters, when the target text sentence is determined in step 102, in order to ensure that the size of the data input into the feature extraction model to be trained is the same for each text sentence, additional characters after the target text sentence may be added, so that the number of characters contained in the obtained target text sentence reaches the preset number of characters, the corresponding processing may be as follows:

and adding additional characters before the sentence before the initial text sentence after the character covering processing to obtain a reference text sentence. The number of characters of the reference text sentence is determined. And if the number of the characters of the reference text sentence is less than the preset number of the characters, adding at least one additional character after the sentence after the reference text sentence to obtain a target text sentence, wherein the number of the characters of the target text sentence is equal to the preset number of the characters.

In practice, the number of characters preset by the operator can be preset. Adding additional characters before the initial text sentence after the character masking processing to obtain a reference text sentence, then calculating the number of characters contained in the reference text sentence, if the number of characters contained in the reference text sentence is less than the preset number of characters, adding at least one additional character after the reference text sentence to obtain a target text sentence, and enabling the number of characters contained in the target text sentence to be equal to the preset number of characters. Therefore, no matter the number of the characters contained in the initial text sentence selected from the sample training set, the number of the characters contained in the target text sentence obtained finally is certain, namely equal to the preset number of the characters, and the size of the data input into the feature extraction model to be trained each time is consistent.

The additional character after the sentence is usually a pad character, but of course, other characters may also be used, which is not limited in the embodiment of the present application.

The specific numerical value of the preset number of characters may be any reasonable numerical value, and in order to adapt to various text sentences, the preset number of characters may be set to be larger, for example, 128, 200, and the like, which is not limited in this embodiment of the application.

For the above case of adding additional characters after sentence after the reference text sentence, when the mask matrix is determined in step 103, the following setting may be performed for the additional characters after sentence: in the mask matrix, the corresponding element of the extra character after the sentence is 0. In implementation, the post-sentence additional characters are only used for making the number of characters contained in the target text sentence be a fixed value, but do not have any literal meaning, so that after the target text sentence is input into the feature extraction model to be trained, the feature extraction model does not need to perform feature extraction operation on the target text sentence, and therefore, the elements corresponding to the post-sentence additional characters are set to be 0 in the mask matrix.

The setting of the mask matrix determined in step 103 above is described in more detail below:

in the embodiment of the present application, the mask matrix may have multiple setting modes, two of which are as follows:

first, in the mask matrix, the element corresponding to the extra character preceding the sentence, the element corresponding to the mask character, and the element corresponding to the text character are all 1.

Wherein the text characters are other characters in the target text sentence except for the mask character and the additional character. The additional characters include an additional character before a sentence and an additional character after the sentence, that is: in the case where the target text sentence does not include an additional character after the sentence, the text character is a character other than the mask character and the additional character before the sentence in the target text sentence, and in the case where the target text sentence includes an additional character after the sentence, the text character is a character other than the mask character, the additional character before the sentence, and the additional character after the sentence in the target text sentence. For example, if the target text sentence is "clc today mask good pad", the first character is the extra character before the sentence, the second, third, fifth, and seventh characters are text characters, the fourth and sixth characters are mask characters, and the eighth character is the extra character after the sentence.

In implementation, all elements corresponding to other characters except for the additional character after the sentence may be set to 1, so that the feature vector corresponding to each character may include the meaning of the character itself and may also include the meaning of the whole sentence of the target text sentence, and for the feature vector corresponding to the additional character before the sentence, since the additional character before the sentence does not have the meaning of itself, the feature vector corresponding to the additional character before the sentence may be directly used as the feature vector corresponding to the whole sentence of the target text sentence.

By the method, the characteristic vector corresponding to each character in the initial text sentence can be predicted through the mask character and the text character, and the characteristic vector corresponding to the whole sentence of the initial text sentence can be predicted through the additional character before the sentence.

It is to be understood that the elements corresponding to the extra characters before the sentence do not include elements corresponding to the extra characters before the sentence and the extra characters after the sentence, the elements corresponding to the mask characters do not include elements corresponding to the mask characters and the extra characters after the sentence, and the elements corresponding to the text characters do not include elements corresponding to the text characters and the extra characters after the sentence, as shown in fig. 4.

Second, in the mask matrix, the element corresponding to the mask character and the text character is 0, the element corresponding to the text character and the additional character preceding the sentence is 1.

Wherein the text characters are other characters in the target text sentence except for the mask character and the additional character. Similarly, the additional characters herein also include an additional character before a sentence and an additional character after the sentence, and are not described herein again.

In implementation, the element corresponding to the mask character and the text character may be set to 0, and when the feature extraction model to be trained predicts the feature vector corresponding to the mask character, since the element corresponding to the mask character and the text character is 0 and the element corresponding to the mask character and the extra character before the sentence is 1, during prediction, the model predicts the correct character of the mask character according to the extra character before the sentence, and the extra character before the sentence does not have any meaning, but since the elements corresponding to the extra character before the sentence are all 1, when calculating the feature vector corresponding to the extra character before the sentence, calculation is performed based on all other characters. Therefore, when the model predicts the correct character of the mask character, the additional character before the sentence is forced to acquire the meanings of all characters, the additional character before the sentence is forced to contain the meaning of the whole sentence, the correct character corresponding to the mask character is predicted through the additional character before the sentence, and the correct feature vector corresponding to the mask character is predicted.

By the method, the characteristic vector corresponding to each character in the initial text sentence can be predicted through the mask character and the text character, and the characteristic vector corresponding to the initial text sentence which is more accurately predicted can be obtained, namely the characteristic vector corresponding to the additional character before the sentence.

It is to be understood that the above-described elements corresponding to the extra characters before the sentence do not include elements corresponding to the extra characters before the sentence and the extra characters after the sentence, as shown in fig. 5.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described in detail herein.

An embodiment of the present application provides a pre-training apparatus, which may be a computer device in the foregoing embodiment, as shown in fig. 6, the apparatus includes:

a first determining module 610, configured to perform character masking processing on the initial text sentence to obtain an initial text sentence after the character masking processing;

a second determining module 620, configured to obtain a target text sentence based on the initial text sentence after the character masking processing and the additional character before the sentence;

a third determining module 630, configured to determine a mask matrix corresponding to the target text sentence, where the mask matrix includes multiple elements, each element is used to indicate, to a feature extraction model to be trained, an operation association degree of two characters corresponding to the element in the target text sentence in a feature extraction process, and an element corresponding to an additional character before the sentence in the mask matrix is not 0;

and the training module 640 is configured to train the feature extraction model to be trained based on the initial text sentence, the target text sentence, and the mask matrix.

In a possible implementation manner, the first determining module 610 is configured to:

In one possible implementation manner, the training module 640 is configured to:

In a possible implementation manner, the second determining module 620 is configured to:

adding an additional character before the sentence before the initial text sentence after the character covering processing to obtain a reference text sentence;

determining the number of characters of the reference text sentence;

In one possible implementation, in the mask matrix, an element corresponding to an additional character after the sentence is 0.

It should be noted that: in the pre-training apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration during pre-training, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the pre-training apparatus and the pre-training method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 7 is a schematic structural diagram of a server 700 according to an embodiment of the present application, where the server 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 701 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor in a terminal to perform the pre-training method of the above embodiments. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of pre-training, the method comprising:

obtaining a target text sentence based on the initial text sentence after the character masking processing and the additional characters before the sentence;

determining a mask matrix corresponding to the target text sentence, wherein the mask matrix comprises a plurality of elements, each element is used for indicating the operation association degree of two characters corresponding to the elements in the target text sentence in the feature extraction process to a feature extraction model to be trained, and the elements corresponding to the additional characters before the sentence in the mask matrix are not 0;

inputting the target text sentence and the mask matrix into the feature extraction model to be trained to obtain feature information corresponding to each character in the target text sentence, wherein the feature information corresponding to the additional character before the sentence is used for representing the feature information corresponding to the target text sentence;

and training the feature extraction model to be trained based on the feature information corresponding to each character in the initial text sentence and the target text sentence.

2. The method of claim 1, wherein the performing a character masking process on the initial text sentence to obtain a character masked initial text sentence, comprises:

3. The method of claim 2, wherein in the mask matrix, the mask character and the element corresponding to the text character are 0, the element corresponding to the text character and the element corresponding to the additional character before the sentence are 1, and wherein the text character is another character in the target text sentence than the mask character and the additional character.

4. The method of claim 2, wherein the training the feature extraction model to be trained based on feature information corresponding to each character in the initial text sentence and the target text sentence comprises:

acquiring the actual ID of the reference character based on the corresponding relation between the pre-stored character and the identification ID;

inputting the characteristic information corresponding to each character in the target text sentence into a normalization softmax module to be trained to obtain a prediction ID of each character in the target text sentence;

5. The method of claim 4, wherein calculating a loss value based on the actual ID of the reference character and the predicted ID of the reference character comprises:

6. The method according to any one of claims 1-5, wherein obtaining the target text sentence based on the initial text sentence after the character masking process and the additional characters before the sentence comprises:

determining the number of characters of the reference text sentence;

7. The method of claim 6, wherein the element corresponding to the extra character after the sentence in the mask matrix is 0.

8. A pre-training apparatus, the apparatus comprising:

the training module is used for inputting the target text sentence and the mask matrix into the feature extraction model to be trained to obtain feature information corresponding to each character in the target text sentence, wherein the feature information corresponding to the additional character before the sentence is used for representing the feature information corresponding to the target text sentence; and training the feature extraction model to be trained based on the feature information corresponding to each character in the initial text sentence and the target text sentence.

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to perform operations performed by the pre-training method of any of claims 1-7.

10. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the pre-training method of any one of claims 1 to 7.