CN114579699A

CN114579699A - Training method and device for pre-training language model

Info

Publication number: CN114579699A
Application number: CN202210152672.9A
Authority: CN
Inventors: 陈谦; 王雯
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-06-03

Abstract

The embodiment of the specification provides a training method and a training device for a pre-training language model, wherein the training method for the pre-training language model comprises the following steps: performing mask processing on first set numerical characters in a sample text to obtain a mask training sample, then determining an enhanced semantic vector of each character based on the weight of a non-mask character position in the mask training sample, then determining a loss value of a pre-training language model according to the enhanced semantic vector of each character, and training the pre-training language model, thereby completing the pre-training process of the language model. Therefore, when the enhanced semantic vector of each character of the mask training sample is calculated, the weight at the position of the masked character can be ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, and the processing accuracy of the downstream task is improved.

Description

Training method and device for pre-training language model

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a training method for a pre-training language model. One or more embodiments of the present disclosure also relate to a method for training a text processing model, a method for processing a text, an apparatus for training a pre-trained language model, an apparatus for training a text processing model, a text processing apparatus, a computing device, and a computer-readable storage medium.

Background

With the rapid development of computer technology, Natural Language Processing is also developed vigorously, and in the field of NLP (Natural Language Processing), pre-trained Language models are widely concerned and used, can be pre-trained on large-scale unmarked linguistic data, and can learn general Language expressions, which can be used for other text Processing tasks.

In the prior art, some words in an input sample can be subjected to mask processing, then semantic information of the input sample is identified through a pre-training language model, and words at the position of the input sample subjected to mask processing are predicted, so that the pre-training language model is trained, the semantic information identification capability of the pre-training language model is improved, and the trained pre-training language model can be matched with other downstream tasks to complete text processing. However, when the pre-trained language model identifies semantic information of the input sample and predicts words at positions where the input sample is masked, features of the words in the input sample are often fused, convergence speed of the pre-trained language model is affected, and over-training may be caused.

Disclosure of Invention

In view of this, the embodiments of the present specification provide a training method for pre-training a language model. One or more embodiments of the present disclosure also relate to a method for training a text processing model, a method for processing a text, a device for training a pre-trained language model, a device for training a text processing model, a text processing device, a computing device, and a computer-readable storage medium, so as to overcome technical shortcomings in the prior art.

According to a first aspect of embodiments of the present specification, there is provided a training method for pre-training a language model, including:

performing mask processing on characters of a first set numerical value in a sample text to obtain a mask training sample;

inputting the mask training sample into a pre-training language model, and determining an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, wherein the enhanced semantic vector of each character is determined based on the weight of a non-mask character in the mask training sample;

determining a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjusting model parameters of the pre-training language model according to the loss value, returning to execute the operation step of performing mask processing on the first set numerical character in the sample text to obtain a mask training sample until a training stopping condition is reached, and obtaining the pre-training language model after pre-training.

Optionally, determining an enhanced semantic vector of each character in the mask training sample by a self-attention layer in the pre-training language model includes:

determining the weight of each character in the mask training sample relative to a first character, wherein the first character is any character in the mask training sample, and the weight of the masked character in the mask training sample relative to the first character is 0;

an enhanced semantic vector for the first character in the mask training sample is determined based on the weight of each character relative to the first character.

Optionally, determining the weight of each character in the mask training sample relative to the first character includes:

acquiring a query vector of a first character, a key vector of each character and a value vector of each character in a mask training sample;

setting each vector element in the key vector corresponding to the masked character as 0;

and respectively carrying out vector operation on the query vector of the first character and the key vector of each character in the mask training sample to obtain the weight of each character relative to the first character.

respectively carrying out vector operation on the query vector of the first character and the key vector of each character in the mask training sample to obtain the weight of each character relative to the first character;

and setting the weight of the masked character relative to the first character in the weights of the characters relative to the first character to be 0.

Optionally, performing mask processing on a first set numerical character in the sample text to obtain a mask training sample, including:

replacing characters with a first proportion in the first set numerical characters with specific symbols, replacing characters with a second proportion in the first set numerical characters with set characters, and keeping characters with a third proportion in the first set numerical characters with original characters to obtain a mask training sample;

wherein the first proportion is larger than the second proportion and larger than the third proportion, and the sum of the first proportion, the second proportion and the third proportion is 1.

Optionally, determining a loss value of the pre-training language model according to the enhanced semantic vector of each character includes:

inputting the enhanced semantic vector of each character into a classification layer of a pre-training language model to obtain a predicted character corresponding to a masked character;

and calculating the loss value of the pre-training language model according to the predicted character and the masked character corresponding to the masked character.

and after each pair of pre-training language models train a second set numerical value wheel, increasing the first set numerical value by a set proportion, wherein the first set numerical value is less than or equal to the maximum numerical value threshold.

Optionally, the pre-trained language model comprises at least two self-attention layers;

inputting the mask training sample into a pre-training language model, and determining an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, wherein the enhanced semantic vector comprises the following steps:

determining an enhanced semantic vector of each character in the mask training sample through the first self-attention layer, wherein the enhanced semantic vector of each character is determined based on the weight of the position of each character in the mask training sample;

and determining an enhanced semantic vector of each character in the mask training sample through a second self-attention layer, wherein the enhanced semantic vector of each character is determined based on the weight of the position of the non-mask character in the mask training sample, and the second self-attention layer is close to a classification layer in the pre-training language model relative to the first self-attention layer.

According to a second aspect of embodiments of the present specification, there is provided a method for training a text processing model, including:

obtaining a training sample, wherein the training sample carries a sample label;

inputting a training sample into a pre-training language model of a text processing model, and determining an enhanced semantic vector of the training sample through a self-attention layer of the pre-training language model, wherein the pre-training language model is obtained by training through the training method of the first aspect;

inputting the enhanced semantic vector of the training sample into a task processing model of a text processing model to obtain a prediction processing result corresponding to the training sample;

and calculating a loss value of the text processing model according to the prediction processing result and the sample label, adjusting model parameters of the pre-training language model and the task processing model according to the loss value, and returning to execute the operation step of acquiring the training sample until a training stopping condition is reached to obtain the trained text processing model.

According to a third aspect of embodiments herein, there is provided a text processing method including:

acquiring a text to be processed;

inputting a text to be processed into a pre-training language model of a text processing model, and determining an enhanced semantic vector of the text to be processed through a self-attention layer of the pre-training language model;

and inputting the enhanced semantic vector of the text to be processed into a task processing model of the text processing model to obtain a text processing result corresponding to the text to be processed, wherein the text processing model is obtained by training through the training method in the second aspect.

Optionally, inputting the text to be processed into a pre-training language model of the text processing model, and determining an enhanced semantic vector of the text to be processed through a self-attention layer of the pre-training language model, including:

acquiring a query vector of a second character, a key vector of each character and a value vector of each character in the text to be processed, wherein the second character is any character in the text to be processed;

respectively carrying out vector operation on the query vector of the second character and the key vector of each character in the text to be processed to obtain the weight of each character relative to the second character, wherein the weight of the masked character in the text to be processed relative to the second character is 0;

and determining an enhanced semantic vector of the second character in the text to be processed based on the weight of each character relative to the second character.

According to a fourth aspect of embodiments of the present specification, there is provided a training apparatus for pre-training a language model, including:

the processing module is configured to perform mask processing on first set numerical characters in the sample text to obtain a mask training sample;

the first determination module is configured to input the mask training sample into a pre-training language model, and determine an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, wherein the enhanced semantic vector of each character is determined based on the weight at the position of a non-mask character in the mask training sample;

and the first adjusting module is configured to determine a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjust model parameters of the pre-training language model according to the loss value, return to execute the operation step of performing mask processing on the first set numerical characters in the sample text to obtain a mask training sample, and obtain the pre-training language model after pre-training until a training stopping condition is reached.

According to a fifth aspect of embodiments herein, there is provided a training apparatus for a text processing model, including:

a first obtaining module configured to obtain a training sample, wherein the training sample carries a sample label;

a second determining module, configured to input the training sample into a pre-training language model of the text processing model, and determine an enhanced semantic vector of the training sample through a self-attention layer of the pre-training language model, where the pre-training language model is obtained by training through the training method of the first aspect;

the first obtaining module is configured to input the enhanced semantic vector of the training sample into a task processing model of the text processing model, and obtain a prediction processing result corresponding to the training sample;

and the second adjusting module is configured to calculate a loss value of the text processing model according to the prediction processing result and the sample label, adjust model parameters of the pre-training language model and the task processing model according to the loss value, and return to execute the operation step of obtaining the training sample until a training stopping condition is reached to obtain the trained text processing model.

According to a sixth aspect of embodiments herein, there is provided a text processing apparatus including:

the second acquisition module is configured to acquire a text to be processed;

the third determination module is configured to input the text to be processed into a pre-training language model of the text processing model, and determine an enhanced semantic vector of the text to be processed through a self-attention layer of the pre-training language model;

and the second obtaining module is configured to input the enhanced semantic vector of the text to be processed into the task processing model of the text processing model, and obtain a text processing result corresponding to the text to be processed, wherein the text processing model is obtained by training through the training method of the second aspect.

According to a seventh aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is for storing computer executable instructions and the processor is for executing the computer executable instructions to implement the method of training a pre-trained language model of the first aspect, or the method of training a text processing model of the second aspect, or the operational steps of the text processing method of the third aspect.

According to an eighth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the method for training a pre-trained language model of the first aspect, or the method for training a text processing model of the second aspect, or the operational steps of the text processing method of the third aspect.

One embodiment of the present specification provides a training method for a pre-training language model, which performs mask processing on a first set numerical character in a sample text to obtain a mask training sample; inputting the mask training sample into a pre-training language model, and determining an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, wherein the enhanced semantic vector of each character is determined based on the weight of a non-mask character in the mask training sample; determining a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjusting model parameters of the pre-training language model according to the loss value, returning to execute the operation step of performing mask processing on the first set numerical character in the sample text to obtain a mask training sample until a training stopping condition is reached, and obtaining the pre-training language model after pre-training.

In this case, when the pre-training language model is trained, mask processing may be performed on a first set number of characters in a sample text to obtain a mask training sample, then an enhanced semantic vector of each character is determined based on a weight at a position of a non-mask character in the mask training sample, then a loss value of the pre-training language model is determined according to the enhanced semantic vector of each character, and the pre-training language model is trained, thereby completing a pre-training process of the language model. Therefore, when the enhanced semantic vector of each character of the mask training sample is calculated, the weight at the position of the masked character can be ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, and the processing accuracy of the downstream task is improved.

Drawings

FIG. 1 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure;

FIG. 2a is a vector diagram of each character in a mask training sample according to an embodiment of the present disclosure;

FIG. 2b is a diagram illustrating a model structure of a pre-trained language model according to an embodiment of the present disclosure;

FIG. 3a is a flowchart of a method for training a text processing model according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram illustrating a training process of a text processing model according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of a method for text processing provided in one embodiment of the present description;

FIG. 5 is a schematic structural diagram of a training apparatus for pre-training a language model according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of an apparatus for training a text processing model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present specification;

fig. 8 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

Transformer model: the Encoder is composed of 6 encoding blocks (each block of the Encoder is composed of self-addressing, FFNN), the Decoder is composed of 6 decoding blocks (each block of the Decoder is composed of self-addressing, Encoder-Decoder addressing and FFNN), and the output of the Encoder is used as the input of the Decoder, which is the same as all generation models.

BERT (bidirectional Encoder retrieval from transformations) model: is a language representation model representing a bi-directional encoder representation from a Transformer. BERT aims to pre-train the deep bi-directional representation by jointly adjusting the left and right context in all layers. The network architecture uses a multi-layer Transformer structure, and the biggest characteristic of the network architecture is that the traditional RNN and CNN are abandoned, and the distance between two words at any position is converted into 1 through an Attention mechanism.

Masked LM: masked language modeling, a self-supervised learning task for pre-training language models.

Self-attention: from the point of attention.

over-smoothening problem: i.e., over-smoothing, or over-training problems, the self-attention layer tends to map words in different positions to similar implicit tokens.

Self-supervised learning (SSL): the method is a combination of supervised learning and unsupervised learning, the learning mode of the method is the same as that of the supervised learning, but the labels of the training data are automatically generated. The core idea is to predict any part of the input in some form based on other parts than this part. In particular, a mask language model (Masked LM) is an auto-supervised task that attempts to mask one word in a sentence, predicting it based on the remaining words.

It should be noted that, in the present NLP field, a general pre-trained language model (such as BERT) is usually used to perform model fine tuning, so as to improve the accuracy of the downstream NLP task, where the quality of the pre-trained language model is critical. The traditional pre-training language model adopts a Masked LM task with low mask probability to train the model, the mask probability is low, the task difficulty is low, the semantic learning efficiency is low, Masked words are considered when self-attribute is calculated, the excessive Masked words participating in self-attribute can face the over-smoothening problem, and the adding of the Masked words can cause the problem of unmatching with downstream tasks.

The embodiment of the specification provides a pre-training language model with high mask probability, which solves the problem that the traditional pre-training language model adopts low mask probability (for example, Masked LM mask probability is 15%), relieves the problem that excessive Masked words participate in self-smoothening by adopting high mask probability (for example, more than 50%) and removing self-attention weights of all word pairs at the positions of the Masked words, optimizes the problem that excessive Masked words do not match with downstream tasks, and effectively improves the training convergence speed of the pre-training language model and the accuracy of the downstream tasks, wherein the downstream tasks comprise common tasks such as text classification, sequence labeling and text generation.

The training method of the pre-training language model provided by the embodiment of the present specification can be applied to application scenarios such as text classification, sequence labeling, text generation tasks, and the like, including but not limited to text segmentation, title generation, key sentence extraction, and a dialogue understanding model, and the like.

In the present specification, a training method of a pre-training language model is provided, and the present specification relates to a training method of a text processing model, a text processing method, a training apparatus of a pre-training language model, a training apparatus of a text processing model, a text processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

Fig. 1 is a flowchart illustrating a training method for pre-training a language model according to an embodiment of the present disclosure, and as shown in fig. 1, the method specifically includes the following steps:

step 102: and performing mask processing on the first set numerical value character in the sample text to obtain a mask training sample.

Specifically, the sample text is the label-free text data acquired in advance, and the pre-trained language model can be subjected to self-supervision training through the sample text. The first set value may be a threshold of the number of characters to be masked in the sample text, or may also be a threshold of a proportion of characters to be masked in the sample text to total characters in the sample text, for example, the first set value may be 5, 10, 50%, 60%, or the like. In practical implementation, the first set value may be greater than a value threshold, which may be set based on training requirements and is used to determine the number or proportion of the masked characters in the sample text, and the first set value may be set to be greater, for example, greater than 50%, that is, characters in the sample text subjected to masking processing account for more than half of total characters in the sample text, so that the pre-training language model is trained with a high masking probability.

It should be noted that the meaning of a sentence can often be inferred from a part of words, and the model extraction inference capability can be enhanced by masking the part of words, so that a large amount of unlabeled text data can be obtained in advance as sample texts, then certain characters in the sample texts are masked to obtain mask training samples, and then the obtained mask training samples can be input into a pre-training language model, and the masked characters are predicted through the pre-training language model, so that the pre-training language model is trained according to the prediction result.

In practical application, the mask processing is performed on the first set numerical value characters in the sample text, the first set numerical value characters can be selected from the sample text, then the selected characters are replaced by using specific symbols or set characters, the replaced characters are masked characters, other characters which are not replaced are non-masked characters, or partial original characters can be maintained, so that a mask training sample is obtained. The specific symbol is a preset mark symbol, such as [ MASK ], replacing the masked character, the set character is a preset random word replacing the masked character, and the random word can be used for replacing the selected masked character.

Illustratively, the obtained sample text is "T1T 2T 3T 4T 5T 6T 7T 8", the first set value is 50%, that is, the sample text includes 8 characters, at this time, 4 characters may be selected from the sample text for masking, assuming that T3, T4, T6, and T8 are selected, at this time, T3, T4, T6, and T8 may be replaced with a specific symbol [ MASK ], and the obtained MASK training sample is "T1T 2[ MASK ] T5[ MASK ] T7[ MASK ].

In the embodiment of the specification, mask processing can be performed on a first set numerical character in a sample text to obtain a mask training sample, in the sample text, a Masked character and a non-Masked character are context information of the Masked character, and a subsequent pre-training language model can learn the capability of capturing text context information by predicting the Masked character, so that the pre-training language model trained and completed based on a Masked LM training method has the capability of understanding the depth semantics of a natural language, and can be used in a series of downstream tasks related to NLP.

In an optional implementation manner of this embodiment, different manners may be adopted for performing mask processing on the first set numerical value character, that is, the mask processing is performed on the first set numerical value character in the sample text to obtain a mask training sample, and a specific implementation process may be as follows:

replacing characters with a first proportion in the characters with a first set numerical value with specific symbols, replacing characters with a second proportion in the characters with the first set numerical value with set characters, and keeping characters with a third proportion in the characters with the first set numerical value with original characters to obtain a mask training sample;

It should be noted that the meaning of a sentence may often be inferred from a part of characters, the extraction inference capability of the pre-training language model may be enhanced by performing mask processing on a part of characters, the first proportion, the second proportion, and the third proportion may be preset to respectively represent the proportions occupied by the corresponding mask processing modes, the first proportion, the second proportion, and the third proportion may be randomly set, and since the meaning of a sentence may often be inferred from a part of words, the model extraction inference capability may be enhanced by replacing a part of words with specific symbols, the maximum value that the first proportion occupied by mask characters is often replaced with specific symbols may be set.

In practical application, after the first set number of characters are selected from the sample text, the selected characters can be subjected to mask processing, the characters are masked characters, and other characters which are not subjected to mask processing are non-masked characters. In order to improve the generalization capability of the pre-training language model and accelerate the model training speed, different processing manners may be further adopted to perform Mask processing on the selected masked characters, for example, 80% of the random masked characters may be replaced by a specific symbol [ Mask ], 10% of the random masked characters may be replaced by original characters, and 10% of the random masked characters may be replaced by random characters.

In an optional implementation manner of this embodiment, the first set value may change along with the training process, that is, the first set value character in the sample text is subjected to mask processing to obtain a mask training sample, and a specific implementation process may be as follows:

It should be noted that the second set value may be the number of interval training rounds for adjusting the number of masked characters in the sample text, for example, the second set value may be 10, 20, 50, 100, and so on. The set ratio is a width, i.e., a step size, of increasing the first set value every time, and may be 5%, 10%, 15%, etc., of the first set value every time, or may be 5%, 10%, 15%, etc., of the first value, for example.

In practical application, the first set value cannot be increased infinitely, and the first set value should be less than or equal to the maximum value threshold, so that when the first set value is increased to a certain value, the increase is not continued, and the mask processing is performed on the sample text by keeping the current first set value.

In an example, assuming that the first set value is 50%, after each pair of pre-training language models are trained for 100 rounds, the first set value is increased by 5%, that is, when the pre-training language models are trained for 1-100 rounds, 50% of characters in the sample text are subjected to masking processing; when the pre-training language model is trained for 101-200 times, 55% of characters in the sample text are subjected to mask processing; when the pre-training language model training is carried out for 201-300 times, 60% of characters in the sample text are subjected to mask processing, and since 60% reaches the maximum numerical value threshold, the first set numerical value is not increased in the subsequent training process, and 60% of characters in the sample text are always subjected to mask processing.

In this embodiment of the present specification, after each pair of pre-training language models is trained by the second set numerical value wheel, the first set numerical value may be increased, that is, the number of masked characters of the sample text is increased, so that more characters in the sample text are masked, thereby accelerating the training speed of the pre-training language models.

Step 104: inputting the mask training sample into a pre-training language model, and determining an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, wherein the enhanced semantic vector of each character is determined based on the weight of the position of a non-mask character in the mask training sample.

It should be noted that, after the mask processing is performed on the first set numerical value characters in the sample text to obtain the mask training sample, the mask training sample may be input to the pre-training language model to train the pre-training language model. The pre-training language model refers to a language model pre-trained by adopting a Masked LM training method, and the pre-training language model can be a BERT model, a RoBERTA model, a MASS model and the like.

In an optional implementation manner of this embodiment, the weight of the masked character in the mask training sample relative to the first character may be set to 0, so as to ignore the weight at the position of the masked character, that is, determine the enhanced semantic vector of each character in the mask training sample through the self-attention layer in the pre-training language model, and a specific implementation process may be as follows:

It should be noted that, when calculating the enhanced semantic vector (self-attention) of the first character in the mask training sample, the self-attention layer may fuse the weight of each character in the mask training sample with respect to the first character, so that the self-attention layer may determine the weight of each character in the mask training sample with respect to the first character first, and then determine the enhanced semantic vector of the first character in the mask training sample based on the weight of each character with respect to the first character. Each character in the mask training sample can be used as a first character, a corresponding enhanced semantic vector is determined to be obtained, so that the enhanced semantic vector of each character in the mask training sample can be obtained, and then the loss value of the pre-training language model can be determined based on the enhanced semantic vector of each character in the mask training sample, so that the pre-training language model is trained.

In practical application, the enhanced semantic vector of the first character is determined based on the non-masked characters in the mask training sample, that is, the weight of the masked character in the mask training sample at the position needs to be masked, so the weight of the masked character in the mask training sample relative to the first character should be 0.

In the embodiment of the description, when the enhanced semantic vector of each character of the mask training sample is calculated, the weight at the position of the masked character can be ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, and the accuracy rate of processing the downstream task is improved.

In an optional implementation manner of this embodiment, the weight of each character in the mask training sample relative to the first character is determined, and a specific implementation process may be as follows:

Specifically, the self attention layer calculates the enhanced semantic vector (self-attention), and each character has 3 different vectors, which are a Query vector (Query vector), a Key vector (Key vector) and a Value vector (Value vector).

In practical application, the Attention mechanism may use the first character as a Query vector, each character of its context as a Key vector, and use the similarity between the Query vector and each Key vector as a weight, and merge the Value vector of each character of the context into the original Value vector of the first character. Specifically, the Attention mechanism may use semantic vector representations of the first character and each character of the context as input, first obtain a Query vector of the first character, a Key vector table of each character of the context, and an original Value vector of each character of the context through linear transformation, then calculate similarity between the Query vector and each Key vector as weight, and weight-fuse the Value vector of the first character and the Value vector of each context character as output of the Attention, that is, an enhanced semantic vector of the first character.

It should be noted that, the weight of the masked character in the mask training sample relative to the first character should be 0, so that after the query vector of the first character, the key vector of each character, and the value vector of each character in the mask training sample are obtained, each vector element in the key vector corresponding to the masked character may be directly set to 0, and when vector operations are subsequently performed on the query vector of the first character and the key vector of each character in the mask training sample, each vector element in the key vector corresponding to the masked character is 0, so that the weight of the masked character in the obtained weight of each character relative to the first character is 0 relative to the first character.

Along with the above example, the MASK training sample is "[ CLS ] T1T 2[ MASK ] T5[ MASK ] T7[ MASK ] [ SEP ]", the masked characters are T3, T4, T6 and T8, the masked characters are replaced by the MASK characters [ MASK ], fig. 2a is a vector diagram of each character in the MASK training sample provided in an embodiment of the present specification, as shown in fig. 2a, a row represents a Query vector of each character in the MASK training sample, a column represents a Key vector of each character in the MASK training sample, and a Key vector at the [ MASK ] position, i.e., each element in a column corresponding to the [ MASK ] is set to 0 (i.e., a column corresponding to the masked character is set to 0).

The enhanced semantic vector for the first character in the mask training sample may be: y _ i _0 x _0+ w _ i _1 x _1+ w _ i _2 x _2+ w _ i _5 x _5+ w _ i _7 x _7+ w _ i _9 x _9, where y _ i represents the enhanced semantic vector of the first character, w _ i _0 represents the weight of the character at the 0 th position (i.e., CLS, start identifier) relative to the character at the i th position (i.e., first character), and x _0 represents the Value vector at the 0 th position (i.e., CLS, start identifier); w _ i _1 represents the weight of the character at the 1 st position (i.e., T1) relative to the character at the ith position (i.e., the first character), and x _1 represents the Value vector at the 1 st position (i.e., T1); w _ i _2 represents the weight of the character at the 2 nd position (i.e., T2) relative to the character at the i th position (i.e., the first character), and x _2 represents the Value vector at the 2 nd position (i.e., T2); w _ i _5 represents the weight of the character at the 5 th position (i.e., T5) relative to the character at the i th position (i.e., the first character), and x _5 represents the Value vector at the 5 th position (i.e., T5); w _ i _7 represents the weight of the character at the 7 th position (i.e., T7) relative to the character at the i th position (i.e., the first character), and x _7 represents the Value vector at the 7 th position (i.e., T7); w _ i _9 represents the weight of the character at the 9 th position (i.e., SEP, end identifier) relative to the character at the i th position (i.e., first character), and x _9 represents the Value vector at the 9 th position (i.e., SEP, end identifier).

In the embodiment of the specification, by directly setting each vector element in the key vector corresponding to the masked character as 0, the calculated weight of the masked character relative to the first character is 0, the calculation method is simple and convenient, the weight at the position of the masked character is ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked characters is large is avoided, and the processing accuracy of the downstream task is improved.

In an optional implementation manner of this embodiment, the weight of each character in the mask training sample relative to the first character is determined, and a specific implementation process may also be as follows:

and setting the weight of the masked character relative to the first character in the weight of each character relative to the first character to be 0.

It should be noted that, in addition to directly setting each vector element in the key vector corresponding to the masked character to 0 after the query vector of the first character, the key vector of each character, and the value vector of each character in the mask training sample are obtained, vector operation may be directly performed on the query vector of the first character and the key vector of each character in the mask training sample, so as to obtain the weight of each character relative to the first character, and then the weight of the masked character relative to the first character in the weight of each character relative to the first character is directly set to 0.

In the embodiment of the specification, the weight at the position of the masked character is shielded in a mode of directly setting the weight of the masked character relative to the first character to 0 in the calculated weight of each character relative to the first character, the calculation mode is simple and convenient, the weight at the position of the masked character is ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when more characters are covered is avoided, and the processing accuracy of the downstream task is improved.

In an optional implementation manner of this embodiment, the pre-trained language model includes at least two self-attention layers; at this time, the mask training sample is input into the pre-training language model, and the enhanced semantic vector of each character in the mask training sample is determined through the self-attention layer in the pre-training language model, and the specific implementation process can be as follows:

It should be noted that, the pre-training language model may include at least two self-attention layers, and may generally include 12 self-attention layers, so that in addition to the above-mentioned determination of the enhanced semantic vector of each character in the mask training sample by the self-attention layers in the pre-training language model, the determination is performed based on the weights at the positions of the non-masked characters in the mask training sample, and the weights at the positions of the masked characters in the mask training sample are ignored, some of the self-attention layers in the pre-training language model may adopt a processing manner of ignoring the weights at the positions of the masked characters in the mask training sample, and other self-attention layers in the pre-training language model may not ignore the weights at the positions of the masked characters in the mask training sample, such as ignoring the lower layers and not ignoring the higher layers.

Step 106: determining a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjusting model parameters of the pre-training language model according to the loss value, returning to execute the operation step of performing mask processing on the first set numerical character in the sample text to obtain a mask training sample until a training stopping condition is reached, and obtaining the pre-training language model after pre-training.

It should be noted that, when returning to perform the operation step of performing mask processing on the first set value characters in the sample text to obtain the mask training sample, the sample text may be the same, but the same or different first set value characters may be selected to perform the mask processing to obtain the corresponding mask training sample, and the pre-training language model is trained based on the mask training sample. And after training is carried out on the basis of the sample text for a preset number of rounds or a preset duration, the current sample text can be replaced by other sample texts, then the operation step of carrying out mask processing on the first set numerical value characters in the sample text to obtain a mask training sample is returned, so that the pre-training language model is trained on the basis of the updated sample text, and when the training stopping condition is reached, the training is completed, and the pre-training language model which is trained is obtained. The training stop condition may be that the loss value is smaller than a loss threshold value, and/or the number of iterations reaches a number threshold value.

In an optional implementation manner of this embodiment, the loss value may be determined based on a classification layer of the pre-training language model, that is, the loss value of the pre-training language model is determined according to the enhanced semantic vector of each character, and a specific implementation process may be as follows:

and calculating a loss value of the pre-training language model according to the predicted character and the masked character corresponding to the masked character.

It should be noted that the classification layer may be an MLP classifier. A masked character may refer to an original character at a masked position, i.e., a true result; and the predicted characters corresponding to the masked characters are characters predicted by the classification layer of the pre-training language model aiming at the masked positions, namely, the predicted results, when the difference between the predicted results and the real results is small enough, the predicted results are close to the real results enough, at the moment, the training stopping condition is reached, the training is finished, and the pre-training language model which is pre-trained can be obtained.

In the embodiment of the specification, the difference between the prediction result and the real result can be visually shown by calculating the loss value, and then the pre-training language model can be trained specifically based on the difference, and the parameters of the pre-training language model are adjusted, so that the training speed and the training effect of the model are effectively improved.

For example, fig. 2b is a schematic diagram of a model structure of a pre-training language model provided in an embodiment of the present specification, and as shown in fig. 2b, a MASK training sample is "[ CLS ] T1T 2[ MASK ] T5[ MASK ] T7[ MASK ] [ SEP ]", where [ CLS ] represents a start identifier and [ SEP ] represents an end identifier. The mask training samples are input to a self-attention layer of the pre-training language model, and the self-attention layer can determine the enhanced semantic vector of each character in the mask training samples. Then, the enhanced semantic vectors of the characters in the mask training sample are input into a classification layer of the pre-training language model, the classification layer can output predicted characters corresponding to the masked characters as T3, T4, T6 and T8, and the loss value of the pre-training language model is calculated based on T3, T4, T6 and T8 and the masked characters (i.e. original characters).

One embodiment of the present specification provides a training method for a pre-training language model, which includes performing mask processing on a first set number of characters in a sample text when the pre-training language model is trained, obtaining a mask training sample, determining an enhanced semantic vector of each character based on a weight at a position of a non-mask character in the mask training sample, determining a loss value of the pre-training language model according to the enhanced semantic vector of each character, and training the pre-training language model, thereby completing a pre-training process of the language model. Therefore, when the enhanced semantic vector of each character of the mask training sample is calculated, the weight at the position of the masked character can be ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, and the processing accuracy of the downstream task is improved.

Fig. 3a is a flowchart illustrating a training method of a text processing model according to an embodiment of the present disclosure, and as shown in fig. 3a, the method specifically includes the following steps:

step 302: a training sample is obtained, wherein the training sample carries a sample label.

It should be noted that the training sample is pre-acquired labeled data, the training sample is used for fine tuning the text processing model, the text processing model includes a pre-training language model and a task processing model, the pre-training language model is used for extracting semantic information of the training sample to obtain an enhanced semantic vector, and the task processing model is based on the enhanced semantic vector to perform specific task processing to obtain a prediction processing result. The task processing model is a specific processing model of a downstream task matched with the pre-training language model, and for example, the task processing model can be a translation model, a labeling model, a question-answering model and the like.

Step 304: inputting the training samples into a pre-training language model of the text processing model, and determining the enhanced semantic vectors of the training samples through a self-attention layer of the pre-training language model.

The pre-training language model is obtained by training through the training method of the pre-training language model.

Step 306: and inputting the enhanced semantic vector of the training sample into a task processing model of the text processing model to obtain a prediction processing result corresponding to the training sample.

Step 308: and calculating a loss value of the text processing model according to the prediction processing result and the sample label, adjusting model parameters of the pre-training language model and the task processing model according to the loss value, and returning to execute the operation step of acquiring the training sample until a training stopping condition is reached to obtain the trained text processing model.

For example, fig. 3b is a schematic diagram of a training process of a text processing model provided in an embodiment of this specification, and as shown in fig. 3b, a pre-trained language model is pre-trained by using label-free data, and then a task processing model is added after the pre-trained language model to obtain a text processing model, and the text processing model is fine-tuned by combining with label data to obtain a trained text processing model.

The embodiment of the specification provides a training method of a text processing model, which is characterized in that a pre-training language model is obtained by training through the training method of the pre-training language model, the pre-training language model ignores the weight at the position of a masked character, the model migration capability is strong, and the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, so that the pre-training language model after pre-training can be well matched with various task processing models, the training accuracy of the text processing model is improved, and the processing accuracy of the downstream task is improved.

Fig. 4 is a flowchart illustrating a text processing method according to an embodiment of the present specification, and as shown in fig. 4, the method specifically includes the following steps:

step 402: and acquiring a text to be processed.

It should be noted that the obtained text to be processed is the content to be subjected to text processing, such as a text to be translated, a text to be labeled, a text to be solved, and the like.

Step 404: inputting the text to be processed into a pre-training language model of the text processing model, and determining an enhanced semantic vector of the text to be processed through a self-attention layer of the pre-training language model.

The text processing model is obtained by training through the training method of the text processing model.

In an optional implementation manner of this embodiment, the text to be processed is input into a pre-training language model of the text processing model, and the enhanced semantic vector of the text to be processed is determined through a self-attention layer of the pre-training language model, and a specific implementation process may be as follows:

It should be noted that, when calculating the enhanced semantic vector (self-attention) of the second character in the text to be processed, the self-attention layer may fuse the weights of the characters in the text to be processed with respect to the second character, so that the self-attention layer may determine the weight of each character in the text to be processed with respect to the second character first, and then determine the enhanced semantic vector of the second character in the text to be processed based on the weight of each character with respect to the second character. And each character in the text to be processed can be used as a second character, and the corresponding enhanced semantic vector is determined and obtained, so that the corresponding text processing result can be predicted subsequently based on the obtained enhanced semantic vector of each character in the text to be processed.

Step 406: and inputting the enhanced semantic vector of the text to be processed into a task processing model of the text processing model to obtain a text processing result corresponding to the text to be processed.

The embodiment of the specification provides a text processing method, a pre-training language model is obtained by training through the pre-training language model training method, fine adjustment is performed by combining a specific task processing model, the pre-training language model ignores the weight of the position of a masked character, the model migration capability is strong, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, the pre-training language model after pre-training can be well matched with various task processing models, the training accuracy of the text processing model is improved, and the processing accuracy of the downstream task is improved.

Corresponding to the above method embodiment, the present specification further provides an embodiment of a training apparatus for pre-training a language model, and fig. 5 shows a schematic structural diagram of the training apparatus for pre-training a language model provided in an embodiment of the present specification. As shown in fig. 5, the apparatus includes:

a processing module 502 configured to perform mask processing on a first set numerical character in a sample text to obtain a mask training sample;

a first determining module 504, configured to input the mask training sample into the pre-training language model, and determine an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, where the enhanced semantic vector of each character is determined based on a weight at a position of a non-mask character in the mask training sample;

the first adjusting module 506 is configured to determine a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjust model parameters of the pre-training language model according to the loss value, and return to perform an operation step of performing mask processing on the first set numerical characters in the sample text to obtain a mask training sample until a training stop condition is reached, so as to obtain the pre-training language model after pre-training.

Optionally, the first determining module 504 is further configured to:

and determining an enhanced semantic vector of the first character in the mask training sample based on the weight of each character relative to the first character.

Optionally, the first determining module 504 is further configured to:

Optionally, the processing module 502 is further configured to:

Optionally, the first adjusting module 506 is further configured to:

Optionally, the processing module 502 is further configured to:

Optionally, the pre-trained language model comprises at least two self-attention layers; the first determination module 504 is further configured to:

determining an enhanced semantic vector of each character in the mask training sample through the first self-attention layer, wherein the enhanced semantic vector of each character is determined and obtained based on the weight of the position of each character in the mask training sample;

One embodiment of the present specification provides a training device for a pre-training language model, which is configured to perform mask processing on a first set number of characters in a sample text when the pre-training language model is trained, obtain a mask training sample, determine an enhanced semantic vector of each character based on a weight at a position of a non-mask character in the mask training sample, determine a loss value of the pre-training language model according to the enhanced semantic vector of each character, and train the pre-training language model, thereby completing a pre-training process of the language model. Therefore, when the enhanced semantic vector of each character of the mask training sample is calculated, the weight at the position of the masked character can be ignored, the convergence speed of the pre-training language model is improved, the over-training is avoided, the model migration capability is improved, the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, and the processing accuracy of the downstream task is improved.

The above is an illustrative scheme of a training apparatus for pre-training a language model according to the present embodiment. It should be noted that the technical solution of the training apparatus for the pre-trained language model and the technical solution of the training method for the pre-trained language model belong to the same concept, and details of the technical solution of the training apparatus for the pre-trained language model, which are not described in detail, can be referred to the description of the technical solution of the training method for the pre-trained language model.

Corresponding to the above method embodiment, the present specification further provides an embodiment of a training apparatus for a text processing model, and fig. 6 shows a schematic structural diagram of a training apparatus for a text processing model provided in an embodiment of the present specification. As shown in fig. 6, the apparatus includes:

a first obtaining module 602 configured to obtain a training sample, wherein the training sample carries a sample label;

a second determining module 604, configured to input the training sample into a pre-training language model of the text processing model, and determine an enhanced semantic vector of the training sample through a self-attention layer of the pre-training language model, where the pre-training language model is obtained by training through a training method of the pre-training language model;

a first obtaining module 606, configured to input the enhanced semantic vector of the training sample into a task processing model of the text processing model, and obtain a prediction processing result corresponding to the training sample;

and a second adjusting module 608 configured to calculate a loss value of the text processing model according to the prediction processing result and the sample label, adjust model parameters of the pre-training language model and the task processing model according to the loss value, and return to perform the operation step of obtaining the training sample until a training stop condition is reached, so as to obtain a trained text processing model.

The above is a schematic scheme of a training apparatus for a text processing model according to this embodiment. It should be noted that the technical solution of the training apparatus for the text processing model and the technical solution of the training method for the text processing model belong to the same concept, and details that are not described in detail in the technical solution of the training apparatus for the text processing model can be referred to the description of the technical solution of the training method for the text processing model.

Corresponding to the above method embodiment, this specification further provides a text processing apparatus embodiment, and fig. 7 shows a schematic structural diagram of a text processing apparatus provided in an embodiment of this specification. As shown in fig. 7, the apparatus includes:

a second obtaining module 702 configured to obtain a text to be processed;

a third determining module 704, configured to input the text to be processed into a pre-trained language model of the text processing model, and determine an enhanced semantic vector of the text to be processed through a self-attention layer of the pre-trained language model;

the second obtaining module 706 is configured to input the enhanced semantic vector of the text to be processed into a task processing model of the text processing model, and obtain a text processing result corresponding to the text to be processed, where the text processing model is obtained by training through a training method of the text processing model.

Optionally, the third determining module 704 is further configured to:

The embodiment of the specification provides a text processing device, which is characterized in that a pre-training language model is obtained by training through the training method of the pre-training language model, and fine tuning is performed by combining a specific task processing model, the pre-training language model ignores the weight at the position of a masked character, the model migration capability is strong, and the problem that the masked character is not matched with a downstream task when the number of the masked character is large is avoided, so that the pre-training language model which is pre-trained can be well matched with various task processing models, the training accuracy of the text processing model is improved, and the processing accuracy of the downstream task is improved.

The above is an exemplary scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method.

FIG. 8 illustrates a block diagram of a computing device, according to one embodiment of the present description. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of Network Interface (e.g., a Network Interface Controller) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.

Wherein the processor 820 is configured to execute the following computer-executable instructions to implement the training method of the pre-trained language model of the first aspect, or the training method of the text processing model of the second aspect, or the operation steps of the text processing method of the third aspect.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device belongs to the same concept as the technical solutions of the training method of the pre-training language model, the training method of the text processing model, and the text processing method, and details that are not described in detail in the technical solutions of the computing device can be referred to the descriptions of the technical solutions of the training method of the pre-training language model, the training method of the text processing model, and the text processing method.

An embodiment of the present specification further provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement the method for training a pre-trained language model of the first aspect, or the method for training a text processing model of the second aspect, or the operating steps of the text processing method of the third aspect.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium is the same as the technical solutions of the above-mentioned pre-training language model training method, text processing model training method, and text processing method, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the descriptions of the technical solutions of the above-mentioned pre-training language model training method, text processing model training method, and text processing method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the teaching of the embodiments of the present disclosure. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A method of training a pre-trained language model, comprising:

determining a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjusting model parameters of the pre-training language model according to the loss value, and returning to execute the operation step of performing mask processing on the first set numerical character in the sample text to obtain a mask training sample until a training stopping condition is reached to obtain the pre-training language model after pre-training.

2. The method for training a pre-trained language model according to claim 1, wherein the determining, by a self-attention layer in the pre-trained language model, an enhanced semantic vector of each character in the mask training sample comprises:

determining the weight of each character in the mask training sample relative to a first character, wherein the first character is any character in the mask training sample, and the weight of a masked character in the mask training sample relative to the first character is 0;

3. The method of training a pre-trained language model according to claim 2, wherein said determining weights of respective characters in the mask training samples with respect to a first character comprises:

acquiring a query vector of a first character, a key vector of each character and a value vector of each character in the mask training sample;

4. The method of training a pre-trained language model according to claim 3, wherein said determining weights of respective characters in said mask training samples with respect to a first character comprises:

5. The method for training a pre-trained language model according to any one of claims 1-4, wherein the masking a first set value character in a sample text to obtain a masked training sample comprises:

replacing characters with a first proportion in the first set numerical value characters with specific symbols, replacing characters with a second proportion in the first set numerical value characters with set characters, and keeping characters with a third proportion in the first set numerical value characters with original characters to obtain the mask training sample;

wherein the first ratio is greater than the second ratio and greater than the third ratio, and the sum of the first ratio, the second ratio and the third ratio is 1.

6. The method for training a pre-trained language model according to any one of claims 1-4, wherein the determining a loss value of the pre-trained language model according to the enhanced semantic vector of each character comprises:

inputting the enhanced semantic vector of each character into a classification layer of the pre-training language model to obtain a predicted character corresponding to the masked character;

and calculating the loss value of the pre-training language model according to the predicted character corresponding to the masked character and the masked character.

7. The method for training a pre-trained language model according to any one of claims 1-4, wherein the step of masking the first set value characters in the sample text to obtain the masked training sample comprises:

and after each pair of pre-training language models train a second set numerical value wheel, increasing the first set numerical value by a set proportion, wherein the first set numerical value is less than or equal to a maximum numerical value threshold.

8. The method for training a pre-trained language model according to any one of claims 1-4, said pre-trained language model comprising at least two self-attention layers;

the inputting the mask training sample into a pre-training language model, and determining an enhanced semantic vector of each character in the mask training sample through a self-attention layer in the pre-training language model, includes:

determining an enhanced semantic vector of each character in the mask training sample through a first self-attention layer, wherein the enhanced semantic vector of each character is determined based on the weight of the position of each character in the mask training sample;

determining an enhanced semantic vector of each character in the mask training sample through a second self-attention layer, wherein the enhanced semantic vector of each character is determined based on the weight of the position of the non-mask character in the mask training sample, and the second self-attention layer is close to a classification layer in the pre-training language model relative to the first self-attention layer.

9. A training method of a text processing model comprises the following steps:

inputting the training sample into a pre-training language model of a text processing model, and determining an enhanced semantic vector of the training sample through a self-attention layer of the pre-training language model, wherein the pre-training language model is obtained by training through the training method of any one of claims 1 to 8;

inputting the enhanced semantic vector of the training sample into a task processing model of the text processing model to obtain a prediction processing result corresponding to the training sample;

and calculating a loss value of the text processing model according to the prediction processing result and the sample label, adjusting model parameters of the pre-training language model and the task processing model according to the loss value, and returning to execute the operation step of obtaining the training sample until a training stopping condition is reached to obtain a trained text processing model.

10. A training apparatus for pre-training a language model, comprising:

a first determining module, configured to input the mask training sample into a pre-training language model, and determine, through a self-attention layer in the pre-training language model, an enhanced semantic vector of each character in the mask training sample, where the enhanced semantic vector of each character is determined based on a weight at a position of a non-mask character in the mask training sample;

and the first adjusting module is configured to determine a loss value of the pre-training language model according to the enhanced semantic vector of each character, adjust a model parameter of the pre-training language model according to the loss value, and return to execute the operation step of performing mask processing on the first set numerical character in the sample text to obtain a mask training sample until a training stop condition is reached to obtain the pre-training language model after pre-training.

11. A training apparatus for a text processing model, comprising:

a second determining module, configured to input the training sample into a pre-training language model of a text processing model, and determine an enhanced semantic vector of the training sample through a self-attention layer of the pre-training language model, wherein the pre-training language model is obtained by training according to any one of the preceding claims 1 to 8;

a first obtaining module, configured to input the enhanced semantic vector of the training sample into a task processing model of the text processing model, and obtain a prediction processing result corresponding to the training sample;

and the second adjusting module is configured to calculate a loss value of the text processing model according to the prediction processing result and the sample label, adjust model parameters of the pre-training language model and the task processing model according to the loss value, and return to execute the operation step of obtaining the training sample until a training stopping condition is reached to obtain a trained text processing model.

12. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the method for training a pre-trained language model according to any one of claims 1 to 8, or the operating steps of the method for training a text processing model according to claim 9.

13. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the method of training a pre-trained language model according to any one of claims 1 to 8, or the operational steps of the method of training a text processing model according to claim 9.