CN116029354A

CN116029354A - Text pair-oriented Chinese language model pre-training method

Info

Publication number: CN116029354A
Application number: CN202210950700.1A
Authority: CN
Inventors: 庞帅; 战科宇; 曹延森; 王华英; 王礼鑫; 张欢
Original assignee: Chinaso Information Technology Co ltd
Current assignee: Chinaso Information Technology Co ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2023-04-28
Anticipated expiration: 2042-08-09
Also published as: CN116029354B

Abstract

The invention provides a text pair-oriented Chinese language model pre-training method, which comprises the following steps: inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs; in the text A, randomly selecting n words, and shielding each randomly selected word by adopting shielding characters to obtain a shielded text; word segmentation is carried out on the text B, and each word subjected to word segmentation is disordered, so that a text subjected to disordered is obtained; splicing the text A1, the text B1 and the text B to obtain a spliced text; and after the spliced text is encoded, a masking word prediction task and a word order recovery task are respectively adopted to obtain a total loss function. The invention provides a text pair-oriented Chinese language model pre-training method, which can learn language and word sequence information in a text pair more fully, thereby improving the pre-training model effect.

Description

Text pair-oriented Chinese language model pre-training method

Technical Field

The invention belongs to the technical field of computer natural language processing, and particularly relates to a text pair-oriented Chinese language model pre-training method.

Background

The advent of Pre-trained Models (PTMs) brought NLP into a new era. At present, a plurality of industrial applications adopt a mode of fine adjustment of PTMS+downstream task data, and have achieved an effect exceeding the past.

Among the many NLP tasks, text takes on the form of tasks such as text semantic matching tasks, question answer pair (question answer pair, QA) matching tasks, etc. Aiming at the task academia, a plurality of pre-training tasks are provided for training related PTMs, and in summary, two methods are mainly adopted, namely, a language model method is adopted, the core is to shield certain words, and then a training stage tries to recover the shielded words; and secondly, predicting the relation of the text pairs, such as scrambling the sequence of continuous sentences in the original article, trying to judge whether the text pairs are continuous in the original text in training, and the like. The knowledge learned by the pre-training tasks is generally simpler, the information quantity is relatively limited, and noise is easy to introduce, so that how to enable the text to learn the language and other information in the text pair more fully for the pre-training tasks, and the efficiency of the text to the pre-training tasks is improved, and the problem to be solved at present is solved.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a text pair-oriented Chinese language model pre-training method, which can effectively solve the problems.

The technical scheme adopted by the invention is as follows:

the invention provides a text pair-oriented Chinese language model pre-training method, which comprises the following steps:

step 1, inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs;

step 2, randomly selecting n words in the text A, wherein each randomly selected word adopts shielding characters to carry out shielding treatment to obtain a shielded text, and the shielded text is expressed as a text A1;

word segmentation is carried out on the text B, and each word after word segmentation is disordered, so that a text with disordered sequence is obtained and is expressed as a text B1;

step 3, dividing the text A1, the text B1 and the text B according to the word, correspondingly obtaining the text A1, the text B1 and the text B;

splicing the text A1, the text B1 and the text B to obtain a spliced text [ CLS ] A1[ SEP ] B1[ SOS ] B [ EOS ]; wherein: [ CLS ], [ SEP ], [ SOS ] and [ EOS ] are respectively: a first separator, a second separator, a third separator, and a fourth separator;

step 4, taking the spliced text as input, inputting the input text into a coding module to respectively obtain a coding vector matrix V of the text A1 _A1 Coding vector matrix V of text B1 _B1 And the coding vector matrix V of the text B _B ；

Wherein:

step 4.1, encoding vector matrix V of text A1 _A1 Obtained by:

1) Word code Emb of i-th word in text A1 _A1(i) A word vector Emb for the i-th word in text A1 _A1char(i) Position code Emb with ith word in text A1 _A1pos(i) Type code Emb with text A1 _type(A1) The sum of which is given by:

Emb _A1(i) ＝Emb _A1char(i) +Emb _A1pos(i) +Emb _type(A1)

2) In the text A1, the words of each word are encoded to form an encoding vector matrix V of the text A1 _A1 ；

Step 4.2, encoding vector matrix V of text B1 _B1 Obtained by:

1) Word code Emb of i-th word in text B1 _B1(i) A word vector Emb for the i-th word in the text B1 _B1char(i) Position code Emb with ith word in text B1 _B1pos(i) Type code Emb with text B1 _type(B1) The sum of which is given by:

Emb _B1(i) ＝Emb _B1char(i) +Emb _B1pos(i) +Emb _type(B1)

2) In the text B1, the words of each word are encoded to form an encoding vector matrix V of the text B1 _B1 ；

Step 4.3, encoding vector matrix V of text B _B Obtained by:

1) Word code Emb of i-th word in text B _B(i) Word vector Emb for the i-th word in text B _Bchar(i) Position code Emb with i-th word in text B _Bpos(i) Type code Emb with text B _type(B) The sum of which is given by:

Emb _B(i) ＝Emb _Bchar(i) +Emb _Bpos(i) +Emb _type(B)

2) In the text B, the words of each word are encoded to form an encoding vector matrix V of the text B _B ；

Wherein:

word vector Emb of i-th word in text A1 _A1char(i) Word vector Emb of i-th word in text B1 _B1char(i) And the word vector Emb of the i-th word in text B _Bchar(i) All are obtained by inquiring a dictionary;

position code Emb of ith word in text A1 _A1pos(i) Position code Emb of i-th word in text B1 _B1pos(i) And the position code Emb of the i-th word in the text B _Bpos(i) The position code of each word in the text of the spliced text is referred to;

type encoding Emb for text A1 _type(A1) Type code Emb of text B1 _type(B1) And type code Emb for text B _type(B) Three different text types;

step 5, the coding vector matrix V of the text A1 _A1 Coding vector matrix V of text B1 _B1 And the coding vector matrix V of the text B _B Inputting to a pre-training task layer, and calculating to obtain a total Loss function Loss by adopting the following method _{Total (S)} ：

Step 5.1, a pre-training task layer comprises a masking word prediction task and a word order recovery task;

step 5.2, obtaining a first Loss function Loss by masking the word prediction task and adopting the following formula ₁ (x,θ)：

Wherein:

P(x _a |V _A1 ,V _B1 ) The meaning is as follows: the matrix of coded vectors V in text A1 _A1 In reading a predicted mask word x _a To read mask word x _a Encoding vector matrix V of vector and text B of (C) _B Splicing to obtain a spliced vector, and multiplying the spliced vector with the dictionary matrix to obtain a probability matrix; in the probability matrix, a maximum probability value, namely P (x _a |V _A1 ,V _B1 ) The method comprises the steps of carrying out a first treatment on the surface of the The dictionary matrix is a matrix formed by word vectors of each word in the dictionary;

-log P(x _a |V _A1 ,V _B1 ): representing cross entropy calculations, namely: using standard cross entropy pairs P (x _a |V _A1 ,V _B1 ) Calculating to obtain a shielding word x _a Is a loss value of (2);

e (): representing an averaging calculation;

a e len (A1), representing that in the text A1, the a-th word is masked;

thus, each mask predicts a penalty; then, summing the loss values of the shielding words, and dividing the sum by the number of the shielding words to obtain an average loss value;

step 5.3, obtaining a second Loss function Loss by recovering tasks through the word order and adopting the following formula ₂ (x,θ)：

Wherein:

b represents the position of the word predicted in text B, which has c words in total, so b=1, 2, …, c;

for word vector x predicted at bit B in text B _b The loss value-log P (x) was obtained by the following method _b |x _b-1:0 ,V _A1 ,V _B1 )：

1) Input V _A1 ,V _B1 And x _b-1:0 ；

Wherein:

x _b-1:0 the meaning is as follows: the 0 th bit separator vector x in front of text B ₀ Text B1 st bit word vector x ₁ …, text B-1 bit word vector x _b-1 Vector formed by splicing;

2) Let V _A1 ,V _B1 And x _b-1:0 Performing splicing operation to obtain word vector x to be predicted _b Context vector of (a);

3)P(x _b |x _b-1:0 ,V _A1 ,V _B1 ) The meaning is as follows: using a sequence-to-sequence seq2seq model, including an encoding end and a decoding end; inputting a context vector at an encoding end; outputting the predicted word vector x at the decoding end _b Is set, and a predicted probability value thereof;

4)-log P(x _b |x _b-1:0 ,V _A1 ,V _B1 ): using standard cross entropy pairs P (x _b |x _b-1:0 ,V _A1 ,V _B1 ) Calculating to obtain word vector x _b Is a loss value of (2);

each predicted word vector in the text B obtains a loss value; averaging the Loss values to obtain a second Loss function Loss ₂ (x,θ)；

Step 5.4, for the first Loss function Loss ₁ (x, θ) and a second Loss function Loss ₂ (x, θ) to obtain the total Loss function Loss _{Total (S)} ；

Step 6, judging whether the training reaches the maximum iteration number, if not, according to the total Loss function Loss _{Total (S)} Obtaining a gradient, carrying out back transmission and parameter updating on the model parameter theta, and returning to the step 4; if yes, stopping training to obtain the pre-trained language model.

Preferably, in step 2, for each randomly selected word, the masking character [ MASK ] is used to replace the corresponding word, resulting in a masked text.

Preferably, the text a and the text B refer to: the text A is a question text; text B is the answer text.

The text pair-oriented Chinese language model pre-training method provided by the invention has the following advantages:

the invention provides a text pair-oriented Chinese language model pre-training method, which can learn language and word sequence information in a text pair more fully, thereby improving the pre-training model effect.

Drawings

FIG. 1 is a flow chart of a text-pair-oriented Chinese language model pre-training method provided by the invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, the invention provides a text-pair-oriented Chinese language model pre-training method, which comprises the following steps:

in the invention, sources of the corpus text pairs include:

firstly, an open source similarity corpus on the Internet;

secondly, the open-source answer pair corpus on the Internet; for example, text a is a question text; text B is the answer text.

Thirdly, the desensitized user in the search engine retrieves query and user click information (comprising article titles, abstracts, texts and the like which are spliced together).

in this step, for each randomly selected word, the masking character [ MASK ] is used to replace the corresponding word, resulting in a masked text.

for example, text A is the query of the user in the search engine and text B is the click browsing information of the user.

Assuming that the text A is "I love history museum", masking the two words of "calendar" and "history" to obtain the text A1 as follows: "I love [ MASK ] [ MASK ] museum".

Assuming that the text B is "rising sun in museum", the text B is "rising sun in museum" after word segmentation, the text B1 is obtained by scrambling the sequence: "museum on sun up".

Wherein:

step 4.1, encoding vector matrix V of text A1 _A1 Obtained by:

1) Word code Emb of i-th word in text A1 _A1(i) A word vector Emb for the i-th word in text A1 _A1char(i) With the position of the i-th word in text A1Code Emb _A1pos(i) Type code Emb with text A1 _type(A1) The sum of which is given by:

Emb _A1(i) ＝Emb _A1char(i) +Emb _A1pos(i) +Emb _type(A1)

Step 4.2, encoding vector matrix V of text B1 _B1 Obtained by:

Emb _B1(i) ＝Emb _B1char(i) +Emb _B1pos(i) +Emb _type(B1)

Step 4.3, encoding vector matrix V of text B _B Obtained by:

Emb _B(i) ＝Emb _Bchar(i) +Emb _Bpos(i) +Emb _type(B)

Wherein:

Wherein:

e (): representing an averaging calculation;

a e len (A1), representing that in the text A1, the a-th word is masked;

the method is mainly used for predicting the characters of the text A1 which are shielded, the text A1 and the text B1 are used as input, and the text A1 and the text B1 are text pairs and can be perceived mutually, so that the understanding of the text A1 can be improved and the diversity of the text A1 information can be increased by using the method.

Wherein:

1) Input V _A1 ,V _B1 And x _b-1:0 ；

Wherein:

3)P(x _b |x _b-1:0 ,V _A1 ,V _B1 ) The meaning is as follows: using a sequence-to-sequence seq2seq model, including an encoding end and a decoding end; inputting a context vector at an encoding end; outputting the predicted word direction at the decoding endQuantity x _b Is set, and a predicted probability value thereof;

The step is mainly used for restoring the word order of the text B, and the main completion mode is implemented by a generation formula, and is exemplified below:

such as: the text A1 is: "I love [ MASK ] [ MASK ] museum". The text B1 is: "museum on sun up". Text B is "museum rising sun".

Splicing the text A1, the text B1 and the text B to obtain a spliced text which is:

[ CLS I love [ MASK ] [ MASK ] museum [ SEP solar museum liter [ SOS ] museum solar liter [ EOS ]

Only the part of the "[ CLS ] i love [ MASK ] museum [ SEP ] solar museum up" is visible when two [ MASK ] are masked out in the prediction text A1.

The order of the text B is restored, and one word in "museum up sun [ EOS ]" is predicted each time.

First, b=1, "[ CLS ] when predicting" blogs "]I love [ MASK][MASK]Museum [ SEP ]]Solar museum liter [ SOS ]]"this parts are mutually visible, namely: x is x _b-1:0 ＝x ₀ ＝[SOS]；

b=2, "[ CLS", when predicting "object"]I love [ MASK][MASK]Museum [ SEP ]]Solar museum liter [ SOS ]]The "day" parts are visible to each other, namely: x is x _b-1:0 ＝[SOS]A step of blogging;

b=3, "[ CLS ] i love [ MASK ] museum [ SEP ] solar museum [ SOS ] museum" this part is visible to each other when predicting "museum".

And so on.

Thus, for text B to be recovered, each time a word in B is predicted, the input is the word vector in text B that has been previously predicted, i.e., the next word in B is predicted each time under the condition that the word vectors in A1, B1, and B have been found.

The invention provides a text pair-oriented Chinese language model pre-training method, which uses a generating method on the basis of encoding text pair information, when the language sequence of a text B is restored, the semantic relation between the text A1 and the text B1 is learned (the information of A1 is encoded in the sequence restoration of the text B), and simultaneously the restoration of the sequence of the text B can be regarded as a higher-level language modeling.

The invention relates to a text pair-oriented Chinese language model pre-training method, which is characterized in that a text is pre-trained by using a language model and a generated order recovery task for segmentation, and language order information in a text pair can be more fully learned, so that the pre-training model effect is improved.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. A text pair-oriented Chinese language model pre-training method is characterized by comprising the following steps:

Wherein:

step 4.1, encoding vector matrix V of text A1 _A1 Obtained by:

Emb _A1(i) ＝Emb _A1char(i) +Emb _A1pos(i) +Emb _type(A1)

Step 4.2, encoding vector matrix V of text B1 _B1 Obtained by:

1) Word code Emb of i-th word in text B1 _B1(i) A word vector Emb for the i-th word in the text B1 _B1char(i) Position code Emb with ith word in text B1 _B1pos(i) Type code Emb with text B1 _type(B1) Sum of formulas such asThe following steps:

Emb _B1(i) ＝Emb _B1char(i) +Emb _B1pos(i) +Emb _type(B1)

Step 4.3, encoding vector matrix V of text B _B Obtained by:

Emb _B(i) ＝Emb _Bchar(i) +Emb _Bpos(i) +Emb _type(B)

Wherein:

Wherein:

e (): representing an averaging calculation;

a e len (A1), representing that in the text A1, the a-th word is masked;

Wherein:

1) Input V _A1 ,V _B1 And x _b-1:0 ；

Wherein:

2. The text-pair oriented chinese language model pretraining method of claim 1, wherein in step 2, for each randomly selected word, a MASK character [ -MASK ] is used to replace the corresponding word, resulting in a masked text.

3. The text-pair oriented chinese language model pretraining method of claim 1, wherein text a and text B refer to: the text A is a question text; text B is the answer text.