CN116029354A - Text pair-oriented Chinese language model pre-training method - Google Patents

Text pair-oriented Chinese language model pre-training method Download PDF

Info

Publication number
CN116029354A
CN116029354A CN202210950700.1A CN202210950700A CN116029354A CN 116029354 A CN116029354 A CN 116029354A CN 202210950700 A CN202210950700 A CN 202210950700A CN 116029354 A CN116029354 A CN 116029354A
Authority
CN
China
Prior art keywords
text
word
emb
vector
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210950700.1A
Other languages
Chinese (zh)
Other versions
CN116029354B (en
Inventor
庞帅
战科宇
曹延森
王华英
王礼鑫
张欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinaso Information Technology Co ltd
Original Assignee
Chinaso Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinaso Information Technology Co ltd filed Critical Chinaso Information Technology Co ltd
Priority to CN202210950700.1A priority Critical patent/CN116029354B/en
Publication of CN116029354A publication Critical patent/CN116029354A/en
Application granted granted Critical
Publication of CN116029354B publication Critical patent/CN116029354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a text pair-oriented Chinese language model pre-training method, which comprises the following steps: inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs; in the text A, randomly selecting n words, and shielding each randomly selected word by adopting shielding characters to obtain a shielded text; word segmentation is carried out on the text B, and each word subjected to word segmentation is disordered, so that a text subjected to disordered is obtained; splicing the text A1, the text B1 and the text B to obtain a spliced text; and after the spliced text is encoded, a masking word prediction task and a word order recovery task are respectively adopted to obtain a total loss function. The invention provides a text pair-oriented Chinese language model pre-training method, which can learn language and word sequence information in a text pair more fully, thereby improving the pre-training model effect.

Description

Text pair-oriented Chinese language model pre-training method
Technical Field
The invention belongs to the technical field of computer natural language processing, and particularly relates to a text pair-oriented Chinese language model pre-training method.
Background
The advent of Pre-trained Models (PTMs) brought NLP into a new era. At present, a plurality of industrial applications adopt a mode of fine adjustment of PTMS+downstream task data, and have achieved an effect exceeding the past.
Among the many NLP tasks, text takes on the form of tasks such as text semantic matching tasks, question answer pair (question answer pair, QA) matching tasks, etc. Aiming at the task academia, a plurality of pre-training tasks are provided for training related PTMs, and in summary, two methods are mainly adopted, namely, a language model method is adopted, the core is to shield certain words, and then a training stage tries to recover the shielded words; and secondly, predicting the relation of the text pairs, such as scrambling the sequence of continuous sentences in the original article, trying to judge whether the text pairs are continuous in the original text in training, and the like. The knowledge learned by the pre-training tasks is generally simpler, the information quantity is relatively limited, and noise is easy to introduce, so that how to enable the text to learn the language and other information in the text pair more fully for the pre-training tasks, and the efficiency of the text to the pre-training tasks is improved, and the problem to be solved at present is solved.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a text pair-oriented Chinese language model pre-training method, which can effectively solve the problems.
The technical scheme adopted by the invention is as follows:
the invention provides a text pair-oriented Chinese language model pre-training method, which comprises the following steps:
step 1, inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs;
step 2, randomly selecting n words in the text A, wherein each randomly selected word adopts shielding characters to carry out shielding treatment to obtain a shielded text, and the shielded text is expressed as a text A1;
word segmentation is carried out on the text B, and each word after word segmentation is disordered, so that a text with disordered sequence is obtained and is expressed as a text B1;
step 3, dividing the text A1, the text B1 and the text B according to the word, correspondingly obtaining the text A1, the text B1 and the text B;
splicing the text A1, the text B1 and the text B to obtain a spliced text [ CLS ] A1[ SEP ] B1[ SOS ] B [ EOS ]; wherein: [ CLS ], [ SEP ], [ SOS ] and [ EOS ] are respectively: a first separator, a second separator, a third separator, and a fourth separator;
step 4, taking the spliced text as input, inputting the input text into a coding module to respectively obtain a coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B
Wherein:
step 4.1, encoding vector matrix V of text A1 A1 Obtained by:
1) Word code Emb of i-th word in text A1 A1(i) A word vector Emb for the i-th word in text A1 A1char(i) Position code Emb with ith word in text A1 A1pos(i) Type code Emb with text A1 type(A1) The sum of which is given by:
Emb A1(i) =Emb A1char(i) +Emb A1pos(i) +Emb type(A1)
2) In the text A1, the words of each word are encoded to form an encoding vector matrix V of the text A1 A1
Step 4.2, encoding vector matrix V of text B1 B1 Obtained by:
1) Word code Emb of i-th word in text B1 B1(i) A word vector Emb for the i-th word in the text B1 B1char(i) Position code Emb with ith word in text B1 B1pos(i) Type code Emb with text B1 type(B1) The sum of which is given by:
Emb B1(i) =Emb B1char(i) +Emb B1pos(i) +Emb type(B1)
2) In the text B1, the words of each word are encoded to form an encoding vector matrix V of the text B1 B1
Step 4.3, encoding vector matrix V of text B B Obtained by:
1) Word code Emb of i-th word in text B B(i) Word vector Emb for the i-th word in text B Bchar(i) Position code Emb with i-th word in text B Bpos(i) Type code Emb with text B type(B) The sum of which is given by:
Emb B(i) =Emb Bchar(i) +Emb Bpos(i) +Emb type(B)
2) In the text B, the words of each word are encoded to form an encoding vector matrix V of the text B B
Wherein:
word vector Emb of i-th word in text A1 A1char(i) Word vector Emb of i-th word in text B1 B1char(i) And the word vector Emb of the i-th word in text B Bchar(i) All are obtained by inquiring a dictionary;
position code Emb of ith word in text A1 A1pos(i) Position code Emb of i-th word in text B1 B1pos(i) And the position code Emb of the i-th word in the text B Bpos(i) The position code of each word in the text of the spliced text is referred to;
type encoding Emb for text A1 type(A1) Type code Emb of text B1 type(B1) And type code Emb for text B type(B) Three different text types;
step 5, the coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B Inputting to a pre-training task layer, and calculating to obtain a total Loss function Loss by adopting the following method Total (S)
Step 5.1, a pre-training task layer comprises a masking word prediction task and a word order recovery task;
step 5.2, obtaining a first Loss function Loss by masking the word prediction task and adopting the following formula 1 (x,θ):
Figure SMS_1
Wherein:
P(x a |V A1 ,V B1 ) The meaning is as follows: the matrix of coded vectors V in text A1 A1 In reading a predicted mask word x a To read mask word x a Encoding vector matrix V of vector and text B of (C) B Splicing to obtain a spliced vector, and multiplying the spliced vector with the dictionary matrix to obtain a probability matrix; in the probability matrix, a maximum probability value, namely P (x a |V A1 ,V B1 ) The method comprises the steps of carrying out a first treatment on the surface of the The dictionary matrix is a matrix formed by word vectors of each word in the dictionary;
-log P(x a |V A1 ,V B1 ): representing cross entropy calculations, namely: using standard cross entropy pairs P (x a |V A1 ,V B1 ) Calculating to obtain a shielding word x a Is a loss value of (2);
e (): representing an averaging calculation;
a e len (A1), representing that in the text A1, the a-th word is masked;
thus, each mask predicts a penalty; then, summing the loss values of the shielding words, and dividing the sum by the number of the shielding words to obtain an average loss value;
step 5.3, obtaining a second Loss function Loss by recovering tasks through the word order and adopting the following formula 2 (x,θ):
Figure SMS_2
Wherein:
b represents the position of the word predicted in text B, which has c words in total, so b=1, 2, …, c;
for word vector x predicted at bit B in text B b The loss value-log P (x) was obtained by the following method b |x b-1:0 ,V A1 ,V B1 ):
1) Input V A1 ,V B1 And x b-1:0
Wherein:
x b-1:0 the meaning is as follows: the 0 th bit separator vector x in front of text B 0 Text B1 st bit word vector x 1 …, text B-1 bit word vector x b-1 Vector formed by splicing;
2) Let V A1 ,V B1 And x b-1:0 Performing splicing operation to obtain word vector x to be predicted b Context vector of (a);
3)P(x b |x b-1:0 ,V A1 ,V B1 ) The meaning is as follows: using a sequence-to-sequence seq2seq model, including an encoding end and a decoding end; inputting a context vector at an encoding end; outputting the predicted word vector x at the decoding end b Is set, and a predicted probability value thereof;
4)-log P(x b |x b-1:0 ,V A1 ,V B1 ): using standard cross entropy pairs P (x b |x b-1:0 ,V A1 ,V B1 ) Calculating to obtain word vector x b Is a loss value of (2);
each predicted word vector in the text B obtains a loss value; averaging the Loss values to obtain a second Loss function Loss 2 (x,θ);
Step 5.4, for the first Loss function Loss 1 (x, θ) and a second Loss function Loss 2 (x, θ) to obtain the total Loss function Loss Total (S)
Step 6, judging whether the training reaches the maximum iteration number, if not, according to the total Loss function Loss Total (S) Obtaining a gradient, carrying out back transmission and parameter updating on the model parameter theta, and returning to the step 4; if yes, stopping training to obtain the pre-trained language model.
Preferably, in step 2, for each randomly selected word, the masking character [ MASK ] is used to replace the corresponding word, resulting in a masked text.
Preferably, the text a and the text B refer to: the text A is a question text; text B is the answer text.
The text pair-oriented Chinese language model pre-training method provided by the invention has the following advantages:
the invention provides a text pair-oriented Chinese language model pre-training method, which can learn language and word sequence information in a text pair more fully, thereby improving the pre-training model effect.
Drawings
FIG. 1 is a flow chart of a text-pair-oriented Chinese language model pre-training method provided by the invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a text pair-oriented Chinese language model pre-training method, which can learn language and word sequence information in a text pair more fully, thereby improving the pre-training model effect.
Referring to fig. 1, the invention provides a text-pair-oriented Chinese language model pre-training method, which comprises the following steps:
step 1, inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs;
in the invention, sources of the corpus text pairs include:
firstly, an open source similarity corpus on the Internet;
secondly, the open-source answer pair corpus on the Internet; for example, text a is a question text; text B is the answer text.
Thirdly, the desensitized user in the search engine retrieves query and user click information (comprising article titles, abstracts, texts and the like which are spliced together).
Step 2, randomly selecting n words in the text A, wherein each randomly selected word adopts shielding characters to carry out shielding treatment to obtain a shielded text, and the shielded text is expressed as a text A1;
in this step, for each randomly selected word, the masking character [ MASK ] is used to replace the corresponding word, resulting in a masked text.
Word segmentation is carried out on the text B, and each word after word segmentation is disordered, so that a text with disordered sequence is obtained and is expressed as a text B1;
for example, text A is the query of the user in the search engine and text B is the click browsing information of the user.
Assuming that the text A is "I love history museum", masking the two words of "calendar" and "history" to obtain the text A1 as follows: "I love [ MASK ] [ MASK ] museum".
Assuming that the text B is "rising sun in museum", the text B is "rising sun in museum" after word segmentation, the text B1 is obtained by scrambling the sequence: "museum on sun up".
Step 3, dividing the text A1, the text B1 and the text B according to the word, correspondingly obtaining the text A1, the text B1 and the text B;
splicing the text A1, the text B1 and the text B to obtain a spliced text [ CLS ] A1[ SEP ] B1[ SOS ] B [ EOS ]; wherein: [ CLS ], [ SEP ], [ SOS ] and [ EOS ] are respectively: a first separator, a second separator, a third separator, and a fourth separator;
step 4, taking the spliced text as input, inputting the input text into a coding module to respectively obtain a coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B
Wherein:
step 4.1, encoding vector matrix V of text A1 A1 Obtained by:
1) Word code Emb of i-th word in text A1 A1(i) A word vector Emb for the i-th word in text A1 A1char(i) With the position of the i-th word in text A1Code Emb A1pos(i) Type code Emb with text A1 type(A1) The sum of which is given by:
Emb A1(i) =Emb A1char(i) +Emb A1pos(i) +Emb type(A1)
2) In the text A1, the words of each word are encoded to form an encoding vector matrix V of the text A1 A1
Step 4.2, encoding vector matrix V of text B1 B1 Obtained by:
1) Word code Emb of i-th word in text B1 B1(i) A word vector Emb for the i-th word in the text B1 B1char(i) Position code Emb with ith word in text B1 B1pos(i) Type code Emb with text B1 type(B1) The sum of which is given by:
Emb B1(i) =Emb B1char(i) +Emb B1pos(i) +Emb type(B1)
2) In the text B1, the words of each word are encoded to form an encoding vector matrix V of the text B1 B1
Step 4.3, encoding vector matrix V of text B B Obtained by:
1) Word code Emb of i-th word in text B B(i) Word vector Emb for the i-th word in text B Bchar(i) Position code Emb with i-th word in text B Bpos(i) Type code Emb with text B type(B) The sum of which is given by:
Emb B(i) =Emb Bchar(i) +Emb Bpos(i) +Emb type(B)
2) In the text B, the words of each word are encoded to form an encoding vector matrix V of the text B B
Wherein:
word vector Emb of i-th word in text A1 A1char(i) Word vector Emb of i-th word in text B1 B1char(i) And the word vector Emb of the i-th word in text B Bchar(i) All are obtained by inquiring a dictionary;
position code Emb of ith word in text A1 A1pos(i) Position code Emb of i-th word in text B1 B1pos(i) And the position code Emb of the i-th word in the text B Bpos(i) The position code of each word in the text of the spliced text is referred to;
type encoding Emb for text A1 type(A1) Type code Emb of text B1 type(B1) And type code Emb for text B type(B) Three different text types;
step 5, the coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B Inputting to a pre-training task layer, and calculating to obtain a total Loss function Loss by adopting the following method Total (S)
Step 5.1, a pre-training task layer comprises a masking word prediction task and a word order recovery task;
step 5.2, obtaining a first Loss function Loss by masking the word prediction task and adopting the following formula 1 (x,θ):
Figure SMS_3
Wherein:
P(x a |V A1 ,V B1 ) The meaning is as follows: the matrix of coded vectors V in text A1 A1 In reading a predicted mask word x a To read mask word x a Encoding vector matrix V of vector and text B of (C) B Splicing to obtain a spliced vector, and multiplying the spliced vector with the dictionary matrix to obtain a probability matrix; in the probability matrix, a maximum probability value, namely P (x a |V A1 ,V B1 ) The method comprises the steps of carrying out a first treatment on the surface of the The dictionary matrix is a matrix formed by word vectors of each word in the dictionary;
-log P(x a |V A1 ,V B1 ): representing cross entropy calculations, namely: using standard cross entropy pairs P (x a |V A1 ,V B1 ) Calculating to obtain a shielding word x a Is a loss value of (2);
e (): representing an averaging calculation;
a e len (A1), representing that in the text A1, the a-th word is masked;
thus, each mask predicts a penalty; then, summing the loss values of the shielding words, and dividing the sum by the number of the shielding words to obtain an average loss value;
the method is mainly used for predicting the characters of the text A1 which are shielded, the text A1 and the text B1 are used as input, and the text A1 and the text B1 are text pairs and can be perceived mutually, so that the understanding of the text A1 can be improved and the diversity of the text A1 information can be increased by using the method.
Step 5.3, obtaining a second Loss function Loss by recovering tasks through the word order and adopting the following formula 2 (x,θ):
Figure SMS_4
Wherein:
b represents the position of the word predicted in text B, which has c words in total, so b=1, 2, …, c;
for word vector x predicted at bit B in text B b The loss value-log P (x) was obtained by the following method b |x b-1:0 ,V A1 ,V B1 ):
1) Input V A1 ,V B1 And x b-1:0
Wherein:
x b-1:0 the meaning is as follows: the 0 th bit separator vector x in front of text B 0 Text B1 st bit word vector x 1 …, text B-1 bit word vector x b-1 Vector formed by splicing;
2) Let V A1 ,V B1 And x b-1:0 Performing splicing operation to obtain word vector x to be predicted b Context vector of (a);
3)P(x b |x b-1:0 ,V A1 ,V B1 ) The meaning is as follows: using a sequence-to-sequence seq2seq model, including an encoding end and a decoding end; inputting a context vector at an encoding end; outputting the predicted word direction at the decoding endQuantity x b Is set, and a predicted probability value thereof;
4)-log P(x b |x b-1:0 ,V A1 ,V B1 ): using standard cross entropy pairs P (x b |x b-1:0 ,V A1 ,V B1 ) Calculating to obtain word vector x b Is a loss value of (2);
each predicted word vector in the text B obtains a loss value; averaging the Loss values to obtain a second Loss function Loss 2 (x,θ);
The step is mainly used for restoring the word order of the text B, and the main completion mode is implemented by a generation formula, and is exemplified below:
such as: the text A1 is: "I love [ MASK ] [ MASK ] museum". The text B1 is: "museum on sun up". Text B is "museum rising sun".
Splicing the text A1, the text B1 and the text B to obtain a spliced text which is:
[ CLS I love [ MASK ] [ MASK ] museum [ SEP solar museum liter [ SOS ] museum solar liter [ EOS ]
Only the part of the "[ CLS ] i love [ MASK ] museum [ SEP ] solar museum up" is visible when two [ MASK ] are masked out in the prediction text A1.
The order of the text B is restored, and one word in "museum up sun [ EOS ]" is predicted each time.
First, b=1, "[ CLS ] when predicting" blogs "]I love [ MASK][MASK]Museum [ SEP ]]Solar museum liter [ SOS ]]"this parts are mutually visible, namely: x is x b-1:0 =x 0 =[SOS];
b=2, "[ CLS", when predicting "object"]I love [ MASK][MASK]Museum [ SEP ]]Solar museum liter [ SOS ]]The "day" parts are visible to each other, namely: x is x b-1:0 =[SOS]A step of blogging;
b=3, "[ CLS ] i love [ MASK ] museum [ SEP ] solar museum [ SOS ] museum" this part is visible to each other when predicting "museum".
And so on.
Thus, for text B to be recovered, each time a word in B is predicted, the input is the word vector in text B that has been previously predicted, i.e., the next word in B is predicted each time under the condition that the word vectors in A1, B1, and B have been found.
Step 5.4, for the first Loss function Loss 1 (x, θ) and a second Loss function Loss 2 (x, θ) to obtain the total Loss function Loss Total (S)
Step 6, judging whether the training reaches the maximum iteration number, if not, according to the total Loss function Loss Total (S) Obtaining a gradient, carrying out back transmission and parameter updating on the model parameter theta, and returning to the step 4; if yes, stopping training to obtain the pre-trained language model.
The invention provides a text pair-oriented Chinese language model pre-training method, which uses a generating method on the basis of encoding text pair information, when the language sequence of a text B is restored, the semantic relation between the text A1 and the text B1 is learned (the information of A1 is encoded in the sequence restoration of the text B), and simultaneously the restoration of the sequence of the text B can be regarded as a higher-level language modeling.
The invention relates to a text pair-oriented Chinese language model pre-training method, which is characterized in that a text is pre-trained by using a language model and a generated order recovery task for segmentation, and language order information in a text pair can be more fully learned, so that the pre-training model effect is improved.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims (3)

1. A text pair-oriented Chinese language model pre-training method is characterized by comprising the following steps:
step 1, inputting text pairs; the text pair comprises a text A and a text B which are arranged in pairs;
step 2, randomly selecting n words in the text A, wherein each randomly selected word adopts shielding characters to carry out shielding treatment to obtain a shielded text, and the shielded text is expressed as a text A1;
word segmentation is carried out on the text B, and each word after word segmentation is disordered, so that a text with disordered sequence is obtained and is expressed as a text B1;
step 3, dividing the text A1, the text B1 and the text B according to the word, correspondingly obtaining the text A1, the text B1 and the text B;
splicing the text A1, the text B1 and the text B to obtain a spliced text [ CLS ] A1[ SEP ] B1[ SOS ] B [ EOS ]; wherein: [ CLS ], [ SEP ], [ SOS ] and [ EOS ] are respectively: a first separator, a second separator, a third separator, and a fourth separator;
step 4, taking the spliced text as input, inputting the input text into a coding module to respectively obtain a coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B
Wherein:
step 4.1, encoding vector matrix V of text A1 A1 Obtained by:
1) Word code Emb of i-th word in text A1 A1(i) A word vector Emb for the i-th word in text A1 A1char(i) Position code Emb with ith word in text A1 A1pos(i) Type code Emb with text A1 type(A1) The sum of which is given by:
Emb A1(i) =Emb A1char(i) +Emb A1pos(i) +Emb type(A1)
2) In the text A1, the words of each word are encoded to form an encoding vector matrix V of the text A1 A1
Step 4.2, encoding vector matrix V of text B1 B1 Obtained by:
1) Word code Emb of i-th word in text B1 B1(i) A word vector Emb for the i-th word in the text B1 B1char(i) Position code Emb with ith word in text B1 B1pos(i) Type code Emb with text B1 type(B1) Sum of formulas such asThe following steps:
Emb B1(i) =Emb B1char(i) +Emb B1pos(i) +Emb type(B1)
2) In the text B1, the words of each word are encoded to form an encoding vector matrix V of the text B1 B1
Step 4.3, encoding vector matrix V of text B B Obtained by:
1) Word code Emb of i-th word in text B B(i) Word vector Emb for the i-th word in text B Bchar(i) Position code Emb with i-th word in text B Bpos(i) Type code Emb with text B type(B) The sum of which is given by:
Emb B(i) =Emb Bchar(i) +Emb Bpos(i) +Emb type(B)
2) In the text B, the words of each word are encoded to form an encoding vector matrix V of the text B B
Wherein:
word vector Emb of i-th word in text A1 A1char(i) Word vector Emb of i-th word in text B1 B1char(i) And the word vector Emb of the i-th word in text B Bchar(i) All are obtained by inquiring a dictionary;
position code Emb of ith word in text A1 A1pos(i) Position code Emb of i-th word in text B1 B1pos(i) And the position code Emb of the i-th word in the text B Bpos(i) The position code of each word in the text of the spliced text is referred to;
type encoding Emb for text A1 type(A1) Type code Emb of text B1 type(B1) And type code Emb for text B type(B) Three different text types;
step 5, the coding vector matrix V of the text A1 A1 Coding vector matrix V of text B1 B1 And the coding vector matrix V of the text B B Inputting to a pre-training task layer, and calculating to obtain a total Loss function Loss by adopting the following method Total (S)
Step 5.1, a pre-training task layer comprises a masking word prediction task and a word order recovery task;
step 5.2, obtaining a first Loss function Loss by masking the word prediction task and adopting the following formula 1 (x,θ):
Figure FDA0003788984540000031
Wherein:
P(x a |V A1 ,V B1 ) The meaning is as follows: the matrix of coded vectors V in text A1 A1 In reading a predicted mask word x a To read mask word x a Encoding vector matrix V of vector and text B of (C) B Splicing to obtain a spliced vector, and multiplying the spliced vector with the dictionary matrix to obtain a probability matrix; in the probability matrix, a maximum probability value, namely P (x a |V A1 ,V B1 ) The method comprises the steps of carrying out a first treatment on the surface of the The dictionary matrix is a matrix formed by word vectors of each word in the dictionary;
-log P(x a |V A1 ,V B1 ): representing cross entropy calculations, namely: using standard cross entropy pairs P (x a |V A1 ,V B1 ) Calculating to obtain a shielding word x a Is a loss value of (2);
e (): representing an averaging calculation;
a e len (A1), representing that in the text A1, the a-th word is masked;
thus, each mask predicts a penalty; then, summing the loss values of the shielding words, and dividing the sum by the number of the shielding words to obtain an average loss value;
step 5.3, obtaining a second Loss function Loss by recovering tasks through the word order and adopting the following formula 2 (x,θ):
Figure FDA0003788984540000032
Wherein:
b represents the position of the word predicted in text B, which has c words in total, so b=1, 2, …, c;
for word vector x predicted at bit B in text B b The loss value-log P (x) was obtained by the following method b |x b-1:0 ,V A1 ,V B1 ):
1) Input V A1 ,V B1 And x b-1:0
Wherein:
x b-1:0 the meaning is as follows: the 0 th bit separator vector x in front of text B 0 Text B1 st bit word vector x 1 …, text B-1 bit word vector x b-1 Vector formed by splicing;
2) Let V A1 ,V B1 And x b-1:0 Performing splicing operation to obtain word vector x to be predicted b Context vector of (a);
3)P(x b |x b-1:0 ,V A1 ,V B1 ) The meaning is as follows: using a sequence-to-sequence seq2seq model, including an encoding end and a decoding end; inputting a context vector at an encoding end; outputting the predicted word vector x at the decoding end b Is set, and a predicted probability value thereof;
4)-log P(x b |x b-1:0 ,V A1 ,V B1 ): using standard cross entropy pairs P (x b |x b-1:0 ,V A1 ,V B1 ) Calculating to obtain word vector x b Is a loss value of (2);
each predicted word vector in the text B obtains a loss value; averaging the Loss values to obtain a second Loss function Loss 2 (x,θ);
Step 5.4, for the first Loss function Loss 1 (x, θ) and a second Loss function Loss 2 (x, θ) to obtain the total Loss function Loss Total (S)
Step 6, judging whether the training reaches the maximum iteration number, if not, according to the total Loss function Loss Total (S) Obtaining a gradient, carrying out back transmission and parameter updating on the model parameter theta, and returning to the step 4; if yes, stopping training to obtain the pre-trained language model.
2. The text-pair oriented chinese language model pretraining method of claim 1, wherein in step 2, for each randomly selected word, a MASK character [ -MASK ] is used to replace the corresponding word, resulting in a masked text.
3. The text-pair oriented chinese language model pretraining method of claim 1, wherein text a and text B refer to: the text A is a question text; text B is the answer text.
CN202210950700.1A 2022-08-09 2022-08-09 Text pair-oriented Chinese language model pre-training method Active CN116029354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210950700.1A CN116029354B (en) 2022-08-09 2022-08-09 Text pair-oriented Chinese language model pre-training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210950700.1A CN116029354B (en) 2022-08-09 2022-08-09 Text pair-oriented Chinese language model pre-training method

Publications (2)

Publication Number Publication Date
CN116029354A true CN116029354A (en) 2023-04-28
CN116029354B CN116029354B (en) 2023-08-01

Family

ID=86076489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210950700.1A Active CN116029354B (en) 2022-08-09 2022-08-09 Text pair-oriented Chinese language model pre-training method

Country Status (1)

Country Link
CN (1) CN116029354B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569500A (en) * 2019-07-23 2019-12-13 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN112487786A (en) * 2019-08-22 2021-03-12 创新工场(广州)人工智能研究有限公司 Natural language model pre-training method based on disorder rearrangement and electronic equipment
CN112632997A (en) * 2020-12-14 2021-04-09 河北工程大学 Chinese entity identification method based on BERT and Word2Vec vector fusion
CN113343683A (en) * 2021-06-18 2021-09-03 山东大学 Chinese new word discovery method and device integrating self-encoder and countertraining
KR20220044406A (en) * 2020-10-01 2022-04-08 네이버 주식회사 Method and system for controlling distributions of attributes in language models for text generation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569500A (en) * 2019-07-23 2019-12-13 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN112487786A (en) * 2019-08-22 2021-03-12 创新工场(广州)人工智能研究有限公司 Natural language model pre-training method based on disorder rearrangement and electronic equipment
KR20220044406A (en) * 2020-10-01 2022-04-08 네이버 주식회사 Method and system for controlling distributions of attributes in language models for text generation
CN112632997A (en) * 2020-12-14 2021-04-09 河北工程大学 Chinese entity identification method based on BERT and Word2Vec vector fusion
CN113343683A (en) * 2021-06-18 2021-09-03 山东大学 Chinese new word discovery method and device integrating self-encoder and countertraining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡益淮;: "基于XLNET的抽取式多级语义融合模型", 通信技术, no. 07 *

Also Published As

Publication number Publication date
CN116029354B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN110134771B (en) Implementation method of multi-attention-machine-based fusion network question-answering system
Kowsher et al. Bangla-bert: transformer-based efficient model for transfer learning and language understanding
Zhao et al. Attention-Based Convolutional Neural Networks for Sentence Classification.
CN115392259A (en) Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN115658890A (en) Chinese comment classification method based on topic-enhanced emotion-shared attention BERT model
Li et al. Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
Tang et al. Full attention-based bi-GRU neural network for news text classification
CN115129819A (en) Text abstract model production method and device, equipment and medium thereof
CN111428518B (en) Low-frequency word translation method and device
Yang et al. Generation-based parallel particle swarm optimization for adversarial text attacks
Waghela et al. Saliency attention and semantic similarity-driven adversarial perturbation
Jahan et al. A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs
Behere et al. Text summarization and classification of conversation data between service chatbot and customer
CN116029354B (en) Text pair-oriented Chinese language model pre-training method
CN115688703B (en) Text error correction method, storage medium and device in specific field
Mastronardo et al. Enhancing a text summarization system with ELMo
CN115309898A (en) Word granularity Chinese semantic approximate countermeasure sample generation method based on knowledge enhanced BERT
Wei Research on internet text sentiment classification based on BERT and CNN-BiGRU
Zhu A Simple Survey of Pre-trained Language Models
Sattari et al. Improving image captioning with local attention mechanism
Liu et al. Raw-to-end name entity recognition in social media
Croce et al. Grammatical Feature Engineering for Fine-grained IR Tasks.
Xu et al. Incorporating forward and backward instances in a bi-lstm-cnn model for relation classification
CN110990385A (en) Software for automatically generating news headlines based on Sequence2Sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant