CN110489555A

CN110489555A - A kind of language model pre-training method of combination class word information

Info

Publication number: CN110489555A
Application number: CN201910775453.4A
Authority: CN
Inventors: 白佳欣; 宋彦
Original assignee: Innovation Workshop (guangzhou) Artificial Intelligence Research Co Ltd
Current assignee: Innovation Workshop (guangzhou) Artificial Intelligence Research Co Ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2019-11-22
Anticipated expiration: 2039-08-21
Also published as: CN110489555B

Abstract

The present invention relates to language processing techniques field more particularly to a kind of language model pre-training methods of combination class word information comprising following steps: S1, providing pre-training model and pre-training text；S2, it extracts character string and forms vocabulary；S3, two sentences are extracted as training sentence while sentence will be trained to be divided into individual character sequence；S4, the word in the character string and individual character sequence in step S2 is matched, and to and individual character sequence in the character string that matches of word be marked；S5, the individual character for choosing preset ratio among individual character sequence is covered or is replaced, and by the training sentence for covering or replacing and labeled character string while be input in pre-training model pre-training model is trained and is optimized；S6, step S2-S5 is repeated until pre-training model reaches the optimal conditions of setting with the pre-training model after being optimized.The language model pre-training method and pre-training model of combination class word information provided by the invention have better performance on multiple Downstream Jobs.

Description

A kind of language model pre-training method of combination class word information

[technical field]

The present invention relates to language processing techniques field more particularly to a kind of language model pre-training sides of combination class word information Method.

[background technique]

Current state-of-the-art pre-training language model is divided into two classes, is autoregression language model respectively (Autoregressive Model) with from coded language model (Autoencoding Model).GPT and GPT2 be performance compared with Good autoregression language model.The training objective of autoregression model is correctly to speculate next word according to above.BERT is generation Table from coded language model.The training objective of BERT is based on context correctly to deduce covered or replacement word.Two Kind language model has their own advantages and disadvantage.Autoregression model can only combine above, and can not complete in combination with context Particular task.On the other hand, contextual information can be utilized simultaneously from coded language model, but wherein in pre-training process In, in order to cover supposition target word, [mask] is marked, training corpus is added for replacing original target word, however [mask] Coding does not appear in during the micro-adjustment for particular task.Above-mentioned reason results in pre-training language model in pre-training Input with micro-adjustment mismatches, and then influences the performance of model entirety.Recently, XLNet is proposed for solving simultaneously above-mentioned Two problems, so that pre-training language model while not introducing [mask] label, can complete task in conjunction with context.

However above-mentioned language model and underuse word, phrase, the entity etc. occurred in pre-training and micro-adjustment corpus The information of larger particle degree.And this type of information is even more important in Chinese task.Compared with English, Chinese is clear without space etc. Word boundary so that model is more difficult from the sequence of individual character study to double word or the whole meaning of multi-character words.

Recently, BERT-wwm model be proposed as a kind of BERT model carried out on Chinese regarding to the issue above it is excellent Change.BERT-wwm and BERT the difference is that only the pretreatment to training corpus.BERT is covered to pre-training corpus When operation, 15% individual character is substituted for [mask], other words will retain.And BERT-wwm first uses participle tool to primitive material It is segmented, then carries out identical covering operation as unit of whole word.Before a little earlier, the ERNIE of Baidu's publication is also BERT Improvement regarding to the issue above.ERNIE uses multi-level covering strategy.The multi-level strategy that covers includes that word rank covers, Phrase rank covers and entity level covers.In order to reach the target covered at many levels, Baidu in addition to Chinese wikipedia data, Baidupedia, Baidu's discussion bar and question and answer data are additionally used.Although ERNIE has used more training datas, learn more Knowledge, at that time it the performance on Downstream Jobs is suitable with BERT-wwm.

However covering strategy by multilayer also has problems to learn word boundary information.Firstly, covering the effective of strategy Property dependent on the additional information except text, such as BERT-wwm dependent on segmenter provide as a result, and ERNIE dependent on outer Portion's knowledge.In actual use, there are following disadvantages using additional information.First, the quality of information is unable to get guarantee.Example As the effect of BERT-wwm depends on the quality of Chinese word segmentation.Second, it is that the information of high quality needs a large amount of acquisitions and mark, gives Pre-training language model brings additional cost.Third only carries out covering and under utilized word information for word, because of word Language may contain with the exotic vocabularies such as literal unrelated amplification meaning, such as " Romania ", the Chinese idioms such as " the old frontiersman lose his horse ", with And two-part allegorical sayings such as " daughter's son carrys a lantern lighted ".

For this problem, this patent proposes a kind of new by class word information on the basis of existing language model It is dissolved into the pre-training of language model and the method for fine tuning.

[summary of the invention]

For the low and at high cost defect of existing language model prediction accuracy, the present invention provides a kind of combination class word The language model pre-training method of information.

In order to solve the above technical problem, the present invention provides a kind of language model pre-training of combination class word information by the present invention Method comprising following steps: a pre-training model and pre-training text S1, are provided；S2, from the pre-training text It extracts character string and forms vocabulary；S3, two sentences of extraction simultaneously will be described as training sentence from the pre-training text Training sentence is divided into individual character sequence；S4, the word in the character string and the individual character sequence in the step S2 is matched, And the character string to match with the word in the individual character sequence is marked；S5, preset ratio will be chosen among individual character sequence Individual character covered or replaced, and by the training sentence for covering or replacing and labeled character string while being input to pre- instruction Practice and pre-training model is trained and is optimized in model；S6, the S2-S5 that repeats the above steps are until pre-training model reaches setting Optimal conditions with the pre-training model after being optimized.

Preferably, in above-mentioned steps S2, character string is obtained by derived algorithm or artificially extracts character string.

Preferably, in above-mentioned steps S3, the ending of each sentence is added respectively in two trained sentences of extraction [sep], the beginning of the sentence addition [cls] in first sentence；In the step S4, the location information of the character string is utilized And/or the character string is marked in length information.

Preferably, in above-mentioned steps S6, when executing step S2 every time, two sentences are extracted one by one from pre-training file As training sentence, until sentence all in the pre-training text is extracted and finishes, two extracted every time sentence is phase Adjacent is either non-conterminous, when extraction finishes, proportional region shared by two adjacent sentences and non-conterminous two sentences For 40-70%, sum of the two 100%.

Preferably, the step S5 specifically comprises the following steps: S51, establishes the target letter about the pre-training model Number；S52, the individual character that the individual character sequence chooses 15% is covered or is replaced；S53, the training that will be covered or replaced Sentence and labeled character string are input in pre-training model simultaneously；S54, it is capped or is replaced by pre-training model prediction Word to obtain the vector expression for representing described capped or replacement word；And S55, utilize the vector expression calculate target letter It counts and optimizes the pre-training model.

Preferably, the language model pre-training method of the combination class word information further includes following steps: step S7, being combined The vocabulary formed in the step S2 carries out task fine tuning to the pre-training model after the optimization obtained in the step S6.

Preferably, the step S7 specifically comprises the following steps: S71, provides fine tuning task text；S72, to the fine tuning Task text segmentation is at individual character sequence；S73, will be in the individual character sequence in the character string and the step S72 in the step S2 Word carry out match and the character string after the matching is marked；S74, by the individual character sequence and labeled character Pre-training model is finely adjusted in string while the pre-training model being input to after optimization.

Preferably, it in above-mentioned steps S74, is realized by full articulamentum or CRF network optimization objective function to described The optimization of pre-training model after optimization.

Preferably, the pre-training model includes embeding layer, character level encoder, word level encoder, multiple attentions Encoder；Wherein, the embeding layer is in covered in the step S5 or the training sentence and the step S4 replaced Labeled character string input, the embeding layer by the individual character be converted into individual character insertion vector corresponding with each individual character and Each character string is converted into character string insertion vector corresponding with each character string each individual character insertion vector sum character simultaneously String insertion vector is corresponding plus position encoded；The character level encoder is compiled for its corresponding position of individual character insertion vector sum Code is inputted and is calculated to obtain the word vector expression of not covered or replacement word；Institute's predicate level encoder supplies the word Its corresponding position encoded input of symbol string insertion vector sum is simultaneously calculated to obtain term vector expression；The attention encoder To be multiple, express while inputted to obtain about described for the word vector expression for not covering or replacing and the term vector The vector expression of covered or replacement word.

Preferably, the pre-training model further includes Linear network layer and Softmax network layer, the word vector expression With term vector expression through being input to the Linear network layer and Softmax network after the attention encoder output Further training and fine tuning are done to the pre-training model in layer.

Compared with prior art, the language model pre-training method and pre-training mould of combination class word information provided by the invention Type has the following beneficial effects:

One pre-training model and pre-training text are provided first, character string is extracted from the pre-training text and formed Then vocabulary extracts two sentences as training sentence from the pre-training text and the trained sentence is divided into list simultaneously Word sequence；By the way that the word in character string and the individual character sequence is matched, and to the word phase in the individual character sequence The character string matched is marked, can well using character string relevant information and not be only individual character sequence in individual character to Amount information predicts capped or replacement word, character string is marked the location information and length for often using character string It spends information to carry out, therefore, can go to predict covered or replacement word using relevance between character string well, improve pre- instruction Practice model to the accuracy of capped or replacement word, while the pre-training model that optimization is come has preferably performance Also, there is better performance in multiple tasks, for example, in Chinese word segmentation, part-of-speech tagging, Entity recognition, sentiment analysis, nature Language inference, sentence classification, machine read the Downstream Jobs such as understanding, article classification with preferably performance.

In pre-training model provided by the invention, including embeding layer, character level encoder, word level encoder, multiple notes Meaning power encoder, institute's predicate level encoder is for its corresponding position encoded input of character string insertion vector sum and is counted Calculate to obtain term vector expression, and attention encoder for the word vector expression for not covering or replacing and institute's predicate to Amount is expressed while being inputted to obtain the vector expression about described covered or replacement word, and a word level encoder knot is arranged Character level encoder is closed, so that obtaining when attention encodes and calculates the vector expression for obtaining covered or replacement word Vector expression preferably represent capped or replacement word, the pre- instruction for improving the accuracy of prediction, while optimization being come Practicing model has preferably performance also, has better performance in multiple tasks, for example, part-of-speech tagging is real in Chinese word segmentation Body identification, the Downstream Jobs such as sentiment analysis and document classification have preferably performance.And pre-training mould provided by the invention Type joined word boundary information, have stronger generative capacity, can apply and generate in keyword, and article is continued, and article is summarized Etc. among tasks, when the task of execution, it is higher to generate the sentence quality come.

[Detailed description of the invention]

Fig. 1 is in first embodiment of the invention in conjunction with the flow diagram of the language model pre-training method of class word information；

Fig. 2 is the pre- instruction used for combining the language model pre-training method of class word information in the present invention in first embodiment Practice the module diagram of model；

Fig. 3 is will be single in the language model pre-training method and step S4 for combine in first embodiment in the present invention class word information Word sequence and character string carry out matched schematic diagram；

Fig. 4 is in first embodiment of the invention in conjunction with the details of step S5 in the language model pre-training method of class word information Flow chart；

Fig. 5 is in first embodiment of the invention in conjunction with step S53 and step in the language model pre-training method of class word information The rapid corresponding input pre-training model of S54 executes the schematic diagram of respective operations；

Fig. 6 is in first embodiment of the invention in conjunction with variant embodiment in the language model pre-training method of class word information Flow diagram；

Fig. 7 is in first embodiment of the invention in conjunction with the details of step S7 in the language model pre-training method of class word information Flow chart；

Fig. 8 is the module diagram of the electronic equipment provided in second embodiment of the invention；

Fig. 9 is adapted for the structural schematic diagram for the computer system for realizing the server of the embodiment of the present invention.

Description of symbols:

11, embeding layer；12, character level encoder；13, word level encoder；14, attention encoder；60, electronics is set It is standby；601, memory；602, processor；800, computer system；801, central processing unit (CPU)；802, memory (ROM)；803,RAM；804, bus；805, I/O interface；806, importation；807, output par, c；808, storage section； 809, communications portion；810, driver；811, detachable media.

[specific embodiment]

In order to make the purpose of the present invention, technical solution and advantage are more clearly understood, below in conjunction with attached drawing and embodiment, The present invention will be described in further detail.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, It is not intended to limit the present invention.

Referring to Fig. 1, first embodiment of the invention provides a kind of language model pre-training method of combination class word information, Include the following steps:

S1, a pre-training model and pre-training text are provided；

S2, character string is extracted from the pre-training text and forms vocabulary；

S3, two sentences are extracted as training sentence from the pre-training text while being divided into the trained sentence Individual character sequence；

S4, the word in the character string and the individual character sequence in the step S2 is matched, and to the individual character The character string that word in sequence matches is marked；

S5, the individual character for choosing preset ratio among individual character sequence is covered or is replaced, and will covered or replaced Training sentence and labeled character string are input in pre-training model simultaneously to be trained and optimizes to pre-training model；

S6, the S2-S5 that repeats the above steps are until pre-training model reaches the optimal conditions of setting with pre- after being optimized Training pattern.

In the step S1, the acquisition of the pre-training text from wikipedia, news corpus, medical question and answer corpus and The text-only files such as financial report data, which are chosen, to be obtained.

Referring to Fig. 2, in the step S1, the pre-training model is to be carried out based on existing from coded language model Acquisition is improved, existing it includes but is not limited to BERT (Bidirectional Encoder from coded language model Representations from Transformer) language model.The pre-training model includes embeding layer 11, character level volume The modular units such as code device 12, word level encoder 13, multiple attention encoders 14.Wherein the xN in attached drawing 2 is indicated wherein also Multiple attention encoders 14 are omitted.

The embeding layer 11 is for quilt in covered in the step S5 or the training sentence and the step S4 replaced The character string input of label, the embeding layer 11 by the individual character be converted into corresponding with each individual character individual character insertion vector and Each character string is converted into character string insertion vector corresponding with each character string each individual character insertion vector sum character simultaneously String insertion vector is corresponding plus position encoded.

The character level encoder 12 is for its corresponding position encoded input of individual character insertion vector sum and is calculated To obtain the word vector expression of not covered or replacement word；

Institute's predicate level encoder 13 is for its corresponding position encoded input of character string insertion vector sum and is counted It calculates to obtain term vector expression.

The attention encoder 14 is multiple, the word vector expression for not covering or replacing for described and the term vector It expresses while inputting to obtain the vector expression about described covered or replacement word.

Pre-training model further includes Linear network layer 15 and Softmax network layer 16, described covered or replacement word The expression of word vector be input to the Linear network layer 15 and Softmax network after the attention encoder 14 output In layer 16, further to complete the optimization and fine tuning task to pre-training model.

In the step S2, character string is extracted from the pre-training text and forms vocabulary.It in this step can be with Character string is obtained derived algorithm or by extracting character string by way of artificially extracting to form vocabulary.Generally, word is taken out Algorithm is also referred to as segmentation methods or string matching segmentation methods.Such algorithm is according to certain strategy by character to be matched Word in string and " sufficiently big " dictionary having had built up is matched, illustrate to match if finding some entry at Function, that is, identifying the word.Optionally, derived algorithm includes but is not limited to Accesser Variety.It will of course be understood that , in this step, pre-training text can be carried out to take out word by way of artificially extracting and character string is placed into word In table.When, there are when the relatively low character string of the utilization rates such as some two-part allegorical sayings, Chinese idiom or common saying, passing through in pre-training text Character string is added in vocabulary the content that can enrich vocabulary well by the artificial mode for taking out word, improves the optimization of pre-training model Effects.

In the step S3, two sentences are extracted from the pre-training text as training sentence simultaneously by the instruction Practice sentence and is divided into individual character sequence.In this step, it is divided into individual character sequence to be also divided into a sentence training sentence Using single word as the meaning of minimum unit.Training sentence can be divided into individual character sequence by split function.In step S3 In, further include operation: the ending of each sentence adds [sep], at described first respectively in two trained sentences of extraction The beginning of the sentence of sentence adds [cls].

Referring to Fig. 3, in above-mentioned steps S4, by the word in the character string and the individual character sequence in the step S2 into Row matching, and the character string to match with the word in the individual character sequence is marked.In the present embodiment, individual character sequence pair Should be " people, work, intelligence, energy, reality, test, room, very, rhinoceros, benefit ", it is corresponding in vocabulary that there are following character strings: " artificial ", " intelligence ", " experiment ", " laboratory " and " sharp "；It is as shown in Figure 3 after word and string matching so in individual character sequence.When the two into After row matching, the character string is carried out using the location information and/or length information of the corresponding training sentence of the character string Label.

In above-mentioned steps S5, the individual character that preset ratio is chosen among individual character sequence is covered or replaced, and will be hidden The training sentence that covers or replaced and labeled character string while being input in pre-training model instructs pre-training model Practice and optimizes.In this step, preset ratio is usually the percentage that covered or replacement word accounts for entire sentence number of words, model It encloses for 10-30%, in the present embodiment, the ratio of selection is 15%.

Referring to Fig. 4, the step S5 specifically comprises the following steps:

The objective function of S51, foundation about the pre-training model；

S52, the individual character that the individual character sequence chooses 15% is covered or is replaced；

S53, by the training sentence for covering or replacing and labeled character string while pre-training model is input to；

S54, word capped by pre-training model prediction or replacing represent described capped or replacement word to obtain Vector expression；And

S55, calculating target function is expressed using the vector and optimizes the pre-training model.

Referring to Fig. 5, in the step S53, the training sentence of extraction are as follows: " Artificial Intelligence Laboratory is very sharp ", it will " work " word therein and " rhinoceros " word are substituted for " mask ", simultaneously with labeled character string by the training sentence for covering or replacing It is input in the embeding layer 11 of pre-training model, individual character is converted into individual character corresponding with each individual character and is embedded in vector by embeding layer 11 And each character string is converted into character string insertion vector corresponding with each character string each individual character insertion vector sum simultaneously It is corresponding plus position encoded that character string is embedded in vector.It is appreciated that being provided with position coder in embeding layer 11, pass through position Encoder is corresponding plus position encoded in each individual character insertion vector sum character string insertion vector.It is position encoded to be used to indicate The position that each character string occurs in training sentence.

As shown in Figure 5, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 correspond in individual character sequence " people ", " mask ", " work ", " intelligence ", " energy ", " reality ", " testing ", " room ", " very ", " rhinoceros " " benefit "；W1, w2, w3, w4, w5 respectively correspond as word Symbol string " artificial ", " intelligence ", " experiment ", " laboratory ", " sharp " insertion vector.

Further, individual character insertion its corresponding position encoded is input in character level encoder 12 of vector sum is calculated To obtain the word vector expression of not covered or replacement word；Character string insertion vector sum its corresponding position encoded be input to word Level encoder 13 is calculated to obtain and express at term vector.

In step S54, the word vector expression of the word for not covering or replacing and the term vector are expressed while being inputted It is expressed in the attention encoder 14 with acquisition " [cls] " label and covered or replacement individual character vector.In step S54 In, term vector expression has location information and length information as mark information, and character string and word are matched, so that When attention encoder 14 calculates the word vector expression for the word for being capped or being replaced, accuracy is higher, so that being predicted The accuracy of word out is higher, also, has better performance in multiple tasks, for example, in Chinese word segmentation, part-of-speech tagging, Entity recognition, sentiment analysis, natural language inference, sentence classification, machine read the Downstream Jobs such as understanding, article classification with more Good performance.

After executing the step S55, pre-training model has obtained certain optimization, further below execute step S6, S2-S5 repeat the above steps until pre-training model reaches the optimal conditions of setting with the pre-training model after being optimized.In In this step, the optimal conditions of setting correspond to the objective function and reach convergence state.When executing step S2 every time, from pre- instruction Practice and extracts two sentences in file one by one as training sentence, until sentence all in the pre-training text has been extracted Finish, two sentences extracted every time be it is adjacent either non-conterminous, extraction is when finishing, two adjacent sentences and non-conterminous Two sentences shared by proportional region be 40-70%, sum of the two 100%.In the present embodiment, the ratio that the two respectively accounts for It is 50%.

Referring to Fig. 6, the language model pre-training method of the combination class word information further includes following steps: step S7, Task fine tuning is carried out to the pre-training model after the optimization obtained in the step S6 in conjunction with the vocabulary formed in the step S2.

Referring to Fig. 7, the step S7 specifically comprises the following steps:

S71, fine tuning task text is provided；

S72, to the fine tuning task text segmentation at individual character sequence；

S73, by the word in the individual character sequence in the character string and the step S72 in the step S2 carry out matching and it is right Character string after the matching is marked；

S74, the individual character sequence and labeled character string are input in the pre-training model after optimization simultaneously to pre- Training pattern is finely adjusted.

In step S71, fine tuning task text is also from wikipedia, news corpus, medical question and answer corpus and financial report number It chooses and obtains according to equal text-only files, it cannot be identical with the pre-training text in step S1 but finely tune task text.

In above-mentioned steps S74, after being realized by full articulamentum or CRF network optimization objective function to the optimization The optimization of pre-training model.

Referring to Fig. 8, the second embodiment of the present invention provides a kind of electronic equipment 60, including memory 601 and processor 602, computer program is stored in the memory 601, the computer program is arranged to be executed when operation as first is real Apply the language model pre-training method of combination class word information described in example；

The processor 602 is arranged to execute combination class word as in the first embodiment by the computer program The language model pre-training method of information.

Below with reference to Fig. 9, it illustrates the terminal device/server computers for being suitable for being used to realize the embodiment of the present application The structural schematic diagram of system 800.Terminal device/server shown in Fig. 8 is only an example, should not be to the embodiment of the present application Function and use scope bring any restrictions.

As shown in figure 9, computer system 800 includes central processing unit (CPU) 801, it can be read-only according to being stored in Program in memory (ROM) 802 or be loaded into the program in random access storage device (RAM) 803 from storage section 808 and Execute various movements appropriate and processing.In RAM 803, also it is stored with system 800 and operates required various programs and data. CPU 801, ROM 802 and RAM 803 are connected with each other by bus 804.Input/output (I/O) interface 805 is also connected to always Line 804.

I/O interface 805 is connected to lower component: the importation 806 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 807 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 808 including hard disk etc.； And the communications portion 809 of the network interface card including LAN card, modem etc..Communications portion 809 via such as because The network of spy's net executes communication process.Driver 810 is also connected to I/O interface 805 as needed.Detachable media 811, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 810, in order to read from thereon Computer program be mounted into storage section 808 as needed.

Disclosed embodiment according to the present invention may be implemented as computer software above with reference to the process of flow chart description Program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 809, and/or from detachable media 811 are mounted.When the computer program is executed by central processing unit (CPU) 801, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer readable storage medium either the two any combination.Computer readable storage medium for example can be-but not Be limited to-electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.

The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " such as " language or similar programming language.Program code can Fully to execute, partly be executed on management end computer, as an independent software package on management end computer It executes, partially part executes on the remote computer or completely in remote computer or server on management end computer Upper execution.In situations involving remote computers, remote computer can pass through the network of any kind --- including local Net (LAN) or the domain wide area network (WAN) are connected to management end computer, or, it may be connected to outer computer (such as using because Spy nets service provider to connect by internet).

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

The foregoing is merely present pre-ferred embodiments, are not intended to limit the invention, it is all principle of the present invention it Any modification made by interior, equivalent replacement and improvement etc. should all be comprising within protection scope of the present invention.

Claims

1. a kind of language model pre-training method of combination class word information, it is characterised in that: it includes the following steps:

S1, a pre-training model and pre-training text are provided；

S3, two sentences are extracted from the pre-training text as training sentence while the trained sentence is divided into individual character Sequence；

S4, the word in the character string and the individual character sequence in the step S2 is matched, and to the individual character sequence In the character string that matches of word be marked；

S5, the training that the individual character that preset ratio is chosen among individual character sequence is covered or replaced, and will covered or replaced Sentence and labeled character string are input in pre-training model simultaneously to be trained and optimizes to pre-training model；

S6, the S2-S5 that repeats the above steps are until pre-training model reaches the optimal conditions of setting with the pre-training after being optimized Model.

2. combining the language model pre-training method of class word information as described in claim 1, it is characterised in that: in above-mentioned steps In S2, character string is obtained by derived algorithm or artificially extracts character string.

3. combining the language model pre-training method of class word information as described in claim 1, it is characterised in that: in above-mentioned steps In S3, the ending of each sentence adds [sep], in the beginning of the sentence of first sentence respectively in two trained sentences of extraction It adds [cls]；In the step S4, using the character string location information and/or length information to the character string into Line flag.

4. combining the language model pre-training method of class word information as described in claim 1, it is characterised in that: in above-mentioned steps In S6, when executing step S2 every time, two sentences are extracted one by one from pre-training file as training sentence, until the pre- instruction Practice all sentence in text and be extracted and finishes, two extracted every time sentence be it is adjacent either non-conterminous, extracted Proportional region shared by Bi Shi, two adjacent sentences and non-conterminous two sentences is 40-70%, sum of the two 100%.

5. combining the language model pre-training method of class word information as described in claim 1, it is characterised in that: the step S5 Specifically comprise the following steps:

The objective function of S51, foundation about the pre-training model；

S53, it by the training sentence for covering or replacing and labeled character string while being input in pre-training model；

S54, or the word replaced capped by pre-training model prediction with obtain represent described capped or replacement word to Amount expression；And

6. combining the language model pre-training method of class word information as described in claim 1, it is characterised in that: the combination class The language model pre-training method of word information further includes following steps: step S7, in conjunction with the vocabulary pair formed in the step S2 Pre-training model after the optimization obtained in the step S6 carries out task fine tuning.

7. combining the language model pre-training method of class word information as claimed in claim 6, it is characterised in that: the step S7 Specifically comprise the following steps:

S71, fine tuning task text is provided；

S73, the word in the individual character sequence in the character string and the step S72 in the step S2 is carried out to matching and to described Character string after matching is marked；

S74, the individual character sequence and labeled character string are input in the pre-training model after optimization simultaneously to pre-training Model is finely adjusted.

8. combining the language model pre-training method of class word information as claimed in claim 7, it is characterised in that: in above-mentioned steps In S74, the optimization to the pre-training model after the optimization is realized by full articulamentum or CRF network optimization objective function.

9. such as the language model pre-training method of combination class word information of any of claims 1-8, it is characterised in that: The pre-training model includes embeding layer, character level encoder, word level encoder, multiple attention encoders；Wherein,

The embeding layer is for being labeled in covered in the step S5 or the training sentence and the step S4 replaced The individual character is converted into corresponding with each individual character individual character insertion vector and by each word by character string input, the embeding layer Symbol string be converted into corresponding with each character string character string insertion vector simultaneously each individual character insertion vector sum character string be embedded in Amount is corresponding plus position encoded；

The character level encoder is for its corresponding position encoded input of individual character insertion vector sum and is calculated to obtain The word vector expression of not covered or replacement word；

Institute's predicate level encoder is for its corresponding position encoded input of character string insertion vector sum and is calculated to obtain Obtain term vector expression；

The attention encoder be it is multiple, the word vector expression for not covering or replacing for described and the term vector are expressed same When input with obtain about it is described covered or replacement word vector expression.

10. combining the language model pre-training method of class word information as claimed in claim 9, it is characterised in that: the pre- instruction Practicing model further includes Linear network layer and Softmax network layer, and the word vector expression and the term vector are expressed described in warp It is input in the Linear network layer and Softmax network layer after attention encoder output and the pre-training model is done Further training and fine tuning.