CN110489555B

CN110489555B - Language model pre-training method combined with similar word information

Info

Publication number: CN110489555B
Application number: CN201910775453.4A
Authority: CN
Inventors: 白佳欣; 宋彦
Original assignee: Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd
Current assignee: Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2022-03-08
Anticipated expiration: 2039-08-21
Also published as: CN110489555A

Abstract

The invention relates to the technical field of language processing, in particular to a language model pre-training method combined with similar word information, which comprises the following steps: s1, providing a pre-training model and a pre-training text; s2, extracting character strings and forming a word list; s3, extracting two sentences as training sentences and simultaneously dividing the training sentences into single character sequences; s4, matching the character string in the step S2 with the characters in the single character sequence, and marking the character string matched with the characters in the single character sequence; s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model; and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model. The language model pre-training method combined with the similar word information and the pre-training model provided by the invention have better performance on a plurality of downstream tasks.

Description

Language model pre-training method combined with similar word information

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of language processing, in particular to a language model pre-training method combined with word-like information.

[ background of the invention ]

Currently, the most advanced pre-training language models are classified into two types, namely, an Autoregressive language Model (Autoregressive Model) and an Autoregressive language Model (Autoencoding Model). GPT and GPT2 are well-behaved autoregressive language models. The training goal of the autoregressive model is to correctly guess the next word from the context. BERT is a representative self-coding language model. The training goal of BERT is to correctly infer words that are masked or replaced based on context. Both language models have advantages and disadvantages. The autoregressive model can only incorporate the context and cannot simultaneously incorporate the context to accomplish a specific task. On the other hand, the self-coding language model may simultaneously utilize context information, but where in the pre-training process a mask tag is added to the corpus to replace the original target word in order to mask the speculative target word, whereas the mask coding does not occur during the fine-tuning process for a specific task. The above reasons cause the input of the pre-training language model in pre-training and micro-adjustment to be mismatched, and further influence the overall performance of the model. Recently, XLNet was proposed to solve both of the above problems simultaneously, allowing the pre-trained language model to complete tasks in conjunction with context without introducing a mask token.

However, the above language model does not fully utilize the information of larger granularity of words, phrases, entities, etc. appearing in the pre-trained and micro-adjusted corpus. Such information is particularly important in the chinese task. Compared with English, Chinese has no clear word boundaries such as blank spaces, so that the model is more difficult to learn the overall meaning of double-word or multi-word from the sequence of single words.

Recently, the BERT-wwm model was proposed as a BERT model that was optimized in Chinese for the above problems. BERT-wwm differs from BERT only in the preprocessing of the corpus. When the BERT carries out covering operation on the pre-training corpus, 15% of single words are replaced by 'mask', and other words are reserved. And BERT-wwm performs word segmentation on the original material by using a word segmentation tool, and then performs the same covering operation by taking the whole word as a unit. Earlier, hundredth releases ERNIE was also an improvement of BERT to the above problem. ERNIE employs a multi-level masking strategy. The multi-level coverage policy includes word level coverage, phrase level coverage, and entity level coverage. In order to achieve the goal of multi-level coverage, Baidu encyclopedia, Baidu Bar and question and answer data are additionally used in addition to the Chinese Wikipedia data. While ERNIE uses more training data, learns more knowledge, it performs as well as BERT-wwm on downstream tasks.

Learning word boundary information through a multi-tiered masking strategy, however, has a number of problems. First, the effectiveness of the masking strategy relies on additional information beyond text, such as BERT-wwm relying on the results given by the tokenizer, while ERNIE relies on external knowledge. In actual use, the use of the additional information has the following disadvantages. First, the quality of the information cannot be guaranteed. The effect of BERT-wwm, for example, depends on the quality of Chinese participles. Second, high quality information requires a large amount of collection and labeling, which adds additional cost to the pre-training language model. Third, word information is not fully used by covering words alone because words may contain literal and non-literal implications, such as foreign words such as "romania," idioms such as "seeondary emargia," and posterals such as "foreign children play lantern.

Aiming at the problem, the patent provides a new method for integrating word-like information into the pre-training and fine-tuning of the language model on the basis of the existing language model.

[ summary of the invention ]

Aiming at the defects of low prediction accuracy and high cost of the existing language model, the invention provides a language model pre-training method combining similar word information.

In order to solve the technical problem, the invention provides a language model pre-training method combined with similar word information, which comprises the following steps: s1, providing a pre-training model and a pre-training text; s2, extracting character strings from the pre-training text and forming a word list; s3, extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single character sequences; s4, matching the character strings in the step S2 with the characters in the single character sequence, and marking the character strings matched with the characters in the single character sequence; s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model; and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model.

Preferably, in the step S2, the character string is obtained by a word extraction algorithm or artificially extracted.

Preferably, in the step S3, adding [ sep ] to the end of each sentence in the extracted two training sentences, and adding [ cls ] to the beginning of the first sentence; in step S4, the character string is marked with the position information and/or the length information of the character string.

Preferably, in step S6, each time step S2 is executed, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training document are extracted, the two extracted sentences are adjacent or not adjacent, when the extraction is completed, the ratio of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two adjacent sentences and the two non-adjacent sentences is 100%.

Preferably, the step S5 specifically includes the following steps: s51, establishing an objective function related to the pre-training model; s52, selecting 15% of single characters from the single character sequence for covering or replacing; s53, inputting the covered or replaced training sentences and the marked character strings into a pre-training model simultaneously; s54, predicting covered or replaced words through a pre-training model to obtain vector expressions representing the covered or replaced words; and S55, calculating an objective function by using the vector expression and optimizing the pre-training model.

Preferably, the method for pre-training a language model in combination with word-like information further includes the following steps: and step S7, performing task fine adjustment on the optimized pre-training model obtained in the step S6 by combining the vocabulary formed in the step S2.

Preferably, the step S7 specifically includes the following steps: s71, providing fine tuning task texts; s72, segmenting the fine tuning task text into single character sequences; s73, matching the character string in the step S2 with the words in the single word sequence in the step S72 and marking the character string after matching; and S74, simultaneously inputting the single character sequence and the marked character string into the optimized pre-training model to finely adjust the pre-training model.

Preferably, in step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.

Preferably, the pre-training model comprises an embedding layer, a character-level encoder, a word-level encoder, a plurality of attention encoders; wherein the embedding layer is for inputting the covered or replaced training sentence of step S5 and the string marked in step S4, the embedding layer converting the words into word embedding vectors corresponding to each word and converting each string into string embedding vectors corresponding to each string while each word embedding vector and string embedding vector correspond to each plus position coding; the character-level encoder is used for inputting the single-character embedded vector and the position code corresponding to the single-character embedded vector and calculating to obtain the word vector expression of the uncovered or replaced word; the word level encoder is used for inputting the character string embedding vector and the position code corresponding to the character string embedding vector and calculating to obtain word vector expression; the attention encoder is in plurality, and the uncovered or replaced word vector expression and the word vector expression are simultaneously input to obtain the vector expression of the covered or replaced word.

Preferably, the pre-training model further includes a Linear network layer and a Softmax network layer, and the word vector expressions are output by the attention encoder and then input into the Linear network layer and the Softmax network layer to further train and fine tune the pre-training model.

Compared with the prior art, the language model pre-training method and the pre-training model combining the similar word information have the following beneficial effects:

firstly, providing a pre-training model and a pre-training text, extracting character strings from the pre-training text and forming a word list, then extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single word sequences; by matching character strings with characters in the single character sequence and marking the character strings matched with the characters in the single character sequence, covered or replaced characters can be well predicted by utilizing character string related information rather than vector information of the single characters in the single character sequence, the marking of the character string is usually carried out by using the position information and the length information of the character string, so that the covered or replaced characters can be well predicted by utilizing the relevance among the character strings, the accuracy of the pre-training model on the covered or replaced characters is improved, meanwhile, the optimized pre-training model has better performance, the method has better performance on a plurality of tasks, for example, the method has better performance on downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, natural language reasoning, sentence classification, machine reading understanding, article classification and the like.

The pre-training model provided by the invention comprises an embedding layer, a character level encoder, a word level encoder and a plurality of attention encoders, wherein the word level encoder is used for inputting the character string embedding vector and the corresponding position coding thereof and calculating to obtain a word vector expression, and the attention encoder is used for simultaneously inputting the uncovered or replaced word vector expression and the word vector expression to obtain the vector expression related to the covered or replaced words, a word level encoder is arranged in combination with the character level encoder, so that when the attention coding calculation obtains the vector expression of the covered or replaced words, the obtained vector expression better represents the covered or replaced words, the prediction accuracy is improved, meanwhile, the optimized pre-training model has better performance and better performance on a plurality of tasks, for example, the method has better performance in downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, document classification and the like. The pre-training model provided by the invention is added with word boundary information, has stronger generating capability, can be applied to tasks such as keyword generation, article continuous writing, article summarization and the like, and has higher quality of generated sentences when the tasks are executed.

[ description of the drawings ]

FIG. 1 is a flowchart illustrating a method for pre-training a language model in conjunction with information about word classes according to a first embodiment of the present invention;

FIG. 2 is a block diagram of a pre-training model used in the pre-training method of a language model with associated word-like information according to the first embodiment of the present invention;

FIG. 3 is a diagram illustrating matching of a single character sequence and a character string in step S4 of the language model pre-training method with word-like information according to the first embodiment of the present invention;

FIG. 4 is a flowchart illustrating the details of step S5 in the method for pre-training a language model according to the information of similar words in the first embodiment of the present invention;

FIG. 5 is a diagram illustrating operations performed by the pre-training method of language model with associated word-like information according to the first embodiment of the present invention in response to the input of the pre-training model in steps S53 and S54;

FIG. 6 is a flowchart illustrating a variant embodiment of the language model pre-training method according to the first embodiment of the present invention;

FIG. 7 is a flowchart illustrating the details of step S7 in the method for pre-training a language model according to the information of word classes according to the first embodiment of the present invention;

fig. 8 is a block schematic diagram of an electronic device provided in a second embodiment of the invention;

FIG. 9 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.

Description of reference numerals:

11. an embedding layer; 12. a character-level encoder; 13. a word level encoder; 14. an attention encoder; 60. an electronic device; 601. a memory; 602. a processor; 800. a computer system; 801. a Central Processing Unit (CPU); 802. a memory (ROM); 803. a RAM; 804. a bus; 805. an I/O interface; 806. an input section; 807. an output section; 808. a storage section; 809. a communication section; 810. a driver; 811. a removable media.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, a first embodiment of the present invention provides a language model pre-training method combined with word-like information, which includes the following steps:

s1, providing a pre-training model and a pre-training text;

s2, extracting character strings from the pre-training text and forming a word list;

s3, extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single character sequences;

s4, matching the character strings in the step S2 with the characters in the single character sequence, and marking the character strings matched with the characters in the single character sequence;

s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model;

and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model.

In step S1, the pre-training text is obtained by selecting plain text files such as wikipedia, news corpus, medical question and answer corpus, and financial report data.

Referring to fig. 2, in the step S1, the pre-training model is obtained by performing an improvement based on an existing self-coding language model, which includes but is not limited to a bert (bidirectional Encoder expressions from transformer) language model. The pre-training model comprises module units such as an embedded layer 11, a character-level encoder 12, a word-level encoder 13, a plurality of attention encoders 14 and the like. Wherein xN in fig. 2 indicates that a number of attention encoders 14 have also been omitted.

The embedding layer 11 provides the covered or replaced training sentence in the step S5 and the string marked in the step S4 for input, the embedding layer 11 converts the words into word embedding vectors corresponding to each word and converts each string into string embedding vectors corresponding to each string while each word embedding vector and string embedding vector correspond to each plus position coding.

The character-level encoder 12 is used for inputting the single-character embedded vectors and the position codes corresponding to the single-character embedded vectors and calculating to obtain the word vector expression of the uncovered or replaced words;

the word level encoder 13 provides the string embedding vector and its corresponding position encoding input and performs calculation to obtain word vector expression.

The attention encoder 14 is a plurality of which the uncovered or replaced word vector expression and the word vector expression are input simultaneously to obtain the vector expression for the covered or replaced word.

The pre-training model further comprises a Linear network layer 15 and a Softmax network layer 16, and the word vector expression of the covered or replaced words is input into the Linear network layer 15 and the Softmax network layer 16 after being output by the attention encoder 14, so as to further complete the optimization and fine tuning tasks of the pre-training model.

In step S2, character strings are extracted from the pre-training text and word lists are formed. In the step, the character strings can be obtained through a word extraction algorithm or extracted in a manual extraction mode to form a word list. Generally, the word extraction algorithm is also called a word segmentation algorithm or a character string matching word segmentation algorithm. The algorithm matches a character string to be matched with a word in an established 'sufficiently large' dictionary according to a certain strategy, and if a certain entry is found, the matching is successful, namely, the word is recognized. Optionally, the abstraction algorithm includes, but is not limited to, Access Variety. Of course, it is understood that in this step, the pre-training text may be extracted by human extraction and the character strings may be placed in the vocabulary. When some character strings with relatively low utilization rates such as the postlanguage, idioms or colloquial languages exist in the pre-training text, the character strings are added into the vocabulary in a manual word extraction mode, the content of the vocabulary can be well enriched, and the optimization effect of the pre-training model is improved.

In step S3, two sentences are extracted from the pre-training text as training sentences and the training sentences are divided into a single word sequence. In this step, the training sentence is divided into a sequence of single words, that is, a sentence is divided into the meaning that a single word is the minimum unit. The training sentence can be segmented into a single word sequence by a split function. In step S3, the method further includes the operation of: adding [ sep ] at the end of each sentence in the two extracted training sentences, and adding [ cls ] at the beginning of the first sentence.

Referring to fig. 3, in step S4, the character string in step S2 is matched with the word in the word sequence, and the character string matched with the word in the word sequence is marked. In this embodiment, the word sequence corresponds to "human, artificial, intelligent, energy, reality, experience, laboratory, very, rhinoceros, and profit", and the following character strings exist in the vocabulary: "Artificial", "Intelligent", "laboratory" and "rhinoceros"; then the words and strings in the single word sequence are matched as shown in figure 3. And after the character string and the training sentence are matched, marking the character string by utilizing the position information and/or the length information of the training sentence corresponding to the character string.

In step S5, the single words in the single word sequence with the preset ratio are selected for masking or replacing, and the masked or replaced training sentences and the marked character strings are simultaneously input into the pre-training model for training and optimizing the pre-training model. In this step, the predetermined percentage is usually the percentage of the covered or replaced words in the whole sentence, which is in the range of 10-30%, and in this embodiment, the selected percentage is 15%.

Referring to fig. 4, the step S5 specifically includes the following steps:

s51, establishing an objective function related to the pre-training model;

s52, selecting 15% of single characters from the single character sequence for covering or replacing;

s53, inputting the covered or replaced training sentences and the marked character strings into a pre-training model simultaneously;

s54, predicting covered or replaced words through a pre-training model to obtain vector expressions representing the covered or replaced words; and

and S55, calculating an objective function by using the vector expression and optimizing the pre-training model.

Referring to fig. 5, in step S53, the extracted training sentences are: "artificial intelligence laboratory is very luck", replace "worker" word and "rhino" word among them with "mask", the training sentence that will cover or replace and the character string marked are input into embedding layer 11 of the pre-training model at the same time, embedding layer 11 converts the simple word into the single word embedding vector corresponding to each single word and converts each character string into the character string embedding vector corresponding to each character string and each single word embedding vector and character string embedding vector correspond to each single word embedding vector and character string embedding vector plus position code. It is to be understood that a position encoder is provided in the embedding layer 11, and position encoding is added to each of the single-word embedding vector and the character string embedding vector by the position encoder. The position code is used to indicate the position where each string appears in the training sentence.

As shown in fig. 5, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 correspond to "human", "mask", "worker", "smart", "experiment", "laboratory", "very", "rhino", "li" in the single character sequence; w1, w2, w3, w4 and w5 correspond to embedded vectors of character strings of "artificial", "intelligent", "experimental", "laboratory" and "rhinoceros", respectively.

Further, the single word embedding vector and its corresponding position code are input into the character-level encoder 12 for calculation to obtain the word vector expression of the uncovered or replaced word; the string embedding vector and its corresponding position code are input to the word-level encoder 13 for calculation to obtain a word-forming vector expression.

In step S54, the word vector expression of the uncovered or replaced word and the word vector expression are simultaneously input into the attention encoder 14 to obtain the "[ cls ]" flag and the vector expression of the covered or replaced single word. In step S54, the word vector expression has position information and length information as marker information, and the character string and the word are matched, so that when the attention encoder 14 calculates and obtains the word vector expression of the covered or replaced word, the accuracy is higher, so that the accuracy of the predicted word is higher, and there is better expression in a plurality of tasks, for example, downstream tasks such as chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, natural language inference, sentence classification, machine reading understanding, article classification, etc. have better expression.

After the step S55 is completed, the pre-training model is optimized, and the following step S6 is further performed, and the above steps S2-S5 are repeated until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model. In this step, the set optimization condition corresponds to the objective function reaching a convergence state. When step S2 is executed each time, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training text are extracted, the two sentences extracted each time are adjacent or non-adjacent, when the extraction is completed, the ratio range of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two sentences is 100%. In this embodiment, the ratio of both components is 50%.

Referring to fig. 6, the method for pre-training a language model in combination with word-like information further includes the following steps: and step S7, performing task fine adjustment on the optimized pre-training model obtained in the step S6 by combining the vocabulary formed in the step S2.

Referring to fig. 7, the step S7 specifically includes the following steps:

s71, providing fine tuning task texts;

s72, segmenting the fine tuning task text into single character sequences;

s73, matching the character string in the step S2 with the words in the single word sequence in the step S72 and marking the character string after matching;

and S74, simultaneously inputting the single character sequence and the marked character string into the optimized pre-training model to finely adjust the pre-training model.

In step S71, the fine tuning task text is also selected from plain text files such as wikipedia, news corpus, medical question and answer corpus, and financial data, but the fine tuning task text cannot be the same as the pre-training text in step S1.

In the above step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.

Referring to fig. 8, a second embodiment of the present invention provides an electronic device 60, which includes a memory 601 and a processor 602, where the memory 601 stores a computer program, and the computer program is configured to execute the method for pre-training a language model according to the first embodiment in combination with generic word information when running;

the processor 602 is arranged to perform a method of pre-training a language model in combination with word-like information as described in the first embodiment by means of the computer program.

Referring now to fig. 9, a block diagram of a computer system 800 suitable for use in implementing a terminal device/server of an embodiment of the present application is shown. The terminal device/server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

According to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "for example" programming language or similar programming languages. The program code may execute entirely on the management-side computer, partly on the management-side computer, as a stand-alone software package, partly on the management-side computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the administrative side computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. A language model pre-training method combined with similar word information is characterized in that: which comprises the following steps:

s1, providing a pre-training model and a pre-training text;

s6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model;

s7, combining the vocabulary formed in the step S2 to perform task fine adjustment on the optimized pre-training model obtained in the step S6;

the step S7 specifically includes the following steps:

s71, providing fine tuning task texts;

s72, segmenting the fine tuning task text into single character sequences;

2. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the above step S2, the character string is obtained by the word extraction algorithm or artificially extracted.

3. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in step S3, adding [ sep ] to the end of each sentence in the two extracted training sentences and adding [ cls ] to the beginning of the first sentence; in step S4, the character string is marked with the position information and/or the length information of the character string.

4. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the step S6, each time step S2 is executed, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training document are extracted, the two sentences extracted each time are adjacent or non-adjacent, and when the extraction is completed, the ratio range of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two sentences is 100%.

5. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: the step S5 specifically includes the following steps:

s51, establishing an objective function related to the pre-training model;

6. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the above step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.

7. A method for language model pre-training in combination with word-like information as claimed in any one of claims 1 to 6, wherein: the pre-training model comprises an embedded layer, a character-level encoder, a word-level encoder and a plurality of attention encoders; wherein the content of the first and second substances,

the embedding layer for inputting the covered or replaced training sentence of the step S5 and the string marked in the step S4, converting the words into word embedding vectors corresponding to each word and converting each string into string embedding vectors corresponding to each string while adding position codes to each word embedding vector and string embedding vector;

the character-level encoder is used for inputting the single-character embedded vector and the position code corresponding to the single-character embedded vector and calculating to obtain the word vector expression of the uncovered or replaced word;

the word level encoder is used for inputting the character string embedding vector and the position code corresponding to the character string embedding vector and calculating to obtain word vector expression;

the attention encoder is in a plurality, and the uncovered or replaced word vector expression and the word vector expression are simultaneously input to obtain a vector expression of the covered or replaced word; the pre-training model further comprises a Linear network layer and a Softmax network layer, and the word vector expressions are output by the attention encoder and then input into the Linear network layer and the Softmax network layer to further train and fine tune the pre-training model.