CN110489555B - Language model pre-training method combined with similar word information - Google Patents

Language model pre-training method combined with similar word information Download PDF

Info

Publication number
CN110489555B
CN110489555B CN201910775453.4A CN201910775453A CN110489555B CN 110489555 B CN110489555 B CN 110489555B CN 201910775453 A CN201910775453 A CN 201910775453A CN 110489555 B CN110489555 B CN 110489555B
Authority
CN
China
Prior art keywords
training
word
character
sentences
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910775453.4A
Other languages
Chinese (zh)
Other versions
CN110489555A (en
Inventor
白佳欣
宋彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd
Original Assignee
Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd filed Critical Innovation Workshop Guangzhou Artificial Intelligence Research Co ltd
Priority to CN201910775453.4A priority Critical patent/CN110489555B/en
Publication of CN110489555A publication Critical patent/CN110489555A/en
Application granted granted Critical
Publication of CN110489555B publication Critical patent/CN110489555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention relates to the technical field of language processing, in particular to a language model pre-training method combined with similar word information, which comprises the following steps: s1, providing a pre-training model and a pre-training text; s2, extracting character strings and forming a word list; s3, extracting two sentences as training sentences and simultaneously dividing the training sentences into single character sequences; s4, matching the character string in the step S2 with the characters in the single character sequence, and marking the character string matched with the characters in the single character sequence; s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model; and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model. The language model pre-training method combined with the similar word information and the pre-training model provided by the invention have better performance on a plurality of downstream tasks.

Description

Language model pre-training method combined with similar word information
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of language processing, in particular to a language model pre-training method combined with word-like information.
[ background of the invention ]
Currently, the most advanced pre-training language models are classified into two types, namely, an Autoregressive language Model (Autoregressive Model) and an Autoregressive language Model (Autoencoding Model). GPT and GPT2 are well-behaved autoregressive language models. The training goal of the autoregressive model is to correctly guess the next word from the context. BERT is a representative self-coding language model. The training goal of BERT is to correctly infer words that are masked or replaced based on context. Both language models have advantages and disadvantages. The autoregressive model can only incorporate the context and cannot simultaneously incorporate the context to accomplish a specific task. On the other hand, the self-coding language model may simultaneously utilize context information, but where in the pre-training process a mask tag is added to the corpus to replace the original target word in order to mask the speculative target word, whereas the mask coding does not occur during the fine-tuning process for a specific task. The above reasons cause the input of the pre-training language model in pre-training and micro-adjustment to be mismatched, and further influence the overall performance of the model. Recently, XLNet was proposed to solve both of the above problems simultaneously, allowing the pre-trained language model to complete tasks in conjunction with context without introducing a mask token.
However, the above language model does not fully utilize the information of larger granularity of words, phrases, entities, etc. appearing in the pre-trained and micro-adjusted corpus. Such information is particularly important in the chinese task. Compared with English, Chinese has no clear word boundaries such as blank spaces, so that the model is more difficult to learn the overall meaning of double-word or multi-word from the sequence of single words.
Recently, the BERT-wwm model was proposed as a BERT model that was optimized in Chinese for the above problems. BERT-wwm differs from BERT only in the preprocessing of the corpus. When the BERT carries out covering operation on the pre-training corpus, 15% of single words are replaced by 'mask', and other words are reserved. And BERT-wwm performs word segmentation on the original material by using a word segmentation tool, and then performs the same covering operation by taking the whole word as a unit. Earlier, hundredth releases ERNIE was also an improvement of BERT to the above problem. ERNIE employs a multi-level masking strategy. The multi-level coverage policy includes word level coverage, phrase level coverage, and entity level coverage. In order to achieve the goal of multi-level coverage, Baidu encyclopedia, Baidu Bar and question and answer data are additionally used in addition to the Chinese Wikipedia data. While ERNIE uses more training data, learns more knowledge, it performs as well as BERT-wwm on downstream tasks.
Learning word boundary information through a multi-tiered masking strategy, however, has a number of problems. First, the effectiveness of the masking strategy relies on additional information beyond text, such as BERT-wwm relying on the results given by the tokenizer, while ERNIE relies on external knowledge. In actual use, the use of the additional information has the following disadvantages. First, the quality of the information cannot be guaranteed. The effect of BERT-wwm, for example, depends on the quality of Chinese participles. Second, high quality information requires a large amount of collection and labeling, which adds additional cost to the pre-training language model. Third, word information is not fully used by covering words alone because words may contain literal and non-literal implications, such as foreign words such as "romania," idioms such as "seeondary emargia," and posterals such as "foreign children play lantern.
Aiming at the problem, the patent provides a new method for integrating word-like information into the pre-training and fine-tuning of the language model on the basis of the existing language model.
[ summary of the invention ]
Aiming at the defects of low prediction accuracy and high cost of the existing language model, the invention provides a language model pre-training method combining similar word information.
In order to solve the technical problem, the invention provides a language model pre-training method combined with similar word information, which comprises the following steps: s1, providing a pre-training model and a pre-training text; s2, extracting character strings from the pre-training text and forming a word list; s3, extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single character sequences; s4, matching the character strings in the step S2 with the characters in the single character sequence, and marking the character strings matched with the characters in the single character sequence; s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model; and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model.
Preferably, in the step S2, the character string is obtained by a word extraction algorithm or artificially extracted.
Preferably, in the step S3, adding [ sep ] to the end of each sentence in the extracted two training sentences, and adding [ cls ] to the beginning of the first sentence; in step S4, the character string is marked with the position information and/or the length information of the character string.
Preferably, in step S6, each time step S2 is executed, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training document are extracted, the two extracted sentences are adjacent or not adjacent, when the extraction is completed, the ratio of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two adjacent sentences and the two non-adjacent sentences is 100%.
Preferably, the step S5 specifically includes the following steps: s51, establishing an objective function related to the pre-training model; s52, selecting 15% of single characters from the single character sequence for covering or replacing; s53, inputting the covered or replaced training sentences and the marked character strings into a pre-training model simultaneously; s54, predicting covered or replaced words through a pre-training model to obtain vector expressions representing the covered or replaced words; and S55, calculating an objective function by using the vector expression and optimizing the pre-training model.
Preferably, the method for pre-training a language model in combination with word-like information further includes the following steps: and step S7, performing task fine adjustment on the optimized pre-training model obtained in the step S6 by combining the vocabulary formed in the step S2.
Preferably, the step S7 specifically includes the following steps: s71, providing fine tuning task texts; s72, segmenting the fine tuning task text into single character sequences; s73, matching the character string in the step S2 with the words in the single word sequence in the step S72 and marking the character string after matching; and S74, simultaneously inputting the single character sequence and the marked character string into the optimized pre-training model to finely adjust the pre-training model.
Preferably, in step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.
Preferably, the pre-training model comprises an embedding layer, a character-level encoder, a word-level encoder, a plurality of attention encoders; wherein the embedding layer is for inputting the covered or replaced training sentence of step S5 and the string marked in step S4, the embedding layer converting the words into word embedding vectors corresponding to each word and converting each string into string embedding vectors corresponding to each string while each word embedding vector and string embedding vector correspond to each plus position coding; the character-level encoder is used for inputting the single-character embedded vector and the position code corresponding to the single-character embedded vector and calculating to obtain the word vector expression of the uncovered or replaced word; the word level encoder is used for inputting the character string embedding vector and the position code corresponding to the character string embedding vector and calculating to obtain word vector expression; the attention encoder is in plurality, and the uncovered or replaced word vector expression and the word vector expression are simultaneously input to obtain the vector expression of the covered or replaced word.
Preferably, the pre-training model further includes a Linear network layer and a Softmax network layer, and the word vector expressions are output by the attention encoder and then input into the Linear network layer and the Softmax network layer to further train and fine tune the pre-training model.
Compared with the prior art, the language model pre-training method and the pre-training model combining the similar word information have the following beneficial effects:
firstly, providing a pre-training model and a pre-training text, extracting character strings from the pre-training text and forming a word list, then extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single word sequences; by matching character strings with characters in the single character sequence and marking the character strings matched with the characters in the single character sequence, covered or replaced characters can be well predicted by utilizing character string related information rather than vector information of the single characters in the single character sequence, the marking of the character string is usually carried out by using the position information and the length information of the character string, so that the covered or replaced characters can be well predicted by utilizing the relevance among the character strings, the accuracy of the pre-training model on the covered or replaced characters is improved, meanwhile, the optimized pre-training model has better performance, the method has better performance on a plurality of tasks, for example, the method has better performance on downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, natural language reasoning, sentence classification, machine reading understanding, article classification and the like.
The pre-training model provided by the invention comprises an embedding layer, a character level encoder, a word level encoder and a plurality of attention encoders, wherein the word level encoder is used for inputting the character string embedding vector and the corresponding position coding thereof and calculating to obtain a word vector expression, and the attention encoder is used for simultaneously inputting the uncovered or replaced word vector expression and the word vector expression to obtain the vector expression related to the covered or replaced words, a word level encoder is arranged in combination with the character level encoder, so that when the attention coding calculation obtains the vector expression of the covered or replaced words, the obtained vector expression better represents the covered or replaced words, the prediction accuracy is improved, meanwhile, the optimized pre-training model has better performance and better performance on a plurality of tasks, for example, the method has better performance in downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, document classification and the like. The pre-training model provided by the invention is added with word boundary information, has stronger generating capability, can be applied to tasks such as keyword generation, article continuous writing, article summarization and the like, and has higher quality of generated sentences when the tasks are executed.
[ description of the drawings ]
FIG. 1 is a flowchart illustrating a method for pre-training a language model in conjunction with information about word classes according to a first embodiment of the present invention;
FIG. 2 is a block diagram of a pre-training model used in the pre-training method of a language model with associated word-like information according to the first embodiment of the present invention;
FIG. 3 is a diagram illustrating matching of a single character sequence and a character string in step S4 of the language model pre-training method with word-like information according to the first embodiment of the present invention;
FIG. 4 is a flowchart illustrating the details of step S5 in the method for pre-training a language model according to the information of similar words in the first embodiment of the present invention;
FIG. 5 is a diagram illustrating operations performed by the pre-training method of language model with associated word-like information according to the first embodiment of the present invention in response to the input of the pre-training model in steps S53 and S54;
FIG. 6 is a flowchart illustrating a variant embodiment of the language model pre-training method according to the first embodiment of the present invention;
FIG. 7 is a flowchart illustrating the details of step S7 in the method for pre-training a language model according to the information of word classes according to the first embodiment of the present invention;
fig. 8 is a block schematic diagram of an electronic device provided in a second embodiment of the invention;
FIG. 9 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Description of reference numerals:
11. an embedding layer; 12. a character-level encoder; 13. a word level encoder; 14. an attention encoder; 60. an electronic device; 601. a memory; 602. a processor; 800. a computer system; 801. a Central Processing Unit (CPU); 802. a memory (ROM); 803. a RAM; 804. a bus; 805. an I/O interface; 806. an input section; 807. an output section; 808. a storage section; 809. a communication section; 810. a driver; 811. a removable media.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, a first embodiment of the present invention provides a language model pre-training method combined with word-like information, which includes the following steps:
s1, providing a pre-training model and a pre-training text;
s2, extracting character strings from the pre-training text and forming a word list;
s3, extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single character sequences;
s4, matching the character strings in the step S2 with the characters in the single character sequence, and marking the character strings matched with the characters in the single character sequence;
s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model;
and S6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model.
In step S1, the pre-training text is obtained by selecting plain text files such as wikipedia, news corpus, medical question and answer corpus, and financial report data.
Referring to fig. 2, in the step S1, the pre-training model is obtained by performing an improvement based on an existing self-coding language model, which includes but is not limited to a bert (bidirectional Encoder expressions from transformer) language model. The pre-training model comprises module units such as an embedded layer 11, a character-level encoder 12, a word-level encoder 13, a plurality of attention encoders 14 and the like. Wherein xN in fig. 2 indicates that a number of attention encoders 14 have also been omitted.
The embedding layer 11 provides the covered or replaced training sentence in the step S5 and the string marked in the step S4 for input, the embedding layer 11 converts the words into word embedding vectors corresponding to each word and converts each string into string embedding vectors corresponding to each string while each word embedding vector and string embedding vector correspond to each plus position coding.
The character-level encoder 12 is used for inputting the single-character embedded vectors and the position codes corresponding to the single-character embedded vectors and calculating to obtain the word vector expression of the uncovered or replaced words;
the word level encoder 13 provides the string embedding vector and its corresponding position encoding input and performs calculation to obtain word vector expression.
The attention encoder 14 is a plurality of which the uncovered or replaced word vector expression and the word vector expression are input simultaneously to obtain the vector expression for the covered or replaced word.
The pre-training model further comprises a Linear network layer 15 and a Softmax network layer 16, and the word vector expression of the covered or replaced words is input into the Linear network layer 15 and the Softmax network layer 16 after being output by the attention encoder 14, so as to further complete the optimization and fine tuning tasks of the pre-training model.
In step S2, character strings are extracted from the pre-training text and word lists are formed. In the step, the character strings can be obtained through a word extraction algorithm or extracted in a manual extraction mode to form a word list. Generally, the word extraction algorithm is also called a word segmentation algorithm or a character string matching word segmentation algorithm. The algorithm matches a character string to be matched with a word in an established 'sufficiently large' dictionary according to a certain strategy, and if a certain entry is found, the matching is successful, namely, the word is recognized. Optionally, the abstraction algorithm includes, but is not limited to, Access Variety. Of course, it is understood that in this step, the pre-training text may be extracted by human extraction and the character strings may be placed in the vocabulary. When some character strings with relatively low utilization rates such as the postlanguage, idioms or colloquial languages exist in the pre-training text, the character strings are added into the vocabulary in a manual word extraction mode, the content of the vocabulary can be well enriched, and the optimization effect of the pre-training model is improved.
In step S3, two sentences are extracted from the pre-training text as training sentences and the training sentences are divided into a single word sequence. In this step, the training sentence is divided into a sequence of single words, that is, a sentence is divided into the meaning that a single word is the minimum unit. The training sentence can be segmented into a single word sequence by a split function. In step S3, the method further includes the operation of: adding [ sep ] at the end of each sentence in the two extracted training sentences, and adding [ cls ] at the beginning of the first sentence.
Referring to fig. 3, in step S4, the character string in step S2 is matched with the word in the word sequence, and the character string matched with the word in the word sequence is marked. In this embodiment, the word sequence corresponds to "human, artificial, intelligent, energy, reality, experience, laboratory, very, rhinoceros, and profit", and the following character strings exist in the vocabulary: "Artificial", "Intelligent", "laboratory" and "rhinoceros"; then the words and strings in the single word sequence are matched as shown in figure 3. And after the character string and the training sentence are matched, marking the character string by utilizing the position information and/or the length information of the training sentence corresponding to the character string.
In step S5, the single words in the single word sequence with the preset ratio are selected for masking or replacing, and the masked or replaced training sentences and the marked character strings are simultaneously input into the pre-training model for training and optimizing the pre-training model. In this step, the predetermined percentage is usually the percentage of the covered or replaced words in the whole sentence, which is in the range of 10-30%, and in this embodiment, the selected percentage is 15%.
Referring to fig. 4, the step S5 specifically includes the following steps:
s51, establishing an objective function related to the pre-training model;
s52, selecting 15% of single characters from the single character sequence for covering or replacing;
s53, inputting the covered or replaced training sentences and the marked character strings into a pre-training model simultaneously;
s54, predicting covered or replaced words through a pre-training model to obtain vector expressions representing the covered or replaced words; and
and S55, calculating an objective function by using the vector expression and optimizing the pre-training model.
Referring to fig. 5, in step S53, the extracted training sentences are: "artificial intelligence laboratory is very luck", replace "worker" word and "rhino" word among them with "mask", the training sentence that will cover or replace and the character string marked are input into embedding layer 11 of the pre-training model at the same time, embedding layer 11 converts the simple word into the single word embedding vector corresponding to each single word and converts each character string into the character string embedding vector corresponding to each character string and each single word embedding vector and character string embedding vector correspond to each single word embedding vector and character string embedding vector plus position code. It is to be understood that a position encoder is provided in the embedding layer 11, and position encoding is added to each of the single-word embedding vector and the character string embedding vector by the position encoder. The position code is used to indicate the position where each string appears in the training sentence.
As shown in fig. 5, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 correspond to "human", "mask", "worker", "smart", "experiment", "laboratory", "very", "rhino", "li" in the single character sequence; w1, w2, w3, w4 and w5 correspond to embedded vectors of character strings of "artificial", "intelligent", "experimental", "laboratory" and "rhinoceros", respectively.
Further, the single word embedding vector and its corresponding position code are input into the character-level encoder 12 for calculation to obtain the word vector expression of the uncovered or replaced word; the string embedding vector and its corresponding position code are input to the word-level encoder 13 for calculation to obtain a word-forming vector expression.
In step S54, the word vector expression of the uncovered or replaced word and the word vector expression are simultaneously input into the attention encoder 14 to obtain the "[ cls ]" flag and the vector expression of the covered or replaced single word. In step S54, the word vector expression has position information and length information as marker information, and the character string and the word are matched, so that when the attention encoder 14 calculates and obtains the word vector expression of the covered or replaced word, the accuracy is higher, so that the accuracy of the predicted word is higher, and there is better expression in a plurality of tasks, for example, downstream tasks such as chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, natural language inference, sentence classification, machine reading understanding, article classification, etc. have better expression.
After the step S55 is completed, the pre-training model is optimized, and the following step S6 is further performed, and the above steps S2-S5 are repeated until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model. In this step, the set optimization condition corresponds to the objective function reaching a convergence state. When step S2 is executed each time, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training text are extracted, the two sentences extracted each time are adjacent or non-adjacent, when the extraction is completed, the ratio range of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two sentences is 100%. In this embodiment, the ratio of both components is 50%.
Referring to fig. 6, the method for pre-training a language model in combination with word-like information further includes the following steps: and step S7, performing task fine adjustment on the optimized pre-training model obtained in the step S6 by combining the vocabulary formed in the step S2.
Referring to fig. 7, the step S7 specifically includes the following steps:
s71, providing fine tuning task texts;
s72, segmenting the fine tuning task text into single character sequences;
s73, matching the character string in the step S2 with the words in the single word sequence in the step S72 and marking the character string after matching;
and S74, simultaneously inputting the single character sequence and the marked character string into the optimized pre-training model to finely adjust the pre-training model.
In step S71, the fine tuning task text is also selected from plain text files such as wikipedia, news corpus, medical question and answer corpus, and financial data, but the fine tuning task text cannot be the same as the pre-training text in step S1.
In the above step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.
Referring to fig. 8, a second embodiment of the present invention provides an electronic device 60, which includes a memory 601 and a processor 602, where the memory 601 stores a computer program, and the computer program is configured to execute the method for pre-training a language model according to the first embodiment in combination with generic word information when running;
the processor 602 is arranged to perform a method of pre-training a language model in combination with word-like information as described in the first embodiment by means of the computer program.
Referring now to fig. 9, a block diagram of a computer system 800 suitable for use in implementing a terminal device/server of an embodiment of the present application is shown. The terminal device/server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 9, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
According to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "for example" programming language or similar programming languages. The program code may execute entirely on the management-side computer, partly on the management-side computer, as a stand-alone software package, partly on the management-side computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the administrative side computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Compared with the prior art, the language model pre-training method and the pre-training model combining the similar word information have the following beneficial effects:
firstly, providing a pre-training model and a pre-training text, extracting character strings from the pre-training text and forming a word list, then extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single word sequences; by matching character strings with characters in the single character sequence and marking the character strings matched with the characters in the single character sequence, covered or replaced characters can be well predicted by utilizing character string related information rather than vector information of the single characters in the single character sequence, the marking of the character string is usually carried out by using the position information and the length information of the character string, so that the covered or replaced characters can be well predicted by utilizing the relevance among the character strings, the accuracy of the pre-training model on the covered or replaced characters is improved, meanwhile, the optimized pre-training model has better performance, the method has better performance on a plurality of tasks, for example, the method has better performance on downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, natural language reasoning, sentence classification, machine reading understanding, article classification and the like.
The pre-training model provided by the invention comprises an embedding layer, a character level encoder, a word level encoder and a plurality of attention encoders, wherein the word level encoder is used for inputting the character string embedding vector and the corresponding position coding thereof and calculating to obtain a word vector expression, and the attention encoder is used for simultaneously inputting the uncovered or replaced word vector expression and the word vector expression to obtain the vector expression related to the covered or replaced words, a word level encoder is arranged in combination with the character level encoder, so that when the attention coding calculation obtains the vector expression of the covered or replaced words, the obtained vector expression better represents the covered or replaced words, the prediction accuracy is improved, meanwhile, the optimized pre-training model has better performance and better performance on a plurality of tasks, for example, the method has better performance in downstream tasks such as Chinese word segmentation, part of speech tagging, entity recognition, emotion analysis, document classification and the like. The pre-training model provided by the invention is added with word boundary information, has stronger generating capability, can be applied to tasks such as keyword generation, article continuous writing, article summarization and the like, and has higher quality of generated sentences when the tasks are executed.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A language model pre-training method combined with similar word information is characterized in that: which comprises the following steps:
s1, providing a pre-training model and a pre-training text;
s2, extracting character strings from the pre-training text and forming a word list;
s3, extracting two sentences from the pre-training text as training sentences and simultaneously dividing the training sentences into single character sequences;
s4, matching the character strings in the step S2 with the characters in the single character sequence, and marking the character strings matched with the characters in the single character sequence;
s5, selecting single characters with preset proportion from the single character sequence to cover or replace, and inputting the covered or replaced training sentences and the marked character strings into a pre-training model to train and optimize the pre-training model;
s6, repeating the steps S2-S5 until the pre-training model reaches the set optimization condition to obtain the optimized pre-training model;
s7, combining the vocabulary formed in the step S2 to perform task fine adjustment on the optimized pre-training model obtained in the step S6;
the step S7 specifically includes the following steps:
s71, providing fine tuning task texts;
s72, segmenting the fine tuning task text into single character sequences;
s73, matching the character string in the step S2 with the words in the single word sequence in the step S72 and marking the character string after matching;
and S74, simultaneously inputting the single character sequence and the marked character string into the optimized pre-training model to finely adjust the pre-training model.
2. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the above step S2, the character string is obtained by the word extraction algorithm or artificially extracted.
3. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in step S3, adding [ sep ] to the end of each sentence in the two extracted training sentences and adding [ cls ] to the beginning of the first sentence; in step S4, the character string is marked with the position information and/or the length information of the character string.
4. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the step S6, each time step S2 is executed, two sentences are extracted one by one from the pre-training document as training sentences until all sentences in the pre-training document are extracted, the two sentences extracted each time are adjacent or non-adjacent, and when the extraction is completed, the ratio range of the two adjacent sentences to the two non-adjacent sentences is 40-70%, and the sum of the two sentences is 100%.
5. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: the step S5 specifically includes the following steps:
s51, establishing an objective function related to the pre-training model;
s52, selecting 15% of single characters from the single character sequence for covering or replacing;
s53, inputting the covered or replaced training sentences and the marked character strings into a pre-training model simultaneously;
s54, predicting covered or replaced words through a pre-training model to obtain vector expressions representing the covered or replaced words; and
and S55, calculating an objective function by using the vector expression and optimizing the pre-training model.
6. A method for pre-training a language model in conjunction with word-like information as recited in claim 1, wherein: in the above step S74, the optimized pre-training model is optimized through a full connection layer or a CRF network optimization objective function.
7. A method for language model pre-training in combination with word-like information as claimed in any one of claims 1 to 6, wherein: the pre-training model comprises an embedded layer, a character-level encoder, a word-level encoder and a plurality of attention encoders; wherein the content of the first and second substances,
the embedding layer for inputting the covered or replaced training sentence of the step S5 and the string marked in the step S4, converting the words into word embedding vectors corresponding to each word and converting each string into string embedding vectors corresponding to each string while adding position codes to each word embedding vector and string embedding vector;
the character-level encoder is used for inputting the single-character embedded vector and the position code corresponding to the single-character embedded vector and calculating to obtain the word vector expression of the uncovered or replaced word;
the word level encoder is used for inputting the character string embedding vector and the position code corresponding to the character string embedding vector and calculating to obtain word vector expression;
the attention encoder is in a plurality, and the uncovered or replaced word vector expression and the word vector expression are simultaneously input to obtain a vector expression of the covered or replaced word; the pre-training model further comprises a Linear network layer and a Softmax network layer, and the word vector expressions are output by the attention encoder and then input into the Linear network layer and the Softmax network layer to further train and fine tune the pre-training model.
CN201910775453.4A 2019-08-21 2019-08-21 Language model pre-training method combined with similar word information Active CN110489555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910775453.4A CN110489555B (en) 2019-08-21 2019-08-21 Language model pre-training method combined with similar word information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910775453.4A CN110489555B (en) 2019-08-21 2019-08-21 Language model pre-training method combined with similar word information

Publications (2)

Publication Number Publication Date
CN110489555A CN110489555A (en) 2019-11-22
CN110489555B true CN110489555B (en) 2022-03-08

Family

ID=68552689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910775453.4A Active CN110489555B (en) 2019-08-21 2019-08-21 Language model pre-training method combined with similar word information

Country Status (1)

Country Link
CN (1) CN110489555B (en)

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008531B (en) * 2019-12-06 2023-05-26 北京金山数字娱乐科技有限公司 Training method and device for sentence selection model, sentence selection method and device
CN111144115B (en) * 2019-12-23 2023-10-20 北京百度网讯科技有限公司 Pre-training language model acquisition method, device, electronic equipment and storage medium
CN111222337A (en) * 2020-01-08 2020-06-02 山东旗帜信息有限公司 Training method and device for entity recognition model
CN111259663B (en) 2020-01-14 2023-05-26 北京百度网讯科技有限公司 Information processing method and device
CN111444311A (en) * 2020-02-26 2020-07-24 平安科技(深圳)有限公司 Semantic understanding model training method and device, computer equipment and storage medium
CN113360751A (en) * 2020-03-06 2021-09-07 百度在线网络技术(北京)有限公司 Intention recognition method, apparatus, device and medium
CN112749251B (en) * 2020-03-09 2023-10-31 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium
CN111460832B (en) * 2020-03-27 2023-11-24 北京百度网讯科技有限公司 Method, device, system, equipment and computer storage medium for object coding
CN113496122A (en) * 2020-04-08 2021-10-12 中移(上海)信息通信科技有限公司 Named entity identification method, device, equipment and medium
CN111522944B (en) * 2020-04-10 2023-11-14 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for outputting information
CN111241496B (en) * 2020-04-24 2021-06-29 支付宝(杭州)信息技术有限公司 Method and device for determining small program feature vector and electronic equipment
CN111581383A (en) * 2020-04-30 2020-08-25 上海电力大学 Chinese text classification method based on ERNIE-BiGRU
CN111737383B (en) * 2020-05-21 2021-11-23 百度在线网络技术(北京)有限公司 Method for extracting spatial relation of geographic position points and method and device for training extraction model
CN111401077B (en) * 2020-06-02 2020-09-18 腾讯科技(深圳)有限公司 Language model processing method and device and computer equipment
CN113779185B (en) * 2020-06-10 2023-12-29 武汉Tcl集团工业研究院有限公司 Natural language model generation method and computer equipment
CN111814448B (en) * 2020-07-03 2024-01-16 思必驰科技股份有限公司 Pre-training language model quantization method and device
CN111798986B (en) * 2020-07-07 2023-11-03 云知声智能科技股份有限公司 Data enhancement method and device
CN111914551B (en) * 2020-07-29 2022-05-20 北京字节跳动网络技术有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112016300B (en) * 2020-09-09 2022-10-14 平安科技(深圳)有限公司 Pre-training model processing method, pre-training model processing device, downstream task processing device and storage medium
CN112329391A (en) * 2020-11-02 2021-02-05 上海明略人工智能(集团)有限公司 Target encoder generation method, target encoder generation device, electronic equipment and computer readable medium
CN112329392B (en) * 2020-11-05 2023-12-22 上海明略人工智能(集团)有限公司 Method and device for constructing target encoder of bidirectional encoding
CN112307212A (en) * 2020-11-11 2021-02-02 上海昌投网络科技有限公司 Public opinion delivery monitoring method for advertisement delivery
CN112635013B (en) * 2020-11-30 2023-10-27 泰康保险集团股份有限公司 Medical image information processing method and device, electronic equipment and storage medium
CN113011176A (en) * 2021-03-10 2021-06-22 云从科技集团股份有限公司 Language model training and language reasoning method, device and computer storage medium thereof
CN113032559B (en) * 2021-03-15 2023-04-28 新疆大学 Language model fine tuning method for low-resource adhesive language text classification
CN113032560B (en) * 2021-03-16 2023-10-27 北京达佳互联信息技术有限公司 Sentence classification model training method, sentence processing method and equipment
CN115292439A (en) * 2021-04-18 2022-11-04 华为技术有限公司 Data processing method and related equipment
CN113468877A (en) * 2021-07-09 2021-10-01 浙江大学 Language model fine-tuning method and device, computing equipment and storage medium
CN113836297B (en) * 2021-07-23 2023-04-14 北京三快在线科技有限公司 Training method and device for text emotion analysis model
CN113486141A (en) * 2021-07-29 2021-10-08 宁波薄言信息技术有限公司 Text, resume and financing bulletin extraction method based on SegaBert pre-training model
CN113591475B (en) * 2021-08-03 2023-07-21 美的集团(上海)有限公司 Method and device for unsupervised interpretable word segmentation and electronic equipment
CN113961669A (en) * 2021-10-26 2022-01-21 杭州中软安人网络通信股份有限公司 Training method of pre-training language model, storage medium and server
CN113887245B (en) * 2021-12-02 2022-03-25 腾讯科技(深圳)有限公司 Model training method and related device
CN114186043B (en) * 2021-12-10 2022-10-21 北京三快在线科技有限公司 Pre-training method, device, equipment and storage medium
CN114444488B (en) * 2022-01-26 2023-03-24 中国科学技术大学 Few-sample machine reading understanding method, system, equipment and storage medium
CN114792097B (en) * 2022-05-14 2022-12-06 北京百度网讯科技有限公司 Method and device for determining prompt vector of pre-training model and electronic equipment
CN115017915B (en) * 2022-05-30 2023-05-30 北京三快在线科技有限公司 Model training and task execution method and device
CN117235233A (en) * 2023-10-24 2023-12-15 之江实验室 Automatic financial report question-answering method and device based on large model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228758A (en) * 2017-12-22 2018-06-29 北京奇艺世纪科技有限公司 A kind of file classification method and device
CN108415896A (en) * 2017-02-09 2018-08-17 北京京东尚科信息技术有限公司 Deep learning model training method, segmenting method, training system and Words partition system
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110032644A (en) * 2019-04-03 2019-07-19 人立方智能科技有限公司 Language model pre-training method
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6004452B2 (en) * 2014-07-24 2016-10-05 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method for selecting learning text for language model, method for learning language model using the learning text, and computer and computer program for executing the same

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415896A (en) * 2017-02-09 2018-08-17 北京京东尚科信息技术有限公司 Deep learning model training method, segmenting method, training system and Words partition system
CN108228758A (en) * 2017-12-22 2018-06-29 北京奇艺世纪科技有限公司 A kind of file classification method and device
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110032644A (en) * 2019-04-03 2019-07-19 人立方智能科技有限公司 Language model pre-training method
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF

Also Published As

Publication number Publication date
CN110489555A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN113723105A (en) Training method, device and equipment of semantic feature extraction model and storage medium
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN116306600B (en) MacBert-based Chinese text error correction method
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
Moeng et al. Canonical and surface morphological segmentation for nguni languages
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features
CN114970536A (en) Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition
CN114742069A (en) Code similarity detection method and device
CN114333838A (en) Method and system for correcting voice recognition text
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN113095063A (en) Two-stage emotion migration method and system based on masking language model
US20230289528A1 (en) Method for constructing sentiment classification model based on metaphor identification
CN116186241A (en) Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium
CN112380882B (en) Mongolian Chinese neural machine translation method with error correction function
CN115359323A (en) Image text information generation method and deep learning model training method
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN113761875A (en) Event extraction method and device, electronic equipment and storage medium
Chen et al. Fast OOV words incorporation using structured word embeddings for neural network language model
CN115114915B (en) Phrase identification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant