CN112528650A

CN112528650A - Method, system and computer equipment for pretraining Bert model

Info

Publication number: CN112528650A
Application number: CN202011503784.1A
Authority: CN
Inventors: 佘璇; 段少毅
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-19
Anticipated expiration: 2040-12-18
Also published as: CN112528650B

Abstract

The application relates to a method, a system and computer equipment for pre-training a Bert model, wherein the method for pre-training the Bert model comprises the following steps: an original data set acquisition step for acquiring an original data set; a data set preprocessing step, namely performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set by a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and coding the words according to occurrence frequency to obtain high-frequency words, low-frequency words and Word codes; and a pretraining step of the Bert model, which is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the vocabulary codes to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the vocabularies. Through the application, the convergence of the model parameters is optimized, and the model is effectively prevented from oscillating.

Description

Method, system and computer equipment for pretraining Bert model

Technical Field

The application relates to the technical field of internet, in particular to a method, a system and computer equipment for pre-training a Bert model.

Background

With the rise of deep learning technology, more and more pre-training models begin to be applied to natural language processing tasks, and the model effect is greatly improved. Early natural language pre-training used word vector methods such as word2vec, which mapped high-dimensional sparse word vectors to dense low-dimensional vectors for input into subsequent models. With the development of deep learning, a plurality of more strengthened and large pre-training models are proposed, Bert can be regarded as a representative model, and the best effect is achieved on many tasks by means of strong pre-training of Bert.

The natural language pre-training model can be applied to other sequence data besides the natural language processing task. Users who have collected such as in big data internet companies, viewing clicked advertisements or goods, can be modeled as a similar natural language data set for processing. However, this data is also somewhat different from natural language data sets, and the amount of advertisement and merchandise data is usually very large, e.g., the types of advertisements on the internet are typically more than a million, while the types of e-commerce merchandise may be more. In natural language data sets, the number of words is often only tens of thousands, which makes it difficult for the model to handle such millions of vocabularies (i.e., millions of goods or advertisements), because an excessively large vocabulary requires a large word embedding matrix to store, i.e., the model needs to learn more parameters, which results in an excessively large number of parameters for the model. Furthermore, the distribution of the occurrence frequency of the vocabulary words in the vocabulary is usually long-tailed, i.e. a small part of the vocabulary words appear very frequently, and a large part of the vocabulary words appear very frequently, which also makes the model learning more difficult.

The existing method for pre-training a large vocabulary data set by using a Bert model generally adopts two-step pre-training: (1) and preprocessing the data set, sequencing all the appeared vocabularies according to frequency, only reserving n vocabularies with the highest appearance frequency, and representing all the rest words by using the word 'UNK' ('unknown'). Then word2vec is used for pre-training to obtain word embedding vectors of all words; (2) and substituting the word embedding matrix obtained by word2vec pre-training into the Bert model word embedding matrix for initialization, and then pre-training the Bert model.

Based on the pre-training method, before training word2vec, the vocabulary after n before the frequency ranking is directly set as the same word "UNK", so that a plurality of word information can be lost, and a plurality of words representing different information in the pre-processing process are represented by using the same word; on the other hand, initializing the Bert model word embedding matrix directly using the word2vec pre-trained word embedding matrix and training along with the Bert model may lead to a deterioration of the already trained word embedding matrix learning.

Disclosure of Invention

The embodiment of the application provides a method, a system, computer equipment and a computer readable storage medium for pretraining a Bert model, which optimize convergence of model parameters and effectively prevent model oscillation.

In a first aspect, an embodiment of the present application provides a method for pre-training a Bert model, including:

an original data set acquisition step for acquiring an original data set;

a data set preprocessing step, namely performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set by a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and coding the words according to occurrence frequency to obtain high-frequency words, low-frequency words and Word codes;

and a pretraining step of the Bert model, which is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the vocabulary codes to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the vocabularies.

Based on the steps, the pretraining of the Bert model is realized through two steps, the model input in the two steps is different, the first step is to input word vectors of words in the word segmentation data set, namely, word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, a training word with smaller learning rate is embedded into the matrix parameters, so that model oscillation can be effectively prevented, and parameter convergence is further better.

In some of these embodiments, the dataset preprocessing step further comprises:

a word segmentation data set acquisition step, which is used for carrying out word segmentation processing on the original data set to obtain the word segmentation data set;

a Word embedding matrix obtaining step, which is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;

and a vocabulary ordering step, which is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.

And a vocabulary coding step, which is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so as to obtain the vocabulary code.

The above steps are based on pre-training of all words in the word segmentation data set to prevent loss of most of the vocabulary information.

In some embodiments, the pre-training step of the Bert model further comprises:

a partial parameter pre-training step, configured to initialize a vocabulary size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix includes N +1 rows of word vectors, freeze a word embedding matrix parameter of the Bert model, input word vectors in the word embedding matrices of all words to the Bert model, and train a non-word embedding matrix parameter in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.

And a model parameter pre-training step, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model to train parameters of each layer of the Bert model.

Through the steps, the problem of poor training effect when the Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.

In some of these embodiments, the model parameter pre-training step further comprises:

a model adjustment step for reducing the learning rate of the Bert model;

initializing a word embedding matrix, namely initializing the word embedding matrix of the Bert model by using the word vector;

and a model secondary pre-training step, namely inputting the vocabulary codes into parameters of each training layer of the Bert model.

In some embodiments, the word embedding matrix initializing step further comprises:

obtaining word vectors corresponding to N high-frequency vocabularies in the word vectors for initializing the word vectors of words corresponding to the word embedding matrix of the Bert model;

and obtaining the average value of the word vectors corresponding to the high-frequency words in the word vectors so as to initialize the word embedding matrix of the Bert model to the word vector of the UNK.

Based on the steps, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, and the overfitting phenomenon is effectively prevented.

In a second aspect, an embodiment of the present application provides a Bert model pre-training system, including:

the original data set acquisition module is used for acquiring an original data set;

the data set preprocessing module is used for carrying out Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words and phrases, and sequencing and coding the words and phrases according to the occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes;

and the Bert model pre-training module is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the codes of the words to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the words.

Based on the above module, the pretraining of the Bert model is realized through two steps, and the model input in the two steps is different, wherein in the first step, word vectors of words in the word segmentation data set are input, namely word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, the embedded matrix parameters of the training words with smaller learning rate can effectively prevent model oscillation, further make the parameter convergence better, and effectively solve the problem of a large vocabulary.

In some of these embodiments, the dataset preprocessing module further comprises:

the word segmentation data set acquisition module is used for carrying out word segmentation processing on the original data set to obtain a word segmentation data set;

the Word embedding matrix obtaining module is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;

and the vocabulary ordering module is used for ordering all the vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.

And the vocabulary coding module is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so that the vocabulary code is obtained.

The module pre-trains all words in the word segmentation data set to prevent most of the vocabulary information from being lost.

In some embodiments, the Bert model pre-training module further comprises:

the partial parameter pre-training module is used for initializing the word list size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix comprises N +1 rows of word vectors, freezing the word embedding matrix parameters of the Bert model, inputting the word vectors in the word embedding matrix of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.

And the model parameter pre-training module is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model so as to train each layer of parameters of the Bert model.

Through the modules, the problem of poor training effect when a Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.

In some of these embodiments, the model parameter pre-training module further comprises:

the model adjusting module is used for reducing the learning rate of the Bert model;

the word embedding matrix initialization module is used for initializing the word embedding matrix of the Bert model by using the word vectors, and specifically, word vectors corresponding to N high-frequency words in the word vectors are obtained to initialize the word vectors of words corresponding to the word embedding matrix of the Bert model; and obtaining the mean value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vectors of the words "UNK" corresponding to the word embedding matrix of the Bert model.

And the model secondary pre-training module inputs the vocabulary codes into the Bert model to train parameters of each layer.

Based on the module, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, so that the overfitting phenomenon is effectively prevented.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the method for pre-training a Bert model according to the first aspect is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the Bert model pre-training method as described in the first aspect above.

Compared with the related art, the method, the system, the computer equipment and the computer-readable storage medium for pretraining the Bert model provided by the embodiment of the application solve the problem of poor effect of pretraining a million-level large vocabulary data set by the Bert model through multi-step pretraining, prevent most of vocabulary information from being lost, prevent the model from being valid and optimize a convergence process of model parameters, and enable both the word embedding matrix parameters and the non-embedding matrix parameters of the Bert model to be better converged.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a method of pretraining a Bert model in accordance with an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps S2 of a method for pre-training a Bert model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps S32 of a method for pre-training a Bert model according to an embodiment of the present application;

FIG. 4 is a block diagram of a Bert model pre-training system according to an embodiment of the present application;

FIG. 5 is a block diagram of sub-module structures of a Bert model pre-training system according to an embodiment of the present application.

Description of the drawings:

1. an original data set acquisition module; 2. a data set preprocessing module; 3. a Bert model pre-training module;

21. a word segmentation data set acquisition module; 22. a word embedding matrix acquisition module;

23. a vocabulary ordering module; 24. a vocabulary encoding module; 31. a partial parameter pre-training module;

32. a model parameter pre-training module; 321. a model adjustment module; 322. a word embedding matrix initialization module; 323. and a model secondary pre-training module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment provides a pretraining method for a Bert model. Fig. 1-3 are flowcharts of a method for pretraining a Bert model according to an embodiment of the present application, and referring to fig. 1-3, the flowcharts include the following steps:

a raw data set acquisition step S1 for acquiring a raw data set.

A data set preprocessing step S2, which is used for carrying out Word segmentation on the original data set to obtain a Word segmentation data set, carrying out Word embedding matrix of all words and phrases by training the Word segmentation data set through a Word2Vec model, and sequencing and coding the words and phrases according to the occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes; optionally, the word segmentation process may adopt, but is not limited to, Jieba word segmentation.

And a Bert model pre-training step S3, which is used for freezing the word embedding matrix parameters of the Bert model, training the Bert model based on the word embedding matrices of all words, reducing the learning rate, inputting word codes, and training the Bert model again.

In some of these embodiments, the data set preprocessing step S2 further includes:

a segmentation data set obtaining step S21, configured to perform segmentation processing on the original data set to obtain a segmentation data set;

a Word embedding matrix obtaining step S22, which is used for inputting all the words in the Word segmentation data set into a Word2Vec model to train so as to obtain Word embedding matrixes of all the words;

a vocabulary sorting step S23, configured to sort all vocabularies in the segmentation data set according to the occurrence frequency, and divide the vocabularies in the segmentation data set into N high-frequency vocabularies with higher frequency and the remaining low-frequency vocabularies according to a set threshold N, where a value range of N is 50000-.

A vocabulary encoding step S24, configured to sequentially encode the high-frequency vocabulary according to the sequence and encode the low-frequency vocabulary into "UNK" to obtain a vocabulary code, specifically, the vocabulary code includes an encoding of the high-frequency vocabulary and an encoding of the low-frequency vocabulary "UNK"; for example, without limitation, the vocabulary a is ordered to 1, the vocabulary B is ordered to 2, and the encoding of the vocabulary a and the vocabulary B can be obtained by converting the ordering 1 and the ordering 2 into the encoding format.

In some of these embodiments, the Bert model pre-training step S3 further includes:

a partial parameter pre-training step S31, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix comprises N +1 lines of word vectors, the word embedding matrix parameters of the Bert model are frozen, the word vectors in the word embedding matrix of all words are input to the Bert model, and non-word embedding matrix parameters in the Bert model are trained; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.

And a model parameter pre-training step S32, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using a word vector, inputting a word code to each layer of parameters of the Bert model training the Bert model, wherein each layer of parameters comprises the word embedding matrix and a non-word embedding matrix.

In some of these embodiments, the model parameter pre-training step S32 further includes:

a model adjustment step S321 for reducing the learning rate of the Bert model;

a word embedding matrix initializing step S322, configured to initialize a word embedding matrix of the Bert model by using the word vector;

and a model secondary pre-training step S323, inputting the vocabulary codes into the Bert model to train parameters of each layer.

step S3221, obtaining word vectors corresponding to N high-frequency vocabularies in the word vectors to initialize the word vectors of the words corresponding to the word embedding matrix of the Bert model;

step S3222, obtaining a mean value of word vectors corresponding to the high-frequency vocabulary in the word vectors to initialize the word vector of the "UNK" corresponding to the word embedding matrix of the Bert model.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The embodiment also provides a Bert model pre-training system, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the system that has been already made is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

Fig. 4-5 are block diagrams of the structure of a Bert model pre-training system according to an embodiment of the present application, and referring to fig. 4-5, the system includes:

an original data set obtaining module 1, configured to obtain an original data set;

the data set preprocessing module 2 is used for performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words and phrases, and sequencing and coding the words and phrases according to occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes; specifically, the data set preprocessing module 2 further includes: a word segmentation data set acquisition module 21, configured to perform word segmentation processing on the original data set to obtain a word segmentation data set; the Word embedding matrix obtaining module 22 is used for inputting all the words in the Word segmentation data set into a Word2Vec model to train so as to obtain Word embedding matrices of all the words; the vocabulary ordering module 23 is configured to order all vocabularies in the segmentation data set according to the occurrence frequency, and divide the vocabularies in the segmentation data set into N high-frequency vocabularies with higher frequency and remaining low-frequency vocabularies according to a set threshold N, where a value range of N is 50000-150000. And the vocabulary coding module 24 is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so that a vocabulary code is obtained. The module pre-trains all words in the word segmentation data set to prevent most of the vocabulary information from being lost.

And the Bert model pre-training module 3 is used for freezing word embedding matrix parameters of the Bert model, training the Bert model based on word embedding matrixes of all words, reducing the learning rate, inputting codes of the words and training the Bert model again. Specifically, the Bert model pre-training module 3 further includes: the partial parameter pre-training module 31 is configured to initialize a word list size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix includes N +1 rows of word vectors, freeze word embedding matrix parameters of the Bert model, input word vectors in the word embedding matrix of all words to the Bert model, and train non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters. The model parameter pre-training module 32 is configured to reduce the learning rate of the Bert model, initialize a word embedding matrix of the Bert model using a word vector, input a word code to the Bert model, and train parameters of each layer of the Bert model, where the parameters of each layer include a word embedding matrix and a non-word embedding matrix. Through the modules, the problem of poor training effect when a Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.

Wherein the model parameter pre-training module 32 further comprises: a model adjustment module 321, configured to reduce the learning rate of the Bert model; a word embedding matrix initialization module 322, configured to initialize a word embedding matrix of the Bert model by using the word vectors, and optionally, obtain word vectors corresponding to N high-frequency words in the word vectors to initialize word vectors of words corresponding to the word embedding matrix of the Bert model; acquiring the mean value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vector of the word embedding matrix corresponding to the word "UNK" of the Bert model; and the model secondary pre-training module 323 inputs the vocabulary codes into the Bert model to train parameters of each layer. Based on the module, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, so that the overfitting phenomenon is effectively prevented.

Based on the module, the pretraining of the Bert model is realized through two steps, and the model input in the two steps is different, wherein in the first step, word vectors of words in a word segmentation data set are input, namely word embedding matrixes are input, so that non-word embedding matrix parameters in the Bert model are trained; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, the embedded matrix parameters of the training words with smaller learning rate can effectively prevent model oscillation, further make the parameter convergence better, and effectively solve the problem of a large vocabulary.

In addition, the method for pretraining the Bert model in the embodiment of the present application described in conjunction with fig. 1 may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.

In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (earrom), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.

The processor reads and executes the computer program instructions stored in the memory to implement any one of the Bert model pre-training methods in the above embodiments.

In addition, in combination with the method for pretraining the Bert model in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the Bert model pre-training methods in the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for pretraining a Bert model, comprising:

an original data set acquisition step for acquiring an original data set;

2. The method of pretraining a Bert model as recited in claim 1, wherein the dataset preprocessing step further comprises:

a vocabulary ordering step, which is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N;

3. The method of pre-training a Bert model according to claim 1 or 2, wherein the step of pre-training the Bert model further comprises:

a partial parameter pre-training step, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N +1, freezing the word embedding matrix parameters of the Bert model, inputting word vectors in the word embedding matrix of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model;

and a model parameter pre-training step, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary code to the Bert model so as to train each layer of parameters of the Bert model.

4. The method of pretraining a Bert model in accordance with claim 3, wherein the model parameter pretraining step further comprises:

a model adjustment step for reducing the learning rate of the Bert model;

5. The method of pretraining the Bert model as claimed in claim 4, wherein the word embedding matrix initializing step further comprises:

6. A Bert model pre-training system, comprising:

7. The Bert model pre-training system of claim 6, wherein the dataset pre-processing module further comprises:

the vocabulary ordering module is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N;

8. The Bert model pre-training system of claim 6 or 7, wherein the Bert model pre-training module further comprises:

the partial parameter pre-training module is used for initializing the word list size of a word embedding matrix of the Bert model to be N +1, freezing the word embedding matrix parameters of the Bert model, inputting word vectors in the word embedding matrices of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model;

9. The Bert model pre-training system of claim 8, wherein the model parameter pre-training module further comprises:

the word embedding matrix initialization module is used for initializing the word embedding matrix of the Bert model by utilizing the word vector;

10. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the Bert model pre-training method as recited in any one of claims 1 to 5 when executing the computer program.