CN112528650B

CN112528650B - Bert model pre-training method, system and computer equipment

Info

Publication number: CN112528650B
Application number: CN202011503784.1A
Authority: CN
Inventors: 佘璇; 段少毅
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-04-02
Anticipated expiration: 2040-12-18
Also published as: CN112528650A

Abstract

The application relates to a method, a system and computer equipment for pretraining a Bert model, wherein the method for pretraining the Bert model comprises the following steps: an original data set acquisition step of acquiring an original data set; a data set preprocessing step, which is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and encoding the words according to the occurrence frequency to obtain high-frequency words, low-frequency words and Word codes; and a Bert model pre-training step, which is used for reducing the learning rate and inputting the vocabulary codes to train the Bert model again after freezing the word embedding matrix parameters of the Bert model and training the Bert model based on the word embedding matrix of all vocabularies. By the method and the device, the convergence of the model parameters is optimized, and the model vibration is effectively prevented.

Description

Bert model pre-training method, system and computer equipment

Technical Field

The application relates to the technical field of internet, in particular to a Bert model pre-training method, a Bert model pre-training system and computer equipment.

Background

With the rise of deep learning technology, more and more pre-trained models are beginning to be applied to natural language processing tasks, and the model effect is greatly improved. Early natural language pre-training uses word2vec word vector methods, and maps high-dimensional sparse word vectors to dense low-dimensional vectors to serve as inputs of subsequent models. With the development of deep learning, more powerful pre-training models are proposed, in which Bert can be regarded as a representative, and by means of powerful pre-training of Bert, the current best effect is achieved on many tasks.

The natural language pre-training model can be applied to other sequence data besides the natural language processing task. Users, such as those collected in big data internet companies, view clicked on advertisements or merchandise, which may be treated as a similar natural language dataset for process modeling. However, there are some important differences between this data and the natural language data set, and the amount of advertisement and commodity data is usually very large, for example, advertisement types on the internet are generally over millions, and the e-commerce commodity types may be more. In natural language data sets, the number of words is often tens of thousands, which makes it difficult for the model to handle such millions of word lists (i.e., millions of goods or advertisements), because oversized word lists require a large word embedding matrix to store, i.e., the model needs to learn more parameters, which can result in oversized models. Furthermore, the frequency distribution of occurrence of the vocabulary in the vocabulary is typically long-tailed, i.e. the frequency of occurrence of a small part of the vocabulary is very high, while the frequency of occurrence of a large part of the vocabulary is very low, which also results in more difficult model learning.

The existing method for pre-training a large vocabulary dataset by using a Bert model generally adopts two-step pre-training: (1) The data set is preprocessed, all the words are ordered according to the frequency, only n words with highest frequency are reserved, and the rest words are expressed by the word 'UNK' (unown). Pre-training by using word2vec to obtain word embedding vectors of all words; (2) And substituting the word embedding matrix obtained by word2vec pre-training into the Bert model word embedding matrix for initialization, and then pre-training the Bert model.

Based on the pre-training method, before training word2vec, the vocabulary with the frequency ranking n before and after is set to the same word UNK directly, so that a plurality of words representing different information are represented by using the same word in the preprocessing process; on the other hand, initializing the Bert model word embedding matrix directly using word2vec pre-trained word embedding matrix and training with the Bert model may result in poor learning of the trained word embedding matrix.

Disclosure of Invention

The embodiment of the application provides a Bert model pre-training method, a Bert model pre-training system, computer equipment and a computer readable storage medium, which optimize the convergence of model parameters and effectively prevent model concussion.

In a first aspect, an embodiment of the present application provides a Bert model pretraining method, including:

an original data set acquisition step of acquiring an original data set;

a data set preprocessing step, which is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and encoding the words according to the occurrence frequency to obtain high-frequency words, low-frequency words and Word codes;

and a Bert model pre-training step, which is used for reducing the learning rate and inputting the vocabulary codes to train the Bert model again after freezing the word embedding matrix parameters of the Bert model and training the Bert model based on the word embedding matrix of all vocabularies.

Based on the steps, the Bert model pre-training is realized through two steps, the model input is different in the two steps, and the first step is to input word vectors of words in the word segmentation dataset, namely word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to train the Bert model word embedding matrix parameters and the non-word embedding matrix parameters by inputting the codes of the words, the first step uses word vectors of all words based on the word embedding matrix of all words, no information loss is caused, and the non-word embedding matrix parameters are better converged; in the second step, the matrix parameters can be effectively prevented from vibrating by using the training words with smaller learning rate to be embedded into the matrix parameters, and the parameters are further converged better.

In some of these embodiments, the dataset preprocessing step further comprises:

a word segmentation data set obtaining step, which is used for carrying out word segmentation processing on the original data set to obtain the word segmentation data set;

a Word embedding matrix obtaining step, which is used for inputting all words in the Word segmentation data set into the Word2Vec model for training to obtain a Word embedding matrix of all words;

and a vocabulary sorting step, which is used for sorting all the vocabularies in the word segmentation data set according to the occurrence frequency, and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.

And a vocabulary coding step, which is used for sequentially coding the high-frequency vocabularies according to the sorting and coding the low-frequency vocabularies into UNK to obtain the vocabulary coding.

The steps are pre-trained based on all words in the word segmentation dataset to prevent loss of a substantial portion of lexical information.

In some of these embodiments, the Bert model pre-training step further comprises:

a partial parameter pre-training step, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N+1, specifically, the word embedding matrix comprises N+1 rows of word vectors, freezing the word embedding matrix parameters of the Bert model, and inputting the word vectors in the word embedding matrix of all the vocabularies to the Bert model so as to train the non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, intermediate layer parameters.

And model parameter pre-training, namely reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model to train each layer of parameters of the Bert model.

Through the steps, the problem that the training effect is poor when the Bert model is used for training the million-level large vocabulary data set in advance is solved through multi-step pre-training. By reducing the model learning rate, model concussion is prevented and the Bert model parameters are learned.

In some of these embodiments, the model parameter pre-training step further comprises:

a model adjustment step for reducing the learning rate of the Bert model;

a word embedding matrix initializing step, which is used for initializing a word embedding matrix of the Bert model by utilizing the word vector;

and model secondary pre-training, namely inputting the vocabulary codes into the Bert model to train parameters of each layer.

In some of these embodiments, the word embedding matrix initializing step further comprises:

acquiring word vectors corresponding to N high-frequency words in the word vectors, and initializing word vectors of words corresponding to word embedding matrixes of the Bert model;

and acquiring the average value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vectors corresponding to the UNK of the word embedding matrix of the Bert model.

Based on the steps, the Word vector in the Bert model is initialized based on the Word vector trained by the Word2Vec model, so that the overfitting phenomenon is effectively prevented.

In a second aspect, an embodiment of the present application provides a Bert model pretraining system, including:

the original data set acquisition module is used for acquiring an original data set;

the data set preprocessing module is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and encoding the words according to the occurrence frequency to obtain high-frequency words, low-frequency words and Word encoding;

and the Bert model pre-training module is used for reducing the learning rate and inputting codes of the vocabularies to train the Bert model again after freezing the word embedding matrix parameters of the Bert model and training the Bert model based on the word embedding matrix of all the vocabularies.

Based on the modules, the Bert model pre-training is realized through two steps, the model input is different in the two steps, and the first step is to input word vectors of words in the word segmentation dataset, namely word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to train the Bert model word embedding matrix parameters and the non-word embedding matrix parameters by inputting the codes of the words, the first step uses word vectors of all words based on the word embedding matrix of all words, no information loss is caused, and the non-word embedding matrix parameters are better converged; in the second step, the matrix parameters can be embedded into the training words with smaller learning rate, so that the model can be effectively prevented from vibrating, the parameters can be further converged better, and the problem of large word list is effectively solved.

In some of these embodiments, the dataset preprocessing module further comprises:

the word segmentation data set acquisition module is used for carrying out word segmentation on the original data set to obtain the word segmentation data set;

the Word embedding matrix acquisition module is used for inputting all the words in the Word segmentation data set into the Word2Vec model for training to obtain a Word embedding matrix of all the words;

and the vocabulary sorting module is used for sorting all the vocabularies in the word segmentation data set according to the occurrence frequency, and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.

And the vocabulary coding module is used for sequentially coding the high-frequency vocabularies according to the sorting and coding the low-frequency vocabularies into UNK so as to obtain the vocabulary coding.

The above modules pretrain based on all words in the word segmentation dataset to prevent loss of most of the lexical information.

In some of these embodiments, the Bert model pre-training module further comprises:

the partial parameter pre-training module is used for initializing the word list size of the word embedding matrix of the Bert model to be N+1, specifically, the word embedding matrix comprises N+1 rows of word vectors, freezing the word embedding matrix parameters of the Bert model, and inputting the word vectors in the word embedding matrix of all the vocabularies to the Bert model so as to train the non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, intermediate layer parameters.

And the model parameter pre-training module is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary code to the Bert model so as to train each layer of parameters of the Bert model.

Through the module, the problem of poor training effect when the Bert model is used for pre-training the million-level large vocabulary data set is solved through multi-step pre-training. By reducing the model learning rate, model concussion is prevented and the Bert model parameters are learned.

In some of these embodiments, the model parameter pre-training module further comprises:

the model adjustment module is used for reducing the learning rate of the Bert model;

the word embedding matrix initializing module is used for initializing a word embedding matrix of the Bert model by utilizing the word vectors, and specifically, acquiring word vectors corresponding to N high-frequency words in the word vectors for initializing word vectors of words corresponding to the word embedding matrix of the Bert model; and acquiring the average value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vector corresponding to the word UNK of the word embedding matrix of the Bert model.

And the model secondary pre-training module inputs the vocabulary codes into the Bert model to train parameters of each layer.

Based on the above modules, the Word vector in the Bert model is initialized based on the Word vector trained by the Word2Vec model, so that the overfitting phenomenon is effectively prevented.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the Bert model pretraining method according to the first aspect described above when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a Bert model pre-training method as described in the first aspect above.

Compared with the related art, the method, the system, the computer equipment and the computer readable storage medium for pretraining the Bert model solve the problem of poor effect of pretraining a million-level large vocabulary data set by pretraining the Bert model through multiple steps, prevent most of vocabulary information from being lost, prevent the model from being legal and optimize the model parameter convergence process, and enable the word embedding matrix parameters and the non-embedding matrix parameters of the Bert model to be better converged.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a Bert model Pre-training method according to an embodiment of the present application;

FIG. 2 is a substep flow chart of step S2 of the Bert model pre-training method according to an embodiment of the present application;

FIG. 3 is a substep flow chart of step S32 of the Bert model pre-training method according to an embodiment of the present application;

FIG. 4 is a block diagram of the structure of a Bert model Pre-training system according to an embodiment of the present application;

fig. 5 is a block diagram of a submodule structure of the Bert model pre-training system according to an embodiment of the present application.

Description of the drawings:

1. an original data set acquisition module; 2. a data set preprocessing module; 3. a Bert model pre-training module;

21. the word segmentation data set acquisition module; 22. a word embedding matrix acquisition module;

23. a vocabulary ordering module; 24. a vocabulary encoding module; 31. a partial parameter pre-training module;

32. model parameter pre-training module; 321. a model adjustment module; 322. a word embedding matrix initializing module; 323. and a model secondary pre-training module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The embodiment provides a Bert model pre-training method. FIGS. 1-3 are flowcharts of a Bert model pre-training method according to embodiments of the present application, and with reference to FIGS. 1-3, the process includes the steps of:

an original data set obtaining step S1, for obtaining an original data set.

A data set preprocessing step S2, which is used for carrying out Word segmentation processing on an original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and encoding the words according to the occurrence frequency to obtain high-frequency words, low-frequency words and Word codes; alternatively, the word segmentation process may employ, but is not limited to, jieba segmentation.

And a Bert model pre-training step S3, which is used for training the Bert model by freezing the word embedding matrix parameters of the Bert model and based on the word embedding matrix of all the vocabularies, reducing the learning rate and inputting the vocabulary codes to train the Bert model again.

In some of these embodiments, the data set preprocessing step S2 further includes:

a word segmentation data set obtaining step S21, which is used for carrying out word segmentation processing on the original data set to obtain a word segmentation data set;

a Word embedding matrix obtaining step S22, which is used for inputting all words in the Word segmentation data set into a Word2Vec model for training to obtain Word embedding matrixes of all words;

and a vocabulary sorting step S23, which is used for sorting all the vocabularies in the word segmentation data set according to the occurrence frequency, and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.

A vocabulary encoding step S24, configured to encode the high-frequency vocabularies sequentially according to the ordering and encode the low-frequency vocabularies as "UNK", so as to obtain vocabulary encoding, where the vocabulary encoding specifically includes encoding the high-frequency vocabularies and encoding the low-frequency vocabularies as "UNK"; for example, but not by way of limitation, the word A has a rank of 1 and the word B has a rank of 2, and the codes of the word A and the word B can be obtained by converting the ranks 1 and 2 into the coding format.

The above steps are pre-trained based on all words in the word segmentation dataset to prevent loss of most of the lexical information.

In some of these embodiments, the Bert model pre-training step S3 further comprises:

a partial parameter pre-training step S31, which is used for initializing the word list size of a word embedding matrix of the Bert model to be N+1, specifically, the word embedding matrix comprises N+1 rows of word vectors, freezing the word embedding matrix parameters of the Bert model, and inputting the word vectors in the word embedding matrix of all vocabularies to the Bert model so as to train the non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, intermediate layer parameters.

And a model parameter pre-training step S32, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, inputting vocabulary codes to each layer of parameters of the Bert model, wherein each layer of parameters comprise the word embedding matrix and a non-word embedding matrix.

In some of these embodiments, the model parameter pre-training step S32 further includes:

a model adjustment step S321, configured to reduce the learning rate of the Bert model;

a word embedding matrix initializing step S322 for initializing a word embedding matrix of the Bert model by Li Yongci vector;

and a model secondary pre-training step S323, namely inputting vocabulary codes into the Bert model to train parameters of each layer.

step S3221, word vectors corresponding to N high-frequency words in the word vectors are obtained and used for initializing word vectors of words corresponding to word embedding matrixes of the Bert model;

step S3222, the mean value of the word vectors corresponding to the high-frequency words in the word vectors is obtained to initialize the word vectors corresponding to the UNK of the word embedding matrix of the Bert model.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

The embodiment also provides a Bert model pre-training system, which is used for implementing the above embodiments and preferred embodiments, and is not described in detail. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4-5 are block diagrams of the structure of a Bert model pre-training system according to embodiments of the present application, with reference to fig. 4-5, the system comprising:

the original data set acquisition module 1 is used for acquiring an original data set;

the data set preprocessing module 2 is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and encoding the words according to the occurrence frequency to obtain high-frequency words, low-frequency words and Word codes; specifically, the data set preprocessing module 2 further includes: a word segmentation data set acquisition module 21, configured to perform word segmentation processing on the original data set to obtain a word segmentation data set; the Word embedding matrix obtaining module 22 is configured to input all the vocabularies in the Word segmentation dataset into a Word2Vec model for training to obtain a Word embedding matrix of all the vocabularies; the vocabulary sorting module 23 is configured to sort all the vocabularies in the word segmentation data set according to the occurrence frequency, and divide the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and remaining low-frequency vocabularies according to a set threshold value N, where the value range of N is 50000-150000. The vocabulary encoding module 24 is configured to encode the high-frequency vocabularies sequentially according to the ordering and encode the low-frequency vocabularies as "UNK" to obtain vocabulary encoding. The above modules are pre-trained based on all words in the word segmentation dataset to prevent loss of most of the lexical information.

And the Bert model pre-training module 3 is used for training the Bert model based on the word embedding matrix parameters of all the vocabularies by freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the codes of the vocabularies to train the Bert model again. Specifically, the Bert model pre-training module 3 further includes: the partial parameter pre-training module 31 is configured to initialize a word list size of a word embedding matrix of the Bert model to be n+1, specifically, the word embedding matrix includes n+1 rows of word vectors, freeze word embedding matrix parameters of the Bert model, and input word vectors in the word embedding matrix of all vocabularies to the Bert model to train non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, intermediate layer parameters. The model parameter pre-training module 32 is configured to reduce the learning rate of the Bert model, initialize a word embedding matrix of the Bert model by using the word vector, and input vocabulary into the Bert model to train each layer of parameters of the Bert model, where each layer of parameters includes the word embedding matrix and the non-word embedding matrix. Through the module, the problem of poor training effect when the Bert model is used for pre-training the million-level large vocabulary data set is solved through multi-step pre-training. By reducing the model learning rate, model concussion is prevented and the Bert model parameters are learned.

Wherein the model parameter pre-training module 32 further comprises: the model adjustment module 321 is configured to reduce the learning rate of the Bert model; the word embedding matrix initializing module 322 is configured to initialize a word embedding matrix of the Bert model by using word vectors, and optionally, obtain word vectors corresponding to N high-frequency words in the word vectors for initializing word vectors of words corresponding to the word embedding matrix of the Bert model; acquiring the average value of word vectors corresponding to high-frequency words in the word vectors, and initializing the word vector of word embedding matrix corresponding to word UNK of the Bert model; the model secondary pre-training module 323 inputs the vocabulary code into the Bert model to train the parameters of each layer. Based on the above modules, the Word vector in the Bert model is initialized based on the Word vector trained by the Word2Vec model, so that the overfitting phenomenon is effectively prevented.

Based on the modules, the Bert model pre-training is realized through two steps, the model input is different in the two steps, the first step is to input word vectors of words in the word segmentation dataset, namely word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to train the Bert model word embedding matrix parameters and the non-word embedding matrix parameters by inputting the codes of the words, the first step uses word vectors of all words based on the word embedding matrix of all words, no information loss is caused, and the non-word embedding matrix parameters are better converged; in the second step, the matrix parameters can be embedded into the training words with smaller learning rate, so that the model can be effectively prevented from vibrating, the parameters can be further converged better, and the problem of large word list is effectively solved.

In addition, the Bert model pre-training method of the embodiment of the present application described in connection with fig. 1 may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.

In particular, the processor may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, the memory may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM for short), an erasable PROM (Erasable Programmable Read-Only Memory for short), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory for short EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory for short EAROM) or a FLASH Memory (FLASH) or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

The memory may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by the processor.

The processor reads and executes the computer program instructions stored in the memory to implement any of the Bert model pre-training methods of the above embodiments.

In addition, in combination with the Bert model pre-training method in the above embodiments, embodiments of the present application may provide a computer readable storage medium for implementation. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the Bert model pre-training methods of the above embodiments.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A Bert model pre-training method, comprising:

an original data set acquisition step of acquiring an original data set;

a data set preprocessing step, which is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, sorting and encoding the words according to the occurrence frequency, dividing the sorted words into high-frequency words and low-frequency words according to a set threshold N, and generating Word encoding based on the high-frequency words and the low-frequency words;

a Bert model pre-training step, which is used for training the Bert model based on the word embedding matrix parameters of the whole vocabulary and then reducing the learning rate, inputting the vocabulary code and training the Bert model again, and further comprises the following steps:

a partial parameter pre-training step, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N+1, freezing the word embedding matrix parameters of the Bert model, and inputting word vectors in the word embedding matrix of all the vocabularies to the Bert model so as to train non-word embedding matrix parameters in the Bert model;

and model parameter pre-training, namely reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary code to the Bert model so as to train each layer of parameters of the Bert model.

2. The Bert model pretraining method of claim 1, wherein the dataset pretraining step further comprises:

a vocabulary ordering step, which is used for ordering all the vocabularies in the word segmentation data set according to the occurrence frequency, and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to the set threshold N;

3. The Bert model pre-training method of claim 2, wherein the model parameter pre-training step further comprises:

a model adjustment step for reducing the learning rate of the Bert model;

4. The Bert model pre-training method of claim 3, wherein the word embedding matrix initializing step further comprises:

5. A Bert model pretraining system, comprising:

the data set preprocessing module is used for carrying out Word segmentation processing on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words, sorting and encoding the words according to the occurrence frequency, dividing the sorted words into high-frequency words and low-frequency words according to a set threshold N, and generating Word encoding based on the high-frequency words and the low-frequency words;

the Bert model pre-training module is configured to freeze word embedding matrix parameters of the Bert model, train the Bert model based on word embedding matrices of all vocabularies, reduce learning rate, and input codes of the vocabularies to train the Bert model again, where the Bert model pre-training module further includes:

the partial parameter pre-training module is used for initializing the word list size of the word embedding matrix of the Bert model to be N+1, freezing the word embedding matrix parameters of the Bert model, and inputting word vectors in the word embedding matrix of all the vocabularies to the Bert model so as to train the non-word embedding matrix parameters in the Bert model;

6. The Bert model pretraining system of claim 5, wherein the dataset pretraining module further comprises:

the vocabulary sorting module is used for sorting all the vocabularies in the word segmentation data set according to the occurrence frequency, and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold N;

7. The Bert model pre-training system of claim 6, wherein the model parameter pre-training module further comprises:

the word embedding matrix initializing module is used for initializing a word embedding matrix of the Bert model by utilizing the word vector;

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the Bert model pre-training method according to any of claims 1 to 4 when executing the computer program.