CN112528650A - Method, system and computer equipment for pretraining Bert model - Google Patents

Method, system and computer equipment for pretraining Bert model Download PDF

Info

Publication number
CN112528650A
CN112528650A CN202011503784.1A CN202011503784A CN112528650A CN 112528650 A CN112528650 A CN 112528650A CN 202011503784 A CN202011503784 A CN 202011503784A CN 112528650 A CN112528650 A CN 112528650A
Authority
CN
China
Prior art keywords
word
bert model
model
data set
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011503784.1A
Other languages
Chinese (zh)
Other versions
CN112528650B (en
Inventor
佘璇
段少毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Enyike Beijing Data Technology Co ltd
Original Assignee
Enyike Beijing Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enyike Beijing Data Technology Co ltd filed Critical Enyike Beijing Data Technology Co ltd
Priority to CN202011503784.1A priority Critical patent/CN112528650B/en
Publication of CN112528650A publication Critical patent/CN112528650A/en
Application granted granted Critical
Publication of CN112528650B publication Critical patent/CN112528650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a method, a system and computer equipment for pre-training a Bert model, wherein the method for pre-training the Bert model comprises the following steps: an original data set acquisition step for acquiring an original data set; a data set preprocessing step, namely performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set by a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and coding the words according to occurrence frequency to obtain high-frequency words, low-frequency words and Word codes; and a pretraining step of the Bert model, which is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the vocabulary codes to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the vocabularies. Through the application, the convergence of the model parameters is optimized, and the model is effectively prevented from oscillating.

Description

Method, system and computer equipment for pretraining Bert model
Technical Field
The application relates to the technical field of internet, in particular to a method, a system and computer equipment for pre-training a Bert model.
Background
With the rise of deep learning technology, more and more pre-training models begin to be applied to natural language processing tasks, and the model effect is greatly improved. Early natural language pre-training used word vector methods such as word2vec, which mapped high-dimensional sparse word vectors to dense low-dimensional vectors for input into subsequent models. With the development of deep learning, a plurality of more strengthened and large pre-training models are proposed, Bert can be regarded as a representative model, and the best effect is achieved on many tasks by means of strong pre-training of Bert.
The natural language pre-training model can be applied to other sequence data besides the natural language processing task. Users who have collected such as in big data internet companies, viewing clicked advertisements or goods, can be modeled as a similar natural language data set for processing. However, this data is also somewhat different from natural language data sets, and the amount of advertisement and merchandise data is usually very large, e.g., the types of advertisements on the internet are typically more than a million, while the types of e-commerce merchandise may be more. In natural language data sets, the number of words is often only tens of thousands, which makes it difficult for the model to handle such millions of vocabularies (i.e., millions of goods or advertisements), because an excessively large vocabulary requires a large word embedding matrix to store, i.e., the model needs to learn more parameters, which results in an excessively large number of parameters for the model. Furthermore, the distribution of the occurrence frequency of the vocabulary words in the vocabulary is usually long-tailed, i.e. a small part of the vocabulary words appear very frequently, and a large part of the vocabulary words appear very frequently, which also makes the model learning more difficult.
The existing method for pre-training a large vocabulary data set by using a Bert model generally adopts two-step pre-training: (1) and preprocessing the data set, sequencing all the appeared vocabularies according to frequency, only reserving n vocabularies with the highest appearance frequency, and representing all the rest words by using the word 'UNK' ('unknown'). Then word2vec is used for pre-training to obtain word embedding vectors of all words; (2) and substituting the word embedding matrix obtained by word2vec pre-training into the Bert model word embedding matrix for initialization, and then pre-training the Bert model.
Based on the pre-training method, before training word2vec, the vocabulary after n before the frequency ranking is directly set as the same word "UNK", so that a plurality of word information can be lost, and a plurality of words representing different information in the pre-processing process are represented by using the same word; on the other hand, initializing the Bert model word embedding matrix directly using the word2vec pre-trained word embedding matrix and training along with the Bert model may lead to a deterioration of the already trained word embedding matrix learning.
Disclosure of Invention
The embodiment of the application provides a method, a system, computer equipment and a computer readable storage medium for pretraining a Bert model, which optimize convergence of model parameters and effectively prevent model oscillation.
In a first aspect, an embodiment of the present application provides a method for pre-training a Bert model, including:
an original data set acquisition step for acquiring an original data set;
a data set preprocessing step, namely performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set by a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and coding the words according to occurrence frequency to obtain high-frequency words, low-frequency words and Word codes;
and a pretraining step of the Bert model, which is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the vocabulary codes to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the vocabularies.
Based on the steps, the pretraining of the Bert model is realized through two steps, the model input in the two steps is different, the first step is to input word vectors of words in the word segmentation data set, namely, word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, a training word with smaller learning rate is embedded into the matrix parameters, so that model oscillation can be effectively prevented, and parameter convergence is further better.
In some of these embodiments, the dataset preprocessing step further comprises:
a word segmentation data set acquisition step, which is used for carrying out word segmentation processing on the original data set to obtain the word segmentation data set;
a Word embedding matrix obtaining step, which is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;
and a vocabulary ordering step, which is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.
And a vocabulary coding step, which is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so as to obtain the vocabulary code.
The above steps are based on pre-training of all words in the word segmentation data set to prevent loss of most of the vocabulary information.
In some embodiments, the pre-training step of the Bert model further comprises:
a partial parameter pre-training step, configured to initialize a vocabulary size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix includes N +1 rows of word vectors, freeze a word embedding matrix parameter of the Bert model, input word vectors in the word embedding matrices of all words to the Bert model, and train a non-word embedding matrix parameter in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.
And a model parameter pre-training step, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model to train parameters of each layer of the Bert model.
Through the steps, the problem of poor training effect when the Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.
In some of these embodiments, the model parameter pre-training step further comprises:
a model adjustment step for reducing the learning rate of the Bert model;
initializing a word embedding matrix, namely initializing the word embedding matrix of the Bert model by using the word vector;
and a model secondary pre-training step, namely inputting the vocabulary codes into parameters of each training layer of the Bert model.
In some embodiments, the word embedding matrix initializing step further comprises:
obtaining word vectors corresponding to N high-frequency vocabularies in the word vectors for initializing the word vectors of words corresponding to the word embedding matrix of the Bert model;
and obtaining the average value of the word vectors corresponding to the high-frequency words in the word vectors so as to initialize the word embedding matrix of the Bert model to the word vector of the UNK.
Based on the steps, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, and the overfitting phenomenon is effectively prevented.
In a second aspect, an embodiment of the present application provides a Bert model pre-training system, including:
the original data set acquisition module is used for acquiring an original data set;
the data set preprocessing module is used for carrying out Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words and phrases, and sequencing and coding the words and phrases according to the occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes;
and the Bert model pre-training module is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the codes of the words to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the words.
Based on the above module, the pretraining of the Bert model is realized through two steps, and the model input in the two steps is different, wherein in the first step, word vectors of words in the word segmentation data set are input, namely word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, the embedded matrix parameters of the training words with smaller learning rate can effectively prevent model oscillation, further make the parameter convergence better, and effectively solve the problem of a large vocabulary.
In some of these embodiments, the dataset preprocessing module further comprises:
the word segmentation data set acquisition module is used for carrying out word segmentation processing on the original data set to obtain a word segmentation data set;
the Word embedding matrix obtaining module is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;
and the vocabulary ordering module is used for ordering all the vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N, wherein the value range of N is 50000-150000.
And the vocabulary coding module is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so that the vocabulary code is obtained.
The module pre-trains all words in the word segmentation data set to prevent most of the vocabulary information from being lost.
In some embodiments, the Bert model pre-training module further comprises:
the partial parameter pre-training module is used for initializing the word list size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix comprises N +1 rows of word vectors, freezing the word embedding matrix parameters of the Bert model, inputting the word vectors in the word embedding matrix of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.
And the model parameter pre-training module is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model so as to train each layer of parameters of the Bert model.
Through the modules, the problem of poor training effect when a Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.
In some of these embodiments, the model parameter pre-training module further comprises:
the model adjusting module is used for reducing the learning rate of the Bert model;
the word embedding matrix initialization module is used for initializing the word embedding matrix of the Bert model by using the word vectors, and specifically, word vectors corresponding to N high-frequency words in the word vectors are obtained to initialize the word vectors of words corresponding to the word embedding matrix of the Bert model; and obtaining the mean value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vectors of the words "UNK" corresponding to the word embedding matrix of the Bert model.
And the model secondary pre-training module inputs the vocabulary codes into the Bert model to train parameters of each layer.
Based on the module, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, so that the overfitting phenomenon is effectively prevented.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the method for pre-training a Bert model according to the first aspect is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the Bert model pre-training method as described in the first aspect above.
Compared with the related art, the method, the system, the computer equipment and the computer-readable storage medium for pretraining the Bert model provided by the embodiment of the application solve the problem of poor effect of pretraining a million-level large vocabulary data set by the Bert model through multi-step pretraining, prevent most of vocabulary information from being lost, prevent the model from being valid and optimize a convergence process of model parameters, and enable both the word embedding matrix parameters and the non-embedding matrix parameters of the Bert model to be better converged.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method of pretraining a Bert model in accordance with an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps S2 of a method for pre-training a Bert model according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating steps S32 of a method for pre-training a Bert model according to an embodiment of the present application;
FIG. 4 is a block diagram of a Bert model pre-training system according to an embodiment of the present application;
FIG. 5 is a block diagram of sub-module structures of a Bert model pre-training system according to an embodiment of the present application.
Description of the drawings:
1. an original data set acquisition module; 2. a data set preprocessing module; 3. a Bert model pre-training module;
21. a word segmentation data set acquisition module; 22. a word embedding matrix acquisition module;
23. a vocabulary ordering module; 24. a vocabulary encoding module; 31. a partial parameter pre-training module;
32. a model parameter pre-training module; 321. a model adjustment module; 322. a word embedding matrix initialization module; 323. and a model secondary pre-training module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a pretraining method for a Bert model. Fig. 1-3 are flowcharts of a method for pretraining a Bert model according to an embodiment of the present application, and referring to fig. 1-3, the flowcharts include the following steps:
a raw data set acquisition step S1 for acquiring a raw data set.
A data set preprocessing step S2, which is used for carrying out Word segmentation on the original data set to obtain a Word segmentation data set, carrying out Word embedding matrix of all words and phrases by training the Word segmentation data set through a Word2Vec model, and sequencing and coding the words and phrases according to the occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes; optionally, the word segmentation process may adopt, but is not limited to, Jieba word segmentation.
And a Bert model pre-training step S3, which is used for freezing the word embedding matrix parameters of the Bert model, training the Bert model based on the word embedding matrices of all words, reducing the learning rate, inputting word codes, and training the Bert model again.
Based on the steps, the pretraining of the Bert model is realized through two steps, the model input in the two steps is different, the first step is to input word vectors of words in the word segmentation data set, namely, word embedding matrixes, so as to train non-word embedding matrix parameters in the Bert model; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, a training word with smaller learning rate is embedded into the matrix parameters, so that model oscillation can be effectively prevented, and parameter convergence is further better.
In some of these embodiments, the data set preprocessing step S2 further includes:
a segmentation data set obtaining step S21, configured to perform segmentation processing on the original data set to obtain a segmentation data set;
a Word embedding matrix obtaining step S22, which is used for inputting all the words in the Word segmentation data set into a Word2Vec model to train so as to obtain Word embedding matrixes of all the words;
a vocabulary sorting step S23, configured to sort all vocabularies in the segmentation data set according to the occurrence frequency, and divide the vocabularies in the segmentation data set into N high-frequency vocabularies with higher frequency and the remaining low-frequency vocabularies according to a set threshold N, where a value range of N is 50000-.
A vocabulary encoding step S24, configured to sequentially encode the high-frequency vocabulary according to the sequence and encode the low-frequency vocabulary into "UNK" to obtain a vocabulary code, specifically, the vocabulary code includes an encoding of the high-frequency vocabulary and an encoding of the low-frequency vocabulary "UNK"; for example, without limitation, the vocabulary a is ordered to 1, the vocabulary B is ordered to 2, and the encoding of the vocabulary a and the vocabulary B can be obtained by converting the ordering 1 and the ordering 2 into the encoding format.
The above steps are based on pre-training of all words in the word segmentation data set to prevent loss of most of the vocabulary information.
In some of these embodiments, the Bert model pre-training step S3 further includes:
a partial parameter pre-training step S31, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix comprises N +1 lines of word vectors, the word embedding matrix parameters of the Bert model are frozen, the word vectors in the word embedding matrix of all words are input to the Bert model, and non-word embedding matrix parameters in the Bert model are trained; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters.
And a model parameter pre-training step S32, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using a word vector, inputting a word code to each layer of parameters of the Bert model training the Bert model, wherein each layer of parameters comprises the word embedding matrix and a non-word embedding matrix.
Through the steps, the problem of poor training effect when the Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.
In some of these embodiments, the model parameter pre-training step S32 further includes:
a model adjustment step S321 for reducing the learning rate of the Bert model;
a word embedding matrix initializing step S322, configured to initialize a word embedding matrix of the Bert model by using the word vector;
and a model secondary pre-training step S323, inputting the vocabulary codes into the Bert model to train parameters of each layer.
In some embodiments, the word embedding matrix initializing step further comprises:
step S3221, obtaining word vectors corresponding to N high-frequency vocabularies in the word vectors to initialize the word vectors of the words corresponding to the word embedding matrix of the Bert model;
step S3222, obtaining a mean value of word vectors corresponding to the high-frequency vocabulary in the word vectors to initialize the word vector of the "UNK" corresponding to the word embedding matrix of the Bert model.
Based on the steps, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, and the overfitting phenomenon is effectively prevented.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The embodiment also provides a Bert model pre-training system, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the system that has been already made is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
Fig. 4-5 are block diagrams of the structure of a Bert model pre-training system according to an embodiment of the present application, and referring to fig. 4-5, the system includes:
an original data set obtaining module 1, configured to obtain an original data set;
the data set preprocessing module 2 is used for performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words and phrases, and sequencing and coding the words and phrases according to occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes; specifically, the data set preprocessing module 2 further includes: a word segmentation data set acquisition module 21, configured to perform word segmentation processing on the original data set to obtain a word segmentation data set; the Word embedding matrix obtaining module 22 is used for inputting all the words in the Word segmentation data set into a Word2Vec model to train so as to obtain Word embedding matrices of all the words; the vocabulary ordering module 23 is configured to order all vocabularies in the segmentation data set according to the occurrence frequency, and divide the vocabularies in the segmentation data set into N high-frequency vocabularies with higher frequency and remaining low-frequency vocabularies according to a set threshold N, where a value range of N is 50000-150000. And the vocabulary coding module 24 is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so that a vocabulary code is obtained. The module pre-trains all words in the word segmentation data set to prevent most of the vocabulary information from being lost.
And the Bert model pre-training module 3 is used for freezing word embedding matrix parameters of the Bert model, training the Bert model based on word embedding matrixes of all words, reducing the learning rate, inputting codes of the words and training the Bert model again. Specifically, the Bert model pre-training module 3 further includes: the partial parameter pre-training module 31 is configured to initialize a word list size of a word embedding matrix of the Bert model to be N +1, specifically, the word embedding matrix includes N +1 rows of word vectors, freeze word embedding matrix parameters of the Bert model, input word vectors in the word embedding matrix of all words to the Bert model, and train non-word embedding matrix parameters in the Bert model; by way of example, and not limitation, such as: attention layer parameters, forward propagation parameters, Intermediate layer parameters. The model parameter pre-training module 32 is configured to reduce the learning rate of the Bert model, initialize a word embedding matrix of the Bert model using a word vector, input a word code to the Bert model, and train parameters of each layer of the Bert model, where the parameters of each layer include a word embedding matrix and a non-word embedding matrix. Through the modules, the problem of poor training effect when a Bert model pre-trains a million-level large vocabulary data set is solved through multi-step pre-training. And reducing the model learning rate to prevent model oscillation and learn the Bert model parameters.
Wherein the model parameter pre-training module 32 further comprises: a model adjustment module 321, configured to reduce the learning rate of the Bert model; a word embedding matrix initialization module 322, configured to initialize a word embedding matrix of the Bert model by using the word vectors, and optionally, obtain word vectors corresponding to N high-frequency words in the word vectors to initialize word vectors of words corresponding to the word embedding matrix of the Bert model; acquiring the mean value of the word vectors corresponding to the high-frequency words in the word vectors to initialize the word vector of the word embedding matrix corresponding to the word "UNK" of the Bert model; and the model secondary pre-training module 323 inputs the vocabulary codes into the Bert model to train parameters of each layer. Based on the module, the Word vector in the Bert model is initialized based on the Word vector obtained by training the Word2Vec model, so that the overfitting phenomenon is effectively prevented.
Based on the module, the pretraining of the Bert model is realized through two steps, and the model input in the two steps is different, wherein in the first step, word vectors of words in a word segmentation data set are input, namely word embedding matrixes are input, so that non-word embedding matrix parameters in the Bert model are trained; the second step is to input the word codes of the vocabularies to train the word embedding matrix parameters and the non-word embedding matrix parameters of the Bert model, and the word embedding matrix based on all the vocabularies in the first step uses word vectors of all the vocabularies, so that no information loss exists, and the convergence of the non-word embedding matrix parameters is better; in the second step, the embedded matrix parameters of the training words with smaller learning rate can effectively prevent model oscillation, further make the parameter convergence better, and effectively solve the problem of a large vocabulary.
In addition, the method for pretraining the Bert model in the embodiment of the present application described in conjunction with fig. 1 may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.
In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (earrom), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.
The processor reads and executes the computer program instructions stored in the memory to implement any one of the Bert model pre-training methods in the above embodiments.
In addition, in combination with the method for pretraining the Bert model in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the Bert model pre-training methods in the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for pretraining a Bert model, comprising:
an original data set acquisition step for acquiring an original data set;
a data set preprocessing step, namely performing Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set by a Word2Vec model to obtain Word embedding matrixes of all words, and sequencing and coding the words according to occurrence frequency to obtain high-frequency words, low-frequency words and Word codes;
and a pretraining step of the Bert model, which is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the vocabulary codes to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the vocabularies.
2. The method of pretraining a Bert model as recited in claim 1, wherein the dataset preprocessing step further comprises:
a word segmentation data set acquisition step, which is used for carrying out word segmentation processing on the original data set to obtain the word segmentation data set;
a Word embedding matrix obtaining step, which is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;
a vocabulary ordering step, which is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N;
and a vocabulary coding step, which is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so as to obtain the vocabulary code.
3. The method of pre-training a Bert model according to claim 1 or 2, wherein the step of pre-training the Bert model further comprises:
a partial parameter pre-training step, which is used for initializing the word list size of the word embedding matrix of the Bert model to be N +1, freezing the word embedding matrix parameters of the Bert model, inputting word vectors in the word embedding matrix of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model;
and a model parameter pre-training step, which is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary code to the Bert model so as to train each layer of parameters of the Bert model.
4. The method of pretraining a Bert model in accordance with claim 3, wherein the model parameter pretraining step further comprises:
a model adjustment step for reducing the learning rate of the Bert model;
initializing a word embedding matrix, namely initializing the word embedding matrix of the Bert model by using the word vector;
and a model secondary pre-training step, namely inputting the vocabulary codes into parameters of each training layer of the Bert model.
5. The method of pretraining the Bert model as claimed in claim 4, wherein the word embedding matrix initializing step further comprises:
obtaining word vectors corresponding to N high-frequency vocabularies in the word vectors for initializing the word vectors of words corresponding to the word embedding matrix of the Bert model;
and obtaining the average value of the word vectors corresponding to the high-frequency words in the word vectors so as to initialize the word embedding matrix of the Bert model to the word vector of the UNK.
6. A Bert model pre-training system, comprising:
the original data set acquisition module is used for acquiring an original data set;
the data set preprocessing module is used for carrying out Word segmentation on the original data set to obtain a Word segmentation data set, training the Word segmentation data set through a Word2Vec model to obtain Word embedding matrixes of all words and phrases, and sequencing and coding the words and phrases according to the occurrence frequency to obtain high-frequency words and phrases, low-frequency words and Word codes;
and the Bert model pre-training module is used for freezing the word embedding matrix parameters of the Bert model, reducing the learning rate and inputting the codes of the words to train the Bert model again after the Bert model is trained on the basis of the word embedding matrixes of all the words.
7. The Bert model pre-training system of claim 6, wherein the dataset pre-processing module further comprises:
the word segmentation data set acquisition module is used for carrying out word segmentation processing on the original data set to obtain a word segmentation data set;
the Word embedding matrix obtaining module is used for inputting all the words in the Word segmentation data set into the Word2Vec model to train so as to obtain Word embedding matrixes of all the words;
the vocabulary ordering module is used for ordering all vocabularies in the word segmentation data set according to the occurrence frequency and dividing the vocabularies in the word segmentation data set into N high-frequency vocabularies with higher frequency and the rest low-frequency vocabularies according to a set threshold value N;
and the vocabulary coding module is used for sequentially coding the high-frequency vocabulary according to the sequence and coding the low-frequency vocabulary into 'UNK', so that the vocabulary code is obtained.
8. The Bert model pre-training system of claim 6 or 7, wherein the Bert model pre-training module further comprises:
the partial parameter pre-training module is used for initializing the word list size of a word embedding matrix of the Bert model to be N +1, freezing the word embedding matrix parameters of the Bert model, inputting word vectors in the word embedding matrices of all words to the Bert model, and training non-word embedding matrix parameters in the Bert model;
and the model parameter pre-training module is used for reducing the learning rate of the Bert model, initializing a word embedding matrix of the Bert model by using the word vector, and inputting the vocabulary codes to the Bert model so as to train each layer of parameters of the Bert model.
9. The Bert model pre-training system of claim 8, wherein the model parameter pre-training module further comprises:
the model adjusting module is used for reducing the learning rate of the Bert model;
the word embedding matrix initialization module is used for initializing the word embedding matrix of the Bert model by utilizing the word vector;
and the model secondary pre-training module inputs the vocabulary codes into the Bert model to train parameters of each layer.
10. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the Bert model pre-training method as recited in any one of claims 1 to 5 when executing the computer program.
CN202011503784.1A 2020-12-18 2020-12-18 Bert model pre-training method, system and computer equipment Active CN112528650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011503784.1A CN112528650B (en) 2020-12-18 2020-12-18 Bert model pre-training method, system and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011503784.1A CN112528650B (en) 2020-12-18 2020-12-18 Bert model pre-training method, system and computer equipment

Publications (2)

Publication Number Publication Date
CN112528650A true CN112528650A (en) 2021-03-19
CN112528650B CN112528650B (en) 2024-04-02

Family

ID=75001494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011503784.1A Active CN112528650B (en) 2020-12-18 2020-12-18 Bert model pre-training method, system and computer equipment

Country Status (1)

Country Link
CN (1) CN112528650B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN114861651A (en) * 2022-05-05 2022-08-05 北京百度网讯科技有限公司 Model training optimization method, computing device, electronic device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN108509427A (en) * 2018-04-24 2018-09-07 北京慧闻科技发展有限公司 The data processing method of text data and application
CN110209822A (en) * 2019-06-11 2019-09-06 中译语通科技股份有限公司 Sphere of learning data dependence prediction technique based on deep learning, computer
CA3065784A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant classifier based on deep neural networks
CN111950540A (en) * 2020-07-24 2020-11-17 浙江师范大学 Knowledge point extraction method, system, device and medium based on deep learning
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CA3081242A1 (en) * 2019-05-22 2020-11-22 Royal Bank Of Canada System and method for controllable machine text generation architecture

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CA3065784A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant classifier based on deep neural networks
CN108509427A (en) * 2018-04-24 2018-09-07 北京慧闻科技发展有限公司 The data processing method of text data and application
CA3081242A1 (en) * 2019-05-22 2020-11-22 Royal Bank Of Canada System and method for controllable machine text generation architecture
CN110209822A (en) * 2019-06-11 2019-09-06 中译语通科技股份有限公司 Sphere of learning data dependence prediction technique based on deep learning, computer
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN111950540A (en) * 2020-07-24 2020-11-17 浙江师范大学 Knowledge point extraction method, system, device and medium based on deep learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
KIM YUNSU 等: "Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies", 《PROCEEDINGS OF THE 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》, pages 1246 - 1257 *
POERNER NINA 等: "E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT", 《FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: EMNLP 2020》, pages 803 - 818 *
徐菲菲 等: "文本词向量与预训练语言模型研究", 《上海电力大学学报》, vol. 36, no. 04, pages 320 - 328 *
梁仕威 等: "基于协同表示学习的个性化新闻推荐", 《中文信息学报》, vol. 32, no. 11, pages 72 - 78 *
王晶: "基于深度学习的文本表示和分类研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 08, pages 138 - 1395 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113326693B (en) * 2021-05-28 2024-04-16 智者四海(北京)技术有限公司 Training method and system of natural language model based on word granularity
CN114861651A (en) * 2022-05-05 2022-08-05 北京百度网讯科技有限公司 Model training optimization method, computing device, electronic device and storage medium

Also Published As

Publication number Publication date
CN112528650B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN109685116B (en) Image description information generation method and device and electronic device
CN108875807B (en) Image description method based on multiple attention and multiple scales
US11636283B2 (en) Committed information rate variational autoencoders
US11531889B2 (en) Weight data storage method and neural network processor based on the method
US11403528B2 (en) Self-tuning incremental model compression solution in deep neural network with guaranteed accuracy performance
US20210004677A1 (en) Data compression using jointly trained encoder, decoder, and prior neural networks
CN113011581B (en) Neural network model compression method and device, electronic equipment and readable storage medium
US11403520B2 (en) Neural network machine translation method and apparatus
CN112528650A (en) Method, system and computer equipment for pretraining Bert model
CN112380319A (en) Model training method and related device
CN112560456A (en) Generation type abstract generation method and system based on improved neural network
CN114880452A (en) Text retrieval method based on multi-view contrast learning
CN112800757A (en) Keyword generation method, device, equipment and medium
CN109145946B (en) Intelligent image recognition and description method
Huai et al. Zerobn: Learning compact neural networks for latency-critical edge systems
CN109145107A (en) Subject distillation method, apparatus, medium and equipment based on convolutional neural networks
CN115908641A (en) Text-to-image generation method, device and medium based on features
CN113743277A (en) Method, system, equipment and storage medium for short video frequency classification
CN116018589A (en) Method and system for product quantization based matrix compression
CN107247944B (en) Face detection speed optimization method and device based on deep learning
Li et al. A graphical approach for filter pruning by exploring the similarity relation between feature maps
CN113344060A (en) Text classification model training method, litigation shape classification method and device
CN113011555B (en) Data processing method, device, equipment and storage medium
CN115937650A (en) Graph network model for time sequence action detection
Kermarrec et al. Improving communication-efficiency in Decentralized and Federated Learning Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant