CN113255328A - Language model training method and application method - Google Patents

Language model training method and application method Download PDF

Info

Publication number
CN113255328A
CN113255328A CN202110719988.7A CN202110719988A CN113255328A CN 113255328 A CN113255328 A CN 113255328A CN 202110719988 A CN202110719988 A CN 202110719988A CN 113255328 A CN113255328 A CN 113255328A
Authority
CN
China
Prior art keywords
training
corpus
model
word
word block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110719988.7A
Other languages
Chinese (zh)
Other versions
CN113255328B (en
Inventor
冀潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Technology Development Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202110719988.7A priority Critical patent/CN113255328B/en
Publication of CN113255328A publication Critical patent/CN113255328A/en
Application granted granted Critical
Publication of CN113255328B publication Critical patent/CN113255328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a training method and an application method of a language model, wherein the training method comprises the following steps: acquiring a first corpus, a second corpus and word block training results; performing word frequency ordering matching on word blocks in the first corpus and the second corpus; mapping the word block training result to a first corpus according to the matching result; in the pre-training model, initializing word blocks in the first corpus by adopting mapped word block training results; inputting the first corpus into a pre-training model, and then performing specific task training to generate a language model; the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model. In the training method, the first corpus borrows the word block training result which is trained by the second corpus, and initializes the BERT model, so that the first corpus can obtain better training effect without depending on a large number of data sets, and the language model can be formed more easily.

Description

Language model training method and application method
Technical Field
The embodiment of the application relates to the technical field of natural language processing, in particular to a training method and an application method of a language model.
Background
In the field of Natural Language Processing (NLP), training of a Language model requires dependence on a very large amount of corpus data, and is very limited.
Disclosure of Invention
In view of this, an object of the embodiments of the present application is to provide a method for training a language model and an application method thereof.
In a first aspect, an embodiment of the present application provides a method for training a language model, including:
acquiring a first corpus, a second corpus and word block training results;
performing word frequency ordering matching on word blocks in the first corpus and the second corpus;
mapping the word block training result to the first corpus according to a matching result;
in a pre-training model, initializing word blocks in the first corpus by adopting the mapped word block training results;
inputting the first corpus into the pre-training model, and then performing specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.
In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second corpus, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with less corpus lack language models is solved, and the language models can be formed more easily.
In one possible embodiment, the performing word frequency ordering matching on the word blocks in the first corpus and the second corpus includes:
performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;
according to the word frequency statistical result, the first word block and the second word block are subjected to forward ordering;
and establishing a matching relation between the first word block and the second word block with the same sequence.
In a possible implementation, the mapping the word block training result to the first corpus according to the matching result includes:
acquiring a word block vector of the second word block based on the word block training result;
and establishing a mapping relation between the first word block and the word block vector according to a matching result.
In one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.
In a possible implementation, the inputting the first corpus into the pre-training model for task-specific training includes:
inputting the first corpus into the pre-training model;
obtaining an output result of the first corpus after the pre-training model;
and extracting the characteristic vector corresponding to the specific task in the output result, and inputting the characteristic vector to a full connection layer.
In one possible embodiment, the inputting the first corpus into the pre-training model includes:
performing data enhancement on the first corpus;
inputting the first corpus after data enhancement into the second model;
wherein the data enhancement method comprises at least one of out-of-order, stretch, truncation, and MASK.
In one possible embodiment, the loss function of the language model is:
Loss=L1+L2
wherein, L1 is a loss function of the BERT model unsupervised training task; l2 is a loss function for the classification task.
In a second aspect, an embodiment of the present application provides an application method of a language model, where the language model is trained by using the training method in the embodiment of the first aspect;
the application method comprises the following steps:
acquiring an input text;
inputting the input text into a pre-trained model of the language models;
acquiring a feature vector corresponding to a specific task from an output result of the pre-training model;
inputting the feature vectors into a fully connected layer in the language model.
In a possible implementation manner, the specific task includes a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.
In one possible implementation, the classification task includes text sentiment analysis and text semantic matching.
In a possible implementation manner, the specific tasks include a question and answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.
In a third aspect, an embodiment of the present application provides an apparatus for generating a language model, including:
the first acquisition module is configured to acquire a first corpus, a second corpus and word block training results;
the ordering matching module is configured to perform word frequency ordering matching on the word blocks in the first corpus and the second corpus;
a matching mapping module configured to map the word block training result to the first corpus according to a matching result;
an initialization module configured to initialize word blocks in the first corpus with the mapped word block training results in a pre-training model;
the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.
In a fourth aspect, an embodiment of the present application provides an application apparatus of a language model, where the language model is trained by using the training method in the embodiment of the first aspect; the application device comprises:
a second obtaining module configured to obtain an input text;
an input module configured to input the input text into a pre-trained model of the language models;
a feature extraction module configured to obtain a feature vector corresponding to a specific task from an output result of the pre-training model BERT model;
a specific task module configured to input the feature vector into a fully-connected layer.
In a fifth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the method according to any one of the first and second aspects.
In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the first and second aspects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only examples of the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a first flowchart of a method for training a language model according to an embodiment of the present disclosure;
fig. 2 is an example of word frequency statistics of english and japanese vocabularies provided by the embodiment of the present application;
FIG. 3 is a schematic structural diagram of a language model according to an embodiment of the present application;
fig. 4 is a second flowchart of a method for training a language model according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for applying a language model according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a device for generating a language model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an application apparatus of a language model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application are further described in detail below with reference to the accompanying drawings.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the embodiments of the present application belong, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
In the field of Natural Language Processing (NLP), training of a Language model requires a very large amount of corpus data and rich computing resources, and for some languages with insufficient corpus data, there is not enough corpus data to support a training task of the Language model, and the Language model cannot be constructed.
In view of this, an embodiment of the present application provides a method for training a language model, as shown in fig. 1, which includes:
step S10: acquiring a first corpus, a second corpus and word block training results;
the first corpus is a corpus that needs to generate a language model, and is usually a corpus with a small corpus data size such as a small language. The second corpus is usually a general corpus, which is a training corpus used for training the language model in the NLP domain, and is a corpus with a large corpus data volume, such as an enwiki data set. That is, in one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.
The word block (Token) training result is a word block vector (Token Embedding) of the second corpus after being trained by the BERT model. It should be noted that the term block in this document refers to a basic semantic unit for composing a text, that is, Token, and may be a word or a word, and the division of the term block Token may be different according to different languages and different granularities.
BERT is a pre-training model for natural language processing, which is called Bidirectional Encoder reproduction from Transformers, namely the Encoder of Bidirectional Transformer, and the main model structure is formed by stacking the encoders of the Transformer. BERT is divided into BERT-base and BERT-large, in the BERT-base, 12 layers of transform encoders are adopted, and the parameter amount is 1.1 hundred million; in BERT-large, a 24-layer transform Encoder was used with a parameter of 3.4 hundred million. The BERT model needs a great amount of corpus data in the pre-training process, the first corpus cannot meet the requirement of the BERT model on the size of the corpus, even though the BERT-base is adopted, more small languages cannot meet the requirement of the BERT model on the size of the corpus, and therefore the first corpus cannot directly use the BERT model.
Step S20: performing word frequency ordering matching on word blocks in the first corpus and the second corpus;
herein, a word block in the first corpus is defined as a first word block, and a word block in the second corpus is defined as a second word block; the first corpus includes a plurality of first word blocks, and the second corpus includes a plurality of second word blocks. The word frequency ordering means that the occurrence frequency of a first word block and a second word block is calculated and ordering is carried out according to the frequency; for example, fig. 2 is an example of word frequency statistics of english and japanese vocabularies provided in the embodiment of the present application.
The word frequency ordering matching means that the word frequency ordering in the first corpus is the same as that in the second corpus, and the first word block and the second word block establish a matching relationship.
In one possible implementation, step S20 includes:
performing word frequency statistics on a first word block in a first corpus and a second word block in a second corpus;
according to the word frequency statistical result, the first word block and the second word block are subjected to forward ordering;
and establishing a matching relation for the first word block and the second word block with the same sequence.
Step S30: mapping the word block training result to a first corpus according to the matching result;
it can be generally considered that the word block distribution rules of the first corpus and the second corpus are similar, that is, common word blocks in different languages are similar. Based on this, the word block training result may be mapped to the first corpus according to the word frequency ordering matching result in step S20. That is, a word block vector of the second word block is obtained from the word block training result; and establishing a mapping relation between the first word block and the word block vector according to the matching result.
Step S40: in the pre-training model, initializing word blocks in the first corpus by adopting mapped word block training results;
the pre-training model is a BERT model; the word block training result is a word block vector of the second corpus after being trained by the BERT model, and the vector values are easy to converge. And mapping the word block training result to the first corpus according to the word frequency ordering matching result, wherein even if the corresponding text semantics of the first word block and the second word block which are matched with each other are different, the effect of the first word block is far better than that of random initialization when the first word block adopts the mapped word block training result to initialize the BERT model.
Step S50: inputting the first corpus into a pre-training model, and then performing specific task training to generate a language model;
fig. 3 is a schematic structural diagram of a language model provided in an embodiment of the present application, and as shown in fig. 3, the language model includes a BERT model and a fully-connected layer for implementing a specific task, where the fully-connected layer may be one layer or more than one layer.
After the BERT model is initialized by the first corpus, the concrete task training is also needed to generate an available language model in order to be applied to the concrete task.
In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second corpus, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with less corpus lack language models is solved, and the language models can be formed more easily.
In one possible implementation, as shown in fig. 4, step S50 includes:
step S51: inputting a first corpus into a pre-training model;
the pre-training model is a BERT model, the input of the BERT model is the sum of three vectors, and the three vectors are respectively a word block vector (Token Embedding), a text vector (Segment Embedding) and a Position vector (Position Embedding). Wherein, Token Embedding is used for representing Embedding of the current word block; segment Embedding represents index Embedding of the sentence where the current word block is located; the Position Embedding represents the index Embedding of the Position of the current word block.
Some special tokens are also included in the BERT model: comprises [ CLS ], [ SEP ] and [ MASK ]; wherein, [ CLS ] is placed at the head of the first sentence, and the value of [ CLS ] is automatically learned in the training process of the BERT model and used for depicting the global semantic information of the text; and obtaining a classification output C through the BERT model, wherein the classification output C can be used for subsequent classification tasks. [ SEP ] for dividing two input sentences, for example, input sentences A and B, input a sentence A, B is added later by [ SEP ]. [ MASK ] is used to MASK some of the Token in the sentence, and after masking Token with [ MASK ], Token is predicted using the [ MASK ] vector output by the BERT model.
Step S52: obtaining an output result of the first corpus after a pre-training model;
the output result of the BERT model comprises a classification output C and a text output T, wherein the classification output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after passing through the BERT model.
Step S53: and extracting the characteristic vector corresponding to the specific task in the output result, and inputting the characteristic vector to the full connection layer.
There are various tasks in NLP, such as Classification (Classification) task, Question Answering (QA) task, and named entity identification (NER).
The classification task is to classify the input text according to the set conditions, and in the BERT model, the classification task comprises a classification task based on a single sentence and a classification task based on a sentence pair. The classification task based on the single sentence is input into the single sentence and is used for text sentiment analysis (like or dislike) of texts such as movie comments/product comments and the like; the classification task based on sentence pairs is input as sentence pairs, and can be used for semantic matching of texts, such as judging whether meanings between sentence pairs are the same, judging similarity between sentence pairs, judging relations (implication, neutral or contradiction) between sentence pairs, and the like.
In the classification task, the feature vector is a classification output C in the output result of the BERT model, the classification output C is a classification vector output after [ CLS ] passes through the BERT model, and the classification task can be realized by inputting the classification vector to the full connection layer.
The question-answering task is to give a question and give a paragraph, and then to mark the specific position of the answer from the paragraph. In such tasks, all text output T (each Token) in the output result needs to be input to the full-link layer, and then the full-link layer outputs the start position and the end position of the answer after operation.
Named entity recognition refers to labeling word blocks of an input text to judge whether the input text belongs to a Person name Person, a place name (Location), an Organization name (Organization), a mixture (Miscellaneous) or Other (Other). In this task, the text output T is input to the full connection layer.
Because the first corpus is a small data set and the data size is insufficient, and the BERT model is unsupervised training of a long text, the usage scenarios are not completely matched, for example, in an emotion classification task, most of the first corpus is short texts, and the data in the data set is less, so position templates in most of texts are subjected to mask, and the condition of insufficient training is easy to occur. In addition, the text amount is small, the overfitting phenomenon is easy to occur, therefore, the first language database is subjected to data enhancement, the data enhancement method comprises at least one of disorder, extension, truncation and MASK, and the first language database after the data enhancement can increase the generalization of the language model.
Illustratively, the enhancement method for the emotion classification task is as follows:
1) the first corpus is T, the first corpus T is backed up, and the total data amount is amplified by 1 time to obtain an amplified corpus T;
2) randomly expanding the length of 20% of texts in the amplified corpus T to a random length between 400-512; the expansion mode can be realized by copying the text of the user for multiple times.
The maximum extension length is 512 here because the maximum input length allowed by the BERT-base model is 512, and here, the maximum extension length may be adjusted according to the BERT model, for example, in an application scenario of the BERT-large model, the maximum extension length may be increased to the maximum length allowed by the maximum extension length.
3) In the data T-T of the amplified corpus T increased relative to the first corpus T, randomly reversing the word order of 40% of texts with the maximum distance range of 2, such as abcde- > adcbe;
4) according to the preprocessing operation of BERT, in the data obtained by the processing of the steps 1) to 3), 80% of 20% of texts are replaced by masks, 10% of texts are replaced by random characters, and 10% of texts are not replaced.
It should be noted that the embodiments of the present application are not limited to the enhancement parameters, and the parameters may be adjusted for different specific tasks.
The loss function of the language model provided by the embodiment of the application is as follows:
Loss=L1+L2
therefore, the loss function comprises two parts, namely L1 and L2, wherein L1 is the loss function of the BERT model in the unsupervised training task; l2 is a loss function for the classification task.
L2 may use Cross Engine Loss of Loss Function; the classification task comprises two classifications and multiple classifications.
In the case of binary classification, the language model requires only two cases of prediction results, and for each class, the probabilities of our prediction are p and 1-p, where the expression is:
Figure BDA0003136598140000091
wherein, yiLabel representing the sample i, with a positive class of 1 and a negative class of 0;
pirepresenting the sample i preProbability of being measured as positive class.
The multi-classification case is actually an extension to two classes:
Figure BDA0003136598140000092
wherein M represents the number of categories;
yicrepresenting a sign function (0 or 1), if the true class of sample i is equal to c taken 1, otherwise 0;
picrepresenting the predicted probability that the observed sample i belongs to class c.
The embodiment of the application also provides an application method of the language model, wherein the language model comprises a BERT model and a full connection layer for realizing specific tasks; the language model is trained by the training method in the above embodiment, and as shown in fig. 5, the application method includes:
step S100: acquiring an input text;
here, the input text is text that needs to implement a specific task, for example, comments that need to be subjected to text sentiment analysis.
Step S200: inputting the input text into a pre-training model in a language model;
as can be seen from the above description, the pre-training model is a BERT model, and the input of the BERT model is the sum of three vectors, so in the input step, the input text needs to be converted into an input vector meeting the requirements of the BERT model.
Step S300: acquiring a feature vector corresponding to a specific task from an output result of the pre-training model;
the output result of the BERT model comprises a classification output C and a text output T, wherein the classification output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after passing through the BERT model.
The specific tasks include a classification task, a question and answer task, and a named entity identification task, and for the specific tasks and the relationship between the specific tasks and the feature vectors, reference may be made to the above description, which is not repeated herein.
Step S400: the feature vectors are input into the fully-connected layer.
And after the characteristic vectors are input into the full connection layer, the full connection layer outputs task results related to the specific tasks.
In the application method of the language model provided by the embodiment of the application, because in the pre-training model of the language model, the first corpus borrows the word block training result trained by the second corpus, and initializes the BERT model, by the design, the pre-training model can obtain a better training effect without depending on a large number of training data sets, the problem that the language with less corpus lacks the pre-training model is solved, the language model can be formed more easily, and the method can be used for processing specific tasks under the condition of a small amount of corpus data.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that some of the embodiments of the present application have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, the embodiment of the present application further provides a device for generating a language model, corresponding to any of the above embodiments.
Referring to fig. 6, the generating means of the language model includes:
a first obtaining module 100 configured to obtain a first corpus, a second corpus, and word block training results;
a rank matching module 200 configured to perform word frequency rank matching on word blocks in the first corpus and the second corpus;
a matching mapping module 300 configured to map the word block training result to a first corpus according to the matching result;
an initialization module 400 configured to initialize word blocks in the first corpus with the mapped word block training results in a pre-training model;
a task training module 500 configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.
The apparatus of the foregoing embodiment is used to implement the corresponding language model training method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, the embodiment of the present application further provides an application apparatus of a language model, corresponding to any of the above embodiments.
Referring to fig. 7, the apparatus for applying a language model includes:
a second obtaining module 600 configured to obtain an input text;
an input module 700 configured to input the input text into a pre-trained model of the language models;
a feature extraction module 800 configured to obtain a feature vector corresponding to a specific task from an output result of the pre-training model;
a specific task module 900 configured to input the feature vectors into a fully connected layer in the language model.
The language model is formed by training by adopting the training method in the embodiment.
The apparatus in the foregoing embodiment is used to implement the application method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functions of the modules may be implemented in the same or multiple software and/or hardware when implementing the embodiments of the present application.
Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the embodiments of the present application further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method according to any of the above embodiments is implemented.
Fig. 8 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the above embodiment is used to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to any of the above-mentioned embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiment, and are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of embodiments of the application, including the claims, is limited to those examples; within the context of the embodiments of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the present application, it should be apparent to one skilled in the art that the present application embodiments may be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the embodiments of the present application have been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the embodiments of the present application.

Claims (15)

1. A method for training a language model, comprising:
acquiring a first corpus, a second corpus and word block training results;
performing word frequency ordering matching on word blocks in the first corpus and the second corpus;
mapping the word block training result to the first corpus according to a matching result;
in a pre-training model, initializing word blocks in the first corpus by adopting the mapped word block training results;
inputting the first corpus into the pre-training model, and then performing specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.
2. The training method according to claim 1, wherein said performing word-frequency rank matching on the word blocks in the first corpus and the second corpus comprises:
performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;
according to the word frequency statistical result, the first word block and the second word block are subjected to forward ordering;
and establishing a matching relation between the first word block and the second word block with the same sequence.
3. The training method of claim 2, wherein the mapping the word block training result to the first corpus according to the matching result comprises:
acquiring a word block vector of the second word block based on the word block training result;
and establishing a mapping relation between the first word block and the word block vector according to a matching result.
4. A training method as claimed in claim 1, 2 or 3, characterized in that the first corpus text dataset is smaller than the second corpus text dataset.
5. The training method according to claim 1, wherein the task-specific training performed after the first corpus is input into the pre-training model comprises:
inputting the first corpus into the pre-training model;
obtaining an output result of the first corpus after the pre-training model;
and extracting the characteristic vector corresponding to the specific task in the output result, and inputting the characteristic vector to a full connection layer.
6. A training method as recited in claim 5, wherein said inputting the first corpus into the pre-trained model comprises:
performing data enhancement on the first corpus;
inputting the first corpus after data enhancement into the second model;
wherein the data enhancement method comprises at least one of out-of-order, stretch, truncation, and MASK.
7. Training method according to claim 1,
the loss function of the language model is:
Loss=L1+L2;
wherein, L1 is a loss function of the BERT model unsupervised training task; l2 is a loss function for the classification task.
8. A method for applying a language model, wherein the language model is formed by training according to the training method of any one of claims 1 to 7; the application method comprises the following steps:
acquiring an input text;
inputting the input text into a pre-trained model of the language models;
acquiring a feature vector corresponding to a specific task from an output result of the pre-training model;
inputting the feature vectors into a fully connected layer in the language model.
9. The application method of claim 8, wherein the specific task comprises a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.
10. The application method of claim 9, wherein the classification task comprises text emotion analysis and text semantic matching.
11. The application method of claim 8, wherein the specific tasks include a question and answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.
12. An apparatus for generating a language model, comprising:
the first acquisition module is configured to acquire a first corpus, a second corpus and word block training results;
the ordering matching module is configured to perform word frequency ordering matching on the word blocks in the first corpus and the second corpus;
a matching mapping module configured to map the word block training result to the first corpus according to a matching result;
an initialization module configured to initialize word blocks in the first corpus with the mapped word block training results in a pre-training model;
the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.
13. An application apparatus of a language model, characterized in that, it is formed by training with the training method of any one of claims 1-7; the application device comprises:
a second obtaining module configured to obtain an input text;
an input module configured to input the input text into a pre-trained model of the language models;
the characteristic extraction module is configured to obtain a characteristic vector corresponding to a specific task from an output result of the pre-training model;
a specific task module configured to input the feature vectors into a fully connected layer in the language model.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 11 when executing the program.
15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 11.
CN202110719988.7A 2021-06-28 2021-06-28 Training method and application method of language model Active CN113255328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110719988.7A CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110719988.7A CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Publications (2)

Publication Number Publication Date
CN113255328A true CN113255328A (en) 2021-08-13
CN113255328B CN113255328B (en) 2024-02-02

Family

ID=77189900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110719988.7A Active CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Country Status (1)

Country Link
CN (1) CN113255328B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656573A (en) * 2021-08-27 2021-11-16 北京大数医达科技有限公司 Text information generation method and device and terminal equipment
CN114881141A (en) * 2022-05-06 2022-08-09 中国人民解放军国防科技大学 Event type analysis method and related equipment
CN115329062A (en) * 2022-10-17 2022-11-11 中邮消费金融有限公司 Dialogue model training method under low-data scene and computer equipment
CN114817469B (en) * 2022-04-27 2023-08-08 马上消费金融股份有限公司 Text enhancement method, training method and training device for text enhancement model

Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
US20140200878A1 (en) * 2013-01-14 2014-07-17 Xerox Corporation Multi-domain machine translation model adaptation
KR20160058531A (en) * 2014-11-17 2016-05-25 포항공과대학교 산학협력단 Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
US20160306794A1 (en) * 2015-04-20 2016-10-20 Alibaba Group Holding Limited System and method for training a machine translation system
US20180157641A1 (en) * 2016-12-07 2018-06-07 International Business Machines Corporation Automatic Detection of Required Tools for a Task Described in Natural Language Content
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN109255117A (en) * 2017-07-13 2019-01-22 普天信息技术有限公司 Chinese word cutting method and device
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111695336A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Disease name code matching method and device, computer equipment and storage medium
CN111737994A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Method, device and equipment for obtaining word vector based on language model and storage medium
US20200334334A1 (en) * 2019-04-18 2020-10-22 Salesforce.Com, Inc. Systems and methods for unifying question answering and text classification via span extraction
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
WO2020242567A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Cross-lingual task training
WO2020253042A1 (en) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 Intelligent sentiment judgment method and device, and computer readable storage medium
CN112307181A (en) * 2020-10-28 2021-02-02 刘玲玲 Corpus-specific-corpus-based corpus extraction method and corpus extractor
CN112507101A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for establishing pre-training language model
US20210081832A1 (en) * 2019-09-12 2021-03-18 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for optimizing training set for text classification and storage medium
US20210089936A1 (en) * 2019-09-24 2021-03-25 International Business Machines Corporation Opinion snippet detection for aspect-based sentiment analysis
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
US20210142164A1 (en) * 2019-11-07 2021-05-13 Salesforce.Com, Inc. Multi-Task Knowledge Distillation for Language Model
CN112908315A (en) * 2021-03-10 2021-06-04 北京思图场景数据科技服务有限公司 Question-answer intention judgment method based on voice characteristics and voice recognition
CN112966712A (en) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 Language model training method and device, electronic equipment and computer readable medium
US20210182498A1 (en) * 2019-12-12 2021-06-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, electronic device and storage medium for processing a semantic representation model
CN112989828A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Training method, device, medium and electronic equipment for named entity recognition model

Patent Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
US20140200878A1 (en) * 2013-01-14 2014-07-17 Xerox Corporation Multi-domain machine translation model adaptation
KR20160058531A (en) * 2014-11-17 2016-05-25 포항공과대학교 산학협력단 Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
US20160306794A1 (en) * 2015-04-20 2016-10-20 Alibaba Group Holding Limited System and method for training a machine translation system
US20180157641A1 (en) * 2016-12-07 2018-06-07 International Business Machines Corporation Automatic Detection of Required Tools for a Task Described in Natural Language Content
CN109255117A (en) * 2017-07-13 2019-01-22 普天信息技术有限公司 Chinese word cutting method and device
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
US20200334334A1 (en) * 2019-04-18 2020-10-22 Salesforce.Com, Inc. Systems and methods for unifying question answering and text classification via span extraction
WO2020242567A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Cross-lingual task training
WO2020253042A1 (en) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 Intelligent sentiment judgment method and device, and computer readable storage medium
US20210081832A1 (en) * 2019-09-12 2021-03-18 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for optimizing training set for text classification and storage medium
US20210089936A1 (en) * 2019-09-24 2021-03-25 International Business Machines Corporation Opinion snippet detection for aspect-based sentiment analysis
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
US20210142164A1 (en) * 2019-11-07 2021-05-13 Salesforce.Com, Inc. Multi-Task Knowledge Distillation for Language Model
US20210182498A1 (en) * 2019-12-12 2021-06-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, electronic device and storage medium for processing a semantic representation model
CN112989828A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Training method, device, medium and electronic equipment for named entity recognition model
CN111695336A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Disease name code matching method and device, computer equipment and storage medium
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111737994A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Method, device and equipment for obtaining word vector based on language model and storage medium
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
CN112307181A (en) * 2020-10-28 2021-02-02 刘玲玲 Corpus-specific-corpus-based corpus extraction method and corpus extractor
CN112507101A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for establishing pre-training language model
CN112966712A (en) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 Language model training method and device, electronic equipment and computer readable medium
CN112908315A (en) * 2021-03-10 2021-06-04 北京思图场景数据科技服务有限公司 Question-answer intention judgment method based on voice characteristics and voice recognition

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JING SU, QINGYUN DAI ET AL.: "BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling", 《COMPUTER STANDARDS & INTERFACES》, vol. 67 *
林小俊,田浩等: "语言模型训练语料处理方法及解码词典的设计", 《第八届全国人机语音通讯学术会议》, pages 164 - 168 *
谢腾;杨俊安;刘辉;: "基于BERT-BiLSTM-CRF模型的中文实体识别", 计算机系统应用, no. 07 *
魏忠;周俊;石元兵;黄明浩;: "一种基于Word2vec的内容态势感知方法", 通信技术, no. 05 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656573A (en) * 2021-08-27 2021-11-16 北京大数医达科技有限公司 Text information generation method and device and terminal equipment
CN113656573B (en) * 2021-08-27 2024-02-06 北京大数医达科技有限公司 Text information generation method, device and terminal equipment
CN114817469B (en) * 2022-04-27 2023-08-08 马上消费金融股份有限公司 Text enhancement method, training method and training device for text enhancement model
CN114881141A (en) * 2022-05-06 2022-08-09 中国人民解放军国防科技大学 Event type analysis method and related equipment
CN114881141B (en) * 2022-05-06 2024-09-27 中国人民解放军国防科技大学 Event type analysis method and related equipment
CN115329062A (en) * 2022-10-17 2022-11-11 中邮消费金融有限公司 Dialogue model training method under low-data scene and computer equipment

Also Published As

Publication number Publication date
CN113255328B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
EP3926531A1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN112632225B (en) Semantic searching method and device based on case and event knowledge graph and electronic equipment
CN113255328A (en) Language model training method and application method
CN114298121B (en) Multi-mode-based text generation method, model training method and device
CN112632226B (en) Semantic search method and device based on legal knowledge graph and electronic equipment
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN110619050B (en) Intention recognition method and device
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN111507250B (en) Image recognition method, device and storage medium
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN110728147A (en) Model training method and named entity recognition method
CN110968725A (en) Image content description information generation method, electronic device, and storage medium
CN116821781A (en) Classification model training method, text analysis method and related equipment
CN114860905A (en) Intention identification method, device and equipment
CN115510232A (en) Text sentence classification method and classification device, electronic equipment and storage medium
CN116127348A (en) Text label generation, model training, text classification method and related equipment
CN113535912B (en) Text association method and related equipment based on graph rolling network and attention mechanism
CN113435531A (en) Zero sample image classification method and system, electronic equipment and storage medium
CN116738956A (en) Prompt template generation method and device, computer equipment and storage medium
CN110889290A (en) Text encoding method and apparatus, text encoding validity checking method and apparatus
CN114881141B (en) Event type analysis method and related equipment
CN115408523A (en) Medium-length and long-text classification method and system based on abstract extraction and keyword extraction
CN114896973A (en) Text processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant