CN113255328A

CN113255328A - Language model training method and application method

Info

Publication number: CN113255328A
Application number: CN202110719988.7A
Authority: CN
Inventors: 冀潮
Original assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-08-13
Anticipated expiration: 2041-06-28
Also published as: CN113255328B

Abstract

The embodiment of the application provides a training method and an application method of a language model, wherein the training method comprises the following steps: acquiring a first corpus, a second corpus and word block training results; performing word frequency ordering matching on word blocks in the first corpus and the second corpus; mapping the word block training result to a first corpus according to the matching result; in the pre-training model, initializing word blocks in the first corpus by adopting mapped word block training results; inputting the first corpus into a pre-training model, and then performing specific task training to generate a language model; the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model. In the training method, the first corpus borrows the word block training result which is trained by the second corpus, and initializes the BERT model, so that the first corpus can obtain better training effect without depending on a large number of data sets, and the language model can be formed more easily.

Description

Language model training method and application method

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a training method and an application method of a language model.

Background

In the field of Natural Language Processing (NLP), training of a Language model requires dependence on a very large amount of corpus data, and is very limited.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a method for training a language model and an application method thereof.

In a first aspect, an embodiment of the present application provides a method for training a language model, including:

acquiring a first corpus, a second corpus and word block training results;

performing word frequency ordering matching on word blocks in the first corpus and the second corpus;

mapping the word block training result to the first corpus according to a matching result;

in a pre-training model, initializing word blocks in the first corpus by adopting the mapped word block training results;

inputting the first corpus into the pre-training model, and then performing specific task training to generate a language model;

the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus after being trained by the BERT model.

In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second corpus, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with less corpus lack language models is solved, and the language models can be formed more easily.

In one possible embodiment, the performing word frequency ordering matching on the word blocks in the first corpus and the second corpus includes:

performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;

according to the word frequency statistical result, the first word block and the second word block are subjected to forward ordering;

and establishing a matching relation between the first word block and the second word block with the same sequence.

In a possible implementation, the mapping the word block training result to the first corpus according to the matching result includes:

acquiring a word block vector of the second word block based on the word block training result;

and establishing a mapping relation between the first word block and the word block vector according to a matching result.

In one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.

In a possible implementation, the inputting the first corpus into the pre-training model for task-specific training includes:

inputting the first corpus into the pre-training model;

obtaining an output result of the first corpus after the pre-training model;

and extracting the characteristic vector corresponding to the specific task in the output result, and inputting the characteristic vector to a full connection layer.

In one possible embodiment, the inputting the first corpus into the pre-training model includes:

performing data enhancement on the first corpus;

inputting the first corpus after data enhancement into the second model;

wherein the data enhancement method comprises at least one of out-of-order, stretch, truncation, and MASK.

In one possible embodiment, the loss function of the language model is:

Loss＝L1+L2

wherein, L1 is a loss function of the BERT model unsupervised training task; l2 is a loss function for the classification task.

In a second aspect, an embodiment of the present application provides an application method of a language model, where the language model is trained by using the training method in the embodiment of the first aspect;

the application method comprises the following steps:

acquiring an input text;

inputting the input text into a pre-trained model of the language models;

acquiring a feature vector corresponding to a specific task from an output result of the pre-training model;

inputting the feature vectors into a fully connected layer in the language model.

In a possible implementation manner, the specific task includes a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.

In one possible implementation, the classification task includes text sentiment analysis and text semantic matching.

In a possible implementation manner, the specific tasks include a question and answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.

In a third aspect, an embodiment of the present application provides an apparatus for generating a language model, including:

the first acquisition module is configured to acquire a first corpus, a second corpus and word block training results;

the ordering matching module is configured to perform word frequency ordering matching on the word blocks in the first corpus and the second corpus;

a matching mapping module configured to map the word block training result to the first corpus according to a matching result;

an initialization module configured to initialize word blocks in the first corpus with the mapped word block training results in a pre-training model;

the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;

In a fourth aspect, an embodiment of the present application provides an application apparatus of a language model, where the language model is trained by using the training method in the embodiment of the first aspect; the application device comprises:

a second obtaining module configured to obtain an input text;

an input module configured to input the input text into a pre-trained model of the language models;

a feature extraction module configured to obtain a feature vector corresponding to a specific task from an output result of the pre-training model BERT model;

a specific task module configured to input the feature vector into a fully-connected layer.

In a fifth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the method according to any one of the first and second aspects.

In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the first and second aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only examples of the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a first flowchart of a method for training a language model according to an embodiment of the present disclosure;

fig. 2 is an example of word frequency statistics of english and japanese vocabularies provided by the embodiment of the present application;

FIG. 3 is a schematic structural diagram of a language model according to an embodiment of the present application;

fig. 4 is a second flowchart of a method for training a language model according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for applying a language model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a device for generating a language model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an application apparatus of a language model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application are further described in detail below with reference to the accompanying drawings.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the embodiments of the present application belong, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In the field of Natural Language Processing (NLP), training of a Language model requires a very large amount of corpus data and rich computing resources, and for some languages with insufficient corpus data, there is not enough corpus data to support a training task of the Language model, and the Language model cannot be constructed.

In view of this, an embodiment of the present application provides a method for training a language model, as shown in fig. 1, which includes:

step S10: acquiring a first corpus, a second corpus and word block training results;

the first corpus is a corpus that needs to generate a language model, and is usually a corpus with a small corpus data size such as a small language. The second corpus is usually a general corpus, which is a training corpus used for training the language model in the NLP domain, and is a corpus with a large corpus data volume, such as an enwiki data set. That is, in one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.

The word block (Token) training result is a word block vector (Token Embedding) of the second corpus after being trained by the BERT model. It should be noted that the term block in this document refers to a basic semantic unit for composing a text, that is, Token, and may be a word or a word, and the division of the term block Token may be different according to different languages and different granularities.

BERT is a pre-training model for natural language processing, which is called Bidirectional Encoder reproduction from Transformers, namely the Encoder of Bidirectional Transformer, and the main model structure is formed by stacking the encoders of the Transformer. BERT is divided into BERT-base and BERT-large, in the BERT-base, 12 layers of transform encoders are adopted, and the parameter amount is 1.1 hundred million; in BERT-large, a 24-layer transform Encoder was used with a parameter of 3.4 hundred million. The BERT model needs a great amount of corpus data in the pre-training process, the first corpus cannot meet the requirement of the BERT model on the size of the corpus, even though the BERT-base is adopted, more small languages cannot meet the requirement of the BERT model on the size of the corpus, and therefore the first corpus cannot directly use the BERT model.

Step S20: performing word frequency ordering matching on word blocks in the first corpus and the second corpus;

herein, a word block in the first corpus is defined as a first word block, and a word block in the second corpus is defined as a second word block; the first corpus includes a plurality of first word blocks, and the second corpus includes a plurality of second word blocks. The word frequency ordering means that the occurrence frequency of a first word block and a second word block is calculated and ordering is carried out according to the frequency; for example, fig. 2 is an example of word frequency statistics of english and japanese vocabularies provided in the embodiment of the present application.

The word frequency ordering matching means that the word frequency ordering in the first corpus is the same as that in the second corpus, and the first word block and the second word block establish a matching relationship.

In one possible implementation, step S20 includes:

performing word frequency statistics on a first word block in a first corpus and a second word block in a second corpus;

and establishing a matching relation for the first word block and the second word block with the same sequence.

Step S30: mapping the word block training result to a first corpus according to the matching result;

it can be generally considered that the word block distribution rules of the first corpus and the second corpus are similar, that is, common word blocks in different languages are similar. Based on this, the word block training result may be mapped to the first corpus according to the word frequency ordering matching result in step S20. That is, a word block vector of the second word block is obtained from the word block training result; and establishing a mapping relation between the first word block and the word block vector according to the matching result.

Step S40: in the pre-training model, initializing word blocks in the first corpus by adopting mapped word block training results;

the pre-training model is a BERT model; the word block training result is a word block vector of the second corpus after being trained by the BERT model, and the vector values are easy to converge. And mapping the word block training result to the first corpus according to the word frequency ordering matching result, wherein even if the corresponding text semantics of the first word block and the second word block which are matched with each other are different, the effect of the first word block is far better than that of random initialization when the first word block adopts the mapped word block training result to initialize the BERT model.

Step S50: inputting the first corpus into a pre-training model, and then performing specific task training to generate a language model;

fig. 3 is a schematic structural diagram of a language model provided in an embodiment of the present application, and as shown in fig. 3, the language model includes a BERT model and a fully-connected layer for implementing a specific task, where the fully-connected layer may be one layer or more than one layer.

After the BERT model is initialized by the first corpus, the concrete task training is also needed to generate an available language model in order to be applied to the concrete task.

In one possible implementation, as shown in fig. 4, step S50 includes:

step S51: inputting a first corpus into a pre-training model;

the pre-training model is a BERT model, the input of the BERT model is the sum of three vectors, and the three vectors are respectively a word block vector (Token Embedding), a text vector (Segment Embedding) and a Position vector (Position Embedding). Wherein, Token Embedding is used for representing Embedding of the current word block; segment Embedding represents index Embedding of the sentence where the current word block is located; the Position Embedding represents the index Embedding of the Position of the current word block.

Some special tokens are also included in the BERT model: comprises [ CLS ], [ SEP ] and [ MASK ]; wherein, [ CLS ] is placed at the head of the first sentence, and the value of [ CLS ] is automatically learned in the training process of the BERT model and used for depicting the global semantic information of the text; and obtaining a classification output C through the BERT model, wherein the classification output C can be used for subsequent classification tasks. [ SEP ] for dividing two input sentences, for example, input sentences A and B, input a sentence A, B is added later by [ SEP ]. [ MASK ] is used to MASK some of the Token in the sentence, and after masking Token with [ MASK ], Token is predicted using the [ MASK ] vector output by the BERT model.

Step S52: obtaining an output result of the first corpus after a pre-training model;

the output result of the BERT model comprises a classification output C and a text output T, wherein the classification output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after passing through the BERT model.

Step S53: and extracting the characteristic vector corresponding to the specific task in the output result, and inputting the characteristic vector to the full connection layer.

There are various tasks in NLP, such as Classification (Classification) task, Question Answering (QA) task, and named entity identification (NER).

The classification task is to classify the input text according to the set conditions, and in the BERT model, the classification task comprises a classification task based on a single sentence and a classification task based on a sentence pair. The classification task based on the single sentence is input into the single sentence and is used for text sentiment analysis (like or dislike) of texts such as movie comments/product comments and the like; the classification task based on sentence pairs is input as sentence pairs, and can be used for semantic matching of texts, such as judging whether meanings between sentence pairs are the same, judging similarity between sentence pairs, judging relations (implication, neutral or contradiction) between sentence pairs, and the like.

In the classification task, the feature vector is a classification output C in the output result of the BERT model, the classification output C is a classification vector output after [ CLS ] passes through the BERT model, and the classification task can be realized by inputting the classification vector to the full connection layer.

The question-answering task is to give a question and give a paragraph, and then to mark the specific position of the answer from the paragraph. In such tasks, all text output T (each Token) in the output result needs to be input to the full-link layer, and then the full-link layer outputs the start position and the end position of the answer after operation.

Named entity recognition refers to labeling word blocks of an input text to judge whether the input text belongs to a Person name Person, a place name (Location), an Organization name (Organization), a mixture (Miscellaneous) or Other (Other). In this task, the text output T is input to the full connection layer.

Because the first corpus is a small data set and the data size is insufficient, and the BERT model is unsupervised training of a long text, the usage scenarios are not completely matched, for example, in an emotion classification task, most of the first corpus is short texts, and the data in the data set is less, so position templates in most of texts are subjected to mask, and the condition of insufficient training is easy to occur. In addition, the text amount is small, the overfitting phenomenon is easy to occur, therefore, the first language database is subjected to data enhancement, the data enhancement method comprises at least one of disorder, extension, truncation and MASK, and the first language database after the data enhancement can increase the generalization of the language model.

Illustratively, the enhancement method for the emotion classification task is as follows:

1) the first corpus is T, the first corpus T is backed up, and the total data amount is amplified by 1 time to obtain an amplified corpus T;

2) randomly expanding the length of 20% of texts in the amplified corpus T to a random length between 400-512; the expansion mode can be realized by copying the text of the user for multiple times.

The maximum extension length is 512 here because the maximum input length allowed by the BERT-base model is 512, and here, the maximum extension length may be adjusted according to the BERT model, for example, in an application scenario of the BERT-large model, the maximum extension length may be increased to the maximum length allowed by the maximum extension length.

3) In the data T-T of the amplified corpus T increased relative to the first corpus T, randomly reversing the word order of 40% of texts with the maximum distance range of 2, such as abcde- > adcbe;

4) according to the preprocessing operation of BERT, in the data obtained by the processing of the steps 1) to 3), 80% of 20% of texts are replaced by masks, 10% of texts are replaced by random characters, and 10% of texts are not replaced.

It should be noted that the embodiments of the present application are not limited to the enhancement parameters, and the parameters may be adjusted for different specific tasks.

The loss function of the language model provided by the embodiment of the application is as follows:

Loss＝L1+L2

therefore, the loss function comprises two parts, namely L1 and L2, wherein L1 is the loss function of the BERT model in the unsupervised training task; l2 is a loss function for the classification task.

L2 may use Cross Engine Loss of Loss Function; the classification task comprises two classifications and multiple classifications.

In the case of binary classification, the language model requires only two cases of prediction results, and for each class, the probabilities of our prediction are p and 1-p, where the expression is:

wherein, y_iLabel representing the sample i, with a positive class of 1 and a negative class of 0;

p_irepresenting the sample i preProbability of being measured as positive class.

The multi-classification case is actually an extension to two classes:

wherein M represents the number of categories;

y_icrepresenting a sign function (0 or 1), if the true class of sample i is equal to c taken 1, otherwise 0;

p_icrepresenting the predicted probability that the observed sample i belongs to class c.

The embodiment of the application also provides an application method of the language model, wherein the language model comprises a BERT model and a full connection layer for realizing specific tasks; the language model is trained by the training method in the above embodiment, and as shown in fig. 5, the application method includes:

step S100: acquiring an input text;

here, the input text is text that needs to implement a specific task, for example, comments that need to be subjected to text sentiment analysis.

Step S200: inputting the input text into a pre-training model in a language model;

as can be seen from the above description, the pre-training model is a BERT model, and the input of the BERT model is the sum of three vectors, so in the input step, the input text needs to be converted into an input vector meeting the requirements of the BERT model.

Step S300: acquiring a feature vector corresponding to a specific task from an output result of the pre-training model;

The specific tasks include a classification task, a question and answer task, and a named entity identification task, and for the specific tasks and the relationship between the specific tasks and the feature vectors, reference may be made to the above description, which is not repeated herein.

Step S400: the feature vectors are input into the fully-connected layer.

And after the characteristic vectors are input into the full connection layer, the full connection layer outputs task results related to the specific tasks.

In the application method of the language model provided by the embodiment of the application, because in the pre-training model of the language model, the first corpus borrows the word block training result trained by the second corpus, and initializes the BERT model, by the design, the pre-training model can obtain a better training effect without depending on a large number of training data sets, the problem that the language with less corpus lacks the pre-training model is solved, the language model can be formed more easily, and the method can be used for processing specific tasks under the condition of a small amount of corpus data.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that some of the embodiments of the present application have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, the embodiment of the present application further provides a device for generating a language model, corresponding to any of the above embodiments.

Referring to fig. 6, the generating means of the language model includes:

a first obtaining module 100 configured to obtain a first corpus, a second corpus, and word block training results;

a rank matching module 200 configured to perform word frequency rank matching on word blocks in the first corpus and the second corpus;

a matching mapping module 300 configured to map the word block training result to a first corpus according to the matching result;

an initialization module 400 configured to initialize word blocks in the first corpus with the mapped word block training results in a pre-training model;

a task training module 500 configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;

The apparatus of the foregoing embodiment is used to implement the corresponding language model training method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, the embodiment of the present application further provides an application apparatus of a language model, corresponding to any of the above embodiments.

Referring to fig. 7, the apparatus for applying a language model includes:

a second obtaining module 600 configured to obtain an input text;

an input module 700 configured to input the input text into a pre-trained model of the language models;

a feature extraction module 800 configured to obtain a feature vector corresponding to a specific task from an output result of the pre-training model;

a specific task module 900 configured to input the feature vectors into a fully connected layer in the language model.

The language model is formed by training by adopting the training method in the embodiment.

The apparatus in the foregoing embodiment is used to implement the application method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functions of the modules may be implemented in the same or multiple software and/or hardware when implementing the embodiments of the present application.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the embodiments of the present application further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method according to any of the above embodiments is implemented.

Fig. 8 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to any of the above-mentioned embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiment, and are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of embodiments of the application, including the claims, is limited to those examples; within the context of the embodiments of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the present application, it should be apparent to one skilled in the art that the present application embodiments may be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the embodiments of the present application have been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the embodiments of the present application.

Claims

1. A method for training a language model, comprising:

acquiring a first corpus, a second corpus and word block training results;

2. The training method according to claim 1, wherein said performing word-frequency rank matching on the word blocks in the first corpus and the second corpus comprises:

3. The training method of claim 2, wherein the mapping the word block training result to the first corpus according to the matching result comprises:

4. A training method as claimed in claim 1, 2 or 3, characterized in that the first corpus text dataset is smaller than the second corpus text dataset.

5. The training method according to claim 1, wherein the task-specific training performed after the first corpus is input into the pre-training model comprises:

inputting the first corpus into the pre-training model;

obtaining an output result of the first corpus after the pre-training model;

6. A training method as recited in claim 5, wherein said inputting the first corpus into the pre-trained model comprises:

performing data enhancement on the first corpus;

inputting the first corpus after data enhancement into the second model;

7. Training method according to claim 1,

the loss function of the language model is:

Loss＝L1+L2；

8. A method for applying a language model, wherein the language model is formed by training according to the training method of any one of claims 1 to 7; the application method comprises the following steps:

acquiring an input text;

inputting the input text into a pre-trained model of the language models;

9. The application method of claim 8, wherein the specific task comprises a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.

10. The application method of claim 9, wherein the classification task comprises text emotion analysis and text semantic matching.

11. The application method of claim 8, wherein the specific tasks include a question and answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.

12. An apparatus for generating a language model, comprising:

13. An application apparatus of a language model, characterized in that, it is formed by training with the training method of any one of claims 1-7; the application device comprises:

a second obtaining module configured to obtain an input text;

the characteristic extraction module is configured to obtain a characteristic vector corresponding to a specific task from an output result of the pre-training model;

a specific task module configured to input the feature vectors into a fully connected layer in the language model.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 11 when executing the program.

15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 11.