CN113255328B

CN113255328B - Training method and application method of language model

Info

Publication number: CN113255328B
Application number: CN202110719988.7A
Authority: CN
Inventors: 冀潮
Original assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2024-02-02
Anticipated expiration: 2041-06-28
Also published as: CN113255328A

Abstract

The embodiment of the application provides a training method and an application method of a language model, wherein the training method comprises the following steps: acquiring a first corpus, a second corpus and word block training results; word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus; mapping the word block training result to a first corpus according to the matching result; in the pre-training model, initializing word blocks in a first corpus by adopting mapped word block training results; inputting the first corpus into a pre-training model, and then training a specific task to generate a language model; the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model. In the training method, the first corpus borrows the word block training result trained by the second prediction library, and the BERT model is initialized, so that the first corpus can obtain a better training effect without depending on a large number of data sets, and the language model can be formed more easily.

Description

Training method and application method of language model

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a training method and an application method of a language model.

Background

In the field of computer natural language processing (Natural Language Processing, abbreviated NLP), training of language models requires extremely large amounts of corpus data, and is highly limited.

Disclosure of Invention

In view of the foregoing, an object of an embodiment of the present application is to provide a training method and an application method for a language model.

In a first aspect, an embodiment of the present application provides a training method for a language model, including:

acquiring a first corpus, a second corpus and word block training results;

word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus;

mapping the word block training result to the first corpus according to the matching result;

in a pre-training model, initializing word blocks in the first corpus by using mapped word block training results;

inputting the first corpus into the pre-training model, and then training a specific task to generate a language model;

the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.

In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second prediction library, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with fewer corpora lack language models is solved, and the language models can be formed more easily.

In a possible implementation manner, the word frequency ordering matching of the word blocks in the first corpus and the second corpus includes:

performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;

forward ordering is carried out on the first word block and the second word block according to word frequency statistics results;

and establishing a matching relationship for the first word blocks and the second word blocks which are in the same sequence.

In a possible implementation manner, the mapping the word block training result to the first corpus according to the matching result includes:

acquiring word block vectors of the second word blocks based on the word block training results;

and establishing a mapping relation between the first word block and the word block vector according to a matching result.

In one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.

In a possible implementation manner, the step of inputting the first corpus into the pre-training model for training a specific task includes:

inputting the first corpus into the pre-training model;

obtaining an output result of the first corpus after the pre-training model;

and extracting a characteristic vector corresponding to the specific task from the output result, and inputting the characteristic vector to a full-connection layer.

In one possible implementation, the inputting the first corpus into the pre-training model includes:

performing data enhancement on the first corpus;

inputting the first corpus after data enhancement into the second model;

the data enhancement method comprises at least one of disorder, extension, truncation and MASK.

In one possible implementation, the loss function of the language model is:

Loss＝L1+L2

wherein L1 is a loss function of an unsupervised training task of the BERT model; l2 is the loss function for the classification task.

In a second aspect, embodiments of the present application provide a method for applying a language model, where the language model is trained by using the training method in the embodiment of the first aspect;

the application method comprises the following steps:

acquiring an input text;

inputting the input text into a pre-trained model in the language model;

obtaining a feature vector corresponding to a specific task from the output result of the pre-training model;

the feature vectors are input to a fully connected layer in the language model.

In a possible implementation manner, the specific task includes a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.

In one possible implementation, the classification tasks include text emotion analysis and text semantic matching.

In a possible implementation manner, the specific tasks include a question-answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.

In a third aspect, an embodiment of the present application provides a generating device of a language model, including:

the first acquisition module is configured to acquire a first corpus, a second corpus and word block training results;

the ordering matching module is configured to perform word frequency ordering matching on word blocks in the first corpus and the second corpus;

the matching mapping module is configured to map the word block training result to the first corpus according to the matching result;

the initialization module is configured to initialize word blocks in the first corpus by using mapped word block training results in a pre-training model;

the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;

In a fourth aspect, embodiments of the present application provide an application apparatus of a language model, where the language model is trained by using a training method in the embodiment of the first aspect; the application device comprises:

a second acquisition module configured to acquire an input text;

an input module configured to input the input text into a pre-trained model of the language models;

the feature extraction module is configured to acquire feature vectors corresponding to specific tasks from the output results of the pre-training model BERT model;

and the specific task module is configured to input the feature vector into the full connection layer.

In a fifth aspect, embodiments of the present application provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method according to any one of the embodiments of the first and second aspects when executing the program.

In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any one of the methods of the embodiments of the first and second aspects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without any inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of a language model according to an embodiment of the present application;

fig. 2 is an example of statistics of english and claret word frequencies provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a language model according to an embodiment of the present disclosure;

FIG. 4 is a second flowchart of a training method of a language model according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for applying a language model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a device for generating a language model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an application device of a language model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

It should be noted that unless otherwise defined, technical terms or scientific terms used in the embodiments of the present application should be given the ordinary meanings as understood by those having ordinary skill in the art to which the embodiments of the present application belong. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In the field of computer natural language processing (Natural Language Processing, abbreviated NLP), training of a language model requires a great amount of corpus data and abundant computing resources, and for some small languages with insufficient corpus data, the language model cannot be constructed because the training task of the language model is supported by insufficient corpus data.

In view of this, an embodiment of the present application provides a training method for a language model, as shown in fig. 1, including:

step S10: acquiring a first corpus, a second corpus and word block training results;

the first corpus is a corpus with a small amount of corpus data, such as a language model, and the like, which needs to be generated. The second corpus is usually a general corpus, which is a training corpus used for training a language model in the NLP field, and is a corpus with a larger corpus data volume, such as an enwiki dataset. That is, in one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.

The word block (Token) training result is a word block vector (Token) trained by the BERT model of the second corpus. It should be noted that, the word blocks herein refer to basic semantic units, namely Token, used for forming text, and may be words or words, and the division of the word blocks Token may be different according to different languages and different granularities.

BERT is a pre-trained model for natural language processing, which is known as Bidirectional Encoder Representation from Transformers, i.e., the Encoder of a bi-directional transducer, and the primary model structure is a stack of encoders of transducers. BERT is divided into BERT-base and BERT-large, wherein 12 layers Transformer Encoder are adopted in BERT-base, and the parameter amount is 1.1 hundred million; in BERT-large, 24 layers Transformer Encoder are used, the parameter amount being 3.4 billion. The BERT model needs a great amount of corpus data in the pre-training process, the first corpus cannot meet the requirement of the BERT model on the corpus size, even if the BERT-base is adopted, more small languages cannot meet the requirement of the BERT model on the corpus size, and therefore the first corpus cannot directly use the BERT model.

Step S20: word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus;

herein, word blocks in the first corpus are defined as first word blocks, and word blocks in the second corpus are defined as second word blocks; the first corpus comprises a plurality of first word blocks, and the second corpus comprises a plurality of second word blocks. Word frequency ordering refers to calculating the occurrence frequency of a first word block and a second word block, and ordering according to the frequency; for example, fig. 2 is an example of english and clan word frequency statistics provided in an embodiment of the present application.

Word frequency ordering matching refers to making word frequency ordering identical in a first corpus and a second corpus, and establishing a matching relationship between a first word block and a second word block.

In one possible implementation, step S20 includes:

performing word frequency statistics on a first word block in a first corpus and a second word block in a second corpus;

and establishing a matching relationship for the first word blocks and the second word blocks which are ranked the same.

Step S30: mapping the word block training result to a first corpus according to the matching result;

it can be generally considered that the word block distribution rules of the first corpus and the second corpus are similar, i.e. the common word blocks in different languages are similar. Based on this, the word block training result may be mapped to the first corpus according to the word frequency ordering matching result in step S20. That is, a word block vector of the second word block is obtained from the word block training result; and establishing a mapping relation between the first word block and the word block vector according to the matching result.

Step S40: in the pre-training model, initializing word blocks in a first corpus by adopting mapped word block training results;

the pre-training model is a BERT model; the word block training result is word block vectors of the second corpus trained by the BERT model, and vector values of the word block vectors are easy to converge. According to word frequency sequencing matching results, mapping word block training results to a first corpus, wherein even if text semantics corresponding to the first word block and text semantics corresponding to the second word block which are matched with each other are different, the effect of the first word block is far better than random initialization when the BERT model is initialized by adopting the mapped word block training results.

Step S50: inputting the first corpus into a pre-training model, and then training a specific task to generate a language model;

fig. 3 is a schematic structural diagram of a language model provided in the embodiment of the present application, where, as shown in fig. 3, the language model includes a BERT model and a full-connection layer for implementing a specific task, where the full-connection layer may be one layer or more than one layer.

After the BERT model is initialized by the first corpus, specific task training is further required to be performed in order to be applied to specific tasks, so as to generate an available language model.

In one possible implementation, as shown in fig. 4, step S50 includes:

step S51: inputting the first corpus into a pre-training model;

the pre-training model is a BERT model, the input of which is the sum of three vectors, namely a word block vector (Token) a text vector (Segment Embedding) and a position vector (Position Embedding). Wherein Token references are used for representing references of the current word block; segment Embedding indicates index casting of the sentence in which the current word block is located; position Embedding indicates index casting of the current word block.

The BERT model also includes some special Token: including [ CLS ], [ SEP ] and [ MASK ]; wherein [ CLS ] is placed at the first position of the first sentence, and the value of the [ CLS ] is automatically learned in the BERT model training process and is used for describing the global semantic information of the text; and obtaining a classification output C through the BERT model, wherein the classification output C can be used for subsequent classification tasks. [ SEP ] for dividing two input sentences of input, such as input sentences A and B, an [ SEP ] is added after sentence A, B. The MASK is used for masking some of the tokens in the sentence, masking the tokens with the MASK, and predicting the tokens by using the MASK vector output by the BERT model.

Step S52: obtaining an output result of the first corpus after the pre-training model;

the output result of the BERT model comprises a classification output C and a text output T, wherein the classification output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after the BERT model.

Step S53: and extracting the feature vector corresponding to the specific task from the output result, and inputting the feature vector to the full connection layer.

There are various tasks in NLP, such as Classification (Classification) task, question-answering (QA) task, and Named Entity Recognition (NER), etc.

The classification task refers to classifying the input text according to the setting condition, and in the BERT model, the classification task comprises a classification task based on single sentences and a classification task based on sentence pairs. The classification task based on single sentence is input into single sentence, which is used for text emotion analysis (like or dislike) of texts such as movie comments/product comments; the classification task input based on the sentence pairs is sentence pairs, and can be used for text semantic matching, such as judging whether meanings between the sentence pairs are the same, judging similarity between the sentence pairs, judging the relation (implication, neutral or contradiction) between the sentence pairs, and the like.

In the classification task, the feature vector is a classification output C in the BERT model output result, the classification output C is a classification vector output by [ CLS ] after passing through the BERT model, and the classification output C is input to the full connection layer to realize the classification task.

A question-and-answer task refers to giving a question and a paragraph, and then marking the specific location of the answer from the paragraph. In such tasks, all text output T (each Token) in the output result is input to the full-connection layer, and then the full-connection layer outputs the starting position and the ending position of the answer after operation.

Named entity recognition refers to marking word blocks of an input text in rows and judging whether the word blocks belong to Person names Person, place names (Location), organization names (Organization), mixing (Miscellanlaneous) or others (Other). In this task, the text output T is input to the full connection layer.

Because the first corpus is a small data set, the data volume is insufficient, and the BERT model is unsupervised training of long text, the usage scenario is not completely matched, for example, in an emotion classification task, the first corpus is mostly short text, the data volume is in the data set, and the data of the long text is less, so position embeddings in most texts can be masked, and the condition of insufficient training is easy to occur. In addition, the fitting phenomenon is easy to occur due to the fact that the text quantity is small, the first corpus is subjected to data enhancement, the data enhancement method comprises at least one of disorder, extension, truncation and MASK, and the generalization of the language model can be improved through the first corpus after data enhancement.

Illustratively, the enhancement method for emotion classification tasks is:

1) The first corpus is T, the first corpus T is backed up, the total data amount is amplified by 1 time, and an amplified corpus T is obtained;

2) Randomly expanding 20% text length in the amplified corpus T to a random length between 400 and 512; the expansion mode can be realized by copying the text of the user for multiple times.

The maximum expansion length is 512 here, because the maximum input length allowed by the BERT-base model is 512, the maximum expansion length can be adjusted according to the BERT model here, for example, in an application scene of the BERT-large model, the maximum expansion length can be increased to the maximum length allowed by the BERT-large model.

3) In the data T-T of the augmentation corpus T which is increased relative to the first corpus T, randomly reversing the word order of which the maximum distance range is 2 from 40% of texts, such as abcde- > adcbe;

4) According to the preprocessing operation of BERT, 80% of 20% of text is replaced by mask, 10% is replaced by random text, and 10% is not replaced in the data obtained by processing in steps 1) to 3).

It should be noted that the embodiments of the present application are not limited to the enhancement parameters described above, and parameter adjustment may be performed for different specific tasks.

The loss function of the language model provided by the embodiment of the application is as follows:

Loss＝L1+L2

it can be seen that the loss function comprises two parts, namely L1 and L2, wherein L1 is the loss function of the BERT model in the unsupervised training task; l2 is the loss function for the classification task.

L2 may employ Cross Entropy Loss Function (cross entropy loss function); wherein the classification tasks include two classifications and multiple classifications.

Under the two classification conditions, the language model only needs two cases of predicted results, and the predicted probability of each class is p and 1-p, and the expression is:

wherein y is _i Label representing sample i, positive class 1, negative class 0;

p _i representing the probability that sample i is predicted to be a positive class.

The case of multiple classifications is effectively an extension of the two classifications:

wherein M represents the number of categories;

y _ic representing a sign function (0 or 1), taking 1 if the true class of sample i is equal to c, or taking 0 otherwise;

p _ic representing the predicted probability that the observation sample i belongs to category c.

The embodiment of the application simultaneously provides an application method of the language model, wherein the language model comprises a BERT model and a full-connection layer for realizing specific tasks; the language model is trained by the training method in the above embodiment, as shown in fig. 5, and the application method includes:

step S100: acquiring an input text;

text is entered here, i.e. text that is required to perform a specific task, such as comments that require text emotion analysis.

Step S200: inputting the input text into a pre-trained model in the language model;

from the above description, the pre-training model is a BERT model, and the input of the BERT model is the sum of three vectors, so in the input step, the input text needs to be converted into the input vector meeting the requirements of the BERT model.

Step S300: obtaining a feature vector corresponding to a specific task from an output result of the pre-training model;

the output result of the BERT model comprises classified output C and text output T, wherein the classified output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after the BERT model.

Specific tasks include identifying tasks for classification tasks, question-answer tasks, and named entities, and reference may be made to the above description for specific tasks and the relationship of specific tasks to feature vectors, which are not repeated here.

Step S400: the feature vectors are input into the full connection layer.

After the feature vector is input to the full connection layer, the full connection layer outputs a task result related to the specific task.

In the application method of the language model provided by the embodiment of the application, the first corpus is initialized by using the word block training result which is already trained by the second prediction library in the pre-training model of the language model, so that the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that the pre-training model is lacking in languages with fewer linguistics is solved, the language model can be formed more easily, and the application method can be used for processing specific tasks under the condition of a small amount of corpus data.

It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.

It should be noted that some embodiments of the present application have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the embodiment of the application also provides a device for generating a language model, which corresponds to the method of any embodiment.

Referring to fig. 6, the language model generating apparatus includes:

a first obtaining module 100 configured to obtain a first corpus, a second corpus, and a word block training result;

the ranking matching module 200 is configured to perform word frequency ranking matching on word blocks in the first corpus and the second corpus;

the matching mapping module 300 is configured to map the word block training result to the first corpus according to the matching result;

an initialization module 400 configured to initialize word blocks in the first corpus with mapped word block training results in the pre-training model;

the task training module 500 is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;

The device of the foregoing embodiment is used to implement the training method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the embodiment of the application also provides an application device of the language model, which corresponds to the method of any embodiment.

Referring to fig. 7, the language model application apparatus includes:

a second acquisition module 600 configured to acquire an input text;

an input module 700 configured to input the input text into a pre-trained model of the language models;

the feature extraction module 800 is configured to obtain a feature vector corresponding to a specific task from the output result of the pre-training model;

a specific task module 900 is configured to input the feature vector into a fully connected layer in the language model.

The language model is formed by training by the training method in the embodiment.

The device of the foregoing embodiment is configured to implement the application method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in one or more pieces of software and/or hardware when implementing the embodiments of the present application.

Based on the same inventive concept, the embodiment of the present application further provides an electronic device corresponding to the method of any embodiment, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the program to implement the method of any embodiment.

Fig. 8 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, corresponding to any of the above-described embodiments of the method, the embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described in any of the above-described embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the method of any of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of embodiments of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in different embodiments may also be combined under the idea of the embodiments of the present application, the steps may be implemented in any order, and many other variations of the different aspects of the embodiments of the present application as described above exist, which are not provided in details for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the embodiments, it should be apparent to one skilled in the art that the embodiments may be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While embodiments of the present application have been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the present application, are intended to be included within the scope of the embodiments of the present application.

Claims

1. A method for training a language model, comprising:

acquiring a first corpus and a second corpus; inputting the second corpus into a pre-training model for training to obtain word block training results;

word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus, and a matching result is obtained;

mapping the word block training result to the first corpus according to the matching result to obtain a matching relationship between a first word block in the first corpus and a second word block in the second corpus;

initializing the pre-training model by adopting first word blocks in the first corpus, wherein the first word blocks are in one-to-one correspondence with second word blocks in the second corpus;

2. The training method of claim 1, wherein the word-frequency ordering matching of word blocks in the first corpus and the second corpus comprises:

3. The training method of claim 2, wherein mapping the word block training results to the first corpus based on the matching results comprises:

4. A training method as claimed in claim 1 or 2 or 3, wherein the first corpus text data set is smaller than the second corpus text data set.

5. The training method according to claim 1, wherein the inputting the first corpus into the pre-training model for specific task training comprises:

inputting the first corpus into the pre-training model;

obtaining an output result of the first corpus after the pre-training model;

6. The training method of claim 5, wherein said inputting the first corpus into the pre-training model comprises:

performing data enhancement on the first corpus;

inputting the first corpus after data enhancement into the pre-training model;

7. The training method of claim 1, wherein,

the loss function of the language model is:

Loss＝L1+L2；

wherein L1 is a loss function of an unsupervised training task of the BERT model; l2 is the loss function for a particular task.

8. A method of applying a language model, wherein the language model is trained and formed by the training method according to any one of claims 1 to 7; the application method comprises the following steps:

acquiring an input text;

inputting the input text into the language model for training to obtain an output result;

acquiring a feature vector corresponding to a specific task from the output result;

the feature vectors are input to a fully connected layer in the language model.

9. The application method according to claim 8, wherein the specific task includes a classification task, and the feature vector is a classification vector corresponding to a CLS input in the output result.

10. The application method according to claim 9, wherein the classification tasks include text emotion analysis and text semantic matching.

11. The application method according to claim 8, wherein the specific task includes a question-answer task and a named entity recognition task, and the feature vector is a portion of the output result corresponding to a word block in the input text.

12. A language model generating apparatus, comprising:

the first acquisition module is configured to acquire a first corpus and a second corpus; inputting the second corpus into a pre-training model for training to obtain word block training results;

the ordering and matching module is configured to perform word frequency ordering and matching on word blocks in the first corpus and the second corpus to obtain a matching result;

the matching mapping module is configured to map the word block training result to the first corpus according to the matching result to obtain a matching relationship between a first word block in the first corpus and a second word block in the second corpus;

the initialization module is configured to initialize the pre-training model by adopting first word blocks in the first corpus, which are in one-to-one correspondence with second word blocks in the second corpus;

13. An application device of a language model, characterized in that the application device is formed by training by the training method according to any one of claims 1 to 7; the application device comprises:

a second acquisition module configured to acquire an input text;

the input module is configured to input the input text into the language model for training to obtain an output result;

the feature extraction module is configured to acquire a feature vector corresponding to a specific task from the output result;

and a specific task module configured to input the feature vector into a fully connected layer in the language model.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 11 when the program is executed.

15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 11.