CN113255328B - Training method and application method of language model - Google Patents

Training method and application method of language model Download PDF

Info

Publication number
CN113255328B
CN113255328B CN202110719988.7A CN202110719988A CN113255328B CN 113255328 B CN113255328 B CN 113255328B CN 202110719988 A CN202110719988 A CN 202110719988A CN 113255328 B CN113255328 B CN 113255328B
Authority
CN
China
Prior art keywords
corpus
training
word
model
word block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110719988.7A
Other languages
Chinese (zh)
Other versions
CN113255328A (en
Inventor
冀潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Technology Development Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202110719988.7A priority Critical patent/CN113255328B/en
Publication of CN113255328A publication Critical patent/CN113255328A/en
Application granted granted Critical
Publication of CN113255328B publication Critical patent/CN113255328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the application provides a training method and an application method of a language model, wherein the training method comprises the following steps: acquiring a first corpus, a second corpus and word block training results; word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus; mapping the word block training result to a first corpus according to the matching result; in the pre-training model, initializing word blocks in a first corpus by adopting mapped word block training results; inputting the first corpus into a pre-training model, and then training a specific task to generate a language model; the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model. In the training method, the first corpus borrows the word block training result trained by the second prediction library, and the BERT model is initialized, so that the first corpus can obtain a better training effect without depending on a large number of data sets, and the language model can be formed more easily.

Description

Training method and application method of language model
Technical Field
The embodiment of the application relates to the technical field of natural language processing, in particular to a training method and an application method of a language model.
Background
In the field of computer natural language processing (Natural Language Processing, abbreviated NLP), training of language models requires extremely large amounts of corpus data, and is highly limited.
Disclosure of Invention
In view of the foregoing, an object of an embodiment of the present application is to provide a training method and an application method for a language model.
In a first aspect, an embodiment of the present application provides a training method for a language model, including:
acquiring a first corpus, a second corpus and word block training results;
word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus;
mapping the word block training result to the first corpus according to the matching result;
in a pre-training model, initializing word blocks in the first corpus by using mapped word block training results;
inputting the first corpus into the pre-training model, and then training a specific task to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.
In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second prediction library, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with fewer corpora lack language models is solved, and the language models can be formed more easily.
In a possible implementation manner, the word frequency ordering matching of the word blocks in the first corpus and the second corpus includes:
performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;
forward ordering is carried out on the first word block and the second word block according to word frequency statistics results;
and establishing a matching relationship for the first word blocks and the second word blocks which are in the same sequence.
In a possible implementation manner, the mapping the word block training result to the first corpus according to the matching result includes:
acquiring word block vectors of the second word blocks based on the word block training results;
and establishing a mapping relation between the first word block and the word block vector according to a matching result.
In one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.
In a possible implementation manner, the step of inputting the first corpus into the pre-training model for training a specific task includes:
inputting the first corpus into the pre-training model;
obtaining an output result of the first corpus after the pre-training model;
and extracting a characteristic vector corresponding to the specific task from the output result, and inputting the characteristic vector to a full-connection layer.
In one possible implementation, the inputting the first corpus into the pre-training model includes:
performing data enhancement on the first corpus;
inputting the first corpus after data enhancement into the second model;
the data enhancement method comprises at least one of disorder, extension, truncation and MASK.
In one possible implementation, the loss function of the language model is:
Loss=L1+L2
wherein L1 is a loss function of an unsupervised training task of the BERT model; l2 is the loss function for the classification task.
In a second aspect, embodiments of the present application provide a method for applying a language model, where the language model is trained by using the training method in the embodiment of the first aspect;
the application method comprises the following steps:
acquiring an input text;
inputting the input text into a pre-trained model in the language model;
obtaining a feature vector corresponding to a specific task from the output result of the pre-training model;
the feature vectors are input to a fully connected layer in the language model.
In a possible implementation manner, the specific task includes a classification task, and the feature vector is a classification vector corresponding to the CLS input in the output result.
In one possible implementation, the classification tasks include text emotion analysis and text semantic matching.
In a possible implementation manner, the specific tasks include a question-answer task and a named entity recognition task, and the feature vector is a part of the output result corresponding to a word block in the input text.
In a third aspect, an embodiment of the present application provides a generating device of a language model, including:
the first acquisition module is configured to acquire a first corpus, a second corpus and word block training results;
the ordering matching module is configured to perform word frequency ordering matching on word blocks in the first corpus and the second corpus;
the matching mapping module is configured to map the word block training result to the first corpus according to the matching result;
the initialization module is configured to initialize word blocks in the first corpus by using mapped word block training results in a pre-training model;
the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.
In a fourth aspect, embodiments of the present application provide an application apparatus of a language model, where the language model is trained by using a training method in the embodiment of the first aspect; the application device comprises:
a second acquisition module configured to acquire an input text;
an input module configured to input the input text into a pre-trained model of the language models;
the feature extraction module is configured to acquire feature vectors corresponding to specific tasks from the output results of the pre-training model BERT model;
and the specific task module is configured to input the feature vector into the full connection layer.
In a fifth aspect, embodiments of the present application provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method according to any one of the embodiments of the first and second aspects when executing the program.
In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any one of the methods of the embodiments of the first and second aspects.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without any inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a training method of a language model according to an embodiment of the present application;
fig. 2 is an example of statistics of english and claret word frequencies provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a language model according to an embodiment of the present disclosure;
FIG. 4 is a second flowchart of a training method of a language model according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for applying a language model according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a device for generating a language model according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of an application device of a language model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.
It should be noted that unless otherwise defined, technical terms or scientific terms used in the embodiments of the present application should be given the ordinary meanings as understood by those having ordinary skill in the art to which the embodiments of the present application belong. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In the field of computer natural language processing (Natural Language Processing, abbreviated NLP), training of a language model requires a great amount of corpus data and abundant computing resources, and for some small languages with insufficient corpus data, the language model cannot be constructed because the training task of the language model is supported by insufficient corpus data.
In view of this, an embodiment of the present application provides a training method for a language model, as shown in fig. 1, including:
step S10: acquiring a first corpus, a second corpus and word block training results;
the first corpus is a corpus with a small amount of corpus data, such as a language model, and the like, which needs to be generated. The second corpus is usually a general corpus, which is a training corpus used for training a language model in the NLP field, and is a corpus with a larger corpus data volume, such as an enwiki dataset. That is, in one possible implementation, the first corpus text dataset is smaller than the second corpus text dataset.
The word block (Token) training result is a word block vector (Token) trained by the BERT model of the second corpus. It should be noted that, the word blocks herein refer to basic semantic units, namely Token, used for forming text, and may be words or words, and the division of the word blocks Token may be different according to different languages and different granularities.
BERT is a pre-trained model for natural language processing, which is known as Bidirectional Encoder Representation from Transformers, i.e., the Encoder of a bi-directional transducer, and the primary model structure is a stack of encoders of transducers. BERT is divided into BERT-base and BERT-large, wherein 12 layers Transformer Encoder are adopted in BERT-base, and the parameter amount is 1.1 hundred million; in BERT-large, 24 layers Transformer Encoder are used, the parameter amount being 3.4 billion. The BERT model needs a great amount of corpus data in the pre-training process, the first corpus cannot meet the requirement of the BERT model on the corpus size, even if the BERT-base is adopted, more small languages cannot meet the requirement of the BERT model on the corpus size, and therefore the first corpus cannot directly use the BERT model.
Step S20: word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus;
herein, word blocks in the first corpus are defined as first word blocks, and word blocks in the second corpus are defined as second word blocks; the first corpus comprises a plurality of first word blocks, and the second corpus comprises a plurality of second word blocks. Word frequency ordering refers to calculating the occurrence frequency of a first word block and a second word block, and ordering according to the frequency; for example, fig. 2 is an example of english and clan word frequency statistics provided in an embodiment of the present application.
Word frequency ordering matching refers to making word frequency ordering identical in a first corpus and a second corpus, and establishing a matching relationship between a first word block and a second word block.
In one possible implementation, step S20 includes:
performing word frequency statistics on a first word block in a first corpus and a second word block in a second corpus;
forward ordering is carried out on the first word block and the second word block according to word frequency statistics results;
and establishing a matching relationship for the first word blocks and the second word blocks which are ranked the same.
Step S30: mapping the word block training result to a first corpus according to the matching result;
it can be generally considered that the word block distribution rules of the first corpus and the second corpus are similar, i.e. the common word blocks in different languages are similar. Based on this, the word block training result may be mapped to the first corpus according to the word frequency ordering matching result in step S20. That is, a word block vector of the second word block is obtained from the word block training result; and establishing a mapping relation between the first word block and the word block vector according to the matching result.
Step S40: in the pre-training model, initializing word blocks in a first corpus by adopting mapped word block training results;
the pre-training model is a BERT model; the word block training result is word block vectors of the second corpus trained by the BERT model, and vector values of the word block vectors are easy to converge. According to word frequency sequencing matching results, mapping word block training results to a first corpus, wherein even if text semantics corresponding to the first word block and text semantics corresponding to the second word block which are matched with each other are different, the effect of the first word block is far better than random initialization when the BERT model is initialized by adopting the mapped word block training results.
Step S50: inputting the first corpus into a pre-training model, and then training a specific task to generate a language model;
fig. 3 is a schematic structural diagram of a language model provided in the embodiment of the present application, where, as shown in fig. 3, the language model includes a BERT model and a full-connection layer for implementing a specific task, where the full-connection layer may be one layer or more than one layer.
After the BERT model is initialized by the first corpus, specific task training is further required to be performed in order to be applied to specific tasks, so as to generate an available language model.
In the training method of the language model provided by the embodiment of the application, the first corpus borrows the word block training result trained by the second prediction library, initializes the BERT model, and performs task training on the initialized BERT model by adopting the first corpus to generate the language model. By the design, the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that languages with fewer corpora lack language models is solved, and the language models can be formed more easily.
In one possible implementation, as shown in fig. 4, step S50 includes:
step S51: inputting the first corpus into a pre-training model;
the pre-training model is a BERT model, the input of which is the sum of three vectors, namely a word block vector (Token) a text vector (Segment Embedding) and a position vector (Position Embedding). Wherein Token references are used for representing references of the current word block; segment Embedding indicates index casting of the sentence in which the current word block is located; position Embedding indicates index casting of the current word block.
The BERT model also includes some special Token: including [ CLS ], [ SEP ] and [ MASK ]; wherein [ CLS ] is placed at the first position of the first sentence, and the value of the [ CLS ] is automatically learned in the BERT model training process and is used for describing the global semantic information of the text; and obtaining a classification output C through the BERT model, wherein the classification output C can be used for subsequent classification tasks. [ SEP ] for dividing two input sentences of input, such as input sentences A and B, an [ SEP ] is added after sentence A, B. The MASK is used for masking some of the tokens in the sentence, masking the tokens with the MASK, and predicting the tokens by using the MASK vector output by the BERT model.
Step S52: obtaining an output result of the first corpus after the pre-training model;
the output result of the BERT model comprises a classification output C and a text output T, wherein the classification output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after the BERT model.
Step S53: and extracting the feature vector corresponding to the specific task from the output result, and inputting the feature vector to the full connection layer.
There are various tasks in NLP, such as Classification (Classification) task, question-answering (QA) task, and Named Entity Recognition (NER), etc.
The classification task refers to classifying the input text according to the setting condition, and in the BERT model, the classification task comprises a classification task based on single sentences and a classification task based on sentence pairs. The classification task based on single sentence is input into single sentence, which is used for text emotion analysis (like or dislike) of texts such as movie comments/product comments; the classification task input based on the sentence pairs is sentence pairs, and can be used for text semantic matching, such as judging whether meanings between the sentence pairs are the same, judging similarity between the sentence pairs, judging the relation (implication, neutral or contradiction) between the sentence pairs, and the like.
In the classification task, the feature vector is a classification output C in the BERT model output result, the classification output C is a classification vector output by [ CLS ] after passing through the BERT model, and the classification output C is input to the full connection layer to realize the classification task.
A question-and-answer task refers to giving a question and a paragraph, and then marking the specific location of the answer from the paragraph. In such tasks, all text output T (each Token) in the output result is input to the full-connection layer, and then the full-connection layer outputs the starting position and the ending position of the answer after operation.
Named entity recognition refers to marking word blocks of an input text in rows and judging whether the word blocks belong to Person names Person, place names (Location), organization names (Organization), mixing (Miscellanlaneous) or others (Other). In this task, the text output T is input to the full connection layer.
Because the first corpus is a small data set, the data volume is insufficient, and the BERT model is unsupervised training of long text, the usage scenario is not completely matched, for example, in an emotion classification task, the first corpus is mostly short text, the data volume is in the data set, and the data of the long text is less, so position embeddings in most texts can be masked, and the condition of insufficient training is easy to occur. In addition, the fitting phenomenon is easy to occur due to the fact that the text quantity is small, the first corpus is subjected to data enhancement, the data enhancement method comprises at least one of disorder, extension, truncation and MASK, and the generalization of the language model can be improved through the first corpus after data enhancement.
Illustratively, the enhancement method for emotion classification tasks is:
1) The first corpus is T, the first corpus T is backed up, the total data amount is amplified by 1 time, and an amplified corpus T is obtained;
2) Randomly expanding 20% text length in the amplified corpus T to a random length between 400 and 512; the expansion mode can be realized by copying the text of the user for multiple times.
The maximum expansion length is 512 here, because the maximum input length allowed by the BERT-base model is 512, the maximum expansion length can be adjusted according to the BERT model here, for example, in an application scene of the BERT-large model, the maximum expansion length can be increased to the maximum length allowed by the BERT-large model.
3) In the data T-T of the augmentation corpus T which is increased relative to the first corpus T, randomly reversing the word order of which the maximum distance range is 2 from 40% of texts, such as abcde- > adcbe;
4) According to the preprocessing operation of BERT, 80% of 20% of text is replaced by mask, 10% is replaced by random text, and 10% is not replaced in the data obtained by processing in steps 1) to 3).
It should be noted that the embodiments of the present application are not limited to the enhancement parameters described above, and parameter adjustment may be performed for different specific tasks.
The loss function of the language model provided by the embodiment of the application is as follows:
Loss=L1+L2
it can be seen that the loss function comprises two parts, namely L1 and L2, wherein L1 is the loss function of the BERT model in the unsupervised training task; l2 is the loss function for the classification task.
L2 may employ Cross Entropy Loss Function (cross entropy loss function); wherein the classification tasks include two classifications and multiple classifications.
Under the two classification conditions, the language model only needs two cases of predicted results, and the predicted probability of each class is p and 1-p, and the expression is:
wherein y is i Label representing sample i, positive class 1, negative class 0;
p i representing the probability that sample i is predicted to be a positive class.
The case of multiple classifications is effectively an extension of the two classifications:
wherein M represents the number of categories;
y ic representing a sign function (0 or 1), taking 1 if the true class of sample i is equal to c, or taking 0 otherwise;
p ic representing the predicted probability that the observation sample i belongs to category c.
The embodiment of the application simultaneously provides an application method of the language model, wherein the language model comprises a BERT model and a full-connection layer for realizing specific tasks; the language model is trained by the training method in the above embodiment, as shown in fig. 5, and the application method includes:
step S100: acquiring an input text;
text is entered here, i.e. text that is required to perform a specific task, such as comments that require text emotion analysis.
Step S200: inputting the input text into a pre-trained model in the language model;
from the above description, the pre-training model is a BERT model, and the input of the BERT model is the sum of three vectors, so in the input step, the input text needs to be converted into the input vector meeting the requirements of the BERT model.
Step S300: obtaining a feature vector corresponding to a specific task from an output result of the pre-training model;
the output result of the BERT model comprises classified output C and text output T, wherein the classified output C is an output vector of [ CLS ] after passing through the BERT model; the text output T is an output vector of each Token in the input text after the BERT model.
Specific tasks include identifying tasks for classification tasks, question-answer tasks, and named entities, and reference may be made to the above description for specific tasks and the relationship of specific tasks to feature vectors, which are not repeated here.
Step S400: the feature vectors are input into the full connection layer.
After the feature vector is input to the full connection layer, the full connection layer outputs a task result related to the specific task.
In the application method of the language model provided by the embodiment of the application, the first corpus is initialized by using the word block training result which is already trained by the second prediction library in the pre-training model of the language model, so that the first corpus can obtain a better training effect without depending on a large number of training data sets, the problem that the pre-training model is lacking in languages with fewer linguistics is solved, the language model can be formed more easily, and the application method can be used for processing specific tasks under the condition of a small amount of corpus data.
It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.
It should be noted that some embodiments of the present application have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, the embodiment of the application also provides a device for generating a language model, which corresponds to the method of any embodiment.
Referring to fig. 6, the language model generating apparatus includes:
a first obtaining module 100 configured to obtain a first corpus, a second corpus, and a word block training result;
the ranking matching module 200 is configured to perform word frequency ranking matching on word blocks in the first corpus and the second corpus;
the matching mapping module 300 is configured to map the word block training result to the first corpus according to the matching result;
an initialization module 400 configured to initialize word blocks in the first corpus with mapped word block training results in the pre-training model;
the task training module 500 is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.
The device of the foregoing embodiment is used to implement the training method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, the embodiment of the application also provides an application device of the language model, which corresponds to the method of any embodiment.
Referring to fig. 7, the language model application apparatus includes:
a second acquisition module 600 configured to acquire an input text;
an input module 700 configured to input the input text into a pre-trained model of the language models;
the feature extraction module 800 is configured to obtain a feature vector corresponding to a specific task from the output result of the pre-training model;
a specific task module 900 is configured to input the feature vector into a fully connected layer in the language model.
The language model is formed by training by the training method in the embodiment.
The device of the foregoing embodiment is configured to implement the application method of the corresponding language model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in one or more pieces of software and/or hardware when implementing the embodiments of the present application.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device corresponding to the method of any embodiment, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the program to implement the method of any embodiment.
Fig. 8 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, corresponding to any of the above-described embodiments of the method, the embodiments of the present application further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described in any of the above-described embodiments.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the method of any of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of embodiments of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in different embodiments may also be combined under the idea of the embodiments of the present application, the steps may be implemented in any order, and many other variations of the different aspects of the embodiments of the present application as described above exist, which are not provided in details for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the embodiments, it should be apparent to one skilled in the art that the embodiments may be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While embodiments of the present application have been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the present application, are intended to be included within the scope of the embodiments of the present application.

Claims (15)

1. A method for training a language model, comprising:
acquiring a first corpus and a second corpus; inputting the second corpus into a pre-training model for training to obtain word block training results;
word frequency sorting matching is carried out on word blocks in the first corpus and the second corpus, and a matching result is obtained;
mapping the word block training result to the first corpus according to the matching result to obtain a matching relationship between a first word block in the first corpus and a second word block in the second corpus;
initializing the pre-training model by adopting first word blocks in the first corpus, wherein the first word blocks are in one-to-one correspondence with second word blocks in the second corpus;
inputting the first corpus into the pre-training model, and then training a specific task to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.
2. The training method of claim 1, wherein the word-frequency ordering matching of word blocks in the first corpus and the second corpus comprises:
performing word frequency statistics on a first word block in the first corpus and a second word block in the second corpus;
forward ordering is carried out on the first word block and the second word block according to word frequency statistics results;
and establishing a matching relationship for the first word blocks and the second word blocks which are in the same sequence.
3. The training method of claim 2, wherein mapping the word block training results to the first corpus based on the matching results comprises:
acquiring word block vectors of the second word blocks based on the word block training results;
and establishing a mapping relation between the first word block and the word block vector according to a matching result.
4. A training method as claimed in claim 1 or 2 or 3, wherein the first corpus text data set is smaller than the second corpus text data set.
5. The training method according to claim 1, wherein the inputting the first corpus into the pre-training model for specific task training comprises:
inputting the first corpus into the pre-training model;
obtaining an output result of the first corpus after the pre-training model;
and extracting a characteristic vector corresponding to the specific task from the output result, and inputting the characteristic vector to a full-connection layer.
6. The training method of claim 5, wherein said inputting the first corpus into the pre-training model comprises:
performing data enhancement on the first corpus;
inputting the first corpus after data enhancement into the pre-training model;
the data enhancement method comprises at least one of disorder, extension, truncation and MASK.
7. The training method of claim 1, wherein,
the loss function of the language model is:
Loss=L1+L2;
wherein L1 is a loss function of an unsupervised training task of the BERT model; l2 is the loss function for a particular task.
8. A method of applying a language model, wherein the language model is trained and formed by the training method according to any one of claims 1 to 7; the application method comprises the following steps:
acquiring an input text;
inputting the input text into the language model for training to obtain an output result;
acquiring a feature vector corresponding to a specific task from the output result;
the feature vectors are input to a fully connected layer in the language model.
9. The application method according to claim 8, wherein the specific task includes a classification task, and the feature vector is a classification vector corresponding to a CLS input in the output result.
10. The application method according to claim 9, wherein the classification tasks include text emotion analysis and text semantic matching.
11. The application method according to claim 8, wherein the specific task includes a question-answer task and a named entity recognition task, and the feature vector is a portion of the output result corresponding to a word block in the input text.
12. A language model generating apparatus, comprising:
the first acquisition module is configured to acquire a first corpus and a second corpus; inputting the second corpus into a pre-training model for training to obtain word block training results;
the ordering and matching module is configured to perform word frequency ordering and matching on word blocks in the first corpus and the second corpus to obtain a matching result;
the matching mapping module is configured to map the word block training result to the first corpus according to the matching result to obtain a matching relationship between a first word block in the first corpus and a second word block in the second corpus;
the initialization module is configured to initialize the pre-training model by adopting first word blocks in the first corpus, which are in one-to-one correspondence with second word blocks in the second corpus;
the task training module is configured to input the first corpus into the pre-training model and then perform specific task training to generate a language model;
the pre-training model is a BERT model, and the word block training result is a word block vector of the second corpus trained by the BERT model.
13. An application device of a language model, characterized in that the application device is formed by training by the training method according to any one of claims 1 to 7; the application device comprises:
a second acquisition module configured to acquire an input text;
the input module is configured to input the input text into the language model for training to obtain an output result;
the feature extraction module is configured to acquire a feature vector corresponding to a specific task from the output result;
and a specific task module configured to input the feature vector into a fully connected layer in the language model.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 11 when the program is executed.
15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 11.
CN202110719988.7A 2021-06-28 2021-06-28 Training method and application method of language model Active CN113255328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110719988.7A CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110719988.7A CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Publications (2)

Publication Number Publication Date
CN113255328A CN113255328A (en) 2021-08-13
CN113255328B true CN113255328B (en) 2024-02-02

Family

ID=77189900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110719988.7A Active CN113255328B (en) 2021-06-28 2021-06-28 Training method and application method of language model

Country Status (1)

Country Link
CN (1) CN113255328B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656573B (en) * 2021-08-27 2024-02-06 北京大数医达科技有限公司 Text information generation method, device and terminal equipment
CN114817469B (en) * 2022-04-27 2023-08-08 马上消费金融股份有限公司 Text enhancement method, training method and training device for text enhancement model
CN115329062B (en) * 2022-10-17 2023-01-06 中邮消费金融有限公司 Dialogue model training method under low-data scene and computer equipment

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
KR20160058531A (en) * 2014-11-17 2016-05-25 포항공과대학교 산학협력단 Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN109255117A (en) * 2017-07-13 2019-01-22 普天信息技术有限公司 Chinese word cutting method and device
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN111695336A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Disease name code matching method and device, computer equipment and storage medium
CN111737994A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Method, device and equipment for obtaining word vector based on language model and storage medium
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
WO2020242567A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Cross-lingual task training
WO2020253042A1 (en) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 Intelligent sentiment judgment method and device, and computer readable storage medium
CN112307181A (en) * 2020-10-28 2021-02-02 刘玲玲 Corpus-specific-corpus-based corpus extraction method and corpus extractor
CN112507101A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for establishing pre-training language model
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112908315A (en) * 2021-03-10 2021-06-04 北京思图场景数据科技服务有限公司 Question-answer intention judgment method based on voice characteristics and voice recognition
CN112966712A (en) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 Language model training method and device, electronic equipment and computer readable medium
CN112989828A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Training method, device, medium and electronic equipment for named entity recognition model

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235567B2 (en) * 2013-01-14 2016-01-12 Xerox Corporation Multi-domain machine translation model adaptation
CN106156010B (en) * 2015-04-20 2019-10-11 阿里巴巴集团控股有限公司 Translate training method, device, system and translation on line method and device
US20180157641A1 (en) * 2016-12-07 2018-06-07 International Business Machines Corporation Automatic Detection of Required Tools for a Task Described in Natural Language Content
US11281863B2 (en) * 2019-04-18 2022-03-22 Salesforce.Com, Inc. Systems and methods for unifying question answering and text classification via span extraction
CN110580290B (en) * 2019-09-12 2022-12-13 北京小米智能科技有限公司 Method and device for optimizing training set for text classification
US11501187B2 (en) * 2019-09-24 2022-11-15 International Business Machines Corporation Opinion snippet detection for aspect-based sentiment analysis
US11620515B2 (en) * 2019-11-07 2023-04-04 Salesforce.Com, Inc. Multi-task knowledge distillation for language model
CN110717339B (en) * 2019-12-12 2020-06-30 北京百度网讯科技有限公司 Semantic representation model processing method and device, electronic equipment and storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
KR20160058531A (en) * 2014-11-17 2016-05-25 포항공과대학교 산학협력단 Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
CN109255117A (en) * 2017-07-13 2019-01-22 普天信息技术有限公司 Chinese word cutting method and device
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
WO2020242567A1 (en) * 2019-05-27 2020-12-03 Microsoft Technology Licensing, Llc Cross-lingual task training
WO2020253042A1 (en) * 2019-06-18 2020-12-24 平安科技(深圳)有限公司 Intelligent sentiment judgment method and device, and computer readable storage medium
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112989828A (en) * 2019-12-17 2021-06-18 医渡云(北京)技术有限公司 Training method, device, medium and electronic equipment for named entity recognition model
CN111695336A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Disease name code matching method and device, computer equipment and storage medium
CN111737994A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Method, device and equipment for obtaining word vector based on language model and storage medium
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
CN112307181A (en) * 2020-10-28 2021-02-02 刘玲玲 Corpus-specific-corpus-based corpus extraction method and corpus extractor
CN112507101A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for establishing pre-training language model
CN112966712A (en) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 Language model training method and device, electronic equipment and computer readable medium
CN112908315A (en) * 2021-03-10 2021-06-04 北京思图场景数据科技服务有限公司 Question-answer intention judgment method based on voice characteristics and voice recognition

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling;Jing Su, Qingyun Dai et al.;《Computer Standards & Interfaces》;第67卷;全文 *
一种基于Word2vec的内容态势感知方法;魏忠;周俊;石元兵;黄明浩;;通信技术(第05期);全文 *
基于BERT-BiLSTM-CRF模型的中文实体识别;谢腾;杨俊安;刘辉;;计算机系统应用(第07期);全文 *
语言模型训练语料处理方法及解码词典的设计;林小俊,田浩等;《第八届全国人机语音通讯学术会议》;第164-168页 *

Also Published As

Publication number Publication date
CN113255328A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN113255328B (en) Training method and application method of language model
WO2022007823A1 (en) Text data processing method and device
CN108846077B (en) Semantic matching method, device, medium and electronic equipment for question and answer text
JP6601470B2 (en) NATURAL LANGUAGE GENERATION METHOD, NATURAL LANGUAGE GENERATION DEVICE, AND ELECTRONIC DEVICE
CN111539197B (en) Text matching method and device, computer system and readable storage medium
CN113239169B (en) Answer generation method, device, equipment and storage medium based on artificial intelligence
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN114676704B (en) Sentence emotion analysis method, device and equipment and storage medium
CN112632226A (en) Semantic search method and device based on legal knowledge graph and electronic equipment
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN113535912B (en) Text association method and related equipment based on graph rolling network and attention mechanism
CN114490926A (en) Method and device for determining similar problems, storage medium and terminal
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN112131884A (en) Method and device for entity classification and method and device for entity presentation
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
CN116738956A (en) Prompt template generation method and device, computer equipment and storage medium
CN116957006A (en) Training method, device, equipment, medium and program product of prediction model
CN116821781A (en) Classification model training method, text analysis method and related equipment
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
CN115408523A (en) Medium-length and long-text classification method and system based on abstract extraction and keyword extraction
He et al. Discovering interdisciplinary research based on neural networks
CN115952317A (en) Video processing method, device, equipment, medium and program product
CN114995729A (en) Voice drawing method and device and computer equipment
CN109740162B (en) Text representation method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant