CN112084317B - Method and apparatus for pre-training language model - Google Patents

Method and apparatus for pre-training language model Download PDF

Info

Publication number
CN112084317B
CN112084317B CN202011009914.6A CN202011009914A CN112084317B CN 112084317 B CN112084317 B CN 112084317B CN 202011009914 A CN202011009914 A CN 202011009914A CN 112084317 B CN112084317 B CN 112084317B
Authority
CN
China
Prior art keywords
sample
sentence
word
task
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011009914.6A
Other languages
Chinese (zh)
Other versions
CN112084317A (en
Inventor
王福东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202011009914.6A priority Critical patent/CN112084317B/en
Publication of CN112084317A publication Critical patent/CN112084317A/en
Application granted granted Critical
Publication of CN112084317B publication Critical patent/CN112084317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The embodiment of the specification provides a method and a device for pre-training a language model, wherein the method comprises the following steps: acquiring a first sentence of a first role and a second sentence of a second role in a history dialogue record; the history dialogue record comprises sentences of each dialogue in the multiple dialogue; splicing the first sentence and the second sentence into a first sample; masking words with preset proportion in the first sample to obtain a second sample; superposing a word embedding vector, a word type embedding vector, a position embedding vector and an additional embedding vector of any word in the second sample to obtain an initial word expression vector of the word; the initial word expression vector for each word in the second sample is input into the language model, and the language model is pre-trained based on at least one pre-training task including a first task, the first task being used to predict the masked words in the second sample. The language model can be more suitable for language characterization in the dialogue field after being pre-trained.

Description

Method and apparatus for pre-training language model
Technical Field
One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for pre-training a language model.
Background
With the development of artificial intelligence, a manner of using a robot to perform a conversation with a user instead of manual work is presented, and the conversation often needs to be performed for multiple rounds, which is called for short. In the multi-turn dialogue process of the robot and the user, the intention expressed by the sentence of the user is identified through the intention identification model, the corresponding robot answer sentence is given for the intention, and the established business objective is completed through the continuous interaction mode, for example, the problem of the user is solved, or the user is prompted to conduct preset user behaviors.
The intention recognition model is a classification model, and determines the intention expressed by a sentence of a user based on language characterization obtained by the language model. The existing language model is a general model trained on the public encyclopedia corpus, sentences in the conversation field cannot be well represented, and accordingly, an intention recognition model cannot accurately recognize intention expressed by the sentences of a user, so that a set business target cannot be completed.
It is therefore desirable to have improved schemes that can make a language model more suitable for language characterization in the dialog domain after pre-training the language model.
Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for pre-training a language model, which can make the language model more suitable for language characterization in the dialog field after pre-training the language model.
In a first aspect, a method of pre-training a language model for language characterization in the field of dialog is provided, the method comprising:
acquiring a first sentence of a first role in a history dialogue record of a dialogue field and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role;
splicing the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample;
superposing a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
Inputting an initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample.
In one possible implementation, the masked words in the second sample are used as a sample tag for determining the predicted loss of the first task.
In a possible implementation manner, the pre-training task further includes a second task, where the second task is used to predict whether the first sentence and the second sentence are two sentences that are connected in sequence.
Further, the first sample corresponds to a positive sample of the second task, and the first sentence and the second sentence are two sentences connected in sequence; alternatively, the first sample corresponds to a negative sample of the second task, and the first sentence and the second sentence are not two sentences that are sequentially connected.
In a possible implementation, the pre-training task further includes a third task for predicting pinyin of the masked words in the second sample.
Further, the pinyin of the masked word in the second sample is used as a sample label to determine the predicted loss of the third task.
In one possible implementation manner, the additional embedded vector includes at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word;
the pre-training task further includes a fourth task for predicting whether the first sentence and the second sentence are two sentences of the same round.
Further, the first sample corresponds to a positive sample of the fourth task, and the first sentence and the second sentence are two sentences of the same round; alternatively, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same round.
In a possible implementation manner, after the pre-training the language model based on at least one pre-training task including the first task, the method further includes:
acquiring a third statement of the first role and a fourth statement of the second role in the history dialogue record; the third sentence and the fourth sentence belong to the same round;
Splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the language model after pre-training to obtain a language characterization vector of the third sample;
inputting the language characterization vector of the third sample into an intention recognition model to obtain a predicted intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
Further, after the fine tuning of the language model, the method further comprises:
acquiring a fifth sentence of a first role and a sixth sentence of a second role in the current dialogue; the fifth sentence and the sixth sentence belong to the same round;
splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain the predicted intention category corresponding to the fourth sample.
In a second aspect, there is provided an apparatus for pre-training a language model for language characterization in the field of dialog, the apparatus comprising:
A first obtaining unit, configured to obtain a first sentence of a first role in a history dialogue record in a dialogue field and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role;
the first sample generation unit is used for splicing the first statement and the second statement acquired by the first acquisition unit into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample;
the initial expression unit is used for superposing the word embedding vector of any word, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word in the second sample obtained by the first sample generating unit to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
the pre-training unit is used for inputting the initial word expression vector of each word in the second sample obtained by the initial expression unit into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements the method of the first aspect.
Through the method and the device provided by the embodiment of the specification, first, the first statement of the first role in the history dialogue record of the dialogue field and the second statement of the second role in the history dialogue record are acquired; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role; then splicing the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample; then, superposing a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. From the above, in the embodiment of the present disclosure, a second sample is obtained based on the history dialogue record in the dialogue domain, and the second sample is used to pretrain the language model, so that the trained language model is more suitable for the language characterization in the dialogue domain; when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superimposed, but also the additional embedding vector corresponding to the word is superimposed, where the additional embedding vector includes at least one of the round embedding vector of the round to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector reflects information specific to the dialogue field, and then the initial word expression vector of each word in the second sample is input into the language model, and the language model is pre-trained, so that the language model can better extract the information specific to the dialogue field, and after pre-training the language model, the language model is more suitable for language characterization in the dialogue field.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a method flow diagram of a pre-trained language model, according to one embodiment;
FIG. 3 illustrates a process diagram of a pre-trained language model according to one embodiment;
FIG. 4 illustrates a schematic block diagram of an apparatus for pre-training a language model, according to one embodiment.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
Fig. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification. The implementation scenario involves a pre-trained language model for language characterization in the dialog domain. Referring to fig. 1, at least a sentence of a user is input into a language model, a corresponding language characterization vector is output through the language model, the language characterization vector is input into an intention recognition model, and a corresponding predicted intention category is output through the intention recognition model. It can be understood that the intent recognition model is a classification model and is based on language characterization obtained by the language model, so whether the language model can well characterize sentences in the dialogue field has a great influence on the recognition effect of the intent recognition model.
The language model can adopt the structural design of a bidirectional coding representation (bidirectional encoder representation from transformers, BERT) model based on a converter, in general, the BERT model is obtained after pre-training by a pre-training task, training data is derived from encyclopedic corpus, the pre-training task comprises word masking training and continuous sentence prediction training, the word masking training is to mask a plurality of words in a section of words and then predict the masked words, the continuous sentence prediction training is to judge whether the two sentences are in a context relation, and the BERT model obtained by training in the mode is more general and can not well represent sentences in the dialogue field.
In the embodiment of the specification, a training sample is constructed based on a history dialogue record in the dialogue field, and a language model is pre-trained based on at least one pre-training task by using the training sample; and when the initial word expression vector of each word in the training sample is determined, the specific information of the dialogue field is reflected, the initial word expression vector of each word in the training sample is input into the language model, and the language model is pre-trained, so that the language model can better extract the specific information of the dialogue field, and the language model is more suitable for language characterization of the dialogue field after the language model is pre-trained.
FIG. 2 illustrates a flow diagram of a method of pre-training a language model for language characterization in the dialog domain, which may be based on the implementation scenario illustrated in FIG. 1, according to one embodiment. As shown in fig. 2, the method for pre-training the language model in this embodiment includes the steps of: step 21, obtaining a first sentence of a first role in a history dialogue record in the dialogue field and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role; step 22, splicing the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample; step 23, superposing the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; step 24, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. Specific implementations of the above steps are described below.
Firstly, in step 21, a first sentence of a first role in a history dialogue record of a dialogue field and a second sentence of a second role in the history dialogue record are obtained; wherein the history dialogue record includes sentences of each of the rounds of dialogue in the multiple rounds of dialogue of the first and second roles. It will be appreciated that the parties to a conversation typically each belong to a different role, e.g., one is in customer service and the other is in user.
In the embodiment of the present disclosure, the history dialogue record corresponds to one session (session) of the first character and the second character, for example, a dialogue between a customer service and a user is taken as an example, and the history dialogue record includes a plurality of rounds of dialogue between a robot customer service and the user, and a plurality of rounds of dialogue between a human customer service and the user when the robot customer service fails to achieve a predetermined goal. Wherein, a round of dialogue includes customer service sentences and user sentences, and starts with customer service sentences.
It can be appreciated that the first sentence and the second sentence may belong to the same round of dialogue, or may belong to different rounds of dialogue. In the embodiment of the present disclosure, the sentence is not limited to a sentence, and may be a word, a sentence, or two sentences, based on the actual expression in the dialogue. The above statement is the actual expression of the parties in the conversation and therefore may also be referred to as speaking.
In the embodiment of the present disclosure, the history dialogue record may be a history dialogue record of an intelligent outbound scenario, where the intelligent outbound scenario is that a robot interacts with a user through a phone outbound form to complete an outbound task and a specific target; the history dialogue record can also be a history dialogue record of a user incoming call scene, namely, the user interacts with a robot or a manual customer service through a telephone incoming call mode to finish consultation of specific problems.
Then, at step 22, the first sentence and the second sentence are spliced into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample. It will be appreciated that this second sample corresponds to a pre-training task of word masking training of the BERT model.
The predetermined ratio may be a small value, for example, 10% or 15%.
The preset word may be a general Chinese character or a special mark, for example, a "[ mask ]" mark. In one example, the words to be masked are replaced with "[ mask ]" marks in a first proportion, the words to be masked are replaced with one word to be randomly sampled in a second proportion, and the words to be masked are not replaced in a third proportion.
Then, in step 23, the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word are overlapped to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which the sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word. It may be understood that, the word embedding vector of a word, the word type embedding vector of the word, and the position embedding vector of the word are superimposed to obtain an initial word expression vector of the word, which is a manner adopted by a general BERT model, and in this embodiment of the present disclosure, an additional embedding vector corresponding to the word is further superimposed on this basis, where the additional embedding vector is specific to the dialogue domain, so that after the language model is pre-trained based on the initial word expression vector, the language model can learn information specific to the dialogue domain.
Wherein, when the pinyin embedded vector of the pinyin corresponding to a word is introduced into the initial word expression vector of the word, the pre-trained language model is beneficial to suppressing automatic speech recognition (automatic speech recognition, ASR) errors.
Finally, at step 24, the initial word expression vector for each word in the second sample is input into the language model, which is pre-trained based on at least one pre-training task including a first task for predicting the masked words in the second sample. It will be appreciated that since only a predetermined proportion of words in the second sample are masked, the language model may predict the masked words in the second sample based on the context of the masked words. The first task may correspond to a word masking training pre-training task of a general BERT model, and by executing the pre-training task, the language model may better implement language characterization in the dialog field.
In one example, the masked words in the second sample are used as sample tags to determine the predicted loss of the first task.
In one example, the pre-training task further includes a second task for predicting whether the first sentence and the second sentence are two sentences connected in sequence. The second task may correspond to a continuous sentence predictive training pre-training task of a general BERT model.
Further, the first sample corresponds to a positive sample of the second task, and the first sentence and the second sentence are two sentences connected in sequence; alternatively, the first sample corresponds to a negative sample of the second task, and the first sentence and the second sentence are not two sentences that are sequentially connected. The following describes, as an example, the history dialogue record shown in table one, what two sentences are connected in sequence.
Table one: historical dialog records
Roles and roles Statement Round of
Customer service Statement 1 1
User' s Statement 2 1
Customer service Statement 3 2
User' s Statement 4 2
Customer service Statement 5 3
User' s Statement 6 3
Referring to table one, each sentence in the history dialogue record is sequentially recorded in accordance with the time series, wherein sentence 1 and sentence 2 are two sentences which are sequentially connected, sentence 2 and sentence 3 are also two sentences which are sequentially connected, but sentence 1 and sentence 3 are not two sentences which are sequentially connected.
In one example, the pre-training task further includes a third task for predicting pinyin for the masked word in the second sample. The third task is adapted to a specific scene in the dialog field, and speech recognition is often performed during the dialog, that is, speech recognition is performed as text, and ASR errors sometimes occur during the process, which can be effectively suppressed.
Further, the pinyin of the masked word in the second sample is used as a sample label to determine the predicted loss of the third task.
In one example, the additional embedded vector includes at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word; the pre-training task further includes a fourth task for predicting whether the first sentence and the second sentence are two sentences of the same round. The fourth task is also adaptive to a specific scene in the dialogue field, and is beneficial to the language model to express turn information.
Further, the first sample corresponds to a positive sample of the fourth task, and the first sentence and the second sentence are two sentences of the same round; alternatively, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same round.
In the embodiment of the present disclosure, in order for the language model to better adapt to the target task, fine-tuning training is further required on the language model on the target task, where the target task may be, but is not limited to, an intention recognition task.
In one example, after the pre-training the language model based on at least one pre-training task including the first task, the method further comprises:
acquiring a third statement of the first role and a fourth statement of the second role in the history dialogue record; the third sentence and the fourth sentence belong to the same round;
splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the language model after pre-training to obtain a language characterization vector of the third sample;
inputting the language characterization vector of the third sample into an intention recognition model to obtain a predicted intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
In the embodiment of the specification, after the fine-tuning training is performed on the language model on the target task, the target task can be executed based on the language model.
In one example, after the fine tuning of the language model, the method further comprises:
acquiring a fifth sentence of a first role and a sixth sentence of a second role in the current dialogue; the fifth sentence and the sixth sentence belong to the same round;
Splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain the predicted intention category corresponding to the fourth sample.
FIG. 3 illustrates a process diagram of a pre-trained language model according to one embodiment. Referring to fig. 3, a sample is jointly spliced by a robot conversation (context) and a corresponding user conversation (query) from the history dialogue logs of different outbound application scenarios, where the history dialogue logs may also be referred to as history dialogue records. As in the example of the figure, the robotic surgery "returns money" and the user answers "no money". For such a sample, firstly, acquiring a word embedding vector of any word in the sample, a word type embedding vector of the word and a position embedding vector of the word, wherein the three embedding vectors belong to three embedding vectors of an original BERT model, on the basis, three additional embedding vectors are additionally added, including a round embedding vector of a round to which a sentence corresponding to the word belongs, a role embedding vector of a role to which the sentence corresponding to the word belongs and a pinyin embedding vector of a pinyin corresponding to the word, wherein the round embedding vector can be used for helping a language model to better learn dialogue knowledge of different rounds, the role embedding vector introduces role information, helping a better learning robot of the language model and different speech style knowledge of a user, and the pinyin embedding vector is used for inhibiting sample instability caused by ASR errors; and adding all embedded vectors, inputting a language model through regularization processing, adding two pre-training tasks on the basis of the traditional pre-training task of the BERT model, wherein the traditional pre-training task comprises the first task and the second task, the first task predicts missing texts by using surrounding texts, the second task predicts whether a conversation of a robot and a conversation of a user are sequentially connected or not, the added two pre-training tasks comprise the third task and the fourth task, the third task predicts pinyin of the missing texts by using pinyin of surrounding texts, and the fourth task predicts whether the conversation of the robot and the conversation of the user belong to the same round of the two classification tasks.
Through the method provided by the embodiment of the specification, first, a first sentence of a first role in a history dialogue record in the dialogue field and a second sentence of a second role in the history dialogue record are acquired; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role; then splicing the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample; then, superposing a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. From the above, in the embodiment of the present disclosure, a second sample is obtained based on the history dialogue record in the dialogue domain, and the second sample is used to pretrain the language model, so that the trained language model is more suitable for the language characterization in the dialogue domain; when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superimposed, but also the additional embedding vector corresponding to the word is superimposed, where the additional embedding vector includes at least one of the round embedding vector of the round to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector reflects information specific to the dialogue field, and then the initial word expression vector of each word in the second sample is input into the language model, and the language model is pre-trained, so that the language model can better extract the information specific to the dialogue field, and after pre-training the language model, the language model is more suitable for language characterization in the dialogue field.
According to an embodiment of another aspect, there is further provided an apparatus for pre-training a language model, which is used for executing the method for pre-training a language model provided in the embodiments of the present specification. FIG. 4 illustrates a schematic block diagram of an apparatus for pre-training a language model, according to one embodiment. As shown in fig. 4, the apparatus 400 includes:
a first obtaining unit 41, configured to obtain a first sentence of a first role in a history dialogue record in a dialogue domain and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role;
a first sample generation unit 42 that splices the first sentence and the second sentence acquired by the first acquisition unit 41 into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample;
an initial expression unit 43, configured to superimpose a word embedding vector of any word, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word in the second sample obtained by the first sample generating unit 42, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
A pre-training unit 44, configured to input the initial word expression vector of each word in the second sample obtained by the initial expression unit 43 into the language model, and pre-train the language model based on at least one pre-training task including a first task, where the first task is used for predicting the masked word in the second sample.
Optionally, as an embodiment, the masked word in the second sample is used as a sample tag for determining a prediction loss of the first task.
Optionally, as an embodiment, the pre-training task further includes a second task, where the second task is used to predict whether the first sentence and the second sentence are two sentences that are connected in sequence.
Further, the first sample corresponds to a positive sample of the second task, and the first sentence and the second sentence are two sentences connected in sequence; alternatively, the first sample corresponds to a negative sample of the second task, and the first sentence and the second sentence are not two sentences that are sequentially connected.
Optionally, as an embodiment, the pre-training task further includes a third task, where the third task is used to predict pinyin of the masked word in the second sample.
Further, the pinyin of the masked word in the second sample is used as a sample label to determine the predicted loss of the third task.
Optionally, as an embodiment, the additional embedded vector includes at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word;
the pre-training task further includes a fourth task for predicting whether the first sentence and the second sentence are two sentences of the same round.
Further, the first sample corresponds to a positive sample of the fourth task, and the first sentence and the second sentence are two sentences of the same round; alternatively, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same round.
Optionally, as an embodiment, the apparatus further includes:
a second obtaining unit, configured to obtain a third sentence of the first role and a fourth sentence of the second role in the history dialogue record after the pre-training unit performs pre-training on the language model based on at least one pre-training task including the first task; the third sentence and the fourth sentence belong to the same round;
A second sample generation unit, configured to splice the third sentence and the fourth sentence acquired by the second acquisition unit into a third sample;
the language characterization unit is used for inputting the initial word expression vector of each word in the third sample obtained by the second sample generation unit into the language model after pre-training to obtain the language characterization vector of the third sample;
the prediction unit is used for inputting the language characterization vector of the third sample obtained by the language characterization unit into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and the fine tuning unit is used for fine tuning the language model according to the actual intention category corresponding to the third sample and the predicted intention category obtained by the prediction unit.
Further, the apparatus further comprises:
a third obtaining unit, configured to obtain, after the fine tuning unit performs fine tuning on the language model, a fifth sentence of the first role and a sixth sentence of the second role in the current dialogue; the fifth sentence and the sixth sentence belong to the same round;
a third sample generation unit, configured to splice the fifth sentence and the sixth sentence acquired by the third acquisition unit into a fourth sample;
The language characterization unit is further configured to input the fourth sample obtained by the third sample generation unit into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
the prediction unit is further configured to input the language characterization vector of the fourth sample obtained by the language characterization unit into the intent recognition model, so as to obtain a predicted intent category corresponding to the fourth sample.
With the apparatus provided by the embodiment of the present specification, first, the first obtaining unit 41 obtains a first sentence of a first character in a history dialogue record in a dialogue field, and a second sentence of a second character in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role; then the first sample generation unit 42 concatenates the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample; then, the initial expression unit 43 superimposes the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; finally, the pre-training unit 44 inputs the initial word expression vector of each word in the second sample into the language model, and pre-trains the language model based on at least one pre-training task including a first task for predicting the masked words in the second sample. From the above, in the embodiment of the present disclosure, a second sample is obtained based on the history dialogue record in the dialogue domain, and the second sample is used to pretrain the language model, so that the trained language model is more suitable for the language characterization in the dialogue domain; when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superimposed, but also the additional embedding vector corresponding to the word is superimposed, where the additional embedding vector includes at least one of the round embedding vector of the round to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector reflects information specific to the dialogue field, and then the initial word expression vector of each word in the second sample is input into the language model, and the language model is pre-trained, so that the language model can better extract the information specific to the dialogue field, and after pre-training the language model, the language model is more suitable for language characterization in the dialogue field.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (22)

1. A method of pre-training a language model for language characterization in the field of dialog, the method comprising:
acquiring a first sentence of a first role in a history dialogue record of a dialogue field and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role;
splicing the first sentence and the second sentence into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample;
superposing a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
inputting an initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample.
2. The method of claim 1, wherein the masked words in the second sample are used as sample tags for determining a predicted penalty for the first task.
3. The method of claim 1, wherein the pre-training task further comprises a second task for predicting whether the first sentence and the second sentence are two sentences connected in sequence.
4. A method as claimed in claim 3, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two consecutive statements; alternatively, the first sample corresponds to a negative sample of the second task, and the first sentence and the second sentence are not two sentences that are sequentially connected.
5. The method of claim 1, wherein the pre-training task further comprises a third task for predicting pinyin for the masked word in the second sample.
6. The method of claim 5, wherein pinyin of the masked word in the second sample is used as a sample tag to determine a predicted loss of the third task.
7. The method of claim 1, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word;
The pre-training task further includes a fourth task for predicting whether the first sentence and the second sentence are two sentences of the same round.
8. The method of claim 7, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; alternatively, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same round.
9. The method of claim 1, wherein after the pre-training the language model based on at least one pre-training task including a first task, the method further comprises:
acquiring a third statement of the first role and a fourth statement of the second role in the history dialogue record; the third sentence and the fourth sentence belong to the same round;
splicing the third sentence and the fourth sentence into a third sample;
inputting the initial word expression vector of each word in the third sample into the language model after pre-training to obtain a language characterization vector of the third sample;
Inputting the language characterization vector of the third sample into an intention recognition model to obtain a predicted intention category corresponding to the third sample;
and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.
10. The method of claim 9, wherein after said fine-tuning the language model, the method further comprises:
acquiring a fifth sentence of a first role and a sixth sentence of a second role in the current dialogue; the fifth sentence and the sixth sentence belong to the same round;
splicing the fifth sentence and the sixth sentence into a fourth sample;
inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain the predicted intention category corresponding to the fourth sample.
11. An apparatus for pre-training a language model for language characterization in the field of dialog, the apparatus comprising:
a first obtaining unit, configured to obtain a first sentence of a first role in a history dialogue record in a dialogue field and a second sentence of a second role in the history dialogue record; wherein the history dialogue record comprises sentences of each round of dialogue in the multi-round dialogue of the first role and the second role;
The first sample generation unit is used for splicing the first statement and the second statement acquired by the first acquisition unit into a first sample; masking the words with preset proportion in the first sample by using preset words to obtain a second sample;
the initial expression unit is used for superposing the word embedding vector of any word, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word in the second sample obtained by the first sample generating unit to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of a round to which a sentence corresponding to the word belongs, a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;
the pre-training unit is used for inputting the initial word expression vector of each word in the second sample obtained by the initial expression unit into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample.
12. The apparatus of claim 11, wherein the masked words in the second sample are used as sample tags for determining a predicted loss of the first task.
13. The apparatus of claim 11, wherein the pre-training task further comprises a second task for predicting whether the first sentence and the second sentence are two sentences connected in sequence.
14. The apparatus of claim 13, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two sequentially connected statements; alternatively, the first sample corresponds to a negative sample of the second task, and the first sentence and the second sentence are not two sentences that are sequentially connected.
15. The apparatus of claim 11, wherein the pre-training task further comprises a third task to predict pinyin for a masked word in the second sample.
16. The apparatus of claim 15, wherein pinyin of the masked word in the second sample is used as a sample tag to determine a predicted loss of the third task.
17. The apparatus of claim 11, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of the pinyin corresponding to the word;
The pre-training task further includes a fourth task for predicting whether the first sentence and the second sentence are two sentences of the same round.
18. The apparatus of claim 17, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; alternatively, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same round.
19. The apparatus of claim 11, wherein the apparatus further comprises:
a second obtaining unit, configured to obtain a third sentence of the first role and a fourth sentence of the second role in the history dialogue record after the pre-training unit performs pre-training on the language model based on at least one pre-training task including the first task; the third sentence and the fourth sentence belong to the same round;
a second sample generation unit, configured to splice the third sentence and the fourth sentence acquired by the second acquisition unit into a third sample;
the language characterization unit is used for inputting the initial word expression vector of each word in the third sample obtained by the second sample generation unit into the language model after pre-training to obtain the language characterization vector of the third sample;
The prediction unit is used for inputting the language characterization vector of the third sample obtained by the language characterization unit into an intention recognition model to obtain a prediction intention category corresponding to the third sample;
and the fine tuning unit is used for fine tuning the language model according to the actual intention category corresponding to the third sample and the predicted intention category obtained by the prediction unit.
20. The apparatus of claim 19, wherein the apparatus further comprises:
a third obtaining unit, configured to obtain, after the fine tuning unit performs fine tuning on the language model, a fifth sentence of the first role and a sixth sentence of the second role in the current dialogue; the fifth sentence and the sixth sentence belong to the same round;
a third sample generation unit, configured to splice the fifth sentence and the sixth sentence acquired by the third acquisition unit into a fourth sample;
the language characterization unit is further configured to input the fourth sample obtained by the third sample generation unit into the language model after fine tuning to obtain a language characterization vector of the fourth sample;
the prediction unit is further configured to input the language characterization vector of the fourth sample obtained by the language characterization unit into the intent recognition model, so as to obtain a predicted intent category corresponding to the fourth sample.
21. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-10.
22. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-10.
CN202011009914.6A 2020-09-23 2020-09-23 Method and apparatus for pre-training language model Active CN112084317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011009914.6A CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011009914.6A CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Publications (2)

Publication Number Publication Date
CN112084317A CN112084317A (en) 2020-12-15
CN112084317B true CN112084317B (en) 2023-11-14

Family

ID=73739659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011009914.6A Active CN112084317B (en) 2020-09-23 2020-09-23 Method and apparatus for pre-training language model

Country Status (1)

Country Link
CN (1) CN112084317B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905772B (en) * 2021-02-10 2022-04-19 网易有道信息技术(北京)有限公司 Semantic correlation analysis method and device and related products
CN113177113B (en) * 2021-05-27 2023-07-25 中国平安人寿保险股份有限公司 Task type dialogue model pre-training method, device, equipment and storage medium
CN113609275B (en) * 2021-08-24 2024-03-26 腾讯科技(深圳)有限公司 Information processing method, device, equipment and storage medium
CN113688245B (en) * 2021-08-31 2023-09-26 中国平安人寿保险股份有限公司 Processing method, device and equipment of pre-training language model based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111291166A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for training language model based on Bert

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563208B (en) * 2019-01-29 2023-06-30 株式会社理光 Method and device for identifying intention and computer readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001057851A1 (en) * 2000-02-02 2001-08-09 Famoice Technology Pty Ltd Speech system
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111291166A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for training language model based on Bert

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张鹏远 ; 卢春晖 ; 王睿敏 ; .基于预训练语言表示模型的汉语韵律结构预测.天津大学学报(自然科学与工程技术版).2020,(03),全文. *
徐菲菲 ; 冯东升 ; .文本词向量与预训练语言模型研究.上海电力大学学报.2020,(04),全文. *

Also Published As

Publication number Publication date
CN112084317A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN112084317B (en) Method and apparatus for pre-training language model
CN111309889B (en) Method and device for text processing
CN110413746B (en) Method and device for identifying intention of user problem
CN109670035B (en) Text abstract generating method
WO2019200923A1 (en) Pinyin-based semantic recognition method and device and human-machine conversation system
CN110472224B (en) Quality of service detection method, apparatus, computer device and storage medium
CN111212190B (en) Conversation management method, device and system based on conversation strategy management
CN112951240B (en) Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN108897896B (en) Keyword extraction method based on reinforcement learning
CN110543554A (en) Classification method and device for multi-turn conversations
CN113268610B (en) Intent jump method, device, equipment and storage medium based on knowledge graph
CN111339302A (en) Method and device for training element classification model
CN110399472B (en) Interview question prompting method and device, computer equipment and storage medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN110021293A (en) Audio recognition method and device, readable storage medium storing program for executing
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN112395887A (en) Dialogue response method, dialogue response device, computer equipment and storage medium
CN110706710A (en) Voice recognition method and device, electronic equipment and storage medium
CN115269836A (en) Intention identification method and device
CN111046674A (en) Semantic understanding method and device, electronic equipment and storage medium
KR20210059995A (en) Method for Evaluating Foreign Language Speaking Based on Deep Learning and System Therefor
CN113792133B (en) Question judging method and device, electronic equipment and medium
CN115346520A (en) Method, apparatus, electronic device and medium for speech recognition
CN115270728A (en) Conference record processing method, device, equipment and storage medium
CN111782775B (en) Dialogue method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant