CN112084317A

CN112084317A - Method and apparatus for pre-training a language model

Info

Publication number: CN112084317A
Application number: CN202011009914.6A
Authority: CN
Inventors: 王福东
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-15
Anticipated expiration: 2040-09-23
Also published as: CN112084317B

Abstract

The embodiment of the specification provides a method and a device for pre-training a language model, wherein the method comprises the following steps: acquiring a first statement of a first role and a second statement of a second role in a historical conversation record; the historical conversation record comprises sentences of each conversation in multiple conversations; splicing the first statement and the second statement into a first sample; shielding words with a preset proportion in the first sample to obtain a second sample; superposing a word embedding vector, a word type embedding vector, a position embedding vector and an additional embedding vector of any word in a second sample to obtain an initial word expression vector of the word; and inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the shielded words in the second sample. The language model can be made more suitable for language characterization in the dialogue domain after being pre-trained.

Description

Method and apparatus for pre-training a language model

Technical Field

One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for pre-training a language model.

Background

With the development of artificial intelligence, a way of performing a conversation with a user by using a robot instead of a human appears, and the conversation often needs to be performed for multiple rounds, namely multiple rounds of conversation. In the multi-turn conversation process between the robot and the user, the intention expressed by the sentence of the user is identified through the intention identification model, the corresponding robot response sentence is given for the intention, and the established business target, such as solving the user question or prompting the user to perform the predetermined user behavior, is completed through the continuous interaction mode.

The intention recognition model is a classification model, and the intention expressed by the user's sentence is determined based on the language representation obtained by the language model. The existing language model is a general model trained on a public encyclopedia corpus, and cannot well represent sentences in a dialogue field, and accordingly, an intention recognition model cannot accurately recognize intentions expressed by the sentences of a user, and further cannot complete a set business target.

Therefore, it would be desirable to have an improved approach that enables language models to be more suitable for language characterization in the field of dialog after the language models have been pre-trained.

Disclosure of Invention

One or more embodiments of the present specification describe a method and apparatus for pre-training a language model, which can make the language model more suitable for language characterization in the dialogue domain after pre-training the language model.

In a first aspect, a method for pre-training a language model for language characterization in the field of dialog is provided, the method comprising:

acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;

splicing the first sentence and the second sentence into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;

superposing a word embedded vector of any word in the second sample, the word type embedded vector of the word, the position embedded vector of the word and an additional embedded vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;

inputting an initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the words which are shielded in the second sample.

In one possible embodiment, the words masked in the second sample are used as sample labels for determining the predicted loss of the first task.

In a possible implementation, the pre-training task further includes a second task, and the second task is configured to predict whether the first sentence and the second sentence are two sentences connected in sequence.

Further, the first sample corresponds to a positive sample of the second task, and the first statement and the second statement are two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.

In one possible embodiment, the pre-training task further comprises a third task for predicting the pinyin for the masked words in the second sample.

Further, pinyin for the masked words in the second sample is used as a sample label for determining the predicted loss of the third task.

In a possible implementation manner, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of pinyin corresponding to the word;

the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn.

Further, the first sample corresponds to a positive sample of the fourth task, and the first statement and the second statement are two statements of the same turn; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.

In one possible embodiment, after the pre-training the language model based on at least one pre-training task including the first task, the method further comprises:

acquiring a third statement of the first role and a fourth statement of the second role in the historical conversation record; the third sentence and the fourth sentence belong to the same turn;

splicing the third sentence and the fourth sentence into a third sample;

inputting the initial word expression vector of each word in the third sample into the pre-trained language model to obtain a language representation vector of the third sample;

inputting the language characterization vector of the third sample into an intention recognition model to obtain a prediction intention category corresponding to the third sample;

and fine-tuning the language model according to the actual intention category and the predicted intention category corresponding to the third sample.

Further, after the fine-tuning the language model, the method further includes:

acquiring a fifth statement of a first role and a sixth statement of a second role in the current conversation; the fifth sentence and the sixth sentence belong to the same turn;

splicing the fifth sentence and the sixth sentence into a fourth sample;

inputting the fourth sample into the language model after fine tuning to obtain a language characterization vector of the fourth sample;

and inputting the language characterization vector of the fourth sample into the intention recognition model to obtain a prediction intention category corresponding to the fourth sample.

In a second aspect, an apparatus for pre-training a language model for language characterization in the field of dialogs is provided, the apparatus comprising:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;

the first sample generation unit splices the first statement and the second statement acquired by the first acquisition unit into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;

an initial expression unit, configured to superimpose a word embedding vector of any word in the second sample obtained by the first sample generation unit, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;

and the pre-training unit is used for inputting the initial word expression vector of each word in the second sample obtained by the initial expression unit into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the shielded words in the second sample.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, first statements of a first role in a historical dialogue record in the dialogue field and second statements of a second role in the historical dialogue record are obtained; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then, overlapping a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method of pre-training a language model, according to one embodiment;

FIG. 3 illustrates a process diagram of a pre-trained language model according to one embodiment;

FIG. 4 shows a schematic block diagram of an apparatus to pre-train a language model, according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves a pre-trained language model for language characterization in the field of dialog. Referring to fig. 1, at least a sentence of a user is input to a language model, a corresponding language representation vector is output through the language model, the language representation vector is input to an intention recognition model, and a corresponding predicted intention category is output through the intention recognition model. It can be understood that the intention recognition model is a classification model, and is based on the language representation obtained by the language model, so whether the language model can well represent the sentence in the dialogue domain has a great influence on the recognition effect of the intention recognition model.

The language model can adopt a structural design of a bidirectional coding representation (BERT) model based on a converter, generally, the BERT model is obtained by pre-training a pre-training task, training data are from encyclopedic linguistic data, the pre-training task comprises word masking training and continuous statement prediction training, the word masking training is to mask out a plurality of words in a section of speech and then predict the masked out words, the continuous statement prediction training is to judge whether the two sentences are in a context relationship, and the BERT model obtained by the training in the mode is relatively universal and cannot well represent the sentences in the conversation field.

In the embodiment of the specification, a training sample is constructed based on a historical dialogue record in the dialogue field, and the training sample is utilized to pre-train a language model based on at least one pre-training task; and when the initial word expression vector of each word in the training sample is determined, the specific information of the dialogue field is reflected, the initial word expression vector of each word in the training sample is input into the language model, and the language model is pre-trained, so that the specific information of the dialogue field can be better extracted by the language model, and the language model can be more suitable for language representation of the dialogue field after the language model is pre-trained.

Fig. 2 shows a flow diagram of a method for pre-training a language model for language characterization in the field of dialog, which may be based on the implementation scenario shown in fig. 1, according to an embodiment. As shown in fig. 2, the method for pre-training the language model in this embodiment includes the following steps: step 21, acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; step 22, splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; step 23, superposing the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and 24, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. Specific execution modes of the above steps are described below.

Firstly, in step 21, acquiring a first statement of a first role in a historical dialogue record of a dialogue field and a second statement of a second role in the historical dialogue record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character. It will be appreciated that the two parties to a conversation typically belong to different roles, for example, one role is customer service and the other role is user.

In the embodiment of the present specification, the historical dialog record corresponds to a session (session) between the first role and the second role, for example, taking a dialog between the customer service and the user as an example, the historical dialog record includes multiple rounds of dialog between the robot customer service and the user, and multiple rounds of dialog between the human customer service and the user when the robot customer service cannot achieve the predetermined target. Wherein, one pair of words comprises customer service sentences and user sentences, and starts with the customer service sentences.

It is understood that the first sentence and the second sentence may belong to the same round of dialog or may belong to different rounds of dialog. In the embodiments of the present specification, a sentence is not limited to a single sentence, and may be a single word, a single sentence, or two sentences, etc., based on the actual expression in the dialog. The above statements are the actual expressions of the parties in the conversation and may therefore also be referred to as dialogs.

In the embodiment of the specification, the historical conversation record can be a historical conversation record of an intelligent outbound scene, wherein the intelligent outbound scene is that the robot interacts with a user through a telephone outbound mode to complete an outbound task and a specific target; the historical dialogue records can also be historical dialogue records of user call-in scenes, namely, the user interacts with a robot or a manual customer service through a telephone call-in mode to complete consultation of specific problems.

Then, in step 22, the first sentence and the second sentence are spliced into a first sample; and shielding words with a preset proportion in the first sample by using preset words to obtain a second sample. It will be appreciated that this second sample corresponds to the pre-training task of the word mask training of the BERT model.

The predetermined ratio may be a small value, for example, 10% or 15%.

The preset word may be a normal chinese character, or may be a special mark, for example, a special mark "mask". In one example, the word to be masked is replaced with a "[ mask ]" mark at a first scale, the word to be masked is replaced with a randomly sampled word at a second scale, and the word to be masked is not replaced at a third scale.

Then, in step 23, overlapping the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word. It can be understood that the word embedding vector of a word, the word type embedding vector of the word and the position embedding vector of the word are superposed to obtain an initial word expression vector of the word, which is a way adopted by a common BERT model.

When the pinyin embedded vector of the pinyin corresponding to a character is introduced into the initial character expression vector of the character, the method is favorable for restraining the Automatic Speech Recognition (ASR) error of the pre-trained language model.

Finally, in step 24, the initial word expression vector of each word in the second sample is input into the language model, which is pre-trained based on at least one pre-training task including a first task for predicting the words that are occluded in the second sample. It will be appreciated that since only a predetermined proportion of the words in the second sample are occluded, the language model can predict the occluded words in the second sample based on the context of the occluded words. This first task may correspond to a word mask training pre-training task of a typical BERT model, which may be performed to enable a language model to better implement language characterization of the dialog domain.

In one example, the masked words in the second sample serve as sample tags for determining a predicted loss for the first task.

In one example, the pre-training task further includes a second task for predicting whether the first statement and the second statement are two statements that are connected in sequence. This second task may correspond to a continuous statement prediction training pre-training task of a typical BERT model.

Further, the first sample corresponds to a positive sample of the second task, and the first statement and the second statement are two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence. The following takes the history dialog record shown in table one as an example to describe what is two sentences connected in sequence.

Table one: historical conversation record

Character	Sentence	Number of rounds
			Customer service	Statement 1	1
User' s	Statement 2	1
			Customer service	Statement 3	2
User' s	Statement 4	2
			Customer service	Statement 5	3
User' s	Statement 6	3

Referring to table one, the statements in the history dialog record are sequentially recorded in time sequence, where statement 1 and statement 2 are two serially connected statements, and statement 2 and statement 3 are also two serially connected statements, but statement 1 and statement 3 are not two serially connected statements.

In one example, the pre-training tasks further include a third task for predicting pinyin for an occluded word in the second sample. The third task is adapted to a specific scene in the dialogue field, speech recognition is often performed in the dialogue process, namely, the speech recognition is performed as text, ASR errors sometimes occur in the process, and the third task can effectively suppress the errors.

In one example, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word; the pre-training task further comprises a fourth task, and the fourth task is used for predicting whether the first statement and the second statement are two statements of the same turn. The fourth task is also adaptive to the specific scene of the dialogue field, and is beneficial to the language model to express the round information.

In the embodiment of the present specification, in order to make the language model better fit to the target task, the language model also needs to be fine-tuned and trained on the target task, which may be, but is not limited to, an intention recognition task.

In one example, after the pre-training the language model based on at least one pre-training task including a first task, the method further comprises:

splicing the third sentence and the fourth sentence into a third sample;

In the embodiment of the specification, after the language model is subjected to fine tuning training on the target task, the target task can be executed based on the language model.

In one example, after the fine-tuning the language model, the method further comprises:

splicing the fifth sentence and the sixth sentence into a fourth sample;

FIG. 3 illustrates a process diagram of a pre-trained language model, according to one embodiment. Referring to fig. 3, the robot's dialogs (context) and the corresponding user's dialogs (query) are extracted from the historical dialog logs of different outbound application scenarios and are spliced together into a sample, and the historical dialog logs may also be referred to as historical dialog records. As an example in the figure, the robot says "pay back" and the user answers "no money". Aiming at a sample, firstly, acquiring a word embedded vector of any word in the sample, a word type embedded vector of the word and a position embedded vector of the word, wherein the three embedded vectors belong to three embedded vectors of an original BERT model, and on the basis, three additional embedded vectors are additionally added, including a round embedded vector of a turn to which a statement corresponding to the word belongs, a role embedded vector of a role to which the statement corresponding to the word belongs and a pinyin embedded vector corresponding to the word, wherein the round embedded vector can be used for helping a language model to better learn dialog knowledge of different turns, the role embedded vector introduces role information to help the language model to better learn different language style knowledge of a robot and a user, and the pinyin embedded vector is used for inhibiting instability of the sample caused by ASR errors; and then adding all the embedded vectors, inputting the embedded vectors into a language model through regularization, and adding two pre-training tasks on the basis of a traditional pre-training task of a BERT model, wherein the traditional pre-training task comprises the first task and the second task, the first task predicts a missing text by using surrounding texts, the second task predicts a binary task whether the dialect of the robot and the dialect of the user are sequentially connected, the added two pre-training tasks comprise the third task and the fourth task, the third task predicts the pinyin of the missing text by using the pinyin of the surrounding texts, and the fourth task predicts whether the dialect of the robot and the dialect of the user belong to the binary task of the same turn.

According to the method provided by the embodiment of the description, first statements of a first role in a historical dialogue record in a dialogue field and second statements of a second role in the historical dialogue record are obtained; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then splicing the first statement and the second statement into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then, overlapping a word embedding vector of any word in the second sample, a word type embedding vector of the word, a position embedding vector of the word and an additional embedding vector corresponding to the word to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; and finally, inputting the initial word expression vector of each word in the second sample into the language model, and pre-training the language model based on at least one pre-training task including a first task, wherein the first task is used for predicting the masked words in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.

According to an embodiment of another aspect, an apparatus for pre-training a language model is also provided, and the apparatus is used for executing the method for pre-training a language model provided by the embodiment of the present specification. FIG. 4 shows a schematic block diagram of an apparatus to pre-train a language model, according to one embodiment. As shown in fig. 4, the apparatus 400 includes:

a first obtaining unit 41, configured to obtain a first sentence of a first character in a history dialog record of a dialog field, and a second sentence of a second character in the history dialog record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character;

a first sample generation unit 42, which splices the first sentence and the second sentence acquired by the first acquisition unit 41 into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample;

an initial expression unit 43, configured to superimpose a word embedding vector of any word in the second sample obtained by the first sample generation unit 42, a word type embedding vector of the word, a position embedding vector of the word, and an additional embedding vector corresponding to the word, so as to obtain an initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word;

a pre-training unit 44, configured to input the initial word expression vector of each word in the second sample obtained by the initial expression unit 43 into the language model, and pre-train the language model based on at least one pre-training task including a first task, where the first task is used to predict a word masked in the second sample.

Optionally, as an embodiment, the masked words in the second sample are used as sample labels for determining the predicted loss of the first task.

Optionally, as an embodiment, the pre-training task further includes a second task, and the second task is configured to predict whether the first sentence and the second sentence are two sentences connected in sequence.

Optionally, as an embodiment, the pre-training task further includes a third task, and the third task is used for predicting pinyin of the masked words in the second sample.

Optionally, as an embodiment, the additional embedded vector includes at least one of a role embedded vector of a role to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;

Optionally, as an embodiment, the apparatus further includes:

a second obtaining unit, configured to obtain a third sentence of the first role and a fourth sentence of the second role in the historical dialog record after the pre-training unit pre-trains the language model based on at least one pre-training task including the first task; the third sentence and the fourth sentence belong to the same turn;

a second sample generation unit, configured to splice the third statement and the fourth statement acquired by the second acquisition unit into a third sample;

the language characterization unit is used for inputting the initial word expression vector of each word in the third sample obtained by the second sample generation unit into the pre-trained language model to obtain a language characterization vector of the third sample;

the prediction unit is used for inputting the language representation vector of the third sample obtained by the language representation unit into an intention recognition model to obtain a prediction intention category corresponding to the third sample;

and the fine tuning unit is used for fine tuning the language model according to the actual intention category corresponding to the third sample and the prediction intention category obtained by the prediction unit.

Further, the apparatus further comprises:

a third obtaining unit, configured to obtain a fifth statement of the first role and a sixth statement of the second role in the current dialog after the language model is fine-tuned by the fine-tuning unit; the fifth sentence and the sixth sentence belong to the same turn;

a third sample generation unit, configured to splice the fifth statement and the sixth statement acquired by the third acquisition unit into a fourth sample;

the language characterization unit is further configured to input the fourth sample obtained by the third sample generation unit into the language model after the fine tuning, so as to obtain a language characterization vector of the fourth sample;

the prediction unit is further configured to input the language characterization vector of the fourth sample obtained by the language characterization unit into the intention recognition model, so as to obtain a predicted intention category corresponding to the fourth sample.

With the apparatus provided by the embodiment of the present specification, first, the first obtaining unit 41 obtains a first sentence of a first character in a historical dialog record of a dialog field, and a second sentence of a second character in the historical dialog record; wherein the historical conversation record comprises statements of each of the multiple turns of conversation of the first character and the second character; then the first sample generation unit 42 splices the first sentence and the second sentence into a first sample; masking words with a preset proportion in the first sample by using preset words to obtain a second sample; then the initial expression unit 43 superimposes the word embedding vector of any word in the second sample, the word type embedding vector of the word, the position embedding vector of the word and the additional embedding vector corresponding to the word to obtain the initial word expression vector of the word; the additional embedded vector comprises at least one of a round embedded vector of the round to which the statement corresponding to the word belongs, a role embedded vector of the role to which the statement corresponding to the word belongs, and a pinyin embedded vector of the pinyin corresponding to the word; finally, a pre-training unit 44 inputs the initial word expression vector of each word in the second sample into the language model, which is pre-trained based on at least one pre-training task including a first task for predicting the words that are occluded in the second sample. As can be seen from the above, in the embodiments of the present specification, a second sample is obtained based on a historical dialogue record in the dialogue field, and the language model is pre-trained by using the second sample, so that the trained language model is more suitable for language representation in the dialogue field; and when determining the initial word expression vector of each word in the second sample, not only the word embedding vector of any word in the second sample, the word type embedding vector of the word, and the position embedding vector of the word are superposed, but also the additional embedding vector corresponding to the word is superposed, the additional embedding vector includes at least one of the round embedding vector of the turn to which the sentence corresponding to the word belongs, the role embedding vector of the role to which the sentence corresponding to the word belongs, and the pinyin embedding vector of the pinyin corresponding to the word, the additional embedding vector embodies the specific information of the dialogue domain, and then the initial word expression vector of each word in the second sample is input into the language model to pre-train the language model, so that the language model can better extract the specific information of the dialogue domain, after pre-training the language model, making the language model more suitable for language characterization of the conversational domain.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of pre-training a language model for language characterization in the field of dialog, the method comprising:

2. The method of claim 1, wherein the masked words in the second sample serve as sample tags for determining the predicted loss of the first task.

3. The method of claim 1, wherein the pre-training task further comprises a second task to predict whether the first statement and the second statement are two statements that are serially connected.

4. The method of claim 3, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.

5. The method of claim 1, wherein the pre-training tasks further comprise a third task for predicting pinyin for occluded words in the second sample.

6. The method of claim 5, wherein pinyins of the masked words in the second sample are used as sample labels for determining the predicted loss of the third task.

7. The method of claim 1, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;

8. The method of claim 7, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.

9. The method of claim 1, wherein after the pre-training the language model based on at least one pre-training task comprising a first task, the method further comprises:

splicing the third sentence and the fourth sentence into a third sample;

10. The method of claim 9, wherein after the fine-tuning the language model, the method further comprises:

splicing the fifth sentence and the sixth sentence into a fourth sample;

11. An apparatus for pre-training a language model for language characterization in the field of dialog, the apparatus comprising:

12. The apparatus of claim 11, wherein the masked words in the second sample serve as sample tags for determining the predicted loss of the first task.

13. The apparatus of claim 11, wherein the pre-training task further comprises a second task to predict whether the first statement and the second statement are two statements that are consecutive.

14. The apparatus of claim 13, wherein the first sample corresponds to a positive sample of the second task, the first statement and the second statement being two statements connected in sequence; or, the first sample corresponds to a negative sample of the second task, and the first statement and the second statement are not two statements connected in sequence.

15. The apparatus of claim 11, wherein the pre-training tasks further comprise a third task for predicting pinyin for occluded words in the second sample.

16. The apparatus of claim 15, wherein pinyin for an occluded word in the second sample is used as a sample label for determining a predicted loss for the third task.

17. The apparatus of claim 11, wherein the additional embedded vector comprises at least one of a character embedded vector of a character to which the sentence corresponding to the word belongs and a pinyin embedded vector of a pinyin corresponding to the word;

18. The apparatus of claim 17, wherein the first sample corresponds to a positive sample of the fourth task, the first statement and the second statement being two statements of a same round; or, the first sample corresponds to a negative sample of the fourth task, and the first sentence and the second sentence are not two sentences of the same turn.

19. The apparatus of claim 11, wherein the apparatus further comprises:

20. The apparatus of claim 19, wherein the apparatus further comprises:

21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.

22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.