CN113312453A

CN113312453A - Model pre-training system for cross-language dialogue understanding

Info

Publication number: CN113312453A
Application number: CN202110667409.9A
Authority: CN
Inventors: 车万翔; 李祺欣; 覃立波; 刘挺
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-08-27
Anticipated expiration: 2041-06-16
Also published as: CN113312453B

Abstract

The invention relates to a model pre-training system for cross-language dialogue understanding. The invention aims to solve the problems that in the existing cross-language dialogue understanding scene, due to the fact that corpus of a small language is scarce, the model training effect is limited, an accurate dialogue understanding system cannot be obtained, and accurate reply cannot be completed to user words. A model pre-training system for cross-language dialogue understanding comprises: the system comprises a data acquisition module, a dialogue field label sorting and merging module, a training corpus sorting module, a target language type determining module, a static dictionary determining module, a word replacing module, a coding module, a word replacing and predicting module, a sample belonging dialogue field predicting module, an integral model acquiring module, a training module and a cross-language dialogue understanding field downstream task fine-tuning module. The invention is used in the field of cross-language dialog understanding.

Description

Model pre-training system for cross-language dialogue understanding

Technical Field

The invention relates to a model pre-training system for cross-language dialogue understanding, relates to a cross-language model pre-training system in the field of natural language processing, and relates to a dialogue understanding model training system in the field of natural language processing.

Background

Currently, a man-machine conversation system becomes a leading research hotspot in the industry due to the huge use value and prospect of the man-machine conversation system. Indeed, in the last 60 s, professor Joseph Weizonbaum of the university of Engineers in Massachusetts has begun to develop a human-machine dialog system Eliza (Weizonbaum J. ELIZA-a computer program for the student of natural language communication between human and machine [ J ]. Communications of the ACM,1966,9(1):36-45.) that is able to mimic the responses of psychotherapists and provide assistance to patients with psychological illnesses. In the years that follow, man-machine dialog systems for various purposes have also been developed due to the rapid development of natural language processing (Chowdhury G. Natural language processing [ J ]. Annual review of information science and technology,2003,37(1):51-89.) and deep learning (LeCun Y, Bengio Y, Hinton G. deep learning [ J ]. nature,2015,521(7553): 436-. The most prominent module behind these human-machine dialog systems is the dialog understanding system.

The dialog understanding system is able to understand the user's intentions and give corresponding replies and help, such as weather inquiries, airline reservations, ordering, device control for smart homes, voice control for car-mounted devices, etc. At present, the industry has many conversation understanding systems applied to mobile phones or smart home devices, but most of them are only adapted to languages such as chinese and english, which have wide application range. Similarly, Pre-training of a model of the conversational understanding system by researchers in academia (Wu C S, Hoi S, Socher R, et al. Tod-bert: Pre-trained natural language understanding for task-oriented languages [ J ]. arXiv preprinting arXiv:2004.06871,2020.) is also limited to English, and is rarely studied in cross-language scenarios. The important reason for this situation is that because the corpus is scarce in the field of conversational understanding labeled in the Chinese language, how to effectively utilize the existing conversational understanding corpus to assist training in the cross-language scene is a problem that needs to be solved at present.

Disclosure of Invention

The invention aims to solve the problems that in the existing cross-language dialogue understanding scene, due to the fact that corpus of a small language is scarce, model training effect is limited, an accurate dialogue understanding system cannot be obtained, and accurate reply cannot be completed to user words, and provides a cross-language dialogue understanding-oriented model pre-training system.

A model pre-training system for cross-language dialogue understanding comprises:

the system comprises a data acquisition module, a dialogue field label sorting and merging module, a training corpus sorting module, a target language type determining module, a static dictionary determining module, a word replacing module, a coding module, a word replacing and predicting module, a sample belonging dialogue field predicting module, an integral model acquiring module, a training module and a cross-language dialogue understanding field downstream task fine-tuning module;

the data acquisition module is used for collecting an English data set in the labeled dialogue understanding field;

the dialogue domain label sorting and merging module is used for sorting dialogue domain labels marked on all data sets in the data acquisition module and merging dialogue domain labels with the same meaning on different data sets;

the training corpus sorting module is used for dividing conversation corpuses in all data sets collected by the data acquisition module, taking user words and system replies in a round of conversation as a sample, respectively segmenting the user words and the system replies, and simultaneously labeling a conversation field label for each sample by utilizing conversation field label information combined in the conversation field label sorting and combining module;

the target language determining module is used for determining a target language;

the static dictionary determining module is used for respectively collecting static dictionaries translated from English vocabulary to various target languages according to the target languages determined by the target language determining module;

the word replacing module is used for randomly selecting a certain proportion of English words on each sample marked with the dialogue field labels in the training corpus sorting module, randomly selecting a language from the target language determined in the target language determining module for each randomly selected word, translating each randomly selected word to a word corresponding to the target language by using a static dictionary collected by the static dictionary determining module, replacing the English word with the word corresponding to the target language, and simultaneously keeping the original English word as a label to be predicted;

the coding module obtains a coded representation of the processed sample in the word replacement module by using a cross-language coding model;

the word replacement prediction module uses a fully-connected neural network, the encoding expression of each word in the sample obtained by the encoding module calculates the probability of the word which is possibly replaced in the dictionary, and the cross entropy loss is calculated through the label to be predicted in the word replacement module;

the dialogue domain prediction module to which the sample belongs uses a fully-connected neural network, the dialogue domain to which the sample belongs is judged by the coding expression of the whole sentence of the sample obtained by the coding module, and the cross entropy loss is calculated through the dialogue domain label marked in the training corpus sorting module;

the integral model acquisition module adds the cross entropy loss obtained by the word replacement prediction module and the cross entropy loss obtained by the dialogue field prediction module to which the sample belongs to obtain the final loss;

through the final loss, performing back propagation on the integral model and updating parameters of the integral model;

the overall model in the overall model acquisition module is a cross-language coding model in the coding module, and a word replaces the whole of a fully-connected neural network in the prediction module and a fully-connected neural network in the dialogue domain prediction module to which the sample belongs;

the training module trains an integral model in the integral model acquisition module by using the processed data in the training corpus sorting module and the word replacement module;

and the downstream task fine-tuning module in the cross-language dialogue understanding field uses the whole model trained by the training module as a pre-training model, and completes the tasks in the cross-language dialogue understanding field based on the pre-training model.

The invention has the beneficial effects that:

the invention provides a model pre-training system for cross-language dialogue understanding, which does not depend on cross-language labeled dialogue understanding data and can pre-train a dialogue understanding model in a cross-language scene only by utilizing the existing English data. In addition, the invention designs a self-supervision task, and utilizes a dictionary to automatically label, so that the model can learn the mapping relation between English words and words of other languages which are translation pairs in the pre-training process, thereby improving the overall expression between other languages and English on the pre-training model. Particularly, the invention also summarizes the dialogue domain labels in different English dialogue understanding data sets, and trains the model by using the labeled information, so that the model can learn the special knowledge of the dialogue understanding domain in the pre-training process. The method solves the problems that in the existing cross-language dialogue understanding scene, due to the fact that corpus of a small language is scarce, model training effect is limited, an accurate dialogue understanding system cannot be obtained, and accurate reply cannot be completed to user words.

The invention evaluates the data set of a conversational language understanding task in ten small languages of Arabic, German, Spanish, French, Italian, Malaysia, Polish, Russian, Thai and Turkish, which covers the two most classical subtasks in the field of conversational understanding: intent recognition and slot extraction. Experimental results show that the model pre-trained by the method can obtain a better result than a baseline model when a downstream task is trained.

The present invention trains the dialogue language understanding data set on the ten languages by using five random seeds respectively, takes the average result under the five random seeds as the current result, and compares the average result on the ten languages. The model pre-trained by the method has the intention recognition accuracy rate of 93.73 percent, is improved by 4.17 percent compared with a baseline model, has the slot position extraction F1 value of 66.80 percent, is improved by 3.03 percent compared with the baseline model, has the intention and slot position prediction overall accuracy rate of 38.01 percent, and is improved by 3.6 percent compared with the baseline model. The method has great improvement on various indexes, which also shows that the system provided by the invention is very effective for pre-training the cross-language dialogue understanding model.

Drawings

FIG. 1 is a mulberry diagram of conversation domain label summary categorization results for multiple conversation understanding datasets.

Detailed Description

The first embodiment is as follows: the model pre-training system for cross-language dialogue understanding includes:

8 industry comparative classical public English dialog understanding datasets were collected including CamRest676(Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milicasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young, 2016.A network end-to-end reliable task-oriented dialog system. arXiv preprintiv: 1604.04562.), WOZ (Nikola)

Diarmuid O Séaghdha,Tsung-Hsien Wen,Blaise Thomson,and Steve Young.2016.Neural belief tracker:Data-driven dialogue state tracking.arXiv preprint arXiv:1606.03777.)、SMD(Mihail Eric and Christopher D Manning.2017.Keyvalue retrieval networks for task-oriented dialogue.arXiv preprint arXiv:1705.05414.)、MSR-E2E(Xiujun Li,Sarah Panda,JJ(Jingjing)Liu,and Jianfeng Gao.2018.Microsoft dialogue challenge:Building end-to-end task-completion dialogue systems.In SLT 2018.)、Taskmaster(Bill Byrne,Karthik Krishnamoorthi,Chinnadhurai Sankar,Arvind Neelakantan,Daniel Duckworth,Semih Yavuz,Ben Goodrich,Amit Dubey,Andy Cedilnik,and Kyu-Young Kim.2019.Taskmaster-1:Toward a realistic and diverse dialog dataset.arXiv preprint arXiv:1909.05358.)、Schema(Abhinav Rastogi,Xiaoxue Zang,Srinivas Sunkara,Raghav Gupta,and Pranav Khaitan.2019.Towards scalable multi-domain conversational agents:The schema-guided dialogue dataset.arXiv preprint arXiv:1909.05855.)、MetaLWOZ(Sungjin Lee,Hannes Schulz,Adam Atkinson,Jianfeng Gao,Kaheer Suleman,Layla El Asri,Mahmoud Adada,Minlie Huang,Shikhar Sharma,Wendy Tay,and Xiujun Li.2019.Multi-domain task-completion dialog challenge.In Dialog System Technology Challenges 8.)、MultiWOZ(

Budzianowski,Tsung-Hsien Wen,Bo-Hsiang Tseng,Inigo Casanueva,Stefan Ultes,Osman Ramadan,and

2018.Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.arXiv preprint arXiv:1810.00278.)。

The dialogue domain label sorting and merging module is used for sorting dialogue domain labels marked on all data sets in the data acquisition module (such as weather inquiry, scheduled flight, meal ordering, intelligent household equipment control, vehicle-mounted equipment voice control and the like), and merging dialogue domain labels with the same meaning on different data sets (such as weather inquiry, scheduled flight, meal ordering, intelligent household equipment control, vehicle-mounted equipment voice control and the like);

the training corpus sorting module is used for dividing conversation corpuses in all data sets collected by the data acquisition module, taking user words and system replies in a round of conversation as a sample, respectively segmenting the user words and the system replies according to blank spaces, and labeling a conversation field label for each sample by utilizing conversation field label information combined in the conversation field label sorting and combining module;

the target language determining module is used for determining a target language in the pre-training process according to the research current situations in the academic world and the industry and the use range and frequency of each international language;

we manually selected 10 representative languages among the languages used in various countries, including: arabic, German, Spanish, French, Italian, Malaysia, Polish, Russian, Thai, Turkish;

the word replacing module is used for randomly selecting a certain proportion of English words on each sample marked with the dialogue field labels in the training corpus sorting module, randomly selecting a language from the target language determined in the target language determining module for each randomly selected word, translating each randomly selected word to a word corresponding to the target language by using a static dictionary collected by the static dictionary determining module, replacing the English words with the words corresponding to the target language, and simultaneously keeping original English words (randomly selecting a certain proportion of English words on each sample marked with the dialogue field labels in the training corpus sorting module) as labels to be predicted;

the word replacement prediction module uses a full-connection neural network (the full-connection neural networks of the word replacement prediction module and the dialogue domain prediction module to which the sample belongs are different and have different parameters), the coding of each word in the sample obtained by the coding module represents the probability of the word which is possibly replaced in the calculation dictionary, and the cross entropy loss is calculated through the label to be predicted in the word replacement module;

the sample affiliated dialogue field prediction module uses a full-connection neural network (the full-connection neural networks of the word replacement prediction module and the sample affiliated dialogue field prediction module are different and have different parameters), the whole sentence of the sample obtained by the coding module (one sample is a user utterance and a system reply in a round of dialogue, one sample comprises a plurality of words, the plurality of words form a sentence, and one sample is a whole sentence) is coded and expressed to judge the dialogue field to which the sample belongs, and the cross entropy loss is calculated through dialogue field labels marked in the training corpus sorting module;

the downstream task fine tuning module in the cross-language dialogue understanding field uses the whole model trained by the training module as a pre-training model, and completes tasks in the cross-language dialogue understanding field based on the pre-training model;

tasks within the domain of cross-language dialog understanding include: cross-language dialog language understanding (Cross-language dialog language understanding), Cross-language Intent recognition (Cross-language Intent detection), Cross-language dialog state tracking (Cross-language dialog state tracking), Cross-language dialog behavior prediction (Cross-language dialog behavior prediction), Cross-language reply selection (Cross-language Response selection), and the like. The parameters of the pre-training model are respectively used as the initialization parameters of a BERT architecture-based cross-language dialogue language understanding model, a BERT architecture-based cross-language intention recognition model, a BERT architecture-based cross-language dialogue state tracking model, a BERT architecture-based cross-language dialogue behavior prediction model, a BERT architecture-based cross-language reply selection model and other models, the BERT architecture-based cross-language dialogue language understanding model, the BERT architecture-based cross-language intention recognition model, the BERT architecture-based cross-language dialogue state tracking model, the BERT architecture-based cross-language dialogue behavior prediction model, the BERT architecture-based cross-language reply selection model and other models are respectively trained to respectively obtain a trained BERT architecture-based cross-language dialogue language understanding model, a BERT architecture-based cross-language intention recognition model, a BERT architecture-based cross-language state tracking model, a BERT architecture-based cross-language reply selection model and other models, The method comprises the steps of generating a model of cross-language dialogue behavior prediction based on a BERT architecture, generating a model of cross-language reply selection based on the BERT architecture and the like, and accordingly completing tasks of cross-language dialogue language understanding, cross-language intention recognition, cross-language dialogue state tracking, cross-language dialogue behavior prediction, cross-language reply selection and the like.

The second embodiment is as follows: the difference between the present embodiment and the specific embodiment is that the dialogue domain tag sorting and merging module is configured to sort dialogue domain tags marked on all data sets in the data acquisition module (for example, weather inquiry, scheduled flight, meal ordering, device control of smart home, voice control of vehicle-mounted devices, and the like), and merge dialogue domain tags having the same meaning on different data sets (for example, weather inquiry, scheduled flight, meal ordering, device control of smart home, voice control of vehicle-mounted devices, and the like); the specific process is as follows:

step two, arranging dialogue domain labels marked on all data sets in the data acquisition module, wherein 1 dialogue domain label is arranged in CamRest676, 1 dialogue domain label is arranged in WOZ, 3 dialogue domain labels are arranged in SMD, 3 dialogue domain labels are arranged in MSR-E2E, 6 dialogue domain labels are arranged in Taskmaster, 17 dialogue domain labels are arranged in Schema, 47 dialogue domain labels are arranged in MetaLWOZ, and 6 dialogue domain labels are arranged in MultiWOZ;

and step two, classifying the conversation field labels with the same meaning on different data sets into the same category through manual screening, wherein the classification result is shown in figure 1. In fig. 1, the left text represents the name of the data set and the number of samples therein, the right text represents the name of the classified dialogue area and the number of samples contained therein, the sum of the numbers of the two sides is equal, the arc line connecting the left side and the right side represents that a part of the samples in the data set shown on the left side corresponds to the dialogue area label shown on the right side, and the width of the arc line represents the ratio of the part of the samples in all the samples. The 8 data sets are collated and gathered to form 59 conversation field labels.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the difference between the embodiment and the specific embodiment is that the corpus training arrangement module is used for segmenting dialogue corpora in all data sets collected by the data acquisition module, taking user words and system replies in a round of dialogue as a sample, segmenting the user words and the system replies according to blank spaces, and labeling dialogue domain labels on each sample by using dialogue domain label information combined in the step two; the specific process is as follows:

step three, the dialogue understanding corpus in the data set collected in the data acquisition module is multi-turn dialogue, and each dialogue can be expressed as D ═ U₁,R₁,…,U_N,R_N}；

Wherein N represents the number of dialogue rounds, U₁And R₁User utterances and system replies, U, representing the 1 st round of dialogue, respectively_NAnd R_NRepresenting user utterances and system replies, respectively, for the nth round of dialog;

taking the user's words and system replies in one turn as a sample, and dividing words for them according to blank space, inserting separator [ SEP ] between them]And insert an identifier [ CLS ] at the beginning of the sentence]Is used to represent global information, resulting in a sample S { [ CLS { [],u₁,u₂,…,u_i,[SEP],r₁,r₂,…,r_j}；

Wherein u is₁And r₁Representing the 1 st word, u, in the user utterance and the system reply, respectively₂And r₂Representing the 2 nd word, u, in the user utterance and the system reply, respectively_iRepresenting the i-th word, r, in the user utterance_jRepresenting the jth word in the system reply, i representing the length after word segmentation for the user utterance; j represents the length after the system is replied with participles;

step two, marking a conversation field label on each sample by utilizing the conversation field label information combined in the conversation field label sorting and combining module (because part of the samples collected in the step 1 are marked to a plurality of conversation field labels, the invention only considers the condition of a single conversation field label, therefore, if the sample belongs to a plurality of conversation fields after being marked, the sample is ignored), and each sample marked with the conversation field label is expressed as:

S＝{S_tokens＝[CLS],u₁,u₂,…,u_i,[SEP]，r₁,r_2,…,r_j；S_domain＝d}，

wherein d is the dialog domain label corresponding to the sample, S_tokensFor each sample the sequence of processed input characters, tokens (the words are those u₁,u₂,r₁,r₂The characters, in addition to words, also include [ CLS]And [ SEP ]])，S_domainA dialogue domain label for each sample;

the finished samples amounted to 457555;

other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment and the first to third embodiments is that the static dictionary determining module is configured to collect, according to the target language determined by the target language determining module, static dictionaries translated from english vocabulary to each target language; the specific process is as follows:

the dictionary translated from English to target language is downloaded through the website https:// github.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the embodiment is different from the first to the fourth specific embodiments in that the word replacement module is configured to randomly select an english word with a certain proportion on each sample labeled with a dialogue field tag in the training corpus sorting module, randomly select a language from the target language determined by the target language determination module for each randomly selected word, translate each randomly selected word to a word corresponding to the target language by using a static dictionary collected by the static dictionary determination module, replace the english word with a word corresponding to the target language, and simultaneously retain an original english word (an english word with a certain proportion is randomly selected on each sample labeled with a dialogue field tag in the training corpus sorting module) as a tag to be predicted; the specific process is as follows:

setting the randomly selected proportion as p% (15%);

at the same time, create S_goldensThe array is used to store the model label to be predicted (gold), and S_tokensOf the same length with [ PAD]Array pair S as placeholder_goldensCarry out initialization, i.e. S_goldens＝[PAD],…,[PAD]。

In addition, create S_masksArray for storing position information of the replaced words of the model, and S_tokensAll 0 array pairs S of the same length_masksCarry out initialization, i.e. S_masks＝0,…,0；

S on each sample after labeling dialogue field labels in training corpus sorting module_tokensGenerating a random number of 0-1 for each t, if the random number is less than p%, randomly selecting target language from 10 languages determined in the target language determining module at equal probability, and translating t to the corresponding word t in the randomly selected target language by using the static dictionary collected in the static dictionary determining module^xLet t be t in the sample^xReplacing the position, and storing the replaced t in S_goldensAs the label to be predicted, and simultaneously using S as the label_masksThe value of this position is 1;

t∈{t|t∈S_tokens,t≠[CLS],t≠[SEP]}

an example of a sample after word substitution is

S_goldens＝[PAD],…,u_k,…,r_l,…,r_m,…,[PAD]；S_masks＝0,…,1,…,1,…,1,…,0}

Wherein the content of the first and second substances,

u representing k position in user utterance_kThe vocabulary of the target language after the word replacement,

r representing the position of l in the system reply_lThe vocabulary of the target language after the word replacement,

r representing m position in system recovery_mAnd (5) performing word replacement on the target language vocabulary.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: this embodiment is different from one of the first to fifth embodiments in that S_goldens、S_tokens、S_masksAll have the same length, but S_goldensThere is a replaced t only at the position of the replaced word, and the other positions are [ PAD ]]Meaning that no prediction is required.

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is that the encoding module uses a cross-language encoding model to obtain an encoded representation of the processed samples in the word replacement module; the specific process is as follows:

XLM-RoBERTA-base (Conneau A, Khandelwal K, Goyal N, et al, Unstupervised cross-linear representation learning at scale [ J ] is selected]arXiv preprint arXiv:1911.02116,2019.) as a cross-language coding model, for a pass throughS processed by word replacement in word replacement module_tokensCoding is carried out to obtain the coded representation of each token

Wherein Cross _ Lingual _ Encoder is a Cross-language coding model, h_[CLS]、h_[SEP]Respectively represent [ CLS]And [ SEP ]]The tag is represented by a coded representation after being coded by a cross-language coding model,

represents u₁The coded representation after being coded by the cross-language coding model,

is represented by r₁And (4) representing the coded representation after the coding of the cross-language coding model.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The specific implementation mode is eight: the embodiment is different from the first to seventh embodiments in that the word replacement prediction module uses a fully-connected neural network (the fully-connected neural networks of the word replacement prediction module and the dialogue domain prediction module to which the sample belongs are different and have different parameters), the encoding of each word in the sample obtained by the encoding module represents the probability of the word which is possibly replaced in the calculation dictionary, and the cross entropy loss is calculated through the label to be predicted in the word replacement module; the specific process is as follows:

eighthly, using a fully connected neural network, calculating the probability of possibly replaced words in the dictionary according to the coded representation of each word in the sample obtained by the coding module

Wherein

Is the weight of the fully-connected neural network, b is the bias of the fully-connected neural network, h_iTo weave intoCoded representation of the i-th position, z, obtained in the code module_iThe predicted probability for the word at the ith position (the representation of the word at the 1 st position is z)₁At the x-th position is z_xThe following word replacement task is to predict the replaced word for each position word respectively);

step eight two, through S constructed in the word replacement module_goldensAnd S_masksCalculating cross entropy loss of Word Replacement task (WR)

Wherein V is the size of the vocabulary, z_i，kRepresenting the predicted probability of the kth word at the ith position,

a true tag (0 or 1, where 1 denotes the kth word and S) to the kth word at the ith position_goldensThe words in the ith position are consistent, otherwise, the words are 0),

cross entropy loss for i position;

wherein i is S_masksThe position of the replaced word stored in, S_masks[i]Denotes S_masksThe value of the i-th position in (c),

is the sum of the losses over the positions of all replaced words.

Other steps and parameters are the same as those in one of the first to seventh embodiments.

The specific implementation method nine: the embodiment is different from one to eight of the specific embodiments in that the sample belonging dialogue domain prediction module uses a fully-connected neural network (the fully-connected neural networks of the word replacement prediction module and the sample belonging dialogue domain prediction module are different and have different parameters), the coding expression of a sample whole sentence obtained by the coding module (one sample is a user utterance and a system reply in a round of dialogue, one sample comprises a plurality of words, the plurality of words form a sentence, and one sample is a whole sentence) judges the dialogue domain to which the sample belongs, and the cross entropy loss is calculated through a dialogue domain label marked in the training corpus sorting module; the specific process is as follows:

step nine, using a fully-connected neural network, and obtaining an identifier [ CLS ] in a sample by an encoding module]Is represented by a code of_[CLS]Calculating the probability of the dialogue area to which the sample belongs

Wherein

B' is the bias of the fully-connected neural network;

and step nine, calculating the cross entropy loss of a dialogue Domain classification task (Domain Classifier, DC for short) through the dialogue Domain label marked in the training corpus sorting module.

Other steps and parameters are the same as those in one to eight of the embodiments.

The detailed implementation mode is ten: the difference between this embodiment and the first to ninth embodiments is that, in the ninth step, the cross entropy loss of the Domain classification task (Domain Classifier, abbreviated as DC) is calculated by the dialog Domain label marked in the corpus sorting module:

wherein D is the number of the conversation field tags collected in the conversation field tag sorting and merging module, z_iFor the ith dialogue domain mark in' is zThe predicted probability of a signal is determined,

true tag for ith dialogue domain for current sample (0 or 1, where 1 denotes the ith dialogue domain tag and S_domainConsistent, otherwise 0).

Other steps and parameters are the same as those in one of the first to ninth embodiments.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

the embodiment selects a dialogue language understanding task as a downstream task in the dialogue understanding field, and gives a dialogue language understanding data set in the cross-language field, the task is to classify the intention of a dialogue language and extract corresponding slots in a sentence, and the task is specifically prepared according to the following steps:

collecting English dialogue language understanding data, translating the English dialogue language understanding data into a cross-language field, and labeling translated texts for training, verifying and testing models;

we downloaded the SNIPS dataset (Alice Cocke, AlaasAade, Adrien Ball, Th 'eodoreBluche, Alexandre Caulier, David Leroy, Cl' ementdouumouro, Thibault Gisslbercht, France co Caltagarone, Thibaut Lavril, et al.2018.SNIPS voice platform: an embedded spoke mapping arrangement system for private-by-design voice interface. arXiv print arXiv:1805.10190.), split 700 bar samples from its validation set (100 bars per intent, total 7 intents), split into two halves (350 bars for total 7 intents, 50 bars per intent), and simultaneously extract a random set of test bars from its validation set (350 bars per test bar, total 50 bars per test bar).

For the extracted training, verification and test set (1050 samples in total), the expert is requested to respectively translate the extracted training, verification and test set into Arabic, German, Spanish, French, Italian, Malaysia, Polish, Russian, Thai and Turkish languages, 10 languages in total, and the slot positions in the extracted training, verification and test set are marked again while the original intention labels of the extracted training, verification and test set are kept for the model.

Setting a baseline cross-language pre-training model;

XLM-RoBERTA-base was chosen as the baseline cross-language pre-training model for this example.

Step three, setting a dialogue language understanding task model architecture;

our model uses a whole model with a pipeline as an architecture. The overall model is composed of two models, namely an intention classification model and a slot extraction model.

Step four, training an intention classification model;

step four, obtaining the coding representation of the sample by using the cross-language pre-training model

Where Input is the Input sample, k is the sample length, h_[CLS]Is a sample [ CLS]The coded representation at the label is represented by,

for the coded representation of the first word in the sample,

is the coded representation of the kth word in the sample;

step four, using a full-connection neural network to calculate the probability of the intention label of the current sample

Wherein

B is the weight of the fully-connected neural network, b is the bias of the fully-connected neural network;

step four and step three, calculating cross entropy loss through the prediction probability in the step four and the step two

Wherein I is the number of intents summarized in step two,

for the true label of the current sample to the i-th intention (0 or 1, where 1 means that the i-th intention label is the golden label of the sample, and vice versa is 0), z_iA predicted probability for the model for the ith intention tag;

fourthly, performing back propagation through the loss calculated in the fourth step and the third step and updating model parameters;

step five, training a slot position extraction model;

step five, obtaining coded representation of samples by using cross-language pre-training model

for the coded representation of the first word in the sample,

is the coded representation of the kth word in the sample;

step two, a full-connection neural network is respectively created for each intention, and the probability of the slot position label of each token position of the current sample is predicted through the golden label of the intention in the sample

Wherein

Weight of the fully-connected neural network corresponding to the ith intention label, b_iBias of fully-connected neural network for ith intention tag, h_kFor coded representation of words at k positions in the sample, z_kPredicting the probability of the slot position of the word after the word passes through the model;

step five and step three, calculating cross entropy loss through the prediction probability in the step five and step two

Where L is the sample length, S_iThe number of slot tags corresponding to the ith intention tag,

for the true tag of the current sample k position to the s-th slot (0 or 1, where 1 means that the s-th slot tag is the golden slot tag of the sample k position, and vice versa is 0), z_k，sPredicting probability of the current sample k position model to the s-th slot position label;

fifthly, performing back propagation through the loss calculated in the fifth step and the third step and updating model parameters;

the training processes of the intention classification model and the slot extraction model in the fourth step and the fifth step are mutually independent, and the two trained models form the integral model in the third step.

Predicting a final result and calculating an index;

sixthly, predicting a final result;

and secondly, predicting the slot position label on the sample by using the fully-connected neural network corresponding to the slot position extraction model trained in the fifth step by using the prediction result of the intention classification model.

Sixthly, calculating indexes;

let the number of mean predictions correct for all samples be C^IntentIf the total number of samples is A, the Intent recognition accuracy (Intent Acc) is

Assuming that the number of correct Slot tag predictions is TP, the number of incorrect predictions is FP, and the number of unpredicted slots is FN in all tokens of all samples, the calculation method of Slot extraction F1 value (Slot F1) is as follows:

let C be the number of all samples for which the intent and all slots are predicted correctly^OverallWhen the total number of samples is A, the Overall recognition accuracy (Overall Acc) is

In order to balance result fluctuation caused by less test data, 5 different random seeds are selected from a training set for experiment, the average value of each index of each language under the 5 random seeds is counted, and finally the average experiment result of 10 languages is reported.

The final experimental results on the test set are shown in table 1.

TABLE 1 average experimental results of conversational language understanding tasks in Ten-door languages

The best results are shown in bold in the table.

Where the first row of experimental results shows our experimental results on the baseline model.

The second row shows the experimental results of a model pre-training system oriented to cross-language dialogue understanding according to the present invention.

The third row shows the experimental results of the word replacement method in the above-mentioned scheme of the present invention changed to Masked Language Model (Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep biological transformations for Language integrity [ J ]. arXiv preprintiv: 1810.04805,2018.).

The fourth row shows the experimental results after the dialogue domain classification fully-connected neural network in the scheme of the invention is removed.

The fifth row shows the experimental results after the word replacement method in the above scheme of the present invention is changed into Masked Language Model and the classified fully-connected neural network in the dialogue domain is removed.

As can be seen from the experimental results of the ablation experiments in the third, fourth and fifth rows of Table 1, all parts in the scheme of the present invention are indispensable, and the combined training of the word replacement model and the classification model in the dialogue domain can make the model effect better.

As can be seen from Table 1, the intention recognition accuracy of the cross-language dialogue understanding pre-training model trained by the method is improved by 4.17% compared with that of the baseline model, the slot extraction F1 value is improved by 3.03% compared with that of the baseline model, and the accuracy of the whole intention and slot prediction is improved by 3.60% compared with that of the baseline model. The method also proves that the overall effect of the cross-language dialogue understanding model can be remarkably improved.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A model pre-training system for cross-language conversational understanding, characterized by: the system comprises:

2. The model pre-training system for cross-language dialogue understanding according to claim 1, wherein: the dialogue domain label sorting and merging module is used for sorting dialogue domain labels marked on all data sets in the data acquisition module and merging dialogue domain labels with the same meaning on different data sets; the specific process is as follows:

step two, sorting all data sets in the data acquisition module to have marked conversation field labels;

and step two, classifying the conversation field labels with the same meaning on different data sets into the same category through manual screening.

3. The model pre-training system for cross-language dialogue understanding according to claim 2, wherein: the training corpus sorting module is used for dividing conversation corpuses in all data sets collected by the data acquisition module, taking user words and system replies in a round of conversation as a sample, segmenting words of the user words and the system replies respectively, and labeling a conversation field label for each sample by using the conversation field label information combined in the step two; the specific process is as follows:

step three, the dialogue understanding corpus in the data set collected in the data acquisition module is multi-turn dialogue, and each dialogue can be expressed as D ═ U₁，R₁，...，U_N，R_N}；

taking the user utterance and the system reply in a round of conversation as a sample, segmenting the user utterance and the system reply respectively, and inserting a separator [ SEP ] between the user utterance and the system reply]And insert an identifier [ CLS ] at the beginning of the sentence]Is used to represent global information, resulting in a sample S { [ CLS { []，u₁，u₂，...，u_i，[SEP]，r₁，r₂，...，r_j}；

step two, marking a conversation field label on each sample by utilizing the conversation field label information combined in the conversation field label sorting and combining module, wherein each sample marked with the conversation field label is represented as:

S＝{S_tokens＝[CLS]，u₁，u₂，…，u_i，[SEP]，r₁，r₂，…，r_j；S_domain＝d}，

wherein d is the dialog domain label corresponding to the sample, S_tokensFor the sequence of processed input characters, S, in each sample_domainDialog realm tags for each sample.

4. A model pre-training system for cross-language dialogue understanding according to claim 3, wherein: the static dictionary determining module is used for respectively collecting static dictionaries translated from English vocabulary to various target languages according to the target languages determined by the target language determining module; the specific process is as follows:

by the web address https: com/facebook/MUSE downloads a dictionary that translates english to the target language.

5. The model pre-training system for cross-language dialogue understanding according to claim 4, wherein: the word replacing module is used for randomly selecting a certain proportion of English words on each sample marked with the dialogue field labels in the training corpus sorting module, randomly selecting a language from the target language determined in the target language determining module for each randomly selected word, translating each randomly selected word to a word corresponding to the target language by using a static dictionary collected by the static dictionary determining module, replacing the English word with the word corresponding to the target language, and simultaneously keeping the original English word as a label to be predicted; the specific process is as follows:

setting the randomly selected proportion as p%;

creation S_goldensArray for storing labels to be predicted, [ PAD ]]Array pair S as placeholder_goldensCarry out initialization, i.e. S_goldens＝[PAD]，…，[PAD]。

Creation S_masksArray is used for storing position information of replaced words, and all 0 array pairs S are used_masksCarry out initialization, i.e. S_masks＝0，...，0；

S on each sample after labeling dialogue field labels in training corpus sorting module_tokensGenerating a random number of 0-1 for each t, if the random number is less than p%, translating t to the corresponding word t in the randomly selected target language by using the static dictionary collected from the static dictionary determining module^xLet t be t in the sample^xReplacing the position, and storing the replaced t in S_goldensAs the label to be predicted, and simultaneously using S as the label_masksThe value of this position is 1;

t∈{t|t∈S_tokens，t≠[CLS]，t≠[SEP]}

an example of a sample after word substitution is

S_goldens＝[PAD]，…，u_k，…，r_l，…，r_m，…，[PAD]；S_masks＝0，…，1，…，1，…，1，…，0}

Wherein the content of the first and second substances,

representing speech at a userU in the middle k position_kThe vocabulary of the target language after the word replacement,

6.A model pre-training system for cross-language dialogue understanding according to claim 4 or 5, characterized in that: said S_goldens、S_tokens、S_masksAll have the same length, but S_goldensThere is a replaced t only at the position of the replaced word, and the other positions are [ PAD ]]Meaning that no prediction is required.

7. The cross-language dialogue understanding-oriented model pre-training system of claim 6, wherein: the encoding module obtains an encoded representation of the processed samples in the word replacement module using a cross-language encoding model; the specific process is as follows:

selecting XLM-RoBERta-base as cross-language coding model, and substituting processed S for words in word substitution module_tokensCoding is carried out to obtain the coded representation of each token

8. The cross-language dialogue understanding-oriented model pre-training system of claim 7, wherein: the word replacement prediction module uses a fully-connected neural network, the encoding expression of each word in the sample obtained by the encoding module calculates the probability of the word which is possibly replaced in the dictionary, and the cross entropy loss is calculated through the label to be predicted in the word replacement module; the specific process is as follows:

Wherein

Is the weight of the fully-connected neural network, b is the bias of the fully-connected neural network, h_iFor the coded representation of the i-th position obtained in the coding module,

a predicted probability for a word at the ith position;

step eight two, through S constructed in the word replacement module_goldensAnd S_masksComputing cross-entropy loss for word replacement tasks

Wherein y is the size of the vocabulary,

representing the predicted probability of the kth word at the ith position,

the true label representing the k word at the ith position,

cross entropy loss for i position;

is the sum of the losses over the positions of all replaced words.

9. The cross-language dialogue understanding-oriented model pre-training system of claim 8, wherein: the dialogue domain prediction module to which the sample belongs uses a fully-connected neural network, the dialogue domain to which the sample belongs is judged by the coding expression of the whole sentence of the sample obtained by the coding module, and the cross entropy loss is calculated through the dialogue domain label marked in the training corpus sorting module; the specific process is as follows:

Wherein

B' is the bias of the fully-connected neural network;

and step nine, calculating the cross entropy loss of the dialogue field classification task through the dialogue field labels marked in the training corpus sorting module.

10. The cross-language dialogue understanding-oriented model pre-training system of claim 9, wherein: in the ninth step, the cross entropy loss of the dialogue field classification task is calculated through the dialogue field labels marked in the training corpus sorting module:

wherein D is the number of the conversation domain labels summarized in the conversation domain label sorting and merging module,

is composed of

For the prediction probability of the ith dialogue domain tag,

the true label for the ith dialogue domain for the current sample.