CN114443827A

CN114443827A - Local information perception dialogue method and system based on pre-training language model

Info

Publication number: CN114443827A
Application number: CN202210109478.2A
Authority: CN
Inventors: 陈羽中; 陈泽林
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-06

Abstract

The invention relates to a local information perception dialogue method and a system based on a pre-training language model, wherein the method comprises the following steps: step A: collecting multi-turn dialog texts of a specific scene, labeling the category to which each multi-turn dialog reply belongs, and constructing a training set with positive and negative category labelsD(ii) a And B: use training setDTraining a local information perception deep learning network model PLIP based on a pre-training language model for selecting a reply corresponding to a given multi-turn dialogue context; and C: and inputting the multi-turn conversation context and the reply set into the trained local information perception deep learning network model PLIP to obtain the most appropriate reply corresponding to the multi-turn conversation context. The method and the system can effectively improve the accuracy of multi-turn dialogue reply selection.

Description

Local information perception dialogue method and system based on pre-training language model

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a local information perception dialogue method and system based on a pre-training language model.

Background

In recent years, with the development of machine learning and deep learning networks, human beings have made great progress in intelligent dialogue with computers, and dialogue systems gradually move into the visual field of people. The dialogue system has important research value for both the industrial and academic fields, and can be widely applied in many fields. The current dialogue system algorithm mainly comprises two types of generative dialogue and retrieval dialogue, wherein the generative dialogue can generate an answer word by word according to a question without depending on any corpus in the reasoning stage, the generated answer has the advantage of diversity, but the obtained answer is usually not strong in logic and sometimes falls into a trap of safe reply. The retrieval type dialogue is to enable an algorithm to find a most appropriate answer from a corpus according to a specific question to reply, extract information related to correct reply from the question, and deduce the appropriate answer according to the information. The search-type dialogue model is widely applied to a multi-turn dialogue system such as Microsoft ice, and is more reliable and has better practicability compared with a generation-type dialogue model.

Two reference models are constructed by Lowe et al for reply selection tasks in a search-type multi-turn dialog, which are respectively based on a Recurrent Neural Networks (RNNs) algorithm and a Long Short Term Memory network (LSTM) algorithm. In the process of coding the text, the two reference models memorize the text characteristics at the previous moment by means of the hidden layer unit of the RNN, time sequence information is introduced for the models, and the defect that a bag-of-words model is used in an early algorithm is overcome. However, in a plurality of rounds of conversations, the conversation history may be lengthy, not all contents are related to the reply, and the two reference models directly encode the whole conversation data, so that important information cannot be extracted from the conversation data in a targeted manner, and unnecessary noise is brought to the models. In order to extract important information from long texts, researchers propose to extract important information by matching context and reply, and decompose the reply selection task into three steps, the first step is to extract features from each utterance and reply by using an RNN-based algorithm, the second step is to match the extracted utterance features with the reply features, and the third step is to extract information required for calculating scores in a matching matrix by using a method such as CNN. However, semantic information that the RNN can extract is limited, RNN coding assumes that data are sequence-related, but topics in conversational data are dynamic, two distant paragraphs may also be highly related, RNN coding is difficult to accurately learn the relationship of the two paragraphs, and at the same time, RNN coding may also have a phenomenon of gradient disappearance when a paragraph is long in coding length, and cannot well acquire a distant dependency relationship. The limitations of RNN result in the possibility that the above method may have lost important information in the first step. The Transformer architecture proposed by Vaswani in 2017 can fully grasp global dependency information by means of a large amount of self-attention and interactive attention operations, and is not limited by sequence distance. Researchers rewrite and apply the encoder part of the transform to the encoding module of the model, so that the capability of the model for extracting information is enhanced, and meanwhile, the influence of a multi-head attention mechanism in the transform is utilized, the work constructs semantic information with various granularities by utilizing the multi-head attention in a matching stage, the feature representation of the model is enriched, and an obvious improvement effect is achieved. However, the above model has the following problems. First, global sequence information is under consideration. The model mainly uses methods such as RNN to code all statement characterizations after the matching is finished, and the statement characterizations can lose important information in the coding and matching stages. Second, the word vector representation used does not take into account the context. The model mainly uses static Word vectors such as Word2vec, the problem of polysemy of a Word is difficult to solve, semantic information cannot be accurately expressed according to different context, and therefore noise is brought in the encoding stage.

In 2018, Google proposed a well-known BERT (bidirectional Encoder retrieval from transformations) model. In the process of coding dialogue data, the BERT can do a depth self-attention mechanism aiming at word granularity in the global range, and can effectively solve the two problems that global sequence information is not considered sufficiently and the context is not considered in word vector representation by combining position embedded representation in the BERT. Therefore, the research on replying selection tasks in multiple rounds of conversations mainly turns to a method based on a pre-training language model, and the basic steps of the method are firstly to encode the whole section of conversation data by using the pre-training language model with a multi-layer Transformer encoder, and then input an output representation capable of representing the position of global information [ CLS ] in the output into a classification layer for prediction. On the basis, researchers can strengthen the adaptive capacity of the pre-training language model on the conversation task by enabling the pre-training language model to learn the post-training strategy of the domain data before fine adjustment. In addition, some research works try to generate additional dialogue texts or generate data enhancement methods such as embedding of dialogue people, some research works extract the output of the pre-training language model to perform further fine matching to filter noise irrelevant to reply, and the research works add new data or modules on the basis of the pre-training language model to achieve good effects. However, the method simply inputs the context and the reply concatenation into the pre-training language model, some potential features in the dialogue data cannot be learned, the number of parameters of the model is greatly increased due to the addition of additional modules, the improvement effect is limited, and the potential of the pre-training language model cannot be efficiently mined. Recently, some research works introduce multi-task learning into multiple rounds of dialogue reply selection tasks, and design multiple subtasks according to the characteristics of the continuity, consistency and the like of dialogue data. The subtasks share the parameters of the pre-training language model together with the main task, and an additional loss function is designed to purposefully optimize the pre-training language model. Compared with the prior framework, the method combining the pre-training language model and the multi-task learning shows a remarkable effect in the multi-round dialogue reply selection task, further excavates the potential of the pre-training language model, reduces the number of parameters of the model, can learn the potential features in the dialogue data more efficiently, and greatly improves the comprehension capability of the pre-training language model on the dialogue data.

In summary, although the model of multi-turn dialog reply selection based on the pre-trained language model has been developed, the following problems still exist: first, the subtasks of the design are not efficient enough. Designing multiple subtasks for the pre-trained language model means that the pre-trained language model needs to be fine-tuned multiple times during training, which is at the cost of a large amount of time cost and computing resources, and the training cost is higher as more subtasks are used. Meanwhile, the optimization objectives of the subtasks are very different from those of the main task, and for example, the subtasks that predict their positions after the utterance is deleted may cause noise to the main task. Second, the semantic comprehension capabilities of the pre-trained language model are not fully exploited. The method only uses the [ CLS ] label representing the global information to predict in the main task, and ignores a large amount of information output at other positions except the [ CLS ]. Meanwhile, the model can understand the dialogue semantics from different angles in a plurality of auxiliary tasks and learn precious dialogue information, but the main task only uses the output of the [ CLS ] position for simple classification, and does not extract the information learned in the auxiliary tasks in a targeted manner, so that the model cannot fully utilize the important semantic information learned in the auxiliary tasks.

Disclosure of Invention

The invention aims to provide a local information perception dialogue method and system based on a pre-training language model, which are beneficial to improving the accuracy of multi-turn dialogue reply selection.

In order to achieve the purpose, the invention adopts the technical scheme that: a local information perception dialogue method based on a pre-training language model comprises the following steps:

step A: collecting multi-turn dialog texts of a specific scene, labeling the category to which each multi-turn dialog reply belongs, and constructing a training set D with positive and negative category labels;

and B: training a local information perception deep learning network model PLIP based on a pre-training language model by using a training set D, and selecting a reply corresponding to a given multi-turn conversation context;

and C: and inputting the multi-turn conversation context and the reply set into the trained local information perception deep learning network model PLIP to obtain the most appropriate reply corresponding to the multi-turn conversation context.

Further, the step B specifically includes the following steps:

step B1: inputting each sample of the training set D into the deep learning network model in the form of a triplet (c, r, y), wherein c is { u ═ u₁,u₂,...,u_mDenotes a dialog context containing m utterances, with the t-th utterance in the context

Wherein l_tIs the number of words in the t-th utterance, r is a candidate reply,

l_rfor the number of words in the reply, y belongs to {0,1} as a sample label, y equals 1 to indicate that the candidate reply is a reasonable reply of the current context, and y equals 0 to indicate unreasonable;

the deep learning network model PLIP outputs an evaluation score capable of reflecting context and reply correlation degree after encoding a calculation triple, the deep learning network model learns context semantic representation combined with the context by utilizing a multi-layer attention mechanism of a pre-training language model, and simultaneously adopts a multi-task learning strategy, so that the learning of the pre-training language model on the local context of multi-round conversation is enhanced in an auxiliary task while optimizing a main task, namely a multi-round conversation reply selection task, the characterization vector understanding global information is promoted, the context and the reply correlation degree are learned, and the semantic understanding capability of the pre-training language model is fully developed;

step B2: in the auxiliary task part, a PLIP (platform learning network) model replies a prediction task by using a random sliding window to further strengthen the comprehension capability of a pre-training language model on the local context of the multi-turn conversation;

the method comprises the steps that a random sliding window replying prediction task samples dialogue context data of different initial positions in a multi-turn dialogue context to obtain dialogue segments, a pre-training language model is used for coding the dialogue segments, and the window replying is predicted, so that the pre-training language model can fully learn semantic information of local contexts;

step B3: in a multi-round dialogue reply selection task, a deep learning network model PLIP adopts a local information perception module to promote a pre-training language model to generate local semantic information, meanwhile, global information and the local semantic information are fused, rationality scores between multi-round dialogue contexts and replies are calculated, whether the current reply corresponds to the given multi-round dialogue context is evaluated, finally, according to a target loss function, the gradient of each parameter in the deep learning network model is calculated by using a back propagation method, and the parameter is updated by using a random gradient descent method;

step B4: and terminating the training of the deep learning network model when the iterative change of the loss value generated by the PLIP of the deep learning network model is smaller than a set threshold value or reaches the maximum iteration times.

Further, the step B1 specifically includes the following steps:

step B11: splicing the words and the replies in the conversation context to obtain an input x of the deep learning network model;

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

wherein, x is a long text obtained by splicing, [ SEP ] is a separator, [ CLS ] is a mark used for learning global features by a deep learning network model, and [ EOT ] is a special mark used for learning local information by the deep learning network model;

step B12: mapping x into a form of a number sequence through a dictionary of a pre-training language model, wherein each number is an id of a word in a word list, inputting the id sequence into an embedding layer in the pre-training language model, and mapping the id sequence into word embedding representation, position embedding representation and paragraph embedding representation according to three initialized embedding matrixes;

X＝Embedding_word(x)+Embedding_pos(x_pos)+Embedding_type(x_type)

wherein, Embedding_wordRepresenting the mapping of word-embedded representations, the input sequence being able to be mapped into a word vector, Embedding, according to a vocabulary_posThe mapping mode of the expression position Embedding expression can be mapped to a corresponding position Embedding matrix, Embedding according to the position of each word_typeThe mapping mode of the embedded representation of the representation paragraph can map the context and the reply to different vector spaces to obtain three word vectors, and then the three word vectors are added to obtain the word vector

l is the number of words in x, [ CLS]、[SEP]、[EOT]All are regarded as a word;

step B13: adding the word embedded representation, the sentence representation and the position representation of each word to obtain a fused embedded representation, and coding by using a multilayer Transformer network to obtain high-level semantic feature representation of a sequence;

the multi-layer Transformer network is formed by stacking a plurality of Transformer coding blocks; each Transformer coding block comprises a multi-head self-attention mechanism and a forward feedback layer, and a residual error connection and normalization layer is arranged behind each sublayer; x is firstly mapped into three vectors, namely a query vector Q, a key vector K and a value vector V, and the calculation formula is as follows:

Q＝XW^Q+b^Q

K＝XW^K+b^K

V＝XW^V+b^V

wherein, W^Q、W^K、W^V、b^Q、b^K、b^VRepresenting a training parameter;

step B14: sending Q, K, V vectors into a multi-head self-attention mechanism, dividing h sub-vectors on the word vector dimension d of the h sub-vectors, wherein the dimension of each sub-vector is d/h, respectively sending the h sub-vectors into the self-attention mechanism for training, and finally splicing the h sub-vectors to obtain a d-dimension output vector C; in order to prevent overfitting, make the vector more integral and accelerate network convergence, residual connection and normalization are added to the multi-head self-attention mechanism sublayer to obtain a vector T, and the calculation formula is as follows:

C＝Concat(head₁,head₂,...,head_h)W^C+b^C

T＝LayerNorm(X+C)

wherein the head_iRepresents the self-attention score of the ith sub-vector,

W^C,b^Crepresenting a training parameter, Concat representing splicing operation, and LayerNorm being layer normalization transformation;

step B15: sending the vector T into a fully-connected forward feedback sublayer, performing two linear transformations on the T by the layer to obtain comprehensive characteristics FFN of the sequence, performing residual error connection on the T and the FFN, and performing layer normalization processing to obtain final high-level characteristics H of the sequence, wherein a computer formula is as follows:

FFN＝(W^FT+b^F)W^N+b^N

H＝LayerNorm(T+FFN)

wherein, W^F、W^N、b^F、b^NRepresenting the training parameters.

Further, the step B2 specifically includes the following steps:

step B21: in the auxiliary task random sliding window reply prediction, the length and the position of a sliding window are set to be random by a model, a large amount of dialogue local context data falling in the sliding window are sampled from the dialogue context, and a special label [ EOT ] is inserted behind each utterance of the dialogue local context data, so that the following formula is shown:

wherein x 'is the input of the subtask, different from the main task, x' only retains the information inside the window, other information is replaced by [ PAD ], i is the initial position of the sliding window, w represents the size of the current window, m represents the number of utterances of the current context, and k is a hyper-parameter, which represents the size of the minimum window;

step B22: the dialogue window data is encoded using the pre-training language model BERT, the formula is as follows:

E＝BERT(x′)

step B23: the vector E obtained in the step B22 contains all semantic representations of the dialog segments coded by the pre-training language model BERT, and the semantic representation which can represent the current dialog segment most is further selected from the vector E to optimize the auxiliary task; in order not to destroy [ CLS ] capable of representing global information in pre-training language model]Indicating that the model only selects the [ EOT ] with the nearest distance window reply in the output of the pre-training language model]Represents E_[EOT]The vector is used as a final characterization vector of a random sliding window reply prediction task; the auxiliary task expresses reasonability of window data, and [ EOT ] in BERT]The tag learns the information of different segments and different moments in the conversation and enriches the EOT]The ability to understand local area information;

step B24: obtaining the final characterization vector E_[EOT]Then, the score is calculated by inputting the score into the classification layer, and the calculation formula is as follows:

g(w_c,w_r)＝σ(W_w ^TE_[EOT]+b_w)

wherein, w_c、w_rRepresenting context and reply within a sliding window, W_wIs a trainable parameter in the prediction layer, sigma (·) represents sigmoid activation function;

step B25: the random sliding window reply prediction task is optimized by adopting a gradient descending mode aiming at an objective function, the objective function adopts a cross entropy loss function to evaluate the difference between the current mark and the real dialogue window mark, and the specific formula is as follows:

where D' represents a window data set.

Further, the step B3 specifically includes the following steps:

step B31: the local information awareness module embeds a special tag [ EOT ] behind each sentence in the dialog context, as shown in the following formula:

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

under the combined action of a pre-training language model deep attention mechanism and position embedding, the special label [ EOT ] of each position can learn the interactive information with the surrounding text at the specific position; meanwhile, in the process of randomly sliding the window to reply the prediction task optimization, the last [ EOT ] label in the window is used for establishing a classification task, and the capability of replying the identification window is gradually learned; thus, the representation of the [ EOT ] tag gradually learns the correct representation of the sentence and focuses more on the text of the local region;

step B32: in the feature fusion stage, the local information perception module selects n local semantic representations which are closest to the response from the output of the pre-training language model as local information with multiple granularities, and simultaneously, the local information is aggregated into a whole in a splicing mode, wherein the specific formula is as follows:

wherein, l represents the entry closest to the reply, and n is a hyper-parameter used for representing the number of [ EOT ] representations to be taken out;

step B33: the local information perception module integrally fuses local information and global information to obtain a final characterization vector of a main task, and the aggregation process is as follows:

step B34: inputting the aggregated characterization vectors into a classification layer to calculate the rationality score between the current multi-turn conversation context and the reply, wherein the formula is as follows:

g(c,r)＝σ(W^TE_ensemble+b)

where W is a trainable parameter, σ (-) represents a sigmoid activation function, b is a bias term for the current classification level;

step B35: the PLIP model updates parameters in the learning model in a gradient descending mode, and meanwhile, cross entropy is adopted as a loss function for a multi-round dialogue reply selection task, and the specific formula is as follows:

and combining the optimization target of the auxiliary task, wherein the final loss function of the model is as follows:

Loss＝Loss_main+αLoss_window

wherein alpha is a hyper-parameter used for controlling the auxiliary task to recover the influence of the prediction task on the model by the random sliding window.

The invention also provides a local information perception dialogue system adopting the method, which comprises the following steps:

the data collection module is used for collecting multi-round conversation samples in a specific field, labeling answer positive and negative labels corresponding to each question in the multi-round conversation data, and constructing a multi-round conversation reply selection training set D with the positive and negative labels;

the pre-training language model coding module is mainly composed of an embedded layer and a multi-layer multi-head attention mechanism; sending each training sample in the form of a triplet of the training set D into a pre-training language model BERT, and learning to combine context semantic representation by utilizing a multi-layer attention mechanism of the pre-training language model; meanwhile, the model fully excavates the semantic understanding ability of the pre-training language model in a multi-task learning mode;

the auxiliary task module is used for exporting parameters of the pre-training language model BERT, and replying a prediction task by using a random sliding window to further enhance the comprehension capability of the pre-training language model on local dialogue information; the random sliding window replying prediction task samples window data with different positions and sizes in a multi-turn conversation context, a derived pre-training language model is used for coding a conversation window, and the pre-training language model is made to fully learn local language characteristics of different conversation stages and conversation lengths by utilizing the reply of a newly added special label [ EOT ] prediction window;

the local information perception module is used for promoting a pre-training language model BERT to generate multi-granularity local semantic information by adopting the local information perception module in a multi-round dialogue reply selection task, meanwhile, global information and the local semantic information are fused to perform classification score calculation, and whether the current reply corresponds to a given multi-round dialogue context is evaluated; finally, calculating the gradient of each parameter in the deep learning network model by using a back propagation method according to the target loss function, and updating the parameters by using a random gradient descent method; and

and the network training module is used for terminating the training of the deep learning network model when the loss value iteration change generated by the deep learning network model is smaller than a set threshold value and is not reduced or reaches the maximum iteration times.

Compared with the prior art, the invention has the following beneficial effects: the method and the system adopt a multi-task learning strategy, strengthen the learning of the pre-training language model to the multi-round dialogue local area in an auxiliary task while optimizing a main task, namely a multi-round dialogue reply selection task, learning context and reply correlation degree, and fully excavate the semantic understanding capability of the pre-training language model, thereby fusing global information and local semantic information and obtaining the most appropriate reply corresponding to the multi-round dialogue context. Therefore, the invention can effectively improve the accuracy of multi-turn dialogue reply selection and has strong practicability and wide application prospect.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention;

FIG. 2 is a diagram of a local information-aware deep learning network model architecture based on a pre-trained language model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating random sliding window reply prediction in accordance with an embodiment of the present invention;

FIG. 4 is a diagram of a random sliding window recovery prediction structure according to an embodiment of the present invention;

FIG. 5 is a block diagram of a local information awareness module according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a local information-aware dialogue method based on a pre-trained language model, which includes the following steps:

step A: and collecting multi-turn dialog texts of a specific scene, labeling the category to which each multi-turn dialog reply belongs, and constructing a training set D with positive and negative category labels.

And B: and training a local information perception deep learning network model PLIP based on the pre-training language model by using a training set D, and selecting a reply corresponding to the given multi-turn conversation context.

The step B specifically comprises the following steps:

l_rfor the number of words in the reply, y ∈ {0,1} is a sample label, where y ═ 1 indicates that the candidate reply is a reasonable reply for the current context, and y ═ 0 indicates unreasonable.

The deep learning network model PLIP outputs an evaluation score capable of reflecting context and reply correlation degree after encoding and calculating triples, the deep learning network model learns context semantic representation combined with the context by utilizing a multi-layer attention mechanism of a pre-training language model, and meanwhile, a multi-task learning strategy is adopted, so that the learning of the pre-training language model on the local context of multi-round conversation is enhanced in an auxiliary task while a main task, namely a multi-round conversation reply selection task, is optimized, the representation vector understanding of global information is promoted, the context and the reply correlation degree are learned, and the semantic understanding capability of the pre-training language model is fully mined. The architecture of the deep learning network model is shown in fig. 2.

The step B1 specifically includes the following steps:

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

wherein, x is a long text obtained by splicing, [ SEP ] is a separator, [ CLS ] is a mark used by the deep learning network model for learning global features, and [ EOT ] is a special mark used by the deep learning network model for learning local information.

X＝Embedding_word(x)+Embedding_pos(x_pos)+Embedding_type(x_type)

l is the number of words in x, [ CLS]、[SEP]、[EOT]Are all regarded as a word.

Step B13: adding the word embedded representation, the sentence representation and the position representation of each word to obtain a fused embedded representation, and coding by using a multilayer Transformer network to obtain the high-level semantic feature representation of the sequence.

Q＝XW^Q+b^Q

K＝XW^K+b^K

V＝XW^V+b^V

wherein, W^Q、W^K、W^V、b^Q、b^K、b^VRepresenting the training parameters.

Step B14: sending Q, K, V vectors into a multi-head self-attention machine system, dividing h sub-vectors on the word vector dimension d of the vectors, wherein the dimension of each sub-vector is d/h, respectively sending the sub-vectors into the self-attention machine system for training, and finally splicing the h self-attention sub-vectors to obtain a d-dimensional output vector C again; in order to prevent overfitting, make the vector more integral and accelerate network convergence, residual connection and normalization are added to the multi-head self-attention mechanism sublayer to obtain a vector T, and the calculation formula is as follows:

C＝Concat(head₁,head₂,...,head_h)W^C+b^C

T＝LayerNorm(X+C)

wherein the head_iRepresents the self-attention score of the ith sub-vector,

W^C,b^Crepresents the training parameters, Concat represents the splicing operation, and LayerNorm is the layer normalization transformation.

FFN＝(W^FT+b^F)W^N+b^N

H＝LayerNorm(T+FFN)

wherein, W^F、W^N、b^F、b^NRepresenting the training parameters.

Step B2: in the auxiliary task part, the PLIP deep learning network model uses a random sliding window to reply to a prediction task to further strengthen the comprehensibility of the pre-training language model on the local context of the multi-turn dialogue.

The method comprises the steps that a random sliding window replying prediction task samples dialogue context data of different initial positions in a multi-turn dialogue context to obtain dialogue segments, a pre-training language model is used for coding the dialogue segments, and the window replying is predicted, so that the pre-training language model can fully learn semantic information of local contexts. The process and structure of the random sliding window reply prediction are shown in fig. 3 and 4.

The step B2 specifically includes the following steps:

wherein x 'is the input of the subtask, different from the main task, x' only retains the information inside the window, other information is replaced by [ PAD ], i is the starting position of the sliding window, w represents the size of the current window, m represents the number of utterances of the current context, and κ is a hyper-parameter, which represents the size of the minimum window.

Step B22: the dialog window data is encoded using the pre-training language model BERT, the formula being as follows:

E＝BERT(x′)

step B23: the vector E obtained in the step B22 contains all semantic representations of the dialog segments coded by the pre-training language model BERT, and the semantic representation which can represent the current dialog segment most is further selected from the vector E to optimize the auxiliary task; in order not to destroy [ CLS ] capable of representing global information in pre-training language model]Indicating that the model only selects the [ EOT ] with the nearest distance window reply in the output of the pre-training language model]Represents E_[EOT]The vector is used as a final characterization vector of a random sliding window reply prediction task; the auxiliary task expresses reasonability of window data, and [ EOT ] in BERT]The tag learns the information of different segments and different moments in the conversation and enriches the EOT]Indicates the ability to understand the local area information.

g(w_c,w_r)＝σ(W_w ^TE_[EOT]+b_w)

wherein, w_c、w_rRepresenting context and reply within a sliding window, W_wIs a trainable parameter in the prediction layer and σ (-) represents the sigmoid activation function.

where D' represents a window data set.

Step B3: in the multi-round dialogue reply selection task, a local information perception module shown in figure 5 is adopted by a deep learning network model PLIP to promote a pre-training language model to generate local semantic information, global information and the local semantic information are fused at the same time, the rationality score between the multi-round dialogue context and the reply is calculated, whether the current reply corresponds to the given multi-round dialogue context is evaluated, finally, the gradient of each parameter in the deep learning network model is calculated by using a back propagation method according to a target loss function, and the parameter is updated by using a random gradient descent method.

The step B3 specifically includes the following steps:

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

under the combined action of a pre-training language model deep attention mechanism and position embedding, the special label [ EOT ] of each position can learn the interactive information with the surrounding text at the specific position; meanwhile, in the process of randomly sliding the window to reply the prediction task optimization, the last [ EOT ] label in the window is used for establishing a classification task, and the capability of replying the identification window is gradually learned; thus, the representation of the [ EOT ] tag gradually learns the correct representation of the sentence and focuses more on the text of the local region.

where l represents the entry closest to the reply, and n is a hyperparameter representing the number of fetch [ EOT ] tokens.

g(c,r)＝σ(W^TE_ensemble+b)

where W is a trainable parameter, σ (-) stands for sigmoid activation function, and b is a bias term for the current classification level.

combining the optimization objective of the auxiliary task, the final loss function of the model is:

Loss＝Loss_main+αLoss_window

The embodiment also provides a local information perception dialogue method and a system adopting the method, and the method comprises a data collection module, a pre-training language model coding module, an auxiliary task module, a local information perception module and a network training module.

The data collection module is used for collecting multi-round conversation samples in a specific field, labeling answer positive and negative labels corresponding to each question in the multi-round conversation data, and constructing a multi-round conversation reply selection training set D with the positive and negative labels.

The pre-training language model coding module comprises a pre-training language model, and the pre-training language model mainly comprises an embedded layer and a multi-layer multi-head attention mechanism; sending each training sample in the form of a triplet of the training set D into a pre-training language model BERT, and learning to combine context semantic representation by utilizing a multi-layer attention mechanism of the pre-training language model; meanwhile, the model fully excavates the semantic understanding ability of the pre-training language model in a multi-task learning mode.

In an auxiliary task module, parameters of a pre-training language model BERT are exported by the model, and the comprehension capability of the pre-training language model on conversation local information is further enhanced by replying a prediction task by using a random sliding window; the random sliding window replying prediction task samples window data with different positions and sizes in a multi-turn conversation context, encodes a conversation window by using a derived pre-training language model, and makes the pre-training language model fully learn local language characteristics of different conversation stages and conversation lengths by using the replying of a newly-added special label [ EOT ] prediction window.

In the multi-round dialog reply selection task, the model adopts a local information perception module to promote a pre-training language model BERT to generate multi-granularity local semantic information, meanwhile, global information and the local semantic information are fused to perform classification score calculation, and whether the current reply corresponds to a given multi-round dialog context is evaluated; and finally, calculating the gradient of each parameter in the deep learning network model by using a back propagation method according to the target loss function, and updating the parameter by using a random gradient descent method.

And the network training module is used for training the network model, and when the iterative change of the loss value generated by the deep learning network model is smaller than a set threshold value and is not reduced or reaches the maximum iterative times, the training of the deep learning network model is terminated.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A local information perception dialogue method based on a pre-training language model is characterized by comprising the following steps:

and B, step B: training a local information perception deep learning network model PLIP based on a pre-training language model by using a training set D, and selecting a reply corresponding to a given multi-turn conversation context;

2. The local information perception dialogue method and system based on the pre-trained language model according to claim 1, wherein the step B specifically comprises the following steps:

3. The local information perception dialogue method based on the pre-trained language model of claim 2, wherein the step B1 specifically comprises the following steps:

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

wherein, x is a long text obtained by splicing, [ SEP ] is a separator, [ CLS ] is a mark used for learning global features of a deep learning network model, and [ EOT ] is a special mark used for learning local information of the deep learning network model;

step B12: mapping x into a form of a digital sequence through a dictionary of a pre-training language model, wherein each digit is an id of a word in a word list, inputting the id sequence into an embedding layer in the pre-training language model, and mapping the id sequence into word embedding representation, position embedding representation and paragraph embedding representation according to three initialized embedding matrixes;

X＝Embedding_word(x)+Embedding_pos(x_pos)+Embedding_type(x_type)

Q＝XW^Q+b^Q

K＝XW^K+b^K

V＝XW^V+b^V

wherein, W^Q、W^K、W^V、b^Q、b^K、b^VRepresenting a training parameter;

C＝Concat(head₁,head₂,...,head_h)W^C+b^C

T＝LayerNorm(X+C)

wherein the head_iRepresents the self-attention score of the ith sub-vector,

W^C,b^Crepresenting training parameters, Concat representing splicing operation, and LayerNorm representing layer normalization transformation;

FFN＝(W^FT+b^F)W^N+b^N

H＝LayerNorm(T+FFN)

wherein, W^F、W^N、b^F、b^NRepresenting the training parameters.

4. The local information perception dialogue method based on the pre-trained language model of claim 3, wherein the step B2 specifically comprises the following steps:

E＝BERT(x′)

step B23: the vector E obtained in the step B22 contains all semantic representations of the dialog segments coded by the pre-training language model BERT, and the most representative current dialog segment is further selected from the vector ETo optimize the auxiliary task; in order not to destroy [ CLS ] capable of representing global information in pre-training language model]Indicating that the model only selects the [ EOT ] with the nearest distance window reply in the output of the pre-training language model]Represents E_[EOT]The vector is used as a final characterization vector of a random sliding window reply prediction task; the auxiliary task expresses reasonability of window data, and [ EOT ] in BERT]The tag learns the information of different segments and different moments in the conversation and enriches the EOT]The ability to understand local area information;

g(w_c,w_r)＝σ(W_w ^TE_[EOT]+b_w)

where D' represents a window data set.

5. The local information perception dialogue method according to claim 4, wherein the step B3 specifically comprises the following steps:

x＝{[CLS],u₁,[EOT],u₂,[EOT],…,[EOT],u_m,[SEP],r,[SEP]}

g(c,r)＝σ(WTE_ensemble+b)

Loss＝Loss_main+αLoss_window

6. A local information-aware dialog system employing the method of any one of claims 1 to 5, comprising:

the auxiliary task module is used for exporting parameters of the pre-training language model BERT, and replying a prediction task by using a random sliding window to further enhance the comprehension capability of the pre-training language model on the local information of the conversation; the random sliding window replying prediction task samples window data with different positions and sizes in a multi-turn conversation context, a derived pre-training language model is used for coding a conversation window, and the pre-training language model is made to fully learn local language characteristics of different conversation stages and conversation lengths by utilizing the reply of a newly added special label [ EOT ] prediction window;

the local information perception module is used for promoting a pre-training language model BERT to generate multi-granularity local semantic information, calculating the rationality score between multi-round conversation context and reply and evaluating whether the current reply corresponds to the given multi-round conversation context or not in the multi-round conversation reply selection task; finally, calculating the gradient of each parameter in the deep learning network model by using a back propagation method according to the target loss function, and updating the parameters by using a random gradient descent method; and

and the network training module terminates the training of the deep learning network model when the loss value iteration change generated by the deep learning network model is smaller than a set threshold value or reaches the maximum iteration times.