CN116150334A

CN116150334A - Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism

Info

Publication number: CN116150334A
Application number: CN202211591710.7A
Authority: CN
Inventors: 朱国华; 姚盛根; 胡晓莉
Original assignee: Jianghan University
Current assignee: Jianghan University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-05-23

Abstract

The invention belongs to the technical field of Chinese-oriented natural language generation, and provides a Chinese cosolvents sentence training method and system based on a UniLM model and a Copy mechanism. Meanwhile, aiming at the lack of sufficient and diversified training corpuses, the generated co-emotion replies are comprehensively evaluated, and the high-quality co-emotion replies meeting the expected standard and user input are put into the original training corpuses to carry out compound automatic iterative training, so that training data are enhanced. The invention aims to solve the problem that a Copy mechanism is fused in a decoder and emotion keywords and complex event details are copied into output. Aiming at the problem of corpus shortage of Chinese mind dialogue with co-emotion capability, the invention adopts compound automatic iterative training to enhance training data.

Description

Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism

Technical Field

The invention belongs to the technical field of Chinese-oriented natural language generation, and particularly relates to a Chinese co-emotion reply generation method based on a UniLM and Copy mechanism.

Background

As deep learning applications are applied in various fields, intelligent session systems have also been rapidly developed. Users want to communicate emotionally with the intelligent session system, and co-emotion can achieve this goal. Thus, the synopsis reply generation is generated. The co-condition is defined by karl rogers (Carl Ransom Rogers) as: in the interpersonal interaction process, the experience and logic of the other person are imagined on the standpoint of the other person, the ideas and feelings of the other person are experienced, and the problem is seen from the view angle of the other person and solved. The co-emotion reply generation refers to that the intelligent session system judges the emotion state of the user through the history session, so that the emotion reply felt by the user is generated. The prior research shows that the intelligent session system with the co-emotion capability not only can improve the satisfaction of the user, but also can obtain more positive feedback of the user.

In the psychological health consultation session, the intelligent session system can help consultants to solve part of tasks as an auxiliary tool, and is considered to be the key of psychological health intervention, auxiliary consultation diagnosis and other service applications. Thus, intelligent session systems that are endowed with co-emotion capabilities are becoming a research hotspot. A good session model must have very strong contextual dependencies between its inputs and outputs. Contextual relevance refers to the interrelationship between the user's input and the model's output. Currently, the mainstream reply generation method is a sequence-to-sequence method based on deep learning or a pre-training model.

The traditional sequence-to-sequence encoder side is mainly RNN, LSTM, etc. RNNs, LSTM, etc. are not sufficiently capable of extracting features semantically and are deficient in long-range dependence compared to transfomers. Although the readability of replies generated by various language models based on the Transformer is higher than that of RNNs, LSTM and the like, the problem that the generated details are inaccurate and the context is irrelevant still exists.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a Chinese co-emotion sentence training method based on a UniLM model and a Copy mechanism.

The invention is realized in such a way, a Chinese co-emotion reply generation method based on a UniLM model and a Copy mechanism is used, and the purpose of fusing the Copy mechanism is to Copy emotion keywords and complex event details in a source sequence into output; and then, evaluating the output co-emotion replies by using evaluation standards such as confusion, putting the replies meeting expectations and user statements into an original training corpus to perform compound automatic iterative training, and obtaining a further updated and optimized co-emotion reply generation model.

The technical scheme adopted by the invention is a Chinese co-emotion reply generation method based on a UniLM model and a Copy mechanism, and specifically comprises the following steps:

step 1, crawling corpus with co-emotion capacity in the field of the psychology dialogue by using a crawler technology, and preprocessing to obtain input representation;

step 2, pre-training is carried out based on a UniLM model, and simultaneously three types of language models are used, wherein each language model uses a different self-attention mask mechanism;

step 3, calculating loss by using a cross entropy loss function, and completing pre-training based on a UniLM model to obtain a co-condition reply generation model;

step 4, carrying out a co-emotion reply generation task based on the UniLM model, and decoding through a self-attention mechanism from a sequence to a sequence language model to obtain vocabulary probability distribution;

step 5, constructing a decoder containing a Copy mechanism on the basis of the step 4, introducing generation probability and Copy probability, and optimizing vocabulary probability distribution in the step 4;

step 6, using the cross entropy loss function as a loss function of the model, and obtaining a generated co-situation reply by using a beam search algorithm;

and 7, putting the generated high-quality co-emotion replies and the statements of the users into the corpus in the step 1, and further carrying out compound automatic iterative training based on the UniLM model to obtain an updated and optimized co-emotion reply generation model.

Further, the two text sequences Segment1 are entered at a time, denoted as s1 and Segment2, denoted as s2, for example: "[ CLS ] always thinks that some people or things very tired of oneself [ SEP ] know you are confused and not solved by the positive event of forgetting about the negative event in life," [ SEP ] ". The [ CLS ] tag sequence starts and the [ SEP ] tag sequence ends, the text sequence pairs are represented by three types of symbols.

Further, the UniLM model is stacked by 12-layer transformers, each layer of transformers has 768 hidden nodes and 12 heads, and the structure is the same as the BERT-BASE, so that parameters can be initialized by the trained BERT-BASE model. The UniLM model can complete three pre-training targets simultaneously, and can complete the prediction tasks of a unidirectional training language model, a bidirectional training language model and a sequence-to-sequence language model, so that the model can apply a natural language generation task. Aiming at different language models, different MASK mechanisms are adopted, and MASKING modes are adopted: the overall proportion is 15%, wherein 80% of the cases are directly replaced by [ MASK ], 10% of the cases are replaced by one word in the random selection dictionary, and the last 10% of the cases are with true values, and no processing is performed. It is also 80% of the cases where only one word is MASK at a time, and 20% of the cases where two or three word trigrams are MASK-dropped. For MASK to be predicted, the one-way language model uses a one-sided context, such as a MASK in the prediction sequence "X1X2 MASK" X4", only X1, X2 and its own information are available, and the information of X4 is not available. The bi-directional language model encodes context information from two directions, taking "X1X2[ MASK ] X4" as an example, where X1, X2, X4 and its own information are all available. In the sequence-to-sequence language model, if MASK is in S1, only the context information of S1 can be encoded; if MASK is in S2, it may obtain context information to the left of MASK, including S1.

Further, text representation output by the converter network is input into a Softmax classifier, masked words are predicted, cross entropy loss functions are used for predicting word segmentation and original word segmentation, model parameters are optimized, and pre-training is completed.

Further, the sequence-to-sequence language model is used for learning and recovering the masked words by randomly masking out a certain proportion of the words in the target sequence, and the training target is to maximize the probability of the masked words based on the context information. The [ SEP ] at the end of the target sequence can also be masked to allow the model to learn when to terminate generating the target sequence. The model uses a MASK mechanism and combines an attention mechanism to obtain text feature vectors, and the text feature vectors are input into a full-connection layer to obtain vocabulary probability distribution.

Further, the vocabulary probability distribution is input into a full-connection layer and a Sigmoid layer to obtain the generation probability. And introducing the duplication probability, and combining the generation probability and the duplication probability to obtain updated and improved vocabulary probability distribution.

Further, the cross entropy loss function is used for completing a fine tuning task of the model, and a Beam Search algorithm is used for generating a co-condition reply.

Further, four evaluation indexes such as confusion, BLEU-4, F1 and expert evaluation are used for comprehensively evaluating the co-emotion replies generated in the step 6, the co-emotion replies meeting the expected standard and user input are automatically put into the original corpus in the step 1 for duplex automatic iterative training, training data are enhanced, and an updated and optimized Chinese co-emotion reply generation model is obtained.

The invention aims to solve the problem that emotion keywords and complex event details cannot be generated by the co-emotion reply generated based on a Transformer network, and provides a method for copying the emotion keywords and the complex event details into output by fusing a Copy mechanism in a decoder.

Another object of the present invention is to enhance training data by multiple automatic iterative training, which aims at the problem of corpus shortage with co-emotion capability of chinese mind dialogue.

In combination with the above technical solution and the technical problems to be solved, please analyze the following aspects to provide the following advantages and positive effects:

first, aiming at the technical problems in the prior art and the difficulty in solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows:

in the interpersonal interaction process, people hope to stand on the standpoint of other people to imagine the experiences and logic of other people, experience the ideas and feelings of other people, and see the problems and solve the problems from the view angles of other people. Among them, intelligent session systems to which a co-emotion capability is given gradually become a research hotspot. The invention solves the problem of generating the co-emotion response, namely, the intelligent session system judges the emotion state of the user through the history session, so as to generate the emotion response which is felt by the user. The intelligent session system with the co-emotion capability not only can improve the satisfaction of the user, but also can obtain more positive feedback of the user.

Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:

the invention provides a Chinese co-emotion reply generation method based on a UniLM model and a Copy mechanism. The invention uses a UniLM model as a basic framework, and aims at solving the problem that the co-emotion reply generated based on a Transformer network can not generate emotion keywords and complex event details, and proposes to fuse a Copy mechanism in a decoder and Copy the emotion keywords and the complex event details into output. Aiming at the problem of corpus shortage of Chinese mind dialogue with co-emotion capability, the invention adopts compound automatic iterative training to enhance training data.

The invention copies emotion keywords and complex event details in the source sequence into output; and then, evaluating the output co-emotion replies by using evaluation standards such as confusion, putting the replies meeting expectations and user statements into an original training corpus to perform compound automatic iterative training, and obtaining a further updated and optimized co-emotion reply generation model.

Drawings

FIG. 1 is a frame diagram of a Chinese co-emotion reply generation model based on a UniLM model and Copy mechanism provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a UniLM model architecture used in accordance with an embodiment of the present invention;

fig. 3 is a specific flowchart of a method for generating a chinese co-emotion reply based on a UniLM model and Copy mechanism according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In order to fully understand how the invention may be embodied by those skilled in the art, this section is an illustrative embodiment in which the claims are presented for purposes of illustration.

The invention further provides a Chinese co-condition reply generation method based on a UniLM model and a Copy mechanism by combining the drawings and the specific embodiments.

As shown in FIG. 1, the invention is mainly based on a UniLM model, and a Copy mechanism is fused at a decoding end, so that the purpose of fully utilizing the context correlation facing the details of complex events in the session co-situation is realized. The method mainly comprises four stages of input processing, pre-training, co-emotion reply generation and compound training. The specific implementation mode is as follows:

the pre-trained corpus includes statements of the psychological consultants about the physical problems and the counselor's returns with co-emotion capabilities. The visitor 'S statement Segment1, denoted S1, the consultant' S reply Segment2, denoted S2, adds special tags [ CLS ] and [ SEP ], shaped as "[ CLS ] S1[ SEP ] S2[ SEP ]". As shown in FIG. 2, the input representation of the model consists of the sum of the three parts Segment Embedding, position Embedding, token Embedding.

Model pre-training, inputting an Embedding vector, each layer of transform coding input vector, using a multi-head attention mechanism to aggregate upper layer input, controlling the attention range of each word or position through a mask matrix to obtain the attention distribution of the current position to other positions, and calculating the feature vector of the current position of the decoder.

The attention distribution At of the generated word vector to the text feature vector XInput At time t is as follows:

the feature vector XOutput output by the decoder at time t is as follows:

X _Output ＝A _t *W _v *X _Intput

wherein Xt is a target vector at time t; XInput is the text feature vector at time t; m is a mask matrix, which acts to control the word attention range; dk is the dimension of the word vector; wq, wk, wv are learning parameters.

The Softmax function maps the vector of scores s into a probability distribution, which is defined as follows:

wherein i represents the number of the output node; si is the output value of the i-th node; n is the number of output nodes, i.e. the number of categories of classification.

Further, the model prediction result XOutput, denoted s, and the masked primitive word st calculate the cross entropy loss to optimize the parameters of the model. The cross entropy function is defined as follows:

the pretreatment process comprises the following steps: training the preprocessed data input model, wherein the total training number of the preprocessed data input model is 20, the number of the epochs is 0.1, the hidden vector dimension is 768, the Learning rate learning_rate is 2e-5, the epochs is 20, the Batch processing size is 32, the attention head number is 12, the hidden layer number is 12, the embedded layer number is 12, the hidden layer unit number is 768, and the vocabulary size is 21128. The maximum input length is set to 512 and the maximum length to generate a co-emotion reply is set to 40, and the loss is calculated using a cross entropy function.

After the pre-training is completed, a UniLM sequence-to-sequence language model is used for fine adjustment, and a co-emotion reply generation task is carried out. Upon decoding, for example: the user inputs a statement "X1" of the mental and psychological problem, inputs the sequence "[ CLS ] x1[ SEP ] y1[ MASK ]" at time t=1, adds "[ MASK ]" to the end of the sequence, and its corresponding feature indicates the predicted next word. "[ CLS ] X1[ SEP ]" is a known source sequence that can see context information within a sentence from each other during the encoding phase. "Y1[ MASK ]" is a predicted target sequence, and information of the source sequence and information of the left part of the target sequence can be seen in the decoding stage. The model fuses the encoder and decoder together through a mask matrix.

After the corpus sample is encoded by the UniLM model, a sequence length X hidden size matrix is obtained, the first row is the characteristic representation of [ CLS ], the second row is the characteristic representation of X1, and so on. In the decoding stage, the [ MASK ] characteristic representation is used for obtaining the probability distribution of words in a vocabulary through a linear layer, a Softmax function is used for obtaining the probability distribution of words in the vocabulary, the word with the largest probability is selected as the word obtained by decoding, the steps are repeated, and the characteristic vector XOutput output by the decoder at the moment t is obtained when the [ SEP ] is generated. The specific calculation is as follows:

the vocabulary probability distribution Pv is obtained by performing linear transformation on the XOutput twice and a Softmax function.

P _v ＝Softmax(W ^′ (W*X _Output +b)+b ^′ )

Wherein W is ^′ 、W、b、b ^′ Is a learnable parameter.

Introducing a generation probability Pg which represents the probability of generating words from the word list; a duplication probability Pc is introduced, representing the probability of duplication of words from the source text, where pg+pc=1. XOutput, at, xt is calculated to get Pg by the full connectivity layer and Sigmoid function.

P _g ＝Sigmoid(W[X _t ，X _Output ，A _t ]+b)

Wherein W, b is a learnable parameter.

Further calculating updated and improved vocabulary probability distribution:

P(w)＝P _g *P _v (w)+P _c *A _t

wherein, when w is not a word in the vocabulary, P _v (w) =0, the predicted word being generated from the source sequence; when w is not a word in the source sequence, A _t =0, the predicted word is generated from the vocabulary. The Copy mechanism replicates emotional keywords and complex event details (highly probable words) from the source sequence as part of the generated co-emotion response, which can control to some extent the accuracy of the co-emotion response generation. The Copy mechanism also serves to some extent to dynamically expand the vocabulary to reduce the probability of generating unregistered words.

The Beam size is set to 1, and a Beam Search algorithm is used to Search for near-optimal target sequences and generate a co-emotion response. And evaluating the generated co-emotion replies, putting the co-emotion replies meeting the standard and statements of users into the original corpus to perform compound automatic iterative training, enhancing training data, and obtaining an updated and optimized Chinese co-emotion reply generation model.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. A Chinese co-emotion reply generation method based on a UniLM model and a Copy mechanism is characterized in that emotion keywords and complex event details in a source sequence are copied into output; and evaluating the output co-emotion replies by using evaluation standards such as confusion, putting the replies meeting expectations and user statements into an original training corpus to perform compound automatic iterative training, and obtaining a further updated and optimized co-emotion reply generation model.

2. The method for generating Chinese co-emotion reply based on UniLM model and Copy mechanism as claimed in claim 1, which is characterized by comprising the following steps:

3. The method for generating Chinese co-emotion replies based on UniLM model and Copy mechanism as claimed in claim 2, wherein step 2 specifically comprises: initializing parameters by using a BERT-BASE pre-training model; based on the same transducer network structure, different MASK is predicted as a pre-training target, the prediction tasks of unidirectional, bidirectional and sequence-to-sequence language models are completed, and different language models are uniformly distributed and used.

4. The method for generating Chinese co-emotion replies based on UniLM model and Copy mechanism as claimed in claim 2, wherein step 4 specifically comprises: using a self-attention MASK mechanism of a sequence-to-sequence language model to randomly segment words in a MASK target sequence and MASK the end of the sequence to learn when to stop generating a co-emotion reply; taking the maximum word segmentation probability under the condition of given context information as a training target, fusing encoding and decoding by using a MASK mechanism, and obtaining text feature vectors by combining an attention mechanism; and inputting the feature vectors obtained by decoding into a full-connection layer, and obtaining vocabulary probability distribution by using a Softmax function.

5. The method for generating Chinese co-emotion reply based on UniLM model and Copy mechanism as claimed in claim 2, wherein step 5 specifically comprises: inputting the vocabulary probability obtained in the last step into a full-connection layer and a Sigmoid layer to obtain a generation probability, introducing a replication probability, and fusing the generation probability and the replication probability to obtain updated and improved vocabulary probability distribution; the Copy mechanism effectively copies emotion keywords and complex event details input by a user into output, accuracy of details in generated co-emotion replies is improved, and meanwhile probability of generating unregistered words can be effectively reduced.

6. The method for generating Chinese co-emotion replies based on UniLM model and Copy mechanism as claimed in claim 2, wherein step 7 specifically comprises: and (3) evaluating the co-emotion replies generated in the step (6) through evaluation standards such as confusion, automatically putting the expected co-emotion replies and user input into the corpus in the step (1) for iterative training, enhancing training data, and obtaining an updated and optimized co-emotion reply generation model.

7. A chinese co-emotion reply generation system based on the generation method of any one of claims 1 to 6, comprising:

the detail copying module is used for copying emotion keywords and complex event details in the source sequence into output;

and the co-emotion reply generation model module is used for evaluating the output co-emotion replies by using evaluation standards such as confusion, and putting expected replies and user statements into the original training corpus to perform compound automatic iterative training so as to obtain a further updated and optimized co-emotion reply generation model.

8. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of improving pedestrian detection of a YOLOv4 network according to any one of claims 1-5.

9. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the improved method of pedestrian detection in a YOLOv4 network according to any one of claims 1-5.

10. An information data processing terminal for realizing the pedestrian detection system of the improved YOLOv4 network according to claim 1.