CN117610513B

CN117610513B - Knowledge protection and selection-based theme text generation method

Info

Publication number: CN117610513B
Application number: CN202410086840.8A
Authority: CN
Inventors: 宋春瑶; 王杰永; 吴依豪
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2024-01-22
Filing date: 2024-01-22
Publication date: 2024-04-02
Anticipated expiration: 2044-01-22
Also published as: CN117610513A

Abstract

The invention relates to the technical field of natural language theme text generation, and provides a theme text generation method based on knowledge protection and selection. Comprising the following steps: collecting theme text data to be generated; introducing a pre-trained language model, freezing the parameters of the language model and constructing a trainable dynamic prefix vector; encoding the training set through a language model encoder to obtain a hidden intermediate state, and decoding the hidden intermediate state as an initial state of a decoder to obtain vocabulary probability distribution corresponding to the training set; calculating and obtaining the similarity between the decoding state of the decoder and the topic text data and the copy probability distribution corresponding to the unexplained topic through a copy mechanism comprising knowledge selection; calculating the distribution of the prediction result, and calculating to obtain the negative log likelihood loss; and generating the subject text after updating the dynamic prefix vector. The invention keeps higher theme consistency by taking the input theme information and the theme semantics which are not expressed in the generated text into consideration through the copy mechanism with knowledge selection.

Description

Knowledge protection and selection-based theme text generation method

Technical Field

The invention relates to the technical field of natural language theme text generation, in particular to a theme text generation method based on knowledge protection and selection.

Background

The topic text generation technology is a special natural language generation technology, and by designing a proper model structure, information is extracted from a given plurality of topic words, so that a coherent paragraph level text is generated. The technology can be used in the fields of automatic advertisement generation, mail generation of specific subject and the like, can also be used as a test platform for controllable text generation, and has a wide application range.

The topic text generation technology is different from other text generation technologies such as text abstracts, machine translation and the like, and for the text abstracts, input source information is a large-section article, and the amount of information contained in the text abstracts is far greater than that of information contained in abstracts needing to be generated; the input text and the output text of the machine translation contain almost equal information; the topic text generation technology takes a small amount of a plurality of topic words as input, the output is paragraph level text, the information quantity contained in the input is far smaller than the information quantity contained in the output, and the phenomenon greatly increases the difficulty of the topic text generation technology.

The method is limited by the phenomenon, the prior art mostly introduces knowledge in a third-party external knowledge base to make up for the deficiency of input and output information, but the knowledge storage form of the external knowledge base is different from that of the knowledge in a model, so that the external knowledge cannot be fully utilized, and the generated subject text still has the problems of unsmooth expression, poor semantic logic and lower diversity.

Disclosure of Invention

The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a topic text generation method based on knowledge protection and selection.

The invention provides a topic text generation method based on knowledge protection and selection, which comprises the following steps:

s1: collecting topic text data to be generated, and dividing the topic text data into a training set and a testing set;

s2: introducing a pre-trained first language model, freezing parameters of the first language model, constructing a trainable dynamic prefix vector, and obtaining a second language model;

s3: coding the training set through the second language model to obtain a hidden intermediate state, recording the hidden intermediate state as an initial state of a decoder of the second language model, and decoding by the decoder according to the initial state to obtain vocabulary probability distribution corresponding to the training set;

s4: calculating and obtaining the similarity of the decoding state of the decoder and the text data of the subject and the copy probability distribution corresponding to the unexplained subject through a copy mechanism comprising knowledge selection;

s5: calculating according to the vocabulary probability distribution and the copy probability distribution to obtain a predicted result distribution, and calculating according to the predicted result distribution to obtain negative log likelihood loss;

s6: updating the dynamic prefix vector in the second language model with the negative log-likelihood loss to obtain a third language model;

s7: and predicting the test set through the third language model based on a beam search strategy to generate a subject text.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, the length of the dynamic prefix vector in the step S2 is in direct correlation with the discrete linear relation of the number of the topic words input into the second language model.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, in step S4, the expression of the similarity is:

wherein,for the similarity of the decoding status of the decoder to the subject text data,/for example>For the current sampling time step,for cosine similarity calculation, ++>For encoder state->For decoder state, ++>To select the auxiliary function->To select the auxiliary function threshold +.>Tensors to be filtered and selected for the content.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, in step S4, the expression of the knowledge selection process contained in the copying mechanism is as follows:

wherein,first status information after selection for knowledge, < >>Second status information after selection for knowledge, < >>Is a normalization function.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, in step S4, the expression of the copy probability distribution is as follows:

wherein,for copying the current time step token probability generated in the probability distribution,/a method for generating a current time step token probability in the probability distribution>To activate the function +.>For the full connection layer weight,/>For the full connection layer bias term,/->For calculating the obtained context vector +.>For accumulating index value, ++>The attention score for the target sequence to the source sequence.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, the expression of the predicted result distribution in the step S5 is as follows:

wherein,for calculating the prediction result distribution obtained +.>For the current time stepToken generated->For generating a probability distribution of tokens from a vocabulary, < +.>Is a copy probability distribution having the attention score of the target sequence to the source sequence as a function.

According to the method for generating the topic text based on knowledge protection and selection provided by the invention, the expression of the negative log likelihood loss in the step S5 is as follows:

wherein,is a negative log likelihood loss>For the total number of sampling time steps>For tag distribution for model prediction, +.>Is a random variable generated in a time step.

The invention provides a topic text generation method based on knowledge protection and selection, which aims at solving the problems that the current topic text generation model cannot fully utilize external knowledge and the generated text is lack of diversity, and improves the generated text diversity of the existing generation method on the basis of protecting rich semantic knowledge of a pre-training language model.

According to the topic text generation method GCS-IPT based on knowledge protection and selection, the weight of the pre-training language model GENIUS is frozen, only the dynamic prefix vector is trained, the prior knowledge of the pre-training can be better protected by the GCS-IPT, and the input of different numbers of topic words is self-adapted; in addition, through a copy mechanism module with knowledge selection, the GCS-IPT can simultaneously consider the input topic information and the topic semantics which are not expressed in the current generated text during generation, so that the generated text keeps higher topic consistency.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for generating a theme text based on knowledge protection and selection according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "coupled," "coupled," and "connected" should be construed broadly, and may be either a fixed connection, a removable connection, or an integral connection, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.

In embodiments of the invention, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

An embodiment of the present invention is described below with reference to fig. 1.

further, the purpose of this stage is to pre-process the topic text generation data. And counting the occurrence frequency of the topics in the topic text generated data, counting the occurrence frequency of different topic words in the whole corpus, screening topic words with higher occurrence frequency by taking the occurrence frequency as a reference, preventing the potential negative influence of the training corpus with too low frequency on the model, screening and filtering the topic words with lower occurrence frequency, and randomly sampling 25000 pieces of the rest as a training set and 2000 pieces of the rest as a test set.

furthermore, the purpose of this stage is to protect knowledge of the language model, freeze all parameters of the pre-trained language model, so as to protect the abundant semantic knowledge from being destroyed; meanwhile, a dynamic prefix vector is constructed to adapt to the input conditions of different numbers of subject words.

Wherein the length of the dynamic prefix vector in step S2 is linearly and positively correlated with the number of discrete keywords entered into the second language model.

Further, in order to adapt to the input conditions of different numbers of keywords, a dynamic prefix vector is constructed, and the actual length of the vector participating in training and reasoning is related to the number of the input keywords, and the actual length increases discretely and linearly with the increase of the number of the keywords.

further, for each data sample in the training data, the data sample is sent to an encoder of the pre-training language model to generate a hidden intermediate state, which indicates semantic information of the input subject word, and the hidden intermediate state is used as an initial state of a decoder.

further, for each decoder state at each moment, the similarity degree between the decoder state and the initial topic distribution is calculated through a copy mechanism module with knowledge selection, and the copy probability distribution of the topic which is not expressed yet is obtained.

In the steps S3 and S4, knowledge associated with the input topic is selected from the language model rich semantic knowledge, and the expression of the input topic semantic in the current generated content is considered, so as to guide the generation process of the future time step.

In step S4, the expression of the similarity is:

Further, the decoder state at this time is first calculated using cosine similarityAnd encoder state->And uses the auxiliary function +.>Is +_associated with a particular threshold value>Auxiliary knowledge selection, wherein the auxiliary function +.>Can be judged by similarity->Different values of (2) and a specific threshold +.>Tensor->Filtering and selecting specific content in the content file.

In step S4, the expression of the knowledge selection process included in the copy mechanism is:

Further, the normalization function is configured to scale a maximum value of the tensor to 1 and a minimum value of the tensor to 0, and the expression of the knowledge selection process included in the copy mechanism represents similarity between states of the encoder and the decoder through the current time step, and filter and select the knowledge to obtain state information after knowledge selectionAnd->。

Therein, whereinThe non-zero elements of (a) come from the information in the encoder most relevant to the current time step topic,/->And filtering the information by taking the difference between the similarity and 1 as a reference, and ensuring that the topic semantic information which is not expressed is considered in the decoding process.

In step S4, the expression of the copy probability distribution is:

wherein,for copying the current time step token probability generated in the probability distribution,/a method for generating a current time step token probability in the probability distribution>To activate the function +.>For the full connection layer weight,/>For the full connection layer bias term,/->For calculating the obtained context vector +.>For accumulating index value, ++>Attention fraction for the target sequence to the source sequence, < >>Representing tensor stitching.

wherein, the expression of the prediction result distribution in the step S5 is:

wherein,for calculating the prediction result distribution obtained +.>Token generated for the current time step, +.>For generating a probability distribution of tokens from a vocabulary, < +.>Is a copy probability distribution having the attention score of the target sequence to the source sequence as a function.

Wherein the expression of the negative log likelihood loss in step S5 is:

further, the objective of this stage is to update the prefix vector by first generating a text token for the current time step and calculating a negative log likelihood loss with the tag to obtain a gradient and update the dynamic prefix vector, specifically described as a pairAnd carrying out back propagation to obtain gradient information of the dynamic prefix vector, and updating the dynamic prefix vector value according to the gradient information.

Further, the above steps are repeated, so that training of the prefix vector can be completed, the third language model in the steps S6 and S7 is obtained, and then the complete subject text can be generated through sampling by the beam search strategy.

The validity verification is carried out on the topic text generation method GCS-IPT based on knowledge protection and selection, and experiments are carried out on two widely used public topic text generation data sets ESSAY and ZHIHU.

Table 1 the subject text generation method based on knowledge protection and selection of the present invention compares experimental results with other existing methods

Table 1 shows a comparative test of GCS-IPT with the prior art method. Wherein the previous method comprises the following steps of MTA: the earliest topic text generation method adopts topic coverage vectors to ensure that the content generated by a decoder tightly surrounds the semantics of the input topic words; CTEG: introducing information in a common sense library of the graph structure as additional external information into a topic text generation model for the first time; SCTKG: the information of the common sense library is introduced, and meanwhile, the method for controlling the subject text semantics from the emotion angle is adopted; GENIUS: and re-training the BART language model by adopting a sketch reconstruction pre-training task.

The experiments were repeated 5 times using automatic evaluation indicators BLEU (BilingualEvaluationUnderstudy), DIST-2 (Diversity-2), consistency, novelty (subject Novelty) which are common in the art.

Experimental results show that compared with the method SCTKG with the best effect in the past, the method GCS-IPT provided by the invention has the advantages of improvement in diversity, theme consistency and novelty, and particularly the average improvement of diversity is more than 40%. The comparison result fully illustrates that the method provided by the invention has excellent effect on the theme text generation task.

Aiming at the problems that the current topic text generation model cannot fully utilize external knowledge and the generated text is lack of diversity, the invention provides a topic text generation method with knowledge selection on the basis of protecting rich semantic knowledge of a pre-training language model, and improves the generated text diversity of the current generation method. The invention provides a topic text generation method based on knowledge protection and selection, which comprises the following steps: (1) Generating topic number distribution analysis in a training set for topic texts, screening topic words with low occurrence frequency, and constructing training, testing and verifying data sets; (2) Freezing all parameters of the pre-training language model to protect rich semantic knowledge from being destroyed, and constructing dynamic prefix vectors to adapt to the input conditions of different numbers of subject matters; (3) For each data sample in the training data, sending the data sample into an encoder of a pre-training language model to generate a hidden intermediate state which indicates semantic information of an input subject word and takes the semantic information as an initial state of a decoder; for the decoder state at each moment, calculating the similarity degree of the decoder state and the initial theme distribution through a copy mechanism module with knowledge selection, and obtaining copy probability distribution of the un-expressed theme; (4) Calculating the prediction result distribution of the current time step by using the copy probability distribution and the vocabulary probability distribution, and calculating the negative log likelihood loss with the training label so as to update the dynamic prefix vector; (5) Repeating the steps (3) to (4), and completing the training of the prefix vector and the output of the subject text.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for generating the theme text based on knowledge protection and selection is characterized by comprising the following steps of:

2. The knowledge-based protected and selected subject text generation method of claim 1 wherein the length of said dynamic prefix vector in step S2 is linearly and positively correlated with the number of subject words entered into said second language model.

3. The method for generating a theme text based on knowledge protection and selection according to claim 1, wherein in step S4, the similarity is expressed as follows:

4. The method for generating a topic text based on knowledge protection and selection as claimed in claim 3, wherein in step S4, the expression of the knowledge selection process included in the copy mechanism is:

5. The method for generating a topic text based on knowledge protection and selection as claimed in claim 4, wherein in step S4, the expression of the copy probability distribution is:

6. The method for generating a topic text based on knowledge protection and selection as claimed in claim 5, wherein the expression of the predicted result distribution in step S5 is:

7. The method for generating a topic text based on knowledge protection and selection as claimed in claim 6, wherein the expression of the negative log likelihood loss in step S5 is: