CN116484858A

CN116484858A - Text abstract generation method based on diffusion model

Info

Publication number: CN116484858A
Application number: CN202310419493.1A
Authority: CN
Inventors: 魏建香; 陈之航; 陈宇行; 陈佳华; 周钰锦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-25

Abstract

The invention discloses a text abstract generating method based on a diffusion model, which comprises the steps of firstly, performing word segmentation operation on a source text by using a natural language processing model; then, pre-encoding and converting the text data into sentence representation by the pre-processed source text through an input module by utilizing a word embedding method, and then mapping the text vector onto a space dimension capable of being continuously transformed by utilizing an evaluation module to realize controllable linear continuous transformation of the text vector; and finally, acquiring a target text through a condition guiding module, generating samples through gradient guiding in the sampling process of the diffusion model, reducing errors of generated text vectors and real text vectors by using a control module, and finally generating a text abstract through an output module according to the acquired sample vectors. The method realizes the generation of the text abstract based on the generated model, is beneficial to improving the accuracy of the text abstract and improves the semantic generalization of the model.

Description

Text abstract generation method based on diffusion model

Technical Field

The invention relates to a text abstract generation method based on a diffusion model, and belongs to the technical field of computer application.

Background

In recent years, due to the growth of various data generated by the internet and the improvement of computer performance, deep learning technologies represented by neural networks are rapidly developed and have achieved great success in various fields, such as the field of natural language processing, the field of computer vision and the like, which have triggered a new round of artificial intelligence research. Text generation is an important branch of the natural language processing field, and is a subdivision field studied by many scholars. It is a process of generating human-readable and understandable text satisfying specific constraints given input information. Has extremely high research value and application value, and has influence on human life in a plurality of fields such as machine translation, automatic abstract, dialogue generation, image description and the like.

In order to improve the quality of text generation, researchers in recent years select a plurality of generation type deep neural network models to apply to text generation tasks and have made certain progress, and the advantage is that the deep neural network can learn the semantic mapping from input data to output text end to end without manual participation in feature engineering. However, deep neural network models tend to have a large number of parameters, and most text generation tasks can use very small data sets, so deep neural networks are very easy to overfit on these data sets, making it impractical to generalize in practical applications. Along with the continuous increase of the difficulty of the text generation task, higher requirements are provided for obtaining the hidden semantic relation between the theme and the text, simulating the human style writing and the like. The traditional generation model comprises an autoregressive model, a variational self-encoder model, a neural network model for generating an countermeasure network and the like, but the likelihood-based model and the recessive generation model have different limitations, such as the variational self-encoder model cannot directly calculate accurate maximum likelihood due to model design, and the phenomenon of unstable training such as mode collapse and the like easily occurs in the training of the countermeasure network model, so the traditional generation model cannot fully cope with the fine-grained task.

In view of this, a text abstract generation method based on a diffusion model is provided, which has important meaning for solving the above-mentioned text abstract generation problem.

Disclosure of Invention

The invention aims to provide a text abstract generating method based on a diffusion model, which can fully utilize the context information of an input text to generate a high-quality text abstract, so that the quality of the text abstract is improved, the text abstract has good generalization capability, can adapt to various input texts of different types, including long texts and short texts, and has good adaptability.

In order to achieve the above object, the present invention provides a text abstract generating method based on a diffusion model, which mainly includes the following steps:

step 1, preprocessing text word segmentation: preliminary preprocessing is carried out on the corpus by using a natural language processing model in an input module, and the corpus is precoded by using a word embedding mode to obtain sentence representation W= [ W ] ₁ ,w ₂ ,w ₃ ,...,w _n ]Where W is the encoded set of sentence representations, W _n A code representing an nth word segment in the sentence;

step 2, generating a text vector: using an auxiliary function in the evaluation module to represent the sentence generated in step 1 with w= [ W ] ₁ ,w ₂ ,w ₃ ,...,w _n ]Conversion to word embedding vector EMB (W) = [ EMB (W) ₁ ),...,EMB(w _n )]∈R ^d Mapping each text vector in the word-embedded vector to a vector space R ^d In each text vector can be represented in a vector space R ^d Wherein EMB (W) is a set of word-embedded vectors transformed from sentence-representation encoding W, EMB (W) _n ) Embedding vectors for words generated by the code conversion of nth segmented words in sentences;

step 3, generating a condition guide: inputting the text vectors generated by the source text into the diffusion model by means of the guidance module in order to introduce additional conditional inputs co into the diffusion model _θ (x _t Y), then at diffusionFor each moment X in the model sampling process _t-1 Gradient update of a network of individual run condition guided modules, where X _t Representing the predicted distribution of the diffusion model generated at time t, y representing the conditional input of the diffusion model introduced at time t, co _θ Representing the model at the extra conditional input y and the predictive distribution X _t A conditional distribution generated by a conditional model parameter θ;

step 4, sampling generation: in the diffusion model sampling process, the initial input is vector space R ^d A random Gaussian noise in the sample, for each moment X in the whole sampling process _t-1 Running gradient updates to generate a prediction vector f _θ (x _t-1 T), where t represents the time t, x after the model samples t steps _t-1 Representing the sampling distribution at time t-1 generated by the diffusion model at time t, f _θ Representing a model by x _t-1 Distributing the prediction vector generated at the time T, and executing the whole sampling process within the range of T:0;

step 5, controlling vector errors: the control module is used for generating the prediction vector f finally generated in the step 4 according to the control function _θ (x ₀ T) mapping to its nearest real word embedding vector, generating a final prediction vectorWherein x is ₀ Sampling distribution at 0 moment generated for a final diffusion model in a sampling process;

step 6, generating a summary: final prediction vector generated according to step 5And (3) word embedding pre-coding of the input module is converted into text information, and a text abstract is output according to a text sequence.

In step 1, a trained fine-tuning bert-base-Chinese model is used in an input module to divide sentences and words of a corpus, and a word embedding mode is used to pre-encode the corpus to obtain sentence representation W= [ W ] ₁ ,w ₂ ,w ₃ ,...,w _n ]。

As a further improvement of the present invention, in step 2, the vector space R ^d Is determined by the word embedding vector.

As a further improvement of the present invention, step 3 is specifically: converting the source text expected to execute abstract into text vector c through the guiding module, inputting the text vector c into a guiding network, and enabling the text vector c to be micro-processedControl is performed by the director network for each time X _t-1 Run gradient update generated by association of a director network parameterized control conditions text and a target text, where p (c|X _t-1 ) Representing a sampling distribution X at the moment of generating t-1 by a diffusion model _t-1 The probability that the target text is c under the condition,representing the diffusion model at X when the step length is the t-th step in the process of sampling _t-1 And performing gradient updating.

As a further improvement of the invention, the bootstrap network used by the bootstrap module is a fine-tuning bert-base-chipset model.

As a further improvement of the invention, the diffusion model in step 4 is input as vector space R at initial time T during sampling ^d Random gaussian noise of (a), for each moment X during the whole sampling process _t-1 Running gradient update, finally ending at sampling X ₀ The digest text vector fθ (x) input in the time generation step 5 ₀ ,t)。

As a further improvement of the invention, the sampling mode used in the sampling process in the step 4 is DDIM sampling, and the final X is generated by a denoising diffusion implicit model sampling method ₀ Abstract text vector f of moment _θ (x ₀ ,t)。

As a further improvement of the present invention, in step 5, the control module is configured to control the prediction vector fθ (x _t-1 T) mapping to its nearest real word embedding vector, generating a final prediction vectorTo reduce the predictive vector->Differences from the real word embedding vector.

As a further development of the invention, in step 5, the control function Squeeze (fθ (x _t T)) to obtain a predictive vectorLet predictive vector +.>Approximating a real word embedding vector.

The beneficial effects of the invention are as follows: according to the method, the input text is encoded and modeled, so that semantic information in the text can be captured better, the problem of omission or misunderstanding possibly occurring in a traditional keyword extraction-based method is avoided, the accuracy of the text abstract is improved, and the semantic generalization of the model is improved.

Drawings

FIG. 1 is a technical roadmap of the text abstract generation method based on the diffusion model of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

In this case, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the aspects of the present invention are shown in the drawings, and other details not greatly related to the present invention are omitted.

In addition, it should be further noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a text abstract generating method based on a diffusion model, which comprises the steps of firstly, utilizing a natural language processing model to perform word segmentation operation on a source text; then, pre-coding and converting the text data into sentence representation by the input module through the pre-processed source text by utilizing a word embedding method, and then mapping the text vector onto a space dimension capable of being continuously transformed by utilizing the evaluation module to realize controllable linear continuous transformation of the text vector; and finally, acquiring a target text through a condition guiding module, generating samples through gradient guiding in the sampling process of the diffusion model, reducing errors of generated text vectors and real text vectors by using a control module, and finally generating a text abstract through an output module according to the acquired sample vectors.

As shown in fig. 1, the text abstract generating method based on the diffusion model mainly comprises the following steps:

step 3, generating a condition guide: will be guided by the guiding moduleThe text vectors generated by the source text are input into the diffusion model in order to introduce additional conditional inputs co into the diffusion model _θ (x _t Y) and then for each moment X during diffusion model sampling _t-1 Gradient update of a network of individual run condition guided modules, where X _t Representing the predicted distribution of the diffusion model generated at time t, y representing the conditional input of the diffusion model introduced at time t, co _θ Representing the model at the extra conditional input y and the predictive distribution X _t A conditional distribution generated by a conditional model parameter θ;

The steps 1 to 6 will be described in detail below.

In step 1, a trained fine-tuning bert-base-lane is used in the input modulePreliminary preprocessing such as sentence segmentation and word segmentation is carried out on the corpus by the model, and the corpus is precoded by a word embedding mode to obtain sentence representation W= [ W ] ₁ ,w ₂ ,w ₃ ,...,w _n ]。

In step 2, vector space R ^d Is determined by the word embedding vector.

The step 3 is specifically as follows: converting the source text expected to execute abstract into text vector c through the guiding module, inputting the text vector c into a guiding network, and enabling the text vector c to be micro-processedControl is performed by the director network for each time X _t-1 Run gradient update generated by association of a director network parameterized control conditions text and a target text, where p (c|X _t-1 ) Representing a sampling distribution X at the moment of generating t-1 by a diffusion model _t-1 Probability of target text being c under the condition, +.>Representing the diffusion model at X when the step length is the t-th step in the process of sampling _t-1 And performing gradient updating on the data, wherein a guide network used by the guide module is a fine-tuning bert-base-Chinese model.

In the sampling process of the diffusion model in the step 4, the initial T moment is input into a vector space R ^d Random gaussian noise of (a), for each moment X during the whole sampling process _t-1 Running gradient update, finally ending at sampling X ₀ The abstract text vector f input in the time generation step 5 _θ (x ₀ T). The sampling mode used in the sampling process is DDIM sampling, and the final X is generated by a denoising diffusion implicit model sampling method ₀ Abstract text vector f of moment _θ (x ₀ ,t)。

In step 5, the control module is used to control the prediction vector fθ (x) generated in step 4 according to the control function _t-1 T) mapping to its nearest real word embedding vector, generating a final prediction vectorTo reduce the predictive vector->Difference from the real word embedding vector, and prediction vector fθ (x _t-1 The difference between t) and the real word embedding vector is controlled by a control function. The process is controlled by a control function Squeeze (fθ (x _t T) to obtain a predictive vector ∈>Causing a predictive vectorThe vector is embedded by approximating the true word, thereby completing the task of controlling the accuracy of generating the vector.

The step 6 is specifically as follows: embedding the text vector according to the final predicted word generated in step 5And according to word embedding pre-coding of the input module, converting the word embedding pre-coding into text information, and outputting a text abstract according to a text sequence.

Table 1 below is text information generated after input of source text and a text summary obtained from the source text:

TABLE 1

In summary, the invention uses a natural language processing model to perform word segmentation operation on the source text; then, pre-encoding and converting the text data into sentence representation by the pre-processed source text through an input module by utilizing a word embedding method, and then mapping the text vector onto a space dimension capable of being continuously transformed by utilizing an evaluation module to realize controllable linear continuous transformation of the text vector; and finally, acquiring a target text through a guiding module, guiding by using a gradient in the sampling process of the diffusion model to generate a sample, reducing errors of a generated text vector and a real text vector by using a control module, and finally generating a text abstract through an output module according to the acquired sample vector. Compared with the prior art, the method and the device realize text abstract generation based on the generation model, are beneficial to improving the accuracy of the text abstract and promote the semantic generalization of the model.

In summary, the invention can fully utilize the context information of the input text to generate the high-quality text abstract, thereby improving the quality of the text abstract; meanwhile, the method has strong generalization capability, can be suitable for various input texts with different types, including long texts and short texts, and has strong adaptability; the semantic information of the input text can be accurately captured, the problem of omission or misunderstanding possibly occurring in the traditional keyword-based extraction method is avoided, and the accuracy of the text abstract is improved, so that a user can obtain text abstract data with higher quality.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A text abstract generation method based on a diffusion model is characterized by mainly comprising the following steps:

step 3, generating a condition guide: inputting the text vectors generated by the source text into the diffusion model by means of the guidance module in order to introduce additional conditional inputs co into the diffusion model _θ (x _t Y) and then for each moment X during diffusion model sampling _t-1 Gradient update of a network of individual run condition guided modules, where X _t Representing the predicted distribution of the diffusion model generated at time t, y representing the conditional input of the diffusion model introduced at time t, co _θ Representing the model at the extra conditional input y and the predictive distribution X _t A conditional distribution generated by a conditional model parameter θ;

step 5, controlling vector errors: the control module is used for generating the prediction vector f finally generated in the step 4 according to the control function _θ (x ₀ T) mapping to its nearest real word embedding vector, generating a final prediction vectorWherein x is ₀ Final spreading for the sampling processSampling distribution of 0 moment generated by the scattered model;

2. The text abstract generation method based on the diffusion model according to claim 1, wherein: in step 1, a trained fine-tuning bert-base-Chinese model is used in an input module to perform sentence segmentation and word segmentation on a corpus, and the corpus is precoded in a word embedding mode to obtain sentence representation W= [ W ] ₁ ,w ₂ ,w ₃ ,...,w _n ]。

3. The text abstract generation method based on the diffusion model according to claim 1, wherein: in step 2, vector space R ^d Is determined by the word embedding vector.

4. The text abstract generation method based on the diffusion model according to claim 1, wherein the step 3 is specifically: converting the source text expected to execute abstract into text vector c through the guiding module, inputting the text vector c into a guiding network, and enabling the text vector c to be micro-processedControl is performed by the director network for each time X _t-1 Run gradient update generated by association of a director network parameterized control conditions text and a target text, where p (c|X _t-1 ) Representing a sampling distribution X at the moment of generating t-1 by a diffusion model _t-1 Probability of target text being c under the condition, +.>Representing the diffusion model at X when the step length is the t-th step in the process of sampling _t-1 And performing gradient updating.

5. The text abstract generation method based on the diffusion model according to claim 4, wherein: the bootstrap network used by the bootstrap module is a fine-tuning bert-base-chipset model.

6. The text abstract generation method based on the diffusion model according to claim 1, wherein: in the sampling process of the diffusion model in the step 4, the initial T moment is input into a vector space R ^d Random gaussian noise of (a), for each moment X during the whole sampling process _t-1 Running gradient update, finally ending at sampling X ₀ The abstract text vector f input in the time generation step 5 _θ (x ₀ ,t)。

7. The diffusion model-based text summarization method of claim 6 wherein: the sampling mode used in the sampling process in the step 4 is DDIM sampling, and the final X is generated by a denoising diffusion implicit model sampling method ₀ Abstract text vector f of moment _θ (x ₀ ,t)。

8. The text abstract generation method based on the diffusion model according to claim 1, wherein: in step 5, the control module is used for converting the prediction vector f finally generated in step 4 according to the control function _θ (x _t-1 T) mapping to its nearest real word embedding vector, generating a final prediction vectorTo reduce the predictive vector->Differences from the real word embedding vector.

9. The text abstract generating method based on the diffusion model according to claim 8,the method is characterized in that: in step 5, the control function is used to control the sequence (f _θ (x _t T)) to obtain a predictive vectorLet predictive vector +.>Approximating a real word embedding vector.