CN116108157A

CN116108157A - Method for training text generation model, text generation method and device

Info

Publication number: CN116108157A
Application number: CN202310387160.5A
Authority: CN
Inventors: 袁正; 苑洪意; 谭传奇; 黄非; 黄松芳
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-05-12
Anticipated expiration: 2043-04-11
Also published as: CN116108157B

Abstract

The embodiment of the application discloses a method for training a text generation model, a text generation method and a text generation device. The main technical scheme comprises the following steps: acquiring training data comprising a plurality of training samples, wherein the training samples comprise sample pairs formed by input text samples and output text samples; acquiring a characteristic representation of an output text sample in a sample pair, and carrying out noise adding diffusion treatment on the characteristic representation of the output text sample to obtain a noisy characteristic representation; and taking the input text sample of the sample pair and the characteristic representation after noise addition as input training text generation models, wherein the text generation models simulate inverse diffusion processing of the noise addition diffusion based on the input text sample and the characteristic representation after noise addition in the training process so as to take the output text sample as a target to be output. According to the method and the device, a diffusion probability generation mechanism is introduced into the field of text generation, and the effect of text generation can be improved.

Description

Method for training text generation model, text generation method and device

Technical Field

The present application relates to the field of natural language processing and artificial intelligence technologies, and in particular, to a method for training a text generation model, a text generation method and a text generation device.

Background

Text-to-text generation techniques refer primarily to techniques that transform and process input text to obtain new text. Mainly comprises text abstract, text rewrite, machine translation, automatic question and answer, and the like. Text-to-text generation uses a text generation model using mostly an encoder-decoder architecture, but conventional text generation models usually employ a maximum likelihood estimation method, i.e., a parameter value that uses a desired sample result to reverse the maximum probability to cause such result, i.e., determine the parameter value of the model so that the probability of occurrence of the desired sample result is maximized. However, the text generation effect by the maximum likelihood estimation method remains to be improved.

Disclosure of Invention

In view of this, the present application provides a method for training a text generation model, a text generation method and a device, so as to improve the effect of text generation.

The application provides the following scheme:

in a first aspect, a method of training a text generation model is provided, the method comprising:

acquiring training data comprising a plurality of training samples, wherein the training samples comprise sample pairs formed by input text samples and output text samples;

Acquiring a characteristic representation of an output text sample in a sample pair, and carrying out noise adding diffusion treatment on the characteristic representation of the output text sample to obtain a noisy characteristic representation;

and taking the input text sample of the sample pair and the characteristic representation after noise addition as input training text generation models, wherein the text generation models simulate inverse diffusion processing of the noise addition diffusion based on the input text sample and the characteristic representation after noise addition in the training process so as to take the output text sample as a target to be output.

According to one implementation in an embodiment of the present application, the text generation model includes an encoder and a decoder;

the encoder acquires a characteristic representation of an input text sample of an input text generation model, and the decoder performs the inverse diffusion processing by using the characteristic representation of the input text sample and the denoised characteristic representation to obtain the output text sample;

the training targets include: minimizing the difference between the distribution produced by the noisy diffusion process and the distribution produced by the inverse diffusion process.

According to an achievable mode of the embodiment of the application, the training target further includes: minimizing the difference between the distribution of the characteristic representation obtained by the diffusion treatment of the last time step and the normal distribution; and/or minimizing the difference between the feature representation resulting from the back-diffusion of the last time step and the feature representation of the output text sample.

According to an implementation manner in the embodiment of the present application, the obtaining the feature representation of the output text sample in the sample pair includes:

word embedding processing is carried out on the output text sample by utilizing an embedding network, so that characteristic representation of the output text sample is obtained; or alternatively, the process may be performed,

and encoding the output text sample by using an encoder in the text generation model to obtain the characteristic representation of the output text sample.

According to an implementation manner in the embodiments of the present application, in the noise-adding diffusion processing, the first time step of diffusion processing adds noise to the feature representation of the output text sample, and the subsequent diffusion processing of each time step adds noise to the feature representation obtained by the diffusion processing of the previous time step, and the feature representations obtained by the diffusion processing of each time step conform to the normal distribution.

According to the embodiment of the applicationIn one implementation manner, in the back diffusion process, the feature representation obtained by the back diffusion process of each time step is obtained by sampling on a posterior distribution of the feature representation obtained by the back diffusion process of the previous time step; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from the a priori distribution of said +.>

Is a prediction result of the characteristic representation obtained by the first diffusion process.

In a second aspect, there is provided a text generation method, the method comprising:

acquiring an input text;

inputting the input text and random noise into a text generation model, and performing inverse diffusion processing by the text generation model based on the input text and the random noise to obtain an output text;

wherein the text generation model is pre-trained by the method described in the first aspect.

the encoder obtains a feature representation of the input text;

the decoder performs inverse diffusion processing by using the characteristic representation of the input text and the random noise to predict an output text; wherein in the back diffusion process, the feature representation obtained by the back diffusion process of each time step is obtained by up-sampling a posterior distribution of the feature representation obtained by the back diffusion process of the previous time step; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from the a priori distribution of said +.>

In a third aspect, a summary generating method is provided, the method includes:

acquiring an input text;

inputting the input text and random noise into a text generation model, and performing inverse diffusion processing by the text generation model based on the input text and the random noise to obtain a summary of the input text;

In a fourth aspect, a machine translation method is provided, the method comprising:

acquiring a text adopting a first language;

inputting the text adopting the first language and the random noise into a text generation model, and performing inverse diffusion processing by the text generation model based on the text adopting the first language and the random noise to obtain a text adopting a second language;

In a fifth aspect, there is provided an apparatus for training a text generation model, the apparatus comprising:

a sample acquisition unit configured to acquire training data including a plurality of training samples including a sample pair of an input text sample and an output text sample;

The noise adding and diffusing unit is configured to acquire the characteristic representation of the output text sample in the sample pair, and perform noise adding and diffusing treatment on the characteristic representation of the output text sample to obtain the characteristic representation after noise adding;

and a model training unit configured to take the input text sample and the denoised feature representation of the sample pair as input training text generation models, wherein the text generation models simulate inverse diffusion processing of the denoised diffusion based on the input text sample and the denoised feature representation in a training process so as to take the output text sample as a target to be output.

In a sixth aspect, there is provided a text generating apparatus, the apparatus comprising:

a text acquisition unit configured to acquire an input text;

a text generation unit configured to input the input text and random noise into a text generation model, and obtain an output text obtained by performing inverse diffusion processing on the text generation model based on the input text and the random noise;

wherein the text generation model is pre-trained by the apparatus of the fifth aspect.

According to a seventh aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the first aspects described above.

According to an eighth aspect, there is provided an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the first aspects above.

According to a specific embodiment provided by the application, the application discloses the following technical effects:

1) The method and the device do not adopt a mode of maximum likelihood estimation to train a text generation model any more, but provide a brand new thought, introduce a diffusion probability generation mechanism into the field of text generation, simulate the text generation process into inverse diffusion processing of noise diffusion, eliminate the influence on text generation caused by information loss generated by noise, and further obtain a better text generation effect.

2) In the actual prediction process, the input and the processing of the encoder are unchanged, that is, the encoder still only needs to perform feedforward calculation of the neural network once and does not need to participate in a back diffusion process, and the back diffusion process can need hundreds of time steps, so that the calculation resources can be greatly saved.

3) In the back diffusion processing, the sampling mode of the prior distribution of the characteristic representation based on the predicted output text sample is closer to a training target, and the characteristic representation with higher quality can be obtained.

Of course, not all of the above-described advantages need be achieved at the same time in practicing any one of the products of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a system architecture to which embodiments of the present application are applicable;

FIG. 2 is a flowchart of a method for training a text generation model according to an embodiment of the present application;

fig. 3 is a schematic diagram of training principle of a text generation model according to an embodiment of the present application;

fig. 4 is a flowchart of a text generation method provided in an embodiment of the present application;

fig. 5 is a schematic diagram of prediction principle of a text generation model according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of an apparatus for training a text generation model provided by an embodiment of the present application;

FIG. 7 is a schematic block diagram of a text generating device provided in an embodiment of the present application;

fig. 8 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

To facilitate an understanding of the present application, a brief description of a system architecture to which the present application applies is first provided. FIG. 1 illustrates an exemplary system architecture to which embodiments of the present application may be applied, including model training means and text generation means, as shown in FIG. 1.

After the training data is acquired in an offline stage, the model training device can perform model training by adopting the method provided by the embodiment of the application to obtain a text generation model.

The text generation means generates output text based on the input text on-line using the text generation model that has been established. The text generation model is in effect a sequence-to-sequence (seq 2 seq) model, enabling predictions from text sequence to text sequence.

The model training device and the text generation device may be set as independent servers, may be set in the same server or server group, or may be set in independent or same cloud servers. The cloud server is also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual special server (VPs, virtual Private Server) service. The model training device and the text generating device can also be arranged on a computer terminal with stronger computing power.

In addition to the text generation on-line, the text generation device may perform text generation off-line, for example, to generate a summary of texts in a batch.

It should be understood that the number of model training devices, text generation devices, and text generation models in fig. 1 are merely illustrative. There may be any number of model training means, text generating means, and text generating models, as desired for implementation.

Fig. 2 is a flowchart of a method for training a text generation model according to an embodiment of the present application, where the method flow may be performed by a model training apparatus in the system shown in fig. 1. As shown in fig. 2, the method may include:

Step 202: training data is obtained that includes a plurality of training samples, the training samples including pairs of input text samples and output text samples.

Step 204: and obtaining the characteristic representation of the output text sample in the sample pair, and carrying out noise adding diffusion treatment on the characteristic representation of the output text sample to obtain the characteristic representation after noise adding.

Step 206: and taking the input text sample and the characteristic representation after noise addition of the sample pair as an input training text generation model, and simulating inverse diffusion processing of noise addition diffusion based on the input text sample and the characteristic representation after noise addition in the training process by the text generation model so as to take the output text sample as a target to be output.

From the above flow, the method and the device can be used for training the text generation model without adopting a mode of maximum likelihood estimation, and provide a brand new thought, introduce a diffusion probability generation mechanism into the field of text generation, simulate the text generation process into inverse diffusion processing of noise diffusion, and eliminate the influence of information loss caused by noise on text generation, thereby obtaining better text generation effect.

Each step in the above-described flow is described in detail below. The above step 202, i.e. "acquiring training data comprising a plurality of training samples", will be described in detail first with reference to the embodiments.

Although the method does not adopt the mode of maximum likelihood estimation, the acquired training data is still the same as the training data adopted by the traditional maximum likelihood estimation. The training data comprises a plurality of training samples, each training sample being an input text sample

And output text sample +.>

A sample pair is formed.

For example, in a summary generation application scenario, some articles may be taken as input text samples, and summaries of these articles may be taken as output text samples.

For another example, in the text rewrite field, sentences or paragraphs may be taken as input text samples, and text of another expression of the sentences or paragraphs, namely, rewrite text, may be taken as output text.

For another example, in the field of machine translation, a text in a first language may be used as an input text sample, and a text in a second language corresponding to the text sample may be used as an output text sample. For example, a chinese sentence is taken as an input text sample, and an english sentence corresponding to the translated chinese sentence is taken as an output text sample.

The step 204, namely, the step of obtaining the feature representation of the output text sample in the sample pair, and the step of performing noise adding diffusion processing on the feature representation of the output text sample to obtain the feature representation after noise adding, is described below in connection with the embodiment.

The denoising diffusion probability model is applied in the field of image generation, achieves the effect of generating an countermeasure model beyond the traditional method, and is still blank in the field of natural language processing. The denoising diffusion probability model mainly comprises: the two processes of forward noise adding and diffusing and backward noise removing are the process of forward noise adding and diffusing, namely the process of increasing noise step by step time on the basis of outputting text samples.

In the field of natural language processing, due to the discretization characteristic of natural language, a denoising diffusion probability model cannot be directly applied to a natural language generation task. In the embodiment of the application, the sequence comprising each Token (element) can be firstly contained

Feature representation mapped to continuous +.>

I.e. a feature representation consisting of word vectors for each Token. Wherein each Token of a text refers to an element constituting the text. For a text, the text is segmented into a sequence of characters or words, and the characters or words, the initiator and the separator in the sequence of text are Token. />

，/>

And->

Respectively->

And the dimensions of the word vector.

As one of the realizable modes, the embedded network can be utilized to output text samples

Word embedding processing is carried out to obtain an output text sample +.>

Characteristic representation of +.>

。/>

As another implementation, output text samples are output using encoder pairs in a text generation model

Coding to obtain output text sample +.>

Characteristic representation of +.>

。

The process of noisy diffusion is a multi-time-step diffusion process imposed on the feature representation of the output text sample. Diffusion processing at a first time step to output a representation of a characteristic of a text sample

Adding noise to get->

. The subsequent diffusion process of each time step adds noise to the characteristic representation obtained by the diffusion process of the previous time step. The resulting characteristic representation of the diffusion process at each time step conforms to a normal distribution, that is, the process of forward noise diffusion can be seen as adding an additional markov transition distribution.

As shown in FIG. 3, the Markov transition distribution in the diffusion process of the first time step may be defined as

For example:

（1）

in the subsequent diffusion treatment at other time steps, by

The Markov transition distribution for the example +1 time step may be defined as +.>

：

（2）

Wherein, the liquid crystal display device comprises a liquid crystal display device,

is about->

Is a distribution of +.>

Mean, in->

Is a normal distribution of variance. The diffusion treatment of each time step is performed by +. >

,/>

For preset parameters, ++>

Is an identity matrix. />

And->

Is->

Time step +1 and->

The time-step diffusion process results in a characteristic representation. By a preset number of time steps (e.g. T+1 timesStep) diffusion, obtain ∈ ->

As far as possible let->

Near normal distribution. The more time steps in which diffusion occurs, +.>

The closer to normal distribution, the better the effect, but the more computation resources are correspondingly occupied, the longer the time is, and therefore a relatively balanced value, for example, 2000 time steps, is required to be taken empirically or experimentally.

By the forward noise adding diffusion process, the discrete output text samples are fused into a continuous noise removing diffusion probability model, and gradually in the process of

Adding noise to obtain samples meeting the above a priori distribution>

The prior distribution adopted in the embodiment of the present application is a normal distribution.

The step 206 of "generating a model of the input text sample and the denoised feature representation of the sample pair, and the text generating model simulates inverse diffusion processing of the denoised diffusion based on the input text sample and the denoised feature representation during training to output the output text sample as a target" will be described in detail with reference to the embodiments.

Training of the text generation model is actually a process of simulating (i.e., learning) the back diffusion on the basis of the forward noise-added diffusion, and the architecture of the text generation model adopted in the embodiment of the application is an encoder-decoder structure. As shown in fig. 3, a text sample is entered

Input encoder, encoder pair->

Coding to obtain input text sample +.>

Is characterized by the following.

The encoder may be implemented based on a Pre-Training language model, where a Pre-Training language model such as BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded representation based on conversion), XLNet (an autoregressive model that implements bi-directional context information through an arrangement language model), GPT (generated Pre-Training) model, etc. is used as an initial encoder, and further time-step Training is performed based on this. The BERT is a bi-directional pre-training language model, and uses Transformer Encoder (transform encoder) as a model structure, and the BERT can well utilize context information for feature learning. XLNet is a BERT-like model, a more generalized autoregressive pre-training model. GPT uses Transformer Decoder (transform decoder) structure and only mask multi-headed attention is reserved in Transformer Decoder.

A transform network is a model that uses a self-attention mechanism to encode each Token of an input to transform into a representation of a feature. In addition, in addition to using a transducer-based encoder-decoder architecture, other network-based encoder-decoder architectures may be employed, such as RNN (Recurrent Neural Network ) based implementations.

And the decoder performs inverse diffusion processing by using the characteristic representation of the input text sample and the characteristic representation after noise addition to obtain an output text sample.

For the text generation task, each time step can be considered as removing noise from the feature representation resulting from the back-diffusion of the last time step, subject to the input text sample. Wherein for the first time step back diffusion, the characteristic representation after noise addition

And removing noise.

Each time stepDenoising (i.e., back-diffusion processing) can be considered as simulating the inverse of the denoising process, i.e., simulating the posterior distribution of the forward denoising diffusion process, expressed as

Which follow the form of a gaussian distribution family. />

Is indicated at->

And->

Basic->

Can be expressed as:

（3）

is about->

Is a distribution of +. >

Variance is->

。

（4）

（5）

（6）

（7）

is a preset parameter, ++>

The processing function which needs to be simulated by the text generation model can also be considered as a denoising function learned by the model. />

Refer to model parameters.

As can be seen from the above description, as one of the realizations, in the back-diffusion process, the feature representation obtained by the back-diffusion process of each time step is up-sampled on the posterior distribution of the feature representation obtained by the back-diffusion process of the previous time step. I.e.

Is at->

Up-sampling.

As another implementation, each time step in the noise-plus-diffusion process can be based on

Expressed as a priori distribution of (a):

（8）

thus, in the back diffusion process, the characteristic representation obtained by the back diffusion process at each time step

Can be based on prediction +.>

Up-sampled from a priori distribution of +.>

Is a prediction result of the characteristic representation obtained by the first diffusion process. That is, each step of back diffusion treatment can predict one +.>

Then based on a priori distribution->

Sampling to get->

. Initial->

Is inaccurate, but over time, for +.>

Is more and more accurate and is back-diffused in the final step to get +. >

When the goal is to make +.>

And->

And consistent. Such a sampling method is more similar to the training target and can result in a higher quality feature representation.

Because the denoising process is the inverse process of the denoising diffusion, the ideal situation is that the text generation model completely learns the inverse process of the denoising diffusion so as to finally predict

Therefore, training targets adopted by the training text generation model in the embodiment of the application mainly include: minimizing distribution of noisy diffusion process generation and back diffusion process generationIs a difference between the distributions of (a).

Further, in the noise-adding diffusion process, it is desirable that the characteristic representation obtained by the diffusion process of the last time step is the same as random noise, so the training target may further include: the difference between the distribution of the characteristic representation obtained by the diffusion processing of the last time step and the normal distribution is minimized.

Further, in the back diffusion (i.e., denoising) process, it is desirable that the back process of the denoising diffusion be completely simulated, and the characteristic representation obtained in the back diffusion at the last time step and

is completely consistent. Thus, the training objectives described above may also include minimizing the difference between the feature representation resulting from the last time step back-diffusion and the feature representation of the text sample.

In this embodiment of the present disclosure, the loss function may be constructed according to the training target, and the model parameters may be updated by using the value of the loss function in each iteration, and using a manner such as gradient descent, until a preset training end condition is satisfied. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

As one of the realizations, a loss function may be constructed

The following are provided:

wherein in the above formula

Is to take the desired treatment, e.g.>

Refers to that in

Constraint take->

Is a desire for content in the medium.

Refers to based on->

Is->

Is in accordance with->

Is a distribution of (a).

The difference between the distribution produced by the back diffusion process and the distribution produced by the noisy diffusion process is reflected. />

The expected value of the characteristic representation obtained by the diffusion processing of the last time step is represented, so that the difference between the characteristic representation obtained by the diffusion processing of the last time step and the normal distribution can be represented. />

Indicating +.>

On the premise of (1) predicting +.>

Probability of (2), thus

The characteristic representation obtained by back diffusion of the last time step and the characteristic representation of the output text sample are embodied in the fact +. >

Differences between them.

Based on the text generation model obtained through training, a specific text generation task can be executed by using the text generation model. Fig. 4 is a flowchart of a text generation method according to an embodiment of the present application, where the method may be performed by the text generation device in the system shown in fig. 1. As shown in fig. 4, the method may include the steps of:

step 402: input text is obtained.

Step 404: and inputting the input text and random noise into a text generation model, and performing inverse diffusion processing by the text generation model based on the input text and random noise to obtain an output text. The text generation model is obtained by training in advance by adopting a method shown in fig. 2.

The structure of the text generation model trained in advance in the embodiment of the present application is shown in fig. 5, and includes an encoder and a decoder.

The encoder obtains a characteristic representation of the input text.

The decoder uses the characteristic representation of the input text and random noise to perform a back-diffusion process to predict the output text. Wherein in the back diffusion process, the feature representation obtained by the back diffusion process of each time step is obtained by up-sampling the posterior distribution of the feature representation obtained by the back diffusion process of the previous time step; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from a priori distribution of +.>

That is, during the actual prediction process, the input and processing of the encoder are unchanged, that is, the encoder still only needs to perform the feedforward calculation of the neural network once, and does not need to participate in the back-diffusion process, and the back-diffusion process may need hundreds of time steps to process, so that the calculation resources can be greatly saved.

The input of the decoder is not only the output of the encoder, but also random noise is input into the decoder, the decoder performs the de-drying processing step by step according to the characteristic representation of the input text, and the characteristic representation is obtained in the last time step

Then ∈>

Mapping is carried out to obtain an output text.

As one of the realizable ways, the text generation method may be executed by a cloud server, that is, the functions of text generation are integrated in the cloud. The cloud server is also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual special server (VPS, virtual Private Server) service.

When the user wishes to generate an output text for the input text, the input text may be uploaded to the cloud server through the user terminal.

The above-mentioned user terminal may be, but is not limited to, such as: a cell phone, tablet, notebook, PDA (Personal Digital Assistant ), wearable device, PC (Personal Computer, personal computer), etc.

The cloud server acquires an input text from a user terminal; and then, performing inverse diffusion processing by utilizing the input text and random noise by utilizing a text generation model obtained based on pre-training to obtain an output text, generating an output text, and returning the output text to the user terminal.

The method provided by the embodiment of the application can be applied to various application scenes, and only a few of the methods are described herein:

application scenario 1: summary generation scenario

In this scenario, when training a text generation model, some articles may be used as input text samples, and summaries of the articles may be used as output text samples, thereby forming sample pairs. For example, some news text may be taken as an input text sample and a summary of the news text as an output text sample. For another example, some papers may be taken as input text samples and summaries of papers may be taken as output text samples. The news text and the abstract thereof, the paper and the abstract thereof and the like are easy to obtain on the network, so that a large number of training samples can be obtained as training data.

Then, obtaining the characteristic representation of the output text sample in the sample pair, and carrying out noise adding diffusion treatment on the characteristic representation of the output text sample to obtain the characteristic representation after noise adding; and in the training process of the text generation model, simulating inverse diffusion processing of noise adding diffusion based on the input text sample and the characteristic representation after noise adding so as to obtain an output text sample. The specific training process may be referred to in the method embodiment for the relevant descriptions of fig. 2 and 3, which are not repeated here.

When the abstract is actually generated, an input text is obtained, the input text and random noise are input into a text generation model which is trained in advance, and the text generation model carries out inverse diffusion processing based on the input text and the random noise to obtain the abstract of the input text.

By the method, accurate abstracts can be automatically generated for the input text, and abstracts can be automatically generated and released together when news texts and paper texts are released online. The text generating device may also be provided to the user as a tool, which is used to obtain an automatically generated summary by uploading his own document as input text.

Application scenario 2: machine translation scenario

In this scenario, when training the text generation model, some bilingual corpus may be used as a sample pair, where the bilingual corpus includes text in a first language as an input text sample, and text in a second language as an output text sample. For example, some chinese text and corresponding english text may be formed into sample pairs as training samples.

When the machine translation is actually carried out, a text adopting a first language is obtained, the text adopting the first language and random noise are input into a text generation model which is obtained by training in advance, and the text generation model carries out back diffusion processing based on the text adopting the first language and the random noise, so that the text adopting a second language is obtained.

In this way, text in the first language can be automatically translated into text in the second language. For example, text may be automatically translated into another language for viewing by users in different countries or regions as the text is published on-line. For another example, the text generating device may be provided to the user as a tool, and the user uploads the document to be translated as the input text, and the tool may be used to obtain text in the specified language obtained by automatic translation.

But also to other application scenarios, not explicitly recited herein.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

According to an embodiment of another aspect, an apparatus for training a text generation model is provided. FIG. 6 illustrates a schematic block diagram of an apparatus for training text generation models, i.e., a model training apparatus in the architecture shown in FIG. 1, according to one embodiment. As shown in fig. 6, the apparatus 600 may include: a sample acquisition unit 601, a noise adding diffusion unit 602, and a model training unit 603. Wherein the main functions of each constituent unit are as follows:

The sample acquiring unit 601 is configured to acquire training data including a plurality of training samples, the training samples including sample pairs of input text samples and output text samples.

And the noise adding and diffusing unit 602 is configured to acquire the characteristic representation of the output text sample in the sample pair, and perform noise adding and diffusing processing on the characteristic representation of the output text sample to obtain a denoised characteristic representation.

The model training unit 603 is configured to take the input text sample and the feature representation after noise addition of the sample pair as an input training text generation model, and the text generation model simulates inverse diffusion processing of noise addition diffusion based on the input text sample and the feature representation after noise addition in the training process so as to take the output text sample as a target to be output.

Wherein the text generation model includes an encoder and a decoder. The encoder acquires the characteristic representation of the input text sample of the input text generation model, and the decoder performs inverse diffusion processing by using the characteristic representation of the input text sample and the characteristic representation after noise addition to obtain an output text sample.

When the model training unit 603 trains the text to generate a model, the training targets used include: the difference between the distribution produced by the noisy diffusion process and the distribution produced by the inverse diffusion process is minimized.

Further, when the model training unit 603 trains the text generation model, the training target used may further include: minimizing the difference between the distribution of the characteristic representation obtained by the diffusion treatment of the last time step and the normal distribution; and/or minimizing the difference between the feature representation resulting from the last time step back-diffusion and the feature representation of the output text sample.

As one of the possible manners, the noise adding and diffusing unit 602 may perform word embedding processing on the output text sample by using an embedding network, so as to obtain a feature representation of the output text sample.

As one of the possible ways, the noise adding and diffusing unit 602 may perform encoding processing on the output text sample by using an encoder in the text generation model, so as to obtain a feature representation of the output text sample.

As one of the realizable ways, in the noise-adding diffusion unit 602, in the noise-adding diffusion process, the diffusion process of the first time step adds noise to the feature representation of the output text sample, and the diffusion process of each subsequent time step adds noise to the feature representation obtained by the diffusion process of the last time step, and the feature representation obtained by the diffusion process of each time step conforms to the normal distribution.

As one of the realizations, in the back-diffusion process, the feature representation obtained by the back-diffusion process of each time step is obtained by up-sampling the posterior distribution of the feature representation obtained by the back-diffusion process of the previous time step.

As another implementation manner, in the back diffusion processing of the text generation model, the feature representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from the a priori distribution of said +.>

According to an embodiment of another aspect, a text generating apparatus is provided. Fig. 7 shows a schematic block diagram of a text generating apparatus according to an embodiment. As shown in fig. 7, the apparatus 700 may include: a text acquisition unit 701 and a text generation unit 702. Wherein the main functions of each constituent unit are as follows:

the text acquisition unit 701 is configured to acquire an input text.

The text generation unit 702 is configured to input the input text and random noise into a text generation model, and acquire an output text obtained by performing a back diffusion process on the text generation model based on the input text and random noise. The text generation model is trained in advance by the model training device shown in fig. 6.

Wherein the text generation model includes an encoder and a decoder. The encoder obtains a characteristic representation of the input text. The decoder uses the characteristic representation of the input text and random noise to carry out inverse diffusion processing and predicts the output text; wherein in the back diffusion process, the feature representation obtained by the back diffusion process of each time step is obtained by up-sampling the posterior distribution of the feature representation obtained by the back diffusion process of the previous time step; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from a priori distribution of +.>

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In addition, the embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method of any one of the foregoing method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

Fig. 8 illustrates an architecture of an electronic device, which may include, inter alia, a processor 810, a video display adapter 811, a disk drive 812, an input/output interface 813, a network interface 814, and a memory 820. The processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820 may be communicatively coupled via a communication bus 830.

The processor 810 may be implemented by a general-purpose CPU, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing relevant programs to implement the technical solutions provided herein.

The Memory 820 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. The memory 820 may store an operating system 821 for controlling the operation of the electronic device 800, and a Basic Input Output System (BIOS) 822 for controlling the low-level operation of the electronic device 800. In addition, a web browser 823, a data storage management system 824, a model training device/text generation device 825, and the like may also be stored. The model training device/text generating device 825 may be an application program that specifically implements the operations of the foregoing steps in the embodiments of the present application. In general, when implemented in software or firmware, the relevant program code is stored in memory 820 and executed by processor 810.

The input/output interface 813 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Network interface 814 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 830 includes a path for transferring information between components of the device (e.g., processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820).

It is noted that although the above-described devices illustrate only the processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, memory 820, bus 830, etc., the device may include other components necessary to achieve proper operation in an implementation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the present application, and not all the components shown in the drawings.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer program product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The foregoing has outlined the detailed description of the preferred embodiment of the present application, and the detailed description of the principles and embodiments of the present application has been provided herein by way of example only to facilitate the understanding of the method and core concepts of the present application; also, as will occur to those of ordinary skill in the art, many modifications are possible in view of the teachings of the present application, both in the detailed description and the scope of its applications. In view of the foregoing, this description should not be construed as limiting the application.

Claims

1. A method of training a text generation model, the method comprising:

2. The method of claim 1, wherein the text generation model comprises an encoder and a decoder;

3. The method of claim 2, wherein the training goal further comprises: minimizing the difference between the distribution of the characteristic representation obtained by the diffusion treatment of the last time step and the normal distribution; and/or minimizing the difference between the feature representation resulting from the back-diffusion of the last time step and the feature representation of the output text sample.

4. The method of claim 2, wherein obtaining a characteristic representation of the output text sample in the sample pair comprises:

5. The method of claim 1, wherein in the noise-added diffusion process, the first time step of diffusion process adds noise to the feature representation of the output text sample, and each subsequent time step of diffusion process adds noise to the feature representation obtained by the previous time step of diffusion process, and the feature representation obtained by the diffusion process of each time step conforms to a normal distribution.

6. The method according to claim 1 or 5, wherein in the back-diffusion process, the feature representation obtained by the back-diffusion process of each time step is obtained by up-sampling a posterior distribution of feature representations obtained by the back-diffusion process of the previous time step; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from the a priori distribution of said +.>

7. A method of text generation, the method comprising:

acquiring an input text;

wherein the text generation model is pre-trained using the method of any one of claims 1 to 6.

8. The method of claim 7, wherein the text generation model comprises an encoder and a decoder;

the encoder obtains a feature representation of the input text;

the decoder performs inverse diffusion processing by using the characteristic representation of the input text and the random noise to predict an output text; wherein in the back diffusion process, the characteristic representation obtained by the back diffusion process of each time step is the characteristic obtained by the back diffusion process based on the previous time step The posterior distribution of the sign representation is obtained by up-sampling; alternatively, the characteristic representation obtained by the back diffusion processing of each time step is obtained based on prediction

Up-sampled from the a priori distribution of said +.>

9. A digest generation method, the method comprising:

acquiring an input text;

10. A machine translation method, the method comprising:

acquiring a text adopting a first language;

11. An apparatus for training a text generation model, the apparatus comprising:

12. A text generation apparatus, the apparatus comprising:

a text acquisition unit configured to acquire an input text;

Wherein the text generation model is pre-trained by the apparatus of claim 11.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 10.