CN113435183A

CN113435183A - Text generation method, device and storage medium

Info

Publication number: CN113435183A
Application number: CN202110745108.3A
Authority: CN
Inventors: 于凤英; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-24
Anticipated expiration: 2041-06-30
Also published as: CN113435183B

Abstract

The invention provides a text generation method, a text generation device and a storage medium; the text generation method comprises the following steps: acquiring an input text; configuring a first label, first hidden state information and a first pointing parameter for an input text; obtaining a first generation probability according to the first hidden state information; calculating a significance coefficient of the input text, and screening out a second label according to the significance coefficient; iteratively training the reinforcement learning model until the return is maximized, and taking the current feedback as target feedback; updating the first directional parameter into a second directional parameter according to the target feedback; and decoding the second directional parameters to obtain a target label, and screening out a target generation text. According to the text generation method, the generated text is guided to the specific target label in the reinforcement learning process according to the first label of the input text, and the text corresponding to the target label is generated in a targeted manner, so that the high readability and the label consistency of the generated text are ensured, and the generated text is more diversified in the aspect of the whole sentence structure.

Description

Text generation method, device and storage medium

Technical Field

The embodiments of the present invention relate to, but not limited to, the field of artificial intelligence, and in particular, to a text generation method, apparatus, and storage medium.

Background

Text generation technology is an important technology in the field of natural language processing. The text sequence meeting the specific target can be generated by using the set information and the text generation model through a text generation technology. The text generation model has rich application scenes including generative reading and understanding, man-machine conversation or intelligent writing and the like. Text generation, however, often presents the challenge of data set scarcity given the type. In order to solve the problem of scarcity of data sets, data enhancement is generally performed on data sets containing a small amount of related text data, but readability and relevance cannot be guaranteed on the enhanced text data, word-level or phrase-level replacement is generally performed, and the enhanced text data is lack of diversity.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a multi-user pairing method, a multi-user pairing device, a base station and a computer readable storage medium, which can ensure the orthogonality of weights of a plurality of users so as to ensure that a user receiving end can eliminate interference brought by other users.

In a first aspect, an embodiment of the present invention provides a text generation method, including:

acquiring an input text;

configuring a first label, first hidden state information and a first pointing parameter for the input text, wherein the first pointing parameter is used for pointing the input text to the first hidden state information;

according to the first hidden state information and a preset first pre-generated text, carrying out probability prediction on the input text to obtain a plurality of first generation probabilities in one-to-one correspondence with the first pre-generated text;

calculating a significance coefficient of the input text, and screening the first label according to the significance coefficient to obtain a second label;

inputting the input text, the first pre-generated text and the first generation probability into the reinforcement learning model for iterative training until the return of the reinforcement learning model is maximized, and obtaining a feedback corresponding to the reinforcement learning model with the maximized return as a target feedback, wherein the reinforcement learning model is in the state of the input text and the reinforcement learning model acts as the first pre-generated text;

updating the first pointing parameter into a second pointing parameter according to the target feedback, wherein the second pointing parameter is used for pointing the second label to a target label;

and decoding the second directional parameters to obtain the target label, and screening the first pre-generated text according to the target label to obtain a target generated text.

In a second aspect, an embodiment of the present invention further provides a text generating apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the text generation method as described above when executing the computer program.

In a third aspect, an embodiment of the present invention further provides a storage medium, where executable instructions are stored in the storage medium, and when executed by a processor, the executable instructions implement the text generation method described above.

The embodiment of the invention comprises the following steps: acquiring an input text; configuring a first label, first hidden state information and a first pointing parameter for an input text, wherein the first pointing parameter is used for pointing the input text to the first hidden state information; according to the first hidden state information and a preset first pre-generated text, carrying out probability prediction on the input text to obtain a plurality of first generation probabilities corresponding to the first pre-generated text one by one; calculating a significance coefficient of the input text, and screening the first label according to the significance coefficient to obtain a second label; inputting the input text, the first pre-generated text and the first generation probability into the reinforcement learning model for iterative training until the return of the reinforcement learning model is maximized, wherein the feedback corresponding to the reinforcement learning model with the maximized return is obtained and is used as target feedback, the reinforcement learning model is in the input text state, and the reinforcement learning model is in the first pre-generated text action; updating the first pointing parameter into a second pointing parameter according to the target feedback, wherein the second pointing parameter is used for pointing the second label to the target label; and decoding the second directional parameters to obtain target labels, and screening the first pre-generated texts according to the target labels to obtain target generated texts. According to the text generation method, the generated text is guided to the specific target label in the reinforcement learning process according to the first label of the input text, and the text corresponding to the target label is generated in a targeted manner, so that the high readability and the label consistency of the generated text are ensured, and the generated text is more diversified in the aspect of the whole sentence structure.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of generating text in accordance with an embodiment of the present invention;

FIG. 2 is a detailed flowchart of step S400 in FIG. 1;

FIG. 3 is a flow chart for obtaining a second generation probability, a similarity coefficient between a second tag and a target tag;

FIG. 4 is a detailed flowchart of step S500 in FIG. 1;

fig. 5 is a detailed flowchart of step S700 in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number.

The invention provides a text generation method, a text generation device and a storage medium. Which comprises the following steps: acquiring an input text; configuring a first label, first hidden state information and a first pointing parameter for an input text, wherein the first pointing parameter is used for pointing the input text to the first hidden state information; according to the first hidden state information and a preset first pre-generated text, carrying out probability prediction on the input text to obtain a plurality of first generation probabilities corresponding to the first pre-generated text one by one; calculating a significance coefficient of the input text, and screening the first label according to the significance coefficient to obtain a second label; inputting the input text, the first pre-generated text and the first generation probability into the reinforcement learning model for iterative training until the return of the reinforcement learning model is maximized, wherein the feedback corresponding to the reinforcement learning model with the maximized return is obtained and is used as target feedback, the reinforcement learning model is in the input text state, and the reinforcement learning model is in the first pre-generated text action; updating the first pointing parameter into a second pointing parameter according to the target feedback, wherein the second pointing parameter is used for pointing the second label to the target label; and decoding the second directional parameters to obtain target labels, and screening the first pre-generated texts according to the target labels to obtain target generated texts. According to the text generation method, the generated text is guided to the specific target label in the reinforcement learning process according to the first label of the input text, and the text corresponding to the target label is generated in a targeted manner, so that the high readability and the label consistency of the generated text are ensured, and the generated text is more diversified in the aspect of the whole sentence structure.

The embodiments of the present invention will be further explained with reference to the drawings.

Referring to fig. 1, fig. 1 is a flowchart of a text generation method.

As shown in fig. 1, a text generation method includes:

and step S100, acquiring an input text.

For step S100, the obtained input text includes one or more sub-texts, and the sub-texts may be words or words. Generally, in consideration of the grammatical meaning of a sentence, generally, a word is used for a subfile, and an input text is divided into a plurality of words. This also facilitates the subsequent addition of a first tag for each vocabulary.

The input text is entered into a text generation model, typically deployed in a processing device, via an input device, such as a keyboard, touch screen, microphone, etc.

S200, configuring a first label, first hidden state information and a first pointing parameter for an input text; the first pointing parameter is used to point the input text to the first hidden state information, i.e. a mapping of the input text to the first hidden state information may be constructed by the first pointing parameter.

For step S200, since the input text includes one or more sub-texts, a first tag, first hidden state information, and a first pointing parameter need to be respectively configured for each sub-text, and the first pointing parameter is used to point the sub-text to the first hidden state information.

It should be noted that the first label is usually determined according to the part of speech of the sub-text. For example, the first label may include three labels, a positive label, a neutral label, and a negative label; that is, forward tags indicate that the sub-text has forward semantics, e.g., likes; neutral tags indicate that the sub-text has neutral semantics, e.g., flat; a reverse tag indicates that the sub-text has reverse semantics, such as sadness. Of course, the positive label, the neutral label and the negative label are only examples, and in other embodiments, the corresponding labels may be set according to actual needs.

The first pointing parameter is represented by theta, and the first hidden state informationFor rest

And (4) showing. The first hidden state information is applied to a markov decision process.

Step S300, according to the first hidden state information and a preset first pre-generated text, probability prediction is carried out on the input text, and a plurality of first generation probabilities corresponding to the first pre-generated text one by one are obtained.

In the text generation model, text generation is actually a prediction process for a vocabulary having the highest probability of being placed at the end of the input text sequence.

For the text generation model, a Generative Pre-Training (GPT) model may be used, but in other embodiments, other text generation models may be used, such as a pointer generation network. In the GPT model, all words are formed into a vocabulary, and each word can be converted into a number vector by assigning an ID value to each word. Encoding the input text according to the vocabulary table, wherein each subfile can be encoded into a one-dimensional digital vector consisting of 1 and 0; multiple sub-copies of the input text may form a multi-dimensional vector matrix consisting of 1 s and 0 s.

Since the multi-dimensional vector matrix encoded by the method is usually filled with a large number of 0, the storage space and the operation space are easily wasted. Therefore, the multidimensional vector matrix needs to be vectorized, and information of word meaning is projected into a smaller space through an embedding function. Respectively transmitting the digital vector of each word to an embedding function, wherein each word code corresponds to each row of a multi-dimensional vector matrix; the embedding function is an embedding weight matrix.

And then the position information coding is carried out on the multi-dimensional vector matrix. The degree of influence of the input signature on the output is then predicted for each output in the sequence, again by a model of the multi-head attention mechanism. The GPT model is provided with a feedforward module, wherein the feedforward module is a multi-layer perceptron with a single hidden layer; the feedforward module multiplies the input by the learned weight, learning deviation is continuously added, and the linear rectification function is a ReLU function after being activated by the linear rectification function.

It should be noted that after the model and feedforward module of the multi-head attention mechanism, the input of the module needs to be added to the output thereof, and the result needs to be normalized. Obtaining a plurality of first generation probabilities in one-to-one correspondence with the first pre-generated text according to the hidden state through a normalization index function, wherein the normalization index function is a softmax function, and the first generation probabilities are expressed as

Wherein the first pre-generated text is all words in the vocabulary; the number of the first generation probabilities is equal to the number of all words in the vocabulary, and the first generation probabilities correspond to the words in the vocabulary one to one. The first generation probability represents a probability of generating a first pre-generated text predicted by the input text.

S400, calculating a significance coefficient of the input text, and screening the first label according to the significance coefficient to obtain a second label;

referring to fig. 2, for step S400, the specific steps are as follows:

step S410, calculating the significance coefficient of each sub text respectively; wherein the significance coefficient is expressed as:

in the formula, S_x,cDenotes a saliency coefficient, V denotes the total number of sub-texts, GM denotes a geometric mean, c is a first label, x denotes a sub-text, and K is the total number of first labels.

S420, sorting the significance coefficients according to a descending order, and selecting all sub-texts with the ranks before a preset number value as first sub-texts; the sub texts are sorted from large to small according to the numerical value of the significance coefficient, the first N sub texts in the sorted text sequence are selected as first sub texts, and N is a preset numerical value. The preset quantity value can be set manually according to actual needs. Of course, the significance coefficients may also be sorted in the order from small to large, and all the sub-texts ranked after the preset number value are selected as the first sub-texts.

And step S430, taking the first label corresponding to the first sub-text as a second label. The target label to which the generated text corresponds is actually selected from the second labels.

Referring to fig. 3, fig. 3 is a flowchart for obtaining the second generation probability and the similarity coefficient between the second tag and the target tag.

In step S400, a saliency coefficient of the input text is calculated, and after the step of obtaining the second label from the first label according to the saliency coefficient, the method further includes the following steps:

step S441, configuring second hidden state information for the second label; likewise, the second hidden state information is applied to a Markov decision process;

step S442, selecting a first pre-generated text with the same first label and second label as a second pre-generated text; first labels identical to the second labels can be obtained firstly, then first pre-generated texts corresponding to the first labels are obtained according to the first labels identical to the second labels, and finally the obtained first pre-generated texts are used as second pre-generated texts;

step S443, performing probability prediction on the second pre-generated text according to the second hidden state information to obtain a plurality of second generation probabilities corresponding to the second pre-generated text one by one; the process of obtaining the second generation probability is the same as the process of obtaining the first generation probability, and is not described in detail here.

In step S443, after the step of screening the second pre-generated text according to the second hidden state information to obtain a plurality of second generation probabilities corresponding to the second pre-generated text one to one, the method further includes the following steps:

s450, calculating similarity according to the significance coefficient and the second generation probability to obtain a similarity coefficient between the second label and the target label; the process of calculating the similarity coefficient can be represented by the following equation:

in the formula (I), the compound is shown in the specification,

a similarity coefficient representing the second tag and the target tag,

is the second hidden-state information and is the second hidden-state information,

a probability is generated for the second.

Step S500, inputting the input text, the first pre-generated text and the first generation probability into the reinforcement learning model for iterative training until the return of the reinforcement learning model is maximized, and obtaining the feedback corresponding to the reinforcement learning model with the maximized return as target feedback, wherein the reinforcement learning model is in the input text state, and the reinforcement learning model acts as the first pre-generated text.

Referring to fig. 4, for step S500, the specific steps are as follows:

step S510, inputting an input text, a first pre-generated text, a first generation probability, a second generation probability and a similarity coefficient into a reinforcement learning model, wherein the reinforcement learning model is in an input text state, and the reinforcement learning model acts as the first pre-generated text; specifically, for the reinforcement learning model, the state at time t is the input text, i.e., s_t＝x_＜tWherein s is_tExpressed as the state of the reinforcement learning model at time t, x_＜tExpressed as the input text at time t, the action at time t is one of all the first pre-generated texts, namely a_t＝x_tWherein a is_tExpressed as the behavior of the reinforcement learning model at time t, x_tRepresented as a first pre-generated text; then at time t, at s_tIn a state of (a) to produce_tIs a first generation probability and can be expressed as

Wherein pi_θ(a_a|s_t) Is shown in state s_tGenerating an action a_tThe probability of (d);

step S520, calculating the first generation probability, the second generation probability, the similarity coefficient of the second label and the target label according to a value function of the reinforcement learning model to obtain a first sub-feedback, wherein the first sub-feedback is used for expressing a future feedback expectation of the current reinforcement learning model; the first sub-feedback is expressed by the mathematical formula:

wherein E_tA cost function for the reinforcement learning model, the cost function being used to calculate a desire for future feedback that the reinforcement learning model can obtain based on the state at time t or to calculate a desire for future feedback that the reinforcement learning model can take action based on the state at time t,

a second probability of generation is represented by a second probability of generation,

step S530, obtaining a relative entropy according to the first generation probability and the second generation probability, wherein the relative entropy is expressed as

Wherein KL (theta | | theta)_c) Representing a relative entropy, wherein the relative entropy represents the difference between the influence of the first directional parameter on the reinforcement learning model and the influence of the second directional parameter on the reinforcement learning model;

step S540, determining a discount value according to the relative entropy, and deducting the discount value from the first sub-feedback to obtain a second sub-feedback, wherein the discount value is used for representing the negative influence of the relative entropy on the first sub-feedback; the second sub-feedback is expressed as a mathematical expression

Wherein

Representing a second sub-feedback, beta representing a weight, enabling to dynamically change the transition of the first pointing parameter to the second pointing parameter;

step S550, the reinforcement learning model is subjected to iterative training by taking the second sub-feedback as the feedback of each iteration until the return of the reinforcement learning model is maximized, wherein the return is the sum of all the second sub-feedbacks generated by the reinforcement learning model in the iterative training process; it should be noted that the reinforcement learning model is a standard markov decision process; the rationale behind the reinforcement learning model is that if a subject's behavior strategy results in positive feedback from the environment, the subject's tendency to later generate this behavior strategy is enhanced. The goal of the agent is to dynamically adjust the parameters to maximize the return, finding the optimal strategy at each discrete state to maximize the desired feedback sum; the main body selects an action for the environment, the state of the environment changes after receiving the action, and simultaneously a feedback is generated to the main body, the main body selects the next action according to the feedback and the current state of the environment, and the selection principle is to increase the probability of receiving positive feedback; the selected action not only affects the instant feedback value, but also affects the state and the final feedback value at the next moment in the environment;

step S560, using the second sub-feedback corresponding to the reinforcement learning model with maximized return as the target feedback; when the reinforcement learning model is iteratively trained until the return is maximized, the reinforcement learning model completes the training and reaches the target, and the second sub-feedback can be used as the target feedback.

And step S600, updating the first directional parameter into a second directional parameter according to the target feedback.

In step S600, the first directional parameter is updated by using a gradient algorithm according to the target feedback, so as to obtain a second directional parameter. That is, a mapping of the second tag to the target tag can be constructed by the second pointing parameter, where the second pointing parameter uses θ_cAnd (4) showing.

Step S600 can be expressed by the following mathematical formula:

wherein η represents the learning rate of the reinforcement learning model; t is temperature, used for controlling random sampling and pairing

Zooming;

the gradient is indicated.

In step S600, in the reinforcement learning process executed in the reinforcement learning model, parameters are updated through feedback from the action of generating the next word executed in a certain text sequence state, so as to optimize the reinforcement learning model, so that the text is generated with the color of the target label.

And S700, decoding the second directional parameters to obtain a target label, and obtaining a target generation text from the first pre-generation text according to the target label.

Referring to fig. 5, specifically, step S700 includes the following sub-steps:

step S710, decoding a second directional parameter through a second generated probability to obtain a target label;

s720, selecting a first pre-generated text with the first label being the same as the target label as a third pre-generated text;

step S730, the third pre-generated text corresponding to the maximum value of the first generation probability is used as the target generated text.

Specifically, step S700 may be implemented by an argmax function. I.e. for one mapping f: and X → Y, wherein X is a first generation probability, Y is a first pre-generated text, the mapping is the mapping from the first generation probability to the first pre-generated text, and the first pre-generated text corresponding to the maximum value of the first generation probability is used as an output through the argmax function under the specified condition that the first label is the target label. And taking the first label as a target label as a specified condition, namely defining the range of X of the argmax function, and taking the first pre-generated text corresponding to the target label and the first label as a third pre-generated text. And taking the third pre-generated text corresponding to the maximum value of the first generation probability as the output target generated text.

In fact, the reinforcement learning model is located between the softmax function and the argmax function of the text generation model; the reinforcement learning model is a conditional generator that is directed to a target label determined by the input text during decoding, under conditions that generate text data responsive to the target label category such that the text data is semantically and readability similar to the original data of the input text.

It should be noted that, when an original input text is input into the text generation model, a first target generation text is obtained by outputting after calculation of the text generation model; then, the original input text and the first target generation text can be merged to be used as a new input text to be input into the text generation model, and the calculation and the output are continued to obtain a second target generation text; the target labels of the second target generation text are obtained according to the new input text, namely the original input text and the labels of all words of the first target generation text, so that the consistency of the labels of the target generation text and the new input text is ensured; by analogy, different target generation texts can be sequentially output according to the original input text to form a new long text sequence, so that the generated text is more diversified.

In addition, an embodiment of the present invention also provides a text generation apparatus, including: a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor, when executing the computer program, implements the text generation apparatus method as described above.

The processor and memory may be connected by a bus or other means.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions necessary to implement the information processing method of the above-described embodiment are stored in the memory, and when executed by the processor, the text generation method of the above-described embodiment is performed, for example, the above-described steps S100 to S700 are performed.

The above described node embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions, which are executed by a processor or a controller, for example, by a processor, and can cause the processor to execute the text generation method in the above embodiment, for example, execute the above-described steps S100 to S700.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A text generation method, comprising:

acquiring an input text;

2. The method of claim 1, wherein the input text comprises a plurality of sub-texts, and configuring a first tag, first hidden state information, and a first pointing parameter for the input text comprises:

and respectively configuring a first label, first hidden state information and a first pointing parameter for each sub-text.

3. The method of claim 2, wherein the calculating a saliency coefficient of the input text and filtering the first label according to the saliency coefficient to obtain a second label comprises:

respectively calculating a significance coefficient of each sub text;

sorting the significance coefficients in a descending order, and selecting all the sub texts with the ranks before a preset number value as first sub texts;

and taking the first label corresponding to the first sub-text as the second label.

4. The text generation method according to claim 1, wherein after the step of calculating a saliency coefficient of the input text, and filtering the first label according to the saliency coefficient to obtain a second label, the text generation method further comprises:

configuring second hidden state information for the second tag;

selecting the first pre-generated text with the same first label and the second label as a second pre-generated text;

and performing probability prediction on the second pre-generated text according to the second hidden state information to obtain a plurality of second generation probabilities corresponding to the second pre-generated text one by one.

5. The text generation method according to claim 4, wherein after the step of filtering the second pre-generated text according to the second hidden state information to obtain a plurality of second generation probabilities in one-to-one correspondence with the second pre-generated text, the text generation method further comprises:

and performing similarity calculation according to the significance coefficient and the second generation probability to obtain a similarity coefficient between the second label and the target label.

6. The method of claim 5, wherein the inputting the input text, the first pre-generated text, and the first generation probability into the reinforcement learning model for iterative training until the return of the reinforcement learning model is maximized, and obtaining a feedback corresponding to the reinforcement learning model with the maximized return as a target feedback, comprises:

inputting the input text, the first pre-generated text, the first generation probability, the second generation probability, and the similarity coefficient to the reinforcement learning model;

calculating the first generation probability, the second generation probability, and a similarity coefficient between the second label and a target label according to a cost function of the reinforcement learning model to obtain a first sub-feedback, wherein the first sub-feedback is used for representing a future feedback expectation of the current reinforcement learning model;

obtaining a relative entropy according to the first generation probability and the second generation probability, wherein the relative entropy represents the difference between the influence of the first directional parameter on the reinforcement learning model and the influence of the second directional parameter on the reinforcement learning model;

determining a discount value according to the relative entropy, and deducting the discount value from the first sub-feedback to obtain the second sub-feedback, wherein the discount value is used for representing the negative influence of the relative entropy on the first sub-feedback;

enabling the reinforcement learning model to perform iterative training by taking the second sub-feedback as feedback of each iteration until the return of the reinforcement learning model is maximized, wherein the return is the sum of all the second sub-feedbacks generated by the reinforcement learning model in the iterative training process;

and taking the second sub-feedback corresponding to the reinforcement learning model with the maximized return as target feedback.

7. The text generation method of claim 1, wherein the updating the first pointing parameter to a second pointing parameter according to the target feedback comprises:

and updating the first pointing parameter by using a gradient algorithm according to the target feedback to obtain the second pointing parameter.

8. The text generation method according to claim 4, wherein the decoding the second directional parameter to obtain a target tag, and filtering the first pre-generated text according to the target tag to obtain a target generated text, comprises:

decoding the second directional parameters through the second generation probability to obtain the target label;

selecting a first pre-generated text with the first label being the same as the target label as a third pre-generated text;

and taking the third pre-generated text corresponding to the maximum value of the first generation probability as the target generation text.

9. A text generation apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the text generation method according to any of claims 1 to 8 when executing the computer program.

10. A storage medium having stored therein executable instructions which, when executed by a processor, implement a text generation method as claimed in any one of claims 1 to 8.