CN110738026B

CN110738026B - Method and device for generating description text

Info

Publication number: CN110738026B
Application number: CN201911012473.2A
Authority: CN
Inventors: 闭玮; 刘晓江; 冯骁骋; 孙亚威; 秦兵; 刘挺
Original assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Current assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2022-04-19
Anticipated expiration: 2039-10-23
Also published as: CN110738026A

Abstract

A method and apparatus for generating descriptive text is described herein. The method comprises the following steps: inputting a keyword sequence and a reference text with a predetermined style into a trained neural network, wherein the neural network comprises a keyword encoder, a text encoder, a mutual attention encoder and a decoder; coding the keyword sequence by using a keyword coder to obtain a hidden state sequence of the keyword sequence; encoding the reference text by using a text encoder to obtain a hidden layer state sequence of the reference text; coding the hidden layer state sequence of the keyword sequence and the hidden layer state sequence of the reference text by using a mutual attention coder to obtain a hidden layer state sequence fused with the keyword sequence of a preset style; decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using a decoder to output the descriptive text with the preset style.

Description

Method and device for generating description text

Technical Field

The present disclosure relates to the technical field of natural language processing, and in particular, to a method and apparatus for generating description text.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In recent years, intelligent writing technology has been greatly developed, and particularly, intelligent writing using a neural network has been rapidly developed. Intelligent authoring generally refers to the generation of a piece of descriptive text related to a sequence of keywords given the sequence of keywords comprising one or more keywords, using a neural network. For example, given several keywords that describe an appearance, a neural network is used to generate a piece of text that describes the appearance from the provided words. However, the sentence pattern and style of the text generated by the currently used neural network are fixed and single, and cannot meet the daily writing or creation requirements of the user.

Disclosure of Invention

In view of the above, the present disclosure provides methods and apparatus for generating descriptive text that desirably overcome some or all of the above-referenced deficiencies and possibly others.

According to a first aspect of the present disclosure, there is provided a method for generating a descriptive text, comprising: inputting a keyword sequence and a reference text with a predetermined style into a trained neural network, wherein the neural network comprises a keyword encoder, a text encoder, a mutual attention encoder and a decoder; coding the keyword sequence by using a keyword coder to obtain a hidden state sequence of the keyword sequence; encoding the reference text by using a text encoder to obtain a hidden layer state sequence of the reference text; encoding the hidden layer state sequence of the keyword sequence and the hidden layer state sequence of the reference text by using a mutual attention encoder to obtain a hidden layer state sequence of the keyword sequence fused with a predetermined style; and decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoder to output the descriptive text with the preset style.

In some embodiments, the keyword sequence includes a plurality of keywords that describe the same topic.

In some embodiments, the theme includes appearance, action, mind and environment, among others.

In some embodiments, the decoder has a joint attention mechanism, wherein decoding, with the decoder, the hidden state fused to the predetermined style of keyword sequence further comprises: at each current decoding moment, calculating a first attention weight of a hidden layer state of a decoder at the last decoding moment to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a keyword sequence of a preset style by using a joint attention mechanism; determining the hidden layer state of the current decoding moment based on the first attention weight, the second attention weight, the hidden layer state of the decoder at the previous decoding moment and the decoded word; and decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoder based on the hidden layer state at the current decoding moment.

In some embodiments, decoding, with the decoder, a hidden layer state sequence fused with a predetermined style of keyword sequence based on a hidden layer state at a current decoding time comprises: determining the copy probability of selecting words from the keyword sequence and the generation probability of selecting words from the word selection word list based on the hidden layer state at the current decoding moment; calculating the selection probability of each word in the keyword sequence based on the copy probability and calculating the selection probability of each word in the word selection word list based on the generation probability; and selecting the word with the maximum selected probability from the keyword sequence and the word selection list as the word output by the decoder at the current moment.

In some embodiments, the trained neural network is trained by training steps comprising: acquiring a text for training and a keyword sequence for training corresponding to the text; acquiring a reference text with a preset style and a keyword sequence with the preset style corresponding to the reference text; inputting a keyword sequence for training and a reference text with a predetermined style into the neural network to obtain a first description text, and calculating a first inconsistency loss of the first description text and the text for training; inputting the keyword sequence with the preset style and the reference text with the preset style into the neural network to obtain a second descriptive text, and calculating a second inconsistency loss of the second descriptive text and the reference text with the preset style; and carrying out gradient postback on the total loss to update the parameters of the neural network, wherein the total loss is the sum of the first inconsistency loss and the second inconsistency loss.

In some embodiments, the trained neural network is trained by training steps comprising: acquiring a text for training and a keyword sequence for training corresponding to the text; acquiring a reference text with a preset style and a keyword sequence with the preset style corresponding to the reference text; inputting a keyword sequence for training and a reference text with a predetermined style into the neural network to obtain a first description text, and calculating a first inconsistency loss of the first description text and the text for training; inputting the keyword sequence with the preset style and the reference text with the preset style into the neural network to obtain a second descriptive text, and calculating a second inconsistency loss of the second descriptive text and the reference text with the preset style; inputting the keyword sequence with the preset style and the first description text as reference texts into the neural network to obtain third description texts, and calculating third inconsistency losses of the third description texts and the reference texts with the preset style; and performing gradient postback on the sum of the first inconsistency loss, the second inconsistency loss and the third inconsistency loss as a total loss to update the parameters of the neural network.

In some embodiments, obtaining text for training and a sequence of keywords corresponding thereto for training comprises: for each text in a corpus which is used as a training set, determining the text as a text for training and performing word segmentation processing on the text; and extracting the keyword sequence for training from the text after word segmentation processing.

In some embodiments, extracting the keyword sequence for training from the text after word segmentation processing includes: extracting text keywords from the text after word segmentation processing based on the word frequency-inverse document frequency; and extracting a keyword sequence included in the keyword word list from the text keywords as the keyword sequence for training according to the keyword word list with the same theme.

In some embodiments, any one or more of the keyword encoder, text encoder, and decoder may be a recurrent neural network or a convolutional neural network.

According to a second aspect of the present disclosure, an apparatus for generating a descriptive text, comprises: an input module and a neural network. The input module is configured to input a sequence of keywords and a reference text having a predetermined style into the trained neural network; the neural network includes: a keyword encoding module configured to encode the keyword sequence to obtain a hidden state sequence of the keyword sequence; a text encoding module configured to encode the reference text to obtain a hidden state sequence of the reference text; the mutual attention coding module is configured to code the hidden layer state sequence of the keyword sequence and the hidden layer state sequence of the reference text to obtain a hidden layer state sequence fused with a keyword sequence of a preset style; the decoding module is configured to decode the hidden layer state sequence fused with the keyword sequence of the preset style so as to output the descriptive text with the preset style.

In some embodiments, the decoding module has a joint attention mechanism and the decoding module is configured to: at each current decoding moment, calculating a first attention weight of a hidden layer state of a decoding module at the last decoding moment to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a keyword sequence of a preset style by using a joint attention mechanism; determining the hidden layer state of the current decoding moment based on the first attention weight, the second attention weight, the hidden layer state of the decoding module at the previous decoding moment and the decoded words; and decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoding module based on the hidden layer state at the current decoding moment.

In some embodiments, the apparatus further comprises a first obtaining module, a second obtaining module, and an updating module. The first acquisition module is configured to acquire a text for training and a keyword sequence corresponding to the text for training, and the second acquisition module is configured to acquire a reference text with a predetermined style and a keyword sequence corresponding to the reference text with the predetermined style. At this point, the input module is configured to: inputting a keyword sequence for training and a reference text with a predetermined style into the neural network to obtain a first description text, and calculating a first inconsistency loss of the first description text and the text for training; inputting the keyword sequence with the preset style and the reference text with the preset style into the neural network to obtain a second descriptive text, and calculating a second inconsistency loss of the second descriptive text and the reference text with the preset style; and inputting the keyword sequence with the preset style and the first description text as reference texts into the neural network to obtain third description texts, and calculating third inconsistency losses of the third description texts and the reference texts with the preset style. The update module is configured to perform a gradient pass back of a sum of the first inconsistency loss, the second inconsistency loss, the third inconsistency loss as an overall loss to update a parameter of the neural network.

According to a third aspect of the present disclosure, there is provided a computing device comprising a processor; and a memory configured to have computer-executable instructions stored thereon that, when executed by the processor, perform any of the methods described above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.

By the method for generating descriptive text claimed by the present disclosure, a novel scheme for generating descriptive text having a desired style is provided. In this approach, by using a mutual attention encoder and optionally a decoder with a joint attention mechanism, the input keyword sequences can better fuse the desired text style, thereby enabling the generated description text to better embody the text style.

These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:

FIG. 1 illustrates a schematic diagram for generating description text according to one embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of a method for generating descriptive text in accordance with one embodiment of the present disclosure;

FIG. 3 illustrates an exemplary model schematic of a mutual attention encoder in accordance with one embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram of a method for decoding a hidden-state sequence fused with a predetermined-style keyword sequence using a decoder with a joint attention mechanism according to one embodiment of the present disclosure;

fig. 5 illustrates a flow diagram of a method for training the neural network described above with reference to fig. 1 and 2, according to one embodiment of the present disclosure;

FIG. 6 illustrates a flow diagram of another method for training the neural network described above with reference to FIGS. 1 and 2, according to one embodiment of the present disclosure;

FIG. 7 illustrates an exemplary block diagram of a neural network used in one embodiment according to the present disclosure;

FIG. 8 illustrates an exemplary block diagram of an apparatus for generating descriptive text in accordance with one embodiment of the present disclosure; and

fig. 9 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.

Detailed Description

The following description provides specific details for a thorough understanding and enabling description of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms referred to in the embodiments of the present application are explained so that those skilled in the art can understand that:

a neural network: a deep learning model simulating the structure and function of a biological neural network in the field of machine learning and cognitive science;

a recurrent neural network: a network model for converting sequence modeling into time sequence modeling, which is a recurrent neural network that recurs in the evolution direction of a sequence and all nodes (cyclic units) are connected in a chain mode, usually with sequence data as input;

a convolutional neural network: the feedforward neural network comprises convolution calculation and has a depth structure, and artificial neurons of the feedforward neural network can respond to peripheral units in a part of coverage range;

deep learning: is a branch of machine learning, an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple non-linear transformations;

an attention mechanism is as follows: a method for modeling hidden state dependency relationship between an encoder and a decoder in a neural network;

the mutual attention mechanism is as follows: is a variation of the attention mechanism, being a bidirectional attention;

reverse translation: the translated information is translated into the original text again in the reverse direction;

LSTM network: a Long Short-Term Memory (Long Short-Term Memory) network, which is a time-cycle neural network and is commonly used for processing Long-sequence data;

BLEU: bilingual Evaluation substitution (Bilingual Evaluation Understudy) is a standard method for machine translation Evaluation, and the higher the value is, the better the expression effect is;

TF-IDF: term frequency-inverse document frequency (term frequency-inverse document frequency) is a common weighting technique used for information retrieval and data mining, and information is processed according to the term frequency and the inverse document frequency coefficient.

Fig. 1 illustrates a schematic diagram for generating description text according to one embodiment of the present disclosure. As shown in fig. 1, a keyword sequence X and a reference text Y' having a predetermined style are input to a trained neural network 101. The neural network 101 comprises a keyword encoder 102, a text encoder 103, a mutual attention encoder 104 and a decoder 105. The keyword encoder 102 receives the keyword sequence X and encodes the keyword sequence to obtain a hidden state sequence R of the keyword sequence. The text encoder 103 receives a reference text Y' having a predetermined style and encodes the reference text to obtain a hidden state sequence W of reference texts. And the mutual attention encoder encodes 104 the hidden state sequence R of the keyword sequence and the hidden state sequence W of the reference text to obtain a hidden state sequence F fused with a keyword sequence of a predetermined style. Finally, the decoder 105 decodes the hidden-layer state sequence F fused with the predetermined style of keyword sequence to output the descriptive text Z having the predetermined style.

In some embodiments, the decoder 105 may have a joint attention mechanism 106. At this time, at the t-th time (t is a positive integer greater than or equal to 2) when the decoder decodes the hidden layer state sequence F fused with the predetermined style of keyword sequence, the joint attention mechanism 106 calculates a first attention weight of the hidden layer state of the decoder at the last decoding time (i.e., t-1 time) to the hidden layer state sequence W of the reference text and a second attention weight of the hidden layer state sequence F fused with the predetermined style of keyword sequence, respectively; and determining the hidden layer state at the current decoding time t based on the first attention weight, the second attention weight and the hidden layer state of the decoder and the output of the decoder at the last decoding time. Then, the decoder 105 decodes the hidden state sequence F in which the predetermined style of keyword sequence is fused, based on the hidden state at the current decoding time. It should be noted that at the first (i.e., t = 1) instant of decoding, the hidden state of the decoder at its last decoding instant is typically set to the last hidden state in the sequence of hidden states of the encoder and the output of the decoder may be set to the symbol "BOS" or "0", although this is not limiting.

With the method described according to the embodiment of the present disclosure, a description text having a desired predetermined style can be generated based on an input keyword sequence. For example, it is possible to obtain an appearance-written text with a positive style by inputting a sequence of keywords related to appearance writing and a reference text with a "positive" style, the appearance-written text generally including the sequence of keywords related to appearance writing. For example, the neural network may be trained to generate descriptive text in a plurality of predetermined styles such that a user may generate reference text having a selected one of the plurality of predetermined styles by inputting a sequence of keywords and the reference text having the selected style.

Fig. 2 illustrates a flow diagram of a method 200 for generating descriptive text in accordance with one embodiment of the present disclosure. As shown in fig. 2, the method 200 includes the following steps.

In step 201, a sequence of keywords and a reference text having a predetermined style are input into a trained neural network. The neural network may include a keyword encoder, a text encoder, a mutual attention encoder, and a decoder. In some embodiments, the decoder may have a joint attention mechanism. The keyword sequence includes one or more keywords. As an example, the keyword sequence may include a plurality of keywords for describing the same topic. The theme may be, for example, appearance, action, mind, and environment, among others. The predetermined style may be, for example, an active style, a passive style, an lovely style, a fresh style, a naive style, and the like, without limitation.

In step 202, the keyword sequence is encoded by the keyword encoder to obtain a hidden state sequence of the keyword sequence. The keyword encoder may be a cyclic neural network, a convolutional neural network, or the like, and the hidden state sequence of the keyword sequence is a sequence of the hidden states of the keyword encoder. For example, the keyword encoder is a two-way LSTM network (two-way long short term memory network), although this is not limiting.

In step 203, the reference text is encoded by a text encoder to obtain a hidden state sequence of the reference text. The text encoder may be a recurrent neural network, a convolutional neural network, or the like, and the hidden state sequence of the reference text is a sequence of hidden states of the text encoder. For example, the text encoder may be an LSTM network, although this is not limiting.

In step 204, the hidden state sequence of the keyword sequence and the hidden state sequence of the reference text are encoded by using a mutual attention encoder to obtain a hidden state sequence fused with a predetermined style of keyword sequence.

FIG. 3 illustrates an exemplary model schematic of a mutual attention encoder, according to an embodiment of the disclosure. The mutual attention encoder encodes the hidden state sequence R of the keyword sequence and the hidden state sequence W of the reference text to obtain a hidden state sequence F fused with the keyword sequence of a preset style. In the context of figure 3 of the drawings,

refers to the weights obtained after attention is calculated with reference to each word and keyword sequence in the text,

each keyword in the keyword sequence and the reference text are weighted after attention is calculated,

and

are the trainable parameters of the model. As shown in fig. 3, based on the weight

And

after a series of product and splicing operations are performed on the hidden state sequence R of the keyword sequence and the hidden state sequence W of the reference text, the mutual attention spliced text S can be obtained. And finally, encoding the mutual attention splicing text by using an LSTM encoder, for example, to obtain a hidden state sequence F fused with a keyword sequence of a predetermined style. It should be noted that the LSTM encoder used herein is not limiting and other types of encoders are contemplated.

In step 205, the decoder is used to decode the hidden layer state sequence fused with the predetermined style of keyword sequence to output the descriptive text with the predetermined style. The decoder may likewise be a recurrent neural network, a convolutional neural network, or the like. For example, the decoder may be an LSTM network, although this is not limiting.

As described with reference to fig. 1, in some embodiments, the decoder may have a joint attention mechanism. In this case, more attention can be paid to the style of the reference text at the time of decoding, so that the style of the output descriptive text is more desirable and required. Fig. 4 illustrates a flow diagram of a method 400 of decoding a hidden state sequence fused with a predetermined style of keyword sequence using a decoder with a joint attention mechanism, which method 400 may be used for step 205 in the method 200 described above, according to an embodiment of the present disclosure.

In step 401, at each current decoding time, a first attention weight of a hidden layer state of a decoder at the previous decoding time to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a predetermined style of keyword sequence are respectively calculated. The first attention weight and the second attention weight may be calculated using various methods of calculating attention weights known in the art, which are not limiting.

In step 402, a hidden layer state at the current decoding time is determined based on the first attention weight, the second attention weight, and the hidden layer state of the decoder and the decoded word at the previous decoding time. As an example, the first attention weight, the second attention weight, and the hidden layer state of the decoder and the decoded word at the previous decoding time may be used as inputs of the multi-layer perceptron to obtain an output of the multi-layer perceptron, i.e. the hidden layer state at the current decoding time.

In step 403, based on the hidden layer state at the current decoding time, decoding the hidden layer state sequence fused with the keyword sequence of the predetermined style by using the decoder.

Specifically, the copy probability of selecting a word from the keyword sequence and the generation probability of selecting a word from the word selection vocabulary may be determined based on the hidden layer state at the current decoding time, then the selection probability of each word in the keyword sequence is calculated based on the copy probability and the selection probability of each word in the word selection vocabulary is calculated based on the generation probability, and finally the word with the maximum selection probability is selected from the keyword sequence and the word selection vocabulary as the word output by the decoder at the current time. The word selection list may be, for example, a preset fixed list or a list obtained according to training set statistics.

As an example, the hidden layer state at the current decoding time may be mapped to obtain a value through the linear layer and the activation function, that is, the copy probability may be obtained

The generation probability can be determined to be 1-

. Then, mapping the hidden layer state of the current decoding moment to a keyword sequence through a linear function and an activation function to obtain the probability of each keyword in the keyword sequence being related to the decoder output of the current moment

，

，

(T is the number of keywords), and the probability is calculated

，

，

Ride on

I.e. the probability of being selected at the current moment of each keyword can be obtained. Mapping the hidden layer state of the current decoding moment to each word in the word selection word list through a linear function and an activation function, and obtaining the probability of each word in the word list related to the decoder output of the current moment

，

，

(M is the number of words in the vocabulary), and then multiplying each probability by 1-

The selection probability of each word in the word list at the current moment can be obtained. And finally, selecting the word with the maximum selection probability as the word output by the decoder at the current moment on the keyword sequence and the word list.

Fig. 5 illustrates a flow diagram of a method 500 for training the neural network described above with reference to fig. 1 and 2, in accordance with an embodiment of the present disclosure. The method may comprise the following steps 501-505.

In step 501, a text for training Y1 and a keyword sequence for training X1 corresponding thereto are obtained. In some embodiments, the text for training and the keyword sequence for training corresponding to the text may be obtained from a corpus as a training set, the corpus including a plurality of texts. By way of example, for each text in the corpus, the text may be determined as a text for training and subjected to word segmentation, and then the keyword sequence for training may be extracted from the word segmented text. Various word segmentation techniques may be used to perform the word segmentation process on the text. For example, when the text is an english text, the word segmentation operation can be completed only by taking a space and a punctuation mark as a basis, and when the text is a chinese text, the word segmentation processing can be performed on the chinese text based on a matching and statistical method. Text keywords may then be extracted from the tokenized text based on the TF-IDF. Further, a keyword sequence included in the keyword vocabulary may be further extracted from the extracted text keywords as the keyword sequence for training, for example, based on the keyword vocabulary having the same topic, so that keywords related to the same topic (e.g., appearance) may be extracted.

In step 502, a reference text Y2 with a predetermined style and a keyword sequence X2 with the predetermined style corresponding to the reference text Y2 are obtained. The reference text Y2 having the predetermined style and the keyword sequence X2 corresponding thereto having the predetermined style may be, for example, crawled from web text, and may even be edited by the user himself, which is not limitative.

In step 503, a keyword sequence X1 for training and a reference text Y2 with a predetermined style are input into the neural network to obtain a first descriptive text Z1, and a first inconsistency Loss1 of the first descriptive text Z1 and the text Y1 for training is calculated. Here, as can be understood from the above, the keyword encoder encodes the keyword sequence for training to obtain the hidden state sequence for the keyword sequence, the text encoder encodes the reference text with a predetermined style to obtain the hidden state sequence for the reference text, and then encodes the hidden state sequence for the keyword sequence and the hidden state sequence for the reference text through the mutual attention encoder, so that the decoder decodes and outputs the first description text. The main role of this training step is to allow the neural network to learn the ability to select content, i.e. to select more words or content from the text used for training as the decoder generates the text.

In step 504, the keyword sequence X2 with the predetermined style and the reference text Y2 with the predetermined style are input into the neural network to obtain a second descriptive text Z2, and a second inconsistency Loss2 of the second descriptive text Z2 and the reference text Y2 with the predetermined style is calculated. The main role of this training step is to allow the neural network to better learn the writing style in which the text is to be generated.

At step 505, the total loss is determined

Performing a gradient back-pass to update parameters of the neural networkThe total loss is the sum of the first and second inconsistency losses, i.e.

= Loss1+ Loss2。

Fig. 6 illustrates a flow diagram of a method 600 for training the neural network described above with reference to fig. 1 and 2, in accordance with an embodiment of the present disclosure. As shown in FIG. 6, steps 601 and 604 in the method 600 are respectively the same as steps 501 and 504 in the method 500, and will not be described herein, only steps 605 and 606 will be described.

In step 605, the keyword sequence X2 with the predetermined style and the first descriptive text Z1 are input into the neural network as reference text to obtain a third descriptive text Z3, and a third inconsistency Loss3 of the third descriptive text Z3 and the reference text Y2 with the predetermined style is calculated. The main function of the training step is to enable the neural network to learn the capability of reverse translation, and ensure that the style words are not lost in the text generation process, so that the finally generated description text has the expected effect.

At step 606, the total loss is determined

Performing a gradient back-pass to update parameters of the neural network, wherein the total loss

Is the sum of the first, second and third inconsistency losses, i.e.

= Loss1+ Loss2+ Loss3。

Fig. 7 illustrates an exemplary block diagram of a neural network 101 used in embodiments according to the present disclosure. As shown in fig. 7, the neural network 101 includes a keyword encoder 102, a text encoder 103, a mutual attention encoder 104, and a decoder 105, wherein the decoder 105 has a joint attention mechanism 106. As shown in FIG. 7, the keyword encoder 102 employs a two-way LSTM network and is utilized to encode the keyword sequence to obtain a hidden-state sequence of keyword sequences. The text encoder 103 employs a bi-directional LSTM network and is utilized to encode the reference text to derive a hidden state sequence of reference text. The mutual attention encoder 104 employs a mutual attention encoder as described with reference to fig. 4 and is utilized to encode the hidden state sequence of the keyword sequence and the hidden state sequence of the reference text to obtain a hidden state sequence fusing a predetermined style of keyword sequence. The decoder 105 also employs an LSTM network and is utilized to decode a hidden-layer state sequence fused with a predetermined style of keyword sequence to output descriptive text having the predetermined style. The decoder 105 is shown with a joint attention mechanism 106, the working principle of which is described with reference to fig. 4.

As shown in FIG. 7, a sequence of keywords consisting of the three keywords "cloak, animated, and charming" and the reference text "Java!with" lovely style! She is simply a super large beast! Lovely then beautiful, i simply cannot control my eyes! "input into the neural network 101, the neural network outputs the descriptive text" Wa | with "lovely style" via the decoder 106! She wears a super beautiful cloak! Then charming, all looking at the charming charm of her sexual feelings! ".

It should be noted that in fig. 7, the hidden state sequence R of the keyword sequence output by the keyword encoder is fed into the decoder for initializing the decoder, so that the decoder obtains the overall information of the keyword sequence, which is beneficial to the accuracy of decoding, although this is not restrictive.

The inventors have tested the technical solutions in the embodiments of the present disclosure and evaluated the generated results according to the commonly used indexes, and the evaluation results are shown in the following table:

the accuracy describes the accuracy of text generation, and the higher the value is, the better the value is; the perplexity is used for evaluating the quality of the model generation description text, and the smaller the value is, the better the value is; the content BLEU is an index for machine translation, and the higher the value, the better; the style BLEU is a style migration index, and the higher the value, the better. It can be seen that the technical solution of the present disclosure is more effective than the description text generated by the common baseline model and the unsupervised style migration model.

Fig. 8 illustrates an exemplary block diagram of an apparatus 800 for generating descriptive text according to one embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes an input module 801 and a neural network 802. The neural network 802 includes a keyword encoding module 803, a text encoding module 804, a mutual attention encoding module 805, and a decoding module 806.

The keyword encoding module 803 is configured to encode the keyword sequence to obtain a hidden state sequence of keyword sequences. The text encoding module 804 is configured to encode the reference text to obtain a hidden state sequence of the reference text. The mutual attention coding module 805 is configured to code the hidden state sequence of the keyword sequence and the hidden state sequence of the reference text to obtain a hidden state sequence fused with a predetermined style of keyword sequence. The decoding module 806 is configured to decode the hidden-layer state sequence fused with the predetermined style of keyword sequence to output the descriptive text having the predetermined style.

In some embodiments, the decode module 806 may have a joint attention mechanism, and the decode module 806 is configured to: at each current decoding moment, calculating a first attention weight of a hidden layer state of a decoding module at the last decoding moment to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a keyword sequence of a preset style by using a joint attention mechanism; determining the hidden layer state of the current decoding moment based on the first attention weight, the second attention weight, the hidden layer state of the decoding module at the previous decoding moment and the decoded words; and decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoding module based on the hidden layer state at the current decoding moment.

In some embodiments, the decode module 806 is configured to: determining the copy probability of selecting words from the keyword sequence and the generation probability of selecting words from the word selection word list based on the hidden layer state at the current decoding moment; calculating the selection probability of each word in the keyword sequence based on the copy probability and calculating the selection probability of each word in the word selection word list based on the generation probability; and selecting the word with the maximum selected probability from the keyword sequence and the word selection list as the word output by the decoder at the current moment.

In some embodiments, the apparatus 800 may further include a first obtaining module 807, a second obtaining module 808, and an updating module 809. The first obtaining module 807 is configured to obtain a text for training and a keyword sequence for training corresponding thereto. The second obtaining module 808 is configured to obtain a reference text having a predetermined style and a keyword sequence corresponding thereto having the predetermined style. In this case, the input module 801 may be configured such that the input module is configured to: inputting a keyword sequence for training and a reference text with a predetermined style into the neural network to obtain a first description text, and calculating a first inconsistency loss of the first description text and the text for training; inputting the keyword sequence with the preset style and the reference text with the preset style into the neural network to obtain a second descriptive text, and calculating a second inconsistency loss of the second descriptive text and the reference text with the preset style; and inputting the keyword sequence with the preset style and the first description text as reference texts into the neural network to obtain third description texts, and calculating third inconsistency losses of the third description texts and the reference texts with the preset style. The update module 809 is configured to perform a gradient pass back of the sum of the first inconsistency loss, the second inconsistency loss, the third inconsistency loss as a total loss to update the parameters of the neural network.

Fig. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. The computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The device 800 for generating descriptive text described above with respect to fig. 8 may take the form of a computing device 910. Alternatively, the apparatus 800 for generating description text may be implemented as a computer program in the form of a text generation application 916.

The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.

The processing system 911 represents functionality to perform one or more operations using hardware. Accordingly, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware element 914 is not limited by the material from which it is formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in various other ways as further described below.

One or more I/O interfaces 913 represent functionality that allows a user to enter commands and information to computing device 910, and optionally also allows information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that may not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Thus, the computing device 910 may be configured in various ways to support user interaction, as described further below.

The computing device 910 also includes a text generation application 916. The text generation application 916 may be, for example, a software instance of the device 800 described in fig. 8 for generating descriptive text, and in combination with other elements in the computing device 910 implement the techniques described herein.

Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 910. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".

"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.

"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 910, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.

As previously described, hardware element 914 and computer-readable medium 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.

Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. The computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing a module as a module executable by the computing device 910 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 914. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 910 and/or processing system 911) to implement the techniques, modules, and examples described herein.

In various implementations, the computing device 910 may assume a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 910 may also be implemented as a mobile device-like device including mobile devices such as mobile telephones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. The computing device 910 may also be implemented as a television-like device that includes or is connected to a device having a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.

The techniques described herein may be supported by these various configurations of the computing device 910 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.

Cloud 920 includes and/or is representative of a platform 922 for resources 924. The platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 920. The resources 924 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 910. The resources 924 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.

The platform 922 may abstract resources and functionality to connect the computing device 910 with other computing devices. The platform 922 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 924 implemented via the platform 922. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 900. For example, the functionality may be implemented in part on the computing device 910 and by the platform 922 that abstracts the functionality of the cloud 920.

It should be understood that embodiments of the disclosure have been described with reference to different functional blocks for clarity. However, it will be apparent that the functionality of each functional module may be implemented in a single module, in multiple modules, or as part of other functional modules without departing from the disclosure. For example, functionality illustrated to be performed by a single module may be performed by multiple different modules. Thus, references to specific functional blocks are only to be seen as references to suitable blocks for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single module or may be physically and functionally distributed between different modules and circuits.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the indefinite article "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims

1. A method for generating descriptive text, comprising:

inputting a keyword sequence and a reference text with a predetermined style into a trained neural network, wherein the neural network comprises a keyword encoder, a text encoder, a mutual attention encoder and a decoder;

coding the keyword sequence by using a keyword coder to obtain a hidden state sequence of the keyword sequence;

encoding the reference text by using a text encoder to obtain a hidden layer state sequence of the reference text;

encoding the hidden layer state sequence of the keyword sequence and the hidden layer state sequence of the reference text by using a mutual attention encoder to obtain a hidden layer state sequence of the keyword sequence fused with a predetermined style;

decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoder to output a description text with the preset style;

wherein the decoder has a joint attention mechanism, and wherein decoding, with the decoder, the hidden state sequence fused with the predetermined style keyword sequence further comprises: at each of the current decoding time instants,

calculating a first attention weight of a hidden layer state of a decoder at the last decoding moment to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a keyword sequence of a predetermined style by using a joint attention mechanism;

determining the hidden layer state of the current decoding moment based on the first attention weight, the second attention weight, the hidden layer state of the decoder at the previous decoding moment and the decoded word;

decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoder based on the hidden layer state at the current decoding moment;

wherein the hidden state sequence of the keyword sequence is used to initialize a decoder.

2. The method of claim 1, wherein the sequence of keywords comprises a plurality of keywords that describe the same topic.

3. The method of claim 2, wherein the theme includes appearance, action, mind and environment.

4. The method of claim 1, wherein decoding, with the decoder, a sequence of hidden states fused with a sequence of predetermined styles of keywords based on a hidden state at a current decoding time comprises:

determining the copy probability of selecting words from the keyword sequence and the generation probability of selecting words from the word selection word list based on the hidden layer state at the current decoding moment;

calculating the selection probability of each word in the keyword sequence based on the copy probability and calculating the selection probability of each word in the word selection word list based on the generation probability;

and selecting the word with the maximum selected probability from the keyword sequence and the word selection list as the word output by the decoder at the current moment.

5. The method of claim 1, wherein the trained neural network is trained by training steps comprising:

acquiring a text for training and a keyword sequence for training corresponding to the text;

acquiring a reference text with a preset style and a keyword sequence with the preset style corresponding to the reference text;

inputting a keyword sequence for training and a reference text with a predetermined style into the neural network to obtain a first description text, and calculating a first inconsistency loss of the first description text and the text for training;

inputting the keyword sequence with the preset style and the reference text with the preset style into the neural network to obtain a second descriptive text, and calculating a second inconsistency loss of the second descriptive text and the reference text with the preset style;

and carrying out gradient postback on the total loss to update the parameters of the neural network, wherein the total loss is the sum of the first inconsistency loss and the second inconsistency loss.

6. The method of claim 1, wherein the trained neural network is trained by training steps comprising:

inputting the keyword sequence with the preset style and the first description text as reference texts into the neural network to obtain third description texts, and calculating third inconsistency losses of the third description texts and the reference texts with the preset style; and

and carrying out gradient postback on the sum of the first inconsistency loss, the second inconsistency loss and the third inconsistency loss as a total loss to update the parameters of the neural network.

7. The method of claim 5 or 6, wherein obtaining the text for training and the keyword sequence for training corresponding to the text for training comprises:

for each text in a corpus which is used as a training set, determining the text as a text for training and performing word segmentation processing on the text;

and extracting the keyword sequence for training from the text after word segmentation processing.

8. The method of claim 7, wherein extracting the keyword sequence for training from the text after word segmentation comprises:

extracting text keywords from the text after word segmentation processing based on the word frequency-inverse document frequency;

and extracting a keyword sequence included in the keyword word list from the text keywords as the keyword sequence for training according to the keyword word list with the same theme.

9. The method of claim 1, wherein any one or more of the keyword encoder, text encoder, and decoder may be a recurrent neural network or a convolutional neural network.

10. An apparatus for generating descriptive text, comprising:

an input module configured to input a sequence of keywords and a reference text having a predetermined style into a trained neural network;

a neural network, comprising:

a keyword encoding module configured to encode the keyword sequence to obtain a hidden state sequence of the keyword sequence;

a text encoding module configured to encode the reference text to obtain a hidden state sequence of the reference text;

the mutual attention coding module is configured to code the hidden layer state sequence of the keyword sequence and the hidden layer state sequence of the reference text to obtain a hidden layer state sequence fused with a keyword sequence of a preset style;

the decoding module is configured to decode the hidden layer state sequence fused with the keyword sequence of the preset style so as to output a description text with the preset style;

wherein the decoding module has a joint attention mechanism and is configured to: at each of the current decoding time instants,

calculating a first attention weight of a hidden layer state of a decoding module at the last decoding moment to a hidden layer state sequence of a reference text and a second attention weight of the hidden layer state sequence fused with a keyword sequence of a predetermined style by using a joint attention mechanism;

determining the hidden layer state of the current decoding moment based on the first attention weight, the second attention weight, the hidden layer state of the decoding module at the previous decoding moment and the decoded words;

and decoding the hidden layer state sequence fused with the keyword sequence of the preset style by using the decoding module based on the hidden layer state at the current decoding moment.

11. The apparatus of claim 10, further comprising:

a first obtaining module configured to obtain a text for training and a keyword sequence for training corresponding thereto;

a second obtaining module configured to obtain a reference text with a predetermined style and a keyword sequence corresponding to the reference text with the predetermined style; and

an update module;

wherein the input module is configured to:

inputting the keyword sequence with the preset style and the first description text as reference texts into the neural network to obtain third description texts, and calculating third inconsistency losses of the third description texts and the reference texts with the preset style;

wherein the update module is configured to: and carrying out gradient postback on the sum of the first inconsistency loss, the second inconsistency loss and the third inconsistency loss as a total loss to update the parameters of the neural network.

12. A computing device comprising

A memory configured to store computer-executable instructions;

a processor configured to perform the method of any one of claims 1-9 when the computer-executable instructions are executed by the processor.

13. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-9.