CN113076756A

CN113076756A - Text generation method and device

Info

Publication number: CN113076756A
Application number: CN202010010862.8A
Authority: CN
Inventors: 王刚; 佘志东; 张涛; 张亮; 饶正锋
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2021-07-06

Abstract

The invention discloses a text generation method and device, and relates to the technical field of computers. One embodiment of the method comprises: generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model; generating a target control signal according to the target metadata, and generating a text corresponding to the target keyword sequence by using the trained controllable text generation model according to the target control signal and the target keyword sequence. The embodiment can solve the problem that the text length and the novelty of the generated text have certain randomness, so that the generated text meets the actual online requirement, and the texts with different styles, different lengths and different article types can be generated according to the input key phrases and the content output requirements of different scenes.

Description

Text generation method and device

Technical Field

The invention relates to the technical field of computers, in particular to a text generation method and a text generation device.

Background

The content marketing transmits valuable information to users through data such as texts and pictures, so that the marketing purpose is achieved. Different scenes have different requirements on text content, such as novelty, length, etc., for example, a mobile client may have strict requirements on the number of words of the content, and if the number of words exceeds the range, the appearance and user experience of a UI (user interface) design may be affected. The existing text generation technology mainly comprises a generation method without a control mechanism and simple text rewriting, the text length and novelty of the text generated by the generation method without the control mechanism have certain randomness, the actual online requirement cannot be met, the text cannot be filtered by a subsequent module, and the simple text rewriting cannot generate texts in different styles, different lengths and different article types.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the text length and novelty of the generated text have certain randomness, so that the actual online requirement cannot be met, and the texts with different styles, different lengths and different article types cannot be generated based on the requirements of different scenes on content output.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for generating a text, which can solve the problem that the length and the novelty of the generated text have a certain randomness, so that the generated text fully meets the actual online requirement, and can generate texts of different styles, different lengths, and different article types according to the input keyword group and based on the requirements of different scenes on content output.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a text generation method.

A text generation method, comprising: acquiring the content and metadata of each article sample in the collected corpus of the original articles; generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model; generating a target control signal according to the input target metadata and the preset rule, and generating a text corresponding to the target keyword sequence by using the trained controllable text generation model according to the target control signal and the input target keyword sequence.

Optionally, the step of generating a keyword sequence corresponding to the article sample according to the content of the article sample includes: performing word segmentation on the content of the article sample by using a first word segmentation algorithm to obtain a word segmentation sequence; extracting the participles belonging to a preset part of speech from the participle sequence, and generating a candidate keyword sequence according to the original sequence of the extracted participles in the participle sequence based on the extracted participles; deleting the participles which do not meet the preset conditions in the candidate keyword sequence to obtain the keyword sequence; generating a label sequence corresponding to the article sample according to the content of the article sample, wherein the step comprises the following steps: and performing word segmentation on the content of the article sample by using a second word segmentation algorithm to obtain the label sequence.

Optionally, the deleting the participles in the candidate keyword sequence that do not meet the preset condition includes: and counting the word frequency of each participle in the candidate keyword sequence based on the corpus of the original article, deleting the participle with the word frequency smaller than a preset threshold value in the candidate keyword sequence and the participle in a preset blacklist.

Optionally, the first control signal is generated according to the preset rule as follows according to the first metadata: in the case that the first metadata is in a numerical form, taking the numerical value thereof as the first control signal; under the condition that the first metadata is in a non-numerical value form, converting the first metadata into discrete numerical values in a finite numerical value interval according to a conversion rule to serve as the first control signal; the first metadata is metadata of the article sample, and the first control signal is a control signal corresponding to the article sample, or the first metadata is the target metadata, and the first control signal is the target control signal.

Optionally, the metadata of the article sample and the target metadata of the input are one or more of an author, a category of the article, a type of the article, and a length of the article, and the metadata of the article sample and the target metadata of the input belong to the same one or more types of metadata.

Optionally, the generating, according to the target control signal and the input target keyword sequence, a text corresponding to the target keyword sequence by using the trained controllable text generation model includes: adding a first word vector of each word in the target keyword sequence and a corresponding first position vector, and coding by a coder of the controllable text generation model to obtain a coded vector, wherein the first word vector is obtained by performing word embedding processing on the words of the target keyword sequence, and the first position vector is obtained by performing position coding on position information of the words of the target keyword sequence; calculating probability distribution of generated words through each step in a decoder of the controllable text generation model, selecting a word sequence with the maximum probability as a text corresponding to the target keyword sequence, wherein the target sequence is obtained according to a specific marker word and the currently calculated words, calculating probability distribution of the next generated words based on a vector obtained by adding a second word vector of each word of the target sequence and a corresponding second position vector, a target control signal vector and the coding vector, the second word vector is obtained by performing word embedding processing on the words of the target sequence, the second position vector is obtained by performing position coding on position information of the words of the target sequence, and the target control signal vector is obtained by performing word embedding processing on the target control signal.

Optionally, the target metadata is the multiple types of metadata, and the target control signal is a corresponding multiple types of control signals; the obtaining of the target control signal vector by performing word embedding processing on the target control signal includes: and performing word embedding processing on each type of control signal of the target control signal to obtain a control signal vector corresponding to each type of control signal, and splicing the obtained control signal vectors of various types according to a preset sequence to obtain the target control signal vector.

According to another aspect of the embodiments of the present invention, there is provided a text generating apparatus.

A text generation apparatus comprising: the data acquisition module is used for acquiring the content and metadata of each article sample in the collected original article corpus; the model training module is used for generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to metadata of the article sample and a preset rule, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model; and the text generation module is used for generating a target control signal according to the input target metadata and the preset rule, and generating a text corresponding to the target keyword sequence by utilizing the trained controllable text generation model according to the target control signal and the input target keyword sequence.

Optionally, the model training module includes a keyword sequence generation sub-module, configured to: performing word segmentation on the content of the article sample by using a first word segmentation algorithm to obtain a word segmentation sequence; extracting the participles belonging to a preset part of speech from the participle sequence, and generating a candidate keyword sequence according to the original sequence of the extracted participles in the participle sequence based on the extracted participles; deleting the participles which do not meet the preset conditions in the candidate keyword sequence to obtain the keyword sequence; model training module 502 may also include a tag sequence generation submodule to: and performing word segmentation on the content of the article sample by using a second word segmentation algorithm to obtain the label sequence.

Optionally, the keyword sequence generating sub-module includes a word segmentation filtering sub-unit, configured to: and counting the word frequency of each participle in the candidate keyword sequence based on the corpus of the original article, deleting the participle with the word frequency smaller than a preset threshold value in the candidate keyword sequence and the participle in a preset blacklist.

Optionally, the text generation module is further configured to: adding a first word vector of each word in the target keyword sequence and a corresponding first position vector, and coding by a coder of the controllable text generation model to obtain a coded vector, wherein the first word vector is obtained by performing word embedding processing on the words of the target keyword sequence, and the first position vector is obtained by performing position coding on position information of the words of the target keyword sequence; calculating probability distribution of generated words through each step in a decoder of the controllable text generation model, selecting a word sequence with the maximum probability as a text corresponding to the target keyword sequence, wherein the target sequence is obtained according to a specific marker word and the currently calculated words, calculating probability distribution of the next generated words based on a vector obtained by adding a second word vector of each word of the target sequence and a corresponding second position vector, a target control signal vector and the coding vector, the second word vector is obtained by performing word embedding processing on the words of the target sequence, the second position vector is obtained by performing position coding on position information of the words of the target sequence, and the target control signal vector is obtained by performing word embedding processing on the target control signal.

Optionally, the target metadata is the multiple types of metadata, and the target control signal is a corresponding multiple types of control signals; the text generation module comprises a target control signal vector generation submodule for: and performing word embedding processing on each type of control signal of the target control signal to obtain a control signal vector corresponding to each type of control signal, and splicing the obtained control signal vectors of various types according to a preset sequence to obtain the target control signal vector.

According to yet another aspect of an embodiment of the present invention, an electronic device is provided.

An electronic device, comprising: one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the text generation method provided by embodiments of the present invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements a text generation method provided by an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: generating a keyword sequence and a tag sequence corresponding to an article sample according to the content of the article sample in the corpus of an original article, generating a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal of the corresponding article sample, training a controllable text generation model, generating a target control signal according to an input target metadata and a preset rule, and generating a text corresponding to the target keyword sequence according to the target control signal and the input target keyword sequence and by using the trained controllable text generation model. The problem that the text length and the novelty of the generated text have certain randomness can be solved, the generated text fully meets the actual online requirement, and texts with different styles, lengths and article types can be generated based on the requirements of different scenes on content output according to the input key phrases.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of the main steps of a text generation method according to a first embodiment of the present invention;

FIG. 2 is an architectural diagram of a controllable text generation model according to a second embodiment of the invention;

FIG. 3 is a diagram illustrating a processing method of a control signal vector Embellding layer according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a text generation flow according to a third embodiment of the invention;

fig. 5 is a schematic diagram of main blocks of a text generating apparatus according to a fourth embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of main steps of a text generation method according to a first embodiment of the present invention.

As shown in fig. 1, the text generation method according to an embodiment of the present invention mainly includes steps S101 to S103 as follows.

Step S101: and acquiring the content and metadata of each article sample in the collected corpus of the original articles.

Step S102: generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model.

Step S103: generating a target control signal according to input target metadata and a preset rule, and generating a text corresponding to the target keyword sequence by using a trained controllable text generation model according to the target control signal and the input target keyword sequence.

A plurality of article samples can be collected in advance and stored in a database system, and original article corpora can be collected from the database system, wherein each article sample comprises metadata and content. The metadata of the article sample may include one or more of an author, a category of the article, a type of the article, and a length of the article. Similarly, the input target metadata is also one or more of author, article category, article type and article length. And, the metadata of the article sample and the input target metadata belong to the same one or more types of metadata, that is, assuming that the metadata of the article sample of the original article corpus includes an author and an article category, the input target metadata should also include the author and the article category.

In one embodiment, the step of generating a keyword sequence corresponding to the article sample according to the content of the article sample specifically includes: performing word segmentation on the content of the article sample by using a first word segmentation algorithm to obtain a word segmentation sequence; extracting the participles belonging to a preset part of speech from the participle sequence, and generating a candidate keyword sequence according to the original sequence of the extracted participles in the participle sequence based on the extracted participles; and deleting the participles which do not accord with the preset conditions in the candidate keyword sequence to obtain the keyword sequence.

The preset part of speech can be set according to the scene needs of a specific field, for example, in the scene of generating a commodity text in the e-commerce field, the preset part of speech can be set to extract nouns and adjectives, so that words of description and use feeling classes of commodities are obtained; for the fields of media and the like, such as generating news texts, the preset part of speech can be set to extract words of nouns, verbs and the like describing events; for other writing fields, the preset part of speech can be set according to specific situations, which are not listed here.

The method for generating the label sequence corresponding to the article sample according to the content of the article sample specifically comprises the following steps: and performing word segmentation on the content of the article sample by using a second word segmentation algorithm to obtain a label sequence.

The first word segmentation algorithm and the second word segmentation algorithm can adopt a general word segmentation algorithm, and the first word segmentation algorithm and the second word segmentation algorithm can be the same or different. The word segmentation algorithm is required to be accurate as much as possible when the keyword sequence is generated. Strategies using word combinations (e.g., sequence Piece, a word segmentation algorithm) may be considered when generating the tag sequences to reduce the complexity of the model. The requirement for segmentation accuracy when generating tag sequences is not as high as when generating keyword sequences.

Deleting the participles which do not meet the preset condition in the candidate keyword sequence, which specifically includes: counting the word frequency of each participle in the candidate keyword sequence based on the corpus of the original article, deleting the participles with the word frequency smaller than a preset threshold value in the candidate keyword sequence and the participles in a preset blacklist.

And counting the word frequency of each participle in the candidate keyword sequence based on the language material of the original article, namely counting the word frequency of each participle in the candidate keyword sequence in the language material of the original article. The preset blacklist may be a keyword list that cannot be used to generate a text, which is pre-established according to requirements.

Generating a first control signal according to the first metadata and a preset rule as follows: in the case that the first metadata is in the form of a numerical value, taking the numerical value thereof as a first control signal; under the condition that the first metadata is in a non-numerical value form, converting the first metadata into discrete numerical values in a finite numerical value interval according to a conversion rule to serve as a first control signal; the first metadata is metadata of an article sample, and the first control signal is a control signal corresponding to the article sample, or the first metadata is target metadata, and the first control signal is a target control signal.

Generating a text corresponding to the target keyword sequence by using a trained controllable text generation model according to the target control signal and the input target keyword sequence, wherein the method specifically comprises the following steps: adding first word vectors of all words in a target keyword sequence and corresponding first position vectors, and coding by a coder of a controllable text generation model to obtain a coded vector, wherein the first word vectors are obtained by performing word embedding processing on the words in the target keyword sequence, each word in the target keyword sequence has a respective first word vector, the first position vectors are obtained by performing position coding on position information of the words in the target keyword sequence, and each word in the target keyword sequence has a respective first position vector; and calculating the probability distribution of the generated words through each step in a decoder of the controllable text generation model, and selecting the word sequence with the highest probability as the text corresponding to the target keyword sequence.

And calculating the probability distribution of the vocabulary generated in the next step based on the vector obtained by adding the second word vector of each word of the target sequence and the corresponding second position vector, the target control signal vector and the coding vector output by the coder. The second word vector is obtained by performing word embedding processing on the words in the target sequence, and each word in the target sequence has a respective second word vector. The second position vector is obtained by position coding position information of words of the target sequence, and each word of the target sequence has a respective second position vector. And performing word embedding processing on the target control signal to obtain a target control signal vector.

With each calculation step in the decoder, the vocabulary that has been currently calculated is increased step by step, and thus the target sequence is dynamically changed. A specific markup word, BOS, is a markup that indicates the beginning of a sentence. The vector obtained by adding the second word vector of each word in the target sequence and the corresponding second position vector is the vector obtained by adding the second word vector of the word corresponding to the same position and the second position vector.

When the probability distribution of the vocabulary is generated in the next step is calculated based on the vector obtained by adding the second word vector of each word in the target sequence and the corresponding second position vector, the target control signal vector and the coding vector output by the encoder, the vector obtained by adding the second word vector of each word in the target sequence and the corresponding second position vector is spliced or added with the target control signal vector, then the obtained vector is input into the decoder, the coding vector output by the encoder is also input into the decoder, and the probability distribution of the vocabulary is generated in the next step by the decoder based on the vector input into the decoder.

Word Embedding, is a mechanism in deep learning to map discrete values into dense vectors.

After the probability distribution of the generated vocabulary is calculated through each step in the decoder of the controllable text generation model and before the text corresponding to the target keyword sequence, the output of the decoder can be processed through a linear layer and a normalization layer.

In one embodiment, the target metadata is a plurality of types of metadata and the target control signals are a corresponding plurality of types of control signals. The obtaining of the target control signal vector by performing word embedding processing on the target control signal may specifically include: and performing word embedding processing on each type of control signal of the target control signal to obtain a control signal vector corresponding to each type of control signal, and splicing the obtained control signal vectors of various types according to a preset sequence to obtain the target control signal vector. For example, the target metadata includes an author, an article category, an article type, and an article length, and then the control signal vectors corresponding to the author, the article category, the article type, and the article length may be concatenated in this order to obtain the target control signal vector.

The text generation method of the embodiment of the invention is described in detail below by taking the generation of the commodity text in the e-commerce field as an example.

The embodiment of the invention can fully utilize the information of the commodity data to control the generation of the text so as to meet the requirements of generating different commodity texts in different application scenes. The embodiment of the invention mainly comprises three parts: preparing data, constructing a model and generating a text.

A first part: preparing data

The method comprises the steps of firstly collecting data, namely collecting original article corpora from a database system, wherein the original article corpora comprises a large number of article samples, and each article sample comprises metadata and content.

And performing data preprocessing on the collected data, wherein the data preprocessing is mainly used for subsequently constructing a training sample. The data preprocessing specifically includes performing the following operations: generating a control signal C, generating a keyword sequence X and generating a label sequence Y.

A control signal C is generated, in particular the metadata is converted into a discrete or continuous control signal. If the signal is a continuous signal, for example, the length of the article is in a numerical form, the signal can be directly processed as a floating point number. The length of the obtained article can be obtained by directly counting the length of the content, such as counting the number of words or the number of words after word segmentation, which is not limited in the embodiment of the invention; if the author is a discrete value, for example, the author is in a non-numerical form, the authors can be converted into discrete information of [1, V ], where V represents the number of authors that can be controlled finally and [1, V ] is a closed interval, and the discrete information converted by the author can be used as the ID (identification) of the author. The authors are converted into discrete control signals, for example, the author may be ranked by the number of articles of each author, and for authors with a large number of articles, the author may be mapped to an ID. For a group of authors with a small number of articles, the group of authors may be mapped to the same ID, and the criteria for how many articles are defined, for example, as shown in table 1, the number of articles is defined as more than or equal to 100, and the IDs of the authors corresponding to the last three rows are the same ID, and the ID is 4.

TABLE 1

Authors refer to	Number of articles	Mapped control signal: ID
			Zhang San	1000	1
Li Si	800	2
			Wang Wu	100	3
Liuqi (Liuqi)	7	4
			Jiang Ba	6	4
Zhu Jiu	5	4

Generating a keyword sequence X, specifically, after segmenting the content of an article sample by using a first segmentation algorithm, extracting nouns and adjectives based on part of speech judgment, keeping the extracted segmentations unchanged according to the original sequence of the segmentations to form a candidate keyword sequence, counting the word frequency of the segmentations in the candidate keyword sequence in the original article corpus based on the original article corpus, and removing the segmentations of which the word frequency is lower than Thd (Thd represents a preset threshold value) or the segmentations in a blacklist, wherein the obtained sequence is the keyword sequence X.

Through keyword extraction, article data are converted into a supervision method for predicting article sequences based on keyword sequences, so that massive continuous text data can be utilized.

And generating a label sequence Y, specifically, segmenting the article sample content by using a second segmentation algorithm to obtain the label sequence Y.

The word segmentation algorithms for generating X and Y can adopt a general word segmentation algorithm, and the adopted word segmentation algorithms can be the same or different. When generating X, the word segmentation algorithm is required to be as accurate as possible, for example, the Chinese word segmentation algorithm of stanford corenlp; when generating Y, a policy (e.g., sequence Piece) using word combinations may be considered to reduce the complexity of the controllable text generation model, with high requirement for segmentation accuracy when X is not generated.

After the data collection, data preprocessing, and the like are performed, a training sample of each article sample, that is, (X, Y, C) of the article sample, and (X, Y, C) of all the article samples may be constructed based on the X, Y, C obtained by preprocessing, to construct a final data set.

Table 2 shows the raw data of a sample of an article. Table 3 shows the data obtained by preprocessing the raw data. The article type, article category, and author are specifically in the form of discrete information, table 3 does not show specific discrete information values, and the target text Y of table 3 is the above-mentioned tag sequence Y.

TABLE 2

TABLE 3

A second part: construction model

The model is constructed, that is, the controllable text generation model according to the embodiment of the present invention is constructed, as shown in fig. 2, mainly by using the architectural design of the encoder and the decoder, and both the encoder and the decoder adopt the transform model (a transform model based on the attention-free mechanism), the encoder specifically includes N layers of transform encoder layers, the decoder specifically includes N layers of transform decoder layers, and two "N ×" in the figure respectively indicate that there are N transform encoder layers and N transform decoder layers. The embodiment of the invention introduces a control signal mechanism into the decoder.

After the input keyword sequence X is subjected to word Embedding processing by an input Embedding layer, the word Embedding processing is added to the output of a position coding layer (which is referred to as a first position coding layer and is shown as a position code I in fig. 2), and then the input keyword sequence X enters a Transformer encoder layer of N layers for encoding to obtain an encoding vector, and the encoding vector output by the encoder is accessed to a decoder.

Performing word Embedding processing on words in a sequence obtained by right shifting the target text Y by one bit through an output Embedding layer, and adding the words to the output of a position coding layer (marked as a second position coding layer, as shown in a position code II of fig. 2) to obtain a vector Ye; the control signal C carries out word Embedding processing through a control signal Embedding layer to obtain a vector Ce; and splicing the vectors Ye and Ce together, entering a transform decoding layer of an N layer, and finally outputting the probability of the prediction sequence through processing of a linear layer and a Softmax (normalization layer).

As an alternative embodiment, the above mentioned concatenation of Ye, Ce may be replaced by addition or mapping by one linear layer.

The input Embedding layer is used for performing word Embedding processing on each word of the keyword sequence X, so that each word of the keyword sequence X is converted into a word vector (namely a first word vector).

The first position coding layer is used for carrying out position coding on position information of each word of the keyword sequence X so as to map each position information into a vector, namely a first position vector. For example, assuming that some X ═ white, inflammatory, summer, then position information for each word, i.e., [1,2,3], position coding maps each position to a vector, e.g., position 1 to a vector. The output of the first position encoding layer is the first position vector of each word of the keyword sequence X.

The output Embedding layer is used for performing word Embedding processing on each word of the sequence obtained by right shifting the target text Y by one bit, so that each word is converted into a word vector (namely, a second word vector). The sequence obtained by right shifting the target text Y by one position is the first vocabulary of the sequence, namely BOS (mark representing the beginning of sentence), and the positions after BOS are the words of the target text Y.

The second position coding layer is used for carrying out position coding on the position information of each word of the sequence obtained by right-shifting the target text Y by one bit so as to map each position information into a vector, namely a second position vector.

In the process of calculating the probability of the prediction sequence, the probability distribution of the vocabulary is calculated and generated through each step in the decoder, and finally the vocabulary sequence with the maximum probability is selected as the output prediction sequence. With Y₁,Y₂,Y₃,Y₄.., each element representing Y is correspondingly shifted right by one position to the target text Y to obtain a sequence BOS, Y₁,Y₂,Y₃.., in calculating Y_j(represents the second of Y)j elements), it is necessary to use the calculated Y₁～Y_j-1Probability distribution of (2). Specifically, from the BOS and the currently calculated Y₁～Y_j-1Obtaining a target sequence, and calculating Y based on a vector obtained by adding a second word vector of each word of the target sequence and a corresponding second position vector (namely Ye), a target control signal vector (namely Ce) and an encoding vector output by an encoder_jProbability distribution of (2).

When the control signal C performs word Embedding processing through the control signal Embedding layer, if the control signal is a plurality of types of control signals, vectors obtained by performing word Embedding processing on the various types of control signals are spliced to obtain a final control signal vector, namely Ce. The discrete control signals, such as authors, categories, types, and the like, are the same as the processing logic for generating word vectors, the continuous control signals, such as length, can be directly converted into floating point numbers with dimension 1, and finally the discrete control signal vectors and the continuous control signals are spliced together according to a certain fixed sequence to form the final control signal vector. Fig. 3 is a schematic diagram of a processing method of a control signal vector Embedding layer according to an embodiment of the present invention, and according to fig. 3, control signals of an author, a category, a type, and the like are respectively subjected to Embedding processing, the length is converted into a floating point number 88.0, and then the four are spliced to obtain a vector which is a final control signal vector.

The decoder of the embodiment of the invention introduces the control signal in the Embedding layer, and can realize the generation of the control text.

And training the controllable text generation model constructed by the second part by using the data set obtained in the first part data preparation stage to obtain the trained controllable text generation model.

And a third part: generating text

Inputting metadata and keyword groups by a user, and converting the metadata into discrete or continuous control signals, wherein for the discrete control signals, for example, the user selects within the range supported by the controllable text generation model of the embodiment of the invention (for example, supported authors, categories, types, etc.); for the continuous type control signal, the user selects within a length interval supported by the controllable text generation model (the length interval is determined by the maximum length and the minimum length of the article in the data set, for example, the article length is [20,100 ]).

And the trained controllable text generation model generates a corresponding output text Y according to the control signal and the key phrase input by the user.

For example, the keyword group X input by the user is: elegant and beautiful; table 4 shows the output text under four different sets of control signals. In table 4, the control C is an original form of the corresponding control signal (i.e. the metadata extracted initially), and when the text is generated, the embodiment of the present invention converts each control C into a corresponding discrete numerical form.

TABLE 4

Fig. 4 is a schematic diagram of a text generation flow according to a third embodiment of the present invention.

As shown in FIG. 4, the text generation process of one embodiment of the present invention includes a training phase and a prediction phase of a controllable text generation model.

In the training stage, raw article corpora (raw data) are collected from a database system, the raw data contain metadata and content, data preprocessing is carried out on the raw data to obtain training samples (X, C and Y), C is a control signal, X is a keyword sequence, Y is a label sequence, the training samples are used for training a controllable text generation model, and the output of the training stage model is the probability distribution P (Y | X, C) of Y under the condition of X, C.

In the prediction stage, the input control signal is C ', the input keyword sequence is X', the trained controllable text generation model is used for outputting a generation result Y_est。

The metadata of the embodiment of the invention can comprise article length, author, category information corresponding to the article, article type and the like, so that texts with different styles, lengths and article types can be planned and generated based on key phrases and metadata input by a user according to the output requirements of different scenes on contents, and the texts are not just simple text continuation.

The text generation process of the embodiment of the invention is not only suitable for the text generation of commodities in the E-commerce field, but also suitable for other writing fields, and the model of the embodiment of the invention can be applied as long as text and metadata exist. In the case where the metadata includes only a length, control over the length of the generated text may be achieved.

Fig. 5 is a schematic diagram of main blocks of a text generating apparatus according to a fourth embodiment of the present invention.

As shown in fig. 5, a text generating apparatus 500 according to an embodiment of the present invention mainly includes: a data acquisition module 501, a model training module 502 and a text generation module 503.

The data obtaining module 501 is configured to obtain content and metadata of each article sample in the collected original article corpus.

The model training module 502 is configured to generate a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generate a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generate a training sample according to the keyword sequence, the tag sequence, and the control signal of the corresponding article sample, and train the controllable text generation model.

The text generating module 503 is configured to generate a target control signal according to a preset rule based on the input target metadata, and generate a text corresponding to the target keyword sequence by using a trained controllable text generating model based on the target control signal and the input target keyword sequence.

Model training module 502 may include a keyword sequence generation submodule to: performing word segmentation on the content of the article sample by using a first word segmentation algorithm to obtain a word segmentation sequence; extracting the participles belonging to a preset part of speech from the participle sequence, and generating a candidate keyword sequence according to the original sequence of the extracted participles in the participle sequence based on the extracted participles; and deleting the participles which do not accord with the preset conditions in the candidate keyword sequence to obtain the keyword sequence.

Model training module 502 may also include a tag sequence generation submodule to: and performing word segmentation on the content of the article sample by using a second word segmentation algorithm to obtain a label sequence.

The keyword sequence generation sub-module may include a segmentation filtering sub-unit for: counting the word frequency of each participle in the candidate keyword sequence based on the corpus of the original article, deleting the participles with the word frequency smaller than a preset threshold value in the candidate keyword sequence and the participles in a preset blacklist.

The metadata of the article sample and the input target metadata may be one or more of an author, a category of the article, a type of the article, and a length of the article, and the metadata of the article sample and the input target metadata belong to the same one or more types of metadata.

The text generation module 503 may be specifically configured to: adding a first word vector of each word in the target keyword sequence and a corresponding first position vector, and coding by a coder of a controllable text generation model to obtain a coded vector, wherein the first word vector is obtained by performing word embedding processing on the words in the target keyword sequence, and the first position vector is obtained by performing position coding on position information of the words in the target keyword sequence; calculating probability distribution of generated words through each step in a decoder of a controllable text generation model, selecting a word sequence with the maximum probability as a text corresponding to a target keyword sequence, wherein the target sequence is obtained according to a specific mark word and the currently calculated words, calculating probability distribution of the words generated in the next step based on a vector obtained by adding a second word vector of each word of the target sequence and a corresponding second position vector, a target control signal vector and a coding vector, the second word vector is obtained by performing word embedding processing on the words of the target sequence, the second position vector is obtained by performing position coding on position information of the words of the target sequence, and the target control signal vector is obtained by performing word embedding processing on a target control signal.

In one embodiment, the target metadata is a plurality of types of metadata, and the target control signals are a corresponding plurality of types of control signals; the text generation module 503 may include a target control signal vector generation submodule for: and performing word embedding processing on each type of control signal of the target control signal to obtain a control signal vector corresponding to each type of control signal, and splicing the obtained control signal vectors of various types according to a preset sequence to obtain the target control signal vector.

In addition, the detailed implementation of the text generation device in the embodiment of the present invention has been described in detail in the text generation method, and therefore, the repeated description is not repeated here.

Fig. 6 illustrates an exemplary system architecture 600 to which the text generation method or text generation apparatus of embodiments of the invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

601, 602, 603. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the text generation method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the text generation apparatus is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, a block diagram of a computer system 700 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a data acquisition module, a model training module, and a text generation module. The names of these modules do not in some cases constitute a limitation on the modules themselves, for example, the data acquisition module may also be described as a "module for acquiring the content and metadata of each article sample in the collected original article corpus".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring the content and metadata of each article sample in the collected corpus of the original articles; generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model; generating a target control signal according to the input target metadata and the preset rule, and generating a text corresponding to the target keyword sequence by using the trained controllable text generation model according to the target control signal and the input target keyword sequence.

According to the technical scheme of the embodiment of the invention, a keyword sequence and a tag sequence corresponding to an article sample are generated according to the content of the article sample in the corpus of an original article, a control signal corresponding to the article sample is generated according to metadata of the article sample and a preset rule, a training sample is generated according to the keyword sequence, the tag sequence and the control signal of the corresponding article sample, a controllable text generation model is trained, a target control signal is generated according to input target metadata and a preset rule, and a text corresponding to the target keyword sequence is generated by utilizing the trained controllable text generation model according to the target control signal and the input target keyword sequence. The problem that the text length and the novelty of the generated text have certain randomness can be solved, the generated text fully meets the actual online requirement, and texts with different styles, lengths and article types can be generated based on the requirements of different scenes on content output according to the input key phrases.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text generation method, comprising:

acquiring the content and metadata of each article sample in the collected corpus of the original articles;

generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to a preset rule according to metadata of the article sample, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model;

generating a target control signal according to the input target metadata and the preset rule, and generating a text corresponding to the target keyword sequence by using the trained controllable text generation model according to the target control signal and the input target keyword sequence.

2. The method of claim 1, wherein the step of generating a keyword sequence corresponding to the article sample from the content of the article sample comprises:

performing word segmentation on the content of the article sample by using a first word segmentation algorithm to obtain a word segmentation sequence;

extracting the participles belonging to a preset part of speech from the participle sequence, and generating a candidate keyword sequence according to the original sequence of the extracted participles in the participle sequence based on the extracted participles;

deleting the participles which do not meet the preset conditions in the candidate keyword sequence to obtain the keyword sequence;

generating a label sequence corresponding to the article sample according to the content of the article sample, wherein the step comprises the following steps:

and performing word segmentation on the content of the article sample by using a second word segmentation algorithm to obtain the label sequence.

3. The method according to claim 2, wherein the deleting the participles in the candidate keyword sequence that do not meet a preset condition comprises:

and counting the word frequency of each participle in the candidate keyword sequence based on the corpus of the original article, deleting the participle with the word frequency smaller than a preset threshold value in the candidate keyword sequence and the participle in a preset blacklist.

4. The method according to claim 1, wherein the first control signal is generated according to the preset rule as follows according to the first metadata:

in the case that the first metadata is in a numerical form, taking the numerical value thereof as the first control signal;

under the condition that the first metadata is in a non-numerical value form, converting the first metadata into discrete numerical values in a finite numerical value interval according to a conversion rule to serve as the first control signal;

the first metadata is metadata of the article sample, and the first control signal is a control signal corresponding to the article sample, or the first metadata is the target metadata, and the first control signal is the target control signal.

5. The method of claim 1, wherein the metadata of the article sample and the target metadata of the input are one or more of author, article category, article type, article length, and the metadata of the article sample and the target metadata of the input belong to the same one or more types of metadata.

6. The method according to claim 5, wherein the generating a text corresponding to the target keyword sequence by using the trained controllable text generation model according to the target control signal and the input target keyword sequence comprises:

adding a first word vector of each word in the target keyword sequence and a corresponding first position vector, and coding by a coder of the controllable text generation model to obtain a coded vector, wherein the first word vector is obtained by performing word embedding processing on the words of the target keyword sequence, and the first position vector is obtained by performing position coding on position information of the words of the target keyword sequence;

calculating probability distribution of generated words through each step in a decoder of the controllable text generation model, selecting a word sequence with the maximum probability as a text corresponding to the target keyword sequence, wherein the target sequence is obtained according to a specific marker word and the currently calculated words, calculating probability distribution of the next generated words based on a vector obtained by adding a second word vector of each word of the target sequence and a corresponding second position vector, a target control signal vector and the coding vector, the second word vector is obtained by performing word embedding processing on the words of the target sequence, the second position vector is obtained by performing position coding on position information of the words of the target sequence, and the target control signal vector is obtained by performing word embedding processing on the target control signal.

7. The method of claim 6, wherein the target metadata is the plurality of types of metadata, and the target control signals are respective plurality of types of control signals;

the obtaining of the target control signal vector by performing word embedding processing on the target control signal includes:

and performing word embedding processing on each type of control signal of the target control signal to obtain a control signal vector corresponding to each type of control signal, and splicing the obtained control signal vectors of various types according to a preset sequence to obtain the target control signal vector.

8. A text generation apparatus, comprising:

the data acquisition module is used for acquiring the content and metadata of each article sample in the collected original article corpus;

the model training module is used for generating a keyword sequence and a tag sequence corresponding to the article sample according to the content of the article sample, generating a control signal corresponding to the article sample according to metadata of the article sample and a preset rule, generating a training sample according to the keyword sequence, the tag sequence and the control signal corresponding to the article sample, and training a controllable text generation model;

and the text generation module is used for generating a target control signal according to the input target metadata and the preset rule, and generating a text corresponding to the target keyword sequence by utilizing the trained controllable text generation model according to the target control signal and the input target keyword sequence.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.