CN117787224A

CN117787224A - Controllable story generation method based on multi-source heterogeneous feature fusion

Info

Publication number: CN117787224A
Application number: CN202311828251.4A
Authority: CN
Inventors: 夏鸿斌; 孟祥仲
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-29
Anticipated expiration: 2043-12-27
Also published as: CN117787224B

Abstract

The invention relates to a controllable story generation method based on multi-source heterogeneous feature fusion, which comprises the steps of for each story sample, acquiring a guide text, a keyword sequence and common sense knowledge, respectively inputting the guide text, the keyword sequence and the common sense knowledge into a BART coder, acquiring text features, keyword features and common sense features, inputting the text features, the keyword features and the common sense features into a multi-source heterogeneous feature fusion model, and acquiring three fusion features by fusion of the text features, the keyword features and the common sense features, so that the heterogeneous fusion features are acquired based on public checkpoints of the BART by using a multi-stage recursion training method; inputting the reference story and heterogeneous fusion characteristics corresponding to the story samples into a BART decoder, and acquiring predicted story texts by using a preset sampling strategy; and constructing a model total loss function, training by using a story sample set until the model total loss function converges, inputting text features, keyword features and common sense features of a story to be generated after a multi-source heterogeneous feature fusion model for training is obtained, obtaining heterogeneous fusion features, and inputting the heterogeneous fusion features into a BART decoder to generate a controllable story text.

Description

Controllable story generation method based on multi-source heterogeneous feature fusion

Technical Field

The invention relates to the technical field of natural language processing, in particular to a controllable story generation method based on multi-source heterogeneous feature fusion.

Background

Controllable story generation is a hot spot direction in the field of natural language processing in recent years, and the research purpose of the controllable story generation is to control a neural network model through a keyword sequence so as to generate story texts with smooth line, thematic relevance and rich plot. Because of the difficulty in the task of short text generation to long text, the current approach to research has two main points: firstly, a pre-training language model is used as a basis, such as GPT-2 and BART, and the pre-training language model also has the advantages of strong generalization capability, large parameter scale, good comprehensive performance and the like, so that the pre-training language model is also highly attractive in the story generation field; secondly, the generation process of the story is controlled by using a keyword sequence, wherein the keyword sequence is usually an action sequence or an emotion sequence, and the behavior change or the psychological change of the roles in the story is correspondingly controlled. This is essentially a habit of simulating the writing of a human composer, who usually lists the outline of the behavioral or psychological change of a character after writing a theme sentence, thereby facilitating the expansion and perfection of a story.

Since the guide text and keyword sequences belong to heterogeneous data, previous work has generally spliced the two directly on the digital vector space as input of a model, which results in that the model cannot effectively capture the features of the guide text and keyword sequences. For this reason, tang et al propose an end-to-end EtriCA model, which uses a cross-attention mechanism to fuse features of the guide text and the keyword sequence, reducing the problem of inconsistent sentences and contradictory logic in the generated text. However, taking fig. 1 as an example, a story sample schematic generated for EtriCA; compared to the reference story, the story text generated by EtriCA still has the following two significant problems: firstly, the grammar structure is too single, the sentence is basically formed by combining three elements of a main sentence and a guest, and the modified language is lacked; secondly, the semantic description is too naive, the plot development of the story is too monotonous, and the richness and the interestingness of the story are reduced. Early story generation systems basically rely on symbol planning, which requires extremely tedious manual design and feature engineering, and consumes manpower and material resources. With the development of neural networks, the seq2seq story generation model used by Roemmele and Fan et al greatly alleviates this problem. To promote consistency between generated story sentences, yao et al propose a two-stage story generation method that first generates an intermediate representation from input and then generates a complete story with the intermediate representation. Fan et al and Goldfarb-Tarrant et al extend successively on this basis, replacing the abstract intermediate representation with a specific sequence of actions. The advent of pre-trained language models, which again increased story generation by a level that many researchers added to their own model construction due to their powerful comprehensive performance, for example, guan et al proposed the HINT model using BART as the infrastructure. In recent years, due to the rising of multi-mode tasks, the story generation is performed in an original text-text mode, and a plurality of emerging modes such as picture-text, video-text and the like are expanded.

In summary, most of the existing researches adopt an end-to-end or two-stage story generation system, usually, the guide text, the keyword sequence and the common knowledge are directly spliced in the digital vector space to be used as the input of the model, so that the structure of the input text is damaged, and the utilization rate of the model to the input text is low. In addition, the existing method implicitly adds external common knowledge, the model can only utilize the common knowledge at the entity level, the relevance between the input whole and the common knowledge cannot be constructed, and the efficient application of the common knowledge is lacking, so that the generated story grammar has single structure and low richness of semantic description, and cannot attract readers.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to solve the problems that the correlation between the input whole and the common knowledge cannot be constructed in the prior art, the high-efficiency application of the common knowledge is lacked, the generated story grammar structure is single, the richness of semantic description is low, and the reader cannot be attracted.

In order to solve the technical problems, the invention provides a controllable story generation method based on multi-source heterogeneous feature fusion, which comprises the following steps:

Acquiring a story sample set; extracting first sentences of each story sample as guide texts, and taking the rest sentences except the first sentences as reference stories; extracting verb phrase in the reference story as a keyword sequence;

carrying out coreference resolution on each story sample to obtain a story principal angle; the main concierge of each story sample is obtained by using an NTLK tool package, a pairing mechanism is used for generating a core plot taking a main angle of a corresponding story as a subject, and bidirectional reasoning is carried out to obtain a corresponding common sense knowledge; the common knowledge of all story samples constitutes a common knowledge base;

encoding each word in all story samples in the story sample set to obtain a vocabulary;

the method comprises the steps of respectively inputting a guide text, a keyword sequence and common sense knowledge of any story sample into a BART encoder to obtain corresponding text features, keyword features and common sense features, and inputting the corresponding text features, keyword features and common sense features into a multi-source heterogeneous feature fusion model to obtain heterogeneous fusion features corresponding to the story sample, wherein the method specifically comprises the following steps:

inputting the text features and the common sense features into a first fusion module of the multi-source heterogeneous feature fusion model, and acquiring first fusion features based on a multi-head attention mechanism;

Inputting the keyword features and common sense features into a second fusion module of the multi-source heterogeneous feature fusion model, processing the second fusion module by a multi-head attention mechanism, and performing residual connection with the keyword features to obtain second fusion features;

inputting the text features and the keyword features into a third fusion module of the multi-source heterogeneous feature fusion model, processing the text features and the keyword features through a multi-head attention mechanism, and performing residual connection with the keyword features to obtain third fusion features;

acquiring a first checkpoints based on the first fusion feature and the published checkpoints of the BART by using a multi-stage recursive training method; acquiring a second checkpoints based on the second fusion feature and the first checkpoints; based on the third fusion feature and the second checkpoints, heterogeneous fusion features corresponding to story samples are obtained;

inputting the reference story and heterogeneous fusion characteristics corresponding to the story sample into a BART decoder for decoding to obtain probability distribution of a vocabulary; starting from the guide text by utilizing a preset sampling strategy, continuously selecting a word with highest probability as a next word until the selected next word is a preset ending mark, and acquiring a predicted story text;

Constructing a model total loss function based on a cross entropy loss function of a predicted story text and a corresponding reference story and a negative log likelihood function of text features and keyword features; training the multi-source heterogeneous feature fusion model by using a story sample set until the total loss function of the model converges, and obtaining the multi-source heterogeneous feature fusion model after training;

acquiring a guide text and a keyword sequence of a story to be generated, and extracting text features and keyword features of the story to be generated; acquiring common sense features of all common sense knowledge in the common sense knowledge base to form common sense features of a story to be generated;

inputting the text features, the keyword features and the common sense features of the story to be generated into the multi-source heterogeneous feature fusion model for completing training; and generating heterogeneous fusion characteristics of the story to be generated, inputting the heterogeneous fusion characteristics into a BART decoder, and acquiring controllable story text corresponding to the story to be generated.

In one embodiment of the present invention, the extracting verb phrase in the reference story as the keyword sequence includes:

extracting action behaviors including a behavior trigger, a constructor, an event, a tool, a time and a place according to the dependency relationship between the part of speech of each word and the word in the reference story;

Taking the action triggers of all sentences as keywords, acquiring a keyword sequence of a reference story, and representing the keyword sequence as follows:<s>action ₁ <sep>action ₂ …<e>；

wherein,<s>start flag, action, representing keyword sequence _i Represents the ith keyword in the keyword sequence,<sep>representing the separation of the sequence of keywords,<e>an end flag indicating a keyword sequence.

In one embodiment of the invention, the co-resolution is performed on each story sample to obtain the principal angle of the story; the method comprises the steps of obtaining main guests of each story sample by using an NTLK tool kit, generating a core plot taking a corresponding story principal angle as a main language by using a pairing mechanism, and carrying out bidirectional reasoning to obtain corresponding common sense knowledge, wherein the method comprises the following steps:

performing coreference resolution on each story sample in the story sample set, and selecting the reference with the largest occurrence number as a story principal angle of the story sample;

extracting main guests of each sentence in the story sample by using an NTLK tool package to form a plot set of the story sample;

selecting a triplet taking the story principal angle as a subject from the plot set by using a pairing mechanism to form a core plot sequence;

based on the reasoning modes of the psychological states caused by the events and the action behaviors caused by the events, respectively carrying out forward reasoning and reverse reasoning on two adjacent core episodes to generate an optimal path connection relation of common knowledge, wherein the optimal path connection relation is expressed as follows:

Wherein G is _link Is core scenario p _i Adjacent core scenario p _j An optimal path for common knowledge reasoning; XI and XR respectively represent mental state pushing caused by eventsPre-event ideas and post-event experiences in the management mode; XN and XW respectively represent the need before and the need after the event in the action behavior reasoning mode caused by the event; p (P) _XR ,P _XW ,P _XI ,P _XN Representing the maximum probabilities in XR, XW, XI and XN modes, respectively; alpha _XR ,α _XI ,α _XN ,α _XW The normalization constants in XR, XW, XI and XN modes are shown, respectively.

In one embodiment of the present invention, the inputting the text feature and the common sense feature into the first fusion module of the multi-source heterogeneous feature fusion model, obtaining a first fusion feature based on a multi-head attention mechanism includes:

based on the multi-head attention mechanism, the query Q, the key K and the value V are trainable parameters, and are expressed as follows:

based on the query, the key and the value, acquiring a normalized first weight matrixExpressed as:

the elements in the first weight matrix are weighted and summed to obtain a first fusion feature F _cy Expressed as:

wherein i represents the ith head of the multi-head attention mechanism;and->Are all preset trainable parameters; f (F) _y The expression is F _y ＝BART_Encoder _y (Y), Y represents common knowledge; f (F) _c Representing text features, expressed as F _c ＝BART_Encoder _c (C) C represents a guide text; bart_encoder () represents a BART Encoder; d represents the dimension of key K and h represents the number of heads in the multi-head attention mechanism.

In one embodiment of the present invention, the second fusion module for inputting the keyword feature and the common sense feature into the multi-source heterogeneous feature fusion model performs residual connection with the keyword feature after processing by a multi-head attention mechanism, to obtain a second fusion feature, including:

based on the query, the key and the value, acquiring a normalized second weight matrixExpressed as:

the elements in the second weight matrix are weighted and summed to obtain a second primary fusion feature F _sy Expressed as:

subjecting the second preliminary fusion feature F _sy And keyword feature F _s Residual connection is carried out to obtain a second fusion characteristicExpressed as:

wherein i represents the ith head of the multi-head attention mechanism;and->Are all preset trainable parameters; f (F) _y The expression is F _y ＝BART_Encoder _y (Y), Y represents common knowledge; f (F) _s Representing the characteristics of keywords, and the expression is F _s ＝BART_Encoder _s (S), S represents a keyword sequence; beta represents a first preset scaling factor.

In one embodiment of the present invention, the third fusion module for inputting the text feature and the keyword feature into the multi-source heterogeneous feature fusion model performs residual connection with the keyword feature after processing by a multi-head attention mechanism, to obtain a third fusion feature, including:

based on the query, the key and the value, obtaining a normalized third weight matrixExpressed as:

for in the third weight matrixThe elements are weighted and summed to obtain a third primary fusion feature F _cs Expressed as:

subjecting the second preliminary fusion feature F _cs And keyword feature F _s Residual connection is carried out to obtain a second fusion characteristicExpressed as:

wherein,and->Are all preset trainable parameters; f (F) _c Representing text features, expressed as F _c ＝BART_Encoder _c (C) C represents a guide text; f (F) _s Representing the characteristics of keywords, and the expression is F _s ＝BART_Encoder _s (S), S represents a keyword sequence; gamma represents a second preset scaling factor.

In one embodiment of the present invention, the acquiring a first checkpoints is based on the first fusion feature and a BART's published checkpoints using a multi-stage recursive training method; acquiring a second checkpoints based on the second fusion feature and the first checkpoints; based on the third fusion feature and the second checkpoints, acquiring heterogeneous fusion features corresponding to the story samples, including:

Based on the first fusion feature F _cy Published checkpoints with BART, the first checkpoints were obtained, expressed as: h _cy ＝FMHF(H ₀ ,F _cy )；

Based on the second fusionFeatures (e.g. a character)And the first checkpoints, obtaining a second checkpoints, expressed as: />

Based on the third fusion feature and the second checkpoints, heterogeneous fusion features F corresponding to story samples are obtained _csy Expressed as:

wherein FMHF () represents a multi-source heterogeneous feature fusion module, H _cy Represents the first checkpoints, H ₀ Published checkpoints, H, representing BART _sy Representing a second checkpoints.

In one embodiment of the invention, the reference story and heterogeneous fusion features corresponding to the story samples are input into a BART decoder for decoding to obtain probability distribution of a vocabulary; starting from the guide text, continuously selecting a word with highest probability as a next word by utilizing a preset sampling strategy until the selected next word is a preset ending mark, and acquiring a predicted story text, wherein the method comprises the following steps of:

fusing the reference story R with the heterogeneous fusion feature F _csy Decoding is carried out in the input BART decoder, and the hidden state of the decoder in the t step is obtained, wherein the hidden state is expressed as follows: h _t ＝BART_Decoder(δ _<t ,F _csy )；

The probability distribution of the vocabulary is obtained, expressed as:

P(δ _t |δ _<t ,X)＝softmax(H _t W+b)；

Starting from the guide text by utilizing a preset sampling strategy, continuously selecting a word with highest probability as a next word until the selected next word is a preset ending mark, and finishing prediction to obtain a predicted story text;

wherein t represents the number of steps, H _t Is the decoder at step tIs output by the encoding stage _csy And the generated text delta _<t Calculating to obtain; w and b are trainable parameters; r represents a reference story in the input decoder; p (delta) _t |δ _<t X) represents the probability distribution of the vocabulary, and X represents the model input.

In an embodiment of the present invention, the obtaining heterogeneous fusion features corresponding to story samples further includes training the first fusion module, the second fusion module, and the third fusion module, and obtaining trained first fusion module, second fusion module, and third fusion module, which includes:

constructing a text common sense loss function L based on a negative log likelihood function of the probability of the t character in the first fusion module _cy Expressed asTraining the first fusion module by using the story sample set until the text common sense loss function converges, and obtaining a trained first fusion module;

constructing a keyword common sense loss function L based on a negative log likelihood function of the probability of the t character in the second fusion module _sy Expressed asTraining the second fusion module by using the story sample set until the keyword common sense loss function converges, and obtaining a trained second fusion module;

constructing a text keyword loss function L based on a negative log likelihood function of the probability of the t character in the third fusion module _cs Expressed asTraining the third fusion module by using the story sample set until the text keyword loss function converges, and obtaining a trained third fusion module;

wherein r represents the total number of characters in the first fusion module, and r _t Representing the t th of theA character; u represents the total number of characters in the second fusion module, u _t Representing the t-th character thereof; v represents the total number of characters in the third fusion module, v _t Representing the t-th character therein.

In one embodiment of the present invention, the model total loss function is constructed based on the cross entropy loss function of the predicted story text and the corresponding reference story, and the text keyword loss function of the text feature and the keyword feature, and is expressed as:

model total loss function L _all ＝L _lm +λL _cs ；

Wherein L is _lm The cross entropy loss function for predicting story text and corresponding reference stories is expressed as:

wherein lambda is a preset scale factor; x represents model input, R represents a reference story; independent heat vector P ^* (δ _t |δ _<t R) represents a probability distribution of the reference story; n represents the total number of words in the reference story.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the controllable story generation method based on multi-source heterogeneous feature fusion, the guiding text, the keyword sequence and the common sense knowledge are encoded to obtain the corresponding text features, the keyword features and the common sense features, the text features, the keyword features and the common sense features are fused in pairs by utilizing a multi-head attention mechanism, three fusion features are obtained, multi-stage recursion training is performed to obtain heterogeneous fusion features, the relevance between heterogeneous inputs is fully utilized, and the explicit addition of the common sense knowledge and the accurate capture of information features are realized. And the heterogeneous fusion characteristics are input into a decoder to be decoded so as to obtain the controllable story text, and the richness of the generated story is improved on the basis of ensuring the smoothness and relativity of the generated story based on the accurate and rich heterogeneous fusion characteristics. In the invention, residual connection with the keyword features is added in the second fusion module and the third fusion module to generate the fusion features, so that the follow-up development of the story is controlled more effectively, and the controllability of the generated text is enhanced.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which

FIG. 1 is a schematic illustration of a story sample generated by Ethica;

FIG. 2 is a flow chart of a controllable story generation method based on multi-source heterogeneous feature fusion provided by the present invention;

FIG. 3 is a schematic diagram of the reasoning flow of the knowledge of the general knowledge provided by the present invention;

FIG. 4 (a) is a graph of consistency versus graph on the ROC validation set; FIG. 4 (b) is a graph of consistency versus WP verification set; FIG. 4 (c) is a correlation comparison plot over the ROC validation set; fig. 4 (d) is a correlation comparison plot over the WP validation set.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

As a story, it is not enough to pursue sentence smoothness and logic, and the diversity of grammar structures and the richness of semantic descriptions are important points for attracting readers. The existing story generation model firstly lacks high-efficiency application of common knowledge, which is equivalent to the fact that the model lacks a part of information sources and compromises the comprehensive performance of the model; secondly, the end-to-end model structure leads the model structure to only capture the characteristics of heterogeneous data roughly, so that the generated text lacks a plurality of detailed descriptions. The invention provides a novel multi-stage controllable story generation model-FMHF based on EthiCA. The FMHF has three modules, which fuse the features of the guide text, the keyword sequence and the common knowledge in the digital vector space by means of a cross-attention mechanism, respectively, and then recursively train the three new features into final features in multiple stages, and control the generation of story text by means of the features.

Referring to fig. 2, a flowchart of a controllable story generation method based on multi-source heterogeneous feature fusion according to the present invention includes the specific steps of:

s101: acquiring a story sample set; extracting first sentences of each story sample as guide texts, and taking the rest sentences except the first sentences as reference stories; extracting verb phrase in the reference story as a keyword sequence;

s102: carrying out coreference resolution on each story sample to obtain a story principal angle; the main concierge of each story sample is obtained by using an NTLK tool package, a pairing mechanism is used for generating a core plot taking a main angle of a corresponding story as a subject, and bidirectional reasoning is carried out to obtain a corresponding common sense knowledge; the common knowledge of all story samples constitutes a common knowledge base;

s102-1: performing coreference resolution on each story sample in the story sample set, and selecting the reference with the largest occurrence number as a story principal angle of the story sample;

s102-2: extracting main guests of each sentence in the story sample by using an NTLK tool package to form a plot set of the story sample;

s102-3: selecting a triplet taking the story principal angle as a subject from the plot set by using a pairing mechanism to form a core plot sequence;

S102-4: based on the reasoning modes of the psychological states caused by the events and the action behaviors caused by the events, respectively carrying out forward reasoning and reverse reasoning on two adjacent core episodes to generate an optimal path connection relation of common knowledge, wherein the optimal path connection relation is expressed as follows:

wherein G is _link Is core scenario p _i Adjacent core scenario p _j An optimal path for common knowledge reasoning; XI and XR respectively represent pre-event ideas in the mental state reasoning mode caused by eventsAnd post-event experience; XN and XW respectively represent the need before and the need after the event in the action behavior reasoning mode caused by the event; p (P) _XR ,P _XW ,P _XI ,P _XN Representing the maximum probabilities in XR, XW, XI and XN modes, respectively; alpha _XR ,α _XI ,α _XN ,α _XW The normalization constants in XR, XW, XI and XN modes are shown, respectively.

S103: encoding each word in all story samples in the story sample set to obtain a vocabulary;

s104: the method comprises the steps of respectively inputting a guide text, a keyword sequence and common sense knowledge of any story sample into a BART encoder to obtain corresponding text features, keyword features and common sense features, and inputting the corresponding text features, keyword features and common sense features into a multi-source heterogeneous feature fusion model to obtain heterogeneous fusion features corresponding to the story sample, wherein the method specifically comprises the following steps:

S104-1: inputting the text features and the common sense features into a first fusion module of the multi-source heterogeneous feature fusion model, and acquiring first fusion features based on a multi-head attention mechanism;

S104-2: inputting the keyword features and common sense features into a second fusion module of the multi-source heterogeneous feature fusion model, processing the second fusion module by a multi-head attention mechanism, and performing residual connection with the keyword features to obtain second fusion features;

S104-3: inputting the text features and the keyword features into a third fusion module of the multi-source heterogeneous feature fusion model, processing the text features and the keyword features through a multi-head attention mechanism, and performing residual connection with the keyword features to obtain third fusion features;

the elements in the third weight matrix are weighted and summed to obtain a third primary fusion feature F _cs Expressed as:

S104-4: acquiring a first checkpoints based on the first fusion feature and the published checkpoints of the BART by using a multi-stage recursive training method; acquiring a second checkpoints based on the second fusion feature and the first checkpoints; based on the third fusion feature and the second checkpoints, heterogeneous fusion features corresponding to story samples are obtained;

Based on the second fusion featureWith the first checkpoints, a second checkpoints is obtained, denoted +.>

S104-5: constructing a text common sense loss function L based on a negative log likelihood function of the probability of the t character in the first fusion module _cy Expressed asTraining the first fusion module by using the story sample set until the text common sense loss function converges, and obtaining a trained first fusion module; constructing a keyword common sense loss function L based on a negative log likelihood function of the probability of the t character in the second fusion module _sy Expressed asTraining the second fusion module by using the story sample set until the keyword common sense loss function converges, and obtaining a trained second fusion module; construction based on negative log likelihood function of t character probability in third fusion moduleText keyword loss function L _cs Expressed as->Training the third fusion module by using the story sample set until the text keyword loss function converges, and obtaining a trained third fusion module; wherein r represents the total number of characters in the first fusion module, and r _t Representing the t-th character thereof; u represents the total number of characters in the second fusion module, u _t Representing the t-th character thereof; v represents the total number of characters in the third fusion module, v _t Representing the t-th character therein.

S105: inputting the reference story and heterogeneous fusion characteristics corresponding to the story sample into a BART decoder for decoding to obtain probability distribution of a vocabulary; starting from the guide text by utilizing a preset sampling strategy, continuously selecting a word with highest probability as a next word until the selected next word is a preset ending mark, and acquiring a predicted story text;

The probability distribution of the vocabulary is obtained, expressed as:

P(δ _t |δ _<t ,R)＝softmax(H _t W+b)；

wherein t represents the number of steps, H _t Is the hidden state of the decoder at the t-th step, and is output by the fusion feature F of the encoding stage _csy And the generated text delta _<t Calculating to obtain; w and b are trainable parameters; r represents a reference story in the input decoder; p (delta) _t |δ _<t R) represents the probability distribution of the vocabulary.

S106: training the multi-source heterogeneous feature fusion model by using a story sample set based on a cross entropy loss function of a predicted story text and a corresponding story sample until the cross entropy loss function converges, and obtaining a multi-source heterogeneous feature fusion model after training;

model total loss function L _all ＝L _lm +λL _cs ；

S107: acquiring a guide text and a keyword sequence of a story to be generated, and extracting text features and keyword features of the story to be generated; acquiring common sense features of all common sense knowledge in the common sense knowledge base to form common sense features of a story to be generated;

s108: inputting the text features, the keyword features and the common sense features of the story to be generated into the multi-source heterogeneous feature fusion model for completing training; and generating heterogeneous fusion characteristics of the story to be generated, inputting the heterogeneous fusion characteristics into a BART decoder, and acquiring controllable story text corresponding to the story to be generated.

The controllable story generation can control the subsequent development of story text according to keyword sequences, so that the consistency and the contextual relevance among sentences are improved, and researchers usually select different types of keyword sequences to control the generation of the story. For example, brahman and Chaturvedi utilize reinforcement learning to control changes in character emotion trajectories in stories; kong et al first extract action sequences and emotion sequences from the original text and then direct the generation of stories with both sequences, respectively; xie et al form angles according to both requirements and emotionA chain of mental states of colors, thereby controlling the generation of stories; wang et al realize controllable story generation by connecting three elements such as roles, actions and emotions and inputting the three elements into a model; tang et al use a cross-attention mechanism to fuse the guide text with the sequence of events, controlling the generation of stories by fusing features. In this embodiment, the action sequence is selected as the keyword sequence, on the one hand, because the behavior of the character in the selected dataset changes much more than the psychological change, and on the other hand, because the complicated emotion label requires a large number of manual labels. Specifically, in step S101, extracting verb phrase in the reference story as a keyword sequence includes: extracting action behaviors including a behavior trigger, a constructor, an event, a tool, a time and a place according to the dependency relationship between the part of speech of each word and the word in the reference story; taking the action triggers of all sentences as keywords, acquiring a keyword sequence of a reference story, and representing the keyword sequence as follows: <s>action ₁ <sep>action ₂ …<e>The method comprises the steps of carrying out a first treatment on the surface of the Wherein,<s>start flag, action, representing keyword sequence _i Represents the ith keyword in the keyword sequence,<sep>representing the separation of the sequence of keywords,<e>an end flag indicating a keyword sequence.

The existing research already proves that the addition of external common knowledge is beneficial to improving the comprehensive performance of the model. Guan et al directly pretrains the model on a common sense reasoning dataset to provide additional semantic information for it; xu et al adds a local knowledge gallery in the model and generates story text by using the relevance between entities; zhang et al propose a method to dynamically retrieve a local knowledge gallery so that the model better captures the associations between entities when generating story text. The above methods basically belong to the method of implicitly adding the common knowledge, and in this case, the model can only utilize the common knowledge at the entity level, and cannot construct the correlation between the input whole and the common knowledge, so the embodiment uses the method of explicitly adding the common knowledge. The biggest obstacle to explicitly adding knowledge to the model in the past is the pairing problem of heterogeneous data, but the embodiment realizes the fusion of multi-source heterogeneous characteristics by using a cross-attention mechanism, so that the explicit addition of knowledge is possible.

In summary, the invention provides a model structure for fusing multi-source heterogeneous features by using a cross attention mechanism, which can fully utilize the relativity between heterogeneous inputs, accurately capture information features, and promote the richness of generated texts on the basis of ensuring fluency and relativity. In order to more efficiently utilize the additional information provided by the common sense knowledge, the invention provides a method for explicitly applying the common sense knowledge, firstly, the common sense knowledge matched with the guide text and the keyword sequence is obtained and converted into readable natural language, and then the readable natural language is input into a model together with the guide text and the keyword sequence, and compared with the implicit method, the method can establish the association between the input whole and the common sense knowledge, and more directly provide an additional information source for the model.

Based on the above embodiments, in the embodiments of the present invention, the model of Multi-source heterogeneous feature Fusion (FMHF) provided by the present invention is utilized, which fuses three heterogeneous features of guide text, keyword sequences and common knowledge through a cross-attention mechanism. The fusion mechanism enables the model to fully utilize the relation among three heterogeneous inputs, accurately capture information characteristics and generate high-quality story text.

In order for the model to adequately capture information features from the lead text, keyword sequences and common knowledge, story text with better fluency, consistency, relevance, richness is generated. The input of the model has three items, namely a guide text C= { C ₁ ,c ₂ ,…,c _l -this is taken from the first sentence of each story, with the effect of guiding the development of the story; keyword sequence s= { S ₁ ,s ₂ ,…,s _m Extracting from the subsequent sentences of each story, which has the effect of controlling the subsequent development of the event; knowledge of common sense k= { K ₁ ,k ₂ ,…,k _n ' this is generated from the key plot reasoning of the story, with additional parameters provided for the development of the storyThe function of the examination information. The output of the model is Wherein->The j-th word representing the i-th sentence.

The generation process of the story text specifically comprises the following steps:

s201: the approach taken in EtriCA was followed in constructing keyword sequences. For each sentence in the dataset, behavior triggers and other important components are extracted according to the dependency relationship DEP between the part-of-speech POS of the word and the word to form a complete action behavior. In this case, the action sequence is composed of verb phrases, rather than single verbs, and EtriCA has demonstrated the superiority of this approach. In order to construct the extracted keyword sequences into naive natural text for convenient input into the model, special characters are used herein to join the keyword sequences together to construct, for example <s>action ₁ <sep>action ₂ …<e>In which<s>Represents the beginning, action of a keyword sequence _i The i-th keyword is represented by a word,<sep>representing the separation of the sequence of keywords,<e>indicating the end of the keyword sequence.

S202: the addition of common sense knowledge to the model is beneficial to understanding and generating natural language, and ATOMIC is selected as common sense reasoning data set in this embodiment. ATOMIC contains a great deal of causal reasoning knowledge, and the content is composed in if-then format, and is mainly divided into three reasoning modes: events lead to mental states, events lead to action behaviors, events lead to personality. The reasoning mode of psychological states caused by the events and action behaviors caused by the events is selected, and 4 reasoning relations of XIntent, XReaction, XNeed and XWant are selected, and the explanation of the reasoning relation is shown in table 1.

Table 1 explanation of the reasoning relationship

To enable the explicit addition of knowledge to model training, it is necessary to convert it from the form of relational triples to readable natural language and to match the lead text and keyword sequences. Inspired by the work of amantarou et al, as shown in fig. 3, in the core plot extraction module, on one hand, the story text is subjected to coreference resolution, and the reference with the largest occurrence number is selected from the coreference resolution, so that the reference is set as a principal angle c of story development; on the other hand, the NTLK toolkit is used for extracting the main guests of each sentence in the story text to form a set T= {(s) ₁ ,r ₁ ,o ₁ ),(s ₂ ,r ₂ ,o ₂ ),…,(s _n ,r _n ,o _n ) -a }; then utilizing the pairing mechanismAnd (5) screening out a ternary composition nucleation mood section sequence taking a principal angle c as a principal meaning s.

In the common sense reasoning module, the COMET model is used for conducting the adjacent core scenario p based on 4 reasoning relations _i p _j And carrying out common sense reasoning. From p _i Forward reasoning is carried out on the position, and m plot candidate reasoning is carried out n times based on two reasoning relations of XReaction and XWant; from p _j Performing reverse reasoning, and reasoning n times on m plot candidates based on two reasoning relations of XIntent and XNeed; the maximum probability of generating common knowledge when the COMET model is reasoning n times under 4 reasoning relations is P _XR ,P _XW ,P _XI ,P _XN Then the optimal path junction G _link The formula of (2) is:

wherein G is core scenario p _i p _j Optimal path, alpha, of common knowledge reasoning between _XR ,α _XI ,α _XN ,α _XW Are normalization constants.

S203: the conventional story generation model usually uses direct concatenation of guide text, keyword sequences and common knowledge in a digital vector space as the input of the model, so that the structure of the input text is destroyed, and the utilization rate of the model to the input text is low. To solve the pairing problem of heterogeneous multi-source features and fuse them, as shown in FIG. 2, the leading text C, keyword sequence S and common knowledge K are first encoded separately using BART to capture different features F _c 、F _s And F _y The method comprises the steps of carrying out a first treatment on the surface of the Then three features are input into a multi-source heterogeneous feature fusion module, wherein the multi-source heterogeneous feature fusion module comprises three sub-modules: CY fusion module, SY fusion module, CS fusion module, respectively using a cross-attention mechanism for F _c F _y 、F _s F _y And F _c F _s Feature fusion is carried out, taking CY module as an example, and feature F is fused _cy The calculation formula of (2) is as follows:

F _c ＝BART_Encoder _c (C)；

F _y ＝BART_Encoder _y (Y)；

/>

wherein C and Y represent a guide text and a keyword sequence, respectively, as input; bart_encoder _c And BART_Encoder _y Common pre-training parameters inherited to BART, but differing in fine tuning;are trainable parameters; d represents +.>Is a dimension of (2); h represents the number of heads in the multi-head attention, and i represents the i-th head. Other two fusion features F _sy And F _cs And F _cy Similarly.

At F _c 、F _s And F _y Of these three isomeric features, F _y Function to provide reference information, F _c Plays a role in guiding the development of stories, F _s Plays a role in controlling the subsequent development of the event, and aims to enhance the controllability of the generated text. This example refers to the practice in EthiCA, giving F to SY fusion module and CS fusion module _s Residual connection is increased, thus F is based on the original _sy And F _cs Is a variation of the calculation of (a):

Where β, γ is a scale factor for controlling F in the residual connection _sy And F _cs The proportion of the components is as follows.

After obtainingThen, three fusion features are subjected to multi-stage recursion training to generate a final feature F _csy 。F _csy The calculation formula of (2) is as follows:

H _cy ＝FMHF(H ₀ ,F _cy )；

wherein H is ₀ Published checkpoints derived from bart_large; h _cy And H _sy Is the checkpoints after stage training;is three fusion features; f (F) _csy The resulting fusion feature, which fuses the information features of the guide text C, the keyword sequence S and the knowledge of common knowledge Y, is input to a decoder for character prediction and story text generation.

S204: the decoding stage adopts the traditional autoregressive decoding method, the embodiment uses the decoder of the BART to decode, and the loss function is calculated by comparing with the reference story so as to generate the character delta _t . The specific calculation formula is as follows:

H _t ＝BART_Decoder(δ _<t ,F _csy )；

P(δ _t |δ _<t ,X)＝softmax(H _t W+b)；

wherein t represents the number of steps; h _t Is the hidden state of the decoder at the t-th step, and is output by the fusion feature F of the encoding stage _csy And the generated text delta _<t Calculating to obtain; w and b are both trainable parameters; r represents a reference story in the input decoder; p (delta) _t |δ _<t R) represents the probability distribution of the vocabulary, and character prediction for each step can be achieved by using some sampling strategy, such as Top-ksmapping or Top-psmapping.

Because FMHF is a multi-stage story generation model, the model can generate corresponding loss functions during training in each stage and finally generating story text, and the calculation formula of each loss function is as follows:

L _all ＝L _lm +λL _cs ；

wherein L is _lm Cross entropy loss function for predicting story text and corresponding reference story, expressed as/>

L _cy Is a text common sense loss function expressed as

L _sy Is a common sense loss function of keywords, expressed as

L _cs Is a text keyword loss function expressed as

Wherein L is _lm Representing language model loss functions, L _ck ,L _sk ,L _cs Representing a loss function during three stages of training, all of which use a negative log-likelihood function of the probability of the corresponding character; r represents a reference story in the input decoder; r, u, v respectively represent the total number of characters in the sentence during training of each stage, r _t ,u _t ,v _t Representing the t-th character therein; n represents the number of characters in the generated story text, y _t Representing the generated t-th character; scale factor lambda, control L _cs At L _all The proportion of the total weight of the components; l (L) _all Is the total loss function by continuously reducing L _all FMHF can generate high quality story text.

Specifically, in step S203, the final feature F is generated based on the guide text C, the keyword sequence S, and the common knowledge Y _csy The algorithm specific process of (1) comprises the following steps:

Input: a guide text C, a keyword sequence S and a common knowledge Y;

and (3) outputting: final fusion feature F _csy .

Using BARTThe encoder encodes three heterogeneous inputs C, S and Y respectively to obtain three heterogeneous features F _c ,F _y ,F _s ；

Feature F using CY fusion module _c ,F _y Fusing to obtain fusion characteristic F _cy ；

In the first stage, feature F is fused _cy Training based on BART checkpoints;

feature F using SY fusion modules _s ,F _y Fusing to obtain fusion characteristics

Second stage, fusing featuresTraining based on the checkpoints obtained in the first stage;

feature F using CS fusion module _c ,F _s Fusing to obtain fusion characteristics

Third stage, fusing featuresTraining based on the checkpoints obtained in the second stage to obtain final fusion feature F _csy 。

Based on the above embodiment, in this embodiment, in order to verify the effectiveness of the controllable story generation method based on multi-source heterogeneous feature fusion provided by the present invention, experiments are performed on two data sets of ROC Stories and Writing probes, and the experimental results show that the FMHF has significant improvement on the automatic evaluation index and the manual evaluation index compared with other baseline models, and verify the feasibility and superiority of the model in the aspect of fusion of multi-source heterogeneous features.

This embodiment uses two well-known data sets ROC Stories (ROC) and Writing Probes (WP) in the story generation field. Because of the lack of emotion sequences in the datasets used herein, only action sequences can be used as a controlling factor. Where ROC is a short story dataset, each story consists of 5 phrases, and WP is a long story dataset, with each story containing 734 words on average. The preprocessing and partitioning of ROC and WP in this embodiment uses the method adopted by the previous work, in which the ROC dataset replaces the names of the characters in the original story with [ MALE ], [ FEMALE ], [ NEUTRAL ] for better generalization, and the WP dataset takes the first 11 sentences of each story to accommodate the task requirements of longer-spread story generation. For story samples in ROC and WP, the present embodiment takes the first sentence as the guide text, with subsequent sentences constituting the reference story. The keyword sequences and the knowledge of the general knowledge are generated according to the methods set forth in the embodiments of the present invention, respectively. The statistics of each dataset are detailed in table 2.

Table 2 statistics of dataset

In the embodiment, 6 baseline models are selected for comparison with the multi-element heterogeneous feature fusion model provided by the invention in order to verify the effectiveness of the multi-element heterogeneous feature fusion model; a baseline model selected, comprising:

(1) P & W: it firstly designs a series of keyword sequences based on input, then generates stories according to the keyword sequences, and the core framework is BiLSTM with attention mechanism.

(2) GPT-2: the method is an autoregressive generation model, and is widely applied to various generation models based on a decoder structure of a transducer;

(3) BART: it merges the encoder structure of BERT and the decoder structure of GPT, and is also widely used in various generative models.

(4) HINT: the decoder structure of the BART is adjusted through two angles of sentence level and space level;

(5) EtriCA: the method is an end-to-end model, heterogeneous feature fusion is carried out on a guide text and a time sequence by using a cross attention mechanism, and story generation is controlled through an event sequence;

(6) cha: the method is an end-to-end model, the generation of stories is controlled by combining input actions, emotions and other factors, and a replication mechanism is added in the model to capture key features.

The framework structure of the FMHF (multi-source heterogeneous feature fusion) model provided by the embodiment of the invention is based on a pre-training model BART published by Huggingface, so that the parameter settings of an encoder, a decoder and an embedded layer are consistent with those of the published checkpoints. Regarding super parameter setting, the encoder and decoder each include 6 hidden layers, the number of heads in the multi-head attention mechanism of each layer is set to 12, and BPE encoding is used; optimal Path junction G _link Normalized constant alpha in (2) _XR ,α _XI ,α _XN, α _XW Is set to 0.5, the scale factors beta, gamma in the feature pairwise fusion are set to 0.1, lambda in the loss function ₁ ,λ ₂ ,λ ₃ The values of (2) are all set to 1; the training batch dimension of the ROC dataset was set to 16 and the validation batch dimension was set to 10; the training batch dimension of the WP dataset is set to 4 and the validation batch dimension is set to 2; the random seed is set to 42, the learning rate is set to 8e-5, the maximum text input length is set to 1024, the epsilon in the adam optimizer is set to 1e-8, the maximum training round number is set to 5, and an early ending mechanism is provided; core sampling is used in generating text. Experiments were performed on cloud platforms using RTX 4090, all code compiled based on Pytorch.

The common evaluation indexes in the story generation field comprise automatic evaluation indexes and manual evaluation indexes, and the automatic evaluation indexes are subdivided into reference indexes and non-reference indexes. Chun et al state that in evaluating the text quality in the story generation field, it is not advisable to use two references, ROUGE and BLUE, because they evaluate by calculating the coverage of the n-gram between the generated text and the reference text, which would instead be reduced by the inclusion of many modifier languages in the text. Therefore, the reference indexes selected in this embodiment are: perplexity (PPL), which uses the information entropy to measure the uncertainty of story text generation, has a certain reference value. The non-reference indexes selected in this embodiment are respectively: repetition-n (Rep-n) which calculates a text Repetition rate by generating the number of repetitions of n-tuples in the story; distinction-n (Dis-n), which defines the Distinction by generating the ratio of the different n-tuples in the story to all n-tuples; coherence and Relevance it measures the coherence between sentences and the correlation between sentences and the guide text by cosine similarity of the semantic vector space.

Story generation is an open task, and although researchers have been striving to propose more reasonable automatic evaluation indexes, it is difficult to be widely accepted, and thus manual evaluation is indispensable, and 2 baseline models which are well-represented in the automatic evaluation indexes are selected in this embodiment to be compared with the models of this embodiment. Firstly randomly extracting 100 story texts generated by 3 models on a test set of ROC and WP respectively; employing voters on Amazon Mechanical Turk (AMT) to vote on the generated stories for 4 aspects of fluency, coherence, relevance, and richness, each set of story comparisons requiring 3 voters to vote on a five-level scoring system, each voter paying $ 0.04; finally, data statistics is carried out, and in order to pursue the effectiveness of manual evaluation and reduce invalid information, voters are required to have excellent English levels, so that the voters select from the countries with English as the mother language.

In the present embodiment, definitions concerning fluency, coherence, correlation, and richness are as follows: fluency measures whether grammar application is accurate and whether semantic understanding is difficult; continuity measures whether the connection between adjacent sentences of story text is tight; the relevance measures the degree of relevance between the generated story and the guide text; the richness measures whether sentences of story text have modifier descriptions and whether plot development is interesting.

Referring to table 3, the results of the FMHF and baseline models are shown for the reference indicators on both ROC and WP data sets, with the optimal results indicated in bold and the suboptimal results indicated in underline.

TABLE 3 results of automatic evaluation of indicators

From the perspective of the referential index, compared with each baseline model, the FMHF model of the embodiment has lower Perplexity when generating the story text, which means that the generated story text has lower confusion, reduces ambiguity between sentences, makes the logic of the story clearer, and improves the understandability of the story text.

From the perspective of non-reference indexes, the FMHF of the present embodiment has lower Repetition and higher distancing than each baseline model, which indicates that the n-tuple Repetition rate in the story text generated by the FMHF is low, and the Distinction degree between n-tuples is large, which means that the story text has more unique content, reduces the situations of text redundancy and plot monotonous, and has richer appeal and interest for readers.

If the results on the ROC data set and the WP data set are compared, the FMHF is obviously superior to the long-range story generation in the aspect of short-range story generation, and the confusion degree of the FMHF is greatly improved and the smoothness and the relevance are reduced when texts with longer space are generated by the FMHF. This situation is manifested in all baseline models, a problem in this area, and is expected to be alleviated in the future.

The average results of the manually evaluated indices are shown in table 4, where Kappa coefficients are used to verify the consistency of voter opinion, all data were signed rank at Wilcoxon and confidence level of the test p <0.05. Compared with Ethica and CHAE, the story text generated by FMHF receives more votes in terms of fluency, continuity, relativity, richness and the like, which indicates that from the perspective of human readers, the story text generated by FMHF has higher quality and can promote the reading experience of the human readers. In particular, in terms of richness, FMHF leads EtriCA and cha by a lot, which illustrates that FMHF can accurately capture information features, adding a lot of detailed descriptions to story text.

TABLE 4 results of Artificial evaluation of metrics

To investigate the contributions made by the various modules in FMHF, this example devised a variant of the following model for ablation experiments, the experimental results being shown in table 5:

(1) w/o CY: and removing the CY fusion module in the multi-source heterogeneous feature fusion model.

(2) w/o SY: and removing the SY fusion module in the multi-source heterogeneous feature fusion model.

(3) w/o CS: and removing the CS fusion module in the multi-source heterogeneous feature fusion model.

Table 5 results of ablation experiments

Table 5 shows the results of the automatic evaluation of the indicators for FMHF and each ablation model on ROC and WP datasets. It can be clearly observed that when any one of the three fusion modules is removed, the performance of the model is reduced, which means that the guiding text, the keyword sequence and the common knowledge play an important role in improving the performance of the model, as described above, the guiding text plays a role in guiding the development of the story, the keyword sequence plays a role in controlling the subsequent development of the event, and the common knowledge provides additional reference information for story generation, and the three are complementary to each other. Furthermore, it can be found that the performance degradation of the model is most severe when SY or CS modules are removed, which means that the keyword sequence as a core control factor is most important in three heterogeneous inputs, since it appears in each sentence of the generated story text, directly related to the continuity and fluency of the whole story text, which also verifies that the present embodiment is F _s The necessity of residual connection is increased. Taken together, it is demonstrated that the FMHF proposed in this example is heterologous in fusionThe validity of the structural characteristics.

In order to further explore the influence of the multi-stage model structure on the model performance, the present embodiment makes more experiments comparing the advantages and disadvantages of the end-to-end (end-to-end) model structure and the multi-stage (multi-step) model structure. In the end-to-end model structure, three fusion features are combined together through a linear combination layer and converted into a final feature F through linear mapping _csk F in end-to-end model structure _csy The calculation method of (2) is as follows: f (F) _csy ＝linear(Concat(F _cy ,F _sy ,F _cs ))；

Table 6 shows statistics for the models over ROC and WP datasets, from which it can be seen that the end-to-end indices are far less than multi-step, and even less than CHAE and EthiCA. This is because, although the cross-attention mechanism is used to perform the fusion of the heterogeneous multi-source features, the model cannot accurately capture the key information features at a time due to the addition of the huge training data set of common knowledge, end-to-end, but the captured information features become redundant, so that the generated text does not perform well in terms of consistency and relevance. To verify the above approach, FIG. 4 shows the Coherence between the first 4 sentences generated by the 4 models on the verification set of ROC and WP and the Relevance between the first 4 sentences and the guide text, where word vector embedding uses "Wikipedia 2014+Gigord5" in glove embedding. FIG. 4 (a) is a graph of consistency versus graph on the ROC validation set; FIG. 4 (b) is a graph of consistency versus WP verification set; FIG. 4 (c) is a correlation comparison plot over the ROC validation set; fig. 4 (d) is a correlation comparison plot over the WP validation set. It can be seen that end-to-end has low consistency and correlation in various situations, which means that the end-to-end has no capability of accurately capturing information characteristics, and the anti-observation multi-step keeps the optimal level all the time, thereby embodying the advantages of the multi-stage model structure.

TABLE 6 end-to-end vs. multistage comparison

Tables 7 and 8 show story cases generated by the models on the ROC dataset and the WP dataset, wherein the first sentence of each story represents the guide text, the underlined fonts represent the keyword sequences, and the heavily marked fonts represent the modifier language. Sample cases show that compared with Ethica and CHAE, short story text generated by FMHF is smooth in line, subject-matter-based, free of logic grammar errors, multiple in sentence structure, multiple in modifier languages, higher in interest in plot development, and the reading experience is very close to that of a reference story. When each model generates a long-spread story text, the quality of the story is obviously reduced, and the difference from a reference story is far, which is one of difficulties faced by the current story generation field, but compared with the long-spread story text generated by CHAE and Ethica and FMHF, the situation that sentences are not smooth or logics are contradictory is rarely generated, and the reading disorder is greatly reduced, so that the superiority of the model is illustrated.

Table 7 story samples generated on ROC datasets

Table 8 WP story samples generated on dataset

/>

The embodiment of the invention provides a task for generating controllable stories by fusing multi-source heterogeneous characteristics by using a cross-attention mechanism and a new method for explicitly applying common knowledge. The model architecture FMHF used in the embodiment can well integrate three heterogeneous features of the guide text, the keyword sequence and the common knowledge, so that more information is provided for the generation of a story for the model, and the ablation experiment proves that the three feature integration sub-modules play an important role in the model. By comparing with a plurality of baseline models, the superiority of the heterogeneous feature fusion model FMHF provided by the embodiment of the invention is proved from two angles of automatic evaluation indexes and manual evaluation indexes, and the generated stories have better fluency, correlation, consistency and richness.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps in the controllable story generation method based on multi-source heterogeneous feature fusion. The embodiment provides a model structure for fusing multisource heterogeneous characteristics by using a cross attention mechanism, which can fully utilize the relativity between heterogeneous inputs, accurately capture information characteristics, and promote the richness of generated texts on the basis of ensuring fluency and relativity. And explicitly applying the knowledge of common knowledge, it can establish a correlation between the input whole and the knowledge of common knowledge, more directly providing an additional source of information for the model than the implicit method.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The controllable story generation method based on multi-source heterogeneous feature fusion is characterized by comprising the following steps of:

2. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 1, wherein extracting verb phrases in a reference story as a keyword sequence comprises:

3. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 1, wherein each story sample is subjected to coreference resolution to obtain a story principal angle thereof; the method comprises the steps of obtaining main guests of each story sample by using an NTLK tool kit, generating a core plot taking a corresponding story principal angle as a main language by using a pairing mechanism, and carrying out bidirectional reasoning to obtain corresponding common sense knowledge, wherein the method comprises the following steps:

wherein G is _link Is core scenario p _i Adjacent core scenario p _j An optimal path for common knowledge reasoning; XI and XR represent pre-event ideas and post-event experiences in a mental state reasoning mode caused by an event, respectively; XN and XW respectively represent the need before and the need after the event in the action behavior reasoning mode caused by the event; p (P) _XR ,P _XW ,P _XI ,P _XN Representing the maximum probabilities in XR, XW, XI and XN modes, respectively; alpha _XR ,α _XI ,α _XN ,α _XW The normalization constants in XR, XW, XI and XN modes are shown, respectively.

4. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 1, wherein the inputting the text feature and the common sense feature into the first fusion module of the multi-source heterogeneous feature fusion model, the obtaining the first fusion feature based on a multi-head attention mechanism, comprises:

5. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 4, wherein the second fusion module for inputting the keyword features and common sense features into the multi-source heterogeneous feature fusion model performs residual connection with the keyword features after processing by a multi-head attention mechanism, and the obtaining of the second fusion features comprises:

6. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 5, wherein the third fusion module for inputting the text feature and the keyword feature into the multi-source heterogeneous feature fusion model performs residual connection with the keyword feature after processing by a multi-head attention mechanism, and the obtaining of the third fusion feature comprises:

7. The controllable story generation method of claim 6, wherein the utilizing a multi-stage recursive training method obtains a first checkpoints based on the first fused features and BART's published checkpoints; acquiring a second checkpoints based on the second fusion feature and the first checkpoints; based on the third fusion feature and the second checkpoints, acquiring heterogeneous fusion features corresponding to the story samples, including:

Based on the second fusion featureAnd the first checkpoints, obtaining a second checkpoints, expressed as: />

8. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 7, wherein the reference story and heterogeneous fusion features corresponding to the story samples are input into a BART decoder for decoding to obtain probability distribution of vocabulary; starting from the guide text, continuously selecting a word with highest probability as a next word by utilizing a preset sampling strategy until the selected next word is a preset ending mark, and acquiring a predicted story text, wherein the method comprises the following steps of:

The probability distribution of the vocabulary is obtained, expressed as:

P(δ _t |δ _<t ,X)＝softmax(H _t W+b)；

wherein t represents the number of steps, H _t Is the hidden state of the decoder at the t-th step, and is output by the fusion feature F of the encoding stage _csy And the generated text delta _<t Calculating to obtain; w and b are trainable parameters; r represents a reference story in the input decoder; p (delta) _t |δ _<t X) represents the probability distribution of the vocabulary, and X represents the model input.

9. The controllable story generation method based on multi-source heterogeneous feature fusion of claim 8, wherein the obtaining heterogeneous fusion features corresponding to story samples further comprises training a first fusion module, a second fusion module, and a third fusion module, and obtaining trained first fusion module, second fusion module, and third fusion module, comprising:

negative pair based on t-th character probability in first fusion moduleNumber likelihood function, constructing text common sense loss function L _cy Expressed asTraining the first fusion module by using the story sample set until the text common sense loss function converges, and obtaining a trained first fusion module;

wherein r represents the total number of characters in the first fusion module, and r _t Representing the t-th character thereof; u represents the total number of characters in the second fusion module, u _t Representing the t-th character thereof; v represents the total number of characters in the third fusion module, v _t Representing the t-th character therein.

10. The controllable story generation method of claim 9, wherein the constructing a model total loss function based on the cross entropy loss function of the predicted story text and the corresponding reference story, and the text keyword loss function of the text feature and the keyword feature is expressed as:

model total loss function L _all ＝L _lm +λL _cs ；