CN113127632B

CN113127632B - Text summarization method and device based on heterogeneous graph, storage medium and terminal

Info

Publication number: CN113127632B
Application number: CN202110533278.5A
Authority: CN
Inventors: 蒋昌俊; 闫春钢; 丁志军; 王俊丽; 张亚英; 张超波
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2022-07-26
Anticipated expiration: 2041-05-17
Also published as: CN113127632A; WO2022241913A1

Abstract

The invention discloses a text summarization method and device based on heterogeneous graphs, a storage medium and a terminal, wherein the method comprises the following steps: performing knowledge fusion on a preset knowledge base and a target text, acquiring word characteristics and sentence characteristics of the target text, and constructing a text heterogeneous graph of the target text based on the word characteristics and the sentence characteristics; updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph; calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous graph, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector; and respectively weighting sentence features in the updated version text heterogeneous image based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels. The invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as intermediaries of the sentences, thereby enriching the association among the sentences and indirectly transmitting information.

Description

Text summarization method and device based on heterogeneous graph, storage medium and terminal

Technical Field

The invention relates to the technical field of text generation, in particular to a method and a device for text summarization based on heterogeneous graphs, a storage medium and a terminal.

Background

The automatic generation of text summaries is an important task in the field of natural language processing, which aims to compress an original text and generate a short description containing the main content of the original text. There are two main aspects of the study: generating and extracting. The key point of the generation formula is to encode the whole document and generate the abstract word by word, and the extraction formula method is to directly select sentences from the document to combine the sentences into the abstract. Compared with the generation formula, the abstraction method has higher efficiency and the readability of the generated abstract is better.

The key step in the abstract text task is to establish the relation between each sentence and the article, most of the existing methods acquire the relation of the sentences based on the Recurrent Neural Network (RNN), but the method can not capture the long-distance dependency relationship of the sentences. The use of graph structures to represent text is a more effective way to solve the above problems, but how to reasonably model text into graphs remains to be studied. Recently, Graph Neural Networks (GNNs) have shown powerful feature extraction capability for graph data, and a text summarization method based on the GNNs has been proposed, and some work has been to decompose sentences into elementary semantic units (EDUs) using a modified structure theory (RST) and construct modified structure theory trees, and then use a Graph Convolution Network (GCN) to complete graph information aggregation and update. Although the basic semantic unit-based approach achieves better effect, the process of generating the basic semantic unit is more complicated, and only one kind of node is used to construct the graph. The relevance strength between sentences in the extraction type abstract is more important, but in the current work of the heterogeneous graph, edges are only added between nodes in different types, so that the sentences are not directly related.

Disclosure of Invention

The technical problem to be solved by the invention is that the process of generating basic semantic units in the existing text abstract generating mode based on the graph neural network is complex, only one node is used for constructing a graph, and the relevance among sentences is weak, so that the generation of the extraction abstract is not facilitated.

In order to solve the technical problem, the invention provides a text summarization method based on heterogeneous graphs, which comprises the following steps:

performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features;

updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated text heterogeneous graph;

calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;

and respectively weighting sentence features in the updated version text heterogeneous graph based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels.

Preferably, the knowledge fusion of a preset knowledge base and a target text, and the obtaining of the word features and sentence features of the target text comprises:

respectively encoding and vectorizing knowledge in a preset knowledge base and content in a target text to acquire a knowledge vector in the preset knowledge base and a word vector in the target text;

respectively calculating the attention weight of each word vector in the target text and the attention weight of the knowledge vector in the preset knowledge base to obtain the attention weight of each word vector in the target text;

sequentially taking the attention weight of the word vector in the target text as a weight, and respectively weighting and combining the knowledge vectors in the preset knowledge base to obtain the knowledge weight of each word vector in the target text;

acquiring word features of corresponding word vectors based on knowledge weight of each word vector in the target text;

and respectively performing local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text to obtain the local features and the global features of each sentence vector, and respectively obtaining the sentence features of the corresponding sentence vectors according to the local features and the global features of each sentence vector.

Preferably, constructing the text heterogeneous map of the target text based on the word features and the sentence features comprises:

based on sentence features of sentence vectors in the target text, calculating the homogeneous edge weight between every two sentence vectors of all the sentence vectors in the target text in a cosine similarity calculation mode;

calculating heterogeneous edge weights among all word vectors and sentence vectors to which the word vectors belong in the target text through a TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong;

and taking the word vector in the target text as a word node, taking the sentence vector in the target text as a sentence node, and constructing a text heterogeneous graph of the target text based on the word features of the word vector, the sentence features of the sentence vector, the homogeneous edge weights among the sentence vectors and the heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong.

Preferably, the step of updating the text heterogeneous map through the graph attention network based on the edge weight and the attention weight comprises:

calculating attention weights between every two sentence vectors of all sentence vectors in the target text, calculating attention weights between all word vectors in the target text and the sentence vectors to which the word vectors belong, and acquiring all homogeneous edge weights and all heterogeneous edge weights in the target text;

and updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on the attention weight between every two sentence vectors of all sentence vectors in the target text, the attention weight between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights to obtain an updated version of the text heterogeneous graph.

Preferably, updating all word nodes and all sentence nodes in the text heterogeneous graph through the graph attention network based on the attention weight between every two sentence vectors of all sentence vectors in the target text, the attention weight between all word vectors and sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights comprises:

taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between the sentence nodes connected with the central nodes as weight, and performing weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes;

taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and heterogeneous edge weight between the word nodes connected with the central nodes as weight, and carrying out weighted aggregation on word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes;

and taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and homogeneous edge weight between the sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.

Preferably, the step of calculating multiple types of abstract indexes of sentence vectors in the updated version text heterogeneous graph and calculating classification weights of corresponding sentence vectors according to the multiple types of abstract indexes corresponding to each sentence vector comprises:

calculating a relevance score, a redundancy score, a new information score and a recall ratio evaluation-oriented measurement standard score of each sentence vector in the updated version text heterogeneous map;

and calculating the classification weight of the corresponding sentence vector through a Sigmoid function based on the relevance score, the redundancy score, the new information score and the recall rate evaluation-oriented metric score of each sentence vector.

Preferably, the step of calculating the relevance score, the redundancy score, the new information score and the recall-rate-evaluation-oriented metric score of the single sentence vector in the updated text heterogeneous map comprises the following steps:

calculating the relevance score of the sentence vector through a bilinear function based on the text characteristics of the updated version text heterogeneous image and the sentence characteristics of the sentence vector;

calculating a redundancy score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image;

calculating a new information quantity score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous graph and the knowledge vector in the preset knowledge base;

and calculating the recall-rate evaluation-oriented metric score of the sentence vector through the recall-rate evaluation-oriented metric function based on the target text which is not coded and vectorized and the text content of the sentence vector.

In order to solve the above technical problem, the present invention further provides a text summarization apparatus based on heterogeneous images, comprising:

the text heterogeneous graph building module is used for carrying out knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and building a text heterogeneous graph of the target text based on the word features and the sentence features;

the updating module is used for updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;

the classification weight acquisition module is used for calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;

and the abstract generating module is used for weighting the sentence characteristics in the updated version text heterogeneous graph respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating a text abstract according to the acquired sentence labels.

In order to solve the above technical problem, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor implements the heterogeneous graph-based text summarization method.

In order to solve the above technical problem, the present invention further provides a terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the text summarization method based on the heterogeneous map.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

the text summarization method based on the heterogeneous graph provided by the embodiment of the invention is applied to connect texts according to semantic and syntactic relations to construct the text heterogeneous graph, updates two types of node characteristics of words and sentences by combining a graph attention network, and designs a plurality of measure indexes related to summaries to perform weighted evaluation on the sentences for final summarization extraction, thereby not only considering information transfer between the words and the sentences, but also considering mutual influence between the sentences. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification according to multi-angle indexes of abstract task design, the utilization capacity of the model on text features is effectively improved, and then the abstract which is accurate and high in readability is generated.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for text summarization based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 2 is a process diagram of a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a text heterogeneous graph construction process in a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a single-layer update process of a heterogeneous graph in a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the experimental results of the ablation learning based on the text summarization method of the heterogeneous map according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating an experimental result based on a CNN & DailyMail data set and a comparative experimental result performed with other abstract methods in the first embodiment of the present invention;

FIG. 7 is a diagram illustrating an influence of multi-angle indicators on a digest in a text summarization method based on heterogeneous images according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a multi-angle index quantization sample according to an embodiment of the present invention;

FIG. 9 is a structural diagram of a text summarization device based on heterogeneous graphs according to a second embodiment of the present invention;

fig. 10 shows a schematic structural diagram of a four-terminal according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features in the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

The automatic generation of text summaries is an important task in the field of natural language processing, and the existing generation of text summaries mainly has two modes, namely a generating mode and an extracting mode. The key point of the generation formula is to encode the whole document and generate the abstract word by word, and the extraction formula method is to directly select sentences from the document to combine the sentences into the abstract. Compared with the generating formula, the abstraction method has the advantages of higher efficiency and good readability. The key step in the task of extracting the text abstract is to establish the connection between each sentence and each article, and the existing method generally cannot capture the long-distance dependence relationship of the sentences. Recently, Graph Neural Networks (GNNs) have shown powerful feature extraction capability for graph data, and a text summarization method based on the graph neural networks has also been proposed. Although the abstract generation mode based on the graph neural network has a good effect, the process of generating the basic semantic unit is complex, only one type of node is used for constructing the graph, and edges are only added among different types of nodes, so that the association among sentences is weak.

Example one

In order to solve the technical problems in the prior art, the embodiment of the invention provides a text summarization method based on heterogeneous graphs.

FIG. 1 is a flow chart of a method for text summarization based on heterogeneous graphs according to an embodiment of the present invention; fig. 2 is a process diagram of a text summarization method based on heterogeneous graphs according to an embodiment of the present invention. Referring to fig. 1 and 2, the method for abstracting a text abstract based on heterogeneous graphs according to an embodiment of the present invention includes the following steps.

Further, in order to more clearly illustrate the specific implementation method of the method for abstracting text abstract based on heterogeneous graph of the present invention, the following definitions are made in advance:

definition of sentence sets and word sets: given a target text d containing m sentences and n words, S ═ S ₁ ，s ₂ ...s _m Is then the set of sentences of d,

is the word set for sentence i.

The definition of the text map is: g ═ V, E } represents a graph, V represents a set of nodes, and E represents a set of edges. Since the heterogeneous graph used in the present invention contains two types of nodes, it can be divided into a word node set and a sentence node set. Therefore, the text graph TG ═ V _TG ，E _TG Is designed as a metamorphic graph, where:

(1)V _TG where W ═ S contains two types of nodes, W ═ W ₁ ，W ₂ ，...，W _m Represents a set of sets of words. S ═ S ₁ ，s ₂ ...s _m Represents a sentence set.

(2)E _TG ＝E _heter ∪E _homo In which

Represents a heterogeneous edge, E _homo ＝{(s _i ，s _j )|s _i ，s _j E S represents a homogenous edge.

(3)e _ij Represents a heterogeneous edge (w) _ij ，s _i ) Weight of e' _ij Denotes homogeneous edge(s) _i ，s _j ) The weight of (c).

Step S101, performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features.

The method is characterized in that knowledge fusion is realized firstly, namely, an external knowledge base is utilized to complete the knowledge enrichment of word features, so that the feature representation of a text has semantic perception and knowledge perception at the same time. Specifically, a preset knowledge base is selected, and the language type of the preset knowledge base needs to be the same as that of the target text. Namely, when the target text is Chinese, the preset knowledge base needs to select a Chinese knowledge base; when the target text is English, the preset knowledge base needs to be selected. Secondly, in order to integrate the knowledge of the preset knowledge base into the word characteristics of the target text, coding vectorization needs to be carried out on the knowledge in the selected preset knowledge base and the content of the target text respectively to obtain all knowledge vectors of the preset knowledge base and all word vectors of the target text. For the sake of simplifying the description, d directly represents the target text after encoding vectorization, W represents the collection of word vectors encoding the target text after vectorization, and W represents _i Represents a word vector in W; and then, assuming that K represents a preset knowledge base for encoding vectorization, and K represents a knowledge vector in K.

And then, word features of each word vector in the target text are obtained. The solving mode of the word feature of each word vector is as follows: computing a word vector w by bilinear operations _i And presetting attention weights of all knowledge vectors k in a knowledge base to obtain the word vector w _i The word vector w _i Attention weight β of _i The calculation method is as follows:

β _i ＝BiLinear(K,W _KB ,w) (1)

wherein, W _KB Are trainable weight parameters.

After the attention weight of each word vector in the target text is obtained through calculation, the attention weight of the word vectors in the target text is sequentially taken as the weight, and the knowledge vectors in the preset knowledge base are respectively weighted and combined to obtain the knowledge weight of each word vector in the target text. The acquisition mode of the knowledge weight knowledge of the single word vector is as follows:

knowledge＝β _i K (2)

the knowledge weight knowledge of the word vector now already contains word-related knowledge.

And then acquiring the word features of the corresponding word vector based on the knowledge weight of each word vector in the target text. I.e. connecting w the knowledge weight of each word vector with the corresponding word vector ^k ＝[w，knowledge]And obtaining the word characteristics of the corresponding word vector, wherein the word characteristics have semantic perception and knowledge perception simultaneously.

After the word features of all word vectors of the target text are obtained, the sentence features of all sentence vectors in the target text can be obtained. Specifically, local feature capture and global feature capture are respectively carried out on word features of word vectors contained in each sentence vector in the target text after encoding and vectorization so as to obtain the local features and the global features of each sentence vector, and then the sentence features of each sentence vector are respectively obtained according to the local features and the global features of each sentence vector. The capture of local features is extracted by a Convolutional Neural Network (CNN), and the capture of global features is extracted by BilSTM. Meanwhile, after the sentence characteristics of all sentence vectors in the target text after the encoding vectorization are obtained, the text characteristics of the target text after the encoding vectorization can also be obtained. The specific sentence features and text features are calculated as follows:

D＝BiLSZM([s ₁ ，...，s _m ]) (4)。

after word features and sentence features in a target text after encoding vectorization are obtained, a text heterogeneous graph of the target text can be constructed through semantic grammar, and word-sentence heterogeneous edge weights and sentence-sentence homogeneous edge weights of the target text are obtained before the text heterogeneous graph. The construction process of the text heterogeneous graph is shown in fig. 3, and referring to fig. 3, in the embodiment of the present invention, a text abstract is regarded as a classification problem, and sentences are regarded as a minimum unit to be classified, so that an association relationship between sentences is particularly important when generating a digest. The homogeneous edge expression quantity request mode is as follows: based on sentence characteristics of sentence vectors in the target text after the code vectorization, calculating the homogeneous edge weight between every two sentence vectors of all the sentence vectors in the target text after the code vectorization in a cosine similarity calculation mode. In order to add more information related to the text, the embodiment calculates all word vectors in the target text and the heterogeneous edge weights among the sentence vectors to which the word vectors belong by the TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong after encoding and vectorization.

And constructing a text heterogeneous graph of the target text based on word features of the word vectors, sentence features of the sentence vectors, homogeneous edge weights among the sentence vectors and heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong.

And step S102, updating the text heterogeneous graph through the graph attention network based on the edge weight and the attention weight, and acquiring an updated version of the text heterogeneous graph.

To further illustrate the process of updating a heterogeneous text graph by using a graph attention network, word features and sentence features of word vectors in a target text after coding vectorization are respectively expressed as word nodesHidden state of (H) _w And hidden state H of sentence node _s And expressing the text feature as H _D 。

Fig. 4 is a schematic diagram illustrating a single-layer updating process of a text heterogeneous graph in a text summarization method based on a heterogeneous graph according to an embodiment of the present invention. Referring to fig. 4, the attention weight between every two sentence vectors of all sentence vectors in the target text is calculated, the attention weight between all word vectors and sentences to which the word vectors belong in the target text is calculated, and all homogeneous edge weights and all heterogeneous edge weights in the target text are obtained. The attention weight calculation method among sentence vectors is as follows, and the attention weight calculation method among word vectors and sentences to which the word vectors belong can also refer to the following formula:

the attention weight calculation method between two sentence vectors is as follows:

wherein h is _i And h _j Representing hidden states of two sentence nodes, W _a ，W _q ，W _k ，W _v For trainable parameters, α _ij Is h _i And h _j Attention weight in between.

Calculating an update delta u by attention weight _i The update delta can be calculated by equation (6):

wherein, mu _i Can represent word node increment and can also represent sentence node increment, N _i Representing a set of neighbor nodes for node i.

In order to make the semantic associated information participate in the updating, the heterogeneous edge weight e is used _ij And homogeneous side weight e' _ij And introducing, controlling the updating degree of the nodes from two aspects of semantic and attention models. The homogeneous edge weight and heterogeneous edge weight calculation method is as follows:

or

In this case, the formula (6) may be modified as follows:

the above process is a process how the graph attention network calculates the update increment by using the attention weight between sentence vectors, the attention weight between word vectors and sentences to which the word vectors belong, the homogeneous edge weight and the heterogeneous edge weight.

And the process of updating the text heterogeneous graph through the attention network actually comprises the updating of word nodes and the updating of sentence nodes in the text heterogeneous graph. The network attention of the further figure actually comprises three processes of updating word nodes by sentence nodes, updating sentence nodes by word nodes and updating sentence nodes mutually.

Wherein, the updating of the word node by the sentence node comprises: and taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between the sentence nodes connected with the central nodes as weight, and performing weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes. The updating of the sentence nodes by the word nodes comprises the following steps: and taking sentence nodes as central nodes, taking the product of the attention weight between the word nodes connected with the central nodes and the heterogeneous edge weight between the word nodes connected with the central nodes as a weight, and carrying out weighted aggregation on the word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes. Mutual updating between sentence nodes comprises: using sentence nodes as central nodes to connect with the central nodesAnd taking the product of the attention weight between sentence nodes connected with points and the homogeneous edge weight between the sentence nodes connected with the central node as a weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central node to realize the update of the sentence nodes. LSTM can also be applied to text feature H at sentence node level _d And (4) updating. Wherein the polymerization mode is as follows: the aggregation is weighted using the corresponding attention weight and the edge totals. To illustrate the above, the following presents the graph attention network's update procedure for the t-th time:

wherein GAT (G, H) _s ，H _w ) Showing a view attention update layer, G is a text graph, H _s For sentence features, as a request matrix in the attention mechanism, H _w Is a word feature, as a key and value matrix in the attention mechanism.

The message representing the delivery of words to sentences is updated by a multi-layer perceptron (MLP). Preferably, the multi-layer perceptron is a multi-layer perceptron comprising two linear hidden layersA machine is provided.

After each iteration updating, the text characteristic H is subjected to _d Updating:

through the updating iteration of the homogeneous heterogeneous graph structure based on the graph attention network GAT, sentences acquire more cross-sentence information through the indirect connection of words, and the homogeneous edges among sentence vectors enable the sentences to acquire long-distance correlation, so that more information is provided for abstract extraction.

Step S103, calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous graph, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector.

Specifically, calculating a relevance score, a redundancy score, a new information score and a recall rate evaluation-oriented measurement standard score of each sentence vector in the updated version text heterogeneous map; and calculating the classification weight of the corresponding sentence vector through a Sigmoid function based on the relevance score, the redundancy score, the new information score and the recall ratio evaluation-oriented metric score of each sentence vector. In order to extract a proper sentence as an abstract, the invention sets multi-angle sentence evaluation indexes, evaluates and scores the sentences from four angles of relevance (Rel), redundancy (Red), new information (Info) and Rouge-F1 score, weights the sentence characteristics by using the scores, and selects the optimal N items as extraction results.

The relevance is a very intuitive measurement standard, represents the relevance of the sentence and the full text, and the higher the value of the relevance is, the more the sentence can represent the subject of the article; the relevance score calculation mode of a single sentence in the updated text heterogeneous graph is as follows: and calculating the relevance score of the sentence vector through a bilinear function based on the text characteristics of the updated text heterogeneous image and the sentence characteristics of the sentence vector. The redundancy is a concept relative to the correlation, and a good abstract is not only matched with the original text theme, but also kept as concise as possible, namely the redundancy of the abstract itself is as low as possible; the redundancy score calculation mode of a single sentence vector in the updated version text heterogeneous graph is as follows: and calculating the redundancy score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous map. The relevancy is a standard which ignores background knowledge and other information sources, and the new information content is evaluated by combining the background knowledge, for readers, the reader hopes to know the knowledge which is not understood before for reading the abstract, and the new knowledge is the new information content; the new information quantity score calculation mode of a single sentence vector in the updated version text heterogeneous graph is as follows: and calculating a new information quantity score of the sentence through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image and the knowledge vector in the preset knowledge base. Rouge is a machine scoring method commonly used in text summarization, and the accuracy of the summarization can be further improved by taking the Rouge as an evaluation index; the calculation mode of the recall ratio evaluation-oriented metric standard score of a single sentence vector in the updated text heterogeneous graph is as follows: and calculating the recall-rate evaluation-oriented metric score of the sentence vector through the recall-rate evaluation-oriented metric function based on the target text which is not coded and vectorized and the text content of the sentence vector.

The sentence s is scored by integrating four abstract indexes, and the concrete calculation formula is as follows:

Rel＝h _s W _rel H _D (15)

Red＝h _s W _red A _s (16)

Info＝h _s W _info H _k (17)

Rouge＝R(s，ref) (18)

Score＝Sigmoid(Rel-Red+Info+Rouge) (19)

wherein h is _s Feature vector of sentence s, h _d For the feature vector of the text to which it belongs, H _k For knowledge base coding, s is the sentence itself, ref is the reference summary, W _rel 、W _red 、W _info And R is a Rouge calculation function for learnable parameters, and the calculation result of each index is processed by a Sigmoid function to be used as the classification weight of the current sentence.

And step S104, weighting the corresponding updated sentence characteristics based on the classification weight of each sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating a text abstract according to the acquired sentence labels.

Specifically, after the classification weight of each sentence vector is obtained in the above steps, weighting is performed respectively corresponding to the updated sentence features to obtain weighted sentence features, then corresponding sentence labels are obtained respectively based on the weighted sentence features through a perceptron classifier, and then a text abstract is generated according to the obtained sentence labels. In the process of selecting the abstract sentences by using the classifier of the perceptron, the best N sentences are selected as the abstract, and the triple blocking strategy is used for reducing the redundancy of the abstract.

In order to verify the effectiveness of the invention, a text summarization method based on heterogeneous graphs is compared with other methods for experiment, the effect of each summarization method is evaluated by a Rouge evaluation method, and the scoring condition of the Rouge evaluation is shown in fig. 5. Among them, KHHGS represents the invention of the heterogeneous graph-based text summarization method.

In comparative experiments, the CNN & Daily Mail data are used as a data set in the present invention, and compared with other abstract methods, the results are shown in fig. 5. Compared with the RNN model BiGRU, the KHHGS model (the method of the application) has obvious advantages and is superior to the Transformer model. From the data in the table, the Transformer score can be found to be similar to the homogeneous graph method score, which indicates that the Transformer can be regarded as a full-link graph of sentence level. KHHGS effect is also superior to that of the previous heterogeneous image-text summarization model HSG, and the three indexes of Rouge are respectively improved by 0.14/0.46/0.97, so that the relevant strategy provided by the invention can effectively improve the effect of the heterogeneous image-text summarization model on the text summarization task.

In order to better explain the effectiveness of the relevant strategy on the text summarization task, the model is subjected to ablation learning, and an external knowledge base, a homogeneity map and a summarization index calculation module are respectively removed from the model for experimental analysis. As shown in fig. 6, each data in the table represents the experimental results after removing the relevant module. The addition of the knowledge base improves the Rouge-2 and the Rouge-L to a certain extent, but does not obviously improve the Rouge-1, and the knowledge base used by the method is a general knowledge base, so that the effect of improving the news text corpus is weaker, and the method has no known mature news knowledge base at present; the addition of the homogeneous graph can obviously increase various Rouge indexes, particularly Rouge-L values, and probably because the addition of the homogeneous graph enhances the relation between sentences, the relation between sentences can be better utilized by a model, and the number of the longest overlapped substrings in the final abstract extraction result is influenced; in addition, the effect can be effectively improved according to the multi-angle standard of the abstract, further experiments are carried out on the influence of different indexes on the abstract, as shown in fig. 7, the horizontal axis represents the range of a certain index score value, the vertical axis represents the probability that the sentence with the score in the range belongs to the reference abstract, and as can be known from the figure, the influence of the relevancy and the Rouge score on the abstract result is large, and the probability that the two sentences with high scores belong to the reference abstract very probably; the new information volume also has a certain influence on the abstract, but the new information volume is not obvious, and because the invention uses the general knowledge base as the background knowledge for calculating the new information volume, the general knowledge with lower identification degree can be considered to be filtered in the content, and similar to the operation of removing stop words during data processing, the model can put the emphasis on other key sentences. The redundancy has no obvious influence, and the evaluation of the redundancy of a single sentence cannot effectively improve the final effect of the summarization method probably because the summarization method based on the sentence level is provided by the invention, and the Rouge score used for model evaluation belongs to the evaluation of the summary level.

In order to more intuitively express the function of each index provided by the invention, a test sample is selected to quantize the index, as shown in fig. 8, each line in the table is a sentence of the original text, the right side corresponds to the normalized score and the total score of each index calculated by a formula, and through the analysis of the above, the redundancy plays no key role in the model, so the redundancy is not added into a quantization list. The data in the table shows that the sentence length is in certain relation with the relevance index, the longer the sentence is, the higher the relevance of the sentence with the original text is, the more the contained information is, so that the longest sentence 2 in the table obtains the highest relevance score, and meanwhile, the content of the sentence 2 is easy to judge, the description of the sentence is very similar to that of the reference abstract, so that the sentence 2 obtains a higher Rouge score, and finally, the sentence 2 obtains the highest comprehensive score.

In conclusion, compared with the existing commonly used summarization method, the text summarization method based on the heterogeneous graph has great advantages in capturing the long-distance dependency relationship.

The text summarization method based on the heterogeneous graph provided by the embodiment of the invention is characterized in that a text is connected according to semantic and syntactic relations to construct the text heterogeneous graph, two types of node characteristics of words and sentences are updated by combining a graph attention network, and a plurality of measurement indexes related to summaries are designed to perform weighted evaluation on the sentences for final summarization extraction, so that not only is the information transfer between the words and the sentences considered, but also the mutual influence between the sentences considered. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification aiming at multi-angle indexes designed by the abstract task, the utilization capacity of the model on text features is effectively improved, and then the abstract which is more accurate and higher in readability is generated. The embodiment of the invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as the intermediaries of the sentences, so that the association among the sentences is enriched and the information is indirectly transmitted.

Example two

In order to solve the technical problems in the prior art, the embodiment of the invention provides a text summarization device based on heterogeneous graphs.

FIG. 9 is a schematic structural diagram of a text summarization apparatus based on heterogeneous graphs according to a second embodiment of the present invention; referring to fig. 9, the text summarization device based on heterogeneous graphs of the present invention includes a text heterogeneous graph construction module, an update module, a classification weight acquisition module, and a summary generation module, which are connected in sequence.

The text heterogeneous graph building module is used for carrying out knowledge fusion on a preset knowledge base and the target text, obtaining word features and sentence features of the target text, and building a text heterogeneous graph of the target text based on the word features and the sentence features.

The updating module is used for updating the text heterogeneous graph through the graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph.

The classification weight acquisition module is used for calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector.

The abstract generating module is used for weighting sentence characteristics in the updated version text heterogeneous image respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating the text abstract according to the acquired sentence labels.

The text summarization device based on the heterogeneous graph provided by the embodiment of the invention connects texts according to semantic and syntactic relations to construct the text heterogeneous graph, updates two types of node characteristics of words and sentences by combining a graph attention network, and designs a plurality of measure indexes related to summaries to perform weighted evaluation on the sentences for final summarization extraction, thereby not only considering information transfer between the words and the sentences, but also considering mutual influence between the sentences. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification aiming at multi-angle indexes designed by the abstract task, the utilization capacity of the model on text features is effectively improved, and then the abstract which is more accurate and higher in readability is generated. The embodiment of the invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and the word nodes are used as intermediaries of the sentences, thereby enriching the association between the sentences and indirectly transmitting information.

EXAMPLE III

In order to solve the above technical problems in the prior art, an embodiment of the present invention further provides a storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program can implement all steps in the text summarization method based on heterogeneous graphs in the first embodiment.

The specific steps of the text summarization method based on heterogeneous images and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that: the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Example four

In order to solve the technical problems in the prior art, the embodiment of the invention also provides a terminal.

Fig. 10 is a schematic structural diagram of a four-terminal according to an embodiment of the present invention, and referring to fig. 10, the terminal according to this embodiment includes a processor and a memory, which are connected to each other; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory, so that the terminal can realize all steps in the text summarization method based on heterogeneous graphs in the embodiment when executing.

The specific steps of the text summarization method based on the heterogeneous graph and the beneficial effects obtained by the terminal applying the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Similarly, the Processor may also be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A text summarization method based on heterogeneous graphs comprises the following steps:

updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;

respectively weighting sentence features in the updated version text heterogeneous image based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels;

wherein constructing a text heterogeneous graph of the target text based on the word features and sentence features comprises:

taking word vectors in the target text as word nodes and sentence vectors in the target text as sentence nodes, constructing a text heterogeneous graph of the target text based on word features of the word vectors, sentence features of the sentence vectors, homogeneous edge weights among the sentence vectors and heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong,

and the multiclass abstract indexes of the sentence vector comprise a relevance score, a redundancy score, a new information score and a recall-oriented evaluation metric score,

updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight, wherein the step of acquiring the updated text heterogeneous graph comprises the following steps:

updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on the attention weight between every two sentence vectors in the target text, the attention weight between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights to obtain an updated version text heterogeneous graph,

wherein updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on attention weights between every two sentence vectors of all sentence vectors in the target text, attention weights between all word vectors and sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights comprises:

taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes;

and taking sentence nodes as central nodes, taking the product of attention weight between the sentence nodes connected with the central nodes and homogeneous edge weight between the sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.

2. The method of claim 1, wherein performing knowledge fusion on a preset knowledge base and a target text, and obtaining word features and sentence features of the target text comprises:

and respectively performing local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text to obtain the local features and the global features of each sentence vector, and respectively obtaining the sentence features of the corresponding sentences according to the local features and the global features of each sentence vector.

3. The method of claim 1, wherein calculating multiple types of summarization indexes of sentence vectors in the updated text heterogeneous map and calculating classification weights of corresponding sentence vectors according to the multiple types of summarization indexes corresponding to each sentence vector respectively comprises:

calculating a relevance score, a redundancy score, a new information score and a recall rate evaluation-oriented metric score of each sentence vector in the updated text heterogeneous graph;

4. The method of claim 3, wherein the step of calculating relevance scores, redundancy scores, new information scores, and recall-assessment-oriented metric scores for individual sentence vectors in the updated textual heterogeneous graph comprises:

5. A text summarization device based on heterogeneous graphs is characterized by comprising:

the abstract generating module is used for weighting sentence characteristics in the updated version text heterogeneous graph respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics and generating a text abstract according to the acquired sentence labels;

based on sentence characteristics of sentence vectors in the target text, calculating the homogeneous edge weight between every two sentence vectors of all sentence vectors in the target text in a cosine similarity calculation mode;

updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight, and acquiring an updated text heterogeneous graph, wherein the step of acquiring the updated text heterogeneous graph comprises the following steps:

wherein, based on the attention weights between every two sentence vectors of all sentence vectors in the target text, the attention weights between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights, updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network comprises:

taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and heterogeneous edge weight between the word nodes connected with the central nodes as weight, and performing weighted aggregation on the word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes;

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for text summarization based on heterogeneous maps according to any one of claims 1 to 4.

7. A terminal, comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the text summarization method based on heterogeneous graphs according to any one of claims 1 to 4.