CN114298037A

CN114298037A - Text abstract acquisition method based on deep learning

Info

Publication number: CN114298037A
Application number: CN202111662780.2A
Authority: CN
Inventors: 张丽; 遆敬苗
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08

Abstract

The invention discloses a text abstract acquisition method based on deep learning, which comprises the steps of firstly extracting key words of an original document; constructing an Encoder module to extract global semantic information; constructing a graph volume module to extract local semantic information; and the construction Decoder module generates a text abstract. The text summarization task is to refine and summarize mass text data, and the time cost for a user to browse the text data is saved by compressing the mass text data into a simple and visual summary. The method takes the key points as local features and the original text as global features to obtain rich semantic representation of the original text; the premise of generating the high-quality abstract is to understand the semantics of the original text; the weight among the features is updated by using the graph convolution, the transmission of semantic information is further promoted, and the meaningless message transmission is inhibited, so that the obtained semantic information of the original text can better reflect the central thought of the original text, the generated abstract can reflect the center of the original text, and the abstract without the central thought is prevented from being generated.

Description

Text abstract acquisition method based on deep learning

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text abstract acquiring method based on deep learning.

Background

With the rapid development of the internet industry, more and more people rely on publishing and acquiring information from an internet platform, the daily contact text information of people is increased explosively, a large amount of information can be accessed quickly through the internet platform, but as the information of the network is extremely large and messy, people need to spend more time to screen the key information in the text. Therefore, it has become an urgent need to extract important content from a large amount of text information. Traditional text summarization mainly relies on manual summarization, requiring enormous time and labor costs. At the same time, it is impractical to simply rely on human labor to summarize text excerpts due to the explosive growth of text messages. Therefore, automatic text summarization, which is a technology for automatically summarizing text summarization by machine, is a popular field that is being actively researched.

The automatic text excerpts can be divided into two categories, which are extraction type text excerpts and generation type text excerpts, according to the output type. The extraction type text abstract extracts important segments from an original text and combines the important segments to form the abstract, so that the abstract text not only can effectively make the content concise and is convenient for people to understand, but also is simple to realize, and is the most mainstream, most applied and easiest method at present. However, this method has a non-negligible disadvantage in that adjacent segments in the text summary are not necessarily adjacent in content, and thus may cause semantic incoherence of the summary. In contrast, a generative text abstract not only extracts several existing segments from the original text, but also a condensed interpretation of the main content of the original text, possibly resulting in a vocabulary not present in the original text, which is more flexible than an abstract text abstract and closer to the process of manually summarizing the abstract. Generating text summaries requires understanding the original document, generating a concise summary that is highly readable, making the task difficult and challenging. Automatic text summarization may be divided into single document summarization and multiple document summarization according to document type. A single document digest generates a digest from a given one of the documents and a multiple document digest generates a digest from a given set of subject matter related documents. With the rapid development of artificial intelligence technology, natural language processing technology based on neural network and deep learning has also made remarkable development. Automatic text summarization has also received wide attention as an important field of natural language processing. More and more researchers are dedicated to realizing automatic text summarization by using a deep neural network, and the generated text summarization technology makes substantial progress to a certain extent, and meanwhile, the extraction text summarization technology is greatly improved. Despite the great advances in automatic text summarization technology in recent years, it is still far from sufficient to generate high quality summaries. For generative text summarization, a model is required to have a stronger ability to represent, understand and generate text. The existing generated text abstract also has the problems of readability, redundancy, information quantity, false information and the like.

Disclosure of Invention

The text summarization task is to refine and summarize mass text data, and the time cost for a user to browse the text data is saved by compressing the mass text data into a simple and visual summary. With more and more text information which people contact daily, the text abstract becomes an urgent need of people. Automatic text summarization is an important field in natural language processing, and aims to automatically summarize a concise, coherent, large-information and accurate text summary through a machine. With the development of deep learning, a text summarization technology is advanced to a certain extent, however, the technology is far from sufficient to meet the actual requirements of people, and for a computer, summarization is a very challenging task, and when the summarization is generated, the computer is required to understand the content of the original text after reading the original text, and the content is cut, cut and spliced according to the degree of urgency and urgency, so that a smooth short text is finally generated.

Aiming at the quality problem of the generated text abstract, the technology adopted by the invention provides a mode of fusing local information and global characteristics to strengthen the semantic representation of the model on the input text, thereby improving the generation quality of the abstract. The method comprises the following steps:

step 1, extracting keywords of an original text.

For texts, the topic idea of the whole text can be snooped through some keywords, and the method extracts a plurality of keywords representing the semantic content of the article as local information of the text. Because keywords are not given in a data set related to the text abstract, in the invention, the keywords of the original document need to be extracted first, and the method used by the invention is mainly based on an unsupervised thought.

The steps for extracting the original text key words are as follows:

step 1.1, considering the position information of words, the probability that the words appearing in the first sentence and the last sentence are keywords is higher, so that the first sentence and the last sentence of the document are respectively repeated for 3 times, and the word frequency of the keywords in the first sentence and the last sentence is increased.

And 1.2, segmenting the text, and selecting 20 words as candidate keywords by utilizing tf-idf statistical information of each word.

Step 1.3, the keywords that we want to obtain can represent the central thought of the original text as much as possible, and the keywords obtained by using the statistical information can not guarantee this, so we need to further screen the keywords obtained in step 1.2: vector representation d of the document is obtained using Doc2Vec, and vector representation w of the candidate keyword is obtained using Word2 Vec. And sorting the candidate keywords according to the cosine distances of w and d, and selecting key phrases close to the document from the initial candidate keywords, wherein the closer the keywords are to the document, the larger the description information amount is, so that the obtained keywords are ensured to be more relevant to the document.

Step 1.4 is to avoid redundancy of final keywords, that is, extracted keywords have the same meaning although having different expression modes, so we need to perform secondary screening on the keywords obtained in step 1.3: similarly, sorting is performed according to cosine distances among the candidate keywords, and only one keyword with the same semantic meaning is reserved.

And 2, constructing an Encoder module.

The purpose of this module is to encode, i.e. vectorize, the input text. The Encoder module of the invention uses the Encoder module of the Transformer to finally obtain the semantic representation of the original text with semantic features and context features, and the semantic representation becomes global semantic information.

And 3, constructing a graph convolution module.

The purpose of the module is the relationship integration of local semantic information. Semantic information of different keywords is obtained in the step 1, and in order to mine more effective local semantic features, the local features are added into the relational features by using a graph convolution method, so that the local semantic information with the relational information is obtained. In the graph convolution, the input comprises nodes and an adjacency matrix, wherein the nodes are local semantic information extracted in the step 1, the nodes are related, the adjacency matrix represents the degree of relation between the nodes, then the graph convolution is used for learning the relation weight between each keyword in a self-adaptive mode, after the adjacency matrix between the keywords is obtained, the adjacency matrix is multiplied by the initial semantic information to obtain relation characteristics, and then the relation characteristics and the initial characteristics are fused to obtain a new round of characteristics.

The method comprises the following steps:

step 3.1 obtains K local semantic information and 1 global semantic information as the nodes of the graph in step 1 and step 2.

Step 3.2 constructs the adjacency matrix of the graph, and initializes the adjacency matrix to 1.

And 3.3, the larger the difference between the local characteristic and the overall characteristic is, the more the local characteristic is outlier is, so that the difference between the local semantic information and the overall semantic information is calculated to construct a difference matrix, and the difference is utilized to dynamically update the weights of the edges of all the nodes in the graph. Firstly, repeating the global semantic information for K times, respectively obtaining the difference degree of K local semantic information and the global semantic information, and finally obtaining a difference matrix.

And 3.4, converting the difference matrix obtained in the step 3.3 into a matrix with the dimensionality of (K, K) by utilizing operations such as linear transformation and the like, and calling the matrix as an updated matrix.

Step 3.5 multiplies the updated matrix obtained in step 3.4 element by element with the adjacent matrix, and the purpose of the operation is to learn the adjacent matrix adaptively by using the updated matrix.

And 3.6, multiplying the adjacent matrix obtained in the step 3.5 by the node information to obtain the relation characteristic of the semantic information.

And 3.7, splicing the local relation characteristics obtained in the step 3.6 with the local semantic information of the nodes to obtain the local semantic information with the relation information.

Step 4, constructing a Decoder module

The purpose of this module is to generate a summary of the original text. The pointer generator network is a seq2seq model with a copying mechanism, which predicts words according to probability distribution of a generator and a pointer, wherein the generator predicts words of a current step mainly by using a background vector output by an encoder module, a hidden layer of the current step of the decoder and an output predicted by the previous step of the decoder, a predicted abstract of the generator is a word in a vocabulary table and can predict words except an original document, and a predicted word of the probability distribution of the pointer is a text in the original document pointed by the pointer, so that the abstract generated by the pointer generator network can generate a new word and can copy the text in the original document. The pointer generator network can be viewed as a balance between the extraction method and the abstraction method, improving the accuracy and processing power of unknown words by copying words, and simultaneously preserving the capability of generating new words. The present invention uses RNN with attention mechanism as decoder to output the summary. The method comprises the following specific steps:

and 4.1, fusing the global semantic information obtained in the step 2 and the step 3 and the local semantic information with the relation information in a summing mode to serve as an initialized hidden vector of the decoder module.

And 4.2, calculating the importance degree of each word in the original text to the Decoder word according to an attention mechanism, and obtaining semantic representation of the original text with different attention information according to different importance degrees for each step of the Decoder, wherein the semantic representation is called as a background vector.

And 4.3, predicting the output of the current time step according to the semantic representation of the original text, the output of the previous time step and the hidden vector of the current time step, and finally obtaining the predicted output of each time step so as to obtain the text abstract of the original text.

Compared with the prior art, the invention has the following advantages:

(1) the method takes the key points as local features and the original text as global features, and obtains richer semantic representation of the original text. The premise of generating high-quality abstract is to understand the original text semantics

(2) The meaningful features are closer to the global features than the meaningless features, the method updates the weights among the features by using the graph convolution, further promotes the transmission of semantic information, and inhibits the meaningless message transmission, so that the obtained semantic information of the original text can better reflect the central thought of the original text, the generated abstract can reflect the center of the original text, and the abstract without the central thought is prevented from being generated.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

The flow chart of an embodiment is shown in fig. 1, and comprises the following steps:

step S10, extracting the key words of the original document;

step S20, constructing an Encoder module to extract global semantic information;

step S30, constructing a graph volume module to extract local semantic information;

and step S40, the Decoder module is constructed to generate a text abstract.

Step 1, extracting keywords of an original text.

The steps for extracting the original text key words are as follows:

And 2, constructing an Encoder module.

And 3, constructing a graph convolution module.

The purpose of the module is the relationship integration of local semantic information. Semantic information of different keywords is obtained in the step 1, and in order to mine more effective local semantic features, the local features are added into the relational features by using a graph convolution method, so that the local semantic information with the relational information is obtained. In the graph convolution, the input comprises nodes and an adjacency matrix, wherein the nodes are local semantic information extracted in the step 1, the nodes are related, the adjacency matrix represents the degree of relation between the nodes, then the graph convolution is used for learning the relation weight between each keyword in a self-adaptive mode, after the adjacency matrix between the keywords is obtained, the adjacency matrix is multiplied by the initial semantic information to obtain relation characteristics, and then the relation characteristics and the initial characteristics are fused to obtain a new round of characteristics. The method comprises the following steps:

Step 4, constructing a Decoder module

Claims

1. The text abstract acquisition method based on deep learning is characterized by comprising the following steps: the method comprises the following steps: step 1, extracting keywords of an original text;

extracting a plurality of keywords representing semantic content of an article as local information of a text; extracting the keywords of the original document, wherein the steps of extracting the keywords of the original document based on an unsupervised thought are as follows:

step 1.1, considering position information of words, enabling the probability that the words appearing in the first sentence and the last sentence are keywords to be high, and repeating the first sentence and the last sentence of the document for 3 times respectively, so that the word frequency of the keywords in the first sentence and the last sentence is increased;

step 1.2, segmenting the text, and selecting 20 words as candidate keywords by utilizing tf-idf statistical information of each word;

step 1.3 the keywords obtained in step 1.2 are further screened: obtaining vector representation d of the document by using Doc2Vec, and obtaining vector representation w of the candidate keyword by using Word2 Vec; sorting the candidate keywords according to the cosine distances of w and d, and selecting key phrases close to the document from the initial candidate keywords, wherein the closer the keywords are to the document, the larger the description information amount is, so that the obtained keywords are ensured to be more relevant to the document;

step 1.4 is to avoid redundancy of final keywords, that is, extracted keywords have the same meaning although having different expression modes, and therefore, the keywords obtained in step 1.3 need to be screened for the second time: sorting according to cosine distances among the candidate keywords, and only keeping one keyword with the same semantics;

step 2, constructing an Encoder module;

the purpose of the Encoder module is to encode, i.e., vectorize, the input text; the Encoder module uses a Transformer Encoder module to finally obtain the semantic representation of the original text with semantic features and context features to become global semantic information;

step 3, constructing a graph convolution module;

obtaining semantic information of different keywords in step 1, and adding the local features into the relationship features by using a graph convolution method to obtain the local semantic information with the relationship information in order to mine more effective local semantic features; in the graph convolution, the input comprises nodes and an adjacency matrix, wherein the nodes are local semantic information extracted in the step 1, the nodes are related, the adjacency matrix represents the degree of relation between the nodes, then the graph convolution is used for learning the relation weight between each keyword in a self-adaptive manner, after the adjacency matrix between the keywords is obtained, the adjacency matrix is multiplied by the initial semantic information to obtain a relation characteristic, and then the relation characteristic is fused with the initial characteristic to obtain a new round of characteristic;

step 4, constructing a Decoder module;

the Decoder module is used for generating a summary of an original text, the pointer generator network is a seq2seq model with a copying mechanism and predicts words according to probability distribution of the generator and the pointer, wherein the generator mainly predicts words of a current step by using a background vector output by the encoder module, a hidden layer of the Decoder current step and output of previous prediction of the Decoder, the predicted summary of the generator is words in a vocabulary table and predicts words outside an original document, and the predicted words of the probability distribution of the pointer are texts in the original document pointed by the pointer, so that the summary generated by the pointer generator network can generate a new vocabulary and can copy texts in the original document; the pointer generator network is regarded as the balance between an extraction method and an abstraction method, improves the accuracy and the processing capacity of the unknown words by copying the words, and simultaneously reserves the capacity of generating new words; the digest is output using RNN with attention as a decoder.

2. The text abstract acquiring method based on deep learning of claim 1, wherein: the step 3 comprises the following steps:

step 3.1, K pieces of local semantic information and 1 piece of global semantic information are obtained in step 1 and step 2 and are used as nodes of the graph;

step 3.2, constructing an adjacency matrix of the graph, and initializing to 1;

3.3, the larger the difference between the local characteristic and the overall characteristic is, the more the local characteristic is outlier is, so that the difference between the local semantic information and the overall semantic information is calculated to construct a difference matrix, and the difference is utilized to dynamically update the weight of the edges of all the nodes in the graph; firstly, repeating the global semantic information for K times to respectively obtain the difference degree of K local semantic information and the global semantic information, and finally obtaining a difference matrix;

step 3.4, converting the difference matrix obtained in step 3.3 into a matrix with the dimensionality of (K, K) by utilizing linear transformation operation, and calling the matrix as an updated matrix;

step 3.5, multiplying the updated matrix obtained in the step 3.4 by the adjacent matrix element by element, wherein the purpose of the operation is to utilize the updated matrix to learn the adjacent matrix adaptively;

step 3.6, multiplying the adjacent matrix obtained in step 3.5 by the node information to obtain the relation characteristic of the semantic information;

3. The text abstract acquiring method based on deep learning of claim 1, wherein: the step 4 comprises the following steps:

step 4.1, fusing the global semantic information obtained in the step 2 and the step 3 and the local semantic information with the relationship information in a summing mode to serve as an initialized hidden vector of the decoder module;

step 4.2, calculating the importance degree of each word in the original text to the Decoder word according to an attention mechanism, and obtaining semantic representation of the original text with different attention information according to different importance degrees in each step of the Decoder, wherein the semantic representation is called as a background vector;