CN109597886B

CN109597886B - Extraction-Generation Hybrid Summary Generation Method

Info

Publication number: CN109597886B
Application number: CN201811238086.6A
Authority: CN
Inventors: 周玉; 朱军楠; 张家俊; 宗成庆
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2021-07-06
Anticipated expiration: 2038-10-23
Also published as: CN109597886A

Abstract

The invention belongs to the field of natural language, and specifically provides an extraction and generation hybrid abstract generation method, which aims to solve the problems existing in the existing extractive automatic abstract methods and generative automatic abstract methods. The invention provides a method for extracting and generating a hybrid abstract, which includes identifying entities and numbers in a document and replacing the entities and numbers in the document with preset labels; Extract a plurality of first key sentences from ; compress the plurality of first key sentences respectively to obtain a second key sentence corresponding to each first key sentence; through the comparison result between the length of the first key sentence and the preset length threshold, The first key sentence or the second key sentence can be selectively used as the first key sentence to be synthesized; a summary of the document is generated according to all the first key sentences to be synthesized. The method provided by the present invention can not only generate an abstract conforming to the semantic expression of the document, but also ensure readability.

Description

Extraction generation mixed abstract generation method

Technical Field

The invention belongs to the technical field of natural language, and particularly relates to a method for generating a mixed abstract by extraction and generation.

Background

The automatic abstract is a technology for automatically realizing text analysis, content induction and abstract automatic generation by using a computer system, and can express the main content of an original text in a concise form according to the requirements of readers (or users). The automatic summarization technology can effectively help a reader (or a user) to find interesting contents from the retrieved articles, and the reading speed and the reading quality are improved. The technique can compress the document into a more compact representation and guarantee coverage of the subject matter of value of the original document.

The existing automatic summarization technology mainly comprises two methods: an extraction type automatic summarization method and a generation type automatic summarization method. The extraction type automatic summarization method is characterized in that segments extracted from a document are combined into a summarization, the realization method is simple, the readability is good, but the precision of the obtained summarization is not high; the generation type automatic summarization method is to generate the summary directly from the meaning expression of the document, and has great difficulty but is closer to the essence of the summary.

Therefore, how to propose a scheme that can not only filter unimportant text content in a document, preserve the fluency of the abstract, but also improve the precision of the abstract is a problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

In order to solve the above problems in the prior art, i.e. to solve the problems of the existing extraction type automatic summarization method and the generation type automatic summarization method, the invention provides an extraction generation hybrid summarization generation method, which comprises the following steps:

identifying entities and numbers in a document and replacing the entities and numbers in the document with preset tags;

extracting a plurality of first key sentences from the document subjected to label replacement by using an extraction type document abstract extraction method;

compressing the plurality of first key sentences respectively to obtain a second key sentence corresponding to each first key sentence;

judging whether the length of the first key sentence is greater than or equal to a preset length threshold value: if so, taking a second key sentence corresponding to the first key sentence as a first key sentence to be synthesized; if not, directly taking the first key sentence as the first key sentence to be synthesized;

and generating the abstract of the document according to all the first key sentences to be synthesized.

In a preferred embodiment of the above-mentioned method, the step of "extracting a plurality of first key sentences from the document after the tag replacement by using an extraction-type document digest extraction method" includes:

extracting a plurality of first key sentences from the document subjected to label replacement by using an extraction type document abstract extraction method based on a Submodular function;

acquiring an original key sentence corresponding to the first key sentence in the document before the label replacement;

and sequencing the corresponding first key sentences according to the sequencing sequence of each original key sentence in the document before the label replacement.

In a preferred technical solution of the foregoing solution, the step of "respectively compressing the plurality of first key sentences to obtain a second key sentence corresponding to each first key sentence" includes:

compressing the first key sentence based on a pre-constructed sentence abstract model to obtain a corresponding second key sentence;

wherein the sentence abstract model is a model constructed based on an attention mechanism.

In a preferred technical solution of the above-mentioned solution, the step of "compressing the first key sentence based on a sentence summarization model constructed in advance to obtain a corresponding second key sentence" includes:

acquiring unknown words generated when the first key sentence is compressed;

acquiring the word with the highest attention value at the generation time of the unknown word and replacing the unknown word with the acquired word with the highest attention value.

In a preferred technical solution of the foregoing, before the step of "respectively compressing the plurality of first key sentences to obtain a second key sentence corresponding to each first key sentence", the method further includes:

identifying entities and numbers in a preset text data set;

replacing entities and numbers in the text data set by using a preset label;

and performing model training on the sentence abstract model according to the text data set subjected to label replacement.

In a preferred embodiment of the foregoing solution, the step of generating the summary of the document according to all the first to-be-synthesized key sentences includes:

restoring the labels in the first key sentence to be synthesized into corresponding entities and numbers to obtain a corresponding second key sentence to be synthesized;

and generating the abstract of the document according to the second key sentence to be synthesized.

Compared with the closest prior art, the technical scheme at least has the following beneficial effects:

1. the extraction generation hybrid abstract generation method provided by the invention can extract the first key sentence through an extraction document abstract extraction method, compress the first key sentence to obtain the second key sentence, selectively take the first key sentence or the second key sentence as the first key sentence to be synthesized according to the comparison result of the length of the first key sentence and a preset length threshold value, and generate the document abstract according to the first key sentence to be synthesized.

2. The method for generating the hybrid abstract by extracting and generating can judge whether the length of the first key sentence is larger than or equal to a preset length threshold value, if so, the second key sentence corresponding to the first key sentence is used as the first key sentence to be synthesized, and if not, the first key sentence is directly used as the first key sentence to be synthesized, so that a more robust abstract can be obtained subsequently, namely, the readability is ensured to the greatest extent while the fact has a certain degree of fidelity.

3. The extraction generation hybrid abstract generation method provided by the invention can extract the first key sentence from the document by the extraction type document abstract extraction method, and can filter some text contents which are not important, so that the abstract of the document can be quickly generated by the generation type automatic abstract method at the later stage, and the high-precision document abstract can be obtained.

Drawings

Fig. 1 is a schematic diagram illustrating the main steps of a hybrid abstract generating method according to an embodiment of the present invention;

fig. 2 is a schematic main framework diagram of a hybrid abstract generating method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

Referring to fig. 1, fig. 1 illustrates the main steps of the hybrid abstract generation method in the present embodiment. As shown in fig. 1, the method for generating a hybrid abstract in this embodiment includes the following steps:

step S101: entities and numbers in the document are identified and replaced with preset tags.

Inspired by the manual summarization process (i.e. extracting some important sentences from the original text and then inductively rewriting the sentences), the invention generates the text summary of the long text by the extraction generation hybrid summary generation method. The method of the invention not only can filter some text contents which are not important by using the extraction type document abstract extraction method, but also can keep the fluency of the text abstract generated by the generation type document abstract extraction method. The extraction generation mixed abstract generation method mainly comprises two parts: extracting important sentences in the document and compressing and rewriting the extracted sentences.

Specifically, entities and numbers in the document may be identified and replaced with preset tags. Assuming that given an input document:

It’s just an example for illustration.There are 56nationalities in China.

the document with entities and numbers in the document replaced by preset tags is as follows:

It’s just an example for illustration.There are number-1nationalities in entity-1.

the named entities can be personal names, organization names, place names and other entities identified by names, the broader entities can also comprise numbers, dates, currency, addresses and the like, the entities and the numbers in the documents can be identified by a named entity identification tool space, and the entities and the numbers in the documents can be replaced by preset labels through a Python regular expression.

Step S102: and extracting a plurality of first key sentences from the document subjected to label replacement by using an extraction type document abstract extraction method.

The extraction type document abstract extraction method can extract some representative text segments from an original document to form an abstract, wherein the segments can be sentences, paragraphs or sections in the whole document. Specifically, a plurality of first key sentences may be extracted from the document after the tag replacement by using an extraction-type document abstract extraction method based on a Submodular function, original key sentences corresponding to the first key sentences in the document before the tag replacement are obtained, and the corresponding first key sentences are sorted according to the sorting order of each original key sentence in the document before the tag replacement. The total vocabulary number of the first key sentences is smaller than a preset threshold vocabulary number, which may be 200.

Step S103: and respectively compressing the plurality of first key sentences to obtain a second key sentence corresponding to each first key sentence.

Although a plurality of first key sentences extracted by the extraction type document abstract extraction method can filter some text contents which are not important, the obtained abstract has low precision, and in order that the generated abstract can better accord with the expression of document meaning, the abstract which is closer to manual writing is obtained, and the plurality of first key sentences can be compressed. Specifically, the first key sentence may be compressed based on a pre-constructed sentence abstract model to obtain a corresponding second key sentence, where the sentence abstract model is a model constructed based on an attention mechanism.

The step of compressing the first key sentence based on the pre-constructed sentence abstract model to obtain the corresponding second key sentence comprises the following steps:

acquiring unknown words generated when the first key sentence is compressed;

the word with the highest attention value at the generation time of the unknown word is acquired and the unknown word is replaced with the acquired word with the highest attention value.

The sentence abstract model is a model constructed based on an attention mechanism, and can be attached to an Encoder-Decoder framework, wherein the framework can be regarded as a research mode in the field of deep learning, the Encoder encodes an input sentence, converts the input sentence into an intermediate semantic representation through nonlinear transformation, can understand the Encoder as an encoding end and can understand the Decode as a decoding end, the Decode generates words to be generated at a specific moment according to the intermediate semantic representation of the sentence and history information generated before, and when an unknown word appears in the sentence, the word with the highest attention value at the generation moment of the unknown word can be acquired and replaces the unknown word with the highest acquired attention value, so that the readability of the abstract is improved.

Before a plurality of first key sentences are compressed to obtain second key sentences, a sentence abstract model can be trained, and the specific steps are as follows:

identifying entities and numbers in a preset text data set;

replacing entities and numbers in the text data set by using a preset label;

and performing model training on the sentence abstract model according to the text data set subjected to label replacement until the sentence abstract model converges, wherein the text data set can be a Gigaword data set.

Step S104: judging whether the length of the first key sentence is greater than or equal to a preset length threshold value, if so, executing a step S105; if not, go to step S106.

In order to obtain a more robust abstract, that is, while ensuring a certain degree of fidelity to the fact, readability is ensured as much as possible, it may be determined whether the length of the first key sentence is greater than or equal to a preset length threshold, and corresponding operations are performed according to the determination result.

Step S105: and taking a second key sentence corresponding to the first key sentence as a first key sentence to be synthesized.

If the length of the first key sentence is greater than or equal to the preset length threshold, in order to control the number of the finally generated abstract words to be controlled at a reasonable length and improve readability, a second key sentence corresponding to the first key sentence can be used as the first key sentence to be synthesized.

Step S106: and directly taking the first key sentence as the first key sentence to be synthesized.

If the length of the first key sentence is smaller than the preset length threshold, the first key sentence extracted from the document can be considered to meet the requirement of the vocabulary quantity of the final generated abstract, and the first key sentence is directly used as the first key sentence to be synthesized.

Step S107: and generating an abstract of the document according to all the first key sentences to be synthesized.

Specifically, the tags in the first key sentence may be reduced to corresponding entities and numbers to obtain a corresponding second key sentence to be synthesized, and the second key sentence to be synthesized is arranged in order according to the sequence of the original sentences in the document corresponding to the second key sentence to be synthesized to generate the abstract of the document.

Referring to the attached table 1, the attached table 1 exemplarily shows the route value of the hybrid abstract generation method and the sequence-to-sequence attention (S2S + attn) model in the CNN/DailyMail data set (randomly extracting 100 documents as test data). The sentence-title training data set comprises 3,803,957 data pairs, the verification data set comprises 189,651 data pairs, and the test data set comprises 1951 data pairs, and as can be seen from the attached table 1, the extraction generation hybrid abstract generation method of the embodiment can significantly improve two indexes, namely, the route-1 and the route-L. In addition, the sentence abstract model of the embodiment is trained on a Gigaword data set by means of the idea of migration learning, while the existing S2S + attn model is trained on a CNN/Daily Mail data set, so that the model of the embodiment of the invention has better migration.

Attached Table 1 comparison of the present invention with ROUGE values based on a sequence-to-sequence model (S2S + attn)

Referring to fig. 2, fig. 2 illustrates the main framework of the hybrid abstract generation method in the present embodiment. As shown in fig. 2, the main framework of the hybrid abstract generating method in this embodiment is as follows:

firstly, extracting an original document to obtain a plurality of first key sentences, then compressing the first key sentences through a pre-constructed sentence abstract model to obtain corresponding second key sentences, and finally, selectively taking the first key sentences or the second key sentences as first key sentences to be synthesized according to a comparison result of the length of the first key sentences and a preset length threshold value, and generating the abstract of the document according to all the first key sentences to be synthesized.

The extraction generation hybrid abstract generation method provided by the invention combines the advantages of an extraction type document abstract extraction method and a generation type document abstract extraction method, can generate an abstract which accords with document semantic expression, can also ensure readability, can extract a first key sentence from a document by the extraction type document abstract extraction method, can filter some text contents which are not important, so that the abstract of the document can be quickly generated by the generation type automatic abstract method at a later stage, and the high-precision document abstract can be obtained.

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.

Those of skill in the art will appreciate that the method steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing or implying any particular order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for generating a hybrid abstract by extraction generation is characterized by comprising the following steps:

2. The abstract generating hybrid abstract generating method as claimed in claim 1, wherein the step of extracting a plurality of first key sentences from the document after the tag replacement by the abstract document abstract extracting method comprises:

3. The method for generating a hybrid abstract according to claim 1, wherein the step of compressing the plurality of first key sentences to obtain a second key sentence corresponding to each first key sentence comprises:

4. The method for abstract-generating hybrid abstract of claim 3, wherein the step of compressing the first key sentence based on a pre-constructed sentence abstract model to obtain a corresponding second key sentence comprises:

acquiring unknown words generated when the first key sentence is compressed;

5. The method for generating hybrid abstract as claimed in claim 4, wherein before the step of compressing the plurality of first key sentences respectively to obtain the second key sentence corresponding to each first key sentence, the method further comprises:

identifying entities and numbers in a preset text data set;

replacing entities and numbers in the text data set by using a preset label;

6. The abstract generating hybrid type abstract generating method of any one of claims 1 to 5, wherein the step of generating the abstract of the document according to all the first key sentences to be synthesized comprises: