CN111400486B

CN111400486B - Automatic text abstract generation system and method

Info

Publication number: CN111400486B
Application number: CN202010175795.5A
Authority: CN
Inventors: 汪泽鸿; 吕昱峰; 钟将
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2023-05-26
Anticipated expiration: 2040-03-13
Also published as: CN111400486A

Abstract

The invention discloses an automatic generation system and method of text abstract, which converts original document into sentence sequence S= [ S ] ₁ ,S ₂ ,…,S _n ]And corresponding labels y are allocated to each sentence _i The method comprises the steps of carrying out a first treatment on the surface of the Extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing; and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words by using a non-key information shielding mechanism, and further automatically generating a abstract. The invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document. Meanwhile, the automatic generation process of the abstract is not simply two stages of generation or screening, but is a highly uniform process, so that the integrity of abstract expression is improved.

Description

Automatic text abstract generation system and method

Technical Field

The invention relates to the technical field of document abstract generation, in particular to an automatic text abstract generation system and method.

Background

Automatically generating text excerpts is a natural language generation task that identifies key information and filters non-important content in the original document to preserve the key information in the document. The automatic generation of the text abstract can be widely applied to a plurality of fields such as news reports, investigation and research, education and teaching and the like.

The methods for automatically generating the text abstract in the prior art mainly comprise two methods: extraction method and generation method. The extraction method is to select key sentences from the original document and splice the key sentences together to form abstract content. Instead of simple selection and replication, the generation method typically generates new sentences based on semantic understanding of the original text. Both of these methods have respective advantages and problems. Therefore, researchers have proposed using deep learning methods to combine the two methods to achieve a unified abstract model. The existing model has two problems: 1) These methods directly refine the extracted sentences, possibly ignoring the persistence of the keywords; 2) Using a similar decoder to re-decode the new word may cause the refining process to predict the same word.

Disclosure of Invention

Aiming at the problem that the automatically generated abstract can not accurately reflect the content of the original document in the prior art, the automatic text abstract generation system and method automatically generate the text abstract by combining two stages of sentence extraction and generation so as to accurately reflect the content information of the original document.

In order to achieve the above object, the present invention provides the following technical solutions:

the automatic text abstract generation system comprises a document sentence extraction module, a document sentence push module and an abstract generation module;

the original document is input into the document sentence extraction module to extract a first candidate sentence, the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.

Preferably, the document sentence extracting module comprises a decomposing module, a scoring module and a first candidate sentence generating module; wherein,,

the decomposition module is used for respectively dividing the input original document into a sentence sequence, and each sentence is provided with a corresponding label;

the scoring module is used for scoring sentences with different labels;

and the first candidate sentence generation module is used for sequencing sentences from high score to low score in sequence and screening the first K sentences as first candidate sentences.

Preferably, the scoring module comprises 3 transducer layers and 4 attention heads.

Preferably, the document statement push module includes 8 converter decoder layers and 12 attention headers.

The invention also provides an automatic text abstract generating method, which comprises the following steps:

s1: will be originalConversion of an initial document into a sentence sequence s= [ S ] ₁ ,S ₂ ,…,S _n ]And corresponding labels y are allocated to each sentence _i ；

S2: extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing;

s3: and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts.

Preferably, the step S2 specifically includes the following steps:

s2-1: in the present embodiment, any sentence S _i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h _i ：h _i ＝BERT(S _i ) Where i.epsilon.1, n, i is an integer, h _i Is m;

s2-2: will represent vector h _i Converting to obtain sentence-level feature vectors:

in the formulas (1) and (2),

transition sentence level feature vector representing a layer-l converter,/->

Sentence-level feature vectors representing l-1 layer converters, LN representing layer normalization, MHAt representing multi-head attention,/->

Sentence-level feature vectors representing l-layer converters, FFN representingA feed-forward layer;

s2-3: output layer output tag y _i ：

In the formula (3), y _i Representing sentence S _i Sigma represents a Sigmoid function, W ₀ The weights of the output layer are represented as,

sentence-level feature vector representing layer L converter, b ₀ Representing the deviation of the output layer;

s2-4: using Sigmoid function to send

Conversion to an attention value:

in the formula (4) of the present invention,

representing sentence S _i Attention value of (2), and->

S2-5: the information quantity of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered according to the height of the score.

Preferably, the extracting step of the refined word in S3 includes:

s3-1: selecting the first K sentences after sequencing as first candidate sentences, and performing word-by-word shielding (so as to obtain second candidate sentences;

s3-2: refining the feature H to generate a refined word, the refining process can be expressed as

d _j ＝f _delib (r _≠j ，H) (5)

In the formula (5), d _j Representing the refining result of the jth mask word, f _delib Representing a push function, r _≠j Representing mask words, H representing contextually relevant bi-directional features;

all d _j Forming a vector, the refined word D can be obtained:

D＝[d ₁ ,d ₂ ,L,d _m ]。

preferably, the salient words in S3 remain as follows:

y _j ＝(1-a _j )d _j +a _j r _j (6)

in the formula (7), y _j Representing the reserved value of the salient word, r _j Representing the extraction result, d _j Representing the refining result of the jth mask word, a _j Representing the weight.

In summary, due to the adoption of the technical scheme, compared with the prior art, the invention has at least the following beneficial effects:

1. the invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document.

2. The automatic summary generation process in the invention is not two stages of simple generation or screening, but is a highly unified process, thereby improving the integrity of summary expression.

Description of the drawings:

fig. 1 is a schematic diagram of an automatic text summary generation system according to an exemplary embodiment of the present invention.

Fig. 2 is a schematic diagram of a document sentence extraction module according to an exemplary embodiment of the present invention.

Fig. 3 is a flowchart illustrating a text summary automatic generation method according to an exemplary embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and embodiments. It should not be understood that the scope of the subject matter described above is limited to the following examples, and all techniques implemented based on the present disclosure are within the scope of the present invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the invention.

The invention provides an automatic text abstract generating system, referring to fig. 1, comprising a document sentence extracting module 10, a document sentence pushing module 20 and an abstract generating module 30. The original document is input to the document sentence extraction module 10 for screening of a plurality of first candidate sentences; the first candidate sentences and the original document are input into the document sentence push module 20 for extraction so as to correct the information (such as missing sentences, incorrect sentences and the like) of the first candidate sentences and obtain a plurality of second candidate sentences; the summary generating module 30 reorganizes the plurality of second candidate sentences to automatically generate a summary to reflect the information content of the original document.

In this embodiment, referring to fig. 2, the document sentence extraction module 10 includes a decomposition module 101, a scoring module 102, and a first candidate sentence generation module 103.

A separate module 101 for decomposing the input original document into a sentence sequence, such as s= [ S ] ₁ , S ₂ ,…,S _i ,…,S _n ]Wherein S is _i An ith sentence representing the entire original document; sentence S _i Is composed of strict phases, S _i ＝[W ₁ ,W ₂ ,…,W _m ]Wherein W is _m Represents the mth word and each sentence S _i All having a corresponding tag y _i ∈{0,1}。

A scoring module 102 for scoring sentences having different labels. The scoring module 102 sets 3 transducer layers and 4 attention headers. Through the pre-training of 500000 steps, gradient accumulation exists in every two steps; checkpoints are saved and evaluated on the validation set every 1000 steps. The evaluation loss of the validation set is selected according to the first three checkpoints and the average result is output on the test set. Thus, the training completion scoring module 102 may be used to score sentences.

The first candidate sentence generating module 103 is configured to sequentially arrange the sentences with scores from high to low, and filter the first K sentences as first candidate sentences.

In this embodiment, the document sentence push module 20 includes a converter decoder layer and an attention header number. Setting the converter decoder layer to 8 and the Attention header number (Multi-Head Attention) to 12; and for regularization, an exit and layer normalization mechanism is used and the exit rate is set to 0.15.

For gradient accumulation during training of the document sentence push module 20, the accumulation step size is set to 12, and 3 samples are input in each training step. A readable text sequence is also generated using a directed search size of 4 and a length penalty of 1.0.

In this embodiment, both the document sentence extraction module 10 and the document sentence push module 20 employ trigram (trigram) skills. For the document sentence extraction module 10, a ternary language model is used, meaning that if there is a ternary language model overlap between the selected sentence and the candidate sentence, the candidate sentence will be skipped; a similar approach is also used for the document statement push module 20. If a ternary language model is generated that exists in the refined abstract, the model sets the word probability to zero to avoid phrase and sentence duplication.

All the modular programs of the present invention are based on BERT-BASE, which is suitable for GPU memory (GTX 1080 Ti), and has low performance penalty. Since the invention uses converters with the same stack layerBERT, most superparameters are equal; and the learning rate used is 3e-4 beta ₁ ＝0.9、β ₂ =0.999 and ε=10 ^-9 The Adam optimizer of (c) trains the program in the module and uses a dynamic learning rate during training and fine tuning, with a batch size of approximately 36. Because of GPU memory limitations, program training in all modules uses a gradient accumulation approach.

Referring to fig. 3, the invention further provides an automatic text abstract generating method, which specifically comprises the following steps:

s1: the original document is converted into a sequence of sentences and each sentence is assigned a corresponding tag.

For example sentence sequence s= [ S ] ₁ ,S ₂ ,…,S _i ,…,S _n ]Wherein S is _i An ith sentence representing the entire original document; each sentence S _i Is composed of strict phases, S _i ＝[W ₁ ,W ₂ ,…,W _m ]M is the maximum word index, that W _m Represents the mth word and each sentence S _i All having a corresponding label y _i ∈{0,1}。

S2: extracting feature quantity of any sentence to obtain sentence-level features, and scoring the sentence-level features by using a factor ROUGE-1 as a score to obtain corresponding sentence scores and sequencing.

S2-1: in this embodiment, there will be a tag y _i Each sentence S of E {0,1} _i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h _i ：h _i ＝BERT(S _i ) Where i.epsilon.1, n, i is an integer, h _i Is m.

S2-2: and then the vector h will be represented _i Converting to obtain sentence-level feature vectors:

in the formula (1),

transition sentence level feature vector representing l-layer converter, l representing index of converter layer,/>

Sentence-level feature vectors representing the l-1 layer converter, LN representing layer normalization, MHAt representing multi-head attention;

in the formula (2),

sentence-level feature vector representing a layer-i converter,/->

Representing the transition sentence level eigenvectors of the l-layer converter, LN represents layer normalization, and FFN represents the feed-forward layer.

S2-3: output layer output tag y _i 。

/>

In the formula (3), y _i Tag representing ith sentence, sigma represents Sigmoid function, W ₀ Representing the weight of the output layer,

sentence-level feature vector representing an L-th layer (also last layer) converter, b ₀ Indicating the deviation of the output layer.

S2-4: using Sigmoid function to send

Converting into an attention value.

In the formula (4) of the present invention,

representing sentence S _i Attention value of (2), and->

S2-5: the information amount of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered in this order according to the high and low ROUGE scores.

In this embodiment, ROUGE-1 is a method for evaluating automatic summarization, which is a conventional technology, and therefore, a detailed description thereof is omitted. When sorting, sentences corresponding to high ROUGE scores are ranked in the front column, and sentences corresponding to low ROUGE scores are ranked in the rear column.

Meanwhile, in order to improve the matching rate, the first 3 sentences of the ordered sentences are selected as correct marks and used for training the extractor, so that the double-cross entropy loss function is minimized, and the number of correct words is improved.

In the formula (5), L _ext Represents the double cross entropy loss function g _n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y _i A tag representing the i-th sentence.

S3: the first K sentences are selected as first candidate sentences, and the first candidate sentences are pushed by using a non-key information shielding mechanism (according to the attention value), so that the abstract is automatically generated.

In this embodiment, the tapping includes two processes of masking word refining and salient word preserving, namely, re-decoding the words in the first candidate sentence twice, finding out missing information, and refining incorrect words.

In the present embodiment, the original document is in the form of w= [ W ] ₁ ,w ₂ ,L,w _n×m ]Which is provided withWherein n is the numbered index of the sentence, m is the sentence S _i The maximum word index of (a). The invention marks the mask word in the first candidate sentence as r _≠j ＝[r ₁ ，L，r _j-1 ，r _j+1 ，L，r _m ]。

The extraction process of the refined words comprises the following steps:

s3-1: selecting the first K sentences as first candidate sentences, and performing word-by-word masking (namely, inputting the candidate sentences with the masked words removed into a knockout) so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the characteristic quantity, and obtaining a vector H: h=bert (W), H representing a contextually relevant bi-directional feature.

S3-2: and refining the vector H to generate refined words, combining the refined words with the second candidate sentences to obtain third candidate sentences, and combining the reserved significant words to automatically generate abstracts.

In this embodiment, the refining process can be expressed as

d _j ＝f _delib (r _≠j ，H) (6)

In the formula (6), d _j Representing the refining result of the jth mask word, f _delib Representing a push function, r _≠j Representing mask words, H represents a contextually relevant bi-directional feature.

All d _j Constructing a vector, refining this D:

D＝[d ₁ ,d ₂ ,L,d _m ]

in this embodiment, the process of preserving the salient word specifically includes:

humans do not rewrite each word during the refinement process, they replace only some of the unused information phrases, and retain most of the salient words. The salient word retaining process is to extract three sentences with high information content from the original document and obtain self attention (self attention) distribution weight thereof, namely, self attention distribution

As the weight, the reserved value of the salient word is a snapshotTaking a weighted sum of the result r and the refined result d:

y _j ＝(1-a _j )d _j +a _j r _j (7)

in the formula (7), y _j Representing the reserved value of the salient word, r _j Representing the extraction result, d _j Representing the refining result of the jth mask word, a _j Representing the weight. I.e. the retention value is smaller than the preset retention threshold y _{Pre-preparation} The corresponding words of the (c) are removed so that the retained words can reflect the information of the original document.

In this embodiment, therefore, the learning objective of the knock stage is:

in the formula (8), L _delib Represents a loss function, M represents the total number of words, y _j The abstract words are represented by a sequence of abstract words,

representing the correct abstract word.

In this embodiment, L _delib And L _ext Combining to obtain a text abstract model L _model I.e. L _model ＝ L _delib +L _ext And make text abstract model L _model And minimizing, namely automatically generating the abstract.

The present model was evaluated on a CNN/Daily Mail data set containing news stories on CNN and Daily Mail websites. Each article in the dataset is paired with a manually written abstract of multiple sentences. This time a non-anonymous version of the data set was used, which contained 287113 training pairs, 13368 verification pairs and 11490 test pairs. In terms of evaluation, the criteria ROUGE-1, REGUO-2 and ROUGE-L were used.

Table 1: ROUGE scores for different models on CNN/Daily Mail dataset

As shown in the last line of Table 1, the model of the present invention exhibits higher ROUGE-1, REGUO-2 and REGUO-L scores than the other models. Compared with a unified model, the performance of the model is improved by approximately 5%, and the performance of the model is improved by 3% compared with a two-stage method.

Meanwhile, the invention also provides a computer storage medium.

The computer storage medium according to the embodiments of the present invention stores a computer program, which when executed by a processor implements any one of the steps of the two-stage automatic text-extracting generation method for text salient word preservation as described above, and the computer storage medium may be any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of implementations of the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. The automatic text abstract generating method is characterized by comprising the following steps of:

s1: converting an original document into a sentence sequence s= [ S ] ₁ ,S ₂ ,…,S _n ]And corresponding labels y are allocated to each sentence _i ；

the step S2 specifically comprises the following steps:

s2-2: will beRepresenting vector h _i Converting to obtain sentence-level feature vectors:

in the formulas (1) and (2),

transition sentence level feature vector representing a layer-l converter,/->

Sentence-level feature vectors representing the l-layer converter, FFN representing the feed-forward layer;

s2-3: output layer output tag y _i ：

s2-4: using Sigmoid function to send

Conversion to an attention value:

in the formula (4) of the present invention,

representing sentence S _i Attention value of (2), and->

S2-5: the ROUGE-1 is used as a score to represent the information quantity of each sentence, and the sentences are ordered according to the score;

s3: selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts;

the step of extracting the refined words in the step S3 is as follows:

s3-1: selecting the first K ordered sentences as first candidate sentences, and carrying out word-by-word shielding so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the features, and obtaining the features H: h=bert (W), H representing a contextually relevant bi-directional feature;

d _j ＝f _delib (r _≠j ，H) (5)

In the formula (5), d _j Representing the refining result of the jth mask word, f _delib Representing a push function, r _≠j Representing mask words, H representing a contextually relevant bi-directional feature;

all d _j Forming a vector, the refined word D can be obtained:

D＝[d ₁ ,d ₂ ,…,d _m ]；

the salient words in the S3 are reserved as follows:

y _j ＝(1-a _j )d _j +a _j r _j (6)

in the formula (6), y _j Representing the reserved value of the salient word, r _j Representing the extraction result, d _j Representing the refining result of the jth mask word, a _j Representing the weight;

will L _delib And L _ext Combining to obtain a text abstract model L _model I.e. L _model ＝L _delib +L _ext And make text abstract model L _model Minimizing, namely automatically generating the abstract;

in the formula (7), L _ext Represents the double cross entropy loss function g _n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y _i A tag representing an i-th sentence;

representing the correct abstract word.

2. An automatic text summary generation system for implementing the method of claim 1, comprising a document sentence extraction module, a document sentence push module, and a summary generation module;

the original document is input into the document sentence extraction module to extract a first candidate sentence, and the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.

3. The automatic text summarization generation system of claim 2 wherein the document sentence extraction module comprises a decomposition module, a scoring module, and a first candidate sentence generation module; wherein,,

the scoring module is used for scoring sentences with different labels;

4. A text excerpt automatic generation system as claimed in claim 3, wherein the scoring module comprises 3 transducer layers and 4 attention headers.

5. The automatic text summarization system of claim 2 wherein the document sentence push module comprises 8 converter decoder layers and 12 attention headers.