CN113626584A

CN113626584A - Automatic text abstract generation method, system, computer equipment and storage medium

Info

Publication number: CN113626584A
Application number: CN202110921956.5A
Authority: CN
Inventors: 郑超; 窦凤虎; 张欢; 顾钊铨; 王乐; 张登辉; 韩伟红
Original assignee: Zhongdian Jizhi Hainan Information Technology Co Ltd
Current assignee: Zhongdian Jizhi Hainan Information Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-11-09

Abstract

The invention provides an automatic text abstract generation method, a system, computer equipment and a storage medium, wherein a sentence feature vector extraction algorithm is constructed according to language features of a Chinese text to form a text feature vector matrix, wherein the sentence feature vector extraction algorithm utilizes 7 language features as text vector features, and the method comprises the following steps: sentence relevancy, sentence similarity to a central sentence, the number of keywords contained in a sentence, the number of domain nouns contained in a sentence, sentence informativeness, sentence length and sentence position. And then, the text characteristic vector matrix is input into the coding-decoding model combined with the attention mechanism, semantic information before and after sentences is modeled through the long and short memory neural network, the limitation of the traditional statistical vector representation is overcome, the text is represented through the semantic vector, a text abstract constructed by being more fit with an artificial generation technology is generated, and the quality of the generated text abstract is improved.

Description

Automatic text abstract generation method, system, computer equipment and storage medium

Technical Field

The invention relates to the field of natural language processing, informatics technology and deep learning technology, in particular to an automatic text abstract generation method, system, computer equipment and storage medium of an encoding-decoding model based on language features and combined with an attention mechanism.

Background

With the rapid development of artificial intelligence technology and the internet, in recent years, text information in the network is increased explosively, and people can receive massive text information every day, such as news, blogs, chatting, reports, microblogs, papers and the like. The overload problem of the information causes that people need to spend a lot of time to screen the information when looking for the information, and the efficiency is low. Automatic text summarization is an information compression technique that utilizes a computer to automatically convert a text or a collection of texts into a short summary according to some type of application. The document abstract information which is concise and easy to read is compressed and extracted from the big data by using the text abstract technology, so that the process of acquiring information by people can be accelerated, and the problem of information overload is effectively solved. At present, the text summarization technology is widely applied to application scenes such as news summarization, retrieval systems and the like.

How to extract key information from a long redundant and unstructured text to form a simple and smooth abstract is a core problem of text abstract. The abstraction technique is a method with stable effect and low error rate in grammar and syntax in the automatic text abstraction technique. The existing extraction type automatic text abstract generation method comprises a TextRank, Lead-3, clustering and other methods based on a traditional machine learning algorithm, and also comprises a Seq2Seq2 sequence marking, an RNN sentence importance degree scoring and other methods based on a deep neural network. Although the text abstract generated by the existing extraction type automatic text abstract generating method meets the application requirements to a certain extent, the automatic abstract generated by the extraction type abstract technology has the problems of poor semantic consistency, redundant sentences and the like. At present, the comprehension type automatic text summarization technology aims at creatively generating text summarization through a neural network model and fitting the process of generating the summarization of human as much as possible.

Therefore, it is desirable to provide an automatic text summary generation method that can sufficiently consider the word coherence and the word information amount while ensuring the stable generation effect and no grammar error.

Disclosure of Invention

The invention aims to provide an automatic text abstract generating method, which utilizes language features of Chinese text to construct a sentence feature vector extraction algorithm to form a text feature vector matrix, then inputs the text feature vector matrix into a coding-decoding model combined with an attention mechanism provided by the invention, encodes an intermediate semantic vector by a bidirectional cyclic long-short memory neural network, and finally decodes the intermediate semantic vector by combining the attention mechanism and a unidirectional long-short memory neural network to realize automatic extraction of text abstract.

In order to achieve the above object, it is necessary to provide an automatic text summary generating method, system, computer device and storage medium for solving the above technical problems.

In a first aspect, an embodiment of the present invention provides an automatic text summary generation method, where the method includes the following steps:

acquiring an original text and a neural network model;

segmenting and compressing an original text to obtain a new text representation;

extracting sentence characteristic vectors from the new text to obtain a text vector matrix;

inputting the text matrix into a bidirectional long and short memory neural network model, and encoding the text matrix into a text semantic matrix;

and inputting the text semantic matrix into the attention model and the long and short memory neural network model, decoding the text semantic matrix into a text vector matrix, and reflecting the text vector to obtain a text abstract.

Further, the text segmentation and compression steps include:

segmenting the text by taking a sentence as a unit to generate a sentence set;

counting the sentence length and calculating the average sentence length;

the method comprises the steps of re-segmenting sentences with the length being more than twice of the average length of the sentences;

and updating the sentence set, and removing the sentences of which the text length is less than 3 in the sentence set.

Further, the sentence feature vector extraction step includes:

constructing 7 sentence characteristics through Chinese language characteristics;

calculating 7 feature scores of each sentence to form a one-dimensional vector to represent each sentence;

the vectors of the sentences are combined into a two-dimensional vector to obtain a vector matrix representation of the sentences.

Further, the step of encoding the text matrix into the text semantic matrix includes:

the encoder receives each sentence in turn;

the encoder outputs a semantic vector V.

Further, the step of decoding the text semantic matrix into a text vector matrix and obtaining a text abstract includes:

the decoder inputs BOS, namely the beginning of a sentence, predicts the probability of using each sentence in the next round according to the BOS and the semantic vector V, and selects the sentence with the maximum probability;

and inputting the sentence with the maximum probability and the semantic vector V into a decoder to obtain the sentence with the maximum probability in the next round, and circulating the steps until EOS is obtained, namely the end of the sentence, and the generation of the text abstract is finished.

Further, the step of constructing 7 sentence features and obtaining a sentence feature vector by using the chinese language features includes:

performing some necessary preprocessing on the text, firstly segmenting the sentence by taking words as units, then removing useless functional words, and then recombining the words to form a new sentence;

combining multiple influence factors extracted from the abstract of the text sentence, selecting 7 characteristics with the best influence effect, wherein the 7 characteristics are as follows: sentence relevancy, sentence and central sentence similarity, the number of keywords contained in the sentence, the number of domain entity nouns contained in the sentence, sentence informativeness, sentence length and sentence position feature vector. The characteristics are mathematically transformed, a reasonable formula is constructed, then, a characteristic score is calculated for each sentence, and each sentence obtains 7 characteristic scores to form a sentence characteristic vector.

Further, selecting language features as sentence vector features, and constructing a feature calculation formula comprises the following steps:

correlation degree of sentences

The sentence relevancy is the basic requirement of abstract generation, the invention reduces the uncertainty of the abstract to the original text by modeling through the cross entropy, and the abstract text infers the original text with the minimum information loss. Meanwhile, the entropy in the information theory is utilized to measure the information content of the abstract, the redundancy is modeled, the larger the entropy is, the higher the uncertainty of the text is, the larger the information content is, and the smaller the redundancy is. The comprehensive correlation degree and redundancy modeling formula is as follows:

Score1(S)＝Rel(S,D)-Red(S)

in the formula, Rea (S, D) represents the degree of correlation, and Red (S) represents the redundancy of the sentence.

Similarity between sentences and central sentence

The center sentence is the sentence that contains the most abundant textual information. In the present invention, the sentence containing the most feature words is selected as the central sentence. If the similarity between a sentence and the central sentence is higher except the central sentence, the included text information is richer, the probability that the sentence is selected as the abstract sentence is higher, and the modeling formula is expressed as follows:

where Sim (S, S)_cen) Similarity between two sentences; d refers to the number of i-th words in the sentence, F (w)_ij) Means word frequency of the co-occurring word; k and b are regulatory factors; idf (wij) refers to the degree of correlation between the co-occurrence and the text, and is expressed as:

③ the number of the keywords contained in the sentence

The number of the characteristic words has great influence on abstract extraction, the weight of the sentence without the characteristic words is 1, the sentence with the characteristic words is increased, and the modeling formula is as follows:

Score3(S)＝1+α₁·Nf

wherein alpha is₁Is a hyper-parameter, the value is 0.5, and Nf is the number of the feature words.

Fourthly, the number of domain entity nouns contained in the sentence

In practical application, text information of different fields generates unique formats and field nouns of lattices, and the extraction is to take the field nouns into consideration, so that the extraction quality of the abstract is improved. Counting related domain nouns, and increasing abstract extraction weight for sentences containing the domain nouns, wherein the modeling formula is as follows:

Score4(S)＝1+α₂·Ne

wherein alpha is₂Is a hyper-parameter, the value is 0.3, Ne is a noun containing entity in a sentenceThe number of (2).

Sentence information degree

In practical application, text information of different fields generates unique formats and field nouns of lattices, and the extraction is to take the field nouns into consideration, so that the extraction quality of the abstract is improved. The method comprises the following steps of counting backbone domain nouns, adding abstract extraction weight to sentences containing the domain nouns, wherein a modeling formula is as follows:

Score5(S)＝1+α₃·Ne

wherein alpha is₃Is a hyper-parameter, the value is 0.3, Ne is the number of entity nouns contained in the sentence.

Length of sentence

The text abstract length is a dimension which needs to be considered in an abstract generating task, if the abstract length is too long, the value of abstract condensing information is lost, and if the abstract length is too short, the expression information is insufficient. So in the generation phase, the sentence length is taken into consideration, and for a single sentence, we take the average sentence length as the optimal sentence length, and the modeling formula is:

wherein

Is the average sentence length and x is the sentence S length.

Seventhly position of sentence

Research shows that the probability of selecting the first sentence of an article paragraph as the abstract in the artificial abstract is 85%, the proportion of selecting the short tail as the abstract of the article is 7%, based on the conclusion, the first paragraph, the tail paragraph and the rest paragraphs of the sentence are weighted and promoted, and the modeling formula is as follows:

wherein F_SIs the total number of first sentences, E_SIs the total number of sentences in the tail, N_SIs that the text contains the total sentencesNumber, m is the sequence number of the current sentence, ε₀，ε₁，ε₂Is a hyper-parameter.

In a second aspect, an embodiment of the present invention provides an automatic text summary generating system, where the system includes:

the sample acquisition module is used for acquiring an original Chinese long text and a neural network model;

the text preprocessing module is used for carrying out primary compression processing on the original text, removing useless sentences and expressing the original text into a sentence set through sentence re-segmentation work;

the text characteristic representation module is used for representing each sentence in the sentence set by using a one-dimensional vector to generate a two-dimensional characteristic vector matrix of the text;

the semantic vector generating module is used for constructing sentence semantic relations in the text through a neural network model and generating semantic vectors of the text;

and the automatic text abstract generating module is used for predicting the next input vector by the semantic vector and the input vector through the neural network model, iteratively generating output sentence characteristic vectors until a text ending symbol is generated, and finally converting all the generated sentence characteristic vectors into character symbols to obtain the generated text abstract.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above method.

The present application provides a method, a system, a computer device, and a storage medium for automatic text summarization generation, by which a method for automatic text summarization generation of an encoding-decoding neural network model based on sentence vector representation in combination with an attention mechanism is realized. Through a sentence vector construction algorithm, 7 language features which have the largest influence on abstract extraction quality are selected, the feature vector representation of each sentence is further calculated by modeling each language feature calculation formula, a text vector matrix representation is obtained, a text vector matrix is input into a coding-decoding neural network model through a combined attention mechanism, a semantic vector is obtained through coding of a coding layer, then the semantic vector and an original text vector are input into a decoding layer combined attention, a new text vector composition is obtained, the text vector is reversely mapped to obtain a section of text, and the text is the generated text abstract.

Compared with the prior art, the sentence characteristic selection fully considers the language characteristic information, ensures that the characteristic value is positively correlated with the correlation degree and the information quantity and negatively correlated with the redundancy, provides 7 language characteristics, and greatly increases the richness of sentences by comparing the common extraction algorithm with the mode of only utilizing partial information such as similarity, keywords and the like, thereby better expressing the difference and the sameness among the sentences by using vectors. The method creatively provides that the sentences are used as basic input units to obtain the text abstracts of the consecutive sentences through the neural network model, and the defect that the neural network model based on word input obtains grammatical and syntactic wrong sentences is effectively avoided. In the invention, sentences are taken as input units of the neural network, and the text abstract generated by the generating type neural network is basically extracted as important sentences in the article, but semantic information among the sentences is fully considered, so that the extracted sentences are consecutive to generate the high-quality text abstract.

Drawings

FIG. 1 is a schematic flow chart diagram of an automatic text summarization method implemented in accordance with the present invention;

fig. 2 is a schematic flowchart of the segmentation and compression of the original text in step S12 in fig. 2;

FIG. 3 is a schematic flow chart of the step S13 in FIG. 2 for obtaining sentence feature vectors;

FIG. 4 is a schematic flow chart of steps S14 and S15 in FIG. 2, in which the text vector matrix is subjected to semantic coding and attention mechanism decoding to obtain a new text vector matrix, and then the new text vector matrix is reflected to obtain a text abstract;

FIG. 5 is a schematic structural diagram of an automatic text summarization generation system according to an embodiment of the present invention;

fig. 6 is an internal structural diagram of a computer device in the embodiment of the present invention.

Detailed Description

In order to make the purpose, technical solution and advantages of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments, and it is obvious that the embodiments described below are part of the embodiments of the present invention, and are used for illustrating the present invention only, but not for limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for constructing the text vector based on the language features and generating the text abstract by combining the coding-decoding model of the attention mechanism selects and models the Chinese language features to obtain the vector representation of the sentence, and further constructs the matrix vector of the text by taking the sentence as a unit. The semantic relation between sentences is effectively learned by combining an encoding-decoding model of an attention mechanism, an original matrix vector is input into the model, the model outputs a new matrix vector (wherein the sizes of the input vector and the output vector are unequal), and the output matrix vector reflects and obtains a text abstract.

In one embodiment, as shown in fig. 1, there is provided an automatic text summary generation method, including the steps of:

s11, acquiring an original text and a network model;

the neural network model is a trained and stable model, and only a certain corresponding relation needs to exist between the original input text and the neural network model, for example, the invention provides an LSTM model which is suitable for the task and combines an attention mechanism. The type of the original input text and the neural network model are not limited and can be determined according to actual use requirements.

S12, segmenting and compressing the original text to obtain a new text representation;

wherein, a sentence set is obtained by splitting and compressing, as shown in fig. 2, the step S12 includes:

s121, segmenting the text to obtain a sentence set; for example, for original Text, the Text is segmented in sentence units to obtain a set T ═ (S)₁，S₂，S₃，...，S_n) And n represents the number of sentences.

S122, counting the sentence length, and calculating the average sentence length; e.g. for the set T, len (S) of the previous step_i) Representing the length of each sentence, calculating the average length of n sentences

S123, re-segmenting sentences with the length being more than twice of the average length of the sentences; for example, for the last step len (S)_i) If, if

Then to S_iPerforming re-segmentation to obtain n + C₁And (4) a sentence.

And S124, removing the sentences of which the text length is less than 3 in the sentence set, and updating the sentence set. Suppose is provided with C₂A sentence with a length less than 3, the updated set T ═ s₁,s₂,s₃,...,s_M). Wherein M is N + C₁-C₂It should be noted that the number of the text sentences herein can be adjusted according to the actual application requirements or experimental conditions, and the above classification is only an exemplary line for illustration and is not limited specifically.

S13, extracting sentence characteristic vectors of the new text to obtain a text vector matrix;

wherein, the sentence characteristic score is calculated by the sentence characteristic algorithm and is formed into a vector representation, so as to obtain a vector matrix of the text, as shown in fig. 3, the step S13 includes:

s131, segmenting the sentences according to the word unit, removing useless functional words, and recombining the words to form new sentences; e.g. for the set T of the last step ═(S₁，S₂，S₃，...，S_m) Suppose S₁"important sonar", cutting words from sentence to obtain S₁The "important" person "removes" and "sonar" of useless functional words, and recombines the words to obtain new S₁The "important person".

TABLE 1

S132, calculating the feature score of each sentence according to the sentence feature vector extraction algorithm; as shown in Table 1, a formula for calculating sentence features is given, e.g., for the previous step S₁By "important person", it is assumed that the scores in the seven features are 1.23, 0.96, 1.13, 1.56, 0.99, 0.78, 1.16.

S133, constructing vector representation of the sentence by the feature score; e.g. for the feature score obtained in the previous step, then S₁The vector of (a) is represented as [1.23, 0.96, 1.13, 1.56, 0.99, 0.78, 1.16]And obtaining vector representation of other sentences in the same way. It should be noted that the number, kind and formula of the feature choices do not affect the effectiveness of the present invention.

S134, combining the vectors of the sentences into a two-dimensional vector to obtain vector matrix representation of the text; for example for T ═ (S)₁,S₂,S₃，...，S_m) And finally obtaining a matrix of T, wherein T is represented as [7, m ═ m]。

And S14 and S15, inputting the text matrix into the neural network model, obtaining a new text vector matrix through encoding-decoding operation, and reflecting the text vector to obtain a text abstract. As shown in fig. 4, the steps S14 and S15 include:

s1451, inputting the text vector matrix into an encoder model encoder to output a semantic vector; e.g. for the set of the previous step T ═ (S)₁，S₂，S₃，...，S_m) Obtaining an intermediate semantic vector C ═ (C)₁,c₂，c₃，...，c_m)。

S1452, S1453, S1454, inputting sentence starting identifier into the decoder model, predicting the next sentence according to the text starting identifier and the semantic vector V, selecting the sentence with the maximum probability as a new sentence, iterating the previous step until a text ending identifier is generated, terminating the generation task to obtain a new text vector matrix, and performing inverse mapping on the vector matrix to obtain a text abstract; for example by the formula y_i＝ g(c₁，y₂，...，y_i-1) Get the new sentence y_iIteratively generate y_i+1，y_i+2,. until an end-of-text symbol is generated, yielding an output representation Y ═ Y · (Y)₁,y₂,...,y_t) And t is the number of sentences for generating the abstract, and can be set in the design stage of the neural network model, and the value of t does not influence the effectiveness of the method.

It should be noted that, although the steps in the above-described flowcharts are shown in sequence as indicated by arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise.

In one embodiment, as shown in fig. 5, an automatic text summary generation system is provided, which is explained above and will not be described in detail.

For specific limitations of an automatic text summary generation system, reference may be made to the above limitations of the generation method, which are not described herein again.

Fig. 6 shows an internal structure diagram of a computer device in an embodiment, where the computer device may specifically be a terminal or a server. As shown in fig. 6, the computer device includes a processor, a memory, an input and an output device connected by a system bus.

It will be appreciated by those of ordinary skill in the art that the architecture shown in FIG. 6 is a block diagram of only a portion of the architecture associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied, as a particular computing device may include more or less components than those shown, or may combine certain components, or have a similar arrangement of components.

In the present specification, each embodiment is described in a progressive manner, each technical feature of the above embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present specification should be considered.

The above-mentioned embodiments only express some preferred embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these should be construed as the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

Claims

1. An automatic text summary generation method, characterized in that the method comprises the steps of:

acquiring an original text and a neural network model;

2. The automatic text summary generation method of claim 1, wherein the step of segmenting and compressing the original text comprises:

segmenting the text by taking a sentence as a unit to generate a sentence set;

counting the sentence length and calculating the average sentence length;

3. The method of automatic text summarization of claim 1 wherein the step of extracting the sentence feature vectors from the new text to obtain a text vector matrix comprises:

4. The method of automatic text summarization of claim 1 wherein inputting the text matrix into a two-way long and short memory neural network model, the step of encoding the text matrix into a text semantic matrix comprises:

the encoder receives the vectorized representation of each text in turn;

the encoder outputs a semantic vector V.

5. The method of automatic text summarization of claim 1 wherein the step of inputting the text semantic matrix into the attention model and the long and short memory neural network model, decoding the text semantic matrix into a text vector matrix, the text vector matrix reflecting the resulting text summary comprises:

the decoder inputs BOS, namely the beginning of a text, predicts the probability of using each sentence in the next round according to the BOS and the semantic vector V, and selects the sentence with the maximum probability;

6. The sentence feature vector extraction of claim 3, wherein the step of constructing 7 sentence features and obtaining the sentence feature vector by the Chinese language features comprises:

combining multiple influence factors extracted from the abstract of the text sentence, selecting 7 characteristics with the best influence effect, wherein the 7 characteristics are as follows: the method comprises the steps of carrying out mathematical transformation on characteristics of sentence relevancy, sentence similarity with a central sentence, the number of keywords contained in a sentence, the number of domain entity nouns contained in a sentence, sentence informativeness, sentence length and sentence position characteristic vectors, constructing a reasonable formula, calculating characteristic scores of each sentence, and obtaining 7 characteristic scores of each sentence to form a sentence characteristic vector.

7. An automatic text summary generation system, the system comprising:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.