CN113626584A - Automatic text abstract generation method, system, computer equipment and storage medium - Google Patents

Automatic text abstract generation method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN113626584A
CN113626584A CN202110921956.5A CN202110921956A CN113626584A CN 113626584 A CN113626584 A CN 113626584A CN 202110921956 A CN202110921956 A CN 202110921956A CN 113626584 A CN113626584 A CN 113626584A
Authority
CN
China
Prior art keywords
text
sentence
vector
matrix
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110921956.5A
Other languages
Chinese (zh)
Inventor
郑超
窦凤虎
张欢
顾钊铨
王乐
张登辉
韩伟红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Jizhi Hainan Information Technology Co Ltd
Original Assignee
Zhongdian Jizhi Hainan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Jizhi Hainan Information Technology Co Ltd filed Critical Zhongdian Jizhi Hainan Information Technology Co Ltd
Priority to CN202110921956.5A priority Critical patent/CN113626584A/en
Publication of CN113626584A publication Critical patent/CN113626584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides an automatic text abstract generation method, a system, computer equipment and a storage medium, wherein a sentence feature vector extraction algorithm is constructed according to language features of a Chinese text to form a text feature vector matrix, wherein the sentence feature vector extraction algorithm utilizes 7 language features as text vector features, and the method comprises the following steps: sentence relevancy, sentence similarity to a central sentence, the number of keywords contained in a sentence, the number of domain nouns contained in a sentence, sentence informativeness, sentence length and sentence position. And then, the text characteristic vector matrix is input into the coding-decoding model combined with the attention mechanism, semantic information before and after sentences is modeled through the long and short memory neural network, the limitation of the traditional statistical vector representation is overcome, the text is represented through the semantic vector, a text abstract constructed by being more fit with an artificial generation technology is generated, and the quality of the generated text abstract is improved.

Description

Automatic text abstract generation method, system, computer equipment and storage medium
Technical Field
The invention relates to the field of natural language processing, informatics technology and deep learning technology, in particular to an automatic text abstract generation method, system, computer equipment and storage medium of an encoding-decoding model based on language features and combined with an attention mechanism.
Background
With the rapid development of artificial intelligence technology and the internet, in recent years, text information in the network is increased explosively, and people can receive massive text information every day, such as news, blogs, chatting, reports, microblogs, papers and the like. The overload problem of the information causes that people need to spend a lot of time to screen the information when looking for the information, and the efficiency is low. Automatic text summarization is an information compression technique that utilizes a computer to automatically convert a text or a collection of texts into a short summary according to some type of application. The document abstract information which is concise and easy to read is compressed and extracted from the big data by using the text abstract technology, so that the process of acquiring information by people can be accelerated, and the problem of information overload is effectively solved. At present, the text summarization technology is widely applied to application scenes such as news summarization, retrieval systems and the like.
How to extract key information from a long redundant and unstructured text to form a simple and smooth abstract is a core problem of text abstract. The abstraction technique is a method with stable effect and low error rate in grammar and syntax in the automatic text abstraction technique. The existing extraction type automatic text abstract generation method comprises a TextRank, Lead-3, clustering and other methods based on a traditional machine learning algorithm, and also comprises a Seq2Seq2 sequence marking, an RNN sentence importance degree scoring and other methods based on a deep neural network. Although the text abstract generated by the existing extraction type automatic text abstract generating method meets the application requirements to a certain extent, the automatic abstract generated by the extraction type abstract technology has the problems of poor semantic consistency, redundant sentences and the like. At present, the comprehension type automatic text summarization technology aims at creatively generating text summarization through a neural network model and fitting the process of generating the summarization of human as much as possible.
Therefore, it is desirable to provide an automatic text summary generation method that can sufficiently consider the word coherence and the word information amount while ensuring the stable generation effect and no grammar error.
Disclosure of Invention
The invention aims to provide an automatic text abstract generating method, which utilizes language features of Chinese text to construct a sentence feature vector extraction algorithm to form a text feature vector matrix, then inputs the text feature vector matrix into a coding-decoding model combined with an attention mechanism provided by the invention, encodes an intermediate semantic vector by a bidirectional cyclic long-short memory neural network, and finally decodes the intermediate semantic vector by combining the attention mechanism and a unidirectional long-short memory neural network to realize automatic extraction of text abstract.
In order to achieve the above object, it is necessary to provide an automatic text summary generating method, system, computer device and storage medium for solving the above technical problems.
In a first aspect, an embodiment of the present invention provides an automatic text summary generation method, where the method includes the following steps:
acquiring an original text and a neural network model;
segmenting and compressing an original text to obtain a new text representation;
extracting sentence characteristic vectors from the new text to obtain a text vector matrix;
inputting the text matrix into a bidirectional long and short memory neural network model, and encoding the text matrix into a text semantic matrix;
and inputting the text semantic matrix into the attention model and the long and short memory neural network model, decoding the text semantic matrix into a text vector matrix, and reflecting the text vector to obtain a text abstract.
Further, the text segmentation and compression steps include:
segmenting the text by taking a sentence as a unit to generate a sentence set;
counting the sentence length and calculating the average sentence length;
the method comprises the steps of re-segmenting sentences with the length being more than twice of the average length of the sentences;
and updating the sentence set, and removing the sentences of which the text length is less than 3 in the sentence set.
Further, the sentence feature vector extraction step includes:
constructing 7 sentence characteristics through Chinese language characteristics;
calculating 7 feature scores of each sentence to form a one-dimensional vector to represent each sentence;
the vectors of the sentences are combined into a two-dimensional vector to obtain a vector matrix representation of the sentences.
Further, the step of encoding the text matrix into the text semantic matrix includes:
the encoder receives each sentence in turn;
the encoder outputs a semantic vector V.
Further, the step of decoding the text semantic matrix into a text vector matrix and obtaining a text abstract includes:
the decoder inputs BOS, namely the beginning of a sentence, predicts the probability of using each sentence in the next round according to the BOS and the semantic vector V, and selects the sentence with the maximum probability;
and inputting the sentence with the maximum probability and the semantic vector V into a decoder to obtain the sentence with the maximum probability in the next round, and circulating the steps until EOS is obtained, namely the end of the sentence, and the generation of the text abstract is finished.
Further, the step of constructing 7 sentence features and obtaining a sentence feature vector by using the chinese language features includes:
performing some necessary preprocessing on the text, firstly segmenting the sentence by taking words as units, then removing useless functional words, and then recombining the words to form a new sentence;
combining multiple influence factors extracted from the abstract of the text sentence, selecting 7 characteristics with the best influence effect, wherein the 7 characteristics are as follows: sentence relevancy, sentence and central sentence similarity, the number of keywords contained in the sentence, the number of domain entity nouns contained in the sentence, sentence informativeness, sentence length and sentence position feature vector. The characteristics are mathematically transformed, a reasonable formula is constructed, then, a characteristic score is calculated for each sentence, and each sentence obtains 7 characteristic scores to form a sentence characteristic vector.
Further, selecting language features as sentence vector features, and constructing a feature calculation formula comprises the following steps:
correlation degree of sentences
The sentence relevancy is the basic requirement of abstract generation, the invention reduces the uncertainty of the abstract to the original text by modeling through the cross entropy, and the abstract text infers the original text with the minimum information loss. Meanwhile, the entropy in the information theory is utilized to measure the information content of the abstract, the redundancy is modeled, the larger the entropy is, the higher the uncertainty of the text is, the larger the information content is, and the smaller the redundancy is. The comprehensive correlation degree and redundancy modeling formula is as follows:
Score1(S)=Rel(S,D)-Red(S)
Figure BDA0003207746420000041
Figure BDA0003207746420000042
in the formula, Rea (S, D) represents the degree of correlation, and Red (S) represents the redundancy of the sentence.
Similarity between sentences and central sentence
The center sentence is the sentence that contains the most abundant textual information. In the present invention, the sentence containing the most feature words is selected as the central sentence. If the similarity between a sentence and the central sentence is higher except the central sentence, the included text information is richer, the probability that the sentence is selected as the abstract sentence is higher, and the modeling formula is expressed as follows:
Figure BDA0003207746420000043
where Sim (S, S)cen) Similarity between two sentences; d refers to the number of i-th words in the sentence, F (w)ij) Means word frequency of the co-occurring word; k and b are regulatory factors; idf (wij) refers to the degree of correlation between the co-occurrence and the text, and is expressed as:
Figure BDA0003207746420000044
③ the number of the keywords contained in the sentence
The number of the characteristic words has great influence on abstract extraction, the weight of the sentence without the characteristic words is 1, the sentence with the characteristic words is increased, and the modeling formula is as follows:
Score3(S)=1+α1·Nf
wherein alpha is1Is a hyper-parameter, the value is 0.5, and Nf is the number of the feature words.
Fourthly, the number of domain entity nouns contained in the sentence
In practical application, text information of different fields generates unique formats and field nouns of lattices, and the extraction is to take the field nouns into consideration, so that the extraction quality of the abstract is improved. Counting related domain nouns, and increasing abstract extraction weight for sentences containing the domain nouns, wherein the modeling formula is as follows:
Score4(S)=1+α2·Ne
wherein alpha is2Is a hyper-parameter, the value is 0.3, Ne is a noun containing entity in a sentenceThe number of (2).
Sentence information degree
In practical application, text information of different fields generates unique formats and field nouns of lattices, and the extraction is to take the field nouns into consideration, so that the extraction quality of the abstract is improved. The method comprises the following steps of counting backbone domain nouns, adding abstract extraction weight to sentences containing the domain nouns, wherein a modeling formula is as follows:
Score5(S)=1+α3·Ne
wherein alpha is3Is a hyper-parameter, the value is 0.3, Ne is the number of entity nouns contained in the sentence.
Length of sentence
The text abstract length is a dimension which needs to be considered in an abstract generating task, if the abstract length is too long, the value of abstract condensing information is lost, and if the abstract length is too short, the expression information is insufficient. So in the generation phase, the sentence length is taken into consideration, and for a single sentence, we take the average sentence length as the optimal sentence length, and the modeling formula is:
Figure BDA0003207746420000051
wherein
Figure BDA0003207746420000053
Is the average sentence length and x is the sentence S length.
Seventhly position of sentence
Research shows that the probability of selecting the first sentence of an article paragraph as the abstract in the artificial abstract is 85%, the proportion of selecting the short tail as the abstract of the article is 7%, based on the conclusion, the first paragraph, the tail paragraph and the rest paragraphs of the sentence are weighted and promoted, and the modeling formula is as follows:
Figure BDA0003207746420000052
wherein FSIs the total number of first sentences, ESIs the total number of sentences in the tail, NSIs that the text contains the total sentencesNumber, m is the sequence number of the current sentence, ε0,ε1,ε2Is a hyper-parameter.
In a second aspect, an embodiment of the present invention provides an automatic text summary generating system, where the system includes:
the sample acquisition module is used for acquiring an original Chinese long text and a neural network model;
the text preprocessing module is used for carrying out primary compression processing on the original text, removing useless sentences and expressing the original text into a sentence set through sentence re-segmentation work;
the text characteristic representation module is used for representing each sentence in the sentence set by using a one-dimensional vector to generate a two-dimensional characteristic vector matrix of the text;
the semantic vector generating module is used for constructing sentence semantic relations in the text through a neural network model and generating semantic vectors of the text;
and the automatic text abstract generating module is used for predicting the next input vector by the semantic vector and the input vector through the neural network model, iteratively generating output sentence characteristic vectors until a text ending symbol is generated, and finally converting all the generated sentence characteristic vectors into character symbols to obtain the generated text abstract.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above method.
The present application provides a method, a system, a computer device, and a storage medium for automatic text summarization generation, by which a method for automatic text summarization generation of an encoding-decoding neural network model based on sentence vector representation in combination with an attention mechanism is realized. Through a sentence vector construction algorithm, 7 language features which have the largest influence on abstract extraction quality are selected, the feature vector representation of each sentence is further calculated by modeling each language feature calculation formula, a text vector matrix representation is obtained, a text vector matrix is input into a coding-decoding neural network model through a combined attention mechanism, a semantic vector is obtained through coding of a coding layer, then the semantic vector and an original text vector are input into a decoding layer combined attention, a new text vector composition is obtained, the text vector is reversely mapped to obtain a section of text, and the text is the generated text abstract.
Compared with the prior art, the sentence characteristic selection fully considers the language characteristic information, ensures that the characteristic value is positively correlated with the correlation degree and the information quantity and negatively correlated with the redundancy, provides 7 language characteristics, and greatly increases the richness of sentences by comparing the common extraction algorithm with the mode of only utilizing partial information such as similarity, keywords and the like, thereby better expressing the difference and the sameness among the sentences by using vectors. The method creatively provides that the sentences are used as basic input units to obtain the text abstracts of the consecutive sentences through the neural network model, and the defect that the neural network model based on word input obtains grammatical and syntactic wrong sentences is effectively avoided. In the invention, sentences are taken as input units of the neural network, and the text abstract generated by the generating type neural network is basically extracted as important sentences in the article, but semantic information among the sentences is fully considered, so that the extracted sentences are consecutive to generate the high-quality text abstract.
Drawings
FIG. 1 is a schematic flow chart diagram of an automatic text summarization method implemented in accordance with the present invention;
fig. 2 is a schematic flowchart of the segmentation and compression of the original text in step S12 in fig. 2;
FIG. 3 is a schematic flow chart of the step S13 in FIG. 2 for obtaining sentence feature vectors;
FIG. 4 is a schematic flow chart of steps S14 and S15 in FIG. 2, in which the text vector matrix is subjected to semantic coding and attention mechanism decoding to obtain a new text vector matrix, and then the new text vector matrix is reflected to obtain a text abstract;
FIG. 5 is a schematic structural diagram of an automatic text summarization generation system according to an embodiment of the present invention;
fig. 6 is an internal structural diagram of a computer device in the embodiment of the present invention.
Detailed Description
In order to make the purpose, technical solution and advantages of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments, and it is obvious that the embodiments described below are part of the embodiments of the present invention, and are used for illustrating the present invention only, but not for limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for constructing the text vector based on the language features and generating the text abstract by combining the coding-decoding model of the attention mechanism selects and models the Chinese language features to obtain the vector representation of the sentence, and further constructs the matrix vector of the text by taking the sentence as a unit. The semantic relation between sentences is effectively learned by combining an encoding-decoding model of an attention mechanism, an original matrix vector is input into the model, the model outputs a new matrix vector (wherein the sizes of the input vector and the output vector are unequal), and the output matrix vector reflects and obtains a text abstract.
In one embodiment, as shown in fig. 1, there is provided an automatic text summary generation method, including the steps of:
s11, acquiring an original text and a network model;
the neural network model is a trained and stable model, and only a certain corresponding relation needs to exist between the original input text and the neural network model, for example, the invention provides an LSTM model which is suitable for the task and combines an attention mechanism. The type of the original input text and the neural network model are not limited and can be determined according to actual use requirements.
S12, segmenting and compressing the original text to obtain a new text representation;
wherein, a sentence set is obtained by splitting and compressing, as shown in fig. 2, the step S12 includes:
s121, segmenting the text to obtain a sentence set; for example, for original Text, the Text is segmented in sentence units to obtain a set T ═ (S)1,S2,S3,...,Sn) And n represents the number of sentences.
S122, counting the sentence length, and calculating the average sentence length; e.g. for the set T, len (S) of the previous stepi) Representing the length of each sentence, calculating the average length of n sentences
Figure BDA0003207746420000081
S123, re-segmenting sentences with the length being more than twice of the average length of the sentences; for example, for the last step len (S)i) If, if
Figure BDA0003207746420000082
Then to SiPerforming re-segmentation to obtain n + C1And (4) a sentence.
And S124, removing the sentences of which the text length is less than 3 in the sentence set, and updating the sentence set. Suppose is provided with C2A sentence with a length less than 3, the updated set T ═ s1,s2,s3,...,sM). Wherein M is N + C1-C2It should be noted that the number of the text sentences herein can be adjusted according to the actual application requirements or experimental conditions, and the above classification is only an exemplary line for illustration and is not limited specifically.
S13, extracting sentence characteristic vectors of the new text to obtain a text vector matrix;
wherein, the sentence characteristic score is calculated by the sentence characteristic algorithm and is formed into a vector representation, so as to obtain a vector matrix of the text, as shown in fig. 3, the step S13 includes:
s131, segmenting the sentences according to the word unit, removing useless functional words, and recombining the words to form new sentences; e.g. for the set T of the last step ═(S1,S2,S3,...,Sm) Suppose S1"important sonar", cutting words from sentence to obtain S1The "important" person "removes" and "sonar" of useless functional words, and recombines the words to obtain new S1The "important person".
TABLE 1
Figure RE-GDA0003286638420000093
Figure RE-GDA0003286638420000101
S132, calculating the feature score of each sentence according to the sentence feature vector extraction algorithm; as shown in Table 1, a formula for calculating sentence features is given, e.g., for the previous step S1By "important person", it is assumed that the scores in the seven features are 1.23, 0.96, 1.13, 1.56, 0.99, 0.78, 1.16.
S133, constructing vector representation of the sentence by the feature score; e.g. for the feature score obtained in the previous step, then S1The vector of (a) is represented as [1.23, 0.96, 1.13, 1.56, 0.99, 0.78, 1.16]And obtaining vector representation of other sentences in the same way. It should be noted that the number, kind and formula of the feature choices do not affect the effectiveness of the present invention.
S134, combining the vectors of the sentences into a two-dimensional vector to obtain vector matrix representation of the text; for example for T ═ (S)1,S2,S3,...,Sm) And finally obtaining a matrix of T, wherein T is represented as [7, m ═ m]。
And S14 and S15, inputting the text matrix into the neural network model, obtaining a new text vector matrix through encoding-decoding operation, and reflecting the text vector to obtain a text abstract. As shown in fig. 4, the steps S14 and S15 include:
s1451, inputting the text vector matrix into an encoder model encoder to output a semantic vector; e.g. for the set of the previous step T ═ (S)1,S2,S3,...,Sm) Obtaining an intermediate semantic vector C ═ (C)1,c2,c3,...,cm)。
S1452, S1453, S1454, inputting sentence starting identifier into the decoder model, predicting the next sentence according to the text starting identifier and the semantic vector V, selecting the sentence with the maximum probability as a new sentence, iterating the previous step until a text ending identifier is generated, terminating the generation task to obtain a new text vector matrix, and performing inverse mapping on the vector matrix to obtain a text abstract; for example by the formula yi= g(c1,y2,...,yi-1) Get the new sentence yiIteratively generate yi+1,yi+2,. until an end-of-text symbol is generated, yielding an output representation Y ═ Y · (Y)1,y2,...,yt) And t is the number of sentences for generating the abstract, and can be set in the design stage of the neural network model, and the value of t does not influence the effectiveness of the method.
It should be noted that, although the steps in the above-described flowcharts are shown in sequence as indicated by arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise.
In one embodiment, as shown in fig. 5, an automatic text summary generation system is provided, which is explained above and will not be described in detail.
For specific limitations of an automatic text summary generation system, reference may be made to the above limitations of the generation method, which are not described herein again.
Fig. 6 shows an internal structure diagram of a computer device in an embodiment, where the computer device may specifically be a terminal or a server. As shown in fig. 6, the computer device includes a processor, a memory, an input and an output device connected by a system bus.
It will be appreciated by those of ordinary skill in the art that the architecture shown in FIG. 6 is a block diagram of only a portion of the architecture associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied, as a particular computing device may include more or less components than those shown, or may combine certain components, or have a similar arrangement of components.
In the present specification, each embodiment is described in a progressive manner, each technical feature of the above embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present specification should be considered.
The above-mentioned embodiments only express some preferred embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these should be construed as the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

Claims (9)

1. An automatic text summary generation method, characterized in that the method comprises the steps of:
acquiring an original text and a neural network model;
segmenting and compressing an original text to obtain a new text representation;
extracting sentence characteristic vectors from the new text to obtain a text vector matrix;
inputting the text matrix into a bidirectional long and short memory neural network model, and encoding the text matrix into a text semantic matrix;
and inputting the text semantic matrix into the attention model and the long and short memory neural network model, decoding the text semantic matrix into a text vector matrix, and reflecting the text vector to obtain a text abstract.
2. The automatic text summary generation method of claim 1, wherein the step of segmenting and compressing the original text comprises:
segmenting the text by taking a sentence as a unit to generate a sentence set;
counting the sentence length and calculating the average sentence length;
the method comprises the steps of re-segmenting sentences with the length being more than twice of the average length of the sentences;
and updating the sentence set, and removing the sentences of which the text length is less than 3 in the sentence set.
3. The method of automatic text summarization of claim 1 wherein the step of extracting the sentence feature vectors from the new text to obtain a text vector matrix comprises:
constructing 7 sentence characteristics through Chinese language characteristics;
calculating 7 feature scores of each sentence to form a one-dimensional vector to represent each sentence;
the vectors of the sentences are combined into a two-dimensional vector to obtain a vector matrix representation of the sentences.
4. The method of automatic text summarization of claim 1 wherein inputting the text matrix into a two-way long and short memory neural network model, the step of encoding the text matrix into a text semantic matrix comprises:
the encoder receives the vectorized representation of each text in turn;
the encoder outputs a semantic vector V.
5. The method of automatic text summarization of claim 1 wherein the step of inputting the text semantic matrix into the attention model and the long and short memory neural network model, decoding the text semantic matrix into a text vector matrix, the text vector matrix reflecting the resulting text summary comprises:
the decoder inputs BOS, namely the beginning of a text, predicts the probability of using each sentence in the next round according to the BOS and the semantic vector V, and selects the sentence with the maximum probability;
and inputting the sentence with the maximum probability and the semantic vector V into a decoder to obtain the sentence with the maximum probability in the next round, and circulating the steps until EOS is obtained, namely the end of the sentence, and the generation of the text abstract is finished.
6. The sentence feature vector extraction of claim 3, wherein the step of constructing 7 sentence features and obtaining the sentence feature vector by the Chinese language features comprises:
performing some necessary preprocessing on the text, firstly segmenting the sentence by taking words as units, then removing useless functional words, and then recombining the words to form a new sentence;
combining multiple influence factors extracted from the abstract of the text sentence, selecting 7 characteristics with the best influence effect, wherein the 7 characteristics are as follows: the method comprises the steps of carrying out mathematical transformation on characteristics of sentence relevancy, sentence similarity with a central sentence, the number of keywords contained in a sentence, the number of domain entity nouns contained in a sentence, sentence informativeness, sentence length and sentence position characteristic vectors, constructing a reasonable formula, calculating characteristic scores of each sentence, and obtaining 7 characteristic scores of each sentence to form a sentence characteristic vector.
7. An automatic text summary generation system, the system comprising:
the sample acquisition module is used for acquiring an original Chinese long text and a neural network model;
the text preprocessing module is used for carrying out primary compression processing on the original text, removing useless sentences and expressing the original text into a sentence set through sentence re-segmentation work;
the text characteristic representation module is used for representing each sentence in the sentence set by using a one-dimensional vector to generate a two-dimensional characteristic vector matrix of the text;
the semantic vector generating module is used for constructing sentence semantic relations in the text through a neural network model and generating semantic vectors of the text;
and the automatic text abstract generating module is used for predicting the next input vector by the semantic vector and the input vector through the neural network model, iteratively generating output sentence characteristic vectors until a text ending symbol is generated, and finally converting all the generated sentence characteristic vectors into character symbols to obtain the generated text abstract.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202110921956.5A 2021-08-12 2021-08-12 Automatic text abstract generation method, system, computer equipment and storage medium Pending CN113626584A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110921956.5A CN113626584A (en) 2021-08-12 2021-08-12 Automatic text abstract generation method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110921956.5A CN113626584A (en) 2021-08-12 2021-08-12 Automatic text abstract generation method, system, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113626584A true CN113626584A (en) 2021-11-09

Family

ID=78384682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110921956.5A Pending CN113626584A (en) 2021-08-12 2021-08-12 Automatic text abstract generation method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113626584A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629804A (en) * 2023-06-06 2023-08-22 河北华正信息工程有限公司 Letters, interviews, supervision and tracking management system and management method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304445A (en) * 2017-12-07 2018-07-20 新华网股份有限公司 A kind of text snippet generation method and device
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110348016A (en) * 2019-07-15 2019-10-18 昆明理工大学 Text snippet generation method based on sentence association attention mechanism
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111966820A (en) * 2020-07-21 2020-11-20 西北工业大学 Method and system for constructing and extracting generative abstract model
US10902191B1 (en) * 2019-08-05 2021-01-26 International Business Machines Corporation Natural language processing techniques for generating a document summary
CN112560456A (en) * 2020-11-03 2021-03-26 重庆安石泽太科技有限公司 Generation type abstract generation method and system based on improved neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304445A (en) * 2017-12-07 2018-07-20 新华网股份有限公司 A kind of text snippet generation method and device
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110348016A (en) * 2019-07-15 2019-10-18 昆明理工大学 Text snippet generation method based on sentence association attention mechanism
US10902191B1 (en) * 2019-08-05 2021-01-26 International Business Machines Corporation Natural language processing techniques for generating a document summary
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111966820A (en) * 2020-07-21 2020-11-20 西北工业大学 Method and system for constructing and extracting generative abstract model
CN112560456A (en) * 2020-11-03 2021-03-26 重庆安石泽太科技有限公司 Generation type abstract generation method and system based on improved neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629804A (en) * 2023-06-06 2023-08-22 河北华正信息工程有限公司 Letters, interviews, supervision and tracking management system and management method
CN116629804B (en) * 2023-06-06 2024-01-09 河北华正信息工程有限公司 Letters, interviews, supervision and tracking management system and management method

Similar Documents

Publication Publication Date Title
CN111897949B (en) Guided text abstract generation method based on Transformer
WO2018214486A1 (en) Method and apparatus for generating multi-document summary, and terminal
CN106970910B (en) Keyword extraction method and device based on graph model
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
CN114048350A (en) Text-video retrieval method based on fine-grained cross-modal alignment model
CN108538286A (en) A kind of method and computer of speech recognition
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN111178053B (en) Text generation method for generating abstract extraction by combining semantics and text structure
CN113032552B (en) Text abstract-based policy key point extraction method and system
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN113065349A (en) Named entity recognition method based on conditional random field
Landthaler et al. Extending Thesauri Using Word Embeddings and the Intersection Method.
CN115906805A (en) Long text abstract generating method based on word fine granularity
CN114298055B (en) Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN116501861A (en) Long text abstract generation method based on hierarchical BERT model and label migration
CN116628186B (en) Text abstract generation method and system
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
CN110929022A (en) Text abstract generation method and system
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
CN112115256A (en) Method and device for generating news text abstract integrated with Chinese stroke information
CN110413770B (en) Method and device for classifying group messages into group topics
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
CN108427769B (en) Character interest tag extraction method based on social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination