CN116775855A - Automatic TextRank Chinese abstract generation method based on Bi-LSTM - Google Patents
Automatic TextRank Chinese abstract generation method based on Bi-LSTM Download PDFInfo
- Publication number
- CN116775855A CN116775855A CN202310463558.2A CN202310463558A CN116775855A CN 116775855 A CN116775855 A CN 116775855A CN 202310463558 A CN202310463558 A CN 202310463558A CN 116775855 A CN116775855 A CN 116775855A
- Authority
- CN
- China
- Prior art keywords
- lstm
- information
- text
- sentence
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 57
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 230000008569 process Effects 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000004927 fusion Effects 0.000 abstract description 10
- 238000005516 engineering process Methods 0.000 description 16
- 238000011156 evaluation Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of computers, in particular to a text rank Chinese abstract automatic generation method based on Bi-LSTM, which takes Word vectors converted by a Word2vec model as input information, further processes the input information by using the Bi-LSTM model, takes the output information as sentence vectors of each sentence of a text, and is used for calculating the similarity among the sentences. By taking the similarity among sentences as the edge weight, constructing a TextRank graph structure by taking each sentence as a node, calculating the TextRank value of each sentence as the weight of each sentence, sorting according to the weight, and finally extracting candidate abstract sentences to form a final abstract. According to the invention, a Bi-LSTM fusion Word2vec+TextRank automatic abstract model is adopted to provide a new fusion model W2v-BiL-TR, so that the quality of abstract extraction results is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a TextRank Chinese abstract automatic generation method based on Bi-LSTM.
Background
In order to process the current complex and multiple internet information, an internet user can quickly acquire effective and simplified information from a large amount of complex internet texts, and a text automatic summarization technology is provided. The technology can better simplify a large amount of complex redundant information, so that a user can quickly acquire the wanted information. The current automatic abstract generation technology is mainly divided into two major types of extraction type abstract technology and generation type abstract technology, and the invention is mainly realized based on the improvement of the extraction type abstract technology. At present, domestic and foreign scholars have made a lot of contributions in text extraction type abstract technology:
the extraction type abstract technology is mainly used for calculating and sequencing weights of sentences of a text, and finally generated abstract results are mainly formed by combining abstract candidate sentences which are sequenced in front, so that the readability is high, and the abstract results can better express the subject information of the original text. In 2004, michalco proposed a TextRank algorithm based on the PageRank algorithm, in which the graph structure in the PageRank algorithm is referenced. The algorithm calculates the weight of each sentence by mainly calculating the similarity between sentences. In 2013, tomas Mikolv et al proposed the Word2vec model. The model can extract word vectors from texts, sentence vectors of all sentences are obtained by means of word vector fusion, the sentence vectors contain certain text semantic features, and the quality of abstract results can be improved by combining the text semantic features with a TextRank algorithm. In order to have a deeper representation of text semantic features, some students attempt to process text in conjunction with neural network structures. Cheng et al codes the text in a sentence-word different manner and calculates the characteristics of the sentence and word in a coding layer-decoding layer manner, respectively. In 2020, luo Feixiong uses the BERT model to further process text and accurately and deeply express semantic information of the text. The processed text is used in a TextRank algorithm, and the quality of the generated abstract result is high.
Through the research of the current automatic summarization technology, the extraction of text semantic features by the current mainstream extraction summarization technology is not deep, and the current mainstream Word2vec fused TextRank extraction summarization technology mainly utilizes a Word2vec model to process text, so that the text is converted into Word vectors for constructing text sentence vectors. Although the method is simple, the Word2vec model has a simple structure, and can not well process longer texts, so that the whole semantic features of Quan Wenben are not fully represented.
Disclosure of Invention
The invention aims to provide a text rank Chinese abstract automatic generation method based on Bi-LSTM, which aims to solve the technical problem that the extraction of text semantic features is not deep in the existing mainstream extraction type abstract technology, and Word vectors converted by a Word2vec model are used as input information by utilizing a Bi-LSTM model, and abstract results with higher quality are generated after further processing.
In order to achieve the above purpose, the invention provides a text rank Chinese abstract automatic generation method based on Bi-LSTM, comprising the following steps:
step 1: preprocessing a text, then carrying out sentence dividing processing, and dividing the text into a one-dimensional sentence list S, wherein the list length represents the number of sentences in the text, and S [ i ] is defined as the ith sentence in the text;
step 2: preserving punctuation marks of the text, and carrying out word segmentation processing on the sentence list to obtain a two-dimensional word list W, wherein W [ i ] [ j ] is defined as a j-th word of an i-th sentence in the text;
step 3: processing the two-dimensional Word list W in the step 2 by adopting a Word2vec model, extracting text Word vectors to obtain a Word vector two-dimensional table WV of the text, wherein WV [ i ] [ j ] corresponds to the Word vector of the j-th Word of the i-th sentence of the text;
step 4: processing the obtained word vector two-dimensional table WV serving as input information by using a Bi-LSTM model, and generating a sentence vector table SV by taking the obtained output state as a sentence vector, wherein SV [ i ] represents the sentence vector of the ith sentence;
step 5: calculating sentence vectors obtained in the step 4, obtaining similarity among sentences, and forming a two-dimensional matrix X;
step 6: constructing a graph structure according to a two-dimensional matrix X by using a TextRank model, wherein nodes of the graph correspond to sentences of the text, edge weight values of the graph correspond to similarity among the sentences, and finally, calculating TextRank values of all the sentences as sentence weight values and using the TextRank values for scoring the sentences;
step 7: and selecting sentences with the front names as candidate abstract sentences according to the ranking degree of the weight values of the sentences, and taking the sentences as a final extraction abstract result.
Preferably, in the processing process using the Bi-LSTM model, word vectors converted by the Word2vec model are used as input information, and after the Word vectors are input at different moments, the input information is screened through the input gate, the hidden gate and the output gate of the two forward and reverse LSTM structures, and the input information is transmitted to the current cell state; and meanwhile, the forgetting gate screens the cell state at the previous moment, combines the reserved information with the current cell state, obtains hidden state information in different directions through the processing of the output gate, and splices the hidden state information to be used as a final hidden state output result of the Bi-LSTM model.
Preferably, the specific processing process of the Bi-LSTM model on the text information at the time t comprises the following steps:
processing and analyzing the input data information and calculating candidate cell states
Updating and calculating an input door and a forget door at the current moment according to the input information and the information at the last moment;
calculating the cell state C at the current moment through the bidirectional input door, the forgetting door and the hidden door state information at the current moment t ;
From the previously obtained states, the output gate state is calculated and the output h of the unidirectional LSTM can be obtained st Finally, the output h of the final Bi-LSTM is obtained by splicing the outputs of the forward LSTM and the reverse LSTM t 。
Preferably, the candidate cell stateThe calculation formula of (2) is as follows:
wherein W is c Represents the weight, h t-1 Output vector representing last time step, b c Is offset.
The calculation formula of the input gate at the current moment is as follows:
i t =σ(W t ·[h t-1 ,x t ]+b i )
wherein W is t Input gate i representing the input of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b i Is offset.
The calculation formula of the forgetting door at the current moment is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f )
wherein W is f Forget gate f indicating the input of the last time and the current time to the current time t Is the weight of (a), sigma is the activation function, b f Is offset.
Memory cell state C at the present moment t The calculation formula is as follows:
the probability of the memory cell state at the previous moment is forgotten through a forgetting door, the information which is still reserved at the previous moment is obtained, the information of the candidate cell state is selected and input through an input door, and the information is combined with the information to obtain the current memory cell state information.
The calculation formula of the output gate at the current moment is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o )
wherein W is o Output gate o representing the output of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b o Is offset.
The output information of the current moment is obtained through the processing of the output gate on the cell state at the current moment, and the calculation formula is as follows:
h st =o t *tanh(C t )
wherein h is st Representing the output of a unidirectional LSTM, tanh is the activation function. The output of the LSTM in two directions is spliced to obtain complete Bi-LSTM output information, namely:
preferably, word vectors obtained by Word2vec processing the text are used as input sequences, and sentences S in the text are processed through an embedding layer of Bi-LSTM i Is converted into a vector representation of a fixed dimension. The input vector is then encoded in a bi-directional LSTM and passed through the input gate i in time steps t For the current input information x t And (5) processing. Then pass through the forgetting door f t For the cell state C at the last moment t-1 Probability forgetting is carried out, useful information is kept, and the probability forgetting and the useful information are combined to obtain the current memory cell state C t . Through the output gate o t Processing the cell state at the current moment by combining the tanh function to obtain unidirectional output information h st . Splicing the forward and reverse unidirectional output information to obtain complete Bi-LSTM output information h t Output h of final Bi-LSTM t Contextual information representing text.
The invention provides a text rank Chinese abstract automatic generation method based on Bi-LSTM, which takes Word vectors converted by a Word2vec model as input information, further processes the input information by using the Bi-LSTM model, takes the output information as sentence vectors of each sentence of a text, and is used for calculating the similarity among the sentences. By taking the similarity among sentences as the edge weight, constructing a TextRank graph structure by taking each sentence as a node, calculating the TextRank value of each sentence as the weight of each sentence, sorting according to the weight, and finally extracting candidate abstract sentences to form a final abstract. According to the invention, a Bi-LSTM fusion Word2vec+TextRank automatic abstract model is adopted to provide a new fusion model W2v-BiL-TR, so that the quality of abstract extraction results is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a text rank Chinese abstract automatic generation method based on Bi-LSTM.
FIG. 2 is a schematic diagram of the structure of a fusion model W2v-BiL-TR of the automatic generation method of the textRank Chinese abstract based on Bi-LSTM.
FIG. 3 is a flow chart of Bi-LSTM extraction features of the automatic generation method of the textRank Chinese abstract based on Bi-LSTM.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to FIG. 1, the invention provides a text rank Chinese abstract automatic generation method based on Bi-LSTM, comprising the following steps:
s1: preprocessing a text, then carrying out sentence dividing processing, and dividing the text into a one-dimensional sentence list S, wherein the list length represents the number of sentences in the text, and S [ i ] is defined as the ith sentence in the text;
s2: preserving punctuation marks of the text, and carrying out word segmentation processing on the sentence list to obtain a two-dimensional word list W, wherein W [ i ] [ j ] is defined as a j-th word of an i-th sentence in the text;
s3: processing the two-dimensional Word list W in the step S2 by adopting a Word2vec model, extracting text Word vectors to obtain a Word vector two-dimensional table WV of the text, wherein WV [ i ] [ j ] corresponds to the Word vector of the j-th Word of the i-th sentence of the text;
s4: processing the obtained word vector two-dimensional table WV serving as input information by using a Bi-LSTM model, and generating a sentence vector table SV by taking the obtained output state as a sentence vector, wherein SV [ i ] represents the sentence vector of the ith sentence;
s5: calculating sentence vectors obtained in the step S4, obtaining the similarity among sentences, and forming a two-dimensional matrix X;
s6: constructing a graph structure according to a two-dimensional matrix X by using a TextRank model, wherein nodes of the graph correspond to sentences of the text, edge weight values of the graph correspond to similarity among the sentences, and finally, calculating TextRank values of all the sentences as sentence weight values and using the TextRank values for scoring the sentences;
s7: and selecting sentences with the front names as candidate abstract sentences according to the ranking degree of the weight values of the sentences, and taking the sentences as a final extraction abstract result.
Specifically, the text information is further processed by fusing the Bi-LSTM model, and the processed information is used as sentence vectors of all sentences to calculate the similarity among all sentences. And constructing a graph structure by using TextRank, wherein graph nodes are sentences, and the edge weights among the nodes are represented by the similarity among the sentences. And calculating the weight of each sentence through the graph structure for sentence sequencing, and finally generating a summary result. The invention mainly provides a novel fusion model W2v-BiL-TR based on a Word2vec model, a Bi-LSTM model and a textRank algorithm, and the structure of the novel fusion model W2v-BiL-TR is shown in figure 2.
Further, the following description will be made with reference to a specific implementation procedure, in which, in the processing procedure using the Bi-LSTM model, the information obtained after the text is processed by the Bi-LSTM model may be used as sentence vector information of each sentence of the text.
The specific workflow diagram of Bi-LSTM processing text information is shown in FIG. 3, and the Bi-LSTM model processes text information, and when feature extraction is performed on text, word vector converted by Word2vec model is mainly used as input information, as shown in { x in FIG. 1 1 ,x 2 ,...x n Shown. After inputting at different moments, two hidden state information is obtained and spliced through the processing of the input gate, the hidden gate and the output gate of the two forward and reverse LSTM structuresThen, as in { h } of FIG. 1 1 ,h 2 ,...h n And outputting the result as a final hidden state of the Bi-LSTM model.
The specific processing steps of the Bi-LSTM model on the text information at the time t are as follows:
processing and analyzing the input data information and calculating candidate cell statesThe calculation formula is as follows:
wherein W is c Represents the weight, h t-1 Output vector representing last time step, b c Is offset.
According to the input information and the information of the last moment, the input door and the forget door at the current moment are updated and calculated, and the calculation formula is as follows:
i t =σ(W t ·[h t-1 ,x t ]+b i ) (2)
wherein W is t Input gate i representing the input of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b i Is offset.
f t =σ(W f ·[h t-1 ,x t ]+b f ) (3)
Wherein W is f Forget gate f indicating the input of the last time and the current time to the current time t Is the weight of (a), sigma is the activation function, b f Is offset.
Calculating the cell state C at the current moment through the bidirectional input door, the forgetting door and the hidden door state information at the current moment t The calculation formula is as follows:
the probability of the memory cell state at the previous moment is forgotten through a forgetting door, the information which is still reserved at the previous moment is obtained, the information of the candidate cell state is selected and input through an input door, and the information is combined with the information to obtain the current memory cell state information.
Finally, according to the state obtained before, calculating the state of the output gate and obtaining the output h of the unidirectional LSTM st Finally, the output h of the final Bi-LSTM is obtained by splicing the outputs of the forward LSTM and the reverse LSTM t . The calculation formula is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
wherein W is o Output gate o representing the output of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b o Is offset.
h st =o t *tanh(C t ) (6)
Wherein h is st Representing the output of a unidirectional LSTM, tanh is the activation function.
The output of the LSTM in two directions is spliced to obtain complete Bi-LSTM output information h t 。
H of final Bi-LSTM output t Contextual information representing text. Because the input information of the forward and reverse LSTM is combined, and meanwhile, because of the memory gate mechanism of the Bi-LSTM, the dependency relationship of the long-distance sequence is captured, and the memory state can be changed at different times t, the long-dependency problem can be well solved, and the semantic features of the text can be extracted through deeper mining.
And further processing the text information through the Bi-LSTM model, and taking the obtained output information as sentence vectors of each sentence. Finally, calculating the similarity between sentences based on the sentence vectors of the Bi-LSTM model, then constructing a textRank graph structure, and calculating the weight of each sentence for extracting candidate abstract sentences.
In order to analyze the quality of the summary result finally obtained by the technology, the invention also provides a specific embodiment, and the traditional TextRank summary technology, the TextRank summary technology fused with Word2vec and the fusion model W2v-BiL-TR newly provided by the invention are evaluated by adopting a Rouge-1, a Rouge-2 and a Rouge-L evaluation method respectively. Wherein Rouge-1 and Rouge-2 belong to Rouge-N, wherein N refers to the matching rate of the generated abstract and the number of words and words in the standard abstract, for example Rouge-1 is calculated by matching single words with single words in the two. Rouge-L is calculated using the coverage of the longest common subsequence in the generated summary and the standard summary as a criterion. The specific evaluation process of the three evaluation standards is as follows:
(1) Converting the results of generating the abstract and the standard abstract into single characters or words, double characters or word sequences;
(2) Calculating the number of times of each single word or word, double word or word in the standard abstract, and constructing the longest common subsequence LCS according to the single word or word;
(3) Calculating the number of times of each single word or word, double word or word in the generated abstract, and constructing the longest public subsequence LCS according to the single word or word;
(4) Sequentially calculating Rough-1, rough-2 and P, R, F of Rough-L 1 Values. The specific calculation formula is as follows:
where n represents the number of divided word sequences that overlap in the generated digest and the standard digest. X, Y each represents a raw materialThe abstract and the standard abstract are formed. u, v denote the number of word sequences that occur in the generated digest and in the standard digest, respectively. R represents recall rate, P represents accuracy rate, F 1 The value is the harmonic mean of the P and R values.
The following are the evaluation results of different evaluation standards:
TABLE 1 Rouge-1 values for summary extraction results from different techniques
As shown in Table 1, the summary result obtained by fusing the W2v-BiL-TR model to further process the text is higher in each index in the Rouge-1 evaluation method. Therefore, after the Bi-LSTM model is fused, the semantic features of the text can be better represented, and the generated abstract quality is better.
TABLE 2 Rouge-2 values for summary extraction results from different techniques
Similarly, as shown in Table 2, the summary result obtained by fusing the W2v-BiL-TR model to further process the text is higher in each index in the Rouge-2 evaluation method. After further describing that the Bi-LSTM model is fused, semantic features of the text can be better represented, and the generated abstract quality is better.
TABLE 3 Rouge-L values for summary extraction results from different techniques
When the longest public subsequence is used as an evaluation standard, namely a Rouge-L evaluation method is adopted, the technology still has the highest index of fusing the W2v-BiL-TR model. Therefore, it can be summarized that, due to the addition of the Bi-LSTM model, the text Word vector extracted by the Word2vec model is further processed, so that the obtained sentence vector has more abundant features and contains more semantic information compared with the sentence vector which uses Word vector simple fusion. Therefore, when the TextRank algorithm is used for calculating the sentence weight, the effect is better, and the quality of the finally obtained abstract result is better.
In summary, the invention uses the Bi-LSTM model, takes Word vectors converted by the Word2vec model as input information, further processes the input information to obtain output information as sentence vectors of each sentence, calculates the similarity between sentences, and then uses the calculated similarity between sentences to construct a graph structure by using textRank. The quality of the finally generated abstract result is higher.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.
Claims (5)
1. A text rank Chinese abstract automatic generation method based on Bi-LSTM is characterized by comprising the following steps:
step 1: preprocessing a text, then carrying out sentence dividing processing, and dividing the text into a one-dimensional sentence list S, wherein the list length represents the number of sentences in the text, and S [ i ] is defined as the ith sentence in the text;
step 2: preserving punctuation marks of the text, and carrying out word segmentation processing on the sentence list to obtain a two-dimensional word list W, wherein W [ i ] [ j ] is defined as a j-th word of an i-th sentence in the text;
step 3: processing the two-dimensional Word list W in the step 2 by adopting a Word2vec model, extracting text Word vectors to obtain a Word vector two-dimensional table WV of the text, wherein WV [ i ] [ j ] corresponds to the Word vector of the j-th Word of the i-th sentence of the text;
step 4: processing the obtained word vector two-dimensional table WV serving as input information by using a Bi-LSTM model, and generating a sentence vector table SV by taking the obtained output state as a sentence vector, wherein SV [ i ] represents the sentence vector of the ith sentence;
step 5: calculating sentence vectors obtained in the step 4, obtaining similarity among sentences, and forming a two-dimensional matrix X;
step 6: constructing a graph structure according to a two-dimensional matrix X by using a TextRank model, wherein nodes of the graph correspond to sentences of the text, edge weight values of the graph correspond to similarity among the sentences, and finally, calculating TextRank values of all the sentences as sentence weight values and using the TextRank values for scoring the sentences;
step 7: and selecting sentences with the front names as candidate abstract sentences according to the ranking degree of the weight values of the sentences, and taking the sentences as a final extraction abstract result.
2. The automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 1, wherein,
in the process of processing by using the Bi-LSTM model, word vectors converted by the Word2vec model are used as input information, and after the Word vectors are input at different moments, the input information is screened by the input gate, the hidden gate and the output gate of the two forward and reverse LSTM structures, and the input information is transmitted to the current cell state; and meanwhile, the forgetting gate screens the cell state at the previous moment, combines the reserved information with the current cell state, obtains hidden state information in different directions through the processing of the output gate, and splices the hidden state information to be used as a final hidden state output result of the Bi-LSTM model.
3. The automatic Bi-LSTM based TextRank chinese summary generation method of claim 2, wherein,
the specific processing process of the Bi-LSTM model on the text information at the time t comprises the following steps:
processing and analyzing the input data information and calculating candidate cell states
Updating and calculating an input door and a forget door at the current moment according to the input information and the information at the last moment;
through the bidirectional input door, the forgetting door and the hidden door at the current momentThe state information of the hidden gate is used for calculating the state C of the cell at the current moment t ;
From the previously obtained states, the output gate state is calculated and the output h of the unidirectional LSTM can be obtained st Finally, the output h of the final Bi-LSTM is obtained by splicing the outputs of the forward LSTM and the reverse LSTM t 。
4. The automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 3 wherein,
the candidate cell stateThe calculation formula of (2) is as follows:
wherein W is c Represents the weight, h t-1 Output vector representing last time step, b c Is biased;
input door i at current moment t The calculation formula of (2) is as follows:
i t =σ(W t ·[h t-1 ,x t ]+b i )
wherein W is t Input gate i representing the input of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b i Is biased;
forgetting door f at current moment t The calculation formula of (2) is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f )
wherein W is f Forget gate f indicating the input of the last time and the current time to the current time t Is the weight of (a), sigma is the activation function, b f Is biased;
memory cell state C at the present moment t The calculation formula is as follows:
the probability of the memory cell state at the previous moment is forgotten through a forgetting door, so that information which is remained in the cell state at the previous moment is obtained, and the information is selected and input with the candidate cell state information through an input door, and the information and the candidate cell state information are combined to obtain the current memory cell state information;
output door o at current moment t The calculation formula of (2) is as follows:
o t =σ(W o ·[h t-1 ,x t ]+b o )
wherein W is o Output gate o representing the output of the last time and the input of the current time to the current time t Is the weight of (a), sigma is the activation function, b o Is biased;
the output information of the current moment is obtained through the processing of the output gate on the cell state at the current moment, and the calculation formula is as follows:
h st =o t *tanh(C t )
wherein h is st Representing the output of LSTM in a single direction, wherein tanh is an activation function, and the complete Bi-LSTM output information is obtained by splicing the outputs of LSTM in two directions, namely:
5. the automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 4 wherein,
output h of final Bi-LSTM t For representing sentences S in text i Is a sentence vector of (a).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310463558.2A CN116775855A (en) | 2023-04-26 | 2023-04-26 | Automatic TextRank Chinese abstract generation method based on Bi-LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310463558.2A CN116775855A (en) | 2023-04-26 | 2023-04-26 | Automatic TextRank Chinese abstract generation method based on Bi-LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116775855A true CN116775855A (en) | 2023-09-19 |
Family
ID=87995184
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310463558.2A Pending CN116775855A (en) | 2023-04-26 | 2023-04-26 | Automatic TextRank Chinese abstract generation method based on Bi-LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116775855A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116956759A (en) * | 2023-09-21 | 2023-10-27 | 宝德计算机系统股份有限公司 | Method, system and device for adjusting rotation speed of BMC fan |
-
2023
- 2023-04-26 CN CN202310463558.2A patent/CN116775855A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116956759A (en) * | 2023-09-21 | 2023-10-27 | 宝德计算机系统股份有限公司 | Method, system and device for adjusting rotation speed of BMC fan |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111897949B (en) | Guided text abstract generation method based on Transformer | |
CN115033670B (en) | Cross-modal image-text retrieval method with multi-granularity feature fusion | |
CN109325112A (en) | A kind of across language sentiment analysis method and apparatus based on emoji | |
CN104391842A (en) | Translation model establishing method and system | |
CN110390018A (en) | A kind of social networks comment generation method based on LSTM | |
CN110851176B (en) | Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus | |
CN114444516B (en) | Cantonese rumor detection method based on deep semantic perception map convolutional network | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN112612871A (en) | Multi-event detection method based on sequence generation model | |
CN114880461A (en) | Chinese news text summarization method combining contrast learning and pre-training technology | |
CN113657115A (en) | Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion | |
CN115422939B (en) | Fine granularity commodity named entity identification method based on big data | |
CN111339407A (en) | Implementation method of information extraction cloud platform | |
CN113032552A (en) | Text abstract-based policy key point extraction method and system | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN114387537A (en) | Video question-answering method based on description text | |
CN113988075A (en) | Network security field text data entity relation extraction method based on multi-task learning | |
CN116775855A (en) | Automatic TextRank Chinese abstract generation method based on Bi-LSTM | |
CN116775862A (en) | Emotion classification method of Bi-LSTM fused with emotion words | |
CN117112786A (en) | Rumor detection method based on graph attention network | |
CN116992040A (en) | Knowledge graph completion method and system based on conceptual diagram | |
CN115329073A (en) | Attention mechanism-based aspect level text emotion analysis method and system | |
CN113222059B (en) | Multi-label emotion classification method using cooperative neural network chain | |
CN116958997B (en) | Graphic summary method and system based on heterogeneous graphic neural network | |
Zhu et al. | Deep metric multi-view hashing for multimedia retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |