CN116775855A

CN116775855A - Automatic TextRank Chinese abstract generation method based on Bi-LSTM

Info

Publication number: CN116775855A
Application number: CN202310463558.2A
Authority: CN
Inventors: 程珠鸿; 周娅
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-09-19

Abstract

The invention relates to the field of computer technology, and specifically relates to a Bi-LSTM-based TextRank Chinese abstract automatic generation method. The word vector converted by the Word2vec model is used as input information, and the Bi-LSTM model is used for further processing, and the output information is used as each sentence of the text. sentence vectors and used to calculate the similarity between sentences. By using the similarity between sentences as the edge weight and each sentence as a node, a TextRank graph structure is constructed, the TextRank value of each sentence is calculated as the weight of each sentence, and sorted according to the size of the weight, and finally the candidate summary sentences are extracted to form the final summary. . The present invention integrates the Word2vec+TextRank automatic summary model through Bi‑LSTM and proposes a new fusion model W2v‑BiL‑TR, which improves the quality of summary extraction results.

Description

TextRank Chinese summary automatic generation method based on Bi-LSTM

技术领域Technical field

本发明涉及计算机技术领域，具体涉及一种基于Bi-LSTM的TextRank中文摘要自动生成方法。The invention relates to the field of computer technology, and in particular to a method for automatically generating TextRank Chinese abstracts based on Bi-LSTM.

背景技术Background technique

为了处理当前复杂多元的互联网信息，让互联网用户快速从大量且复杂的互联网文本中获取有效、精简的信息，提出了文本自动摘要技术。该技术可以更好地精简大量复杂冗余的信息，使用户可以快速获取想要的信息。当前自动摘要生成技术主要分为抽取式摘要技术以及生成式摘要技术两大类，本发明主要基于抽取式摘要技术改进而实现。目前，国内外学者已经在文本抽取式摘要技术方面做出许多贡献：In order to handle the current complex and diverse Internet information and allow Internet users to quickly obtain effective and streamlined information from large and complex Internet texts, automatic text summarization technology is proposed. This technology can better streamline a large amount of complex and redundant information, allowing users to quickly obtain the information they want. Current automatic summary generation technology is mainly divided into two categories: extractive summary technology and generative summary technology. The present invention is mainly implemented based on the improvement of extractive summary technology. At present, domestic and foreign scholars have made many contributions to text extraction summarization technology:

抽取式摘要技术，主要是对文本各句子进行权值计算并排序，最后生成的摘要结果主要是由排序靠前的摘要候选句组合而成，可读性较高，摘要结果可以较好地表达原文的主题信息。2004年，Mihalcea在PageRank算法的基础上，提出了TextRank算法，在这个算法中引用PageRank算法中的图结构。该算法主要通过计算句子之间的相似度大小来计算出各句子的权值。2013年，Tomas Mikolv等人提出Word2vec模型。该模型可以对文本进行词向量的提取，依靠词向量的融合来得到各句子的句向量，句向量包含一定的文本语义特征，与TextRank算法结合，可以提高摘要结果的质量。为了对文本语义特征有更深层次的表示，一些学者尝试结合神经网络结构来处理文本。Cheng等人对文本进行句子与单词不同形式编码，并采用编码层-解码层的形式分别对句子与单词的特征计算。2020年，罗飞雄使用BERT模型进一步处理文本，对文本的语义信息进行准确深层次地表达。处理后的文本用于TextRank算法中，生成的摘要结果质量比较高。Extractive summary technology mainly calculates and sorts the weight of each sentence in the text. The final summary result is mainly composed of the top-ranked summary candidate sentences. It is more readable and the summary result can be better expressed. Theme information of the original text. In 2004, Mihalcea proposed the TextRank algorithm based on the PageRank algorithm, in which the graph structure in the PageRank algorithm was quoted. This algorithm mainly calculates the weight of each sentence by calculating the similarity between sentences. In 2013, Tomas Mikolv and others proposed the Word2vec model. This model can extract word vectors from text, and rely on the fusion of word vectors to obtain the sentence vectors of each sentence. The sentence vectors contain certain text semantic features, and combined with the TextRank algorithm, can improve the quality of summary results. In order to have a deeper representation of the semantic features of text, some scholars try to combine neural network structures to process text. Cheng et al. encoded text in different forms of sentences and words, and used an encoding layer-decoding layer to calculate the features of sentences and words respectively. In 2020, Luo Feixiong used the BERT model to further process the text and accurately and deeply express the semantic information of the text. The processed text is used in the TextRank algorithm, and the summary results generated are of relatively high quality.

通过对当前自动摘要技术的研究，当前主流的抽取式摘要技术对于文本语义特征的提取并不深入，当前主流的融合Word2vec的TextRank抽取式摘要技术主要利用Word2vec模型对文本处理，将文本转换为词向量用于构建文本句向量。该方法虽然简单，但是Word2vec模型结构较为简单，无法很好地对较长文本处理，因此对全文本的整体语义特征表示不全。Through research on current automatic summary technology, the current mainstream extractive summary technology is not in-depth at extracting text semantic features. The current mainstream TextRank extractive summary technology that integrates Word2vec mainly uses the Word2vec model to process text and convert text into words. Vectors are used to construct text sentence vectors. Although this method is simple, the Word2vec model structure is relatively simple and cannot process long texts well, so it does not fully represent the overall semantic features of the full text.

发明内容Contents of the invention

本发明的目的在于提供一种基于Bi-LSTM的TextRank中文摘要自动生成方法，旨在解决现有的主流的抽取式摘要技术对于文本语义特征的提取不深入的技术问题，利用Bi-LSTM模型，以Word2vec模型转换后的词向量为输入信息，进一步处理后生成质量更高的摘要结果。The purpose of this invention is to provide a Bi-LSTM-based TextRank Chinese summary automatic generation method, aiming to solve the technical problem that the existing mainstream extractive summary technology does not extract text semantic features in depth, using the Bi-LSTM model, The word vector converted by the Word2vec model is used as input information, and after further processing, a higher quality summary result is generated.

为实现上述目的，本发明提供了一种基于Bi-LSTM的TextRank中文摘要自动生成方法，包括下列步骤：In order to achieve the above objectives, the present invention provides a Bi-LSTM-based TextRank Chinese summary automatic generation method, which includes the following steps:

步骤1：将文本进行预处理，然后进行分句处理，分割成一维句列表S，其中列表长度代表的是文本中句子的个数，将S[i]定义为文本中第i句话；Step 1: Preprocess the text, then segment it into a one-dimensional sentence list S, where the length of the list represents the number of sentences in the text, and S[i] is defined as the i-th sentence in the text;

步骤2：保留文本的标点符号，将分句列表进行分词处理，得到一个二维词列表W，其中将W[i][j]定义为文本中第i句话的第j个词；Step 2: Keep the punctuation marks of the text, perform word segmentation processing on the clause list, and obtain a two-dimensional word list W, where W[i][j] is defined as the j-th word of the i-th sentence in the text;

步骤3：采用Word2vec模型对步骤2中的二维词列表W进行处理，提取文本词向量，得到文本的词向量二维表WV，其中WV[i][j]则对应文本第i句话第j个词的词向量；Step 3: Use the Word2vec model to process the two-dimensional word list W in step 2, extract the text word vector, and obtain the two-dimensional word vector table WV of the text, where WV[i][j] corresponds to the i-th sentence of the text. Word vectors of j words;

步骤4：将获得的所述词向量二维表WV作为输入信息，使用Bi-LSTM模型进行处理，并将得到的输出状态作为句向量，生成句向量表SV，其中SV[i]表示第i个句子的句向量；Step 4: Use the obtained two-dimensional word vector table WV as input information, use the Bi-LSTM model for processing, and use the obtained output state as a sentence vector to generate a sentence vector table SV, where SV[i] represents the i-th sentence vector of a sentence;

步骤5：计算步骤4获得的句向量，获得句子间的相似度，组成一个二维矩阵X；Step 5: Calculate the sentence vector obtained in step 4, obtain the similarity between sentences, and form a two-dimensional matrix X;

步骤6：利用TextRank模型来根据二维矩阵X构建图结构，其中图的节点对应文本的句子，图的边权重值对应句子之间的相似度，最后计算各句子的TextRank值作为各句子权值，并用于句子打分；Step 6: Use the TextRank model to construct a graph structure based on the two-dimensional matrix , and used for sentence scoring;

步骤7：根据各句子权值排序高低，选取靠名靠前的句子作为候选摘要句，作为最终的抽取式摘要结果。Step 7: According to the ranking of the weight of each sentence, select the top sentence as the candidate summary sentence as the final extractive summary result.

优选的，在使用Bi-LSTM模型进行处理的过程中，以Word2vec模型转换的词向量作为输入信息，在不同时刻进行输入后，通过两个正逆向的LSTM结构的输入门、隐藏门和输出门的处理，具体为通过输入门对输入信息的筛选，并传递到当前细胞状态中；同时遗忘门对上一时刻细胞状态进行筛选，将保留的信息与当前细胞状态结合，通过输出门处理，得到不同向的隐藏状态信息并进行拼接，作为Bi-LSTM模型最终的隐藏状态输出结果。Preferably, in the process of using the Bi-LSTM model for processing, the word vector converted by the Word2vec model is used as input information. After input at different times, the input gate, hidden gate and output gate of the two forward and reverse LSTM structures are used. processing, specifically filtering the input information through the input gate and passing it to the current cell state; at the same time, the forgetting gate filters the cell state at the previous moment, combines the retained information with the current cell state, and processes it through the output gate to obtain The hidden state information in different directions is spliced together as the final hidden state output result of the Bi-LSTM model.

优选的，Bi-LSTM模型对文本信息在时刻t的具体处理过程，包括下列步骤：Preferably, the Bi-LSTM model's specific processing of text information at time t includes the following steps:

对输入的数据信息进行处理分析，并计算候选的细胞状态 Process and analyze the input data information and calculate candidate cell states

根据输入的信息以及上一时刻的信息，对当前时刻的输入门以及遗忘门进行更新并计算；Based on the input information and the information at the previous moment, the input gate and forget gate at the current moment are updated and calculated;

通过当前时刻双向的输入门、遗忘门以及隐藏门状态信息，计算出当前时刻的细胞状态C_t；Calculate the cell state C _t at the current moment through the bidirectional input gate, forget gate and hidden gate status information at the current moment;

根据之前得到的状态，计算输出门状态并可以获得单向LSTM的输出h_st，最终通过对正向LSTM以及逆向LSTM的输出进行拼接，得到最终的Bi-LSTM的输出h_t。Based on the previously obtained state, calculate the output gate state and obtain the output h _st of the one-way LSTM. Finally, by splicing the output of the forward LSTM and the reverse LSTM, the final Bi-LSTM output h _t is obtained.

优选的，所述候选的细胞状态的计算公式如下：Preferably, the candidate cell state The calculation formula is as follows:

其中W_c表示权重，h_t-1表示上一个时间步的输出向量，b_c是偏置。where W _c represents the weight, h _t-1 represents the output vector of the previous time step, and b _c is the bias.

当前时刻输入门的计算公式如下：The calculation formula of the input gate at the current moment is as follows:

i_t＝σ(W_t·[h_t-1,x_t]+b_i)i _t =σ(W _t ·[h _t-1 ,x _t ]+b _i )

其中W_t表示上一时间的输出与当前时刻的输入对当前时刻输入门i_t的权重，σ是激活函数，b_i是偏置。Where W _t represents the weight of the output of the previous time and the input of the current time to the input gate i _t of the current time, σ is the activation function, _{and bi} is the bias.

当前时刻遗忘门的计算公式如下：The calculation formula of the forgetting gate at the current moment is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f )

其中W_f表示上一时间的输出与当前时刻的输入对当前时刻遗忘门f_t的权重，σ是激活函数，b_f是偏置。Where W _f represents the weight of the output of the previous time and the input of the current moment to the forgetting gate f _t at the current moment, σ is the activation function, and b _f is the bias.

当前时刻的记忆细胞状态C_t计算公式如下：The calculation formula of the memory cell state C _t at the current moment is as follows:

通过遗忘门对上一时刻记忆细胞状态的概率遗忘，得到上一时刻细胞状态还存留的信息，与输入门对候选细胞状态的信息进行选择输入，两者结合得到当前的记忆细胞状态信息。Through the forgetting gate, the probability of forgetting the memory cell state at the previous moment is obtained, and the remaining information of the cell state at the previous moment is obtained. The information of the candidate cell state is selectively input by the input gate. The current memory cell state information is obtained by combining the two.

当前时刻输出门的计算公式如下：The calculation formula of the output gate at the current moment is as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)o _t =σ(W _o ·[h _t-1 ,x _t ]+b _o )

其中W_o表示上一时间的输出与当前时刻的输入对当前时刻输出门o_t的权重，σ是激活函数，b_o是偏置。Among them, W _o represents the weight of the output of the previous time and the input of the current moment to the output gate o _t at the current moment, σ is the activation function, _and bo is the bias.

通过输出门对当前时刻细胞状态的处理，得到当前时刻的输出信息，计算公式如下：Through the output gate processing of the cell state at the current moment, the output information at the current moment is obtained. The calculation formula is as follows:

h_st＝o_t*tanh(C_t)h _st =o _t *tanh(C _t )

其中h_st表示单一方向LSTM的输出,tanh是激活函数。通过对两个方向LSTM的输出进行拼接，得到完整的Bi-LSTM输出信息，即： Among them, h _st represents the output of single-directional LSTM, and tanh is the activation function. By splicing the output of LSTM in two directions, the complete Bi-LSTM output information is obtained, namely:

优选的，将Word2vec处理文本得到的词向量作为输入序列，通过Bi-LSTM的嵌入层将文本中句子S_i的词向量转化为固定维度的向量表示。然后在双向的LSTM中对输入向量进行编码，并按时间步通过输入门i_t对当前的输入信息x_t进行处理。然后通过遗忘门f_t对上一时刻细胞状态C_t-1进行概率遗忘，并保留有用的信息，两者结合得到当前的记忆细胞状态C_t。通过输出门o_t结合tanh函数对当前时刻细胞状态进行处理，得到单向的输出信息h_st。通过正逆向的单向输出信息进行拼接，得到完整的Bi-LSTM的输出信息h_t，最终的Bi-LSTM的输出h_t用来表示文本的上下文信息。Preferably, the word vector obtained by processing the text with Word2vec is used as the input sequence, and the word vector of the sentence _Si in the text is converted into a fixed-dimensional vector representation through the embedding layer of Bi-LSTM. Then the input vector is encoded in the bidirectional LSTM, and the current input information x _t is processed through the input gate i _t according to time steps. Then, the cell state C _t-1 of the previous moment is probabilistically forgotten through the forget gate f _t and the useful information is retained. The combination of the two obtains the current memory cell state C _t . The cell status at the current moment is processed through the output gate o _t combined with the tanh function to obtain one-way output information h _st . By splicing the forward and reverse one-way output information, the complete Bi-LSTM output information h _t is obtained. The final Bi-LSTM output h _t is used to represent the context information of the text.

本发明提供了一种基于Bi-LSTM的TextRank中文摘要自动生成方法，以Word2vec模型转换的词向量为输入信息，利用Bi-LSTM模型进一步处理，将输出的信息作为文本各句子的句向量，并用于计算句子间的相似度。通过以句子间相似度为边权重，以各句子为节点构建TextRank图结构，计算出各句子的TextRank值作为各句子权值，并根据权值大小进行排序，最后抽取候选摘要句组成最终的摘要。本发明通过Bi-LSTM融合Word2vec+TextRank自动摘要模型，提出新的融合模型W2v-BiL-TR，提升了摘要抽取结果的质量。The present invention provides a method for automatically generating TextRank Chinese abstracts based on Bi-LSTM. The word vector converted by the Word2vec model is used as input information, and the Bi-LSTM model is used for further processing, and the output information is used as the sentence vector of each sentence of the text, and used To calculate the similarity between sentences. By using the similarity between sentences as the edge weight and each sentence as a node, a TextRank graph structure is constructed, the TextRank value of each sentence is calculated as the weight of each sentence, and sorted according to the size of the weight, and finally the candidate summary sentences are extracted to form the final summary. . The present invention integrates the Word2vec+TextRank automatic summary model through Bi-LSTM and proposes a new fusion model W2v-BiL-TR, which improves the quality of summary extraction results.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1是本发明的一种基于Bi-LSTM的TextRank中文摘要自动生成方法的流程示意图。Figure 1 is a schematic flow chart of a Bi-LSTM-based TextRank Chinese summary automatic generation method of the present invention.

图2是本发明的一种基于Bi-LSTM的TextRank中文摘要自动生成方法的融合模型W2v-BiL-TR的结构示意图。Figure 2 is a schematic structural diagram of the fusion model W2v-BiL-TR of the TextRank Chinese abstract automatic generation method based on Bi-LSTM of the present invention.

图3是本发明的一种基于Bi-LSTM的TextRank中文摘要自动生成方法的Bi-LSTM提取特征流程图。Figure 3 is a Bi-LSTM feature extraction flow chart of a Bi-LSTM-based TextRank Chinese summary automatic generation method of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are intended to explain the present invention and are not to be construed as limiting the present invention.

请参阅图1，本发明提供了一种基于Bi-LSTM的TextRank中文摘要自动生成方法，包括下列步骤：Please refer to Figure 1. The present invention provides a TextRank Chinese summary automatic generation method based on Bi-LSTM, which includes the following steps:

S1：将文本进行预处理，然后进行分句处理，分割成一维句列表S，其中列表长度代表的是文本中句子的个数，将S[i]定义为文本中第i句话；S1: Preprocess the text, then segment it into a one-dimensional sentence list S, where the length of the list represents the number of sentences in the text, and S[i] is defined as the i-th sentence in the text;

S2：保留文本的标点符号，将分句列表进行分词处理，得到一个二维词列表W，其中将W[i][j]定义为文本中第i句话的第j个词；S2: Preserve the punctuation marks of the text, perform word segmentation processing on the clause list, and obtain a two-dimensional word list W, in which W[i][j] is defined as the j-th word of the i-th sentence in the text;

S3：采用Word2vec模型对步骤S2中的二维词列表W进行处理，提取文本词向量，得到文本的词向量二维表WV，其中WV[i][j]则对应文本第i句话第j个词的词向量；S3: Use the Word2vec model to process the two-dimensional word list W in step S2, extract the text word vector, and obtain the two-dimensional word vector table WV of the text, where WV[i][j] corresponds to the i-th sentence j of the text word vector of a word;

S4：将获得的所述词向量二维表WV作为输入信息，使用Bi-LSTM模型进行处理，并将得到的输出状态作为句向量，生成句向量表SV，其中SV[i]表示第i个句子的句向量；S4: Use the obtained two-dimensional word vector table WV as input information, use the Bi-LSTM model for processing, and use the obtained output state as a sentence vector to generate a sentence vector table SV, where SV[i] represents the i-th Sentence vector of a sentence;

S5：计算步骤S4获得的句向量，获得句子间的相似度，组成一个二维矩阵X；S5: Calculate the sentence vector obtained in step S4, obtain the similarity between sentences, and form a two-dimensional matrix X;

S6：利用TextRank模型来根据二维矩阵X构建图结构，其中图的节点对应文本的句子，图的边权重值对应句子之间的相似度，最后计算各句子的TextRank值作为各句子权值，并用于句子打分；S6: Use the TextRank model to construct a graph structure based on the two-dimensional matrix and used for sentence scoring;

S7：根据各句子权值排序高低，选取靠名靠前的句子作为候选摘要句，作为最终的抽取式摘要结果。S7: According to the ranking of the weight of each sentence, select the top sentence as the candidate summary sentence as the final extractive summary result.

具体的，本发明融合Bi-LSTM模型对文本信息进一步处理，处理后的信息作为各句子的句向量用于计算各句子间的相似度。利用TextRank构建图结构，其中图节点为各句子，节点间的边权重由句子间相似度表示。通过图结构计算出各句子的权值用于句子排序，最后生成摘要结果。本发明主要基于Word2vec模型、Bi-LSTM模型以及TextRank算法，提出新的融合模型W2v-BiL-TR，结构如图2所示。Specifically, the present invention integrates the Bi-LSTM model to further process the text information, and the processed information is used as the sentence vector of each sentence to calculate the similarity between each sentence. TextRank is used to construct a graph structure, in which the graph nodes are sentences, and the edge weights between nodes are represented by the similarity between sentences. The weight of each sentence is calculated through the graph structure for sentence sorting, and finally the summary result is generated. This invention is mainly based on the Word2vec model, Bi-LSTM model and TextRank algorithm, and proposes a new fusion model W2v-BiL-TR, whose structure is shown in Figure 2.

进一步的，以下结合具体实施过程进行说明，在使用Bi-LSTM模型进行处理的过程中，通过Bi-LSTM模型对文本处理后，得到的信息可以作为文本各句子的句向量信息。Further, the following is explained in conjunction with the specific implementation process. In the process of using the Bi-LSTM model for processing, after the text is processed by the Bi-LSTM model, the information obtained can be used as the sentence vector information of each sentence of the text.

其中Bi-LSTM处理文本信息的具体工作流程图如图3所示，Bi-LSTM模型处理文本信息，对文本进行特征提取时，主要以Word2vec模型转换的词向量作为输入信息，如图1中{x₁,x₂,...x_n}所示。在不同时刻进行输入后，通过两个正逆向的LSTM结构的输入门、隐藏门和输出门的处理，得到两个隐藏状态信息并进行拼接，如图1中的{h₁,h₂,...h_n}，作为Bi-LSTM模型最终的隐藏状态输出结果。The specific workflow diagram of Bi-LSTM processing text information is shown in Figure 3. The Bi-LSTM model processes text information. When extracting features from the text, the word vector converted by the Word2vec model is mainly used as input information, as shown in Figure 1 { x ₁ ,x ₂ ,...x _n }. After inputting at different times, through the processing of the input gate, hidden gate and output gate of the two forward and reverse LSTM structures, two hidden state information are obtained and spliced, as shown in {h ₁ , h ₂ , in Figure 1. ..h _n }, as the final hidden state output result of the Bi-LSTM model.

对于Bi-LSTM模型对文本信息在时刻t的具体处理步骤如下所示：The specific steps for the Bi-LSTM model to process text information at time t are as follows:

对输入的数据信息进行处理分析，并计算候选的细胞状态计算公式如下：Process and analyze the input data information and calculate candidate cell states Calculated as follows:

根据输入的信息以及上一时刻的信息，对当前时刻的输入门以及遗忘门进行更新并计算，计算公式如下：Based on the input information and the information at the previous moment, the input gate and forget gate at the current moment are updated and calculated. The calculation formula is as follows:

i_t＝σ(W_t·[h_t-1,x_t]+b_i) (2)i _t =σ(W _t ·[h _t-1 ,x _t ]+b _i ) (2)

f_t＝σ(W_f·[h_t-1,x_t]+b_f) (3)f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f ) (3)

通过当前时刻双向的输入门、遗忘门以及隐藏门状态信息，计算出当前时刻的细胞状态C_t，计算公式如下：Through the bidirectional input gate, forget gate and hidden gate status information at the current moment, the cell state C _t at the current moment is calculated. The calculation formula is as follows:

最后根据之前得到的状态，计算输出门状态并可以获得单向LSTM的输出h_st，最终通过对正向LSTM以及逆向LSTM的输出进行拼接，得到最终的Bi-LSTM的输出h_t。计算公式如下：Finally, based on the previously obtained state, the output gate state is calculated and the output h _st of the one-way LSTM can be obtained. Finally, the final output h _t of Bi-LSTM is obtained by splicing the output of the forward LSTM and the reverse LSTM. Calculated as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o) (5)o _t =σ(W _o ·[h _t-1 ,x _t ]+b _o ) (5)

h_st＝o_t*tanh(C_t) (6)h _st =o _t *tanh(C _t ) (6)

其中h_st表示单一方向LSTM的输出,tanh是激活函数。Among them, h _st represents the output of single-directional LSTM, and tanh is the activation function.

通过对两个方向LSTM的输出进行拼接，得到完整的Bi-LSTM输出信息h_t。By splicing the output of LSTM in two directions, the complete Bi-LSTM output information h _t is obtained.

最终Bi-LSTM输出的h_t用来表示文本的上下文信息。由于结合了正逆向LSTM的输入信息，同时由于Bi-LSTM的记忆门机制，长距离序列的依赖关系进行捕捉，并且在不同时刻t都可以对记忆状态进行改变，这样可以很好地解决长依赖问题，也可以对因此对文本的语义特征有了更加深层次的挖掘提取。The final h _t output by Bi-LSTM is used to represent the contextual information of the text. Due to the combination of the input information of forward and reverse LSTM, and due to the memory gate mechanism of Bi-LSTM, the dependence of long-distance sequences is captured, and the memory state can be changed at different times t, which can well solve long dependencies. Questions can also be used to dig deeper and extract the semantic features of the text.

通过Bi-LSTM模型对文本信息进一步处理后，得到的输出信息作为各句子的句向量。最终基于Bi-LSTM模型的句向量计算句子间相似度后，用于构建TextRank图结构，并计算出各句子的权重用于抽取候选摘要句。After the text information is further processed through the Bi-LSTM model, the output information obtained is used as the sentence vector of each sentence. Finally, after calculating the similarity between sentences based on the sentence vector of the Bi-LSTM model, it is used to construct the TextRank graph structure, and the weight of each sentence is calculated to extract candidate summary sentences.

为了分析本技术最终得到的摘要结果质量，本发明还提出了一个具体实施例，采用Rouge-1、Rouge-2以及Rouge-L评测方法分别对传统的TextRank摘要技术、融合Word2vec的TextRank摘要技术以及本本发明新提出的融合模型W2v-BiL-TR进行评测。其中Rouge-1和Rouge-2同属于Rouge-N，其中N指的是生成摘要与标准摘要中字和词数量的匹配率，例如Rouge-1采用两者中单字与单个词的匹配来计算。Rouge-L采用生成摘要与标准摘要中最长公共子序列的覆盖率为准则来计算。三种评测标准的具体评测过程如下：In order to analyze the quality of the summary results finally obtained by this technology, the present invention also proposes a specific embodiment, using Rouge-1, Rouge-2 and Rouge-L evaluation methods to respectively evaluate the traditional TextRank summary technology, TextRank summary technology integrated with Word2vec and The fusion model W2v-BiL-TR newly proposed by the present invention is evaluated. Rouge-1 and Rouge-2 both belong to Rouge-N, where N refers to the matching rate of the words and number of words in the generated summary and the standard summary. For example, Rouge-1 is calculated based on the matching of single words and single words in both. Rouge-L is calculated using the coverage criterion of the longest common subsequence in the generated summary and the standard summary. The specific evaluation process of the three evaluation standards is as follows:

(1)将生成摘要与标准摘要的结果转化为单字或词、双字或词的序列；(1) Convert the results of generated abstracts and standard abstracts into sequences of single words or words, double words or words;

(2)计算标准摘要中出现每个单字或词、双字或词出现的次数，以及根据单字或词构建最长公共子序列LCS；(2) Calculate the number of occurrences of each single word or word, double word or word in the standard summary, and construct the longest common subsequence LCS based on the single word or word;

(3)计算生成摘要中出现每个单字或词、双字或词出现的次数，以及根据单字或词构建最长公共子序列LCS；(3) Calculate the number of occurrences of each single word or word, double word or word in the generated summary, and construct the longest common subsequence LCS based on the single word or word;

(4)依次计算Rouge-1、Rouge-2以及Rouge-L的P、R、F₁值。具体计算公式如下：(4) Calculate the P, R, and F ₁ values of Rouge-1, Rouge-2, and Rouge-L in sequence. The specific calculation formula is as follows:

其中n表示生成摘要与标准摘要中重叠的划分好的字词序列数。X、Y分别表示生成摘要和标准摘要。u、v分别表示在生成摘要中和在标准摘要中出现的字词序列数。R表示召回率，P表示准确率，F₁值是P值和R值的调和平均值。Where n represents the number of divided word sequences that overlap between the generated summary and the standard summary. X and Y represent generated summary and standard summary respectively. u and v represent the number of word sequences appearing in the generated summary and the standard summary respectively. R represents the recall rate, P represents the precision rate, and the F ₁ value is the harmonic mean of the P value and the R value.

以下是不同评测标准的评测结果：The following are the evaluation results of different evaluation standards:

表1不同技术抽取摘要结果的Rouge-1值Table 1 Rouge-1 values of extraction summary results of different technologies

由表1所示，融合W2v-BiL-TR模型对文本进一步处理后得到的摘要结果在Rouge-1评测方法中，各项指标都更高。因此可以说明融合Bi-LSTM模型后，对于文本的语义特征可以更好地表示，生成的摘要质量也更好。As shown in Table 1, the summary results obtained after further processing the text by integrating the W2v-BiL-TR model are higher in all indicators in the Rouge-1 evaluation method. Therefore, it can be explained that after integrating the Bi-LSTM model, the semantic features of the text can be better represented, and the quality of the generated summary is also better.

表2不同技术抽取摘要结果的Rouge-2值Table 2 Rouge-2 values of extraction summary results of different technologies

同理，由表2所示，融合W2v-BiL-TR模型对文本进一步处理后得到的摘要结果在Rouge-2评测方法中，各项指标都更高。进一步说明融合Bi-LSTM模型后，对于文本的语义特征可以更好地表示，生成的摘要质量也更好。Similarly, as shown in Table 2, the summary results obtained after further processing the text by integrating the W2v-BiL-TR model have higher indicators in the Rouge-2 evaluation method. It is further explained that after integrating the Bi-LSTM model, the semantic features of the text can be better represented, and the quality of the generated summary is also better.

表3不同技术抽取摘要结果的Rouge-L值Table 3 Rouge-L value of extraction summary results of different technologies

各技术在以最长公共子序列为评测标准，即采用Rouge-L评测方法时，依然是本新技术融合W2v-BiL-TR模型的各项指标最高。因此可以总结，由于Bi-LSTM模型的加入，将Word2vec模型提取的文本词向量进一步处理，这样得出的句向量相比于使用词向量简单融合的句向量特征更丰富，包含的语义信息更多。因此在使用TextRank算法计算句子权重时，效果会更好，最终得到的摘要结果质量也更好。When each technology uses the longest common subsequence as the evaluation standard, that is, when using the Rouge-L evaluation method, the new technology still has the highest indicators in integrating the W2v-BiL-TR model. Therefore, it can be concluded that due to the addition of the Bi-LSTM model, the text word vectors extracted by the Word2vec model are further processed, and the sentence vectors obtained in this way have richer features and contain more semantic information than the sentence vectors simply fused using word vectors. . Therefore, when using the TextRank algorithm to calculate sentence weights, the effect will be better, and the quality of the final summary result will be better.

综上所述，本发明利用Bi-LSTM模型，以Word2vec模型转换后的词向量为输入信息，进一步处理后，获得的输出信息作为各句子的句向量，计算句子间相似度后用于TextRank构建图结构。最后生成的摘要结果质量更高。To sum up, the present invention uses the Bi-LSTM model and uses the word vector converted by the Word2vec model as input information. After further processing, the obtained output information is used as the sentence vector of each sentence. The similarity between sentences is calculated and used to construct TextRank. Graph structure. The resulting summary results are of higher quality.

以上所揭露的仅为本发明一种较佳实施例而已，当然不能以此来限定本发明之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本发明权利要求所作的等同变化，仍属于发明所涵盖的范围。What is disclosed above is only a preferred embodiment of the present invention. Of course, it cannot be used to limit the scope of the present invention. Those of ordinary skill in the art can understand all or part of the processes for implementing the above embodiments and according to the rights of the present invention. Equivalent changes to the requirements still fall within the scope of the invention.

Claims

1. A text rank Chinese abstract automatic generation method based on Bi-LSTM is characterized by comprising the following steps:

step 1: preprocessing a text, then carrying out sentence dividing processing, and dividing the text into a one-dimensional sentence list S, wherein the list length represents the number of sentences in the text, and S [ i ] is defined as the ith sentence in the text;

step 2: preserving punctuation marks of the text, and carrying out word segmentation processing on the sentence list to obtain a two-dimensional word list W, wherein W [ i ] [ j ] is defined as a j-th word of an i-th sentence in the text;

step 3: processing the two-dimensional Word list W in the step 2 by adopting a Word2vec model, extracting text Word vectors to obtain a Word vector two-dimensional table WV of the text, wherein WV [ i ] [ j ] corresponds to the Word vector of the j-th Word of the i-th sentence of the text;

step 4: processing the obtained word vector two-dimensional table WV serving as input information by using a Bi-LSTM model, and generating a sentence vector table SV by taking the obtained output state as a sentence vector, wherein SV [ i ] represents the sentence vector of the ith sentence;

step 5: calculating sentence vectors obtained in the step 4, obtaining similarity among sentences, and forming a two-dimensional matrix X;

step 6: constructing a graph structure according to a two-dimensional matrix X by using a TextRank model, wherein nodes of the graph correspond to sentences of the text, edge weight values of the graph correspond to similarity among the sentences, and finally, calculating TextRank values of all the sentences as sentence weight values and using the TextRank values for scoring the sentences;

step 7: and selecting sentences with the front names as candidate abstract sentences according to the ranking degree of the weight values of the sentences, and taking the sentences as a final extraction abstract result.

2. The automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 1, wherein,

in the process of processing by using the Bi-LSTM model, word vectors converted by the Word2vec model are used as input information, and after the Word vectors are input at different moments, the input information is screened by the input gate, the hidden gate and the output gate of the two forward and reverse LSTM structures, and the input information is transmitted to the current cell state; and meanwhile, the forgetting gate screens the cell state at the previous moment, combines the reserved information with the current cell state, obtains hidden state information in different directions through the processing of the output gate, and splices the hidden state information to be used as a final hidden state output result of the Bi-LSTM model.

3. The automatic Bi-LSTM based TextRank chinese summary generation method of claim 2, wherein,

the specific processing process of the Bi-LSTM model on the text information at the time t comprises the following steps:

processing and analyzing the input data information and calculating candidate cell states

Updating and calculating an input door and a forget door at the current moment according to the input information and the information at the last moment;

through the bidirectional input door, the forgetting door and the hidden door at the current momentThe state information of the hidden gate is used for calculating the state C of the cell at the current moment _t ；

From the previously obtained states, the output gate state is calculated and the output h of the unidirectional LSTM can be obtained _st Finally, the output h of the final Bi-LSTM is obtained by splicing the outputs of the forward LSTM and the reverse LSTM _t 。

4. The automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 3 wherein,

the candidate cell stateThe calculation formula of (2) is as follows:

wherein W is _c Represents the weight, h _t-1 Output vector representing last time step, b _c Is biased;

input door i at current moment _t The calculation formula of (2) is as follows:

i _t ＝σ(W _t ·[h _t-1 ,x _t ]+b _i )

wherein W is _t Input gate i representing the input of the last time and the input of the current time to the current time _t Is the weight of (a), sigma is the activation function, b _i Is biased;

forgetting door f at current moment _t The calculation formula of (2) is as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

wherein W is _f Forget gate f indicating the input of the last time and the current time to the current time _t Is the weight of (a), sigma is the activation function, b _f Is biased;

memory cell state C at the present moment _t The calculation formula is as follows:

the probability of the memory cell state at the previous moment is forgotten through a forgetting door, so that information which is remained in the cell state at the previous moment is obtained, and the information is selected and input with the candidate cell state information through an input door, and the information and the candidate cell state information are combined to obtain the current memory cell state information;

output door o at current moment _t The calculation formula of (2) is as follows:

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )

wherein W is _o Output gate o representing the output of the last time and the input of the current time to the current time _t Is the weight of (a), sigma is the activation function, b _o Is biased;

the output information of the current moment is obtained through the processing of the output gate on the cell state at the current moment, and the calculation formula is as follows:

h _st ＝o _t *tanh(C _t )

wherein h is _st Representing the output of LSTM in a single direction, wherein tanh is an activation function, and the complete Bi-LSTM output information is obtained by splicing the outputs of LSTM in two directions, namely:

5. the automatic generation method of TextRank Chinese abstract based on Bi-LSTM of claim 4 wherein,

output h of final Bi-LSTM _t For representing sentences S in text _i Is a sentence vector of (a).