CN108446408B

CN108446408B - Short text summarization method based on PageRank

Info

Publication number: CN108446408B
Application number: CN201810329318.2A
Authority: CN
Inventors: 曹斌; 吴佳伟; 王思超; 范菁
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2021-04-06
Anticipated expiration: 2038-04-13
Also published as: CN108446408A

Abstract

The invention relates to a short text summarization method based on PageRank. The method comprises the following steps: the method comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model. The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.

Description

Short text summarization method based on PageRank

Technical Field

The invention relates to a short text summarization method based on PageRank, which mainly solves the problem of how to select representative problem description under the condition of multiple descriptions aiming at the same problem. In particular to a text item ordering method. By this method, a relatively representative description can be selected among a plurality of descriptions of the same kind of problem.

Background introduction

As is known, text is one of the most important information carriers in life and production. Therefore, in many fields, text classification has been highly valued and widely used. Generally, a certain type of text is considered to be a description of a corresponding special event, and the text is usually a short text, relatively general and rich in information. Therefore, the texts are analyzed and abstracted to form a general description, which has a very positive effect and significance on production life, and further becomes a problem to be solved.

Through research, the existing short text summarization method has topic modeling and automatic summarization, but the method still has some defects. A common theme modeling model, namely an LDA model, is relatively complex, has relatively poor short text processing effect and is low in accuracy; there are two main modes for automatic summarization: one is the extraction type, namely, some sentences are selected from the text as the abstract; the other is a conceptual meaning, and the abstract is understood by understanding the meaning. At present, the relatively mature mode is an extraction type, but the effect is usually poor, and the method is usually applied to a single long text scene rather than a plurality of short text corpus scenes.

In an actual application scenario of the invention, various appeal of enterprises need to be analyzed and summarized, so that the enterprises can solve user appeal in a targeted manner and improve service quality. In actual work, due to the fact that the user appeal amount is large, the existing processing method costs too much time and is prone to errors, efficiency is low, follow-up work is difficult to advance, and finally, a processing result cannot be fed back to the user in time. Meanwhile, human resources are limited, hands of people are difficult to allocate to participate in the work, an effective solution is urgently needed, and the complex and tedious operation processes are automated by a computer technology, so that errors are reduced, the efficiency is improved, and the human resources are saved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a short text item representative degree sorting method based on PageRank, which sorts a keyword set formed by processed appeal, selects the most general set as the keyword description of the appeal, enables an analyst to clearly understand the main content of the appeal, saves labor cost and improves working efficiency.

According to one aspect of the invention, a short text classification summarization method based on PageRank is provided, which comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model.

Step 1: frequent item set generation

The method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).

Step 2: item set relational modeling

The method includes the following steps that a PageRank relational model is constructed through analysis statistics of data and simple calculation:

step 2.1: set of weights of initialization items

Counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item sets_i，i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:

i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency.

Further obtain the initial weight vector P of the set₀＝{p₁,p₂,…,p_n}^T。

Step 2.2: constructing a state transition probability matrix

Because there are overlapping words between each frequent item set in the set, the method aims to describe the association between the frequent item sets by constructing a graph. Therefore, the numerical relationship between the two frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set. That is, in the directed graph formed by all item sets in the set, the edge weight value is calculated. The term set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of transition from one state to another, i.e. the transition probability.

For each item set S_iAnd S_jAll have an intersection word vector X_ij＝{x_i1,x_i2,…,x_in}^TWherein x is_ijRepresenting a set of items S_iAnd item set S_jThe word frequency of the intersection word (i) is 0 when i is j, and then a matrix W is formed (since the measurement object is a set of all frequent items, it is an n-dimensional matrix):

wherein

Namely item set S_iAnd item set S_jThe set S of intersecting word frequency pairs_iAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.

Step 2.3: modifying state transition probability matrices

The invention aims to obtain a representative item set weight value through model calculation. In the above process, it can be seen that, because there is an association of intersection words between item sets, it is not difficult to predict that the weights of item sets will change according to the weights of other item sets in the calculation process. Therefore, a correction model needs to be calculated so that a stable value can be calculated.

According to the Markov convergence theorem, when the following conditions are satisfied:

firstly, the number of finite states is limited; a fixed state transition probability;

the states can be changed in any mode; the state transfer mode is not unique;

the markov process will converge to an equilibrium state and the equilibrium is unique.

The present invention is provided in the case where the following conditions are satisfied,

the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still, a correction is required to satisfy the condition (c).

Considering the special case, when the intersection of a certain item set and the other item set is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state. Then the state cannot be transferred when the set of items is accessed. To accommodate this, the matrix W is further modified to W₁：

From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;

wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation. e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state.

And step 3: model calculation

The number of iterations max _ iter and the threshold min _ diff are specified. According to P_n+1＝W₁P_nInitial value P_n＝P₀And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. P_n+1-P_nWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.

The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.

The invention has the advantages that: the method can automatically generate a plurality of candidate event keyword abstracts for short text sets related to a class of events, and calculate the importance degree of each event abstract through a PageRank model, thereby finally obtaining the most general event keyword abstract. The obtained event keyword abstract can clearly describe the main content of the event, the calculation efficiency of the method is high, and the event description obtained by the method can help people to better understand the event through practical application feedback, so that the aim of saving labor cost is fulfilled.

Drawings

FIG. 1 is an overall flowchart of an example of a short text classification summarization method based on PageRank according to the present invention.

FIG. 2 is a flowchart of an example of generating a frequent item set by the PageRank-based short text classification summarization method of the present invention.

FIG. 3 is a flowchart of the method for short text classification summarization based on PageRank according to the present invention.

FIG. 4 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank to construct a state transition probability matrix.

FIG. 5 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank for correcting the state transition probability matrix.

FIG. 6 is a model calculation flow chart of the short text classification summarization method based on PageRank.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the following description is only some embodiments of the present invention, and it is obvious for a person skilled in the art that other embodiments can be obtained according to the embodiments without inventive exercise.

One embodiment of the invention is to adopt a short text classification summarization method based on PageRank to perform short text worksheet summarization.

Referring to fig. 1, a schematic diagram of an example of a short text classification summarization method based on PageRank includes the following steps:

s101, generating a frequent item set;

s102, establishing a set relation model;

s103, calculating a model to obtain a result;

the 101 step specifically comprises: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).

The 102 step specifically comprises:

s201, initializing item set weight values;

s202, constructing a state transition probability matrix;

s203, correcting the state transition probability matrix;

the step 201 specifically comprises:

i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency. Further obtain the initial weight vector P of the set₀＝{p₁,p₂,…,p_n}^T。

The step 202 specifically comprises:

and representing the numerical relationship between the corresponding two frequent item sets by using the intersection condition according to the intersection between each frequent item set in the set, and constructing a relationship matrix.

wherein

That is, the itemCollection S_iAnd item set S_jThe set S of intersecting word frequency pairs_iAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.

The step 203 specifically comprises:

the invention aims to obtain a stable weight value as an index for selecting the abstract, and meanwhile, the model accords with partial conditions of Markov convergence, and the Markov convergence theorem needs to be corrected to adapt to the aim of the invention.

Considering the special case, when the intersection of a certain set and the rest sets is empty, that is, an edge cannot be constructed, the set is referred to as an isolated state. Then the state cannot be transferred when the set is accessed. To accommodate this, the matrix W is further modified to W₁：

wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation.

The step 103 specifically comprises: the number of iterations max _ iter and the threshold min _ diff are specified. According to P_n+1＝W₁P_nInitial value P_n＝P₀And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. P_n+1-P_nWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.

Claims

1. A short text classification summarization method based on PageRank comprises the following steps:

step 1: generating a frequent item set;

the method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; based on a data structure of a frequent pattern tree FP-tree, a frequent pattern growing method FP-growth is used for generating a frequent item set;

step 2: modeling item set relations;

step 2.1: initializing item set weight;

counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item sets_i，i∈[1，n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:

that is, the term set contains the ratio of the accumulation of the word and the word frequency product thereof in the total word frequency;

further obtain the initial weight vector P of the set₀＝{p₁，p₂，...，p_n}^T；

Step 2.2: constructing a state transition probability matrix;

because there are overlapping words between every frequent item set in the set, and the purpose of this method lies in describing the association between the frequent item sets through constructing the figure; therefore, the numerical relationship between the two corresponding frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set; calculating the edge weight of the directed graph formed by all item sets in the set; the item set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of converting from one state to another state, namely the transition probability;

for eachItem set S_iAnd S_jAll have an intersection word vector X_ij＝{x_i1，x_i2，...，x_in}^TWherein x is_ijRepresenting a set of items S_iAnd item set S_jThe word frequency of the intersection words is 0 when i is j, and then a matrix W is formed, because the measurement object is a set of all frequent items, the matrix W is an n-dimensional matrix:

wherein

Namely item set S_iAnd item set S_jThe set S of intersecting word frequency pairs_iThe ratio of the sum of the word frequency of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix;

step 2.3: correcting the state transition probability matrix;

because the association of the intersection words exists between the item sets, the weight values of the item sets can be changed according to the weight values of other item sets in the calculation process; therefore, a correction model needs to be calculated so that a stable value can be calculated;

the states can be changed in any mode; the state transfer mode is not unique;

the Markov process will converge to an equilibrium state, and the equilibrium is unique;

in the case where the following condition is satisfied,

the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still need to revise, in order to meet condition c;

considering special cases, when the intersection of a certain item set and the other item sets is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state; then the state cannot be transferred when the set of items is accessed; to accommodate this, the matrix W is further modified to W₁：

wherein alpha is an empirical value, represents the probability of state transition of an isolated state in the iterative process, and can be corrected by combining with the actual situation; e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state;

and step 3: calculating and abstracting an item set model;

specifying iteration times max _ iter and a threshold min _ diff; according to P_n+1＝W₁P_nInitial value P_n＝P₀Performing operation; when the difference between the results of two iterations is less than or equal to the threshold value, i.e. P_n+1-P_nWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.