CN108446408A

CN108446408A - Short text summarization method based on PageRank

Info

Publication number: CN108446408A
Application number: CN201810329318.2A
Authority: CN
Inventors: 曹斌; 吴佳伟; 王思超; 范菁
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2018-08-24
Anticipated expiration: 2038-04-13
Also published as: CN108446408B

Abstract

The invention relates to a short text summarization method based on PageRank. The method comprises the following steps: the method comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model. The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.

Description

A kind of short text method of abstracting based on PageRank

Technical field

The present invention relates to a kind of short text method of abstracting based on PageRank mainly solves the problems, such as have for same class Under conditions of description, the problem of how selecting representative problem to describe.More particularly to a kind of text item row Sequence method.In this way, relatively representative description can be selected in a variety of descriptions of same class problem.

Background introduction

It is well known that text is one of most important information carrier in life production.Therefore, in many fields, text Classification is paid much attention to and is widely used.Under normal conditions, it is believed that a certain class text is that special thing is corresponded to it The description of part, and this class text is typically some short texts, it is opposite that there is generality, contain abundant information.Therefore, to these Text carries out analyzing processing and makes a summary, and forms generality description, plays the role of very positive and meaning to production and living, in turn Also at a urgent problem to be solved.

Through investigation, existing short text method of abstracting has theme modeling, autoabstract, but above method still has Defect.Common theme modeler model, LDA models are relative complex, relatively poor to short text treatment effect, and accuracy is not high； There are mainly two types of patterns for autoabstract：One is extraction-type, i.e., select certain sentences as abstract from text；Another kind is understanding Formula, by understanding that context is made a summary.The mode of relative maturity is extraction-type at present, but often effect is barely satisfactory, and usually It applies in single long text scene rather than multiple short text language material scenes.

In the practical application scene of the present invention, need that all kinds of demands of enterprise are analyzed and made a summary, so as to Enterprise targetedly solves user's demand, improves service quality.And in actual operation, it is existing since user's demand amount is huge There is processing method to take excessive time and be easy error, leads to inefficiency, it is difficult to follow-up work is promoted, it finally cannot be timely Give user feedback handling result.Simultaneously human resources it is limited be difficult to allocate human hand participate in the work, it is effective there is an urgent need for one These complicated cumbersome operating process are automated with computer technology, reduce mistake, improve efficiency, save people by solution Power resource.

Invention content

The present invention will overcome the disadvantages mentioned above of the prior art, provide a kind of short text sports representative's property based on PageRank Degree sort method is formed by keyword set to treated demand and is ranked up, and chooses most recapitulative set Keyword as the demand describes, and so that analyst is clearly understood the main contents of demand, saves labour turnover and improve work Make efficiency.

The method of abstracting according to an aspect of the invention, there is provided a kind of short text based on PageRank is classified, including： Frequent item set generates；Item collection relationship modeling；Item collection model is calculated and is made a summary.

Step 1：Frequent item set generates

Including the following contents：Pending text is segmented and filtered, stop words is removed, replaces synonym, generating should The set of the initial word of text；After having handled all texts, the word frequency of each word is counted in text word segmentation result and according to word Frequency is ranked up all words；Word order inside text word segmentation result is adjusted, is arranged by word frequency descending；Given threshold MinSupport leaves out the word that word frequency in word segmentation result is less than the threshold value；Data knot based on frequent pattern tree (fp tree) (FP-tree) Structure generates frequent item set using frequent mode growing method (FP-growth).

Step 2：Item collection relationship modeling

Need the analysis by data to count and carry out simple computation, build PageRank relational models, it is specific comprising with Lower step：

Step 2.1：Initialize item collects weights

The frequent item set sum n of a kind of problem is generated in statistic procedure 1, statistical items concentrate the word frequency tf of each word_i, i ∈ [1, N], collecting terms concentrate the case where containing word, and the initial weight for counting each item collection in set of computations is as follows：

What i.e. item collection contained word and its word frequency product is accumulated in accounting in total word frequency.

And then the initial weight vector P gathered₀={ p₁,p₂,…,p_n}^T。

Step 2.2：Build state transition probability matrix

Because there is the word of overlapping in set between each frequent item set, and the purpose of this method be just by structure figures come Association between frequent item set is described.Therefore the word number of the intersection by frequent item set in set of computations between any two is come generation Numeralization relationship between corresponding two frequent item sets of table.In the digraph that all item collections i.e. in set are constituted, calculate Its side right value.Item collection can be considered as to a specific state, the physical significance of side right value is i.e. another to be converted to by a certain state Shape probability of state, i.e. transition probability.

For each item collection S_iAnd S_j, there is intersection term vector X_ij={ x_i1,x_i2,…,x_in}^T, wherein x_ijIndicate item collection S_i With item collection S_jIntersection word word frequency, as i=j value be 0, and then constitute matrix W (because weigh object be all frequent episodes Collection, therefore tie up matrix for n)：

WhereinThat is item collection S_iWith item collection S_jIntersection word frequency to item collection S_iIt is handed over remaining all item collection The ratio for collecting word frequency sum represents the side right value between each item collection with this, forms state transition probability matrix.

Step 2.3：Correct state transition probability matrix

It is an object of the present invention to which a representative item collects weights are calculated by model.In the above process In it can be seen that, since there are the associations of intersection word between item collection, it is not difficult to visualize in calculating process, the weights meeting of item collection Variation is generated according to the weights of other item collections.Therefore it needs to calculate correction model so that a stable value can be calculated.

According to Markov convergence theorem, when meeting the following conditions：

1. finite state number；2. fixed state transition probability；

3. can change in any way between state；4. state branch mode is not unique；

The markoff process will converge to an equilibrium state, and this is balanced unique.

The present invention in the case where meeting the following conditions,

①：State number is item collection number n；②：State transition probability matrix is determined and be will not change by item collection；④： Collection intersection constituted when being all two-way, in the case that up to there are many branch mode between each state；There is still a need for progress It corrects, to meet condition 3..

Consider special circumstances, when the intersection of a certain item collection and remaining item collection is sky, i.e., can not build side, herein should Item collection is known as the item collection of isolated state.Then when the item collection is accessed, state can not be shifted.To adapt to this case, into One step correction matrix W is W₁：

From the perspective of scheming, the modified physical significance be so that figure be connection, meet condition 3.；

Wherein α is empirical value, the probability that isolated state in an iterative process carries out state transfer is represented, in combination with practical feelings Condition is voluntarily corrected.E is unit matrix, therefore the latter half of formula represents the probability for directly accessing the isolated state.

Step 3：Model calculates

Given number of iterations max_iter and threshold value min_diff.According to P_n+1=W₁P_n, initial value P_n=P₀It carries out Operation.When twice, iteration result difference is less than threshold value, i.e. P_n+1-P_n<When min_diff；Or work as iterations and be more than predetermined number of times, i.e., n>Max_iter, you can be considered as operation result convergence, ranking can be exported on demand.

The present invention is based on PageRank models to form keyword set and merge through model calculating to event short essay present treatment The significance level of these multiple set is simulated, most recapitulative set is chosen and makes a summary as the keyword of the event. In practical applications, the main contents for clearly describing event achieve the purpose that save labour turnover and improve working efficiency.

It is an advantage of the invention that：The present invention can be automatically to the relevant assigned short text set symphysis of a kind of event at multiple candidates' Event keyword is made a summary, and by PageRank models, calculates the significance level of each event summary, is finally obtained and is most summarized Property event keyword abstract.Obtained event keyword abstract can clearly describe the main contents of event, and the present invention Computational efficiency it is higher, fed back through practical application, the event description that the present invention obtains can help people to be best understood from event, to reach To the purpose to save labour turnover.

Description of the drawings

Fig. 1 is that the present invention is based on the overall flow figures of the short text of PageRank classification one example of method of abstracting.

Fig. 2 is that the present invention is based on the flows of the short text of PageRank classification one generation frequent item set example of method of abstracting Figure.

Fig. 3 is that the present invention is based on the flow charts that the short text of PageRank classification method of abstracting initializes weights.

Fig. 4 is that the present invention is based on the physics that the short text of PageRank classification method of abstracting builds state transition probability matrix Meaning schematic diagram.

Fig. 5 is that the present invention is based on the physics that the short text of PageRank classification method of abstracting corrects state transition probability matrix Meaning schematic diagram.

Fig. 6 is that the present invention is based on the short text of PageRank classification method of abstracting model calculation flow charts.

Specific implementation mode

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the one of the only present invention is described below A little embodiments for those of ordinary skill in the art without having to pay creative labor, can also be according to this A little embodiments obtain other embodiments.

An example of the present invention is to carry out short text work order using the short text classification method of abstracting based on PageRank Abstract.

The schematic diagram of one example of the short text classification method of abstracting based on PageRank, referring to Fig. 1, including following Step：

S101, frequent item set generate；

S102, set relation model foundation；

Result is calculated in S103, model；

101 step is specially：Pending text is segmented and filtered, stop words is removed, replaces synonym, it is raw At the set of the initial word of the text；After having handled all texts, count in text word segmentation result the word frequency of each word and according to All words are ranked up according to word frequency；Word order inside text word segmentation result is adjusted, is arranged by word frequency descending；Given threshold MinSupport leaves out the word that word frequency in word segmentation result is less than the threshold value；Data knot based on frequent pattern tree (fp tree) (FP-tree) Structure generates frequent item set using frequent mode growing method (FP-growth).

102 step is specially：

S201, initialization item collects weights；

S202, structure state transition probability matrix；

S203, state transition probability matrix is corrected；

201 step is specially：

The frequent item set sum n of a kind of problem is generated in statistic procedure 1, statistical items concentrate the word frequency tf of each word_i, i ∈ [1, N], collecting terms concentrate the case where containing word, and the initial weight for counting each item collection in set of computations is as follows： What i.e. item collection contained word and its word frequency product is accumulated in accounting in total word frequency.And then the initial weight vector P gathered₀={ p₁, p₂,…,p_n}^T。

202 step is specially：

According to the intersection between each frequent item set in set, represented between corresponding two frequent item sets with intersection situation Numeralization relationship, build relational matrix.

203 step is specially：

Due to the index alternatively made a summary it is an object of the invention to acquire a stable weights, while this model accords with The convergent partial condition of Markov is closed, needing to be modified makes its Markov convergence theorem, to adapt to the purpose of the present invention.

Consider special circumstances, it is sky to gather intersection with remaining when a certain set, i.e., side can not be built, herein by the collection It is collectively referred to as isolated state.Then when the set is accessed, state can not be shifted.To adapt to this case, further correct Matrix W is W₁：

Wherein α is empirical value, the probability that isolated state in an iterative process carries out state transfer is represented, in combination with practical feelings Condition is voluntarily corrected.

103 step is specially：Given number of iterations max_iter and threshold value min_diff.According to P_n+1=W₁P_n, Initial value P_n=P₀Carry out operation.When twice, iteration result difference is less than threshold value, i.e. P_n+1-P_n<When min_diff；Or when iteration time Number is more than predetermined number of times, i.e. n>Max_iter, you can be considered as operation result convergence, ranking can be exported on demand.

Claims

The method of abstracting 1. a kind of short text based on PageRank is classified, includes the following steps：

Step 1：Frequent item set generates；

Including the following contents：Pending text is segmented and filtered, stop words is removed, replaces synonym, generates the text Initial word set；After having handled all texts, the word frequency of each word is counted in text word segmentation result and according to word frequency pair All words are ranked up；Word order inside text word segmentation result is adjusted, is arranged by word frequency descending；Given threshold minSupport, is deleted Word frequency in word segmentation result is gone to be less than the word of the threshold value；Based on the data structure of frequent pattern tree (fp tree) FP-tree, increased using frequent mode Long method FP-growth generates frequent item set；

Step 2：Item collection relationship modeling；

It needs the analysis by data to count and carries out simple computation, build PageRank relational models, include specifically following step Suddenly：

Step 2.1：Initialize item collects weights；

The frequent item set sum n of a kind of problem is generated in statistic procedure 1, statistical items concentrate the word frequency tf of each word_i, i ∈ [1, n], knot The case where containing word in item collection is closed, the initial weight for counting each item collection in set of computations is as follows：

What i.e. item collection contained word and its word frequency product is accumulated in accounting in total word frequency；

And then the initial weight vector P gathered₀={ p₁,p₂,…,p_n}^T；

Step 2.2：Build state transition probability matrix；

Because there is the word of overlapping in set between each frequent item set, and the purpose of this method is just to describe by structure figures Association between frequent item set；Therefore the word number of intersection by frequent item set in set of computations between any two represents phase Answer the numeralization relationship between two frequent item sets；In the digraph that all item collections i.e. in set are constituted, its side is calculated Weights；Item collection can be considered as to a specific state, the physical significance of side right value is to be converted to another state by a certain state Probability, i.e. transition probability；

For each item collection S_iAnd S_j, there is intersection term vector X_ij={ x_i1,x_i2,…,x_in}^T, wherein x_ijIndicate item collection S_iWith item Collect S_jIntersection word word frequency, as i=j value be 0, and then constitute matrix W (because weigh object be all frequent item sets, Therefore tie up matrix for n)：

WhereinThat is item collection S_iWith item collection S_jIntersection word frequency to item collection S_iWith remaining all item collection intersection word The ratio of frequency sum represents the side right value between each item collection with this, forms state transition probability matrix；

Step 2.3：Correct state transition probability matrix；

Since there are the associations of intersection word between item collection, it is not difficult to visualize in calculating process, the weights of item collection can be according to it The weights of his item collection and generate variation；Therefore it needs to calculate correction model so that a stable value can be calculated；

According to Markov convergence theorem, when meeting the following conditions：

1. finite state number；2. fixed state transition probability；

3. can change in any way between state；4. state branch mode is not unique；

The markoff process will converge to an equilibrium state, and this is balanced unique；

In the case where meeting the following conditions,

①：State number is item collection number n；②：State transition probability matrix is determined and be will not change by item collection；④：Item collection is handed over Collection is constituted when being all two-way, in the case that up to there are many branch modes between each state；There is still a need for being modified, To meet condition 3.；

Consider special circumstances, when the intersection of a certain item collection and remaining item collection is sky, i.e., side can not be built, herein by the item collection The referred to as item collection of isolated state；Then when the item collection is accessed, state can not be shifted；To adapt to this case, further Correction matrix W is W₁：

From the perspective of scheming, the modified physical significance be so that figure be connection, meet condition 3.；

Wherein α is empirical value, represents the probability that isolated state in an iterative process carries out state transfer, certainly in combination with actual conditions Row is corrected；E is unit matrix, therefore the latter half of formula represents the probability for directly accessing the isolated state.

Step 3：Item collection model is calculated and is made a summary；

Given number of iterations max_iter and threshold value min_diff；According to P_n+1=W₁P_n, initial value P_n=P₀Carry out operation； When twice, iteration result difference is less than threshold value, i.e. P_n+1-P_n<When min_diff；Or when iterations are more than predetermined number of times, i.e. n> Max_iter, you can be considered as operation result convergence, ranking can be exported on demand.