CN108446408B - Short text summarization method based on PageRank - Google Patents

Short text summarization method based on PageRank Download PDF

Info

Publication number
CN108446408B
CN108446408B CN201810329318.2A CN201810329318A CN108446408B CN 108446408 B CN108446408 B CN 108446408B CN 201810329318 A CN201810329318 A CN 201810329318A CN 108446408 B CN108446408 B CN 108446408B
Authority
CN
China
Prior art keywords
item
word
item set
frequent
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810329318.2A
Other languages
Chinese (zh)
Other versions
CN108446408A (en
Inventor
曹斌
吴佳伟
王思超
范菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810329318.2A priority Critical patent/CN108446408B/en
Publication of CN108446408A publication Critical patent/CN108446408A/en
Application granted granted Critical
Publication of CN108446408B publication Critical patent/CN108446408B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a short text summarization method based on PageRank. The method comprises the following steps: the method comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model. The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.

Description

Short text summarization method based on PageRank
Technical Field
The invention relates to a short text summarization method based on PageRank, which mainly solves the problem of how to select representative problem description under the condition of multiple descriptions aiming at the same problem. In particular to a text item ordering method. By this method, a relatively representative description can be selected among a plurality of descriptions of the same kind of problem.
Background introduction
As is known, text is one of the most important information carriers in life and production. Therefore, in many fields, text classification has been highly valued and widely used. Generally, a certain type of text is considered to be a description of a corresponding special event, and the text is usually a short text, relatively general and rich in information. Therefore, the texts are analyzed and abstracted to form a general description, which has a very positive effect and significance on production life, and further becomes a problem to be solved.
Through research, the existing short text summarization method has topic modeling and automatic summarization, but the method still has some defects. A common theme modeling model, namely an LDA model, is relatively complex, has relatively poor short text processing effect and is low in accuracy; there are two main modes for automatic summarization: one is the extraction type, namely, some sentences are selected from the text as the abstract; the other is a conceptual meaning, and the abstract is understood by understanding the meaning. At present, the relatively mature mode is an extraction type, but the effect is usually poor, and the method is usually applied to a single long text scene rather than a plurality of short text corpus scenes.
In an actual application scenario of the invention, various appeal of enterprises need to be analyzed and summarized, so that the enterprises can solve user appeal in a targeted manner and improve service quality. In actual work, due to the fact that the user appeal amount is large, the existing processing method costs too much time and is prone to errors, efficiency is low, follow-up work is difficult to advance, and finally, a processing result cannot be fed back to the user in time. Meanwhile, human resources are limited, hands of people are difficult to allocate to participate in the work, an effective solution is urgently needed, and the complex and tedious operation processes are automated by a computer technology, so that errors are reduced, the efficiency is improved, and the human resources are saved.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a short text item representative degree sorting method based on PageRank, which sorts a keyword set formed by processed appeal, selects the most general set as the keyword description of the appeal, enables an analyst to clearly understand the main content of the appeal, saves labor cost and improves working efficiency.
According to one aspect of the invention, a short text classification summarization method based on PageRank is provided, which comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model.
Step 1: frequent item set generation
The method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).
Step 2: item set relational modeling
The method includes the following steps that a PageRank relational model is constructed through analysis statistics of data and simple calculation:
step 2.1: set of weights of initialization items
Counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:
Figure GDA0002621302570000031
i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency.
Further obtain the initial weight vector P of the set0={p1,p2,…,pn}T
Step 2.2: constructing a state transition probability matrix
Because there are overlapping words between each frequent item set in the set, the method aims to describe the association between the frequent item sets by constructing a graph. Therefore, the numerical relationship between the two frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set. That is, in the directed graph formed by all item sets in the set, the edge weight value is calculated. The term set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of transition from one state to another, i.e. the transition probability.
For each item set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,…,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection word (i) is 0 when i is j, and then a matrix W is formed (since the measurement object is a set of all frequent items, it is an n-dimensional matrix):
Figure GDA0002621302570000032
wherein
Figure GDA0002621302570000033
Namely item set SiAnd item set SjThe set S of intersecting word frequency pairsiAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.
Step 2.3: modifying state transition probability matrices
The invention aims to obtain a representative item set weight value through model calculation. In the above process, it can be seen that, because there is an association of intersection words between item sets, it is not difficult to predict that the weights of item sets will change according to the weights of other item sets in the calculation process. Therefore, a correction model needs to be calculated so that a stable value can be calculated.
According to the Markov convergence theorem, when the following conditions are satisfied:
firstly, the number of finite states is limited; a fixed state transition probability;
the states can be changed in any mode; the state transfer mode is not unique;
the markov process will converge to an equilibrium state and the equilibrium is unique.
The present invention is provided in the case where the following conditions are satisfied,
the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still, a correction is required to satisfy the condition (c).
Considering the special case, when the intersection of a certain item set and the other item set is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state. Then the state cannot be transferred when the set of items is accessed. To accommodate this, the matrix W is further modified to W1
Figure GDA0002621302570000041
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation. e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state.
And step 3: model calculation
The number of iterations max _ iter and the threshold min _ diff are specified. According to Pn+1=W1PnInitial value Pn=P0And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.
The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.
The invention has the advantages that: the method can automatically generate a plurality of candidate event keyword abstracts for short text sets related to a class of events, and calculate the importance degree of each event abstract through a PageRank model, thereby finally obtaining the most general event keyword abstract. The obtained event keyword abstract can clearly describe the main content of the event, the calculation efficiency of the method is high, and the event description obtained by the method can help people to better understand the event through practical application feedback, so that the aim of saving labor cost is fulfilled.
Drawings
FIG. 1 is an overall flowchart of an example of a short text classification summarization method based on PageRank according to the present invention.
FIG. 2 is a flowchart of an example of generating a frequent item set by the PageRank-based short text classification summarization method of the present invention.
FIG. 3 is a flowchart of the method for short text classification summarization based on PageRank according to the present invention.
FIG. 4 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank to construct a state transition probability matrix.
FIG. 5 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank for correcting the state transition probability matrix.
FIG. 6 is a model calculation flow chart of the short text classification summarization method based on PageRank.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the following description is only some embodiments of the present invention, and it is obvious for a person skilled in the art that other embodiments can be obtained according to the embodiments without inventive exercise.
One embodiment of the invention is to adopt a short text classification summarization method based on PageRank to perform short text worksheet summarization.
Referring to fig. 1, a schematic diagram of an example of a short text classification summarization method based on PageRank includes the following steps:
s101, generating a frequent item set;
s102, establishing a set relation model;
s103, calculating a model to obtain a result;
the 101 step specifically comprises: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).
The 102 step specifically comprises:
s201, initializing item set weight values;
s202, constructing a state transition probability matrix;
s203, correcting the state transition probability matrix;
the step 201 specifically comprises:
counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:
Figure GDA0002621302570000061
i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency. Further obtain the initial weight vector P of the set0={p1,p2,…,pn}T
The step 202 specifically comprises:
and representing the numerical relationship between the corresponding two frequent item sets by using the intersection condition according to the intersection between each frequent item set in the set, and constructing a relationship matrix.
For each item set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,…,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection word (i) is 0 when i is j, and then a matrix W is formed (since the measurement object is a set of all frequent items, it is an n-dimensional matrix):
Figure GDA0002621302570000071
wherein
Figure GDA0002621302570000072
That is, the itemCollection SiAnd item set SjThe set S of intersecting word frequency pairsiAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.
The step 203 specifically comprises:
the invention aims to obtain a stable weight value as an index for selecting the abstract, and meanwhile, the model accords with partial conditions of Markov convergence, and the Markov convergence theorem needs to be corrected to adapt to the aim of the invention.
Considering the special case, when the intersection of a certain set and the rest sets is empty, that is, an edge cannot be constructed, the set is referred to as an isolated state. Then the state cannot be transferred when the set is accessed. To accommodate this, the matrix W is further modified to W1
Figure GDA0002621302570000073
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation.
The step 103 specifically comprises: the number of iterations max _ iter and the threshold min _ diff are specified. According to Pn+1=W1PnInitial value Pn=P0And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.

Claims (1)

1. A short text classification summarization method based on PageRank comprises the following steps:
step 1: generating a frequent item set;
the method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; based on a data structure of a frequent pattern tree FP-tree, a frequent pattern growing method FP-growth is used for generating a frequent item set;
step 2: modeling item set relations;
the method includes the following steps that a PageRank relational model is constructed through analysis statistics of data and simple calculation:
step 2.1: initializing item set weight;
counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:
Figure FDA0002621302560000011
that is, the term set contains the ratio of the accumulation of the word and the word frequency product thereof in the total word frequency;
further obtain the initial weight vector P of the set0={p1,p2,...,pn}T
Step 2.2: constructing a state transition probability matrix;
because there are overlapping words between every frequent item set in the set, and the purpose of this method lies in describing the association between the frequent item sets through constructing the figure; therefore, the numerical relationship between the two corresponding frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set; calculating the edge weight of the directed graph formed by all item sets in the set; the item set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of converting from one state to another state, namely the transition probability;
for eachItem set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,...,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection words is 0 when i is j, and then a matrix W is formed, because the measurement object is a set of all frequent items, the matrix W is an n-dimensional matrix:
Figure FDA0002621302560000021
wherein
Figure FDA0002621302560000022
Namely item set SiAnd item set SjThe set S of intersecting word frequency pairsiThe ratio of the sum of the word frequency of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix;
step 2.3: correcting the state transition probability matrix;
because the association of the intersection words exists between the item sets, the weight values of the item sets can be changed according to the weight values of other item sets in the calculation process; therefore, a correction model needs to be calculated so that a stable value can be calculated;
according to the Markov convergence theorem, when the following conditions are satisfied:
firstly, the number of finite states is limited; a fixed state transition probability;
the states can be changed in any mode; the state transfer mode is not unique;
the Markov process will converge to an equilibrium state, and the equilibrium is unique;
in the case where the following condition is satisfied,
the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still need to revise, in order to meet condition c;
considering special cases, when the intersection of a certain item set and the other item sets is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state; then the state cannot be transferred when the set of items is accessed; to accommodate this, the matrix W is further modified to W1
Figure FDA0002621302560000031
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of state transition of an isolated state in the iterative process, and can be corrected by combining with the actual situation; e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state;
and step 3: calculating and abstracting an item set model;
specifying iteration times max _ iter and a threshold min _ diff; according to Pn+1=W1PnInitial value Pn=P0Performing operation; when the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.
CN201810329318.2A 2018-04-13 2018-04-13 Short text summarization method based on PageRank Active CN108446408B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810329318.2A CN108446408B (en) 2018-04-13 2018-04-13 Short text summarization method based on PageRank

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810329318.2A CN108446408B (en) 2018-04-13 2018-04-13 Short text summarization method based on PageRank

Publications (2)

Publication Number Publication Date
CN108446408A CN108446408A (en) 2018-08-24
CN108446408B true CN108446408B (en) 2021-04-06

Family

ID=63199842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810329318.2A Active CN108446408B (en) 2018-04-13 2018-04-13 Short text summarization method based on PageRank

Country Status (1)

Country Link
CN (1) CN108446408B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739953B (en) * 2018-12-30 2021-07-20 广西财经学院 Text retrieval method based on chi-square analysis-confidence framework and back-part expansion
CN110533194B (en) * 2019-03-25 2022-11-25 东北大学 Optimization method for maintenance system construction
US10579894B1 (en) * 2019-07-17 2020-03-03 Capital One Service, LLC Method and system for detecting drift in text streams
US10657416B1 (en) 2019-07-17 2020-05-19 Capital One Services, Llc Method and system for detecting drift in image streams
CN111984688B (en) * 2020-08-19 2023-09-19 中国银行股份有限公司 Method and device for determining business knowledge association relationship
CN111797945B (en) * 2020-08-21 2020-12-15 成都数联铭品科技有限公司 Text classification method
CN112256801B (en) * 2020-10-10 2024-04-09 深圳力维智联技术有限公司 Method, system and storage medium for extracting key entity in entity relation diagram
CN112883080B (en) * 2021-02-22 2022-10-18 重庆邮电大学 UFIM-Matrix algorithm-based improved uncertain frequent item set marketing data mining algorithm
CN116777525A (en) * 2023-06-21 2023-09-19 深圳市创致联创科技有限公司 Popularization and delivery system based on group optimization algorithm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8059891B2 (en) * 2007-12-30 2011-11-15 Intel Corporation Markov stationary color descriptor
CN101727437A (en) * 2009-11-26 2010-06-09 上海大学 Method for computing importance degree of events in text set
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN103699611B (en) * 2013-12-16 2017-01-11 浙江大学 Microblog flow information extracting method based on dynamic digest technology

Also Published As

Publication number Publication date
CN108446408A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN108446408B (en) Short text summarization method based on PageRank
CN110457581B (en) Information recommendation method and device, electronic equipment and storage medium
EP3605358A1 (en) Olap precomputed model, automatic modeling method, and automatic modeling system
US10902022B2 (en) OLAP pre-calculation model, automatic modeling method, and automatic modeling system
US9104733B2 (en) Web search ranking
CN105740424A (en) Spark platform based high efficiency text classification method
CN109948121A (en) Article similarity method for digging, system, equipment and storage medium
CN105005589A (en) Text classification method and text classification device
CN112307762B (en) Search result sorting method and device, storage medium and electronic device
CN104268142B (en) Based on the Meta Search Engine result ordering method for being rejected by strategy
JP2021166109A (en) Fusion sorting model training method and device, search sorting method and device, electronic device, storage medium, and program
JP2008257732A (en) Method for document clustering or categorization
CN112348629A (en) Commodity information pushing method and device
CN108665148B (en) Electronic resource quality evaluation method and device and storage medium
CN110472016B (en) Article recommendation method and device, electronic equipment and storage medium
CN112818230B (en) Content recommendation method, device, electronic equipment and storage medium
CN110083766B (en) Query recommendation method and device based on meta-path guiding embedding
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
CN110750717B (en) Sequencing weight updating method
Li et al. Exploit latent Dirichlet allocation for collaborative filtering
CN102103604B (en) Method and device for determining core weight of term
CN111259117B (en) Short text batch matching method and device
CN109117436A (en) Synonym automatic discovering method and its system based on topic model
Rautray et al. Performance analysis of modified shuffled frog leaping algorithm for multi-document summarization problem
Zuo et al. A tag-aware recommendation algorithm based on deep learning and multi-objective optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant