CN108446408B - Short text summarization method based on PageRank - Google Patents
Short text summarization method based on PageRank Download PDFInfo
- Publication number
- CN108446408B CN108446408B CN201810329318.2A CN201810329318A CN108446408B CN 108446408 B CN108446408 B CN 108446408B CN 201810329318 A CN201810329318 A CN 201810329318A CN 108446408 B CN108446408 B CN 108446408B
- Authority
- CN
- China
- Prior art keywords
- item
- word
- item set
- frequent
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a short text summarization method based on PageRank. The method comprises the following steps: the method comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model. The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.
Description
Technical Field
The invention relates to a short text summarization method based on PageRank, which mainly solves the problem of how to select representative problem description under the condition of multiple descriptions aiming at the same problem. In particular to a text item ordering method. By this method, a relatively representative description can be selected among a plurality of descriptions of the same kind of problem.
Background introduction
As is known, text is one of the most important information carriers in life and production. Therefore, in many fields, text classification has been highly valued and widely used. Generally, a certain type of text is considered to be a description of a corresponding special event, and the text is usually a short text, relatively general and rich in information. Therefore, the texts are analyzed and abstracted to form a general description, which has a very positive effect and significance on production life, and further becomes a problem to be solved.
Through research, the existing short text summarization method has topic modeling and automatic summarization, but the method still has some defects. A common theme modeling model, namely an LDA model, is relatively complex, has relatively poor short text processing effect and is low in accuracy; there are two main modes for automatic summarization: one is the extraction type, namely, some sentences are selected from the text as the abstract; the other is a conceptual meaning, and the abstract is understood by understanding the meaning. At present, the relatively mature mode is an extraction type, but the effect is usually poor, and the method is usually applied to a single long text scene rather than a plurality of short text corpus scenes.
In an actual application scenario of the invention, various appeal of enterprises need to be analyzed and summarized, so that the enterprises can solve user appeal in a targeted manner and improve service quality. In actual work, due to the fact that the user appeal amount is large, the existing processing method costs too much time and is prone to errors, efficiency is low, follow-up work is difficult to advance, and finally, a processing result cannot be fed back to the user in time. Meanwhile, human resources are limited, hands of people are difficult to allocate to participate in the work, an effective solution is urgently needed, and the complex and tedious operation processes are automated by a computer technology, so that errors are reduced, the efficiency is improved, and the human resources are saved.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a short text item representative degree sorting method based on PageRank, which sorts a keyword set formed by processed appeal, selects the most general set as the keyword description of the appeal, enables an analyst to clearly understand the main content of the appeal, saves labor cost and improves working efficiency.
According to one aspect of the invention, a short text classification summarization method based on PageRank is provided, which comprises the following steps: generating a frequent item set; modeling item set relations; and (4) calculating and abstracting an item set model.
Step 1: frequent item set generation
The method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).
Step 2: item set relational modeling
The method includes the following steps that a PageRank relational model is constructed through analysis statistics of data and simple calculation:
step 2.1: set of weights of initialization items
Counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:
i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency.
Further obtain the initial weight vector P of the set0={p1,p2,…,pn}T。
Step 2.2: constructing a state transition probability matrix
Because there are overlapping words between each frequent item set in the set, the method aims to describe the association between the frequent item sets by constructing a graph. Therefore, the numerical relationship between the two frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set. That is, in the directed graph formed by all item sets in the set, the edge weight value is calculated. The term set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of transition from one state to another, i.e. the transition probability.
For each item set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,…,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection word (i) is 0 when i is j, and then a matrix W is formed (since the measurement object is a set of all frequent items, it is an n-dimensional matrix):
whereinNamely item set SiAnd item set SjThe set S of intersecting word frequency pairsiAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.
Step 2.3: modifying state transition probability matrices
The invention aims to obtain a representative item set weight value through model calculation. In the above process, it can be seen that, because there is an association of intersection words between item sets, it is not difficult to predict that the weights of item sets will change according to the weights of other item sets in the calculation process. Therefore, a correction model needs to be calculated so that a stable value can be calculated.
According to the Markov convergence theorem, when the following conditions are satisfied:
firstly, the number of finite states is limited; a fixed state transition probability;
the states can be changed in any mode; the state transfer mode is not unique;
the markov process will converge to an equilibrium state and the equilibrium is unique.
The present invention is provided in the case where the following conditions are satisfied,
the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still, a correction is required to satisfy the condition (c).
Considering the special case, when the intersection of a certain item set and the other item set is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state. Then the state cannot be transferred when the set of items is accessed. To accommodate this, the matrix W is further modified to W1:
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation. e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state.
And step 3: model calculation
The number of iterations max _ iter and the threshold min _ diff are specified. According to Pn+1=W1PnInitial value Pn=P0And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.
The method is based on the PageRank model, processes short texts of the event, forms a keyword set, simulates the importance degree of a plurality of sets through model calculation, and selects the most general set as the keyword abstract of the event. In practical application, main contents of events are clearly described, and the purposes of saving labor cost and improving working efficiency are achieved.
The invention has the advantages that: the method can automatically generate a plurality of candidate event keyword abstracts for short text sets related to a class of events, and calculate the importance degree of each event abstract through a PageRank model, thereby finally obtaining the most general event keyword abstract. The obtained event keyword abstract can clearly describe the main content of the event, the calculation efficiency of the method is high, and the event description obtained by the method can help people to better understand the event through practical application feedback, so that the aim of saving labor cost is fulfilled.
Drawings
FIG. 1 is an overall flowchart of an example of a short text classification summarization method based on PageRank according to the present invention.
FIG. 2 is a flowchart of an example of generating a frequent item set by the PageRank-based short text classification summarization method of the present invention.
FIG. 3 is a flowchart of the method for short text classification summarization based on PageRank according to the present invention.
FIG. 4 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank to construct a state transition probability matrix.
FIG. 5 is a schematic diagram of the physical significance of the short text classification summarization method based on PageRank for correcting the state transition probability matrix.
FIG. 6 is a model calculation flow chart of the short text classification summarization method based on PageRank.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the following description is only some embodiments of the present invention, and it is obvious for a person skilled in the art that other embodiments can be obtained according to the embodiments without inventive exercise.
One embodiment of the invention is to adopt a short text classification summarization method based on PageRank to perform short text worksheet summarization.
Referring to fig. 1, a schematic diagram of an example of a short text classification summarization method based on PageRank includes the following steps:
s101, generating a frequent item set;
s102, establishing a set relation model;
s103, calculating a model to obtain a result;
the 101 step specifically comprises: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; and generating a frequent item set by using a frequent pattern growing method (FP-growth) based on a data structure of a frequent pattern tree (FP-tree).
The 102 step specifically comprises:
s201, initializing item set weight values;
s202, constructing a state transition probability matrix;
s203, correcting the state transition probability matrix;
the step 201 specifically comprises:
counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:i.e., the term set implies that the accumulation of words and their word frequency products is proportional to the total word frequency. Further obtain the initial weight vector P of the set0={p1,p2,…,pn}T。
The step 202 specifically comprises:
and representing the numerical relationship between the corresponding two frequent item sets by using the intersection condition according to the intersection between each frequent item set in the set, and constructing a relationship matrix.
For each item set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,…,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection word (i) is 0 when i is j, and then a matrix W is formed (since the measurement object is a set of all frequent items, it is an n-dimensional matrix):
whereinThat is, the itemCollection SiAnd item set SjThe set S of intersecting word frequency pairsiAnd the ratio of the sum of the word frequencies of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix.
The step 203 specifically comprises:
the invention aims to obtain a stable weight value as an index for selecting the abstract, and meanwhile, the model accords with partial conditions of Markov convergence, and the Markov convergence theorem needs to be corrected to adapt to the aim of the invention.
Considering the special case, when the intersection of a certain set and the rest sets is empty, that is, an edge cannot be constructed, the set is referred to as an isolated state. Then the state cannot be transferred when the set is accessed. To accommodate this, the matrix W is further modified to W1:
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of carrying out state transition on an isolated state in an iteration process, and can be corrected by combining with the actual situation.
The step 103 specifically comprises: the number of iterations max _ iter and the threshold min _ diff are specified. According to Pn+1=W1PnInitial value Pn=P0And (6) performing operation. When the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.
Claims (1)
1. A short text classification summarization method based on PageRank comprises the following steps:
step 1: generating a frequent item set;
the method comprises the following steps: performing word segmentation and filtering on the text to be processed, removing stop words, replacing synonyms, and generating a set of initial words of the text; after all texts are processed, the word frequency of each word in the text word segmentation result is counted, and all words are sequenced according to the word frequency; adjusting the word sequence in the text word segmentation result, and arranging the words in a descending order according to the word frequency; setting a threshold value minSupport, and deleting words with the word frequency smaller than the threshold value in the word segmentation result; based on a data structure of a frequent pattern tree FP-tree, a frequent pattern growing method FP-growth is used for generating a frequent item set;
step 2: modeling item set relations;
the method includes the following steps that a PageRank relational model is constructed through analysis statistics of data and simple calculation:
step 2.1: initializing item set weight;
counting the total number n of frequent item sets of a class of problems generated in the step 1, and counting the word frequency tf of each word in the item setsi,i∈[1,n]Combining the word containing situation in the item set, the initial weight of each item set in the statistical calculation set is as follows:
that is, the term set contains the ratio of the accumulation of the word and the word frequency product thereof in the total word frequency;
further obtain the initial weight vector P of the set0={p1,p2,...,pn}T;
Step 2.2: constructing a state transition probability matrix;
because there are overlapping words between every frequent item set in the set, and the purpose of this method lies in describing the association between the frequent item sets through constructing the figure; therefore, the numerical relationship between the two corresponding frequent item sets is represented by calculating the number of terms of the intersection between every two frequent item sets in the set; calculating the edge weight of the directed graph formed by all item sets in the set; the item set can be regarded as a specific state, and the physical meaning of the edge weight is the probability of converting from one state to another state, namely the transition probability;
for eachItem set SiAnd SjAll have an intersection word vector Xij={xi1,xi2,...,xin}TWherein x isijRepresenting a set of items SiAnd item set SjThe word frequency of the intersection words is 0 when i is j, and then a matrix W is formed, because the measurement object is a set of all frequent items, the matrix W is an n-dimensional matrix:
whereinNamely item set SiAnd item set SjThe set S of intersecting word frequency pairsiThe ratio of the sum of the word frequency of the intersection of all the other item sets represents the edge weight value among the item sets to form a state transition probability matrix;
step 2.3: correcting the state transition probability matrix;
because the association of the intersection words exists between the item sets, the weight values of the item sets can be changed according to the weight values of other item sets in the calculation process; therefore, a correction model needs to be calculated so that a stable value can be calculated;
according to the Markov convergence theorem, when the following conditions are satisfied:
firstly, the number of finite states is limited; a fixed state transition probability;
the states can be changed in any mode; the state transfer mode is not unique;
the Markov process will converge to an equilibrium state, and the equilibrium is unique;
in the case where the following condition is satisfied,
the method comprises the following steps: the number of states is n of the item set; secondly, the step of: the state transition probability matrix is determined by the item set and does not change; fourthly, the method comprises the following steps: the edges formed by the intersection of the item sets are all bidirectional edges, and various transfer modes exist among the states under the condition of being reachable; still need to revise, in order to meet condition c;
considering special cases, when the intersection of a certain item set and the other item sets is empty, that is, an edge cannot be constructed, the item set is referred to as an item set in an isolated state; then the state cannot be transferred when the set of items is accessed; to accommodate this, the matrix W is further modified to W1:
From the view point of the graph, the physical meaning of the correction is that the graph is communicated, and the condition (c) is met;
wherein alpha is an empirical value, represents the probability of state transition of an isolated state in the iterative process, and can be corrected by combining with the actual situation; e is the identity matrix, so the second half of the formula represents the probability of directly accessing the isolated state;
and step 3: calculating and abstracting an item set model;
specifying iteration times max _ iter and a threshold min _ diff; according to Pn+1=W1PnInitial value Pn=P0Performing operation; when the difference between the results of two iterations is less than or equal to the threshold value, i.e. Pn+1-PnWhen min _ diff is less than or equal to; or when the iteration number is equal to the specified number, namely n is max _ iter, the operation result is considered to be converged, and the ranking can be output as required.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810329318.2A CN108446408B (en) | 2018-04-13 | 2018-04-13 | Short text summarization method based on PageRank |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810329318.2A CN108446408B (en) | 2018-04-13 | 2018-04-13 | Short text summarization method based on PageRank |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108446408A CN108446408A (en) | 2018-08-24 |
CN108446408B true CN108446408B (en) | 2021-04-06 |
Family
ID=63199842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810329318.2A Active CN108446408B (en) | 2018-04-13 | 2018-04-13 | Short text summarization method based on PageRank |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108446408B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109739953B (en) * | 2018-12-30 | 2021-07-20 | 广西财经学院 | Text retrieval method based on chi-square analysis-confidence framework and back-part expansion |
CN110533194B (en) * | 2019-03-25 | 2022-11-25 | 东北大学 | Optimization method for maintenance system construction |
US10579894B1 (en) * | 2019-07-17 | 2020-03-03 | Capital One Service, LLC | Method and system for detecting drift in text streams |
US10657416B1 (en) | 2019-07-17 | 2020-05-19 | Capital One Services, Llc | Method and system for detecting drift in image streams |
CN111984688B (en) * | 2020-08-19 | 2023-09-19 | 中国银行股份有限公司 | Method and device for determining business knowledge association relationship |
CN111797945B (en) * | 2020-08-21 | 2020-12-15 | 成都数联铭品科技有限公司 | Text classification method |
CN112256801B (en) * | 2020-10-10 | 2024-04-09 | 深圳力维智联技术有限公司 | Method, system and storage medium for extracting key entity in entity relation diagram |
CN112883080B (en) * | 2021-02-22 | 2022-10-18 | 重庆邮电大学 | UFIM-Matrix algorithm-based improved uncertain frequent item set marketing data mining algorithm |
CN116777525A (en) * | 2023-06-21 | 2023-09-19 | 深圳市创致联创科技有限公司 | Popularization and delivery system based on group optimization algorithm |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8059891B2 (en) * | 2007-12-30 | 2011-11-15 | Intel Corporation | Markov stationary color descriptor |
CN101727437A (en) * | 2009-11-26 | 2010-06-09 | 上海大学 | Method for computing importance degree of events in text set |
CN102043851A (en) * | 2010-12-22 | 2011-05-04 | 四川大学 | Multiple-document automatic abstracting method based on frequent itemset |
CN103699611B (en) * | 2013-12-16 | 2017-01-11 | 浙江大学 | Microblog flow information extracting method based on dynamic digest technology |
-
2018
- 2018-04-13 CN CN201810329318.2A patent/CN108446408B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108446408A (en) | 2018-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108446408B (en) | Short text summarization method based on PageRank | |
CN110457581B (en) | Information recommendation method and device, electronic equipment and storage medium | |
EP3605358A1 (en) | Olap precomputed model, automatic modeling method, and automatic modeling system | |
US10902022B2 (en) | OLAP pre-calculation model, automatic modeling method, and automatic modeling system | |
US9104733B2 (en) | Web search ranking | |
CN105740424A (en) | Spark platform based high efficiency text classification method | |
CN109948121A (en) | Article similarity method for digging, system, equipment and storage medium | |
CN105005589A (en) | Text classification method and text classification device | |
CN112307762B (en) | Search result sorting method and device, storage medium and electronic device | |
CN104268142B (en) | Based on the Meta Search Engine result ordering method for being rejected by strategy | |
JP2021166109A (en) | Fusion sorting model training method and device, search sorting method and device, electronic device, storage medium, and program | |
JP2008257732A (en) | Method for document clustering or categorization | |
CN112348629A (en) | Commodity information pushing method and device | |
CN108665148B (en) | Electronic resource quality evaluation method and device and storage medium | |
CN110472016B (en) | Article recommendation method and device, electronic equipment and storage medium | |
CN112818230B (en) | Content recommendation method, device, electronic equipment and storage medium | |
CN110083766B (en) | Query recommendation method and device based on meta-path guiding embedding | |
WO2020147259A1 (en) | User portait method and apparatus, readable storage medium, and terminal device | |
CN110750717B (en) | Sequencing weight updating method | |
Li et al. | Exploit latent Dirichlet allocation for collaborative filtering | |
CN102103604B (en) | Method and device for determining core weight of term | |
CN111259117B (en) | Short text batch matching method and device | |
CN109117436A (en) | Synonym automatic discovering method and its system based on topic model | |
Rautray et al. | Performance analysis of modified shuffled frog leaping algorithm for multi-document summarization problem | |
Zuo et al. | A tag-aware recommendation algorithm based on deep learning and multi-objective optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |