CN103927398B

CN103927398B - The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method

Info

Publication number: CN103927398B
Application number: CN201410188004.7A
Authority: CN
Inventors: 刘琰; 张进; 罗军勇; 罗向阳; 董雨辰; 陈静; 常斌
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2014-05-07
Filing date: 2014-05-07
Publication date: 2016-12-28
Anticipated expiration: 2034-05-07
Also published as: CN103927398A

Abstract

The present invention relates to propagandize colony based on the microblogging that maximum frequent itemsets excavates and find method, effectively solve microblogging and propagandize the discovery of colony, prevent the problem that false malice is propagandized, method is, with propagandize microblogging dependency as clue, based on the public open platform of crawler technology or microblogging obtain participate in propagandize microblogging propagate account aggregation；With single microblogging as affairs, the account participating in microblogging propagation is item, builds and propagandizes microblogging transaction database；To each affairs in the transaction database corresponding to microblogging group to be detected, find out the maximum frequent itemsets comprised in all affairs, calculate the Duplication between each maximum frequent itemsets, the item collection of small scale is integrated in large items, reduce common factor number of times, when taking common factor between affairs, judge whether affairs comprise certain project with binary chop, improve the efficiency of Mining Maximum Frequent Itemsets, find that microblogging propagandizes colony, the inventive method is simple, can accurately find that malice microblogging propagandizes colony, prevent the harmful effect caused to society.

Description

The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method

Technical field

The present invention relates to microblogging public sentiment monitoring field, a kind of microblogging excavated based on maximum frequent itemsets propagandizes group Body finds method.

Background technology

Microblogging, as a kind of emerging Social Media form, has blog, media, instant communication function concurrently.Microblogging The instantaneity of self, grass roots, mobility, the feature such as interactive become the natural carrier that network public-opinion is propagated.At network In public sentiment, microblogging not only becomes center and the channel of Public Opinion Transmission, also assists in the formation of public opinion, development and bootup process simultaneously.

Microblogging is propagated and is a double-edged sword: on the one hand, and microblogging is that the information in some social eventss is open provides one The quickly platform of response, it compensate for the deficiency of traditional media and other network tools to a certain extent；On the other hand, microblogging Being different from traditional news media media, there is repeatability in the issue of its news, and verity cannot ensure, may be utilized and become ballad Explain in words carrier, the fuse cord of discontented mood broadcast, cause the worst consequence even to national security and social stability.Network is unreal Information starts from its maker, is spread in its disseminator.

Social Calculation and Study team under Hewlett-Packard claims in up-to-date report, and Sina's microblogging exists the most seriously Topic propagation problem, has half all by propagandizing user's transmission in the microblogging that hot issue forwards.Research finds, popular words The falseness that topic is artificially handled in propagating forwards high number, and the rubbish message sender of 1% creates the transfer amount of 49%.From Since in August, 2013, government department increases the dynamics guiding network public opinion, according to " Qin Huohuo ", " vertical two tear four open " etc. From the point of view of the survey result of place network pushing hands company, there is a large amount of organized pushing hands team in network, they are in league with minority " opinion leader " tissue network " waterborne troops ", concocts Deceptive news for a long time, distorts the facts deliberately, strike trouble on the net, obscures and is Non-, very disruptive network public opinion order, its behavior has been subjected to the highest attention of country's public sentiment management and control, and relevant people etc. is also because relating to Suspect's crime is detained for criminal act in accordance with the law.

Therefore, towards New Media, for various hiding public opinion demagogueries, carry out the identification propagandizing microblogging, point Analyse it and propagate population characteristic, collect the identification evidence of false propelling movement behavior, screen the artificial propagation focus manufactured, for finding, Prediction, guiding network public opinion, improve government's public opinion ability to supervise, safeguards that social harmony stably has important theory value with existing Sincere justice.

Along with the explosive growth of microblogging, attract the broad interest of Chinese scholars for the research of microblogging account, one A little achievements in research are delivered in recent years in the momentous conferences such as WWW, KDD.The at present research to microblogging account can be roughly divided into Lower three classes: 1) feature analysis, including account attributes feature and behavior characteristics etc.；2) power of influence analysis, evaluates body including power of influence System's structure and measure etc.；3) relational network analysis between account, including base attribute, generation and the evolution of account relational network Deng.

But, relatively fewer to the document propagandizing population selection the most both at home and abroad, main pertinent literature has rubbish account (spammer), waistcoat account (sockpuppet), the identification of corpse account.Rubbish account refers to often issue junk information Account, Z.Yi et al. is the feature of rubbish account from multiple angle analysis, and uses the mode of machine learning automatically to identify rubbish Account.Chao Yang et al. has analysed in depth the social relations between rubbish account, it is proposed that a kind of next according to cohesion between account The method finding rubbish account.The falseness of the behaviors such as waistcoat account refers to by registering multiple accounts and carry out posting, forwarding, comment Account, Xueling Zheng et al. proposes and a kind of utilizes content of text, similarity mode to the method identifying waistcoat account. Corpse account refers to carry out vermicelli dealing and the account of malicious registration, and Fang Ming etc. proposes a kind of based on microblogging login account The intelligent method for classifying of name feature extraction, has higher accuracy rate.But how these methods also unresolved find that microblogging is propagandized Colony, prevents false propagation, propagandizes difference maximum between account and above a few class account and is, propagation account lays particular emphasis on it and " fries Make " behavior, the account participating in propagandizing more is disperseed and direct relation is inconspicuous, disguised higher with the sense of organization, is the most more difficult to Find.

Colony propagandizes similar with common microblogging, propagation the posting of crowd, forward, comment etc. is isolated on behavior surface, But unconventional malicious dissemination is frequently not the behavior of single people, but organized group behavior, but this kind of groups row For being hidden, it is difficult to discover.Therefore, how to find that microblogging propagandizes colony, prevent false malice from propagandizing and cause not to society Good impact and unnecessary economic loss, be conscientiously to solve the technical problem that.

Summary of the invention

For above-mentioned situation, for overcoming the defect of prior art, the purpose of the present invention is just to provide a kind of based on maximum frequency The microblogging of numerous item set mining is propagandized colony and is found method, can effectively solve microblogging and propagandize the discovery of colony, prevents false malice from frying The problem made.

The technical scheme that the present invention solves is, the microblogging propagation account excavated based on maximum frequent itemsets finds that method includes Following steps:

(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, open based on crawler technology or microblogging are public Set level platform and obtain the account aggregation participating in propagandizing microblogging propagation；

(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds propagation micro- Rich transaction database；

(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilizes Iteration common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set；

The most ten hundreds of, directly in original transaction data owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains In storehouse, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, and that quickly rejects in affairs is non- Frequent item, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale；

(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to Maximum Frequent Item collection merges, and is integrated in relatively large items by item collection less for scale as far as possible, and ensures that the account of the consequent concentration of merger depends on So there is certain relatedness；By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use two Divide lookup method to judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging is fried Make colony.

The inventive method is simple, easily operates, and can accurately find that malice microblogging propagandizes colony, prevent to society cause bad Impact and unnecessary economic loss, have the using value of reality.

Accompanying drawing explanation

Fig. 1 is flow chart element diagram of the present invention.

Fig. 2 is the propagation microblogging transaction database schematic diagram of the present invention.

Fig. 3 is that the present invention propagandizes microblogging transaction database sectional drawing.

Fig. 4 is that inventive algorithm performs time comparison diagram on Mushroom data set.

Fig. 5 is that inventive algorithm is propagandizing execution time comparison diagram on microblog data collection.

Fig. 6 is MFS middle term collection number variation diagram of the present invention.

Fig. 7 is the greatest length variation diagram of MFS middle term collection of the present invention.

Detailed description of the invention

Below in conjunction with accompanying drawing, the detailed description of the invention of the present invention is elaborated.

Being given by Fig. 1, the present invention includes propagandizing microblogging affairs storehouse, maximum frequent itemsets excavates and maximum frequent itemsets is returned And part, propagandize microblogging affairs storehouse and build module mainly responsible collection data and carry out pretreatment, build transaction database D；? Big frequent item set mining module is primarily based on binary chop method screening candidate's maximum frequent itemsets, is then based on iteration common factor side Method excavates maximum frequent itemsets MFS from affairs database D；Maximum frequent itemsets merger module mainly carries out merger to MFS Process, propagandize colony really to reduce as far as possible, comprise the concrete steps that:

1) propagation microblogging sample, is collected

Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain Some microbloggings that individual propagation account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference Existing ripe method of discrimination or specialist system, propagandizing the collection of microblogging sample has two kinds of methods: a kind of method is to select reptile skill Art, from microblogging page download webpage, resolves page structure and extracts the information of microblogging propagation account；Another kind of method be call micro- Rich public open platform, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account, in order to be conducive to To propagandizing the discovery of colony, following principle also should be followed when choosing propagation microblogging sample:

A, choose and forward the of a relatively high popular microblogging of number；

B, microblogging issuing time span < 180 days；

According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should include microblogging identification number, microblogging Account identification number, the essential information of microblogging account；

2) transaction database is built

The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and searches propagandizing microblogging sample On the basis of collection, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build Transaction Information Storehouse, as shown in Figure 2；

3) candidate's maximum frequent itemsets based on binary chop screening

The most ten hundreds of, directly in original transaction storehouse owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to quickly rejects in affairs Non-frequent item, find out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, ramuscule Holding several S, carry out candidate's maximum frequent itemsets screening, method is:

(1) affairs in the D of affairs storehouse are sorted from big to small by project number

(2) note frequent item set, Infrequent item-set closes；From the beginning of i=1, in order in traversal D Each affairs T_i(1≤i≤| D |), to affairs T_iIn each project u:

If a) u ∈ FI, then retain u；

If b) u ∈ NFI, then from T_iMiddle rejecting u；

If c), then forward next step to and judge whether u is frequent item；

(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge T_j, in i < j≤| D | whether Comprising u, end condition is:

A) when the affairs number comprising u reaches S, illustrate that u is frequent item, u is joined in FI；

B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from T_iMiddle rejecting u.If the affairs number now containing u is more than 1, illustrate that u also appears in T_iOutside affairs in, then u is joined In NFI；

(4) the affairs storehouse D after the non-frequent item rejected in D in all affairs, after i.e. can being reduced₁；

4) maximum frequent itemsets occured simultaneously based on iteration excavates:

Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction₁, minimum Supporting number S, the method that maximum frequent itemsets excavates is as follows:

(1) by affairs storehouse D₁In affairs sort from big to small by the number of item, to find maximum frequent itemsets as early as possible, for Reduction affairs storehouse scale, merges the affairs repeated in affairs storehouse, and to affairs counting number；

(2) for reducing the number of times taking common factor, for affairs T_i, 1≤i≤| D₁|-S+1, from the beginning of i=1, first finds out bag Contain T_iThe affairs set of middle Arbitrary Term, T_j|T_jInclude at least a project in Ti；J > i), T_iSuccessively with T_jTake friendship Collection, moves into new affairs storehouse D by both occur simultaneously₂, reject T simultaneously_j,；

(3) for new affairs storehouse D₂In affairs T, if T be by not less than S affairs take common factor and obtain, then by T immigration In Maximum Frequent candidate set MFCS, reject T at D simultaneously₂In subtransaction；

(4) if new affairs storehouse D₂In residue affairs number less than S, then terminate affairs storehouse D₂Process, return to Layer affairs storehouse；Otherwise, to D₂Start to carry out again this process from the 1st step；

(5) as affairs storehouse D₁In remaining number of transactions less than S time, i.e. i > | D₁|-S+1, terminates Current transaction storehouse D₁Place Reason；

(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required Maximum frequent itemsets set MFS；

5) maximum frequent itemsets merger:

Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and deposit between some collection At substantial amounts of crowded item, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem, Duplication is used to reflect the similarity between two item collection, if item collection X₁,X₂∈ MFS, by X₁And X₂Duplication be designated as:

ORate (X_{1}, X_{2}) = \frac{| X_{1} \cap X_{2} |}{Min (| X_{1} |, | X_{2} |)}

In above formula, | X₁∩X₂| represent X₁With X₂Crowded item purpose number, Min (| X₁|,|X₂|) represent the item that scale is less The number of concentration project, the method for item collection merger is:

(1) maximum frequent itemsets in MFS is sorted from big to small by the number of project；

(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf, ORate(X_i,X_j<j≤| MFS |, then by X for)>=minOR, i_iAnd X_jUnion add in new set MMFS, reject X simultaneously_j；

(3) the item collection in MMFS is repeated two above step；

(4) when in MMFS, the Duplication of any two item collection is less than minOR, terminate.

The inventive method is simple, easily operates, and through practical probation, shows that method is reliable and stable, has the application valency of reality Value, relevant information is as follows:

1) data set

Using Sina's microblogging as research platform, with 81 microbloggings with propagation suspicion as object of study, actual participation its The account quantity forwarded is 380,726 (accounts without repeatedly participating in forwarding), and the project number of averagely every affairs is 6, 286, these microbloggings belong to advertisement marketing class mostly, it is possible to there is multiple propagation colony and participate in its communication process.Utilize reptile Program crawls and participates in all account identification (UID) that these microbloggings forward, and stores in transaction database, the lattice of part data Formula is as shown in Figure 3.

In order to verify that algorithm of the present invention (hereinafter referred to as IIA) is applied to the efficiency that maximum frequent itemsets excavates, to warp The Mushroom data set of allusion quotation carries out performance test, and compares with known method.This data set contains 8,124 notes Record, every record has 23 items, have recorded 23 attributes of mushroom.

2) Performance Evaluation

First being estimated the performance of the method for the invention, experimental situation is 4G internal memory, 2.0GHz double-core Duo T5800CPU, Windows732 bit manipulation system, realizes this algorithm with Java, and respectively with classical MAFIA algorithm and DFMFI Algorithm compares.

Fig. 4 is three kinds of algorithms implementation status in Mushroom data set under different supports, it can be seen that this method Efficiency apparently higher than other two kinds of algorithms, even if execution efficiency also has superiority in the case of minimum support is the lowest.Fig. 5 is Three kinds of algorithms are propagandizing implementation status on microblog data collection, it can be seen that the execution efficiency of this method is the highest.

3) parameter threshold selects

Fig. 6, Fig. 7 are from propagandizing the maximum frequent itemsets result that microblog data concentration finds under different minimum supports are several, Fig. 6 and Fig. 7 represents that the greatest length of maximum frequent itemsets middle term collection number and maximum frequent itemsets middle term collection is with ramuscule respectively Hold several changes.In conjunction with research background of the present invention it is found that minSup (minimum support number) set the biggest, the account of discovery It is the biggest that colony propagandizes suspicion, but population size and quantity also can reduce therewith；Otherwise, it is the least that minSup sets, the account of discovery It is the least that family colony propagandizes suspicion, but population size and quantity can increase.For this reason, it may be necessary to set a rational threshold to minSup Value, to find of certain scale and that propagation suspicion is higher colony.

On the other hand, when the item collection concentrating maximum frequent set carries out merger, the setting of minOR also will directly affect conjunction And the scale of consequent collection.By the continuous analysis to data, minOR is set as 50%, i.e. exceedes half when two item collection Merged when project is identical.

In order to further determine that the value of minSup, table 1 lists minSup=3 respectively, to maximum frequent itemsets when 4,5 Result after merger, sorts by merger consequent collection length, the most only lists front 8 item collection (doubtful propagation colony).From table It can be seen that as minSup=3 and 5, in addition to first item collection is on a grand scale, other collection scale is the least；And work as During minSup=4, item collection scale does not drastically change, and suitable scale, illustrates that value is relatively reasonable..

Table 1 is different supports several lower maximum frequent itemsets merger results

Sequence number	MinSup=3	MinSup=4	MinSup=5
				1	14,863	2,623	963
2	311	1,755	65
				3	156	688	29
4	77	410	19
				5	59	129	9

6	56	98	9
				7	55	82	7
8	55	54	5

4) accuracy rate analysis

In order to verify that the colony that propagandizes of the present invention finds the accuracy rate of algorithm, actual propagation in the propagation colony i.e. found Account proportion, in conjunction with the existing propagation account recognition methods analyzed based on multiple features and artificial mask method comprehensive verification knot The accuracy rate of fruit.Assume that propagation colony to be verified is H, first with the existing propagation account identification side analyzed based on multiple features Each account is differentiated by method, and the propagation account aggregation obtained is designated as H₁；Then, use the method for artificial mark to remaining Account differentiates, the propagation account aggregation obtained is designated as H₂, the accuracy rate computing formula propagandizing colony H is:

Precision = \frac{| H_{1} | + | H_{2} |}{| H |} \times 100 % - - - (1)

In above formula, | H | represents the account base in H, | H₁|+|H₂| represent propagation account number actual in H.To in table 1 MinSup=4 and the population size (the i.e. item collection length) partial mass more than 100 are verified, concrete outcome is as shown in table 2.

The accuracy rate (minSup=4) that colony finds propagandized by table 2

Sequence number	\|H₁\|	\|H₂\|	\|H\|	Precision
					1	2,016	451	2,623	94.1%
2	1,465	163	1,755	92.8%
					3	571	78	688	94.3%
4	354	33	410	94.4%
					5	109	10	129	92.2%

It will be seen that each the propagation colony found for this method from table 2, reality propagandizes the ratio shared by account All it is higher than 90%, shows that this method can recognize that the most hidden propagation account (i.e. H₂), and these accounts are often some idols You participate in propagandizing but the huge propagation large size of power of influence.As can be seen here, the present invention has the using value of reality, economical and social Benefit.

Claims

1. the microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method, it is characterised in that comprise the steps:

(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, based on the public opening of crawler technology or microblogging put down Platform obtains and participates in propagandizing the account aggregation that microblogging is propagated；

(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds and propagandizes microblogging thing Business data base；

(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilize iteration Common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set；

The most ten hundreds of, directly in original transaction data base owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, quickly reject in affairs non-frequently Project, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale；

(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to maximum frequent itemsets Merge, item collection less for scale is integrated in relatively large items, and ensure that the account of the consequent concentration of merger still has one Fixed relatedness；By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use binary chop Judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging propagandizes colony.

The microblogging excavated based on maximum frequent itemsets the most according to claim 1 is propagandized colony and is found method, and its feature exists In, including propagandizing microblogging affairs storehouse, maximum frequent itemsets excavation and maximum frequent itemsets merger part, propagandize microblogging affairs storehouse Build module to be mainly responsible for gathering data and carrying out pretreatment, build transaction database D；First maximum frequent itemsets excavates module Screen candidate's maximum frequent itemsets based on binary chop method, be then based on iteration Intersection set method and excavate from affairs database D Go out maximum frequent itemsets MFS；Maximum frequent itemsets merger module mainly carries out merger process to MFS, and group is propagandized in reduction really Body, comprises the concrete steps that:

1) propagation microblogging sample is collected

Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain is fried Making some microbloggings that account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference existing Ripe method of discrimination or specialist system, propagandize microblogging sample and collect and have two kinds of methods: a kind of method is to select crawler technology, from Microblogging page download webpage, resolve page structure and extract microblogging propagate account information；Another kind of method is to call microblogging public affairs Open platform altogether, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account；

2) transaction database is built

The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and is propagandizing what microblogging sample was collected On the basis of, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build transaction database；

3) candidate's maximum frequent itemsets based on binary chop screening

The most ten hundreds of owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains, directly excavate in original transaction storehouse Maximum frequent itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to that quickly rejects in affairs is non- Frequent item, finds out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, minimum support number S, carries out candidate's maximum frequent itemsets screening, and method is:

(2) note frequent item setInfrequent item-set closesFrom the beginning of i=1, in order traversal D in every Individual affairs T_i(1≤i≤| D |), to affairs T_iIn each project u:

If a) u ∈ FI, then retain u；

If b) u ∈ NFI, then from T_iMiddle rejecting u；

If c)Then forward next step to and judge whether u is frequent item；

(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge T_j, whether i < j≤| D | comprises u, End condition is:

B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from T_iIn Reject u, if the affairs number now containing u is more than 1, illustrate that u also appears in T_iOutside affairs in, then u is joined NFI In；

Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction₁, minimum support number S, the method that maximum frequent itemsets excavates is as follows:

(2) for reducing the number of times taking common factor, for affairs T_i, 1≤i≤| D₁|-S+1, from the beginning of i=1, first finds out and contains T_iThe affairs set D of middle Arbitrary Term_Ti, T_j|T_jInclude at least T_iIn a project；J > i, T_iSuccessively with T_jTake common factor, by two Occuring simultaneously of person moves into new affairs storehouse D₂, reject T simultaneously_j,

(3) for new affairs storehouse D₂In affairs T, if T is obtain by taking common factor not less than S affairs, then T is moved into maximum Frequently in candidate set MFCS, reject T at D simultaneously₂In subtransaction；

(4) if new affairs storehouse D₂In residue affairs number less than S, then terminate affairs storehouse D₂Process, return to upper strata thing Business storehouse；Otherwise, to D₂Start to carry out again this process from the 1st step；

(5) as affairs storehouse D₁In remaining number of transactions less than S time, i.e. i > | D₁|-S+1, terminates Current transaction storehouse D₁Process；

(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required maximum Frequent item set set MFS；

5) maximum frequent itemsets merger:

Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and exist big between some collection The crowded item of amount, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem, uses Duplication reflects the similarity between two item collection, if item collection X₁,X₂∈ MFS, by X₁And X₂Duplication be designated as:

O R a t e (X_{1}, X_{2}) = \frac{| X_{1} \cap X_{2} |}{M i n (| X_{1} |, | X_{2} |)}

In above formula, | X₁∩X₂| represent X₁With X₂Crowded item purpose number, Min (| X₁|,|X₂|) represent that the item that scale is less is concentrated The number of project, the method for item collection merger is:

(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf ORate (X_i,X_j<j≤| MFS |, then by X for)>=minOR, i_iAnd X_jUnion add in new set MMFS, reject X simultaneously_j；

(3) the item collection in MMFS is repeated two above step；

The microblogging excavated based on maximum frequent itemsets the most according to claim 2 is propagandized colony and is found method, and its feature exists In, described step 1) in, collect and propagandize microblogging sample and should meet following condition:

B, microblogging issuing time span < 180 days；It is beneficial to the discovery propagandizing colony.