CN103927398B - The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method - Google Patents

The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method Download PDF

Info

Publication number
CN103927398B
CN103927398B CN201410188004.7A CN201410188004A CN103927398B CN 103927398 B CN103927398 B CN 103927398B CN 201410188004 A CN201410188004 A CN 201410188004A CN 103927398 B CN103927398 B CN 103927398B
Authority
CN
China
Prior art keywords
microblogging
affairs
maximum frequent
frequent itemsets
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410188004.7A
Other languages
Chinese (zh)
Other versions
CN103927398A (en
Inventor
刘琰
张进
罗军勇
罗向阳
董雨辰
陈静
常斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA Information Engineering University
Original Assignee
PLA Information Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA Information Engineering University filed Critical PLA Information Engineering University
Priority to CN201410188004.7A priority Critical patent/CN103927398B/en
Publication of CN103927398A publication Critical patent/CN103927398A/en
Application granted granted Critical
Publication of CN103927398B publication Critical patent/CN103927398B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to propagandize colony based on the microblogging that maximum frequent itemsets excavates and find method, effectively solve microblogging and propagandize the discovery of colony, prevent the problem that false malice is propagandized, method is, with propagandize microblogging dependency as clue, based on the public open platform of crawler technology or microblogging obtain participate in propagandize microblogging propagate account aggregation;With single microblogging as affairs, the account participating in microblogging propagation is item, builds and propagandizes microblogging transaction database;To each affairs in the transaction database corresponding to microblogging group to be detected, find out the maximum frequent itemsets comprised in all affairs, calculate the Duplication between each maximum frequent itemsets, the item collection of small scale is integrated in large items, reduce common factor number of times, when taking common factor between affairs, judge whether affairs comprise certain project with binary chop, improve the efficiency of Mining Maximum Frequent Itemsets, find that microblogging propagandizes colony, the inventive method is simple, can accurately find that malice microblogging propagandizes colony, prevent the harmful effect caused to society.

Description

The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method
Technical field
The present invention relates to microblogging public sentiment monitoring field, a kind of microblogging excavated based on maximum frequent itemsets propagandizes group Body finds method.
Background technology
Microblogging, as a kind of emerging Social Media form, has blog, media, instant communication function concurrently.Microblogging The instantaneity of self, grass roots, mobility, the feature such as interactive become the natural carrier that network public-opinion is propagated.At network In public sentiment, microblogging not only becomes center and the channel of Public Opinion Transmission, also assists in the formation of public opinion, development and bootup process simultaneously.
Microblogging is propagated and is a double-edged sword: on the one hand, and microblogging is that the information in some social eventss is open provides one The quickly platform of response, it compensate for the deficiency of traditional media and other network tools to a certain extent;On the other hand, microblogging Being different from traditional news media media, there is repeatability in the issue of its news, and verity cannot ensure, may be utilized and become ballad Explain in words carrier, the fuse cord of discontented mood broadcast, cause the worst consequence even to national security and social stability.Network is unreal Information starts from its maker, is spread in its disseminator.
Social Calculation and Study team under Hewlett-Packard claims in up-to-date report, and Sina's microblogging exists the most seriously Topic propagation problem, has half all by propagandizing user's transmission in the microblogging that hot issue forwards.Research finds, popular words The falseness that topic is artificially handled in propagating forwards high number, and the rubbish message sender of 1% creates the transfer amount of 49%.From Since in August, 2013, government department increases the dynamics guiding network public opinion, according to " Qin Huohuo ", " vertical two tear four open " etc. From the point of view of the survey result of place network pushing hands company, there is a large amount of organized pushing hands team in network, they are in league with minority " opinion leader " tissue network " waterborne troops ", concocts Deceptive news for a long time, distorts the facts deliberately, strike trouble on the net, obscures and is Non-, very disruptive network public opinion order, its behavior has been subjected to the highest attention of country's public sentiment management and control, and relevant people etc. is also because relating to Suspect's crime is detained for criminal act in accordance with the law.
Therefore, towards New Media, for various hiding public opinion demagogueries, carry out the identification propagandizing microblogging, point Analyse it and propagate population characteristic, collect the identification evidence of false propelling movement behavior, screen the artificial propagation focus manufactured, for finding, Prediction, guiding network public opinion, improve government's public opinion ability to supervise, safeguards that social harmony stably has important theory value with existing Sincere justice.
Along with the explosive growth of microblogging, attract the broad interest of Chinese scholars for the research of microblogging account, one A little achievements in research are delivered in recent years in the momentous conferences such as WWW, KDD.The at present research to microblogging account can be roughly divided into Lower three classes: 1) feature analysis, including account attributes feature and behavior characteristics etc.;2) power of influence analysis, evaluates body including power of influence System's structure and measure etc.;3) relational network analysis between account, including base attribute, generation and the evolution of account relational network Deng.
But, relatively fewer to the document propagandizing population selection the most both at home and abroad, main pertinent literature has rubbish account (spammer), waistcoat account (sockpuppet), the identification of corpse account.Rubbish account refers to often issue junk information Account, Z.Yi et al. is the feature of rubbish account from multiple angle analysis, and uses the mode of machine learning automatically to identify rubbish Account.Chao Yang et al. has analysed in depth the social relations between rubbish account, it is proposed that a kind of next according to cohesion between account The method finding rubbish account.The falseness of the behaviors such as waistcoat account refers to by registering multiple accounts and carry out posting, forwarding, comment Account, Xueling Zheng et al. proposes and a kind of utilizes content of text, similarity mode to the method identifying waistcoat account. Corpse account refers to carry out vermicelli dealing and the account of malicious registration, and Fang Ming etc. proposes a kind of based on microblogging login account The intelligent method for classifying of name feature extraction, has higher accuracy rate.But how these methods also unresolved find that microblogging is propagandized Colony, prevents false propagation, propagandizes difference maximum between account and above a few class account and is, propagation account lays particular emphasis on it and " fries Make " behavior, the account participating in propagandizing more is disperseed and direct relation is inconspicuous, disguised higher with the sense of organization, is the most more difficult to Find.
Colony propagandizes similar with common microblogging, propagation the posting of crowd, forward, comment etc. is isolated on behavior surface, But unconventional malicious dissemination is frequently not the behavior of single people, but organized group behavior, but this kind of groups row For being hidden, it is difficult to discover.Therefore, how to find that microblogging propagandizes colony, prevent false malice from propagandizing and cause not to society Good impact and unnecessary economic loss, be conscientiously to solve the technical problem that.
Summary of the invention
For above-mentioned situation, for overcoming the defect of prior art, the purpose of the present invention is just to provide a kind of based on maximum frequency The microblogging of numerous item set mining is propagandized colony and is found method, can effectively solve microblogging and propagandize the discovery of colony, prevents false malice from frying The problem made.
The technical scheme that the present invention solves is, the microblogging propagation account excavated based on maximum frequent itemsets finds that method includes Following steps:
(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, open based on crawler technology or microblogging are public Set level platform and obtain the account aggregation participating in propagandizing microblogging propagation;
(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds propagation micro- Rich transaction database;
(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilizes Iteration common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set;
The most ten hundreds of, directly in original transaction data owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains In storehouse, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, and that quickly rejects in affairs is non- Frequent item, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale;
(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to Maximum Frequent Item collection merges, and is integrated in relatively large items by item collection less for scale as far as possible, and ensures that the account of the consequent concentration of merger depends on So there is certain relatedness;By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use two Divide lookup method to judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging is fried Make colony.
The inventive method is simple, easily operates, and can accurately find that malice microblogging propagandizes colony, prevent to society cause bad Impact and unnecessary economic loss, have the using value of reality.
Accompanying drawing explanation
Fig. 1 is flow chart element diagram of the present invention.
Fig. 2 is the propagation microblogging transaction database schematic diagram of the present invention.
Fig. 3 is that the present invention propagandizes microblogging transaction database sectional drawing.
Fig. 4 is that inventive algorithm performs time comparison diagram on Mushroom data set.
Fig. 5 is that inventive algorithm is propagandizing execution time comparison diagram on microblog data collection.
Fig. 6 is MFS middle term collection number variation diagram of the present invention.
Fig. 7 is the greatest length variation diagram of MFS middle term collection of the present invention.
Detailed description of the invention
Below in conjunction with accompanying drawing, the detailed description of the invention of the present invention is elaborated.
Being given by Fig. 1, the present invention includes propagandizing microblogging affairs storehouse, maximum frequent itemsets excavates and maximum frequent itemsets is returned And part, propagandize microblogging affairs storehouse and build module mainly responsible collection data and carry out pretreatment, build transaction database D;? Big frequent item set mining module is primarily based on binary chop method screening candidate's maximum frequent itemsets, is then based on iteration common factor side Method excavates maximum frequent itemsets MFS from affairs database D;Maximum frequent itemsets merger module mainly carries out merger to MFS Process, propagandize colony really to reduce as far as possible, comprise the concrete steps that:
1) propagation microblogging sample, is collected
Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain Some microbloggings that individual propagation account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference Existing ripe method of discrimination or specialist system, propagandizing the collection of microblogging sample has two kinds of methods: a kind of method is to select reptile skill Art, from microblogging page download webpage, resolves page structure and extracts the information of microblogging propagation account;Another kind of method be call micro- Rich public open platform, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account, in order to be conducive to To propagandizing the discovery of colony, following principle also should be followed when choosing propagation microblogging sample:
A, choose and forward the of a relatively high popular microblogging of number;
B, microblogging issuing time span < 180 days;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should include microblogging identification number, microblogging Account identification number, the essential information of microblogging account;
2) transaction database is built
The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and searches propagandizing microblogging sample On the basis of collection, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build Transaction Information Storehouse, as shown in Figure 2;
3) candidate's maximum frequent itemsets based on binary chop screening
The most ten hundreds of, directly in original transaction storehouse owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to quickly rejects in affairs Non-frequent item, find out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, ramuscule Holding several S, carry out candidate's maximum frequent itemsets screening, method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item set, Infrequent item-set closes;From the beginning of i=1, in order in traversal D Each affairs Ti(1≤i≤| D |), to affairs TiIn each project u:
If a) u ∈ FI, then retain u;
If b) u ∈ NFI, then from TiMiddle rejecting u;
If c), then forward next step to and judge whether u is frequent item;
(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge Tj, in i < j≤| D | whether Comprising u, end condition is:
A) when the affairs number comprising u reaches S, illustrate that u is frequent item, u is joined in FI;
B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from TiMiddle rejecting u.If the affairs number now containing u is more than 1, illustrate that u also appears in TiOutside affairs in, then u is joined In NFI;
(4) the affairs storehouse D after the non-frequent item rejected in D in all affairs, after i.e. can being reduced1
4) maximum frequent itemsets occured simultaneously based on iteration excavates:
Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction1, minimum Supporting number S, the method that maximum frequent itemsets excavates is as follows:
(1) by affairs storehouse D1In affairs sort from big to small by the number of item, to find maximum frequent itemsets as early as possible, for Reduction affairs storehouse scale, merges the affairs repeated in affairs storehouse, and to affairs counting number;
(2) for reducing the number of times taking common factor, for affairs Ti, 1≤i≤| D1|-S+1, from the beginning of i=1, first finds out bag Contain TiThe affairs set of middle Arbitrary Term, Tj|TjInclude at least a project in Ti;J > i), TiSuccessively with TjTake friendship Collection, moves into new affairs storehouse D by both occur simultaneously2, reject T simultaneouslyj,
(3) for new affairs storehouse D2In affairs T, if T be by not less than S affairs take common factor and obtain, then by T immigration In Maximum Frequent candidate set MFCS, reject T at D simultaneously2In subtransaction;
(4) if new affairs storehouse D2In residue affairs number less than S, then terminate affairs storehouse D2Process, return to Layer affairs storehouse;Otherwise, to D2Start to carry out again this process from the 1st step;
(5) as affairs storehouse D1In remaining number of transactions less than S time, i.e. i > | D1|-S+1, terminates Current transaction storehouse D1Place Reason;
(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required Maximum frequent itemsets set MFS;
5) maximum frequent itemsets merger:
Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and deposit between some collection At substantial amounts of crowded item, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem, Duplication is used to reflect the similarity between two item collection, if item collection X1,X2∈ MFS, by X1And X2Duplication be designated as:
ORate ( X 1 , X 2 ) = | X 1 &cap; X 2 | Min ( | X 1 | , | X 2 | )
In above formula, | X1∩X2| represent X1With X2Crowded item purpose number, Min (| X1|,|X2|) represent the item that scale is less The number of concentration project, the method for item collection merger is:
(1) maximum frequent itemsets in MFS is sorted from big to small by the number of project;
(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf, ORate(Xi,Xj<j≤| MFS |, then by X for)>=minOR, iiAnd XjUnion add in new set MMFS, reject X simultaneouslyj
(3) the item collection in MMFS is repeated two above step;
(4) when in MMFS, the Duplication of any two item collection is less than minOR, terminate.
The inventive method is simple, easily operates, and through practical probation, shows that method is reliable and stable, has the application valency of reality Value, relevant information is as follows:
1) data set
Using Sina's microblogging as research platform, with 81 microbloggings with propagation suspicion as object of study, actual participation its The account quantity forwarded is 380,726 (accounts without repeatedly participating in forwarding), and the project number of averagely every affairs is 6, 286, these microbloggings belong to advertisement marketing class mostly, it is possible to there is multiple propagation colony and participate in its communication process.Utilize reptile Program crawls and participates in all account identification (UID) that these microbloggings forward, and stores in transaction database, the lattice of part data Formula is as shown in Figure 3.
In order to verify that algorithm of the present invention (hereinafter referred to as IIA) is applied to the efficiency that maximum frequent itemsets excavates, to warp The Mushroom data set of allusion quotation carries out performance test, and compares with known method.This data set contains 8,124 notes Record, every record has 23 items, have recorded 23 attributes of mushroom.
2) Performance Evaluation
First being estimated the performance of the method for the invention, experimental situation is 4G internal memory, 2.0GHz double-core Duo T5800CPU, Windows732 bit manipulation system, realizes this algorithm with Java, and respectively with classical MAFIA algorithm and DFMFI Algorithm compares.
Fig. 4 is three kinds of algorithms implementation status in Mushroom data set under different supports, it can be seen that this method Efficiency apparently higher than other two kinds of algorithms, even if execution efficiency also has superiority in the case of minimum support is the lowest.Fig. 5 is Three kinds of algorithms are propagandizing implementation status on microblog data collection, it can be seen that the execution efficiency of this method is the highest.
3) parameter threshold selects
Fig. 6, Fig. 7 are from propagandizing the maximum frequent itemsets result that microblog data concentration finds under different minimum supports are several, Fig. 6 and Fig. 7 represents that the greatest length of maximum frequent itemsets middle term collection number and maximum frequent itemsets middle term collection is with ramuscule respectively Hold several changes.In conjunction with research background of the present invention it is found that minSup (minimum support number) set the biggest, the account of discovery It is the biggest that colony propagandizes suspicion, but population size and quantity also can reduce therewith;Otherwise, it is the least that minSup sets, the account of discovery It is the least that family colony propagandizes suspicion, but population size and quantity can increase.For this reason, it may be necessary to set a rational threshold to minSup Value, to find of certain scale and that propagation suspicion is higher colony.
On the other hand, when the item collection concentrating maximum frequent set carries out merger, the setting of minOR also will directly affect conjunction And the scale of consequent collection.By the continuous analysis to data, minOR is set as 50%, i.e. exceedes half when two item collection Merged when project is identical.
In order to further determine that the value of minSup, table 1 lists minSup=3 respectively, to maximum frequent itemsets when 4,5 Result after merger, sorts by merger consequent collection length, the most only lists front 8 item collection (doubtful propagation colony).From table It can be seen that as minSup=3 and 5, in addition to first item collection is on a grand scale, other collection scale is the least;And work as During minSup=4, item collection scale does not drastically change, and suitable scale, illustrates that value is relatively reasonable..
Table 1 is different supports several lower maximum frequent itemsets merger results
Sequence number MinSup=3 MinSup=4 MinSup=5
1 14,863 2,623 963
2 311 1,755 65
3 156 688 29
4 77 410 19
5 59 129 9
6 56 98 9
7 55 82 7
8 55 54 5
4) accuracy rate analysis
In order to verify that the colony that propagandizes of the present invention finds the accuracy rate of algorithm, actual propagation in the propagation colony i.e. found Account proportion, in conjunction with the existing propagation account recognition methods analyzed based on multiple features and artificial mask method comprehensive verification knot The accuracy rate of fruit.Assume that propagation colony to be verified is H, first with the existing propagation account identification side analyzed based on multiple features Each account is differentiated by method, and the propagation account aggregation obtained is designated as H1;Then, use the method for artificial mark to remaining Account differentiates, the propagation account aggregation obtained is designated as H2, the accuracy rate computing formula propagandizing colony H is:
Precision = | H 1 | + | H 2 | | H | &times; 100 % - - - ( 1 )
In above formula, | H | represents the account base in H, | H1|+|H2| represent propagation account number actual in H.To in table 1 MinSup=4 and the population size (the i.e. item collection length) partial mass more than 100 are verified, concrete outcome is as shown in table 2.
The accuracy rate (minSup=4) that colony finds propagandized by table 2
Sequence number |H1| |H2| |H| Precision
1 2,016 451 2,623 94.1%
2 1,465 163 1,755 92.8%
3 571 78 688 94.3%
4 354 33 410 94.4%
5 109 10 129 92.2%
It will be seen that each the propagation colony found for this method from table 2, reality propagandizes the ratio shared by account All it is higher than 90%, shows that this method can recognize that the most hidden propagation account (i.e. H2), and these accounts are often some idols You participate in propagandizing but the huge propagation large size of power of influence.As can be seen here, the present invention has the using value of reality, economical and social Benefit.

Claims (3)

1. the microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method, it is characterised in that comprise the steps:
(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, based on the public opening of crawler technology or microblogging put down Platform obtains and participates in propagandizing the account aggregation that microblogging is propagated;
(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds and propagandizes microblogging thing Business data base;
(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilize iteration Common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set;
The most ten hundreds of, directly in original transaction data base owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, quickly reject in affairs non-frequently Project, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale;
(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to maximum frequent itemsets Merge, item collection less for scale is integrated in relatively large items, and ensure that the account of the consequent concentration of merger still has one Fixed relatedness;By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use binary chop Judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging propagandizes colony.
The microblogging excavated based on maximum frequent itemsets the most according to claim 1 is propagandized colony and is found method, and its feature exists In, including propagandizing microblogging affairs storehouse, maximum frequent itemsets excavation and maximum frequent itemsets merger part, propagandize microblogging affairs storehouse Build module to be mainly responsible for gathering data and carrying out pretreatment, build transaction database D;First maximum frequent itemsets excavates module Screen candidate's maximum frequent itemsets based on binary chop method, be then based on iteration Intersection set method and excavate from affairs database D Go out maximum frequent itemsets MFS;Maximum frequent itemsets merger module mainly carries out merger process to MFS, and group is propagandized in reduction really Body, comprises the concrete steps that:
1) propagation microblogging sample is collected
Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain is fried Making some microbloggings that account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference existing Ripe method of discrimination or specialist system, propagandize microblogging sample and collect and have two kinds of methods: a kind of method is to select crawler technology, from Microblogging page download webpage, resolve page structure and extract microblogging propagate account information;Another kind of method is to call microblogging public affairs Open platform altogether, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should include microblogging identification number, microblogging account Identification number, the essential information of microblogging account;
2) transaction database is built
The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and is propagandizing what microblogging sample was collected On the basis of, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build transaction database;
3) candidate's maximum frequent itemsets based on binary chop screening
The most ten hundreds of owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains, directly excavate in original transaction storehouse Maximum frequent itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to that quickly rejects in affairs is non- Frequent item, finds out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, minimum support number S, carries out candidate's maximum frequent itemsets screening, and method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item setInfrequent item-set closesFrom the beginning of i=1, in order traversal D in every Individual affairs Ti(1≤i≤| D |), to affairs TiIn each project u:
If a) u ∈ FI, then retain u;
If b) u ∈ NFI, then from TiMiddle rejecting u;
If c)Then forward next step to and judge whether u is frequent item;
(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge Tj, whether i < j≤| D | comprises u, End condition is:
A) when the affairs number comprising u reaches S, illustrate that u is frequent item, u is joined in FI;
B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from TiIn Reject u, if the affairs number now containing u is more than 1, illustrate that u also appears in TiOutside affairs in, then u is joined NFI In;
(4) the affairs storehouse D after the non-frequent item rejected in D in all affairs, after i.e. can being reduced1
4) maximum frequent itemsets occured simultaneously based on iteration excavates:
Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction1, minimum support number S, the method that maximum frequent itemsets excavates is as follows:
(1) by affairs storehouse D1In affairs sort from big to small by the number of item, to find maximum frequent itemsets as early as possible, for reduction Affairs storehouse scale, merges the affairs repeated in affairs storehouse, and to affairs counting number;
(2) for reducing the number of times taking common factor, for affairs Ti, 1≤i≤| D1|-S+1, from the beginning of i=1, first finds out and contains TiThe affairs set D of middle Arbitrary TermTi, Tj|TjInclude at least TiIn a project;J > i, TiSuccessively with TjTake common factor, by two Occuring simultaneously of person moves into new affairs storehouse D2, reject T simultaneouslyj,
(3) for new affairs storehouse D2In affairs T, if T is obtain by taking common factor not less than S affairs, then T is moved into maximum Frequently in candidate set MFCS, reject T at D simultaneously2In subtransaction;
(4) if new affairs storehouse D2In residue affairs number less than S, then terminate affairs storehouse D2Process, return to upper strata thing Business storehouse;Otherwise, to D2Start to carry out again this process from the 1st step;
(5) as affairs storehouse D1In remaining number of transactions less than S time, i.e. i > | D1|-S+1, terminates Current transaction storehouse D1Process;
(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required maximum Frequent item set set MFS;
5) maximum frequent itemsets merger:
Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and exist big between some collection The crowded item of amount, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem, uses Duplication reflects the similarity between two item collection, if item collection X1,X2∈ MFS, by X1And X2Duplication be designated as:
O R a t e ( X 1 , X 2 ) = | X 1 &cap; X 2 | M i n ( | X 1 | , | X 2 | )
In above formula, | X1∩X2| represent X1With X2Crowded item purpose number, Min (| X1|,|X2|) represent that the item that scale is less is concentrated The number of project, the method for item collection merger is:
(1) maximum frequent itemsets in MFS is sorted from big to small by the number of project;
(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf ORate (Xi,Xj<j≤| MFS |, then by X for)>=minOR, iiAnd XjUnion add in new set MMFS, reject X simultaneouslyj
(3) the item collection in MMFS is repeated two above step;
(4) when in MMFS, the Duplication of any two item collection is less than minOR, terminate.
The microblogging excavated based on maximum frequent itemsets the most according to claim 2 is propagandized colony and is found method, and its feature exists In, described step 1) in, collect and propagandize microblogging sample and should meet following condition:
A, choose and forward the of a relatively high popular microblogging of number;
B, microblogging issuing time span < 180 days;It is beneficial to the discovery propagandizing colony.
CN201410188004.7A 2014-05-07 2014-05-07 The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method Expired - Fee Related CN103927398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410188004.7A CN103927398B (en) 2014-05-07 2014-05-07 The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410188004.7A CN103927398B (en) 2014-05-07 2014-05-07 The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method

Publications (2)

Publication Number Publication Date
CN103927398A CN103927398A (en) 2014-07-16
CN103927398B true CN103927398B (en) 2016-12-28

Family

ID=51145617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410188004.7A Expired - Fee Related CN103927398B (en) 2014-05-07 2014-05-07 The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method

Country Status (1)

Country Link
CN (1) CN103927398B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550175B (en) * 2014-10-28 2019-03-01 阿里巴巴集团控股有限公司 The recognition methods of malice account and device
CN104516978B (en) * 2014-12-31 2018-11-27 天津南大通用数据技术股份有限公司 The method of compression intermediate candidate frequent item set for Database Intrusion Detection field
CN105808988B (en) * 2014-12-31 2020-07-03 阿里巴巴集团控股有限公司 Method and device for identifying abnormal account
CN104778475B (en) * 2015-03-30 2018-01-19 南京邮电大学 A kind of image classification method based on annular region Maximum Frequent vision word
CN104954360B (en) * 2015-04-17 2018-09-04 腾讯科技(深圳)有限公司 Sharing contents screen method and device
CN104991956B (en) * 2015-07-21 2018-07-31 中国人民解放军信息工程大学 Microblogging based on theme probabilistic model is propagated group and is divided and account liveness appraisal procedure
CN105224593B (en) * 2015-08-25 2019-08-16 中国人民解放军信息工程大学 Frequent co-occurrence account method for digging in the of short duration online affairs of one kind
CN106533893B (en) * 2015-09-09 2020-11-27 腾讯科技(深圳)有限公司 Message processing method and system
CN105681312B (en) * 2016-01-28 2019-03-05 李青山 A kind of mobile Internet abnormal user detection method based on frequent item set mining
CN105530265B (en) * 2016-01-28 2019-01-18 李青山 A kind of mobile Internet malicious application detection method based on frequent item set description
CN107870956B (en) * 2016-09-28 2021-04-27 腾讯科技(深圳)有限公司 High-utility item set mining method and device and data processing equipment
CN106484679B (en) * 2016-10-20 2020-02-11 北京邮电大学 False comment information identification method and device applied to consumption platform
CN106650273B (en) * 2016-12-28 2019-08-23 东方网力科技股份有限公司 A kind of behavior prediction method and apparatus
CN106921565B (en) * 2017-03-30 2019-12-13 北京奇艺世纪科技有限公司 Junk information identification method and device
CN109783531A (en) * 2018-12-07 2019-05-21 北京明略软件系统有限公司 A kind of relationship discovery method and apparatus, computer readable storage medium
CN109948641B (en) * 2019-01-17 2020-08-04 阿里巴巴集团控股有限公司 Abnormal group identification method and device
CN112115305B (en) * 2019-06-21 2024-04-09 杭州海康威视数字技术股份有限公司 Group identification method apparatus and computer-readable storage medium
CN110874786B (en) * 2019-10-11 2022-10-18 支付宝(杭州)信息技术有限公司 False transaction group identification method, device and computer readable medium
US11620344B2 (en) 2020-03-04 2023-04-04 Honeywell International Inc. Frequent item set tracking
CN112948864B (en) * 2021-03-19 2022-12-06 西安电子科技大学 Verifiable PPFIM method based on vertical partition database
CN113254755B (en) * 2021-07-19 2021-10-08 南京烽火星空通信发展有限公司 Public opinion parallel association mining method based on distributed framework

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102111296A (en) * 2011-01-10 2011-06-29 浪潮通信信息系统有限公司 Mining method for communication alarm association rule based on maximal frequent item set

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9563669B2 (en) * 2012-06-12 2017-02-07 International Business Machines Corporation Closed itemset mining using difference update

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102111296A (en) * 2011-01-10 2011-06-29 浪潮通信信息系统有限公司 Mining method for communication alarm association rule based on maximal frequent item set

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
微博中基于统计特征与双向投票的垃圾用户发现;丁兆云等;《计算机研究与发展》;20131231;第2336-2347页 *
挖掘最大频繁项集的事务集迭代算法;陈波等;《计算机工程与应用》;20091231;第141-144页 *

Also Published As

Publication number Publication date
CN103927398A (en) 2014-07-16

Similar Documents

Publication Publication Date Title
CN103927398B (en) The microblogging excavated based on maximum frequent itemsets propagandizes colony&#39;s discovery method
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
CN103729402B (en) Method for establishing mapping knowledge domain based on book catalogue
CN103116605B (en) A kind of microblog hot event real-time detection method based on monitoring subnet and system
CN110457404B (en) Social media account classification method based on complex heterogeneous network
CN106156372B (en) A kind of classification method and device of internet site
CN105354216B (en) A kind of Chinese microblog topic information processing method
CN105354305A (en) Online-rumor identification method and apparatus
CN107273396A (en) A kind of social network information propagates the system of selection of detection node
Meenakshi et al. A Data mining Technique for Analyzing and Predicting the success of Movie
Creamer et al. Segmentation and automated social hierarchy detection through email network analysis
Grosse et al. An Argument-based Approach to Mining Opinions from Twitter.
Guo et al. GroupMe: Supporting group formation with mobile sensing and social graph mining
CN113422761A (en) Malicious social user detection method based on counterstudy
CN109597926A (en) A kind of information acquisition method and system based on social media emergency event
Xu et al. FaNDS: Fake news detection system using energy flow
CN105589916B (en) Method for extracting explicit and implicit interest knowledge
Paraschiv et al. A unified graph-based approach to disinformation detection using contextual and semantic relations
Bakariya et al. An efficient algorithm for extracting infrequent itemsets from weblog.
CN106411704A (en) Distributed junk short message recognition method
CN110851684B (en) Social topic influence recognition method and device based on ternary association graph
CN111008285B (en) Author disambiguation method based on thesis key attribute network
Abu Talha et al. Scrutinize artificial intelligence algorithms for Pakistani and Indian parody tweets detection
CN112380455A (en) Method for directionally and covertly acquiring data of international and foreign internet based on backtracking security controlled network access channel

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161228

CF01 Termination of patent right due to non-payment of annual fee