CN103927398B - The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method - Google Patents
The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method Download PDFInfo
- Publication number
- CN103927398B CN103927398B CN201410188004.7A CN201410188004A CN103927398B CN 103927398 B CN103927398 B CN 103927398B CN 201410188004 A CN201410188004 A CN 201410188004A CN 103927398 B CN103927398 B CN 103927398B
- Authority
- CN
- China
- Prior art keywords
- microblogging
- affairs
- maximum frequent
- frequent itemsets
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000005065 mining Methods 0.000 claims abstract description 10
- 230000002776 aggregation Effects 0.000 claims abstract description 5
- 238000004220 aggregation Methods 0.000 claims abstract description 5
- 238000005516 engineering process Methods 0.000 claims abstract description 5
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 5
- 230000000644 propagated effect Effects 0.000 claims description 3
- 238000007418 data mining Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 claims description 2
- 238000009412 basement excavation Methods 0.000 claims 1
- 230000009286 beneficial effect Effects 0.000 claims 1
- 230000009931 harmful effect Effects 0.000 abstract 1
- 230000006399 behavior Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 235000001674 Agaricus brunnescens Nutrition 0.000 description 4
- 241000270322 Lepidosauria Species 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000006854 communication Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 244000025254 Cannabis sativa Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to propagandize colony based on the microblogging that maximum frequent itemsets excavates and find method, effectively solve microblogging and propagandize the discovery of colony, prevent the problem that false malice is propagandized, method is, with propagandize microblogging dependency as clue, based on the public open platform of crawler technology or microblogging obtain participate in propagandize microblogging propagate account aggregation;With single microblogging as affairs, the account participating in microblogging propagation is item, builds and propagandizes microblogging transaction database;To each affairs in the transaction database corresponding to microblogging group to be detected, find out the maximum frequent itemsets comprised in all affairs, calculate the Duplication between each maximum frequent itemsets, the item collection of small scale is integrated in large items, reduce common factor number of times, when taking common factor between affairs, judge whether affairs comprise certain project with binary chop, improve the efficiency of Mining Maximum Frequent Itemsets, find that microblogging propagandizes colony, the inventive method is simple, can accurately find that malice microblogging propagandizes colony, prevent the harmful effect caused to society.
Description
Technical field
The present invention relates to microblogging public sentiment monitoring field, a kind of microblogging excavated based on maximum frequent itemsets propagandizes group
Body finds method.
Background technology
Microblogging, as a kind of emerging Social Media form, has blog, media, instant communication function concurrently.Microblogging
The instantaneity of self, grass roots, mobility, the feature such as interactive become the natural carrier that network public-opinion is propagated.At network
In public sentiment, microblogging not only becomes center and the channel of Public Opinion Transmission, also assists in the formation of public opinion, development and bootup process simultaneously.
Microblogging is propagated and is a double-edged sword: on the one hand, and microblogging is that the information in some social eventss is open provides one
The quickly platform of response, it compensate for the deficiency of traditional media and other network tools to a certain extent;On the other hand, microblogging
Being different from traditional news media media, there is repeatability in the issue of its news, and verity cannot ensure, may be utilized and become ballad
Explain in words carrier, the fuse cord of discontented mood broadcast, cause the worst consequence even to national security and social stability.Network is unreal
Information starts from its maker, is spread in its disseminator.
Social Calculation and Study team under Hewlett-Packard claims in up-to-date report, and Sina's microblogging exists the most seriously
Topic propagation problem, has half all by propagandizing user's transmission in the microblogging that hot issue forwards.Research finds, popular words
The falseness that topic is artificially handled in propagating forwards high number, and the rubbish message sender of 1% creates the transfer amount of 49%.From
Since in August, 2013, government department increases the dynamics guiding network public opinion, according to " Qin Huohuo ", " vertical two tear four open " etc.
From the point of view of the survey result of place network pushing hands company, there is a large amount of organized pushing hands team in network, they are in league with minority
" opinion leader " tissue network " waterborne troops ", concocts Deceptive news for a long time, distorts the facts deliberately, strike trouble on the net, obscures and is
Non-, very disruptive network public opinion order, its behavior has been subjected to the highest attention of country's public sentiment management and control, and relevant people etc. is also because relating to
Suspect's crime is detained for criminal act in accordance with the law.
Therefore, towards New Media, for various hiding public opinion demagogueries, carry out the identification propagandizing microblogging, point
Analyse it and propagate population characteristic, collect the identification evidence of false propelling movement behavior, screen the artificial propagation focus manufactured, for finding,
Prediction, guiding network public opinion, improve government's public opinion ability to supervise, safeguards that social harmony stably has important theory value with existing
Sincere justice.
Along with the explosive growth of microblogging, attract the broad interest of Chinese scholars for the research of microblogging account, one
A little achievements in research are delivered in recent years in the momentous conferences such as WWW, KDD.The at present research to microblogging account can be roughly divided into
Lower three classes: 1) feature analysis, including account attributes feature and behavior characteristics etc.;2) power of influence analysis, evaluates body including power of influence
System's structure and measure etc.;3) relational network analysis between account, including base attribute, generation and the evolution of account relational network
Deng.
But, relatively fewer to the document propagandizing population selection the most both at home and abroad, main pertinent literature has rubbish account
(spammer), waistcoat account (sockpuppet), the identification of corpse account.Rubbish account refers to often issue junk information
Account, Z.Yi et al. is the feature of rubbish account from multiple angle analysis, and uses the mode of machine learning automatically to identify rubbish
Account.Chao Yang et al. has analysed in depth the social relations between rubbish account, it is proposed that a kind of next according to cohesion between account
The method finding rubbish account.The falseness of the behaviors such as waistcoat account refers to by registering multiple accounts and carry out posting, forwarding, comment
Account, Xueling Zheng et al. proposes and a kind of utilizes content of text, similarity mode to the method identifying waistcoat account.
Corpse account refers to carry out vermicelli dealing and the account of malicious registration, and Fang Ming etc. proposes a kind of based on microblogging login account
The intelligent method for classifying of name feature extraction, has higher accuracy rate.But how these methods also unresolved find that microblogging is propagandized
Colony, prevents false propagation, propagandizes difference maximum between account and above a few class account and is, propagation account lays particular emphasis on it and " fries
Make " behavior, the account participating in propagandizing more is disperseed and direct relation is inconspicuous, disguised higher with the sense of organization, is the most more difficult to
Find.
Colony propagandizes similar with common microblogging, propagation the posting of crowd, forward, comment etc. is isolated on behavior surface,
But unconventional malicious dissemination is frequently not the behavior of single people, but organized group behavior, but this kind of groups row
For being hidden, it is difficult to discover.Therefore, how to find that microblogging propagandizes colony, prevent false malice from propagandizing and cause not to society
Good impact and unnecessary economic loss, be conscientiously to solve the technical problem that.
Summary of the invention
For above-mentioned situation, for overcoming the defect of prior art, the purpose of the present invention is just to provide a kind of based on maximum frequency
The microblogging of numerous item set mining is propagandized colony and is found method, can effectively solve microblogging and propagandize the discovery of colony, prevents false malice from frying
The problem made.
The technical scheme that the present invention solves is, the microblogging propagation account excavated based on maximum frequent itemsets finds that method includes
Following steps:
(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, open based on crawler technology or microblogging are public
Set level platform and obtain the account aggregation participating in propagandizing microblogging propagation;
(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds propagation micro-
Rich transaction database;
(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilizes
Iteration common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set;
The most ten hundreds of, directly in original transaction data owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains
In storehouse, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, and that quickly rejects in affairs is non-
Frequent item, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale;
(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to Maximum Frequent
Item collection merges, and is integrated in relatively large items by item collection less for scale as far as possible, and ensures that the account of the consequent concentration of merger depends on
So there is certain relatedness;By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use two
Divide lookup method to judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging is fried
Make colony.
The inventive method is simple, easily operates, and can accurately find that malice microblogging propagandizes colony, prevent to society cause bad
Impact and unnecessary economic loss, have the using value of reality.
Accompanying drawing explanation
Fig. 1 is flow chart element diagram of the present invention.
Fig. 2 is the propagation microblogging transaction database schematic diagram of the present invention.
Fig. 3 is that the present invention propagandizes microblogging transaction database sectional drawing.
Fig. 4 is that inventive algorithm performs time comparison diagram on Mushroom data set.
Fig. 5 is that inventive algorithm is propagandizing execution time comparison diagram on microblog data collection.
Fig. 6 is MFS middle term collection number variation diagram of the present invention.
Fig. 7 is the greatest length variation diagram of MFS middle term collection of the present invention.
Detailed description of the invention
Below in conjunction with accompanying drawing, the detailed description of the invention of the present invention is elaborated.
Being given by Fig. 1, the present invention includes propagandizing microblogging affairs storehouse, maximum frequent itemsets excavates and maximum frequent itemsets is returned
And part, propagandize microblogging affairs storehouse and build module mainly responsible collection data and carry out pretreatment, build transaction database D;?
Big frequent item set mining module is primarily based on binary chop method screening candidate's maximum frequent itemsets, is then based on iteration common factor side
Method excavates maximum frequent itemsets MFS from affairs database D;Maximum frequent itemsets merger module mainly carries out merger to MFS
Process, propagandize colony really to reduce as far as possible, comprise the concrete steps that:
1) propagation microblogging sample, is collected
Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain
Some microbloggings that individual propagation account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference
Existing ripe method of discrimination or specialist system, propagandizing the collection of microblogging sample has two kinds of methods: a kind of method is to select reptile skill
Art, from microblogging page download webpage, resolves page structure and extracts the information of microblogging propagation account;Another kind of method be call micro-
Rich public open platform, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account, in order to be conducive to
To propagandizing the discovery of colony, following principle also should be followed when choosing propagation microblogging sample:
A, choose and forward the of a relatively high popular microblogging of number;
B, microblogging issuing time span < 180 days;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should include microblogging identification number, microblogging
Account identification number, the essential information of microblogging account;
2) transaction database is built
The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and searches propagandizing microblogging sample
On the basis of collection, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build Transaction Information
Storehouse, as shown in Figure 2;
3) candidate's maximum frequent itemsets based on binary chop screening
The most ten hundreds of, directly in original transaction storehouse owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains
Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to quickly rejects in affairs
Non-frequent item, find out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, ramuscule
Holding several S, carry out candidate's maximum frequent itemsets screening, method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item set, Infrequent item-set closes;From the beginning of i=1, in order in traversal D
Each affairs Ti(1≤i≤| D |), to affairs TiIn each project u:
If a) u ∈ FI, then retain u;
If b) u ∈ NFI, then from TiMiddle rejecting u;
If c), then forward next step to and judge whether u is frequent item;
(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge Tj, in i < j≤| D | whether
Comprising u, end condition is:
A) when the affairs number comprising u reaches S, illustrate that u is frequent item, u is joined in FI;
B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from
TiMiddle rejecting u.If the affairs number now containing u is more than 1, illustrate that u also appears in TiOutside affairs in, then u is joined
In NFI;
(4) the affairs storehouse D after the non-frequent item rejected in D in all affairs, after i.e. can being reduced1;
4) maximum frequent itemsets occured simultaneously based on iteration excavates:
Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction1, minimum
Supporting number S, the method that maximum frequent itemsets excavates is as follows:
(1) by affairs storehouse D1In affairs sort from big to small by the number of item, to find maximum frequent itemsets as early as possible, for
Reduction affairs storehouse scale, merges the affairs repeated in affairs storehouse, and to affairs counting number;
(2) for reducing the number of times taking common factor, for affairs Ti, 1≤i≤| D1|-S+1, from the beginning of i=1, first finds out bag
Contain TiThe affairs set of middle Arbitrary Term, Tj|TjInclude at least a project in Ti;J > i), TiSuccessively with TjTake friendship
Collection, moves into new affairs storehouse D by both occur simultaneously2, reject T simultaneouslyj,;
(3) for new affairs storehouse D2In affairs T, if T be by not less than S affairs take common factor and obtain, then by T immigration
In Maximum Frequent candidate set MFCS, reject T at D simultaneously2In subtransaction;
(4) if new affairs storehouse D2In residue affairs number less than S, then terminate affairs storehouse D2Process, return to
Layer affairs storehouse;Otherwise, to D2Start to carry out again this process from the 1st step;
(5) as affairs storehouse D1In remaining number of transactions less than S time, i.e. i > | D1|-S+1, terminates Current transaction storehouse D1Place
Reason;
(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required
Maximum frequent itemsets set MFS;
5) maximum frequent itemsets merger:
Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and deposit between some collection
At substantial amounts of crowded item, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem,
Duplication is used to reflect the similarity between two item collection, if item collection X1,X2∈ MFS, by X1And X2Duplication be designated as:
In above formula, | X1∩X2| represent X1With X2Crowded item purpose number, Min (| X1|,|X2|) represent the item that scale is less
The number of concentration project, the method for item collection merger is:
(1) maximum frequent itemsets in MFS is sorted from big to small by the number of project;
(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf,
ORate(Xi,Xj<j≤| MFS |, then by X for)>=minOR, iiAnd XjUnion add in new set MMFS, reject X simultaneouslyj;
(3) the item collection in MMFS is repeated two above step;
(4) when in MMFS, the Duplication of any two item collection is less than minOR, terminate.
The inventive method is simple, easily operates, and through practical probation, shows that method is reliable and stable, has the application valency of reality
Value, relevant information is as follows:
1) data set
Using Sina's microblogging as research platform, with 81 microbloggings with propagation suspicion as object of study, actual participation its
The account quantity forwarded is 380,726 (accounts without repeatedly participating in forwarding), and the project number of averagely every affairs is 6,
286, these microbloggings belong to advertisement marketing class mostly, it is possible to there is multiple propagation colony and participate in its communication process.Utilize reptile
Program crawls and participates in all account identification (UID) that these microbloggings forward, and stores in transaction database, the lattice of part data
Formula is as shown in Figure 3.
In order to verify that algorithm of the present invention (hereinafter referred to as IIA) is applied to the efficiency that maximum frequent itemsets excavates, to warp
The Mushroom data set of allusion quotation carries out performance test, and compares with known method.This data set contains 8,124 notes
Record, every record has 23 items, have recorded 23 attributes of mushroom.
2) Performance Evaluation
First being estimated the performance of the method for the invention, experimental situation is 4G internal memory, 2.0GHz double-core Duo
T5800CPU, Windows732 bit manipulation system, realizes this algorithm with Java, and respectively with classical MAFIA algorithm and DFMFI
Algorithm compares.
Fig. 4 is three kinds of algorithms implementation status in Mushroom data set under different supports, it can be seen that this method
Efficiency apparently higher than other two kinds of algorithms, even if execution efficiency also has superiority in the case of minimum support is the lowest.Fig. 5 is
Three kinds of algorithms are propagandizing implementation status on microblog data collection, it can be seen that the execution efficiency of this method is the highest.
3) parameter threshold selects
Fig. 6, Fig. 7 are from propagandizing the maximum frequent itemsets result that microblog data concentration finds under different minimum supports are several,
Fig. 6 and Fig. 7 represents that the greatest length of maximum frequent itemsets middle term collection number and maximum frequent itemsets middle term collection is with ramuscule respectively
Hold several changes.In conjunction with research background of the present invention it is found that minSup (minimum support number) set the biggest, the account of discovery
It is the biggest that colony propagandizes suspicion, but population size and quantity also can reduce therewith;Otherwise, it is the least that minSup sets, the account of discovery
It is the least that family colony propagandizes suspicion, but population size and quantity can increase.For this reason, it may be necessary to set a rational threshold to minSup
Value, to find of certain scale and that propagation suspicion is higher colony.
On the other hand, when the item collection concentrating maximum frequent set carries out merger, the setting of minOR also will directly affect conjunction
And the scale of consequent collection.By the continuous analysis to data, minOR is set as 50%, i.e. exceedes half when two item collection
Merged when project is identical.
In order to further determine that the value of minSup, table 1 lists minSup=3 respectively, to maximum frequent itemsets when 4,5
Result after merger, sorts by merger consequent collection length, the most only lists front 8 item collection (doubtful propagation colony).From table
It can be seen that as minSup=3 and 5, in addition to first item collection is on a grand scale, other collection scale is the least;And work as
During minSup=4, item collection scale does not drastically change, and suitable scale, illustrates that value is relatively reasonable..
Table 1 is different supports several lower maximum frequent itemsets merger results
Sequence number | MinSup=3 | MinSup=4 | MinSup=5 |
1 | 14,863 | 2,623 | 963 |
2 | 311 | 1,755 | 65 |
3 | 156 | 688 | 29 |
4 | 77 | 410 | 19 |
5 | 59 | 129 | 9 |
6 | 56 | 98 | 9 |
7 | 55 | 82 | 7 |
8 | 55 | 54 | 5 |
4) accuracy rate analysis
In order to verify that the colony that propagandizes of the present invention finds the accuracy rate of algorithm, actual propagation in the propagation colony i.e. found
Account proportion, in conjunction with the existing propagation account recognition methods analyzed based on multiple features and artificial mask method comprehensive verification knot
The accuracy rate of fruit.Assume that propagation colony to be verified is H, first with the existing propagation account identification side analyzed based on multiple features
Each account is differentiated by method, and the propagation account aggregation obtained is designated as H1;Then, use the method for artificial mark to remaining
Account differentiates, the propagation account aggregation obtained is designated as H2, the accuracy rate computing formula propagandizing colony H is:
In above formula, | H | represents the account base in H, | H1|+|H2| represent propagation account number actual in H.To in table 1
MinSup=4 and the population size (the i.e. item collection length) partial mass more than 100 are verified, concrete outcome is as shown in table 2.
The accuracy rate (minSup=4) that colony finds propagandized by table 2
Sequence number | |H1| | |H2| | |H| | Precision |
1 | 2,016 | 451 | 2,623 | 94.1% |
2 | 1,465 | 163 | 1,755 | 92.8% |
3 | 571 | 78 | 688 | 94.3% |
4 | 354 | 33 | 410 | 94.4% |
5 | 109 | 10 | 129 | 92.2% |
It will be seen that each the propagation colony found for this method from table 2, reality propagandizes the ratio shared by account
All it is higher than 90%, shows that this method can recognize that the most hidden propagation account (i.e. H2), and these accounts are often some idols
You participate in propagandizing but the huge propagation large size of power of influence.As can be seen here, the present invention has the using value of reality, economical and social
Benefit.
Claims (3)
1. the microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method, it is characterised in that comprise the steps:
(1) propagandize microblogging sample collect: with propagandize microblogging dependency as clue, based on the public opening of crawler technology or microblogging put down
Platform obtains and participates in propagandizing the account aggregation that microblogging is propagated;
(2) transaction database builds: with single microblogging as affairs, and the account participating in microblogging propagation is item, builds and propagandizes microblogging thing
Business data base;
(3) maximum frequent itemsets excavates: to each affairs in the transaction database corresponding to microblogging group to be detected, utilize iteration
Common factor method finds out the maximum frequent itemsets comprised in all affairs, obtains some maximum frequent itemsets set;
The most ten hundreds of, directly in original transaction data base owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains
Mining Maximum Frequent Itemsets will affect the efficiency that algorithm performs, and utilizes binary chop, quickly reject in affairs non-frequently
Project, finds out the candidate collection of maximum frequent itemsets, reduces transaction database scale;
(4) maximum frequent itemsets merger: to each maximum frequent itemsets, the Duplication between computational item collection, to maximum frequent itemsets
Merge, item collection less for scale is integrated in relatively large items, and ensure that the account of the consequent concentration of merger still has one
Fixed relatedness;By reduction transaction database scale, reduce common factor number of times, when taking common factor between affairs, use binary chop
Judge whether affairs comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thus find that microblogging propagandizes colony.
The microblogging excavated based on maximum frequent itemsets the most according to claim 1 is propagandized colony and is found method, and its feature exists
In, including propagandizing microblogging affairs storehouse, maximum frequent itemsets excavation and maximum frequent itemsets merger part, propagandize microblogging affairs storehouse
Build module to be mainly responsible for gathering data and carrying out pretreatment, build transaction database D;First maximum frequent itemsets excavates module
Screen candidate's maximum frequent itemsets based on binary chop method, be then based on iteration Intersection set method and excavate from affairs database D
Go out maximum frequent itemsets MFS;Maximum frequent itemsets merger module mainly carries out merger process to MFS, and group is propagandized in reduction really
Body, comprises the concrete steps that:
1) propagation microblogging sample is collected
Propagandizing microblogging sample and collect the initial step realizing the present invention, the selection of microblogging sample should have dependency, if certain is fried
Making some microbloggings that account once participated in, or the some microbloggings relevant to certain theme, the judgement of microblogging sample should be used for reference existing
Ripe method of discrimination or specialist system, propagandize microblogging sample and collect and have two kinds of methods: a kind of method is to select crawler technology, from
Microblogging page download webpage, resolve page structure and extract microblogging propagate account information;Another kind of method is to call microblogging public affairs
Open platform altogether, the api function that calling microblogging official externally provides obtains microblogging and propagates the information of account;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should include microblogging identification number, microblogging account
Identification number, the essential information of microblogging account;
2) transaction database is built
The maximum frequent itemsets being converted in data mining of propagation colony being pinpointed the problems excavates, and is propagandizing what microblogging sample was collected
On the basis of, microblogging correspondence affairs will be propagandized, participate in the item in the account correspondence affairs that microblogging forwards, build transaction database;
3) candidate's maximum frequent itemsets based on binary chop screening
The most ten hundreds of owing to propagandizing the project that in microblogging affairs storehouse, each transaction packet contains, directly excavate in original transaction storehouse
Maximum frequent itemsets will affect the efficiency that algorithm performs, method based on binary chop, it is possible to that quickly rejects in affairs is non-
Frequent item, finds out the candidate collection of maximum frequent itemsets, reduction affairs storehouse scale, given transaction database D, minimum support number
S, carries out candidate's maximum frequent itemsets screening, and method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item setInfrequent item-set closesFrom the beginning of i=1, in order traversal D in every
Individual affairs Ti(1≤i≤| D |), to affairs TiIn each project u:
If a) u ∈ FI, then retain u;
If b) u ∈ NFI, then from TiMiddle rejecting u;
If c)Then forward next step to and judge whether u is frequent item;
(3), from j=i+1 begin stepping through remaining affairs, and utilize binary chop to judge Tj, whether i < j≤| D | comprises u,
End condition is:
A) when the affairs number comprising u reaches S, illustrate that u is frequent item, u is joined in FI;
B) when remaining affairs number is less than S with the affairs number sum containing u, illustrate that u is non-frequent item, from TiIn
Reject u, if the affairs number now containing u is more than 1, illustrate that u also appears in TiOutside affairs in, then u is joined NFI
In;
(4) the affairs storehouse D after the non-frequent item rejected in D in all affairs, after i.e. can being reduced1;
4) maximum frequent itemsets occured simultaneously based on iteration excavates:
Mining Maximum Frequent Itemsets by the way of affairs iteration is taken common factor, the affairs storehouse D after given reduction1, minimum support number
S, the method that maximum frequent itemsets excavates is as follows:
(1) by affairs storehouse D1In affairs sort from big to small by the number of item, to find maximum frequent itemsets as early as possible, for reduction
Affairs storehouse scale, merges the affairs repeated in affairs storehouse, and to affairs counting number;
(2) for reducing the number of times taking common factor, for affairs Ti, 1≤i≤| D1|-S+1, from the beginning of i=1, first finds out and contains
TiThe affairs set D of middle Arbitrary TermTi, Tj|TjInclude at least TiIn a project;J > i, TiSuccessively with TjTake common factor, by two
Occuring simultaneously of person moves into new affairs storehouse D2, reject T simultaneouslyj,
(3) for new affairs storehouse D2In affairs T, if T is obtain by taking common factor not less than S affairs, then T is moved into maximum
Frequently in candidate set MFCS, reject T at D simultaneously2In subtransaction;
(4) if new affairs storehouse D2In residue affairs number less than S, then terminate affairs storehouse D2Process, return to upper strata thing
Business storehouse;Otherwise, to D2Start to carry out again this process from the 1st step;
(5) as affairs storehouse D1In remaining number of transactions less than S time, i.e. i > | D1|-S+1, terminates Current transaction storehouse D1Process;
(6) merging the item collection in MFCS and reject non-maximum frequent itemsets simultaneously, last result is required maximum
Frequent item set set MFS;
5) maximum frequent itemsets merger:
Owing to minimum supports the restriction of number so that in MFS, maximum frequent itemsets scale is less, and exist big between some collection
The crowded item of amount, the account group that these collection represent is likely to be subordinated to same propagation colony, for solving this problem, uses
Duplication reflects the similarity between two item collection, if item collection X1,X2∈ MFS, by X1And X2Duplication be designated as:
In above formula, | X1∩X2| represent X1With X2Crowded item purpose number, Min (| X1|,|X2|) represent that the item that scale is less is concentrated
The number of project, the method for item collection merger is:
(1) maximum frequent itemsets in MFS is sorted from big to small by the number of project;
(2) each maximum frequent itemsets in traversal MFS, from the beginning of i=1, rightIf ORate
(Xi,Xj<j≤| MFS |, then by X for)>=minOR, iiAnd XjUnion add in new set MMFS, reject X simultaneouslyj;
(3) the item collection in MMFS is repeated two above step;
(4) when in MMFS, the Duplication of any two item collection is less than minOR, terminate.
The microblogging excavated based on maximum frequent itemsets the most according to claim 2 is propagandized colony and is found method, and its feature exists
In, described step 1) in, collect and propagandize microblogging sample and should meet following condition:
A, choose and forward the of a relatively high popular microblogging of number;
B, microblogging issuing time span < 180 days;It is beneficial to the discovery propagandizing colony.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410188004.7A CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410188004.7A CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103927398A CN103927398A (en) | 2014-07-16 |
CN103927398B true CN103927398B (en) | 2016-12-28 |
Family
ID=51145617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410188004.7A Expired - Fee Related CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103927398B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105550175B (en) * | 2014-10-28 | 2019-03-01 | 阿里巴巴集团控股有限公司 | The recognition methods of malice account and device |
CN104516978B (en) * | 2014-12-31 | 2018-11-27 | 天津南大通用数据技术股份有限公司 | The method of compression intermediate candidate frequent item set for Database Intrusion Detection field |
CN105808988B (en) * | 2014-12-31 | 2020-07-03 | 阿里巴巴集团控股有限公司 | Method and device for identifying abnormal account |
CN104778475B (en) * | 2015-03-30 | 2018-01-19 | 南京邮电大学 | A kind of image classification method based on annular region Maximum Frequent vision word |
CN104954360B (en) * | 2015-04-17 | 2018-09-04 | 腾讯科技(深圳)有限公司 | Sharing contents screen method and device |
CN104991956B (en) * | 2015-07-21 | 2018-07-31 | 中国人民解放军信息工程大学 | Microblogging based on theme probabilistic model is propagated group and is divided and account liveness appraisal procedure |
CN105224593B (en) * | 2015-08-25 | 2019-08-16 | 中国人民解放军信息工程大学 | Frequent co-occurrence account method for digging in the of short duration online affairs of one kind |
CN106533893B (en) * | 2015-09-09 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Message processing method and system |
CN105681312B (en) * | 2016-01-28 | 2019-03-05 | 李青山 | A kind of mobile Internet abnormal user detection method based on frequent item set mining |
CN105530265B (en) * | 2016-01-28 | 2019-01-18 | 李青山 | A kind of mobile Internet malicious application detection method based on frequent item set description |
CN107870956B (en) * | 2016-09-28 | 2021-04-27 | 腾讯科技(深圳)有限公司 | High-utility item set mining method and device and data processing equipment |
CN106484679B (en) * | 2016-10-20 | 2020-02-11 | 北京邮电大学 | False comment information identification method and device applied to consumption platform |
CN106650273B (en) * | 2016-12-28 | 2019-08-23 | 东方网力科技股份有限公司 | A kind of behavior prediction method and apparatus |
CN106921565B (en) * | 2017-03-30 | 2019-12-13 | 北京奇艺世纪科技有限公司 | Junk information identification method and device |
CN109783531A (en) * | 2018-12-07 | 2019-05-21 | 北京明略软件系统有限公司 | A kind of relationship discovery method and apparatus, computer readable storage medium |
CN109948641B (en) * | 2019-01-17 | 2020-08-04 | 阿里巴巴集团控股有限公司 | Abnormal group identification method and device |
CN112115305B (en) * | 2019-06-21 | 2024-04-09 | 杭州海康威视数字技术股份有限公司 | Group identification method apparatus and computer-readable storage medium |
CN110874786B (en) * | 2019-10-11 | 2022-10-18 | 支付宝(杭州)信息技术有限公司 | False transaction group identification method, device and computer readable medium |
US11620344B2 (en) | 2020-03-04 | 2023-04-04 | Honeywell International Inc. | Frequent item set tracking |
CN112948864B (en) * | 2021-03-19 | 2022-12-06 | 西安电子科技大学 | Verifiable PPFIM method based on vertical partition database |
CN113254755B (en) * | 2021-07-19 | 2021-10-08 | 南京烽火星空通信发展有限公司 | Public opinion parallel association mining method based on distributed framework |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102111296A (en) * | 2011-01-10 | 2011-06-29 | 浪潮通信信息系统有限公司 | Mining method for communication alarm association rule based on maximal frequent item set |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9563669B2 (en) * | 2012-06-12 | 2017-02-07 | International Business Machines Corporation | Closed itemset mining using difference update |
-
2014
- 2014-05-07 CN CN201410188004.7A patent/CN103927398B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102111296A (en) * | 2011-01-10 | 2011-06-29 | 浪潮通信信息系统有限公司 | Mining method for communication alarm association rule based on maximal frequent item set |
Non-Patent Citations (2)
Title |
---|
微博中基于统计特征与双向投票的垃圾用户发现;丁兆云等;《计算机研究与发展》;20131231;第2336-2347页 * |
挖掘最大频繁项集的事务集迭代算法;陈波等;《计算机工程与应用》;20091231;第141-144页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103927398A (en) | 2014-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103927398B (en) | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method | |
CN103678670B (en) | Micro-blog hot word and hot topic mining system and method | |
CN104239539B (en) | A kind of micro-blog information filter method merged based on much information | |
CN103729402B (en) | Method for establishing mapping knowledge domain based on book catalogue | |
CN103116605B (en) | A kind of microblog hot event real-time detection method based on monitoring subnet and system | |
CN110457404B (en) | Social media account classification method based on complex heterogeneous network | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN105354216B (en) | A kind of Chinese microblog topic information processing method | |
CN105354305A (en) | Online-rumor identification method and apparatus | |
CN107273396A (en) | A kind of social network information propagates the system of selection of detection node | |
Meenakshi et al. | A Data mining Technique for Analyzing and Predicting the success of Movie | |
Creamer et al. | Segmentation and automated social hierarchy detection through email network analysis | |
Grosse et al. | An Argument-based Approach to Mining Opinions from Twitter. | |
Guo et al. | GroupMe: Supporting group formation with mobile sensing and social graph mining | |
CN113422761A (en) | Malicious social user detection method based on counterstudy | |
CN109597926A (en) | A kind of information acquisition method and system based on social media emergency event | |
Xu et al. | FaNDS: Fake news detection system using energy flow | |
CN105589916B (en) | Method for extracting explicit and implicit interest knowledge | |
Paraschiv et al. | A unified graph-based approach to disinformation detection using contextual and semantic relations | |
Bakariya et al. | An efficient algorithm for extracting infrequent itemsets from weblog. | |
CN106411704A (en) | Distributed junk short message recognition method | |
CN110851684B (en) | Social topic influence recognition method and device based on ternary association graph | |
CN111008285B (en) | Author disambiguation method based on thesis key attribute network | |
Abu Talha et al. | Scrutinize artificial intelligence algorithms for Pakistani and Indian parody tweets detection | |
CN112380455A (en) | Method for directionally and covertly acquiring data of international and foreign internet based on backtracking security controlled network access channel |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161228 |
|
CF01 | Termination of patent right due to non-payment of annual fee |