CN110866047A - Community discovery algorithm based on improved association rule - Google Patents

Community discovery algorithm based on improved association rule Download PDF

Info

Publication number
CN110866047A
CN110866047A CN201911108340.5A CN201911108340A CN110866047A CN 110866047 A CN110866047 A CN 110866047A CN 201911108340 A CN201911108340 A CN 201911108340A CN 110866047 A CN110866047 A CN 110866047A
Authority
CN
China
Prior art keywords
algorithm
association rule
support degree
community discovery
community
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911108340.5A
Other languages
Chinese (zh)
Inventor
王永贵
邢若楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN201911108340.5A priority Critical patent/CN110866047A/en
Publication of CN110866047A publication Critical patent/CN110866047A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a community discovery algorithm based on an improved association rule, which comprises the steps of firstly, self-adapting the support degree, and calculating the minimum support degree by a mathematical method; secondly, introducing a Boolean matrix and a transaction weight thought to improve an Apriori algorithm and reduce the times of scanning a database; and finally, combining with a Spark platform to realize the parallelization of the association rule-improved community discovery algorithm. The association rule-improved based community discovery algorithm improves the Apriori algorithm by mining community members by using MAC addresses, introducing the idea of self-adaption of support degree and a method of generating a Boolean matrix by adding transaction weight, combines the improved algorithm with Spark to realize parallelization of the algorithm, and mines the relationship among the community members by mining a frequent item set. Experimental results show that the ARCD algorithm solves the problems of subjectivity of artificially setting the support degree and redundancy of community mining results, has good expandability and improves mining speed found by communities.

Description

Community discovery algorithm based on improved association rule
Technical Field
The invention relates to a community discovery algorithm based on an improved association rule.
Background
With the advent of the big data era, the construction of wireless cities is gradually perfected. Complex network research is always a hotspot of social research, and community discovery plays an important role in researching complex networks. Mining the community relations existing in the wireless cities in massive data becomes a new challenge.
Aiming at the traditional community discovery algorithm, a hybrid algorithm which integrates the community discovery algorithm and the association rule is provided, the improved algorithm improves the accuracy of community discovery, but the improved algorithm introduces the defect of the association rule algorithm, improves the search time and reduces the search efficiency. The CS algorithm proposed by mawei et al, which utilizes a directed weightless graph to improve community discovery, is significantly improved in mining the time and space of communities, but the algorithm generates a large amount of group redundancy after weight sorting and the membership discovered by the algorithm is strong, and is not always true in real life. Zhang Yan et al propose to improve the community discovery algorithm by using a binary tree structure, and combine MapReduce and the binary tree to improve the community discovery algorithm, so as to realize parallelization of the algorithm, and solve the problems of low efficiency and data overflow when processing massive data, but the MapReduce needs to frequently scan a disk when performing iteration, and the calculation time is increased. Yang Qinliu et al propose to improve association rules by using a matrix, the algorithm overcomes the disadvantage that the traditional association rule algorithm frequently scans transaction data sets, and improves the operation efficiency, but the algorithm consumes a large amount of time when processing mass data. The idea of Apriori algorithm support degree confidence coefficient self-adaption is provided by Wangxue et al, the algorithm solves the problems of subjectivity and no scientific basis when the support degree and the confidence coefficient are artificially set, but the algorithm does not solve the defects of the traditional Apriori algorithm.
Disclosure of Invention
Based on the defects of the prior art, the problem to be solved by the invention is to provide a community discovery algorithm based on an improved association rule, improve an Apriori algorithm by combining a thought of self-adaption of a support degree and a method of generating a boolean matrix by using weight, and fuse the improved algorithm and the community discovery algorithm on a Spark platform.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention provides a community discovery algorithm based on an improved association rule, which comprises the following steps:
s1: self-adapting the support degree, and calculating the minimum support degree by a mathematical method;
s2: a Boolean matrix and a transaction weight thought are introduced to improve an Apriori algorithm, so that the times of scanning a database are reduced;
s3: and the association rule is improved by combining with a Spark platform, so that the parallelization of the community discovery algorithm is realized.
Optionally, in step S1, the Apriori algorithm is optimized:
s11, counting the support degree of each item in the transaction data set D, and sorting the items from big to small;
and S12, performing k-degree polynomial curve fitting according to the data pairs.
Optionally, in the step S2, aiming at the problem that the Apriori algorithm frequently scans transaction data to generate a candidate set redundancy, the ARCD algorithm improves the Apriori algorithm by performing an and operation on the weights and the boolean matrix to obtain a candidate set.
Further, the step S3 includes the following steps:
s31, scanning the data set to generate a frequent 1 item set L1Storing the result on the HDFS, regarding the data set stored on the HDFS as an RDD, dividing the RDD into n blocks and distributing the n blocks to m work nodes;
s32, constructing a local matrix, and calculating the support count of the local matrix
And S33, merging the local frequent item sets by utilizing the reduce ByKey operation to obtain a global candidate item set.
Therefore, the Association rule-improved community discovery algorithm has the problems of candidate results, a great amount of redundant generated results and high time complexity, and provides a community discovery ARCD (a Community detection algorithm based on improved Association rules) algorithm. According to the algorithm, community members are mined by using MAC addresses, an Apriori algorithm is improved by introducing a support degree self-adaption thought and a method of generating a Boolean matrix by adding transaction weights, parallelization of the algorithm is realized by combining the improved algorithm with Spark, and the relation among the community members is mined in a frequent item set mining mode. Experimental results show that the ARCD algorithm solves the problems of subjectivity of artificially setting the support degree and redundancy of community mining results, has good expandability and improves mining speed found by communities.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a flow chart of the CS algorithm;
FIG. 2 is a flow chart of the MP-T-CS algorithm;
FIG. 3 is a SACS flow chart;
FIG. 4 is a flow chart of support adaptation;
FIG. 5 is a flow chart of the modified Apriori algorithm;
FIG. 6 is a flow chart of the modified Apriori algorithm running on Spark;
FIG. 7 is a running time diagram of five algorithms at different data volumes;
FIG. 8 is a graph comparing acceleration ratios for two algorithms at different numbers of clusters;
FIG. 9 is a graph of acceleration ratio change for the ARCD algorithm at different data volumes.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
The method improves an Apriori algorithm by combining the idea of self-adaption of support degree and a method for generating a Boolean matrix by using weight, and integrates the improved algorithm with a community discovery algorithm on a Spark platform. Experimental results show that the ARCD algorithm has the advantages of fast running time and high calculation efficiency when processing mass data, and has good parallelism when mining community membership.
At the heart of community discovery is the finding of the direct relationship of users participating in community activities to users. The wireless city is a service for covering the administrative region of the city by using a high-speed broadband wireless technology and providing information to the public at any time and any place by using the wireless technology. MAC address: i.e., physical address, to confirm the location of the device on the network. The MAC address has uniqueness.
Association rules are of the form X → Y, where X is the set of leading items, Y is the set of trailing items, and X ∩ Y ≠ Φ.
Transaction: one transaction per action in table 1.
Item (1): each column in table 1 is an entry.
TABLE 1 binary representation of shopping basket data
Figure BDA0002271986970000041
Item set: a collection containing 0 and more items is called an item set.
And (3) counting the support degree: the total number of sets of items contained in the transaction.
The support degree is as follows: to determine whether a certain set of items is frequent. Is defined as formula (1)
Figure BDA0002271986970000051
Frequent item set: a set of items greater than a support threshold.
Second derivative: the derivation is performed twice for the primitive function y ═ f (x), and the point where the second derivative is 0 may be an extreme point.
And (3) curve fitting: means that an appropriate curve type is selected to fit the observed data, and the fitted curve equation is used for analyzing the relation between two variables.
Determining a coefficient: determining R for coefficient2Representation, which is used to measure the fit of the model.
Spark is an operational engine designed by the RADs laboratory at the university of california at berkeley, 2009, that can perform interactive queries and iterative computations. It can be calculated in the memory and has an efficient fault-tolerant mechanism. The Spark expands a MapReduce model, the operation speed is 100 times of that of MapReduce in Hadoop, and the Spark is more suitable for processing real-time data compared with the Hadoop suitable for processing non-real-time mass data. It supports more computational models (e.g., interactive queries, stream processing). The elastically distributed data set (RDD) is a kernel data structure of Spark, and is a fault-tolerant set capable of operating in parallel, and mainly includes conversion operation and action operation.
Community discovery based on Community Search (CS)
The CS algorithm utilizes a community discovery algorithm improved by an undirected weighted graph, the CS algorithm compresses MAC addresses in original data, and the MAC data after being sorted is used as a vertex V of the graph if the MAC addresses are modifiedi(as V)i) And MACj(as V)j) Appear in the same place or have the same AP, connect with edge e, note (e)ij) And a weight W of the edge ij1. Mapping MAC addresses into the graph in sequence, e for each occurrenceijThen W isijAnd adding 1. Until all MAC addresses have been processed. The group formed by the edges with larger weight is the community found by the CS algorithm, and the algorithm flowchart is shown in fig. 1.
The algorithm has the following defects: the CS algorithm results in edge redundancy due to the one-to-one ordering of the edges for all vertices, and clique redundancy due to edge clique generation. The CS algorithm excavates a group with a high weight value, which indicates that the two members have a strong relationship, but in real life, the relationship between the members is not as strong as that obtained by the algorithm.
T-CS algorithm based on MapReduce
The MP-T-CS algorithm is to apply the T-CS algorithm to a MapReduce platform of Hadoop and realize parallelization of the group search algorithm by using a special binary tree. The MP-T-CS algorithm firstly stores original MAC data in a special binary tree form, converts the binary tree into a text for storage, performs map and reduce operations on the text data, and generates (key, value) key value pairs. And then the key value pairs are subjected to weight k selection sorting. And finally, performing depth-first search traversal on the sorted images to obtain a maximum connected image. The maximum connectivity graph is the community found by the MP-T-CS algorithm, and the flow chart of the algorithm is shown in FIG. 2.
The algorithm has the following defects: although the mining effect of the M-T-CS algorithm is superior to that of the CS algorithm in the aspects of whole and local, the time performance and the space performance are improved, the M-T-CS algorithm is parallelized calculation based on MapReduce, the operation result is stored in an HDFS file and is read from a disk every time the operation result is used, and the operation time is increased by a large amount of iterative calculation.
Spark-based ACS algorithm
The SACS algorithm achieves parallelization of the Apriori algorithm mainly through distributed computation of Spark. The SACS algorithm firstly stores a data set in an HDFS, divides a generated RDD into n data sets used for processing segmentation, processes generated (key, value) key value pairs to obtain n local frequent item sets, performs reduce operation on the local frequent item sets to obtain a global candidate item set, slices the global candidate item set to generate new (key, value) key value pairs, counts global support degree counts, compares the global support degree counts with a minimum support degree, filters item sets which do not meet conditions to obtain the global frequent item set, outputs a community member set, and finishes the algorithm, wherein an algorithm flow chart is shown in FIG. 3.
The algorithm has the following defects: the SACS algorithm parallelizes Apriori on Spark, and overcomes the defect of frequent disk scanning during Hadoop-based calculation. However, the SACS algorithm does not solve the disadvantages of the conventional Apriori algorithm, and firstly, a support threshold needs to be set by itself, and secondly, the SACS algorithm still needs to scan the database for multiple times, which consumes operation time.
ACDR algorithm
The ARCD algorithm: a Community Detection algorithm based on improved dAssociation Rules. The ARCD algorithm acquires the position data of the community members through MAC address transaction data in the wireless city. Firstly, aiming at the fact that the support degree of the Apriori algorithm is mostly set by experience, the ARCD algorithm is combined with a self-adaptive support degree method, and the minimum support degree in the Apriori algorithm is obtained by a scientific method. Secondly, aiming at the defect that the Apriori algorithm frequently scans the database, a frequent item set is generated by utilizing the transaction weight and Boolean matrix improvement algorithm. Finally, aiming at the defect that the operation result is written into a disk after Hadoop processes data every time, the improved algorithm is applied to community discovery, and parallelization is realized by combining a Spark platform.
Data pre-processing
Aiming at the problem that the co-occurrence mode among users cannot be well reflected due to overlarge data volume, a processing method for compressing the MAC address in time and space (place) is provided. Table 2 shows the original data, where Service _ code represents the access location number, AP _ MAC represents the access AP address, User _ MAC represents the MAC address of the User, and time represents the access time.
Table 2 original MAC address dataset
Figure BDA0002271986970000071
The principle of MAC address preprocessing is to compress data based on the same AP address, or the occurrence within 2 minutes. The data in table 2 was compressed to give table 3.
Table 3 MAC address transaction data
Figure BDA0002271986970000072
Figure BDA0002271986970000081
The frequent item set of the MAC is found by using a modified Apriori algorithm, and the process of finding the frequent item set is called community discovery. After data compression, the MAC address is equated to In, e.g. MAC1=I1. Where In represents an entry and Tn represents a transaction. The owners of the MACn (in) addresses in the frequent item set are members of the community. For example, the algorithm finally obtains a frequent trinomial set I1,I2,I3That correspond to MAC addresses { MAC }1,MAC2,MAC3And the community member corresponding to the MAC address is {1, 2, 3 }. We conclude that users 1, 2, 3 belong to the same community.
Improvement based on Apriori algorithm
Aiming at the problem that the traditional Apriori algorithm is low in computational efficiency due to frequent iteration, the Apriori algorithm is optimized from the following three aspects: firstly, the support degree is self-adaptive, and the minimum support degree is calculated by a mathematical method. And secondly, introducing a Boolean matrix and a transaction weight thought to improve an Apriori algorithm and reduce the times of scanning the database. And thirdly, combining with a Spark platform to realize the parallelization of the association rule-improved community discovery algorithm.
Support degree self-adaptation
The self-adaption of the support degree refers to the analysis and statistics of self data, and a point with a second derivative of a fitting curve being 0 is found as min _ sup through k-term curve fitting of the support degree of the descending arrangement. Support adaptation first counts support for each item in the transaction data set D and orders it from large to small. A set of "support-sequence numbers" is established according to equation (2).
{(x,y)}={(1,V1),(2,V2),...,(i,Vi)} (2)
V in formula (2)iMeans that the support degree after sorting from big to small is assigned to V1,V2,...Vi. i is the total number of entries in transaction D. By data pair (i, V)i) And establishing a plane coordinate system. And fitting a polynomial curve of a degree k according to the data pair, wherein the fitted polynomial curve is as shown in a formula (3).
Figure BDA0002271986970000082
The k value is increased by 1 from 2 nd time. The value of k is judged by the formula (4).
Figure BDA0002271986970000083
Rk 2Coefficient of solution, value [0, 1%]. End conditions are 0.05 for Rend. When equation (3) is established, the value of k is increased by 1 until equation (4) is not established, and the cycle is stopped. And (5) calculating the second derivative of the current k-term polynomial, wherein the formula is shown as (5).
Figure BDA0002271986970000091
Point x with the final second derivative 00It is min _ sup. At this time when 1<x0<2 is rounded up to 2, when x0>Rounding off at time 2, execution flow chart is shown in fig. 4.
The support adaptive pseudo code is as follows:
algorithm 1 support degree self-adaption
Inputting: transaction data set T, Rend.
And (3) outputting: min _ sup.
1:C1=find_candidate_1-itemsets(D)
2:for c∈C1 do
3: δ (c) ═ δ (c) +1// support counts
4: l1 ═ Dec _ sort (c)// support decreasing sort
// adaptively solving the degree k of a polynomial fitting curve
5:k=find_k_mult_curve_fitting(L1)
6: (x) mult _ curve _ fitting (L1, k)// k polynomial curve of degree
7:where f"(x0)=0
8:if 1<x0<2
9: x0 Round (x0)// rounding up x0
10: else x0 ═ Trunx (x0)// floor x0
11:min_sup=x0
Apriori based on transaction weights and matrices
Aiming at the problems that the Apriori algorithm frequently scans transaction data, generates candidate set redundancy and the like, the ARCD algorithm improves the Apriori algorithm by using the method that the weighting and the Boolean matrix are used for carrying out AND operation to obtain the candidate set, and the operational efficiency of the algorithm is improved by scanning the database once.
Boolean matrix: the elements take a matrix of 0 or 1 only.
Theorem 1: if two frequent sets of k-1 terms can be joined to generate a set of k terms, then the terms of their sets of k-2 terms must be identical.
Theorem 2: if a set of items is frequent, then all of its subsets must also be frequent. Conversely, if a set of items is infrequent, then all of its supersets must be infrequent.
The set of data items D, which contains the transaction m, the set of items n, is mapped to a boolean matrix as shown in equation (6).
Figure BDA0002271986970000101
Wherein
Figure BDA0002271986970000102
i=1,2,...,m,j=1,2,…,n。TiRepresenting the ith transaction record in the transaction dataset.
The ARCD algorithm introduces the weight formula sup _ count (i) ═ WTI. Where sup _ count is the item's support count. WTIs the weight of the transaction T and I is the column vector of the transaction matrix. Algorithm initialization W T1, 1. When we scan transaction T, we find duplicate transactions, add 1 to the weight and remove the duplicate entry. And updating the Boolean matrix for the processed new transaction, and deleting the transaction with the value of 0 and the item with the support degree smaller than the minimum support degree. And operation is carried out on each column to obtain a candidate k +1 item set. And repeating the above work until the algorithm is finished when the frequent item set cannot be found. A flow chart of the modified Apriori algorithm is shown in fig. 5.
The modified Apriori algorithm pseudocode is as follows:
algorithm 2 acquisition of frequent item sets
Inputting: the transaction data set T, min _ sup, T ═ 1, WT ═ 1, 1, … 1 }.
And (3) outputting: a frequent t-item set.
Figure BDA0002271986970000103
Figure BDA0002271986970000111
Improved parallel implementation of Apriori algorithm
The ARCD algorithm is based on transaction weights and momentsThe Apriori algorithm is improved by the array, only 1-time data set scanning is needed, and the array is combined with a Spark operation engine, so that the algorithm parallelization is realized, and the operation speed is increased. The operation process of the ARCD algorithm in Spark: first scanning the data set to generate a frequent 1 item set L1And storing the result on the HDFS, regarding the data set stored on the HDFS as an RDD, and dividing the RDD into n blocks which are distributed to m work nodes. Secondly, a local matrix is constructed, and the support degree count of the local matrix is calculated. And-ing each column of the local matrix results in a local candidate set. And deleting the items with the support degree smaller than the minimum support degree and the transactions with the support degree smaller than the minimum support degree and all 0 to obtain a local frequent item set. And finally, combining the local frequent item sets by using the reduce ByKey operation to obtain a global candidate item set. And calculating the global support, and filtering columns (item sets) smaller than the minimum support and rows (transactions) with all 0 to obtain a global frequent item set of the matrix. The frequent item set is equated to MAC addresses and the set of members that possess these MAC addresses is obtained. A flow chart of the modified Apriori algorithm running on Spark is shown in fig. 6.
Examples of algorithms
To explain the main idea of the ARCD algorithm more clearly, the transaction data set shown in table 4 is analyzed by way of example.
Table 4 transaction data set T
Figure BDA0002271986970000121
Scanning a transaction data set T, counting the support degrees of each item of the candidate item set, arranging the support degrees from small to large, obtaining data by using a formula (2), and establishing a coordinate table as shown in a table 5.
Table 5 list of support and number
Figure BDA0002271986970000122
Drawing a coordinate system by using each point coordinate in the table, performing curve fitting according to a formula (3) and a formula (4), obtaining k as 4, and obtaining a support number fitting curve of the data set T as shown in a formula (6):
fT=-0.16667x4+1.833x3-6.833x2+9.167x+3 (6)
evaluating function fTPoint of second derivative 0, fTThe second derivative of (d) is shown in equation (7):
fT”(x)=-2.0004x2+10.998x-13.666 (7)
let fTWhen "(x) ═ 0, find x02, so the minimum support for the data set T is 2. The transaction data set T is mapped to a boolean matrix where each row represents a transaction and each column represents an item. If a certain item exists in the transaction, the corresponding position is 1, otherwise, the corresponding position is 0, and a Boolean matrix R is generatedt
Figure BDA0002271986970000131
From the transaction data set T, the transaction T4And T5Repeat transactions, so weight is W T1, 1, 1, 2, 1, 1, 1, 1 }. Calculating a count of support for each column of the matrix, e.g. sup _ count (I)1)=WTI10+0+1+ 2+ 1+0+0+0+ 1-5. I.e. I1Has a support of 5 and a minimum support of 2, so the term I1Is frequent. The support degrees of other items are also calculated, and it can be seen that the support degrees of all the items are greater than the minimum support degree threshold value, so the frequent 1 item set is { I1},{I2},{I3},{I4},{I5And, and operation is performed on each column to obtain a candidate binomial set R2
Figure BDA0002271986970000141
It can be seen that transaction T9All are 0, according to theorem 2, delete T9And calculating the support degree count of each item set in the candidate two-item set. Such as sup _ count (I)1,I2)=WT(I1I2)=0+0+1+0+1+0+0+0=2,sup_count(I3,I5)=WT(I3I5) 0+0+1+0+0+0+ 0-1, binomial set I1I2Has a support degree of 2, so that the binomial set I1I2Is frequent. It can be seen that the binomial set I3I5With a support of 1 less than the minimum support of 2, delete I3I5The column in which it is located. Calculating the support of all item sets to obtain a frequent 2 item set as { I1I2},{I1I3},{I1I4},{I1I5},{I2I4},{I2I5},{I3I4},{I4I5And connecting and AND operating the columns of the frequent item set according to theorem 1. Obtaining a candidate trinomial set R3
Figure BDA0002271986970000142
It can be seen that transaction line T2,T7,T8,T10All 0, according to theorem 1, these three rows are deleted and the support count for each column is calculated. Wherein sup _ count (I)1,I2,I4)=1,sup_count(I1,I2,I5)=1,sup_count(I1,I3,I4)=1,sup_count(I2,I4,I5) All three item sets have a support degree smaller than the minimum support degree 2, and three item sets which do not satisfy min _ sup are deleted, so the frequent 3 item set is { I }1I2I3},{I1I4I5},{I2I3I4}. From theorem 1, the candidate four-term set cannot be obtained by connecting and performing the and operation, so the algorithm is ended.
To sum up, the frequent item set { I1I2I3},{I1I4I5},{I2I3I4The requirements are met. Will item IiConverting into MAC address to obtain { MAC of frequent trinomial set1,MAC2,MAC3},{MAC1,MAC4,MAC5},{MAC2,MAC3,MAC4}. The users having the MAC addresses, i.e., user 1, user 2, and user 3, are members of the same community. User 1, user 4, and user 5 are members of the same community. User 2, user 3, and user 4 are members of the same community.
Experimental validation and analysis
Experimental Environment
The experiment builds a Spark cluster comprising 1 master node and 7 slave nodes. The cluster node is configured as follows: linux operating system, CentOs7.3, scala 2.11.8, jdk 1.8, Hadoop2.7.3 and Spark 2.1.1.
Comparison of experiments
To verify the validity of the ARCD algorithm herein, a MAC address sample generated using the UCI data set is shown in table 6.
TABLE 6 data set characteristic information statistics
Figure BDA0002271986970000151
In order to verify the accuracy of the algorithm mining data results. Use of D1,D2And (3) comparing the accuracy of the data mining results by a data set comparison CS algorithm, an MP-T-CS algorithm, an Apriori algorithm, an SACS algorithm and an ARCD algorithm. Let min _ sup be 20%, δ be 0.05. N () represents the number of MAC addresses after mining. The results of the experiment are shown in Table 7.
TABLE 7 comparison of data set mining results
Figure BDA0002271986970000152
Figure BDA0002271986970000161
As can be seen from Table 7, the ARCD algorithm is the same as the traditional Apriori algorithm and the SACS parallelized on Spark algorithm in mining the number and quality of MAC addresses, which shows that the algorithm is higher in mining the members of the community compared with the CS algorithm and the MP-T-CS algorithm.
In order to verify the calculation efficiency of the algorithm, the algorithm selects a data set D1-D5, the ARCD algorithm is compared with a CS algorithm, an MP-T-CS algorithm, an Apriori algorithm and an SACS algorithm, and the obtained experimental result is shown in FIG. 7.
As can be seen from fig. 7, when the data volume is small, the computing advantages of the Hadoop and Spark platforms cannot be embodied, and when the data volume is increased, it can be seen that the advantages of the ARCD algorithm are gradually obvious, the computing speed is superior to other algorithms, and compared with the conventional Apriori algorithm, the computing time is one tenth of that of the conventional Apriori algorithm. The ARCD algorithm compresses the data set by utilizing the Boolean matrix, and the improved algorithm can obtain a frequent item set only by scanning the database once, so that the operation time is reduced. Secondly, the ARCD algorithm utilizes the characteristic that Spark is calculated based on the memory, and the larger the data is, the more obvious the advantage of Spark is, and the faster the calculation speed is.
In order to verify the expandability of the algorithm and the parallelization effect of the algorithm, the ARCD algorithm compares the acceleration ratio with the MP-T-CS algorithm under the D4 data set.
As can be seen from FIG. 8, the speed-up ratio of the two algorithms increases with the number of clusters, and it can be seen that both algorithms have good parallelism. The acceleration ratio of the ARCD algorithm is higher than that of the SACS algorithm, the main reason is that the ARCD algorithm maps a data set to a Boolean matrix, a frequent item set is obtained through AND operation, the algorithm only needs to scan the database once, and the operation speed is improved. Experiments prove that the ARCD algorithm has better parallelism and faster operation speed when processing a mass data set.
In order to verify the expandability of the algorithm under different data sets, the data sets D1-D5 are taken in experiments, and the acceleration ratio of the ARCD algorithm is tested under different nodes. The results of the experiment are shown in FIG. 9.
As can be seen from fig. 9, when the data amount is small, the acceleration effect is not significant as the number of clusters increases, because the parallel computation is not dominant because the computation time is short because the data amount is small. When the amount of data is large enough, the acceleration ratio is almost linear. Experiments prove that the parallel processing capability of the ARCD algorithm is good when the ARCD algorithm faces a mass data set, and the ARCD algorithm has good expandability.
Aiming at the problem that the traditional community discovery algorithm is low in calculation efficiency, the ARCD algorithm for improving the association rule by using the transaction weight and the Boolean matrix on the Spark platform is provided. The ARCD algorithm combines a support degree self-adaption method, and subjectivity of artificially setting the support degree is solved. The improved algorithm is parallelized on Spark, and the problems that the MapReduce frequently scans a disk and the I/O consumption is increased are solved. Experimental results show that the ARCD algorithm is high in calculation efficiency when mining community membership, good in parallelism and expandability and obvious in advantage when processing mass data.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (4)

1. A community discovery algorithm based on an improved association rule is characterized by comprising the following steps:
s1: self-adapting the support degree, and calculating the minimum support degree by a mathematical method;
s2: a Boolean matrix and a transaction weight thought are introduced to improve an Apriori algorithm, so that the times of scanning a database are reduced;
s3: and the association rule is improved by combining with a Spark platform, so that the parallelization of the community discovery algorithm is realized.
2. The association rule-modified based community discovery algorithm according to claim 1, wherein in the step S1, Apriori algorithm is optimized:
s11, counting the support degree of each item in the transaction data set D, and sorting the items from big to small;
and S12, performing k-degree polynomial curve fitting according to the data pairs.
3. The association rule improvement-based community discovery algorithm of claim 1, wherein in the step S2, transaction data are frequently scanned for Apriori algorithm to generate candidate set redundancy problem, and the ARCD algorithm improves Apriori algorithm by and-operating with weights and boolean matrix to obtain candidate sets.
4. The association rule improvement-based community discovery algorithm according to claim 1, wherein the step S3 comprises the steps of:
s31, scanning the data set to generate a frequent 1 item set L1Storing the result on the HDFS, regarding the data set stored on the HDFS as an RDD, dividing the RDD into n blocks and distributing the n blocks to m work nodes;
s32, constructing a local matrix, and calculating the support count of the local matrix
And S33, merging the local frequent item sets by utilizing the reduce ByKey operation to obtain a global candidate item set.
CN201911108340.5A 2019-11-13 2019-11-13 Community discovery algorithm based on improved association rule Pending CN110866047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911108340.5A CN110866047A (en) 2019-11-13 2019-11-13 Community discovery algorithm based on improved association rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911108340.5A CN110866047A (en) 2019-11-13 2019-11-13 Community discovery algorithm based on improved association rule

Publications (1)

Publication Number Publication Date
CN110866047A true CN110866047A (en) 2020-03-06

Family

ID=69654104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911108340.5A Pending CN110866047A (en) 2019-11-13 2019-11-13 Community discovery algorithm based on improved association rule

Country Status (1)

Country Link
CN (1) CN110866047A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766243A (en) * 2021-04-07 2021-05-07 南京烽火星空通信发展有限公司 Multi-dimensional data fusion method based on FP tree-Clique evolution algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700005A (en) * 2013-12-17 2014-04-02 南京信息工程大学 Association-rule recommending method based on self-adaptive multiple minimum supports
CN107562865A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Multivariate time series association rule mining method based on Eclat

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700005A (en) * 2013-12-17 2014-04-02 南京信息工程大学 Association-rule recommending method based on self-adaptive multiple minimum supports
CN107562865A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Multivariate time series association rule mining method based on Eclat
WO2019041628A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for mining multivariate time series association rule based on eclat

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
曹博;倪建成;李淋淋;于苹苹;姚彬修;: "基于Spark的并行频繁模式挖掘算法" *
林甲祥;巫建伟;陈崇成;张泽均;舒兆港;: "支持度和置信度自适应的关联规则挖掘" *
柴岩 等,: "最小支持度为区间值的加权Apriori算法" *
王永贵 等,: "无线城市社团发现的研究——在Spark上利用改进关联规则实现社团发现的算法" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766243A (en) * 2021-04-07 2021-05-07 南京烽火星空通信发展有限公司 Multi-dimensional data fusion method based on FP tree-Clique evolution algorithm

Similar Documents

Publication Publication Date Title
Ediger et al. Tracking structure of streaming social networks
CN104809244B (en) Data digging method and device under a kind of big data environment
CN108549692B (en) Method for classifying text emotion through sparse multiple logistic regression model under Spark framework
Tran et al. Community detection in partially observable social networks
Yun et al. Efficient mining of maximal correlated weight frequent patterns
CN112883241B (en) Supercomputer benchmark test acceleration method based on connected component generation optimization
CN103559205A (en) Parallel feature selection method based on MapReduce
CN110210248A (en) A kind of network structure towards secret protection goes anonymization systems and method
CN115965058B (en) Neural network training method, entity information classification method, device and storage medium
US20220230369A1 (en) Generating a data visualization graph utilizing modularity-based manifold tearing
CN113297427A (en) Graph classification method based on U-shaped nested network
Gupte et al. Role discovery in graphs using global features: Algorithms, applications and a novel evaluation strategy
CN112925821A (en) MapReduce-based parallel frequent item set incremental data mining method
CN114567634B (en) Method, system, storage medium and electronic device for calculating E-level map facing backward
Carnivali et al. CoVeC: Coarse-grained vertex clustering for efficient community detection in sparse complex networks
CN110866047A (en) Community discovery algorithm based on improved association rule
CN109542949B (en) Formal vector-based decision information system knowledge acquisition method
Song et al. Efficient topology-aware simplification of large triangulated terrains
Hershberger et al. Summarizing spatial data streams using clusterhulls
CN113902113A (en) Convolutional neural network channel pruning method
CN114155410A (en) Graph pooling, classification model training and reconstruction model training method and device
CN112765414A (en) Graph embedding vector generation method and graph embedding-based community discovery method
Huang et al. Community detection algorithm for social network based on node intimacy and graph embedding model
Rajendran et al. Incremental MapReduce for K-medoids clustering of big time-series data
CN111985542B (en) Representative graph structure model, visual understanding model establishing method and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination