CN110866047A

CN110866047A - Community discovery algorithm based on improved association rule

Info

Publication number: CN110866047A
Application number: CN201911108340.5A
Authority: CN
Inventors: 王永贵; 邢若楠
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-03-06

Abstract

The invention discloses a community discovery algorithm based on an improved association rule, which comprises the steps of firstly, self-adapting the support degree, and calculating the minimum support degree by a mathematical method; secondly, introducing a Boolean matrix and a transaction weight thought to improve an Apriori algorithm and reduce the times of scanning a database; and finally, combining with a Spark platform to realize the parallelization of the association rule-improved community discovery algorithm. The association rule-improved based community discovery algorithm improves the Apriori algorithm by mining community members by using MAC addresses, introducing the idea of self-adaption of support degree and a method of generating a Boolean matrix by adding transaction weight, combines the improved algorithm with Spark to realize parallelization of the algorithm, and mines the relationship among the community members by mining a frequent item set. Experimental results show that the ARCD algorithm solves the problems of subjectivity of artificially setting the support degree and redundancy of community mining results, has good expandability and improves mining speed found by communities.

Description

Community discovery algorithm based on improved association rule

Technical Field

The invention relates to a community discovery algorithm based on an improved association rule.

Background

With the advent of the big data era, the construction of wireless cities is gradually perfected. Complex network research is always a hotspot of social research, and community discovery plays an important role in researching complex networks. Mining the community relations existing in the wireless cities in massive data becomes a new challenge.

Aiming at the traditional community discovery algorithm, a hybrid algorithm which integrates the community discovery algorithm and the association rule is provided, the improved algorithm improves the accuracy of community discovery, but the improved algorithm introduces the defect of the association rule algorithm, improves the search time and reduces the search efficiency. The CS algorithm proposed by mawei et al, which utilizes a directed weightless graph to improve community discovery, is significantly improved in mining the time and space of communities, but the algorithm generates a large amount of group redundancy after weight sorting and the membership discovered by the algorithm is strong, and is not always true in real life. Zhang Yan et al propose to improve the community discovery algorithm by using a binary tree structure, and combine MapReduce and the binary tree to improve the community discovery algorithm, so as to realize parallelization of the algorithm, and solve the problems of low efficiency and data overflow when processing massive data, but the MapReduce needs to frequently scan a disk when performing iteration, and the calculation time is increased. Yang Qinliu et al propose to improve association rules by using a matrix, the algorithm overcomes the disadvantage that the traditional association rule algorithm frequently scans transaction data sets, and improves the operation efficiency, but the algorithm consumes a large amount of time when processing mass data. The idea of Apriori algorithm support degree confidence coefficient self-adaption is provided by Wangxue et al, the algorithm solves the problems of subjectivity and no scientific basis when the support degree and the confidence coefficient are artificially set, but the algorithm does not solve the defects of the traditional Apriori algorithm.

Disclosure of Invention

Based on the defects of the prior art, the problem to be solved by the invention is to provide a community discovery algorithm based on an improved association rule, improve an Apriori algorithm by combining a thought of self-adaption of a support degree and a method of generating a boolean matrix by using weight, and fuse the improved algorithm and the community discovery algorithm on a Spark platform.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a community discovery algorithm based on an improved association rule, which comprises the following steps:

s1: self-adapting the support degree, and calculating the minimum support degree by a mathematical method;

s2: a Boolean matrix and a transaction weight thought are introduced to improve an Apriori algorithm, so that the times of scanning a database are reduced;

s3: and the association rule is improved by combining with a Spark platform, so that the parallelization of the community discovery algorithm is realized.

Optionally, in step S1, the Apriori algorithm is optimized:

s11, counting the support degree of each item in the transaction data set D, and sorting the items from big to small;

and S12, performing k-degree polynomial curve fitting according to the data pairs.

Optionally, in the step S2, aiming at the problem that the Apriori algorithm frequently scans transaction data to generate a candidate set redundancy, the ARCD algorithm improves the Apriori algorithm by performing an and operation on the weights and the boolean matrix to obtain a candidate set.

Further, the step S3 includes the following steps:

s31, scanning the data set to generate a frequent 1 item set L₁Storing the result on the HDFS, regarding the data set stored on the HDFS as an RDD, dividing the RDD into n blocks and distributing the n blocks to m work nodes;

s32, constructing a local matrix, and calculating the support count of the local matrix

And S33, merging the local frequent item sets by utilizing the reduce ByKey operation to obtain a global candidate item set.

Therefore, the Association rule-improved community discovery algorithm has the problems of candidate results, a great amount of redundant generated results and high time complexity, and provides a community discovery ARCD (a Community detection algorithm based on improved Association rules) algorithm. According to the algorithm, community members are mined by using MAC addresses, an Apriori algorithm is improved by introducing a support degree self-adaption thought and a method of generating a Boolean matrix by adding transaction weights, parallelization of the algorithm is realized by combining the improved algorithm with Spark, and the relation among the community members is mined in a frequent item set mining mode. Experimental results show that the ARCD algorithm solves the problems of subjectivity of artificially setting the support degree and redundancy of community mining results, has good expandability and improves mining speed found by communities.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.

FIG. 1 is a flow chart of the CS algorithm;

FIG. 2 is a flow chart of the MP-T-CS algorithm;

FIG. 3 is a SACS flow chart;

FIG. 4 is a flow chart of support adaptation;

FIG. 5 is a flow chart of the modified Apriori algorithm;

FIG. 6 is a flow chart of the modified Apriori algorithm running on Spark;

FIG. 7 is a running time diagram of five algorithms at different data volumes;

FIG. 8 is a graph comparing acceleration ratios for two algorithms at different numbers of clusters;

FIG. 9 is a graph of acceleration ratio change for the ARCD algorithm at different data volumes.

Detailed Description

Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.

The method improves an Apriori algorithm by combining the idea of self-adaption of support degree and a method for generating a Boolean matrix by using weight, and integrates the improved algorithm with a community discovery algorithm on a Spark platform. Experimental results show that the ARCD algorithm has the advantages of fast running time and high calculation efficiency when processing mass data, and has good parallelism when mining community membership.

At the heart of community discovery is the finding of the direct relationship of users participating in community activities to users. The wireless city is a service for covering the administrative region of the city by using a high-speed broadband wireless technology and providing information to the public at any time and any place by using the wireless technology. MAC address: i.e., physical address, to confirm the location of the device on the network. The MAC address has uniqueness.

Association rules are of the form X → Y, where X is the set of leading items, Y is the set of trailing items, and X ∩ Y ≠ Φ.

Transaction: one transaction per action in table 1.

Item (1): each column in table 1 is an entry.

TABLE 1 binary representation of shopping basket data

Item set: a collection containing 0 and more items is called an item set.

And (3) counting the support degree: the total number of sets of items contained in the transaction.

The support degree is as follows: to determine whether a certain set of items is frequent. Is defined as formula (1)

Frequent item set: a set of items greater than a support threshold.

Second derivative: the derivation is performed twice for the primitive function y ═ f (x), and the point where the second derivative is 0 may be an extreme point.

And (3) curve fitting: means that an appropriate curve type is selected to fit the observed data, and the fitted curve equation is used for analyzing the relation between two variables.

Determining a coefficient: determining R for coefficient²Representation, which is used to measure the fit of the model.

Spark is an operational engine designed by the RADs laboratory at the university of california at berkeley, 2009, that can perform interactive queries and iterative computations. It can be calculated in the memory and has an efficient fault-tolerant mechanism. The Spark expands a MapReduce model, the operation speed is 100 times of that of MapReduce in Hadoop, and the Spark is more suitable for processing real-time data compared with the Hadoop suitable for processing non-real-time mass data. It supports more computational models (e.g., interactive queries, stream processing). The elastically distributed data set (RDD) is a kernel data structure of Spark, and is a fault-tolerant set capable of operating in parallel, and mainly includes conversion operation and action operation.

Community discovery based on Community Search (CS)

The CS algorithm utilizes a community discovery algorithm improved by an undirected weighted graph, the CS algorithm compresses MAC addresses in original data, and the MAC data after being sorted is used as a vertex V of the graph if the MAC addresses are modified_i(as V)_i) And MAC_j(as V)_j) Appear in the same place or have the same AP, connect with edge e, note (e)_ij) And a weight W of the edge _ij1. Mapping MAC addresses into the graph in sequence, e for each occurrence_ijThen W is_ijAnd adding 1. Until all MAC addresses have been processed. The group formed by the edges with larger weight is the community found by the CS algorithm, and the algorithm flowchart is shown in fig. 1.

The algorithm has the following defects: the CS algorithm results in edge redundancy due to the one-to-one ordering of the edges for all vertices, and clique redundancy due to edge clique generation. The CS algorithm excavates a group with a high weight value, which indicates that the two members have a strong relationship, but in real life, the relationship between the members is not as strong as that obtained by the algorithm.

T-CS algorithm based on MapReduce

The MP-T-CS algorithm is to apply the T-CS algorithm to a MapReduce platform of Hadoop and realize parallelization of the group search algorithm by using a special binary tree. The MP-T-CS algorithm firstly stores original MAC data in a special binary tree form, converts the binary tree into a text for storage, performs map and reduce operations on the text data, and generates (key, value) key value pairs. And then the key value pairs are subjected to weight k selection sorting. And finally, performing depth-first search traversal on the sorted images to obtain a maximum connected image. The maximum connectivity graph is the community found by the MP-T-CS algorithm, and the flow chart of the algorithm is shown in FIG. 2.

The algorithm has the following defects: although the mining effect of the M-T-CS algorithm is superior to that of the CS algorithm in the aspects of whole and local, the time performance and the space performance are improved, the M-T-CS algorithm is parallelized calculation based on MapReduce, the operation result is stored in an HDFS file and is read from a disk every time the operation result is used, and the operation time is increased by a large amount of iterative calculation.

Spark-based ACS algorithm

The SACS algorithm achieves parallelization of the Apriori algorithm mainly through distributed computation of Spark. The SACS algorithm firstly stores a data set in an HDFS, divides a generated RDD into n data sets used for processing segmentation, processes generated (key, value) key value pairs to obtain n local frequent item sets, performs reduce operation on the local frequent item sets to obtain a global candidate item set, slices the global candidate item set to generate new (key, value) key value pairs, counts global support degree counts, compares the global support degree counts with a minimum support degree, filters item sets which do not meet conditions to obtain the global frequent item set, outputs a community member set, and finishes the algorithm, wherein an algorithm flow chart is shown in FIG. 3.

The algorithm has the following defects: the SACS algorithm parallelizes Apriori on Spark, and overcomes the defect of frequent disk scanning during Hadoop-based calculation. However, the SACS algorithm does not solve the disadvantages of the conventional Apriori algorithm, and firstly, a support threshold needs to be set by itself, and secondly, the SACS algorithm still needs to scan the database for multiple times, which consumes operation time.

ACDR algorithm

The ARCD algorithm: a Community Detection algorithm based on improved dAssociation Rules. The ARCD algorithm acquires the position data of the community members through MAC address transaction data in the wireless city. Firstly, aiming at the fact that the support degree of the Apriori algorithm is mostly set by experience, the ARCD algorithm is combined with a self-adaptive support degree method, and the minimum support degree in the Apriori algorithm is obtained by a scientific method. Secondly, aiming at the defect that the Apriori algorithm frequently scans the database, a frequent item set is generated by utilizing the transaction weight and Boolean matrix improvement algorithm. Finally, aiming at the defect that the operation result is written into a disk after Hadoop processes data every time, the improved algorithm is applied to community discovery, and parallelization is realized by combining a Spark platform.

Data pre-processing

Aiming at the problem that the co-occurrence mode among users cannot be well reflected due to overlarge data volume, a processing method for compressing the MAC address in time and space (place) is provided. Table 2 shows the original data, where Service _ code represents the access location number, AP _ MAC represents the access AP address, User _ MAC represents the MAC address of the User, and time represents the access time.

Table 2 original MAC address dataset

The principle of MAC address preprocessing is to compress data based on the same AP address, or the occurrence within 2 minutes. The data in table 2 was compressed to give table 3.

Table 3 MAC address transaction data

The frequent item set of the MAC is found by using a modified Apriori algorithm, and the process of finding the frequent item set is called community discovery. After data compression, the MAC address is equated to In, e.g. MAC₁＝I₁. Where In represents an entry and Tn represents a transaction. The owners of the MACn (in) addresses in the frequent item set are members of the community. For example, the algorithm finally obtains a frequent trinomial set I₁，I₂，I₃That correspond to MAC addresses { MAC }₁，MAC₂，MAC₃And the community member corresponding to the MAC address is {1, 2, 3 }. We conclude that

users

1, 2, 3 belong to the same community.

Improvement based on Apriori algorithm

Aiming at the problem that the traditional Apriori algorithm is low in computational efficiency due to frequent iteration, the Apriori algorithm is optimized from the following three aspects: firstly, the support degree is self-adaptive, and the minimum support degree is calculated by a mathematical method. And secondly, introducing a Boolean matrix and a transaction weight thought to improve an Apriori algorithm and reduce the times of scanning the database. And thirdly, combining with a Spark platform to realize the parallelization of the association rule-improved community discovery algorithm.

Support degree self-adaptation

The self-adaption of the support degree refers to the analysis and statistics of self data, and a point with a second derivative of a fitting curve being 0 is found as min _ sup through k-term curve fitting of the support degree of the descending arrangement. Support adaptation first counts support for each item in the transaction data set D and orders it from large to small. A set of "support-sequence numbers" is established according to equation (2).

{(x,y)}＝{(1,V₁),(2,V₂),...,(i,V_i)} (2)

V in formula (2)_iMeans that the support degree after sorting from big to small is assigned to V₁，V₂，...V_i. i is the total number of entries in transaction D. By data pair (i, V)_i) And establishing a plane coordinate system. And fitting a polynomial curve of a degree k according to the data pair, wherein the fitted polynomial curve is as shown in a formula (3).

The k value is increased by 1 from 2 nd time. The value of k is judged by the formula (4).

R_k ²Coefficient of solution, value [0, 1%]. End conditions are 0.05 for Rend. When equation (3) is established, the value of k is increased by 1 until equation (4) is not established, and the cycle is stopped. And (5) calculating the second derivative of the current k-term polynomial, wherein the formula is shown as (5).

Point x with the final second derivative 0₀It is min _ sup. At this time when 1<x₀<2 is rounded up to 2, when x₀>Rounding off at time 2, execution flow chart is shown in fig. 4.

The support adaptive pseudo code is as follows:

algorithm 1 support degree self-adaption

Inputting: transaction data set T, Rend.

And (3) outputting: min _ sup.

1：C1＝find_candidate_1-itemsets(D)

2：for c∈C1 do

3: δ (c) ═ δ (c) +1// support counts

4: l1 ═ Dec _ sort (c)// support decreasing sort

// adaptively solving the degree k of a polynomial fitting curve

5：k＝find_k_mult_curve_fitting(L1)

6: (x) mult _ curve _ fitting (L1, k)// k polynomial curve of degree

7：where f"(x0)＝0

8：if 1<x0<2

9: x0 Round (x0)// rounding up x0

10: else x0 ═ Trunx (x0)// floor x0

11：min_sup＝x0

Apriori based on transaction weights and matrices

Aiming at the problems that the Apriori algorithm frequently scans transaction data, generates candidate set redundancy and the like, the ARCD algorithm improves the Apriori algorithm by using the method that the weighting and the Boolean matrix are used for carrying out AND operation to obtain the candidate set, and the operational efficiency of the algorithm is improved by scanning the database once.

Boolean matrix: the elements take a matrix of 0 or 1 only.

Theorem 1: if two frequent sets of k-1 terms can be joined to generate a set of k terms, then the terms of their sets of k-2 terms must be identical.

Theorem 2: if a set of items is frequent, then all of its subsets must also be frequent. Conversely, if a set of items is infrequent, then all of its supersets must be infrequent.

The set of data items D, which contains the transaction m, the set of items n, is mapped to a boolean matrix as shown in equation (6).

Wherein

i＝1，2，...，m，j＝1，2，…，n。T_iRepresenting the ith transaction record in the transaction dataset.

The ARCD algorithm introduces the weight formula sup _ count (i) ═ WTI. Where sup _ count is the item's support count. W_TIs the weight of the transaction T and I is the column vector of the transaction matrix.

Algorithm initialization W

_T1, 1. When we scan transaction T, we find duplicate transactions, add 1 to the weight and remove the duplicate entry. And updating the Boolean matrix for the processed new transaction, and deleting the transaction with the value of 0 and the item with the support degree smaller than the minimum support degree. And operation is carried out on each column to obtain a candidate k +1 item set. And repeating the above work until the algorithm is finished when the frequent item set cannot be found. A flow chart of the modified Apriori algorithm is shown in fig. 5.

The modified Apriori algorithm pseudocode is as follows:

algorithm 2 acquisition of frequent item sets

Inputting: the transaction data set T, min _ sup, T ═ 1, WT ═ 1, 1, … 1 }.

And (3) outputting: a frequent t-item set.

Improved parallel implementation of Apriori algorithm

The ARCD algorithm is based on transaction weights and momentsThe Apriori algorithm is improved by the array, only 1-time data set scanning is needed, and the array is combined with a Spark operation engine, so that the algorithm parallelization is realized, and the operation speed is increased. The operation process of the ARCD algorithm in Spark: first scanning the data set to generate a frequent 1 item set L₁And storing the result on the HDFS, regarding the data set stored on the HDFS as an RDD, and dividing the RDD into n blocks which are distributed to m work nodes. Secondly, a local matrix is constructed, and the support degree count of the local matrix is calculated. And-ing each column of the local matrix results in a local candidate set. And deleting the items with the support degree smaller than the minimum support degree and the transactions with the support degree smaller than the minimum support degree and all 0 to obtain a local frequent item set. And finally, combining the local frequent item sets by using the reduce ByKey operation to obtain a global candidate item set. And calculating the global support, and filtering columns (item sets) smaller than the minimum support and rows (transactions) with all 0 to obtain a global frequent item set of the matrix. The frequent item set is equated to MAC addresses and the set of members that possess these MAC addresses is obtained. A flow chart of the modified Apriori algorithm running on Spark is shown in fig. 6.

Examples of algorithms

To explain the main idea of the ARCD algorithm more clearly, the transaction data set shown in table 4 is analyzed by way of example.

Table 4 transaction data set T

Scanning a transaction data set T, counting the support degrees of each item of the candidate item set, arranging the support degrees from small to large, obtaining data by using a formula (2), and establishing a coordinate table as shown in a table 5.

Table 5 list of support and number

Drawing a coordinate system by using each point coordinate in the table, performing curve fitting according to a formula (3) and a formula (4), obtaining k as 4, and obtaining a support number fitting curve of the data set T as shown in a formula (6):

f_T＝-0.16667x⁴+1.833x³-6.833x²+9.167x+3 (6)

evaluating function f_TPoint of second derivative 0, f_TThe second derivative of (d) is shown in equation (7):

f_T”(x)＝-2.0004x²+10.998x-13.666 (7)

let f_TWhen "(x) ═ 0, find x₀2, so the minimum support for the data set T is 2. The transaction data set T is mapped to a boolean matrix where each row represents a transaction and each column represents an item. If a certain item exists in the transaction, the corresponding position is 1, otherwise, the corresponding position is 0, and a Boolean matrix R is generated_t。

From the transaction data set T, the transaction T₄And T₅Repeat transactions, so weight is

W

_T1, 1, 1, 2, 1, 1, 1, 1 }. Calculating a count of support for each column of the matrix, e.g. sup _ count (I)₁)＝W_TI₁0+0+1+ 2+ 1+0+0+0+ 1-5. I.e. I₁Has a support of 5 and a minimum support of 2, so the term I₁Is frequent. The support degrees of other items are also calculated, and it can be seen that the support degrees of all the items are greater than the minimum support degree threshold value, so the frequent 1 item set is { I₁}，{I₂}，{I₃}，{I₄}，{I₅And, and operation is performed on each column to obtain a candidate binomial set R₂。

It can be seen that transaction T₉All are 0, according to theorem 2, delete T₉And calculating the support degree count of each item set in the candidate two-item set. Such as sup _ count (I)₁，I₂)＝W_T(I₁I₂)＝0+0+1+0+1+0+0+0＝2，sup_count(I₃，I₅)＝W_T(I₃I₅) 0+0+1+0+0+0+ 0-1, binomial set I₁I₂Has a support degree of 2, so that the binomial set I₁I₂Is frequent. It can be seen that the binomial set I₃I₅With a support of 1 less than the minimum support of 2, delete I₃I₅The column in which it is located. Calculating the support of all item sets to obtain a frequent 2 item set as { I₁I₂}，{I₁I₃}，{I₁I₄}，{I₁I₅}，{I₂I₄}，{I₂I₅}，{I₃I₄}，{I₄I₅And connecting and AND operating the columns of the frequent item set according to theorem 1. Obtaining a candidate trinomial set R₃。

It can be seen that transaction line T₂，T₇，T₈，T₁₀All 0, according to theorem 1, these three rows are deleted and the support count for each column is calculated. Wherein sup _ count (I)₁，I₂，I₄)＝1，sup_count(I₁，I₂，I₅)＝1，sup_count(I₁，I₃，I₄)＝1，sup_count(I₂，I₄，I₅) All three item sets have a support degree smaller than the minimum support degree 2, and three item sets which do not satisfy min _ sup are deleted, so the frequent 3 item set is { I }₁I₂I₃}，{I₁I₄I₅}，{I₂I₃I₄}. From theorem 1, the candidate four-term set cannot be obtained by connecting and performing the and operation, so the algorithm is ended.

To sum up, the frequent item set { I₁I₂I₃}，{I₁I₄I₅}，{I₂I₃I₄The requirements are met. Will item I_iConverting into MAC address to obtain { MAC of frequent trinomial set₁，MAC₂，MAC₃}，{MAC₁，MAC₄，MAC₅}，{MAC₂，MAC₃，MAC₄}. The users having the MAC addresses, i.e., user 1, user 2, and user 3, are members of the same community. User 1, user 4, and user 5 are members of the same community. User 2, user 3, and user 4 are members of the same community.

Experimental validation and analysis

Experimental Environment

The experiment builds a Spark cluster comprising 1 master node and 7 slave nodes. The cluster node is configured as follows: linux operating system, CentOs7.3, scala 2.11.8, jdk 1.8, Hadoop2.7.3 and Spark 2.1.1.

Comparison of experiments

To verify the validity of the ARCD algorithm herein, a MAC address sample generated using the UCI data set is shown in table 6.

TABLE 6 data set characteristic information statistics

In order to verify the accuracy of the algorithm mining data results. Use of D₁，D₂And (3) comparing the accuracy of the data mining results by a data set comparison CS algorithm, an MP-T-CS algorithm, an Apriori algorithm, an SACS algorithm and an ARCD algorithm. Let min _ sup be 20%, δ be 0.05. N () represents the number of MAC addresses after mining. The results of the experiment are shown in Table 7.

TABLE 7 comparison of data set mining results

As can be seen from Table 7, the ARCD algorithm is the same as the traditional Apriori algorithm and the SACS parallelized on Spark algorithm in mining the number and quality of MAC addresses, which shows that the algorithm is higher in mining the members of the community compared with the CS algorithm and the MP-T-CS algorithm.

In order to verify the calculation efficiency of the algorithm, the algorithm selects a data set D1-D5, the ARCD algorithm is compared with a CS algorithm, an MP-T-CS algorithm, an Apriori algorithm and an SACS algorithm, and the obtained experimental result is shown in FIG. 7.

As can be seen from fig. 7, when the data volume is small, the computing advantages of the Hadoop and Spark platforms cannot be embodied, and when the data volume is increased, it can be seen that the advantages of the ARCD algorithm are gradually obvious, the computing speed is superior to other algorithms, and compared with the conventional Apriori algorithm, the computing time is one tenth of that of the conventional Apriori algorithm. The ARCD algorithm compresses the data set by utilizing the Boolean matrix, and the improved algorithm can obtain a frequent item set only by scanning the database once, so that the operation time is reduced. Secondly, the ARCD algorithm utilizes the characteristic that Spark is calculated based on the memory, and the larger the data is, the more obvious the advantage of Spark is, and the faster the calculation speed is.

In order to verify the expandability of the algorithm and the parallelization effect of the algorithm, the ARCD algorithm compares the acceleration ratio with the MP-T-CS algorithm under the D4 data set.

As can be seen from FIG. 8, the speed-up ratio of the two algorithms increases with the number of clusters, and it can be seen that both algorithms have good parallelism. The acceleration ratio of the ARCD algorithm is higher than that of the SACS algorithm, the main reason is that the ARCD algorithm maps a data set to a Boolean matrix, a frequent item set is obtained through AND operation, the algorithm only needs to scan the database once, and the operation speed is improved. Experiments prove that the ARCD algorithm has better parallelism and faster operation speed when processing a mass data set.

In order to verify the expandability of the algorithm under different data sets, the data sets D1-D5 are taken in experiments, and the acceleration ratio of the ARCD algorithm is tested under different nodes. The results of the experiment are shown in FIG. 9.

As can be seen from fig. 9, when the data amount is small, the acceleration effect is not significant as the number of clusters increases, because the parallel computation is not dominant because the computation time is short because the data amount is small. When the amount of data is large enough, the acceleration ratio is almost linear. Experiments prove that the parallel processing capability of the ARCD algorithm is good when the ARCD algorithm faces a mass data set, and the ARCD algorithm has good expandability.

Aiming at the problem that the traditional community discovery algorithm is low in calculation efficiency, the ARCD algorithm for improving the association rule by using the transaction weight and the Boolean matrix on the Spark platform is provided. The ARCD algorithm combines a support degree self-adaption method, and subjectivity of artificially setting the support degree is solved. The improved algorithm is parallelized on Spark, and the problems that the MapReduce frequently scans a disk and the I/O consumption is increased are solved. Experimental results show that the ARCD algorithm is high in calculation efficiency when mining community membership, good in parallelism and expandability and obvious in advantage when processing mass data.

While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A community discovery algorithm based on an improved association rule is characterized by comprising the following steps:

2. The association rule-modified based community discovery algorithm according to claim 1, wherein in the step S1, Apriori algorithm is optimized:

3. The association rule improvement-based community discovery algorithm of claim 1, wherein in the step S2, transaction data are frequently scanned for Apriori algorithm to generate candidate set redundancy problem, and the ARCD algorithm improves Apriori algorithm by and-operating with weights and boolean matrix to obtain candidate sets.

4. The association rule improvement-based community discovery algorithm according to claim 1, wherein the step S3 comprises the steps of: