CN107145548B

CN107145548B - A kind of Parallel Sequence mode excavation method based on Spark platform

Info

Publication number: CN107145548B
Application number: CN201710284017.8A
Authority: CN
Inventors: 余啸; 刘进; 吴思尧; 崔晓晖; 张建升; 井溢洋
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-04-26
Filing date: 2017-04-26
Publication date: 2019-08-20
Anticipated expiration: 2037-04-26
Also published as: CN107145548A

Abstract

The Parallel Sequence mode excavation method based on Spark platform that the invention discloses a kind of, aiming at the problem that when handling mass data, the inefficient problem of computing capability and the existing Parallel Sequence pattern mining algorithm based on Hadoop have high IO expense and laod unbalance to existing serialization Sequential Pattern Mining Algorithm, reasonable sequence database decomposition strategy is devised, solves the problems, such as laod unbalance to greatest extent.On this basis according to the characteristic of MapReduce programming framework, parallelization has been carried out to original GSP algorithm, has improved mass data sequential mode mining efficiency using the Large-scale parallel computing ability of Spark cloud computing platform.

Description

A kind of Parallel Sequence mode excavation method based on Spark platform

Technical field

The invention belongs to sequential mode mining technical fields, more particularly to a kind of Parallel Sequence based on Spark platform Mode excavation method.

Background technique

(1) sequential mode mining technology

[document 1] proposes the concept of sequential mode mining earliest.Sequential mode mining is exactly to excavate sequence database intermediate frequency It is numerous that there is sequence event or subsequences.Sequential mode mining as research contents important in data mining research field it One, have and be widely applied very much demand, such as user's buying behavior analysis, biological sequence analysis, taxi Frequent Trajectory Patterns It was found that, mankind's mobile behavior pattern analysis.[document 2], which is proposed, wipes out strategy and Hash tree using redundancy candidate pattern come real The GSP algorithm of the existing quick memory access of candidate pattern.[document 3] proposes the SPADE algorithm indicated based on vertical data.[document 4] Propose the PrefixSpan algorithm based on data for projection library.Although these traditional serialization algorithms are excellent with data structure The change for changing and excavating mechanism, improves in performance, but the processing speed of algorithm is past when facing large-scale dataset Toward the requirement that people are not achieved.Until early 20th century, the rapidly development of computer hardware has greatly pushed Parallel Sequence mode The research of mining algorithm.Domestic and foreign scholars propose various distributed Sequential Pattern Mining Algorithms in succession.

[document 5] propose by tree shadow casting technique two different Parallel Algorithms come solve distributed memory parallel based on The sequential pattern discovery problem of calculation machine.[document 6] proposes the DMGSP algorithm that volume of transmitted data is reduced by lexicographic sequence tree. [document 7] proposes the FMGSP algorithm of fast mining global maximum frequent Item Sets.But due to distributed memory system or net These parallel tables of lattice computing system do not provide fault tolerant mechanism, so the Parallel Sequence mould achieved above in these parallel tables Formula mining algorithm does not have fault-tolerance.Programmer is needed to have largely parallel in addition, developing parallel algorithm on these platforms Algorithm development experience.

The appearance of cloud computing platform for realize parallel algorithm provide new method and approach so that high efficiency, low cost from Sequential mode mining is carried out in mass data to be possibly realized.Hadoop cloud computing platform developed by apache foundation due to Its open source property, keeps the programmer for not having abundant parallel algorithm development Experience flat in Hadoop at scalability, high fault tolerance Concurrent program is easily developed on platform, therefore many scholars propose the calculation of the Parallel Sequence mode excavation based on Hadoop platform Method.[document 8] proposes the concurrent incremental Sequential Pattern Mining Algorithm DPSP algorithm based on Hadoop.[document 9] proposes base Sequential mining algorithm-BIDE-MR algorithm is closed parallel in Hadoop.[document 10] proposes the SPAMC algorithm based on Hadoop. [document 11] proposes the parallel PrefixSpan algorithm based on Hadoop.[document 12], which is proposed, decomposes thought based on affairs PrefixSpan parallel algorithm based on Hadoop.[document 13] proposes the DGSP based on Hadoop based on database cutting Algorithm.The Parallel Sequence pattern mining algorithm based on iterative MapReduce task that document [8] [9] [10] [11] proposes needs The MapReduce task that multiple needs read sequence database from HDFS is executed, very big IO expense can be generated.Document What [12] [13] proposed can not be effectively by based on by the Parallel Sequence pattern mining algorithm of non-iterative formula MapReduce task Calculation task is uniformly assigned to each calculate node, causes load imbalance.

(2) Map-Reduce programming framework

Map-Reduce is a kind of programming framework, uses concept " Map (mapping) " and " Reduce (reduction) ", for big The concurrent operation of scale data collection (being greater than 1TB), proposes in [document 14].User need to only write two be referred to as Map and The function of Reduce, system can manage the coordination between the execution and task of Map or Reduce parallel task, and The case where being capable of handling some above-mentioned mission failure, and at the same time the fault-tolerance to hardware fault can be ensured.

Calculating process based on Map-Reduce is as follows:

1) input file is divided into M data fragmentation first by the library Map-Reduce in user program, each fragment it is big It is small generally from 16 to 64MB (size that user can control each data slot by optional parameter), then Map- The library Reduce creates a large amount of copies of programs in a group of planes.

2) these copies of programs have a special program-primary control program, and other programs are all by master control journey in copy The working procedure of sequence distribution task.There are M Map task and R Reduce task that will be assigned, primary control program appoints a Map Business or Reduce task distribute to an idle working procedure.

3) working procedure that Map task is assigned reads relevant input data segment, from the data slot of input It is right to parse key-value (key, value), then key-value will be generated to the customized Map function of user, Map function is passed to The interim key-value in centre to be stored in local memory caching in.

4) key-value in caching is divided into R region to by partition functions, is periodically written to local disk later On.The key-value of caching will pass back to primary control program to the storage location on local disk, be responsible for by primary control program these Storage location is transmitted to the working procedure that Reduce task is assigned again.

5) when the working procedure that Reduce task is assigned receives the data storage location information that primary control program is sent Afterwards, main where the working procedure of Map task is assigned using remote procedure call (remote procedure calls) It is data cached that these are read on the disk of machine.When the working procedure that Reduce task is assigned has read all intermediate data Afterwards, by have the data aggregate of same keys after being ranked up key together.Since many different keys can be mapped to In identical Reduce task, it is therefore necessary to be ranked up.If intermediate data can not be completed greatly very much to sort in memory, It will be ranked up in outside.

6) be assigned Reduce task working procedure traversal sequence after intermediate data, for each it is unique in Between key-value pair, the set of this key median relevant with it passes to use by the working procedure that Reduce task is assigned The customized Reduce function in family.The output of Reduce function is appended to the output file of affiliated subregion.

7) after all Map and Reduce tasks are all completed, primary control program wakes up user program during this time, Calling in user program to Map-Reduce just returns.

(3) Spark cloud computing platform

Spark is by Katyuan universal parallel cloud computing platform of UC Berkeley AMP development in laboratory, and Spark is based on The distributed computing that MapReduce thought is realized, possesses advantage possessed by Hadoop MapReduce；But different places are Output result can be stored in memory among operation, to not need to read and write distributed file system (HDFS), therefore Spark The better operation data excavation of energy and machine learning etc. need the MapReduce algorithm of iteration.Spark enables memory distribution number According to collection, it can provide interactive inquiry, in addition to this can also cache data set in memory, improve data set read-write speed Rate.Realize the reuse of the data set in calculating process, Optimized Iterative workload.A variety of distributed texts can be used in Spark bottom Part system such as HDFS file system stores data, but is more to cooperate together with scheduling of resource platform Mesos and YARN It is existing.

RDD (elasticity distribution formula data set) is the core of Spark, and RDD is distributed across each calculate node and is stored in memory In set of data objects, RDD allow user when executing multiple queries explicitly by working set caching in memory, it is subsequent Inquiry can reuse working set, this greatly improves inquiry velocity.RDD is distributed on multiple nodes, and can be carried out to it Parallel processing.RDD be it is expansible, elastic, in calculating process, memory be less than RDD when, can dump on disk, it is ensured that Memory continues operation enough.RDD be partitioned, be read-only, the immutable and data acquisition system that can be operated in parallel, can only It is created and executing determining conversion operation (such as map, join, filter and group by) in other RDD, however these are limited System is so that realize that fault-tolerant expense is very low.The checkpoint for needing to pay expensive with distributed shared memory system and rollback are not Together, RDD rebuilds the subregion of loss by Lineage: contained in a RDD how from other RDD it is derivative necessary to phase Close information, without checkpointing can reconstruction of lost data subregion.Although RDD is not one general shared Memory is abstract, but has good descriptive power, scalability and reliability, and can be widely used in data parallel class Using.

Related document:

[document 1] Agrawal R, Srikant R.Mining sequential patterns:The 11th International Conference on Data Engineering[C].Taipei:IEEE Computer Society, 1995:3-141.

[document 2] Srikant R, Agrawal R.Mining sequential pattern:Generations and performance improvement[C]//proceedings of the 5th International Conference ExtendingDatabase Technology.Avignon:Lecture Notes in Computer Science,1996: .3-17.

[document 3] Zaki M.SPADE:An efficient algorithm for mining frequent sequences[J].Machine Learning,2001.41(2):31-60.

[document 4] Pei J, Han J, Pinto H.PrefixSpan mining sequential patterns efficiently by prefix-projected pattern growth[C]//proceedings of the 17th International Conference on Data Engineering.Washington,IEEE Transactions on Data Engineering,2004.16(1):1424-1440.

[document 5] Gurainikv, Gargn, Vipink.Parallel tree Projection algorithm for sequence mining[C]//proceedings of the 7th International European Conference on Parallel Processing.London,2001:310-320.

[document 6] Gong Zhenzhi, Hu Kongfa, Da Qingli, Zhang Changhai .DMGSP: a kind of fast distributed global sequential pattern digging Dig algorithm [J] Southeast China University journal, 2007.16 (04): 574-579.

[document 7] Zhang Changhai, Hu Kongfa, Liu Haidong.FMGSP:an efficient method of mining global sequential patterns[C].//proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery.Los Alanitos IEEE Computer Society.2007:761-765.

[document 8] J.Huang, S.Lin, M.Chen, " DPSP:Distributed Progressive Sequential Pattern Mining on the Cloud,”Lecture Notes in Computer Science,pp.27-34,2010.

[document 9] D.Yu, W.Wu, S.Zheng, Z.Zhu, " BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce,”In:Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing,pp.177-186 2012.

[document 10] Chun-Chieh Chen, Chi-Yao Tseng, Chi-Yao Tseng, " Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud,”In 2013IEEE International Congress on Big Data,pp.310–317,2013.

[document 11] P.N.Sabrina, " Miltiple MapReduce and Derivative projected database:new approach for supporting prefixspan scalability,”IEEE,pp.148-153, Nov.2015.

[document 12] X.Wang, " Parallel sequential pattern mining by transcationdecompostion,”IEEE Fuzzy Systems and Knowledge Discovery(FSKD), 2010Seventh International Conference on,vol.4,pp.1746-1750.

[document 13] X.Yu, J.Liu, C.Ma, B.Li, " A MapReducreinforeceddistirbutedsequen ti al pattern mining algorithm,”Algorithms and Architectures for Parallel Processing,vol.9529,pp.183-197,Dec.2015.

[document 14] Jeffrey Dean and Sanjay Ghemawat.Map-Reduce:Simplified data processing on large Cluster[C]//proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation.New York:ACM Press, 2004:137-149.

Summary of the invention

For existing serialization Sequential Pattern Mining Algorithm when handling mass data the inefficient problem of computing capability and The existing Parallel Sequence pattern mining algorithm based on Hadoop has the problem of high IO expense and laod unbalance, and the present invention mentions A kind of Parallel Sequence mode excavation method based on Spark platform is supplied.

The technical scheme adopted by the invention is that: a kind of Parallel Sequence mode excavation method based on Spark platform, it is special Sign is, comprising the following steps:

Step 1: database cutting；

Sequence database is cut into the database fragment of same size, fragment number according to the working node number in cluster come It determines；

Step 2: database prepares；

All 1- sequence patterns are generated using a MapReduce task；

Step 3: database mining；

Iteration finds all k- sequence patterns, k > 1 using MapReduce task.

The present invention devises reasonable sequence database decomposition strategy, solves asking for laod unbalance to greatest extent Topic.On this basis according to the characteristic of MapReduce programming framework, parallelization is carried out to original GSP algorithm, has utilized Spark The Large-scale parallel computing ability of cloud computing platform improves mass data sequential mode mining efficiency.Technical solution of the present invention Have the characteristics that simple, quick, can preferably improve the efficiency of sequential mode mining.

Detailed description of the invention

Fig. 1 is the flow chart of the embodiment of the present invention；

Fig. 2 is the sequence database cutting schematic diagram of the embodiment of the present invention；

Fig. 3 is the sequence database cutting result schematic diagram of the embodiment of the present invention；

Fig. 4 is that the database of the embodiment of the present invention is ready to carry out the schematic diagram of process；

Fig. 5 is the schematic diagram of first MapRduce task execution process of database mining of the embodiment of the present invention；

Fig. 6 is the schematic diagram of second MapRduce task execution process of database mining of the embodiment of the present invention.

Specific embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

The process for the Sequential Pattern Mining Algorithm based on Spark platform that the present invention designs is shown in attached drawing 1, and all steps can be by Those skilled in the art use computer software technology implementation process automatic running.Embodiment the specific implementation process is as follows:

Step 1, database cutting；

By sequence database be cut into same size database fragment (fragment number according to the working node number in cluster come It determines).

See Fig. 2, specific step is as follows for sequence database cutting:

(1) sequence all in database is sorted in the form of sequence length descending.

(2) n sequence of foremost constitutes n initial database fragment, and each database fragment includes a sequence Column mode.Total sequence length of each database fragment be initialized as it includes this sequence length.

(3) total sequence length based on n database fragment constructs a most rickle Ψ={ D₁,D₂,D₃,…,D_n, Middle D₁It is the sequence database fragment that shortest sequence is assigned for most rickle root node.

(4) most rickle root node D is obtained_i, by the maximum sequence addition D of sequence length in unassigned sequence_i, adjust Whole most rickle.

(5) step (4) are repeated, until all sequences are all assigned in sequence database segment.

Such as Fig. 3, original sequence data library is divided into n=3 sub- sequence databases by the present embodiment setting.

Original sequence data library content such as the following table 1:

Table 1

Sequence number	Sequence
		S₁	<(a b)a c>
S₂	<(c d)(e f g)>
		S₃	<h>
S₄	<c g>
		S₅	<g a>
S₆	<a c g h>

Database is ranked up to the database after being sorted: S first₂S₁S₆S₄S₅S₃.Three sequences of front are taken to build Initial pile structure is found, initial heap is established in adjustment.At this time three sub- sequence databases and it includes sequence are as follows: subdata base P₁: S₂, length 5；Subdata base P₂: S₁, length 4；Subdata base P₃: S₆, length 4.Wherein most rickle root node is P₂.So Sequence database after being successively read sequence afterwards.Sequence S is read first₄, P is added₂, P at this time₃Length is 6.Most rickle is adjusted, this When most rickle root node be P₃.Read sequence S₅, P is added₃, P at this time₃Length is 6.Most rickle is adjusted, at this time most rickle root node For P₁.Read sequence S₃, P is added₁, P at this time₁Length is 6.So far sequence read finishes, database cutting the end of the step.Embodiment Middle cutting result ensure that the sequence total length in each sequence database fragment is identical.

Obtained subsequence database 1,2,3 is divided respectively such as the following table 2,3,4:

Table 2

Sequence number	Sequence
		S₂	<(c d)(e f g)>
S₃	<h>

Table 3

Sequence number	Sequence
		S₁	<(a b)a c>
S₄	<c g>

Table 4

Sequence number	Sequence
		S₆	<a c g h>
S₅	<g a>

If the number of Map node is q in Spark platform, it is proposed that the number of subsequence database is equal to of Map node Number, i.e. n=q.If n < q, when running this method, there is (q-n) a Map node to cannot get benefit in the case where no mission failure With Duty-circle is not high.If n > q, when running this method, n-q sub- path sequences in the case where no mission failure Database needs just be handled after complete preceding q sub- path sequence databases of q Map node processing, and treatment effeciency is not high. Therefore n=q can meet Duty-circle and treatment effeciency simultaneously.

Step 2, database prepares；

In this step, all 1- sequence patterns are generated using a MapReduce task.The step calls first first A flatMap function reads every sequence from sequence database segment, and wherein sequence is with < LongWritable offset, The storage of Text sequence > key-value pair form.Then call another flatMap function by sequence cutting be item, generate < , 1 > key-value pair.The key-value pair for possessing same keys, which is merged, passes to Reduce node, and Reduce node calls ReducebyKey () function calculating<item, the support of 1>key-value pair, output support are more than or equal to the minimum support of setting Key-value pair.The key of these key-value pairs is 1- sequence pattern, and value is the support counting of the 1- sequence pattern.

Embodiment sets minimum support as 2, prepares the specific implementation procedure of step referring to fig. 4, Map node is to database point Piece 1 generates key-value pair result such as the following table 5:

Table 5

Export result
	<c,1>
<d,1>
	<e,1>
<f,1>
	<g,1>
<h,1>

Map node generates key-value pair result such as the following table 6 to database fragment 2:

Table 6

Export result
	<a,1>
<b,1>
	<a,1>
<c,1>
	<c,1>
<h,1>

Map node generates key-value pair result such as the following table 7 to database fragment 3:

Table 7

Export result
	<a,1>
<c,1>
	<g,1>
<h,1>
	<g,1>
<a,1>

Reduce node merges the key-value pair of key having the same, the result of key-value pair of the output support more than or equal to 2 Such as the following table 8:

Table 8

Sequence pattern	Support
		a	3
c	4
		g	4
h	2

Step 3, database mining；

This single-step iteration finds all k- sequence patterns (k > 1) using MapReduce task.It is generated in preparing step 1- sequence pattern is stored in RDD rather than in HDFS, to reduce IO expense.In k-th of MapReduce task, each Map Node reads (k-1)-sequence pattern first from RDD, generates step by candidate sequence mode to generate candidate k- sequence mould Formula (C_k).Then every sequence s in a map function reading database segment is called, and judges candidate k- sequence pattern C_kIt is No is the subsequence of the sequence, and < C is then generated if it is subsequence_k, 1 > key-value pair.The key-value pair for possessing same keys is merged biography Pass Reduce node.Finally, each Reduce node calls ReducebyKey () function calculating < C_k, the branch of 1 > key-value pair Degree of holding, output support are more than or equal to the key-value pair of the minimum support of setting, as last k- sequence pattern (L_k)。

The present embodiment sets minimum support as 2, excavates the specific implementation procedure ginseng of the 1st MapReduce task in step See that Fig. 5, Map node 1 generate key-value pair result such as the following table 9 to database fragment 1:

Table 9

Export result
	<c g,1>

Map node 2 generates key-value pair result such as the following table 10 to database fragment 2:

Table 10

Export result
	<a c,1>
<c g,1>

Map node 3 generates key-value pair result such as the following table 11 to database fragment 3:

Table 11

Export result
	<a c,1>
<a g,1>
	<a h,1>
<c g,1>
	<c h,1>
<g h,1>
	<g a,1>

Reduce node merges the key-value pair that all Map nodes generate, and output support is more than or equal to the minimum of setting The key-value pair of support such as the following table 12:

Table 12

The 2nd specific implementation procedure of MapReduce task in step is excavated referring to Fig. 6, since each Map node is from RDD Middle reading 2- sequence pattern generates step by candidate sequence mode, and candidate 3- sequence pattern does not generate, therefore Map node 1 To database fragment 3 key-value pair is not exported to database fragment 2, Map node 3 to database fragment 1, Map node 2, Reduce node merges the key-value pair that all Map nodes generate, it is found that all Map nodes do not export, therefore program is whole Only.

Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims

1. a kind of Parallel Sequence mode excavation method based on Spark platform, which comprises the following steps:

Step 1: database cutting；

Sequence database is cut into the database fragment of same size, fragment number is according to the working node number in cluster come really It is fixed；

The specific implementation of step 1 includes following sub-step:

Step 1.1: by sequence all in database with sequence length descending sort；

Step 1.2: n sequence of foremost constitutes n initial database fragment, and each database fragment includes a sequence Column mode；Total sequence length of each database fragment be initialized as it includes this sequence length；

Step 1.3: total sequence length based on n database fragment constructs a most rickle Ψ={ D₁,D₂,D₃,…,D_n, Middle D₁It is the sequence database fragment that shortest sequence is assigned for most rickle root node；

Step 1.4: obtaining most rickle root node D_i, by the maximum sequence addition D of sequence length in unassigned sequence_i, adjust Whole most rickle；

Step 1.5: step 1.4 is repeated, until all sequences are all assigned in sequence database segment；

Step 2: database prepares；

All 1- sequence patterns are generated using a MapReduce task；

The specific implementation of step 2 includes following sub-step:

Step 2.1: call first flatMap function every sequence is read from sequence database segment, wherein sequence with < The form of LongWritable offset, Text sequence > key-value pair stores；

Step 2.2: call another flatMap function by sequence cutting be item, generation<item, 1>key-value pair；

Step 2.3: the key-value pair for possessing same keys, which is merged, passes to Reduce node, and Reduce node calls ReducebyKey () function calculating<item, the support of 1>key-value pair, output support are more than or equal to the minimum support of setting Key-value pair；The key of these key-value pairs is 1- sequence pattern, and value is the support counting of the 1- sequence pattern；

Step 3: database mining；

Iteration finds all k- sequence patterns, k > 1 using MapReduce task；

The specific implementation of step 3 includes following sub-step:

Step 3.1: in the 1- sequence pattern deposit RDD generated in step 2 rather than in HDFS, to reduce IO expense；

Step 3.2: in k-th of MapReduce task, each Map node reads (k-1)-sequence mould first from RDD Formula generates step by candidate sequence mode to generate candidate k- sequence pattern C_k；

Step 3.3: calling every sequence s in a map function reading database segment, and judge candidate k- sequence pattern C_k Whether be the sequence subsequence, < C is then generated if it is subsequence_k, 1 > key-value pair；The key-value pair for possessing same keys is merged Pass to Reduce node；

Step 3.4: each Reduce node calls ReducebyKey () function calculating < C_k, the support of 1 > key-value pair is defeated Support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence pattern L out_k。