CN107145548A

CN107145548A - A kind of Parallel Sequence mode excavation method based on Spark platforms

Info

Publication number: CN107145548A
Application number: CN201710284017.8A
Authority: CN
Inventors: 余啸; 刘进; 吴思尧; 崔晓晖; 张建升; 井溢洋
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-04-26
Filing date: 2017-04-26
Publication date: 2017-09-08
Anticipated expiration: 2037-04-26
Also published as: CN107145548B

Abstract

The invention discloses a kind of Parallel Sequence mode excavation method based on Spark platforms, the problem of the problem of computing capability is poorly efficient when handling mass data for existing serialization Sequential Pattern Mining Algorithm and the existing Parallel Sequence pattern mining algorithm based on Hadoop have high IO expenses and laod unbalance, devise rational sequence library decomposition strategy, the problem of solving laod unbalance to greatest extent.On this basis according to the characteristic of MapReduce programming frameworks, parallelization has been carried out to original GSP algorithms, mass data sequential mode mining efficiency is improved using the Large-scale parallel computing ability of Spark cloud computing platforms.

Description

A kind of Parallel Sequence mode excavation method based on Spark platforms

Technical field

The invention belongs to sequential mode mining technical field, more particularly to a kind of Parallel Sequence based on Spark platforms Mode excavation method.

Background technology

(1) sequential mode mining technology

[document 1] proposes the concept of sequential mode mining earliest.Sequential mode mining is exactly to excavate sequence library intermediate frequency The orderly event of numerous appearance or subsequence.Sequential mode mining as research contents important in data mining research field it One, have and be widely applied very much demand, such as user's buying behavior analysis, biological sequence analysis, taxi Frequent Trajectory Patterns It was found that, mankind's mobile behavior pattern analysis.[document 2] is proposed wipes out strategy and Hash tree come real using redundancy candidate pattern The GSP algorithms of the quick memory access of existing candidate pattern.[document 3] proposes the SPADE algorithms represented based on vertical data.[document 4] Propose the PrefixSpan algorithms based on data for projection storehouse.Although these traditional serialization algorithms are excellent with data structure Change and excavate the change of mechanism, improved in performance, but the processing speed of algorithm is past when in face of large-scale dataset Toward the requirement for not reaching people.Until early 20th century, the rapidly development of computer hardware has greatly promoted Parallel Sequence pattern The research of mining algorithm.Domestic and foreign scholars propose various distributed Sequential Pattern Mining Algorithms in succession.

[document 5] is proposed to be counted parallel based on two kinds of different Parallel Algorithms of tree shadow casting technique to solve distributed memory The sequential pattern discovery problem of calculation machine.[document 6] proposes the DMGSP algorithms that volume of transmitted data is reduced by lexicographic sequence tree. [document 7] proposes the FMGSP algorithms of fast mining global maximum frequent Item Sets.But it is due to distributed memory system or net These parallel tables of lattice computing system do not provide fault tolerant mechanism, so in these parallel tables Parallel Sequence mould achieved above Formula mining algorithm does not possess fault-tolerance.In addition, developing parallel algorithm on these platforms, to need programmer to possess substantial amounts of parallel Algorithm development experience.

The appearance of cloud computing platform is realizes that parallel algorithm provides new method and approach so that high efficiency, low cost from Sequential mode mining is carried out in mass data to be possibly realized.By Apache funds club develop Hadoop cloud calculating platform due to Its increasing income property, scalability, high fault tolerance, allow not possess abundant parallel algorithm development Experience programmer it is flat in Hadoop Concurrent program is easily developed on platform, therefore many scholars propose the Parallel Sequence mode excavation based on Hadoop platform and calculated Method.[document 8] proposes the concurrent incremental Sequential Pattern Mining Algorithm DPSP algorithms based on Hadoop.[document 9] proposes base Parallel in Hadoop closes sequential mining algorithm-BIDE-MR algorithms.[document 10] proposes the SPAMC algorithms based on Hadoop. [document 11] proposes the parallel PrefixSpan algorithms based on Hadoop.[document 12] is proposed decomposes thought based on affairs PrefixSpan parallel algorithms based on Hadoop.[document 13] proposes the DGSP based on Hadoop based on database cutting Algorithm.The Parallel Sequence pattern mining algorithm based on iterative MapReduce tasks that document [8] [9] [10] [11] is proposed is needed The MapReduce tasks that multiple needs read sequence library from HDFS are performed, very big IO expenses can be produced.Document [12] the Parallel Sequence pattern mining algorithm based on non-iterative formula MapReduce tasks that [13] are proposed can not be effectively by meter Calculation task is uniformly assigned to each calculate node, causes load imbalance.

(2) Map-Reduce programming frameworks

Map-Reduce is a kind of programming framework, concept " Map (mapping) " and " Reduce (reduction) " is employed, for big The concurrent operation of scale data collection (being more than 1TB), is proposed in [document 14].User need to only write two be referred to as Map and Reduce function, system can manage Map or Reduce parallel tasks execution and task between coordination, and The situation of some above-mentioned mission failure can be handled, while the fault-tolerance to hardware fault can be ensured.

Calculating process based on Map-Reduce is as follows：

1) input file is divided into M data fragmentation by the Map-Reduce storehouses in user program first, each burst it is big It is small general from 16 to 64MB (user can control the size of each data slot by optional parameter), then Map- Reduce storehouses create substantial amounts of copies of programs in a group of planes.

2) these copies of programs have other programs in a special program-primary control program, copy to be all by master control journey Sequence distributes the working procedure of task.There are M Map task and R Reduce task to be allocated, primary control program appoints a Map Business or Reduce tasks distribute to an idle working procedure.

3) working procedure that Map tasks are assigned reads related input data fragment, from the data slot of input Parse key-value (key, value) right, then key-value will be produced to passing to user-defined Map functions, Map functions The interim key-value in centre to be stored in local memory caching in.

4) key-value in caching periodically is written to local disk afterwards to being divided into R region by partition functions On.The key-value of caching will pass back to primary control program to the storage location on local disk, be responsible for by primary control program these Storage location is transmitted to the working procedure that Reduce tasks are assigned again.

5) when the working procedure that Reduce tasks are assigned receives the data storage location information that primary control program is sent Afterwards, it is main from the working procedure place that Map tasks are assigned using remote procedure call (remote procedure calls) It is data cached that these are read on the disk of machine.When the working procedure that Reduce tasks are assigned have read all intermediate data Afterwards, by causing the data aggregate with same keys together after being ranked up to key.Because many different keys can be mapped to In identical Reduce tasks, it is therefore necessary to be ranked up.If intermediate data can not complete greatly very much sequence in internal memory, then It will be ranked up in outside.

6) be assigned the working procedure of Reduce tasks traversal sequence after intermediate data, for each it is unique in Between key-value pair, the set of this key median related to it passes to use by the working procedure that Reduce tasks are assigned The customized Reduce functions in family.The output of Reduce functions is appended to the output file of affiliated subregion.

7) after all Map and Reduce tasks are all completed, primary control program wakes up user program during this time, Being called to Map-Reduce in user program just returns.

(3) Spark cloud computing platforms

Spark is that, by Katyuan universal parallel cloud computing platform of UC Berkeley AMP development in laboratory, Spark is based on The Distributed Calculation that MapReduce thoughts are realized, possesses Hadoop MapReduce and is had the advantage that；But different places are Output result can be stored in internal memory in the middle of computing, from without needing read-write distributed file system (HDFS), therefore Spark Preferably service data excavation and machine learning etc. it can need the MapReduce algorithms of iteration.Spark enables internal memory distribution number According to collection, it can provide interactive inquiry, and in addition data set can also be buffered in internal memory, improve data set read-write speed Rate.Realize the reuse of the data set in calculating process, Optimized Iterative workload.A variety of distributed texts can be used in Spark bottoms Part system such as HDFS file system storage data, but be more to cooperate together with scheduling of resource platform Mesos and YARN It is existing.

RDD (elasticity distribution formula data set) is Spark core, and RDD is distributed across each calculate node and is stored in internal memory In set of data objects, RDD allows user that explicitly working set is buffered in internal memory when performing multiple queries, follow-up Inquiry can reuse working set, and this greatly improves inquiry velocity.RDD is distributed on multiple nodes, it is possible to which it is carried out Parallel processing.RDD is expansible, elastic, in calculating process, when internal memory is less than RDD, can be dumped on disk, it is ensured that Internal memory continues computing enough.RDD be partitioned, read-only, the immutable and data acquisition system that can be operated in parallel, can only Created by performing the conversion operation (such as map, join, filter and group by) determined in other RDD, but these are limited System to realize that fault-tolerant expense is very low.Need to pay the checkpoint of expensive and rollback not with distributed shared memory system Together, RDD rebuilds the subregion of loss by Lineage：Contained in one RDD and how to derive necessary phase from other RDD Information is closed, just can be with the data partition of reconstruction of lost without checkpointing.Although RDD is not one general shared Internal memory is abstract, but possesses good descriptive power, scalability and reliability, and can be widely used in data parallel class Using.

Relevant document：

[document 1] Agrawal R, Srikant R.Mining sequential patterns:The 11th International Conference on Data Engineering[C].Taipei:IEEE Computer Society, 1995:3-141.

[document 2] Srikant R, Agrawal R.Mining sequential pattern:Generations and performance improvement[C]//proceedings of the 5th International Conference ExtendingDatabase Technology.Avignon:Lecture Notes in Computer Science,1996: .3-17.

[document 3] Zaki M.SPADE:An efficient algorithm for mining frequent sequences[J].Machine Learning,2001.41(2):31-60.

[document 4] Pei J, Han J, Pinto H.PrefixSpan mining sequential patterns efficiently by prefix-projected pattern growth[C]//proceedings of the 17th International Conference on Data Engineering.Washington,IEEE Transactions on Data Engineering,2004.16(1):1424-1440.

[document 5] Gurainikv, Gargn, Vipink.Parallel tree Projection algorithm for sequence mining[C]//proceedings of the 7th International European Conference on Parallel Processing.London,2001:310-320.

[document 6] Gong Zhenzhi, Hu Kongfa, up to celebrating profit, Zhang Changhai .DMGSP:A kind of fast distributed global sequential pattern is dug Dig algorithm [J] Southeast China University journal, 2007.16 (04):574-579.

[document 7] Zhang Changhai, Hu Kongfa, Liu Haidong.FMGSP:an efficient method of mining global sequential patterns[C].//proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery.Los Alanitos IEEE Computer Society.2007:761-765.

[document 8] J.Huang, S.Lin, M.Chen, " DPSP:Distributed Progressive Sequential Pattern Mining on the Cloud,”Lecture Notes in Computer Science,pp.27-34,2010.

[document 9] D.Yu, W.Wu, S.Zheng, Z.Zhu, " BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce,”In:Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing,pp.177-186 2012.

[document 10] Chun-Chieh Chen, Chi-Yao Tseng, Chi-Yao Tseng, " Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud,”In 2013IEEE International Congress on Big Data,pp.310–317,2013.

[document 11] P.N.Sabrina, " Miltiple MapReduce and Derivative projected database:new approach for supporting prefixspan scalability,”IEEE,pp.148-153, Nov.2015.

[document 12] X.Wang, " Parallel sequential pattern mining by transcationdecompostion,”IEEE Fuzzy Systems and Knowledge Discovery(FSKD), 2010Seventh International Conference on,vol.4,pp.1746-1750.

[document 13] X.Yu, J.Liu, C.Ma, B.Li, " A MapReducreinforeceddistirbutedsequenti al pattern mining algorithm,” Algorithms and Architectures for Parallel Processing,vol.9529,pp.183-197, Dec.2015.

[document 14] Jeffrey Dean and Sanjay Ghemawat.Map-Reduce:Simplified data processing on large Cluster[C]//proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation.New York:ACM Press, 2004:137-149.

The content of the invention

For it is existing serialization Sequential Pattern Mining Algorithm handle mass data when computing capability it is poorly efficient the problem of and The problem of existing Parallel Sequence pattern mining algorithm based on Hadoop has high IO expenses and laod unbalance, the present invention is carried A kind of Parallel Sequence mode excavation method based on Spark platforms is supplied.

The technical solution adopted in the present invention is：A kind of Parallel Sequence mode excavation method based on Spark platforms, it is special Levy and be, comprise the following steps：

Step 1：Database cutting；

Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come It is determined that, make the sequence total length in each database burst almost equal；

Step 2：Database prepares；

All 1- sequence patterns are produced using a MapReduce task；

Step 3：Database mining；

Iteration finds all k- sequence patterns, k using MapReduce tasks>1.

The present invention devises rational sequence library decomposition strategy, and asking for laod unbalance is solved to greatest extent Topic.On this basis according to the characteristic of MapReduce programming frameworks, parallelization is carried out to original GSP algorithms, Spark is utilized The Large-scale parallel computing ability of cloud computing platform improves mass data sequential mode mining efficiency.Technical scheme With it is simple, quick the characteristics of, can preferably improve the efficiency of sequential mode mining.

Brief description of the drawings

Fig. 1 is the flow chart of the embodiment of the present invention；

Fig. 2 is the sequence library cutting schematic diagram of the embodiment of the present invention；

Fig. 3 is the sequence library cutting result schematic diagram of the embodiment of the present invention；

Fig. 4 is ready to carry out the schematic diagram of process for the database of the embodiment of the present invention；

Fig. 5 is the schematic diagram of first MapRduce tasks carrying process of database mining of the embodiment of the present invention；

Fig. 6 is the schematic diagram of second MapRduce tasks carrying process of database mining of the embodiment of the present invention.

Embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the accompanying drawings and embodiment is to this hair It is bright to be described in further detail, it will be appreciated that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

The flow for the Sequential Pattern Mining Algorithm based on Spark platforms that the present invention is designed is shown in accompanying drawing 1, and all steps can be by Those skilled in the art use computer software technology implementation process automatic running.It is as follows that embodiment implements process：

Step 1, database cutting；

Sequence library is cut into database burst (the several working node numbers according in cluster of burst of formed objects It is determined that), make the sequence total length in each database burst almost equal.

See Fig. 2, sequence library cutting is comprised the following steps that：

(1) sequence all in database sorts in sequence length descending form.

(2) n sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence Row pattern.Total sequence length of each database burst is initialized as the length of its this sequence included.

(3) total sequence length based on n database burst builds a most rickle Ψ={ D₁,D₂,D₃,…,D_n, its Middle D₁It is the sequence library burst that most short sequence is assigned for most rickle root node.

(4) most rickle root node D is obtained_i, the maximum sequence of the sequence length in unassigned sequence is added into D_i, adjust Whole most rickle.

(5) repeat step (4), until all sequences are all assigned in sequence library fragment.

Such as Fig. 3, original sequence data storehouse is divided into n=3 sub- sequence libraries by the present embodiment setting.

Original sequence data storehouse content such as table 1 below：

Table 1

Sequence number	Sequence
		S₁	<(a b)a c>
S₂	<(c d)(e f g)>
		S₃	<h>
S₄	<c g>
		S₅	<g a>
S₆	<a c g h>

Database is ranked up to the database after being sorted first：S₂S₁S₆S₄S₅S₃.Three sequences above are taken to build Initial pile structure is found, initial heap is set up in adjustment.Now three sub- sequence libraries and its comprising sequence be：Subdata base P₁： S₂, length is 5；Subdata base P₂：S₁, length is 4；Subdata base P₃：S₆, length is 4.Wherein most rickle root node is P₂.So The sequence library after sequence is successively read afterwards.Sequence S is read first₄, add P₂, now P₃Length is 6.Most rickle is adjusted, this When most rickle root node be P₃.Read sequence S₅, add P₃, now P₃Length is 6.Most rickle is adjusted, now most rickle root node For P₁.Read sequence S₃, add P₁, now P₁Length is 6.So far sequence reads and finished, database cutting the end of the step.Embodiment Middle cutting result ensure that the sequence total length in each sequence library burst is identical.

Obtained subsequence database 1,2,3 is divided respectively such as table 2 below, 3,4：

Table 2

Sequence number	Sequence
		S₂	<(c d)(e f g)>
S₃	<h>

Table 3

Sequence number	Sequence
		S₁	<(a b)a c>
S₄	<c g>

Table 4

Sequence number	Sequence
		S₆	<a c g h>
S₅	<g a>

If the number of Map nodes is q in Spark platforms, it is proposed that the number of subsequence database is equal to of Map nodes Number, i.e. n=q.If n<Q, when running this method, has (q-n) individual Map nodes cannot profit in the case of no mission failure With Duty-circle is not high.If n>Q, when running this method, the n-q sub- path sequences in the case of no mission failure Database needs just be handled after complete preceding q sub- path sequence databases of q Map node processing, and treatment effeciency is not high. Therefore n=q can meet Duty-circle and treatment effeciency simultaneously.

Step 2, database prepares；

In this step, all 1- sequence patterns are produced using a MapReduce task.The step calls first first Individual flatMap functions read every sequence from sequence library fragment, wherein sequence with<LongWritable offset, Text sequence>The form storage of key-value pair.It is item by sequence cutting then to call another flatMap function, is produced< , 1>Key-value pair.The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting Key-value pair.The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.

Embodiment sets minimum support as 2, prepares the specific implementation procedure of step referring to Fig. 4, Map nodes are to database point Piece 1 produces key-value pair result such as table 5 below：

Table 5

Output result
	<c,1>
<d,1>
	<e,1>
<f,1>
	<g,1>
<h,1>

Map nodes produce key-value pair result such as table 6 below to database burst 2：

Table 6

Output result
	<a,1>
<b,1>
	<a,1>
<c,1>
	<c,1>
<h,1>

Map nodes produce key-value pair result such as table 7 below to database burst 3：

Table 7

Output result
	<a,1>
<c,1>
	<g,1>
<h,1>
	<g,1>
<a,1>

Reduce nodes merge the key-value pair with identical key, and output support is more than or equal to the result of 2 key-value pair Such as table 8 below：

Table 8

Sequence pattern	Support
		a	3
c	4
		g	4
h	2

Step 3, database mining；

The utilization MapReduce tasks of this single-step iteration find all k- sequence patterns (k>1).Produced in step is prepared In 1- sequence patterns deposit RDD rather than in HDFS, to reduce IO expenses.In k-th of MapReduce task, each Map Node reads (k-1)-sequence pattern first from RDD, produces step to produce the k- sequence moulds of candidate by candidate sequence pattern Formula (C_k).Then every sequence s in a map function reading database fragment is called, and judges that the k- sequence patterns c of candidate is No is the subsequence of the sequence, if subsequence then produces<c,1>Key-value pair.The key-value pair for possessing same keys is merged transmission Give Reduce nodes.Finally, each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair Degree, output support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns (L_k)。

The present embodiment sets minimum support as 2, excavates the specific implementation procedure ginseng of the 1st MapReduce task in step See Fig. 5, Map nodes 1 produce key-value pair result such as table 9 below to database burst 1：

Table 9

Output result
	<c g,1>

Map nodes 2 produce key-value pair result such as table 10 below to database burst 2：

Table 10

Output result
	<a c,1>
<c g,1>

Map nodes 3 produce key-value pair result such as table 11 below to database burst 3：

Table 11

Output result
	<a c,1>
<a g,1>
	<a h,1>
<c g,1>
	<c h,1>
<g h,1>
	<g a,1>

Reduce nodes merge the key-value pair that all Map nodes are produced, and output support is more than or equal to the minimum of setting The key-value pair of support such as table 12 below：

Table 12

The 2nd specific implementation procedure of MapReduce tasks in step is excavated referring to Fig. 6, because each Map nodes are from RDD Middle reading 2- sequence patterns, step is produced by candidate sequence pattern, the 3- sequence patterns generation without candidate, therefore Map nodes 1 To database burst 3 key-value pair is not exported to database burst 2, Map nodes 3 to database burst 1, Map nodes 2, Reduce nodes merge the key-value pair that all Map nodes are produced, it is found that all Map nodes are not exported, therefore program is whole Only.

Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims

1. a kind of Parallel Sequence mode excavation method based on Spark platforms, it is characterised in that comprise the following steps：

Step 1：Database cutting；

Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come true It is fixed, make the sequence total length in each database burst almost equal；

Step 2：Database prepares；

All 1- sequence patterns are produced using a MapReduce task；

Step 3：Database mining；

Iteration finds all k- sequence patterns, k using MapReduce tasks>1.

2. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 1 Implement including following sub-step：

Step 1.1：By sequence all in database with sequence length descending sort；

Step 1.2：N sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence Row pattern；Total sequence length of each database burst is initialized as the length of its this sequence included；

Step 1.3：Total sequence length based on n database burst builds a most rickle Ψ={ D₁,D₂,D₃,…,D_n, its Middle D₁It is the sequence library burst that most short sequence is assigned for most rickle root node；

Step 1.4：Obtain most rickle root node D_i, the maximum sequence of the sequence length in unassigned sequence is added into D_i, adjust Whole most rickle；

Step 1.5：Repeat step 1.4, until all sequences are all assigned in sequence library fragment.

3. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 2 Implement including following sub-step：

Step 2.1：Call first flatMap function to read every sequence from sequence library fragment, wherein sequence with< LongWritable offset,Text sequence>The form storage of key-value pair；

Step 2.2：It is item by sequence cutting to call another flatMap function, is produced<, 1>Key-value pair；

Step 2.3：The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting Key-value pair；The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.

4. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 3 Implement including following sub-step：

Step 3.1：In the 1- sequence patterns deposit RDD produced in step 2 rather than in HDFS, to reduce IO expenses；

Step 3.2：In k-th of MapReduce task, each Map node reads (k-1)-sequence mould first from RDD Formula, produces step to produce the k- sequence patterns C of candidate by candidate sequence pattern_k；

Step 3.3：Every sequence s in a map function reading database fragment is called, and judges the k- sequence patterns c of candidate Whether be the sequence subsequence, if subsequence then produces<c,1>Key-value pair；The key-value pair for possessing same keys is merged biography Pass Reduce nodes；

Step 3.4：Each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair, output Support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns L_k。