CN107145548B - A kind of Parallel Sequence mode excavation method based on Spark platform - Google Patents

A kind of Parallel Sequence mode excavation method based on Spark platform Download PDF

Info

Publication number
CN107145548B
CN107145548B CN201710284017.8A CN201710284017A CN107145548B CN 107145548 B CN107145548 B CN 107145548B CN 201710284017 A CN201710284017 A CN 201710284017A CN 107145548 B CN107145548 B CN 107145548B
Authority
CN
China
Prior art keywords
sequence
database
key
value pair
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710284017.8A
Other languages
Chinese (zh)
Other versions
CN107145548A (en
Inventor
余啸
刘进
吴思尧
崔晓晖
张建升
井溢洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201710284017.8A priority Critical patent/CN107145548B/en
Publication of CN107145548A publication Critical patent/CN107145548A/en
Application granted granted Critical
Publication of CN107145548B publication Critical patent/CN107145548B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The Parallel Sequence mode excavation method based on Spark platform that the invention discloses a kind of, aiming at the problem that when handling mass data, the inefficient problem of computing capability and the existing Parallel Sequence pattern mining algorithm based on Hadoop have high IO expense and laod unbalance to existing serialization Sequential Pattern Mining Algorithm, reasonable sequence database decomposition strategy is devised, solves the problems, such as laod unbalance to greatest extent.On this basis according to the characteristic of MapReduce programming framework, parallelization has been carried out to original GSP algorithm, has improved mass data sequential mode mining efficiency using the Large-scale parallel computing ability of Spark cloud computing platform.

Description

A kind of Parallel Sequence mode excavation method based on Spark platform
Technical field
The invention belongs to sequential mode mining technical fields, more particularly to a kind of Parallel Sequence based on Spark platform Mode excavation method.
Background technique
(1) sequential mode mining technology
[document 1] proposes the concept of sequential mode mining earliest.Sequential mode mining is exactly to excavate sequence database intermediate frequency It is numerous that there is sequence event or subsequences.Sequential mode mining as research contents important in data mining research field it One, have and be widely applied very much demand, such as user's buying behavior analysis, biological sequence analysis, taxi Frequent Trajectory Patterns It was found that, mankind's mobile behavior pattern analysis.[document 2], which is proposed, wipes out strategy and Hash tree using redundancy candidate pattern come real The GSP algorithm of the existing quick memory access of candidate pattern.[document 3] proposes the SPADE algorithm indicated based on vertical data.[document 4] Propose the PrefixSpan algorithm based on data for projection library.Although these traditional serialization algorithms are excellent with data structure The change for changing and excavating mechanism, improves in performance, but the processing speed of algorithm is past when facing large-scale dataset Toward the requirement that people are not achieved.Until early 20th century, the rapidly development of computer hardware has greatly pushed Parallel Sequence mode The research of mining algorithm.Domestic and foreign scholars propose various distributed Sequential Pattern Mining Algorithms in succession.
[document 5] propose by tree shadow casting technique two different Parallel Algorithms come solve distributed memory parallel based on The sequential pattern discovery problem of calculation machine.[document 6] proposes the DMGSP algorithm that volume of transmitted data is reduced by lexicographic sequence tree. [document 7] proposes the FMGSP algorithm of fast mining global maximum frequent Item Sets.But due to distributed memory system or net These parallel tables of lattice computing system do not provide fault tolerant mechanism, so the Parallel Sequence mould achieved above in these parallel tables Formula mining algorithm does not have fault-tolerance.Programmer is needed to have largely parallel in addition, developing parallel algorithm on these platforms Algorithm development experience.
The appearance of cloud computing platform for realize parallel algorithm provide new method and approach so that high efficiency, low cost from Sequential mode mining is carried out in mass data to be possibly realized.Hadoop cloud computing platform developed by apache foundation due to Its open source property, keeps the programmer for not having abundant parallel algorithm development Experience flat in Hadoop at scalability, high fault tolerance Concurrent program is easily developed on platform, therefore many scholars propose the calculation of the Parallel Sequence mode excavation based on Hadoop platform Method.[document 8] proposes the concurrent incremental Sequential Pattern Mining Algorithm DPSP algorithm based on Hadoop.[document 9] proposes base Sequential mining algorithm-BIDE-MR algorithm is closed parallel in Hadoop.[document 10] proposes the SPAMC algorithm based on Hadoop. [document 11] proposes the parallel PrefixSpan algorithm based on Hadoop.[document 12], which is proposed, decomposes thought based on affairs PrefixSpan parallel algorithm based on Hadoop.[document 13] proposes the DGSP based on Hadoop based on database cutting Algorithm.The Parallel Sequence pattern mining algorithm based on iterative MapReduce task that document [8] [9] [10] [11] proposes needs The MapReduce task that multiple needs read sequence database from HDFS is executed, very big IO expense can be generated.Document What [12] [13] proposed can not be effectively by based on by the Parallel Sequence pattern mining algorithm of non-iterative formula MapReduce task Calculation task is uniformly assigned to each calculate node, causes load imbalance.
(2) Map-Reduce programming framework
Map-Reduce is a kind of programming framework, uses concept " Map (mapping) " and " Reduce (reduction) ", for big The concurrent operation of scale data collection (being greater than 1TB), proposes in [document 14].User need to only write two be referred to as Map and The function of Reduce, system can manage the coordination between the execution and task of Map or Reduce parallel task, and The case where being capable of handling some above-mentioned mission failure, and at the same time the fault-tolerance to hardware fault can be ensured.
Calculating process based on Map-Reduce is as follows:
1) input file is divided into M data fragmentation first by the library Map-Reduce in user program, each fragment it is big It is small generally from 16 to 64MB (size that user can control each data slot by optional parameter), then Map- The library Reduce creates a large amount of copies of programs in a group of planes.
2) these copies of programs have a special program-primary control program, and other programs are all by master control journey in copy The working procedure of sequence distribution task.There are M Map task and R Reduce task that will be assigned, primary control program appoints a Map Business or Reduce task distribute to an idle working procedure.
3) working procedure that Map task is assigned reads relevant input data segment, from the data slot of input It is right to parse key-value (key, value), then key-value will be generated to the customized Map function of user, Map function is passed to The interim key-value in centre to be stored in local memory caching in.
4) key-value in caching is divided into R region to by partition functions, is periodically written to local disk later On.The key-value of caching will pass back to primary control program to the storage location on local disk, be responsible for by primary control program these Storage location is transmitted to the working procedure that Reduce task is assigned again.
5) when the working procedure that Reduce task is assigned receives the data storage location information that primary control program is sent Afterwards, main where the working procedure of Map task is assigned using remote procedure call (remote procedure calls) It is data cached that these are read on the disk of machine.When the working procedure that Reduce task is assigned has read all intermediate data Afterwards, by have the data aggregate of same keys after being ranked up key together.Since many different keys can be mapped to In identical Reduce task, it is therefore necessary to be ranked up.If intermediate data can not be completed greatly very much to sort in memory, It will be ranked up in outside.
6) be assigned Reduce task working procedure traversal sequence after intermediate data, for each it is unique in Between key-value pair, the set of this key median relevant with it passes to use by the working procedure that Reduce task is assigned The customized Reduce function in family.The output of Reduce function is appended to the output file of affiliated subregion.
7) after all Map and Reduce tasks are all completed, primary control program wakes up user program during this time, Calling in user program to Map-Reduce just returns.
(3) Spark cloud computing platform
Spark is by Katyuan universal parallel cloud computing platform of UC Berkeley AMP development in laboratory, and Spark is based on The distributed computing that MapReduce thought is realized, possesses advantage possessed by Hadoop MapReduce;But different places are Output result can be stored in memory among operation, to not need to read and write distributed file system (HDFS), therefore Spark The better operation data excavation of energy and machine learning etc. need the MapReduce algorithm of iteration.Spark enables memory distribution number According to collection, it can provide interactive inquiry, in addition to this can also cache data set in memory, improve data set read-write speed Rate.Realize the reuse of the data set in calculating process, Optimized Iterative workload.A variety of distributed texts can be used in Spark bottom Part system such as HDFS file system stores data, but is more to cooperate together with scheduling of resource platform Mesos and YARN It is existing.
RDD (elasticity distribution formula data set) is the core of Spark, and RDD is distributed across each calculate node and is stored in memory In set of data objects, RDD allow user when executing multiple queries explicitly by working set caching in memory, it is subsequent Inquiry can reuse working set, this greatly improves inquiry velocity.RDD is distributed on multiple nodes, and can be carried out to it Parallel processing.RDD be it is expansible, elastic, in calculating process, memory be less than RDD when, can dump on disk, it is ensured that Memory continues operation enough.RDD be partitioned, be read-only, the immutable and data acquisition system that can be operated in parallel, can only It is created and executing determining conversion operation (such as map, join, filter and group by) in other RDD, however these are limited System is so that realize that fault-tolerant expense is very low.The checkpoint for needing to pay expensive with distributed shared memory system and rollback are not Together, RDD rebuilds the subregion of loss by Lineage: contained in a RDD how from other RDD it is derivative necessary to phase Close information, without checkpointing can reconstruction of lost data subregion.Although RDD is not one general shared Memory is abstract, but has good descriptive power, scalability and reliability, and can be widely used in data parallel class Using.
Related document:
[document 1] Agrawal R, Srikant R.Mining sequential patterns:The 11th International Conference on Data Engineering[C].Taipei:IEEE Computer Society, 1995:3-141.
[document 2] Srikant R, Agrawal R.Mining sequential pattern:Generations and performance improvement[C]//proceedings of the 5th International Conference ExtendingDatabase Technology.Avignon:Lecture Notes in Computer Science,1996: .3-17.
[document 3] Zaki M.SPADE:An efficient algorithm for mining frequent sequences[J].Machine Learning,2001.41(2):31-60.
[document 4] Pei J, Han J, Pinto H.PrefixSpan mining sequential patterns efficiently by prefix-projected pattern growth[C]//proceedings of the 17th International Conference on Data Engineering.Washington,IEEE Transactions on Data Engineering,2004.16(1):1424-1440.
[document 5] Gurainikv, Gargn, Vipink.Parallel tree Projection algorithm for sequence mining[C]//proceedings of the 7th International European Conference on Parallel Processing.London,2001:310-320.
[document 6] Gong Zhenzhi, Hu Kongfa, Da Qingli, Zhang Changhai .DMGSP: a kind of fast distributed global sequential pattern digging Dig algorithm [J] Southeast China University journal, 2007.16 (04): 574-579.
[document 7] Zhang Changhai, Hu Kongfa, Liu Haidong.FMGSP:an efficient method of mining global sequential patterns[C].//proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery.Los Alanitos IEEE Computer Society.2007:761-765.
[document 8] J.Huang, S.Lin, M.Chen, " DPSP:Distributed Progressive Sequential Pattern Mining on the Cloud,”Lecture Notes in Computer Science,pp.27-34,2010.
[document 9] D.Yu, W.Wu, S.Zheng, Z.Zhu, " BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce,”In:Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing,pp.177-186 2012.
[document 10] Chun-Chieh Chen, Chi-Yao Tseng, Chi-Yao Tseng, " Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud,”In 2013IEEE International Congress on Big Data,pp.310–317,2013.
[document 11] P.N.Sabrina, " Miltiple MapReduce and Derivative projected database:new approach for supporting prefixspan scalability,”IEEE,pp.148-153, Nov.2015.
[document 12] X.Wang, " Parallel sequential pattern mining by transcationdecompostion,”IEEE Fuzzy Systems and Knowledge Discovery(FSKD), 2010Seventh International Conference on,vol.4,pp.1746-1750.
[document 13] X.Yu, J.Liu, C.Ma, B.Li, " A MapReducreinforeceddistirbutedsequen ti al pattern mining algorithm,”Algorithms and Architectures for Parallel Processing,vol.9529,pp.183-197,Dec.2015.
[document 14] Jeffrey Dean and Sanjay Ghemawat.Map-Reduce:Simplified data processing on large Cluster[C]//proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation.New York:ACM Press, 2004:137-149.
Summary of the invention
For existing serialization Sequential Pattern Mining Algorithm when handling mass data the inefficient problem of computing capability and The existing Parallel Sequence pattern mining algorithm based on Hadoop has the problem of high IO expense and laod unbalance, and the present invention mentions A kind of Parallel Sequence mode excavation method based on Spark platform is supplied.
The technical scheme adopted by the invention is that: a kind of Parallel Sequence mode excavation method based on Spark platform, it is special Sign is, comprising the following steps:
Step 1: database cutting;
Sequence database is cut into the database fragment of same size, fragment number according to the working node number in cluster come It determines;
Step 2: database prepares;
All 1- sequence patterns are generated using a MapReduce task;
Step 3: database mining;
Iteration finds all k- sequence patterns, k > 1 using MapReduce task.
The present invention devises reasonable sequence database decomposition strategy, solves asking for laod unbalance to greatest extent Topic.On this basis according to the characteristic of MapReduce programming framework, parallelization is carried out to original GSP algorithm, has utilized Spark The Large-scale parallel computing ability of cloud computing platform improves mass data sequential mode mining efficiency.Technical solution of the present invention Have the characteristics that simple, quick, can preferably improve the efficiency of sequential mode mining.
Detailed description of the invention
Fig. 1 is the flow chart of the embodiment of the present invention;
Fig. 2 is the sequence database cutting schematic diagram of the embodiment of the present invention;
Fig. 3 is the sequence database cutting result schematic diagram of the embodiment of the present invention;
Fig. 4 is that the database of the embodiment of the present invention is ready to carry out the schematic diagram of process;
Fig. 5 is the schematic diagram of first MapRduce task execution process of database mining of the embodiment of the present invention;
Fig. 6 is the schematic diagram of second MapRduce task execution process of database mining of the embodiment of the present invention.
Specific embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.
The process for the Sequential Pattern Mining Algorithm based on Spark platform that the present invention designs is shown in attached drawing 1, and all steps can be by Those skilled in the art use computer software technology implementation process automatic running.Embodiment the specific implementation process is as follows:
Step 1, database cutting;
By sequence database be cut into same size database fragment (fragment number according to the working node number in cluster come It determines).
See Fig. 2, specific step is as follows for sequence database cutting:
(1) sequence all in database is sorted in the form of sequence length descending.
(2) n sequence of foremost constitutes n initial database fragment, and each database fragment includes a sequence Column mode.Total sequence length of each database fragment be initialized as it includes this sequence length.
(3) total sequence length based on n database fragment constructs a most rickle Ψ={ D1,D2,D3,…,Dn, Middle D1It is the sequence database fragment that shortest sequence is assigned for most rickle root node.
(4) most rickle root node D is obtainedi, by the maximum sequence addition D of sequence length in unassigned sequencei, adjust Whole most rickle.
(5) step (4) are repeated, until all sequences are all assigned in sequence database segment.
Such as Fig. 3, original sequence data library is divided into n=3 sub- sequence databases by the present embodiment setting.
Original sequence data library content such as the following table 1:
Table 1
Sequence number Sequence
S1 <(a b)a c>
S2 <(c d)(e f g)>
S3 <h>
S4 <c g>
S5 <g a>
S6 <a c g h>
Database is ranked up to the database after being sorted: S first2S1S6S4S5S3.Three sequences of front are taken to build Initial pile structure is found, initial heap is established in adjustment.At this time three sub- sequence databases and it includes sequence are as follows: subdata base P1: S2, length 5;Subdata base P2: S1, length 4;Subdata base P3: S6, length 4.Wherein most rickle root node is P2.So Sequence database after being successively read sequence afterwards.Sequence S is read first4, P is added2, P at this time3Length is 6.Most rickle is adjusted, this When most rickle root node be P3.Read sequence S5, P is added3, P at this time3Length is 6.Most rickle is adjusted, at this time most rickle root node For P1.Read sequence S3, P is added1, P at this time1Length is 6.So far sequence read finishes, database cutting the end of the step.Embodiment Middle cutting result ensure that the sequence total length in each sequence database fragment is identical.
Obtained subsequence database 1,2,3 is divided respectively such as the following table 2,3,4:
Table 2
Sequence number Sequence
S2 <(c d)(e f g)>
S3 <h>
Table 3
Sequence number Sequence
S1 <(a b)a c>
S4 <c g>
Table 4
Sequence number Sequence
S6 <a c g h>
S5 <g a>
If the number of Map node is q in Spark platform, it is proposed that the number of subsequence database is equal to of Map node Number, i.e. n=q.If n < q, when running this method, there is (q-n) a Map node to cannot get benefit in the case where no mission failure With Duty-circle is not high.If n > q, when running this method, n-q sub- path sequences in the case where no mission failure Database needs just be handled after complete preceding q sub- path sequence databases of q Map node processing, and treatment effeciency is not high. Therefore n=q can meet Duty-circle and treatment effeciency simultaneously.
Step 2, database prepares;
In this step, all 1- sequence patterns are generated using a MapReduce task.The step calls first first A flatMap function reads every sequence from sequence database segment, and wherein sequence is with < LongWritable offset, The storage of Text sequence > key-value pair form.Then call another flatMap function by sequence cutting be item, generate < , 1 > key-value pair.The key-value pair for possessing same keys, which is merged, passes to Reduce node, and Reduce node calls ReducebyKey () function calculating<item, the support of 1>key-value pair, output support are more than or equal to the minimum support of setting Key-value pair.The key of these key-value pairs is 1- sequence pattern, and value is the support counting of the 1- sequence pattern.
Embodiment sets minimum support as 2, prepares the specific implementation procedure of step referring to fig. 4, Map node is to database point Piece 1 generates key-value pair result such as the following table 5:
Table 5
Export result
<c,1>
<d,1>
<e,1>
<f,1>
<g,1>
<h,1>
Map node generates key-value pair result such as the following table 6 to database fragment 2:
Table 6
Export result
<a,1>
<b,1>
<a,1>
<c,1>
<c,1>
<h,1>
Map node generates key-value pair result such as the following table 7 to database fragment 3:
Table 7
Export result
<a,1>
<c,1>
<g,1>
<h,1>
<g,1>
<a,1>
Reduce node merges the key-value pair of key having the same, the result of key-value pair of the output support more than or equal to 2 Such as the following table 8:
Table 8
Sequence pattern Support
a 3
c 4
g 4
h 2
Step 3, database mining;
This single-step iteration finds all k- sequence patterns (k > 1) using MapReduce task.It is generated in preparing step 1- sequence pattern is stored in RDD rather than in HDFS, to reduce IO expense.In k-th of MapReduce task, each Map Node reads (k-1)-sequence pattern first from RDD, generates step by candidate sequence mode to generate candidate k- sequence mould Formula (Ck).Then every sequence s in a map function reading database segment is called, and judges candidate k- sequence pattern CkIt is No is the subsequence of the sequence, and < C is then generated if it is subsequencek, 1 > key-value pair.The key-value pair for possessing same keys is merged biography Pass Reduce node.Finally, each Reduce node calls ReducebyKey () function calculating < Ck, the branch of 1 > key-value pair Degree of holding, output support are more than or equal to the key-value pair of the minimum support of setting, as last k- sequence pattern (Lk)。
The present embodiment sets minimum support as 2, excavates the specific implementation procedure ginseng of the 1st MapReduce task in step See that Fig. 5, Map node 1 generate key-value pair result such as the following table 9 to database fragment 1:
Table 9
Export result
<c g,1>
Map node 2 generates key-value pair result such as the following table 10 to database fragment 2:
Table 10
Export result
<a c,1>
<c g,1>
Map node 3 generates key-value pair result such as the following table 11 to database fragment 3:
Table 11
Export result
<a c,1>
<a g,1>
<a h,1>
<c g,1>
<c h,1>
<g h,1>
<g a,1>
Reduce node merges the key-value pair that all Map nodes generate, and output support is more than or equal to the minimum of setting The key-value pair of support such as the following table 12:
Table 12
The 2nd specific implementation procedure of MapReduce task in step is excavated referring to Fig. 6, since each Map node is from RDD Middle reading 2- sequence pattern generates step by candidate sequence mode, and candidate 3- sequence pattern does not generate, therefore Map node 1 To database fragment 3 key-value pair is not exported to database fragment 2, Map node 3 to database fragment 1, Map node 2, Reduce node merges the key-value pair that all Map nodes generate, it is found that all Map nodes do not export, therefore program is whole Only.
Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims (1)

1. a kind of Parallel Sequence mode excavation method based on Spark platform, which comprises the following steps:
Step 1: database cutting;
Sequence database is cut into the database fragment of same size, fragment number is according to the working node number in cluster come really It is fixed;
The specific implementation of step 1 includes following sub-step:
Step 1.1: by sequence all in database with sequence length descending sort;
Step 1.2: n sequence of foremost constitutes n initial database fragment, and each database fragment includes a sequence Column mode;Total sequence length of each database fragment be initialized as it includes this sequence length;
Step 1.3: total sequence length based on n database fragment constructs a most rickle Ψ={ D1,D2,D3,…,Dn, Middle D1It is the sequence database fragment that shortest sequence is assigned for most rickle root node;
Step 1.4: obtaining most rickle root node Di, by the maximum sequence addition D of sequence length in unassigned sequencei, adjust Whole most rickle;
Step 1.5: step 1.4 is repeated, until all sequences are all assigned in sequence database segment;
Step 2: database prepares;
All 1- sequence patterns are generated using a MapReduce task;
The specific implementation of step 2 includes following sub-step:
Step 2.1: call first flatMap function every sequence is read from sequence database segment, wherein sequence with < The form of LongWritable offset, Text sequence > key-value pair stores;
Step 2.2: call another flatMap function by sequence cutting be item, generation<item, 1>key-value pair;
Step 2.3: the key-value pair for possessing same keys, which is merged, passes to Reduce node, and Reduce node calls ReducebyKey () function calculating<item, the support of 1>key-value pair, output support are more than or equal to the minimum support of setting Key-value pair;The key of these key-value pairs is 1- sequence pattern, and value is the support counting of the 1- sequence pattern;
Step 3: database mining;
Iteration finds all k- sequence patterns, k > 1 using MapReduce task;
The specific implementation of step 3 includes following sub-step:
Step 3.1: in the 1- sequence pattern deposit RDD generated in step 2 rather than in HDFS, to reduce IO expense;
Step 3.2: in k-th of MapReduce task, each Map node reads (k-1)-sequence mould first from RDD Formula generates step by candidate sequence mode to generate candidate k- sequence pattern Ck
Step 3.3: calling every sequence s in a map function reading database segment, and judge candidate k- sequence pattern Ck Whether be the sequence subsequence, < C is then generated if it is subsequencek, 1 > key-value pair;The key-value pair for possessing same keys is merged Pass to Reduce node;
Step 3.4: each Reduce node calls ReducebyKey () function calculating < Ck, the support of 1 > key-value pair is defeated Support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence pattern L outk
CN201710284017.8A 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform Expired - Fee Related CN107145548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710284017.8A CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710284017.8A CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Publications (2)

Publication Number Publication Date
CN107145548A CN107145548A (en) 2017-09-08
CN107145548B true CN107145548B (en) 2019-08-20

Family

ID=59774891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710284017.8A Expired - Fee Related CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Country Status (1)

Country Link
CN (1) CN107145548B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665291B (en) * 2017-09-27 2020-05-22 华南理工大学 Mutation detection method based on cloud computing platform Spark

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866904A (en) * 2015-06-16 2015-08-26 中电科软件信息服务有限公司 Parallelization method of BP neural network optimized by genetic algorithm based on spark
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN106126341A (en) * 2016-06-23 2016-11-16 成都信息工程大学 It is applied to many Computational frames processing system and the association rule mining method of big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866904A (en) * 2015-06-16 2015-08-26 中电科软件信息服务有限公司 Parallelization method of BP neural network optimized by genetic algorithm based on spark
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN106126341A (en) * 2016-06-23 2016-11-16 成都信息工程大学 It is applied to many Computational frames processing system and the association rule mining method of big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Spark的并行频繁模式挖掘算法;曹博 等;《计算机工程与应用》;20160614;第52卷(第20期);86-91

Also Published As

Publication number Publication date
CN107145548A (en) 2017-09-08

Similar Documents

Publication Publication Date Title
Zhang et al. Parallel processing systems for big data: a survey
Dean et al. MapReduce: simplified data processing on large clusters
Zhang et al. Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation
Malewicz et al. Pregel: a system for large-scale graph processing
Bu et al. Pregelix: Big (ger) graph analytics on a dataflow engine
Chen et al. Computation and communication efficient graph processing with distributed immutable view
Jin et al. A scalable hierarchical clustering algorithm using spark
Singh et al. Review of apriori based algorithms on mapreduce framework
Zhang et al. Accelerate large-scale iterative computation through asynchronous accumulative updates
Kovács et al. Frequent itemset mining on hadoop
Wang et al. Improving mapreduce performance with partial speculative execution
Zhao et al. ZenLDA: Large-scale topic model training on distributed data-parallel platform
Shi et al. DFPS: Distributed FP-growth algorithm based on Spark
Zhu et al. WolfGraph: The edge-centric graph processing on GPU
CN107346331B (en) A kind of Parallel Sequence mode excavation method based on Spark cloud computing platform
Chen et al. Providing scalable database services on the cloud
Yang From Google file system to omega: a decade of advancement in big data management at Google
CN107145548B (en) A kind of Parallel Sequence mode excavation method based on Spark platform
Kostenetskii et al. Simulation of hierarchical multiprocessor database systems
Wang et al. Research of decision tree on yarn using mapreduce and Spark
Essam et al. Towards enhancing the performance of parallel FP-growth on Spark
He et al. The high-activity parallel implementation of data preprocessing based on MapReduce
Bin Saadon et al. iiHadoop: an asynchronous distributed framework for incremental iterative computations
Liu et al. IncPregel: an incremental graph parallel computation model
Li et al. Flash: A framework for programming distributed graph processing algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190820

Termination date: 20210426

CF01 Termination of patent right due to non-payment of annual fee