CN107145548A - A kind of Parallel Sequence mode excavation method based on Spark platforms - Google Patents

A kind of Parallel Sequence mode excavation method based on Spark platforms Download PDF

Info

Publication number
CN107145548A
CN107145548A CN201710284017.8A CN201710284017A CN107145548A CN 107145548 A CN107145548 A CN 107145548A CN 201710284017 A CN201710284017 A CN 201710284017A CN 107145548 A CN107145548 A CN 107145548A
Authority
CN
China
Prior art keywords
sequence
key
database
value pair
burst
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710284017.8A
Other languages
Chinese (zh)
Other versions
CN107145548B (en
Inventor
余啸
刘进
吴思尧
崔晓晖
张建升
井溢洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201710284017.8A priority Critical patent/CN107145548B/en
Publication of CN107145548A publication Critical patent/CN107145548A/en
Application granted granted Critical
Publication of CN107145548B publication Critical patent/CN107145548B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of Parallel Sequence mode excavation method based on Spark platforms, the problem of the problem of computing capability is poorly efficient when handling mass data for existing serialization Sequential Pattern Mining Algorithm and the existing Parallel Sequence pattern mining algorithm based on Hadoop have high IO expenses and laod unbalance, devise rational sequence library decomposition strategy, the problem of solving laod unbalance to greatest extent.On this basis according to the characteristic of MapReduce programming frameworks, parallelization has been carried out to original GSP algorithms, mass data sequential mode mining efficiency is improved using the Large-scale parallel computing ability of Spark cloud computing platforms.

Description

A kind of Parallel Sequence mode excavation method based on Spark platforms
Technical field
The invention belongs to sequential mode mining technical field, more particularly to a kind of Parallel Sequence based on Spark platforms Mode excavation method.
Background technology
(1) sequential mode mining technology
[document 1] proposes the concept of sequential mode mining earliest.Sequential mode mining is exactly to excavate sequence library intermediate frequency The orderly event of numerous appearance or subsequence.Sequential mode mining as research contents important in data mining research field it One, have and be widely applied very much demand, such as user's buying behavior analysis, biological sequence analysis, taxi Frequent Trajectory Patterns It was found that, mankind's mobile behavior pattern analysis.[document 2] is proposed wipes out strategy and Hash tree come real using redundancy candidate pattern The GSP algorithms of the quick memory access of existing candidate pattern.[document 3] proposes the SPADE algorithms represented based on vertical data.[document 4] Propose the PrefixSpan algorithms based on data for projection storehouse.Although these traditional serialization algorithms are excellent with data structure Change and excavate the change of mechanism, improved in performance, but the processing speed of algorithm is past when in face of large-scale dataset Toward the requirement for not reaching people.Until early 20th century, the rapidly development of computer hardware has greatly promoted Parallel Sequence pattern The research of mining algorithm.Domestic and foreign scholars propose various distributed Sequential Pattern Mining Algorithms in succession.
[document 5] is proposed to be counted parallel based on two kinds of different Parallel Algorithms of tree shadow casting technique to solve distributed memory The sequential pattern discovery problem of calculation machine.[document 6] proposes the DMGSP algorithms that volume of transmitted data is reduced by lexicographic sequence tree. [document 7] proposes the FMGSP algorithms of fast mining global maximum frequent Item Sets.But it is due to distributed memory system or net These parallel tables of lattice computing system do not provide fault tolerant mechanism, so in these parallel tables Parallel Sequence mould achieved above Formula mining algorithm does not possess fault-tolerance.In addition, developing parallel algorithm on these platforms, to need programmer to possess substantial amounts of parallel Algorithm development experience.
The appearance of cloud computing platform is realizes that parallel algorithm provides new method and approach so that high efficiency, low cost from Sequential mode mining is carried out in mass data to be possibly realized.By Apache funds club develop Hadoop cloud calculating platform due to Its increasing income property, scalability, high fault tolerance, allow not possess abundant parallel algorithm development Experience programmer it is flat in Hadoop Concurrent program is easily developed on platform, therefore many scholars propose the Parallel Sequence mode excavation based on Hadoop platform and calculated Method.[document 8] proposes the concurrent incremental Sequential Pattern Mining Algorithm DPSP algorithms based on Hadoop.[document 9] proposes base Parallel in Hadoop closes sequential mining algorithm-BIDE-MR algorithms.[document 10] proposes the SPAMC algorithms based on Hadoop. [document 11] proposes the parallel PrefixSpan algorithms based on Hadoop.[document 12] is proposed decomposes thought based on affairs PrefixSpan parallel algorithms based on Hadoop.[document 13] proposes the DGSP based on Hadoop based on database cutting Algorithm.The Parallel Sequence pattern mining algorithm based on iterative MapReduce tasks that document [8] [9] [10] [11] is proposed is needed The MapReduce tasks that multiple needs read sequence library from HDFS are performed, very big IO expenses can be produced.Document [12] the Parallel Sequence pattern mining algorithm based on non-iterative formula MapReduce tasks that [13] are proposed can not be effectively by meter Calculation task is uniformly assigned to each calculate node, causes load imbalance.
(2) Map-Reduce programming frameworks
Map-Reduce is a kind of programming framework, concept " Map (mapping) " and " Reduce (reduction) " is employed, for big The concurrent operation of scale data collection (being more than 1TB), is proposed in [document 14].User need to only write two be referred to as Map and Reduce function, system can manage Map or Reduce parallel tasks execution and task between coordination, and The situation of some above-mentioned mission failure can be handled, while the fault-tolerance to hardware fault can be ensured.
Calculating process based on Map-Reduce is as follows:
1) input file is divided into M data fragmentation by the Map-Reduce storehouses in user program first, each burst it is big It is small general from 16 to 64MB (user can control the size of each data slot by optional parameter), then Map- Reduce storehouses create substantial amounts of copies of programs in a group of planes.
2) these copies of programs have other programs in a special program-primary control program, copy to be all by master control journey Sequence distributes the working procedure of task.There are M Map task and R Reduce task to be allocated, primary control program appoints a Map Business or Reduce tasks distribute to an idle working procedure.
3) working procedure that Map tasks are assigned reads related input data fragment, from the data slot of input Parse key-value (key, value) right, then key-value will be produced to passing to user-defined Map functions, Map functions The interim key-value in centre to be stored in local memory caching in.
4) key-value in caching periodically is written to local disk afterwards to being divided into R region by partition functions On.The key-value of caching will pass back to primary control program to the storage location on local disk, be responsible for by primary control program these Storage location is transmitted to the working procedure that Reduce tasks are assigned again.
5) when the working procedure that Reduce tasks are assigned receives the data storage location information that primary control program is sent Afterwards, it is main from the working procedure place that Map tasks are assigned using remote procedure call (remote procedure calls) It is data cached that these are read on the disk of machine.When the working procedure that Reduce tasks are assigned have read all intermediate data Afterwards, by causing the data aggregate with same keys together after being ranked up to key.Because many different keys can be mapped to In identical Reduce tasks, it is therefore necessary to be ranked up.If intermediate data can not complete greatly very much sequence in internal memory, then It will be ranked up in outside.
6) be assigned the working procedure of Reduce tasks traversal sequence after intermediate data, for each it is unique in Between key-value pair, the set of this key median related to it passes to use by the working procedure that Reduce tasks are assigned The customized Reduce functions in family.The output of Reduce functions is appended to the output file of affiliated subregion.
7) after all Map and Reduce tasks are all completed, primary control program wakes up user program during this time, Being called to Map-Reduce in user program just returns.
(3) Spark cloud computing platforms
Spark is that, by Katyuan universal parallel cloud computing platform of UC Berkeley AMP development in laboratory, Spark is based on The Distributed Calculation that MapReduce thoughts are realized, possesses Hadoop MapReduce and is had the advantage that;But different places are Output result can be stored in internal memory in the middle of computing, from without needing read-write distributed file system (HDFS), therefore Spark Preferably service data excavation and machine learning etc. it can need the MapReduce algorithms of iteration.Spark enables internal memory distribution number According to collection, it can provide interactive inquiry, and in addition data set can also be buffered in internal memory, improve data set read-write speed Rate.Realize the reuse of the data set in calculating process, Optimized Iterative workload.A variety of distributed texts can be used in Spark bottoms Part system such as HDFS file system storage data, but be more to cooperate together with scheduling of resource platform Mesos and YARN It is existing.
RDD (elasticity distribution formula data set) is Spark core, and RDD is distributed across each calculate node and is stored in internal memory In set of data objects, RDD allows user that explicitly working set is buffered in internal memory when performing multiple queries, follow-up Inquiry can reuse working set, and this greatly improves inquiry velocity.RDD is distributed on multiple nodes, it is possible to which it is carried out Parallel processing.RDD is expansible, elastic, in calculating process, when internal memory is less than RDD, can be dumped on disk, it is ensured that Internal memory continues computing enough.RDD be partitioned, read-only, the immutable and data acquisition system that can be operated in parallel, can only Created by performing the conversion operation (such as map, join, filter and group by) determined in other RDD, but these are limited System to realize that fault-tolerant expense is very low.Need to pay the checkpoint of expensive and rollback not with distributed shared memory system Together, RDD rebuilds the subregion of loss by Lineage:Contained in one RDD and how to derive necessary phase from other RDD Information is closed, just can be with the data partition of reconstruction of lost without checkpointing.Although RDD is not one general shared Internal memory is abstract, but possesses good descriptive power, scalability and reliability, and can be widely used in data parallel class Using.
Relevant document:
[document 1] Agrawal R, Srikant R.Mining sequential patterns:The 11th International Conference on Data Engineering[C].Taipei:IEEE Computer Society, 1995:3-141.
[document 2] Srikant R, Agrawal R.Mining sequential pattern:Generations and performance improvement[C]//proceedings of the 5th International Conference ExtendingDatabase Technology.Avignon:Lecture Notes in Computer Science,1996: .3-17.
[document 3] Zaki M.SPADE:An efficient algorithm for mining frequent sequences[J].Machine Learning,2001.41(2):31-60.
[document 4] Pei J, Han J, Pinto H.PrefixSpan mining sequential patterns efficiently by prefix-projected pattern growth[C]//proceedings of the 17th International Conference on Data Engineering.Washington,IEEE Transactions on Data Engineering,2004.16(1):1424-1440.
[document 5] Gurainikv, Gargn, Vipink.Parallel tree Projection algorithm for sequence mining[C]//proceedings of the 7th International European Conference on Parallel Processing.London,2001:310-320.
[document 6] Gong Zhenzhi, Hu Kongfa, up to celebrating profit, Zhang Changhai .DMGSP:A kind of fast distributed global sequential pattern is dug Dig algorithm [J] Southeast China University journal, 2007.16 (04):574-579.
[document 7] Zhang Changhai, Hu Kongfa, Liu Haidong.FMGSP:an efficient method of mining global sequential patterns[C].//proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery.Los Alanitos IEEE Computer Society.2007:761-765.
[document 8] J.Huang, S.Lin, M.Chen, " DPSP:Distributed Progressive Sequential Pattern Mining on the Cloud,”Lecture Notes in Computer Science,pp.27-34,2010.
[document 9] D.Yu, W.Wu, S.Zheng, Z.Zhu, " BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce,”In:Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing,pp.177-186 2012.
[document 10] Chun-Chieh Chen, Chi-Yao Tseng, Chi-Yao Tseng, " Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud,”In 2013IEEE International Congress on Big Data,pp.310–317,2013.
[document 11] P.N.Sabrina, " Miltiple MapReduce and Derivative projected database:new approach for supporting prefixspan scalability,”IEEE,pp.148-153, Nov.2015.
[document 12] X.Wang, " Parallel sequential pattern mining by transcationdecompostion,”IEEE Fuzzy Systems and Knowledge Discovery(FSKD), 2010Seventh International Conference on,vol.4,pp.1746-1750.
[document 13] X.Yu, J.Liu, C.Ma, B.Li, " A MapReducreinforeceddistirbutedsequenti al pattern mining algorithm,” Algorithms and Architectures for Parallel Processing,vol.9529,pp.183-197, Dec.2015.
[document 14] Jeffrey Dean and Sanjay Ghemawat.Map-Reduce:Simplified data processing on large Cluster[C]//proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation.New York:ACM Press, 2004:137-149.
The content of the invention
For it is existing serialization Sequential Pattern Mining Algorithm handle mass data when computing capability it is poorly efficient the problem of and The problem of existing Parallel Sequence pattern mining algorithm based on Hadoop has high IO expenses and laod unbalance, the present invention is carried A kind of Parallel Sequence mode excavation method based on Spark platforms is supplied.
The technical solution adopted in the present invention is:A kind of Parallel Sequence mode excavation method based on Spark platforms, it is special Levy and be, comprise the following steps:
Step 1:Database cutting;
Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come It is determined that, make the sequence total length in each database burst almost equal;
Step 2:Database prepares;
All 1- sequence patterns are produced using a MapReduce task;
Step 3:Database mining;
Iteration finds all k- sequence patterns, k using MapReduce tasks>1.
The present invention devises rational sequence library decomposition strategy, and asking for laod unbalance is solved to greatest extent Topic.On this basis according to the characteristic of MapReduce programming frameworks, parallelization is carried out to original GSP algorithms, Spark is utilized The Large-scale parallel computing ability of cloud computing platform improves mass data sequential mode mining efficiency.Technical scheme With it is simple, quick the characteristics of, can preferably improve the efficiency of sequential mode mining.
Brief description of the drawings
Fig. 1 is the flow chart of the embodiment of the present invention;
Fig. 2 is the sequence library cutting schematic diagram of the embodiment of the present invention;
Fig. 3 is the sequence library cutting result schematic diagram of the embodiment of the present invention;
Fig. 4 is ready to carry out the schematic diagram of process for the database of the embodiment of the present invention;
Fig. 5 is the schematic diagram of first MapRduce tasks carrying process of database mining of the embodiment of the present invention;
Fig. 6 is the schematic diagram of second MapRduce tasks carrying process of database mining of the embodiment of the present invention.
Embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the accompanying drawings and embodiment is to this hair It is bright to be described in further detail, it will be appreciated that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.
The flow for the Sequential Pattern Mining Algorithm based on Spark platforms that the present invention is designed is shown in accompanying drawing 1, and all steps can be by Those skilled in the art use computer software technology implementation process automatic running.It is as follows that embodiment implements process:
Step 1, database cutting;
Sequence library is cut into database burst (the several working node numbers according in cluster of burst of formed objects It is determined that), make the sequence total length in each database burst almost equal.
See Fig. 2, sequence library cutting is comprised the following steps that:
(1) sequence all in database sorts in sequence length descending form.
(2) n sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence Row pattern.Total sequence length of each database burst is initialized as the length of its this sequence included.
(3) total sequence length based on n database burst builds a most rickle Ψ={ D1,D2,D3,…,Dn, its Middle D1It is the sequence library burst that most short sequence is assigned for most rickle root node.
(4) most rickle root node D is obtainedi, the maximum sequence of the sequence length in unassigned sequence is added into Di, adjust Whole most rickle.
(5) repeat step (4), until all sequences are all assigned in sequence library fragment.
Such as Fig. 3, original sequence data storehouse is divided into n=3 sub- sequence libraries by the present embodiment setting.
Original sequence data storehouse content such as table 1 below:
Table 1
Sequence number Sequence
S1 <(a b)a c>
S2 <(c d)(e f g)>
S3 <h>
S4 <c g>
S5 <g a>
S6 <a c g h>
Database is ranked up to the database after being sorted first:S2S1S6S4S5S3.Three sequences above are taken to build Initial pile structure is found, initial heap is set up in adjustment.Now three sub- sequence libraries and its comprising sequence be:Subdata base P1: S2, length is 5;Subdata base P2:S1, length is 4;Subdata base P3:S6, length is 4.Wherein most rickle root node is P2.So The sequence library after sequence is successively read afterwards.Sequence S is read first4, add P2, now P3Length is 6.Most rickle is adjusted, this When most rickle root node be P3.Read sequence S5, add P3, now P3Length is 6.Most rickle is adjusted, now most rickle root node For P1.Read sequence S3, add P1, now P1Length is 6.So far sequence reads and finished, database cutting the end of the step.Embodiment Middle cutting result ensure that the sequence total length in each sequence library burst is identical.
Obtained subsequence database 1,2,3 is divided respectively such as table 2 below, 3,4:
Table 2
Sequence number Sequence
S2 <(c d)(e f g)>
S3 <h>
Table 3
Sequence number Sequence
S1 <(a b)a c>
S4 <c g>
Table 4
Sequence number Sequence
S6 <a c g h>
S5 <g a>
If the number of Map nodes is q in Spark platforms, it is proposed that the number of subsequence database is equal to of Map nodes Number, i.e. n=q.If n<Q, when running this method, has (q-n) individual Map nodes cannot profit in the case of no mission failure With Duty-circle is not high.If n>Q, when running this method, the n-q sub- path sequences in the case of no mission failure Database needs just be handled after complete preceding q sub- path sequence databases of q Map node processing, and treatment effeciency is not high. Therefore n=q can meet Duty-circle and treatment effeciency simultaneously.
Step 2, database prepares;
In this step, all 1- sequence patterns are produced using a MapReduce task.The step calls first first Individual flatMap functions read every sequence from sequence library fragment, wherein sequence with<LongWritable offset, Text sequence>The form storage of key-value pair.It is item by sequence cutting then to call another flatMap function, is produced< , 1>Key-value pair.The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting Key-value pair.The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.
Embodiment sets minimum support as 2, prepares the specific implementation procedure of step referring to Fig. 4, Map nodes are to database point Piece 1 produces key-value pair result such as table 5 below:
Table 5
Output result
<c,1>
<d,1>
<e,1>
<f,1>
<g,1>
<h,1>
Map nodes produce key-value pair result such as table 6 below to database burst 2:
Table 6
Output result
<a,1>
<b,1>
<a,1>
<c,1>
<c,1>
<h,1>
Map nodes produce key-value pair result such as table 7 below to database burst 3:
Table 7
Output result
<a,1>
<c,1>
<g,1>
<h,1>
<g,1>
<a,1>
Reduce nodes merge the key-value pair with identical key, and output support is more than or equal to the result of 2 key-value pair Such as table 8 below:
Table 8
Sequence pattern Support
a 3
c 4
g 4
h 2
Step 3, database mining;
The utilization MapReduce tasks of this single-step iteration find all k- sequence patterns (k>1).Produced in step is prepared In 1- sequence patterns deposit RDD rather than in HDFS, to reduce IO expenses.In k-th of MapReduce task, each Map Node reads (k-1)-sequence pattern first from RDD, produces step to produce the k- sequence moulds of candidate by candidate sequence pattern Formula (Ck).Then every sequence s in a map function reading database fragment is called, and judges that the k- sequence patterns c of candidate is No is the subsequence of the sequence, if subsequence then produces<c,1>Key-value pair.The key-value pair for possessing same keys is merged transmission Give Reduce nodes.Finally, each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair Degree, output support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns (Lk)。
The present embodiment sets minimum support as 2, excavates the specific implementation procedure ginseng of the 1st MapReduce task in step See Fig. 5, Map nodes 1 produce key-value pair result such as table 9 below to database burst 1:
Table 9
Output result
<c g,1>
Map nodes 2 produce key-value pair result such as table 10 below to database burst 2:
Table 10
Output result
<a c,1>
<c g,1>
Map nodes 3 produce key-value pair result such as table 11 below to database burst 3:
Table 11
Output result
<a c,1>
<a g,1>
<a h,1>
<c g,1>
<c h,1>
<g h,1>
<g a,1>
Reduce nodes merge the key-value pair that all Map nodes are produced, and output support is more than or equal to the minimum of setting The key-value pair of support such as table 12 below:
Table 12
The 2nd specific implementation procedure of MapReduce tasks in step is excavated referring to Fig. 6, because each Map nodes are from RDD Middle reading 2- sequence patterns, step is produced by candidate sequence pattern, the 3- sequence patterns generation without candidate, therefore Map nodes 1 To database burst 3 key-value pair is not exported to database burst 2, Map nodes 3 to database burst 1, Map nodes 2, Reduce nodes merge the key-value pair that all Map nodes are produced, it is found that all Map nodes are not exported, therefore program is whole Only.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims (4)

1. a kind of Parallel Sequence mode excavation method based on Spark platforms, it is characterised in that comprise the following steps:
Step 1:Database cutting;
Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come true It is fixed, make the sequence total length in each database burst almost equal;
Step 2:Database prepares;
All 1- sequence patterns are produced using a MapReduce task;
Step 3:Database mining;
Iteration finds all k- sequence patterns, k using MapReduce tasks>1.
2. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 1 Implement including following sub-step:
Step 1.1:By sequence all in database with sequence length descending sort;
Step 1.2:N sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence Row pattern;Total sequence length of each database burst is initialized as the length of its this sequence included;
Step 1.3:Total sequence length based on n database burst builds a most rickle Ψ={ D1,D2,D3,…,Dn, its Middle D1It is the sequence library burst that most short sequence is assigned for most rickle root node;
Step 1.4:Obtain most rickle root node Di, the maximum sequence of the sequence length in unassigned sequence is added into Di, adjust Whole most rickle;
Step 1.5:Repeat step 1.4, until all sequences are all assigned in sequence library fragment.
3. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 2 Implement including following sub-step:
Step 2.1:Call first flatMap function to read every sequence from sequence library fragment, wherein sequence with< LongWritable offset,Text sequence>The form storage of key-value pair;
Step 2.2:It is item by sequence cutting to call another flatMap function, is produced<, 1>Key-value pair;
Step 2.3:The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting Key-value pair;The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.
4. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 3 Implement including following sub-step:
Step 3.1:In the 1- sequence patterns deposit RDD produced in step 2 rather than in HDFS, to reduce IO expenses;
Step 3.2:In k-th of MapReduce task, each Map node reads (k-1)-sequence mould first from RDD Formula, produces step to produce the k- sequence patterns C of candidate by candidate sequence patternk
Step 3.3:Every sequence s in a map function reading database fragment is called, and judges the k- sequence patterns c of candidate Whether be the sequence subsequence, if subsequence then produces<c,1>Key-value pair;The key-value pair for possessing same keys is merged biography Pass Reduce nodes;
Step 3.4:Each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair, output Support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns Lk
CN201710284017.8A 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform Expired - Fee Related CN107145548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710284017.8A CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710284017.8A CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Publications (2)

Publication Number Publication Date
CN107145548A true CN107145548A (en) 2017-09-08
CN107145548B CN107145548B (en) 2019-08-20

Family

ID=59774891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710284017.8A Expired - Fee Related CN107145548B (en) 2017-04-26 2017-04-26 A kind of Parallel Sequence mode excavation method based on Spark platform

Country Status (1)

Country Link
CN (1) CN107145548B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665291A (en) * 2017-09-27 2018-02-06 华南理工大学 A kind of mutation detection method based on cloud computing platform Spark

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866904A (en) * 2015-06-16 2015-08-26 中电科软件信息服务有限公司 Parallelization method of BP neural network optimized by genetic algorithm based on spark
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN106126341A (en) * 2016-06-23 2016-11-16 成都信息工程大学 It is applied to many Computational frames processing system and the association rule mining method of big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866904A (en) * 2015-06-16 2015-08-26 中电科软件信息服务有限公司 Parallelization method of BP neural network optimized by genetic algorithm based on spark
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN106126341A (en) * 2016-06-23 2016-11-16 成都信息工程大学 It is applied to many Computational frames processing system and the association rule mining method of big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹博 等: "基于Spark的并行频繁模式挖掘算法", 《计算机工程与应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665291A (en) * 2017-09-27 2018-02-06 华南理工大学 A kind of mutation detection method based on cloud computing platform Spark
CN107665291B (en) * 2017-09-27 2020-05-22 华南理工大学 Mutation detection method based on cloud computing platform Spark

Also Published As

Publication number Publication date
CN107145548B (en) 2019-08-20

Similar Documents

Publication Publication Date Title
Zhang et al. Parallel processing systems for big data: a survey
Dean et al. MapReduce: simplified data processing on large clusters
He et al. Comet: batched stream processing for data intensive distributed computing
Zhang et al. Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation
Vulimiri et al. Global analytics in the face of bandwidth and regulatory constraints
Verma et al. Breaking the MapReduce stage barrier
Bu et al. Pregelix: Big (ger) graph analytics on a dataflow engine
Malewicz et al. Pregel: a system for large-scale graph processing
Chen et al. Computation and communication efficient graph processing with distributed immutable view
Liang et al. Express supervision system based on NodeJS and MongoDB
Gu et al. Chronos: An elastic parallel framework for stream benchmark generation and simulation
Labouseur et al. Scalable and Robust Management of Dynamic Graph Data.
CN111966677A (en) Data report processing method and device, electronic equipment and storage medium
Zhao et al. ZenLDA: Large-scale topic model training on distributed data-parallel platform
Shi et al. DFPS: Distributed FP-growth algorithm based on Spark
Sun et al. Survey of distributed computing frameworks for supporting big data analysis
Fang et al. Integrating workload balancing and fault tolerance in distributed stream processing system
CN107346331B (en) A kind of Parallel Sequence mode excavation method based on Spark cloud computing platform
Kostenetskii et al. Simulation of hierarchical multiprocessor database systems
Zhang et al. Egraph: efficient concurrent GPU-based dynamic graph processing
CN107145548B (en) A kind of Parallel Sequence mode excavation method based on Spark platform
Yang From Google file system to omega: a decade of advancement in big data management at Google
Alemi et al. CCFinder: using Spark to find clustering coefficient in big graphs
Chen et al. Applying segmented right-deep trees to pipelining multiple hash joins
Azez et al. JOUM: an indexing methodology for improving join in hive star schema

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190820

Termination date: 20210426