CN107145548A - A kind of Parallel Sequence mode excavation method based on Spark platforms - Google Patents
A kind of Parallel Sequence mode excavation method based on Spark platforms Download PDFInfo
- Publication number
- CN107145548A CN107145548A CN201710284017.8A CN201710284017A CN107145548A CN 107145548 A CN107145548 A CN 107145548A CN 201710284017 A CN201710284017 A CN 201710284017A CN 107145548 A CN107145548 A CN 107145548A
- Authority
- CN
- China
- Prior art keywords
- sequence
- key
- database
- value pair
- burst
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Fuzzy Systems (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of Parallel Sequence mode excavation method based on Spark platforms, the problem of the problem of computing capability is poorly efficient when handling mass data for existing serialization Sequential Pattern Mining Algorithm and the existing Parallel Sequence pattern mining algorithm based on Hadoop have high IO expenses and laod unbalance, devise rational sequence library decomposition strategy, the problem of solving laod unbalance to greatest extent.On this basis according to the characteristic of MapReduce programming frameworks, parallelization has been carried out to original GSP algorithms, mass data sequential mode mining efficiency is improved using the Large-scale parallel computing ability of Spark cloud computing platforms.
Description
Technical field
The invention belongs to sequential mode mining technical field, more particularly to a kind of Parallel Sequence based on Spark platforms
Mode excavation method.
Background technology
(1) sequential mode mining technology
[document 1] proposes the concept of sequential mode mining earliest.Sequential mode mining is exactly to excavate sequence library intermediate frequency
The orderly event of numerous appearance or subsequence.Sequential mode mining as research contents important in data mining research field it
One, have and be widely applied very much demand, such as user's buying behavior analysis, biological sequence analysis, taxi Frequent Trajectory Patterns
It was found that, mankind's mobile behavior pattern analysis.[document 2] is proposed wipes out strategy and Hash tree come real using redundancy candidate pattern
The GSP algorithms of the quick memory access of existing candidate pattern.[document 3] proposes the SPADE algorithms represented based on vertical data.[document 4]
Propose the PrefixSpan algorithms based on data for projection storehouse.Although these traditional serialization algorithms are excellent with data structure
Change and excavate the change of mechanism, improved in performance, but the processing speed of algorithm is past when in face of large-scale dataset
Toward the requirement for not reaching people.Until early 20th century, the rapidly development of computer hardware has greatly promoted Parallel Sequence pattern
The research of mining algorithm.Domestic and foreign scholars propose various distributed Sequential Pattern Mining Algorithms in succession.
[document 5] is proposed to be counted parallel based on two kinds of different Parallel Algorithms of tree shadow casting technique to solve distributed memory
The sequential pattern discovery problem of calculation machine.[document 6] proposes the DMGSP algorithms that volume of transmitted data is reduced by lexicographic sequence tree.
[document 7] proposes the FMGSP algorithms of fast mining global maximum frequent Item Sets.But it is due to distributed memory system or net
These parallel tables of lattice computing system do not provide fault tolerant mechanism, so in these parallel tables Parallel Sequence mould achieved above
Formula mining algorithm does not possess fault-tolerance.In addition, developing parallel algorithm on these platforms, to need programmer to possess substantial amounts of parallel
Algorithm development experience.
The appearance of cloud computing platform is realizes that parallel algorithm provides new method and approach so that high efficiency, low cost from
Sequential mode mining is carried out in mass data to be possibly realized.By Apache funds club develop Hadoop cloud calculating platform due to
Its increasing income property, scalability, high fault tolerance, allow not possess abundant parallel algorithm development Experience programmer it is flat in Hadoop
Concurrent program is easily developed on platform, therefore many scholars propose the Parallel Sequence mode excavation based on Hadoop platform and calculated
Method.[document 8] proposes the concurrent incremental Sequential Pattern Mining Algorithm DPSP algorithms based on Hadoop.[document 9] proposes base
Parallel in Hadoop closes sequential mining algorithm-BIDE-MR algorithms.[document 10] proposes the SPAMC algorithms based on Hadoop.
[document 11] proposes the parallel PrefixSpan algorithms based on Hadoop.[document 12] is proposed decomposes thought based on affairs
PrefixSpan parallel algorithms based on Hadoop.[document 13] proposes the DGSP based on Hadoop based on database cutting
Algorithm.The Parallel Sequence pattern mining algorithm based on iterative MapReduce tasks that document [8] [9] [10] [11] is proposed is needed
The MapReduce tasks that multiple needs read sequence library from HDFS are performed, very big IO expenses can be produced.Document
[12] the Parallel Sequence pattern mining algorithm based on non-iterative formula MapReduce tasks that [13] are proposed can not be effectively by meter
Calculation task is uniformly assigned to each calculate node, causes load imbalance.
(2) Map-Reduce programming frameworks
Map-Reduce is a kind of programming framework, concept " Map (mapping) " and " Reduce (reduction) " is employed, for big
The concurrent operation of scale data collection (being more than 1TB), is proposed in [document 14].User need to only write two be referred to as Map and
Reduce function, system can manage Map or Reduce parallel tasks execution and task between coordination, and
The situation of some above-mentioned mission failure can be handled, while the fault-tolerance to hardware fault can be ensured.
Calculating process based on Map-Reduce is as follows:
1) input file is divided into M data fragmentation by the Map-Reduce storehouses in user program first, each burst it is big
It is small general from 16 to 64MB (user can control the size of each data slot by optional parameter), then Map-
Reduce storehouses create substantial amounts of copies of programs in a group of planes.
2) these copies of programs have other programs in a special program-primary control program, copy to be all by master control journey
Sequence distributes the working procedure of task.There are M Map task and R Reduce task to be allocated, primary control program appoints a Map
Business or Reduce tasks distribute to an idle working procedure.
3) working procedure that Map tasks are assigned reads related input data fragment, from the data slot of input
Parse key-value (key, value) right, then key-value will be produced to passing to user-defined Map functions, Map functions
The interim key-value in centre to be stored in local memory caching in.
4) key-value in caching periodically is written to local disk afterwards to being divided into R region by partition functions
On.The key-value of caching will pass back to primary control program to the storage location on local disk, be responsible for by primary control program these
Storage location is transmitted to the working procedure that Reduce tasks are assigned again.
5) when the working procedure that Reduce tasks are assigned receives the data storage location information that primary control program is sent
Afterwards, it is main from the working procedure place that Map tasks are assigned using remote procedure call (remote procedure calls)
It is data cached that these are read on the disk of machine.When the working procedure that Reduce tasks are assigned have read all intermediate data
Afterwards, by causing the data aggregate with same keys together after being ranked up to key.Because many different keys can be mapped to
In identical Reduce tasks, it is therefore necessary to be ranked up.If intermediate data can not complete greatly very much sequence in internal memory, then
It will be ranked up in outside.
6) be assigned the working procedure of Reduce tasks traversal sequence after intermediate data, for each it is unique in
Between key-value pair, the set of this key median related to it passes to use by the working procedure that Reduce tasks are assigned
The customized Reduce functions in family.The output of Reduce functions is appended to the output file of affiliated subregion.
7) after all Map and Reduce tasks are all completed, primary control program wakes up user program during this time,
Being called to Map-Reduce in user program just returns.
(3) Spark cloud computing platforms
Spark is that, by Katyuan universal parallel cloud computing platform of UC Berkeley AMP development in laboratory, Spark is based on
The Distributed Calculation that MapReduce thoughts are realized, possesses Hadoop MapReduce and is had the advantage that;But different places are
Output result can be stored in internal memory in the middle of computing, from without needing read-write distributed file system (HDFS), therefore Spark
Preferably service data excavation and machine learning etc. it can need the MapReduce algorithms of iteration.Spark enables internal memory distribution number
According to collection, it can provide interactive inquiry, and in addition data set can also be buffered in internal memory, improve data set read-write speed
Rate.Realize the reuse of the data set in calculating process, Optimized Iterative workload.A variety of distributed texts can be used in Spark bottoms
Part system such as HDFS file system storage data, but be more to cooperate together with scheduling of resource platform Mesos and YARN
It is existing.
RDD (elasticity distribution formula data set) is Spark core, and RDD is distributed across each calculate node and is stored in internal memory
In set of data objects, RDD allows user that explicitly working set is buffered in internal memory when performing multiple queries, follow-up
Inquiry can reuse working set, and this greatly improves inquiry velocity.RDD is distributed on multiple nodes, it is possible to which it is carried out
Parallel processing.RDD is expansible, elastic, in calculating process, when internal memory is less than RDD, can be dumped on disk, it is ensured that
Internal memory continues computing enough.RDD be partitioned, read-only, the immutable and data acquisition system that can be operated in parallel, can only
Created by performing the conversion operation (such as map, join, filter and group by) determined in other RDD, but these are limited
System to realize that fault-tolerant expense is very low.Need to pay the checkpoint of expensive and rollback not with distributed shared memory system
Together, RDD rebuilds the subregion of loss by Lineage:Contained in one RDD and how to derive necessary phase from other RDD
Information is closed, just can be with the data partition of reconstruction of lost without checkpointing.Although RDD is not one general shared
Internal memory is abstract, but possesses good descriptive power, scalability and reliability, and can be widely used in data parallel class
Using.
Relevant document:
[document 1] Agrawal R, Srikant R.Mining sequential patterns:The 11th
International Conference on Data Engineering[C].Taipei:IEEE Computer Society,
1995:3-141.
[document 2] Srikant R, Agrawal R.Mining sequential pattern:Generations and
performance improvement[C]//proceedings of the 5th International Conference
ExtendingDatabase Technology.Avignon:Lecture Notes in Computer Science,1996:
.3-17.
[document 3] Zaki M.SPADE:An efficient algorithm for mining frequent
sequences[J].Machine Learning,2001.41(2):31-60.
[document 4] Pei J, Han J, Pinto H.PrefixSpan mining sequential patterns
efficiently by prefix-projected pattern growth[C]//proceedings of the 17th
International Conference on Data Engineering.Washington,IEEE Transactions on
Data Engineering,2004.16(1):1424-1440.
[document 5] Gurainikv, Gargn, Vipink.Parallel tree Projection algorithm for
sequence mining[C]//proceedings of the 7th International European Conference
on Parallel Processing.London,2001:310-320.
[document 6] Gong Zhenzhi, Hu Kongfa, up to celebrating profit, Zhang Changhai .DMGSP:A kind of fast distributed global sequential pattern is dug
Dig algorithm [J] Southeast China University journal, 2007.16 (04):574-579.
[document 7] Zhang Changhai, Hu Kongfa, Liu Haidong.FMGSP:an efficient method
of mining global sequential patterns[C].//proceedings of the 4th
International Conference on Fuzzy Systems and Knowledge Discovery.Los
Alanitos IEEE Computer Society.2007:761-765.
[document 8] J.Huang, S.Lin, M.Chen, " DPSP:Distributed Progressive Sequential
Pattern Mining on the Cloud,”Lecture Notes in Computer Science,pp.27-34,2010.
[document 9] D.Yu, W.Wu, S.Zheng, Z.Zhu, " BIDE-Based Parallel Mining of
Frequent Closed Sequences with MapReduce,”In:Proceedings of the 12th
International Conference on Algorithms and Architecturesfor Parallel
Processing,pp.177-186 2012.
[document 10] Chun-Chieh Chen, Chi-Yao Tseng, Chi-Yao Tseng, " Highly Scalable
Sequential Pattern Mining Based on MapReduce Model on the Cloud,”In 2013IEEE
International Congress on Big Data,pp.310–317,2013.
[document 11] P.N.Sabrina, " Miltiple MapReduce and Derivative projected
database:new approach for supporting prefixspan scalability,”IEEE,pp.148-153,
Nov.2015.
[document 12] X.Wang, " Parallel sequential pattern mining by
transcationdecompostion,”IEEE Fuzzy Systems and Knowledge Discovery(FSKD),
2010Seventh International Conference on,vol.4,pp.1746-1750.
[document 13] X.Yu, J.Liu, C.Ma, B.Li, " A
MapReducreinforeceddistirbutedsequenti al pattern mining algorithm,”
Algorithms and Architectures for Parallel Processing,vol.9529,pp.183-197,
Dec.2015.
[document 14] Jeffrey Dean and Sanjay Ghemawat.Map-Reduce:Simplified data
processing on large Cluster[C]//proceedings of the 6th Conference on
Symposium on Operating Systems Design and Implementation.New York:ACM Press,
2004:137-149.
The content of the invention
For it is existing serialization Sequential Pattern Mining Algorithm handle mass data when computing capability it is poorly efficient the problem of and
The problem of existing Parallel Sequence pattern mining algorithm based on Hadoop has high IO expenses and laod unbalance, the present invention is carried
A kind of Parallel Sequence mode excavation method based on Spark platforms is supplied.
The technical solution adopted in the present invention is:A kind of Parallel Sequence mode excavation method based on Spark platforms, it is special
Levy and be, comprise the following steps:
Step 1:Database cutting;
Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come
It is determined that, make the sequence total length in each database burst almost equal;
Step 2:Database prepares;
All 1- sequence patterns are produced using a MapReduce task;
Step 3:Database mining;
Iteration finds all k- sequence patterns, k using MapReduce tasks>1.
The present invention devises rational sequence library decomposition strategy, and asking for laod unbalance is solved to greatest extent
Topic.On this basis according to the characteristic of MapReduce programming frameworks, parallelization is carried out to original GSP algorithms, Spark is utilized
The Large-scale parallel computing ability of cloud computing platform improves mass data sequential mode mining efficiency.Technical scheme
With it is simple, quick the characteristics of, can preferably improve the efficiency of sequential mode mining.
Brief description of the drawings
Fig. 1 is the flow chart of the embodiment of the present invention;
Fig. 2 is the sequence library cutting schematic diagram of the embodiment of the present invention;
Fig. 3 is the sequence library cutting result schematic diagram of the embodiment of the present invention;
Fig. 4 is ready to carry out the schematic diagram of process for the database of the embodiment of the present invention;
Fig. 5 is the schematic diagram of first MapRduce tasks carrying process of database mining of the embodiment of the present invention;
Fig. 6 is the schematic diagram of second MapRduce tasks carrying process of database mining of the embodiment of the present invention.
Embodiment
Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the accompanying drawings and embodiment is to this hair
It is bright to be described in further detail, it will be appreciated that implementation example described herein is merely to illustrate and explain the present invention, not
For limiting the present invention.
The flow for the Sequential Pattern Mining Algorithm based on Spark platforms that the present invention is designed is shown in accompanying drawing 1, and all steps can be by
Those skilled in the art use computer software technology implementation process automatic running.It is as follows that embodiment implements process:
Step 1, database cutting;
Sequence library is cut into database burst (the several working node numbers according in cluster of burst of formed objects
It is determined that), make the sequence total length in each database burst almost equal.
See Fig. 2, sequence library cutting is comprised the following steps that:
(1) sequence all in database sorts in sequence length descending form.
(2) n sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence
Row pattern.Total sequence length of each database burst is initialized as the length of its this sequence included.
(3) total sequence length based on n database burst builds a most rickle Ψ={ D1,D2,D3,…,Dn, its
Middle D1It is the sequence library burst that most short sequence is assigned for most rickle root node.
(4) most rickle root node D is obtainedi, the maximum sequence of the sequence length in unassigned sequence is added into Di, adjust
Whole most rickle.
(5) repeat step (4), until all sequences are all assigned in sequence library fragment.
Such as Fig. 3, original sequence data storehouse is divided into n=3 sub- sequence libraries by the present embodiment setting.
Original sequence data storehouse content such as table 1 below:
Table 1
Sequence number | Sequence |
S1 | <(a b)a c> |
S2 | <(c d)(e f g)> |
S3 | <h> |
S4 | <c g> |
S5 | <g a> |
S6 | <a c g h> |
Database is ranked up to the database after being sorted first:S2S1S6S4S5S3.Three sequences above are taken to build
Initial pile structure is found, initial heap is set up in adjustment.Now three sub- sequence libraries and its comprising sequence be:Subdata base P1:
S2, length is 5;Subdata base P2:S1, length is 4;Subdata base P3:S6, length is 4.Wherein most rickle root node is P2.So
The sequence library after sequence is successively read afterwards.Sequence S is read first4, add P2, now P3Length is 6.Most rickle is adjusted, this
When most rickle root node be P3.Read sequence S5, add P3, now P3Length is 6.Most rickle is adjusted, now most rickle root node
For P1.Read sequence S3, add P1, now P1Length is 6.So far sequence reads and finished, database cutting the end of the step.Embodiment
Middle cutting result ensure that the sequence total length in each sequence library burst is identical.
Obtained subsequence database 1,2,3 is divided respectively such as table 2 below, 3,4:
Table 2
Sequence number | Sequence |
S2 | <(c d)(e f g)> |
S3 | <h> |
Table 3
Sequence number | Sequence |
S1 | <(a b)a c> |
S4 | <c g> |
Table 4
Sequence number | Sequence |
S6 | <a c g h> |
S5 | <g a> |
If the number of Map nodes is q in Spark platforms, it is proposed that the number of subsequence database is equal to of Map nodes
Number, i.e. n=q.If n<Q, when running this method, has (q-n) individual Map nodes cannot profit in the case of no mission failure
With Duty-circle is not high.If n>Q, when running this method, the n-q sub- path sequences in the case of no mission failure
Database needs just be handled after complete preceding q sub- path sequence databases of q Map node processing, and treatment effeciency is not high.
Therefore n=q can meet Duty-circle and treatment effeciency simultaneously.
Step 2, database prepares;
In this step, all 1- sequence patterns are produced using a MapReduce task.The step calls first first
Individual flatMap functions read every sequence from sequence library fragment, wherein sequence with<LongWritable offset,
Text sequence>The form storage of key-value pair.It is item by sequence cutting then to call another flatMap function, is produced<
, 1>Key-value pair.The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called
ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting
Key-value pair.The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.
Embodiment sets minimum support as 2, prepares the specific implementation procedure of step referring to Fig. 4, Map nodes are to database point
Piece 1 produces key-value pair result such as table 5 below:
Table 5
Output result |
<c,1> |
<d,1> |
<e,1> |
<f,1> |
<g,1> |
<h,1> |
Map nodes produce key-value pair result such as table 6 below to database burst 2:
Table 6
Output result |
<a,1> |
<b,1> |
<a,1> |
<c,1> |
<c,1> |
<h,1> |
Map nodes produce key-value pair result such as table 7 below to database burst 3:
Table 7
Output result |
<a,1> |
<c,1> |
<g,1> |
<h,1> |
<g,1> |
<a,1> |
Reduce nodes merge the key-value pair with identical key, and output support is more than or equal to the result of 2 key-value pair
Such as table 8 below:
Table 8
Sequence pattern | Support |
a | 3 |
c | 4 |
g | 4 |
h | 2 |
Step 3, database mining;
The utilization MapReduce tasks of this single-step iteration find all k- sequence patterns (k>1).Produced in step is prepared
In 1- sequence patterns deposit RDD rather than in HDFS, to reduce IO expenses.In k-th of MapReduce task, each Map
Node reads (k-1)-sequence pattern first from RDD, produces step to produce the k- sequence moulds of candidate by candidate sequence pattern
Formula (Ck).Then every sequence s in a map function reading database fragment is called, and judges that the k- sequence patterns c of candidate is
No is the subsequence of the sequence, if subsequence then produces<c,1>Key-value pair.The key-value pair for possessing same keys is merged transmission
Give Reduce nodes.Finally, each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair
Degree, output support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns (Lk)。
The present embodiment sets minimum support as 2, excavates the specific implementation procedure ginseng of the 1st MapReduce task in step
See Fig. 5, Map nodes 1 produce key-value pair result such as table 9 below to database burst 1:
Table 9
Output result |
<c g,1> |
Map nodes 2 produce key-value pair result such as table 10 below to database burst 2:
Table 10
Output result |
<a c,1> |
<c g,1> |
Map nodes 3 produce key-value pair result such as table 11 below to database burst 3:
Table 11
Output result |
<a c,1> |
<a g,1> |
<a h,1> |
<c g,1> |
<c h,1> |
<g h,1> |
<g a,1> |
Reduce nodes merge the key-value pair that all Map nodes are produced, and output support is more than or equal to the minimum of setting
The key-value pair of support such as table 12 below:
Table 12
The 2nd specific implementation procedure of MapReduce tasks in step is excavated referring to Fig. 6, because each Map nodes are from RDD
Middle reading 2- sequence patterns, step is produced by candidate sequence pattern, the 3- sequence patterns generation without candidate, therefore Map nodes 1
To database burst 3 key-value pair is not exported to database burst 2, Map nodes 3 to database burst 1, Map nodes 2,
Reduce nodes merge the key-value pair that all Map nodes are produced, it is found that all Map nodes are not exported, therefore program is whole
Only.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention
The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode
Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.
Claims (4)
1. a kind of Parallel Sequence mode excavation method based on Spark platforms, it is characterised in that comprise the following steps:
Step 1:Database cutting;
Sequence library is cut into the database burst of formed objects, the several working node numbers according in cluster of burst come true
It is fixed, make the sequence total length in each database burst almost equal;
Step 2:Database prepares;
All 1- sequence patterns are produced using a MapReduce task;
Step 3:Database mining;
Iteration finds all k- sequence patterns, k using MapReduce tasks>1.
2. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 1
Implement including following sub-step:
Step 1.1:By sequence all in database with sequence length descending sort;
Step 1.2:N sequence of foremost constitutes n initial database burst, and each database fragment packets are containing a sequence
Row pattern;Total sequence length of each database burst is initialized as the length of its this sequence included;
Step 1.3:Total sequence length based on n database burst builds a most rickle Ψ={ D1,D2,D3,…,Dn, its
Middle D1It is the sequence library burst that most short sequence is assigned for most rickle root node;
Step 1.4:Obtain most rickle root node Di, the maximum sequence of the sequence length in unassigned sequence is added into Di, adjust
Whole most rickle;
Step 1.5:Repeat step 1.4, until all sequences are all assigned in sequence library fragment.
3. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 2
Implement including following sub-step:
Step 2.1:Call first flatMap function to read every sequence from sequence library fragment, wherein sequence with<
LongWritable offset,Text sequence>The form storage of key-value pair;
Step 2.2:It is item by sequence cutting to call another flatMap function, is produced<, 1>Key-value pair;
Step 2.3:The key-value pair for possessing same keys is merged and passes to Reduce nodes, and Reduce nodes are called
ReducebyKey () function is calculated<, 1>The support of key-value pair, output support is more than or equal to the minimum support of setting
Key-value pair;The key of these key-value pairs is 1- sequence patterns, and value is the support counting of the 1- sequence patterns.
4. the Parallel Sequence mode excavation method according to claim 1 based on Spark platforms, it is characterised in that step 3
Implement including following sub-step:
Step 3.1:In the 1- sequence patterns deposit RDD produced in step 2 rather than in HDFS, to reduce IO expenses;
Step 3.2:In k-th of MapReduce task, each Map node reads (k-1)-sequence mould first from RDD
Formula, produces step to produce the k- sequence patterns C of candidate by candidate sequence patternk;
Step 3.3:Every sequence s in a map function reading database fragment is called, and judges the k- sequence patterns c of candidate
Whether be the sequence subsequence, if subsequence then produces<c,1>Key-value pair;The key-value pair for possessing same keys is merged biography
Pass Reduce nodes;
Step 3.4:Each Reduce node calls ReducebyKey () function to calculate<c,1>The support of key-value pair, output
Support is more than or equal to the key-value pair of the minimum support of setting, as last k- sequence patterns Lk。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710284017.8A CN107145548B (en) | 2017-04-26 | 2017-04-26 | A kind of Parallel Sequence mode excavation method based on Spark platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710284017.8A CN107145548B (en) | 2017-04-26 | 2017-04-26 | A kind of Parallel Sequence mode excavation method based on Spark platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107145548A true CN107145548A (en) | 2017-09-08 |
CN107145548B CN107145548B (en) | 2019-08-20 |
Family
ID=59774891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710284017.8A Expired - Fee Related CN107145548B (en) | 2017-04-26 | 2017-04-26 | A kind of Parallel Sequence mode excavation method based on Spark platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145548B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665291A (en) * | 2017-09-27 | 2018-02-06 | 华南理工大学 | A kind of mutation detection method based on cloud computing platform Spark |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866904A (en) * | 2015-06-16 | 2015-08-26 | 中电科软件信息服务有限公司 | Parallelization method of BP neural network optimized by genetic algorithm based on spark |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
CN106021412A (en) * | 2016-05-13 | 2016-10-12 | 上海市计算技术研究所 | Large-scale vehicle-passing data oriented accompanying vehicle identification method |
CN106126341A (en) * | 2016-06-23 | 2016-11-16 | 成都信息工程大学 | It is applied to many Computational frames processing system and the association rule mining method of big data |
-
2017
- 2017-04-26 CN CN201710284017.8A patent/CN107145548B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866904A (en) * | 2015-06-16 | 2015-08-26 | 中电科软件信息服务有限公司 | Parallelization method of BP neural network optimized by genetic algorithm based on spark |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
CN106021412A (en) * | 2016-05-13 | 2016-10-12 | 上海市计算技术研究所 | Large-scale vehicle-passing data oriented accompanying vehicle identification method |
CN106126341A (en) * | 2016-06-23 | 2016-11-16 | 成都信息工程大学 | It is applied to many Computational frames processing system and the association rule mining method of big data |
Non-Patent Citations (1)
Title |
---|
曹博 等: "基于Spark的并行频繁模式挖掘算法", 《计算机工程与应用》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665291A (en) * | 2017-09-27 | 2018-02-06 | 华南理工大学 | A kind of mutation detection method based on cloud computing platform Spark |
CN107665291B (en) * | 2017-09-27 | 2020-05-22 | 华南理工大学 | Mutation detection method based on cloud computing platform Spark |
Also Published As
Publication number | Publication date |
---|---|
CN107145548B (en) | 2019-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Parallel processing systems for big data: a survey | |
Dean et al. | MapReduce: simplified data processing on large clusters | |
He et al. | Comet: batched stream processing for data intensive distributed computing | |
Zhang et al. | Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation | |
Vulimiri et al. | Global analytics in the face of bandwidth and regulatory constraints | |
Verma et al. | Breaking the MapReduce stage barrier | |
Bu et al. | Pregelix: Big (ger) graph analytics on a dataflow engine | |
Malewicz et al. | Pregel: a system for large-scale graph processing | |
Chen et al. | Computation and communication efficient graph processing with distributed immutable view | |
Liang et al. | Express supervision system based on NodeJS and MongoDB | |
Gu et al. | Chronos: An elastic parallel framework for stream benchmark generation and simulation | |
Labouseur et al. | Scalable and Robust Management of Dynamic Graph Data. | |
CN111966677A (en) | Data report processing method and device, electronic equipment and storage medium | |
Zhao et al. | ZenLDA: Large-scale topic model training on distributed data-parallel platform | |
Shi et al. | DFPS: Distributed FP-growth algorithm based on Spark | |
Sun et al. | Survey of distributed computing frameworks for supporting big data analysis | |
Fang et al. | Integrating workload balancing and fault tolerance in distributed stream processing system | |
CN107346331B (en) | A kind of Parallel Sequence mode excavation method based on Spark cloud computing platform | |
Kostenetskii et al. | Simulation of hierarchical multiprocessor database systems | |
Zhang et al. | Egraph: efficient concurrent GPU-based dynamic graph processing | |
CN107145548B (en) | A kind of Parallel Sequence mode excavation method based on Spark platform | |
Yang | From Google file system to omega: a decade of advancement in big data management at Google | |
Alemi et al. | CCFinder: using Spark to find clustering coefficient in big graphs | |
Chen et al. | Applying segmented right-deep trees to pipelining multiple hash joins | |
Azez et al. | JOUM: an indexing methodology for improving join in hive star schema |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190820 Termination date: 20210426 |