CN103995690B - A kind of parallel time sequential mining method based on GPU - Google Patents

A kind of parallel time sequential mining method based on GPU Download PDF

Info

Publication number
CN103995690B
CN103995690B CN201410172991.1A CN201410172991A CN103995690B CN 103995690 B CN103995690 B CN 103995690B CN 201410172991 A CN201410172991 A CN 201410172991A CN 103995690 B CN103995690 B CN 103995690B
Authority
CN
China
Prior art keywords
length
sequence
gpu
frequent episodes
support
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410172991.1A
Other languages
Chinese (zh)
Other versions
CN103995690A (en
Inventor
杨世权
袁博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Tsinghua University
Original Assignee
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Tsinghua University filed Critical Shenzhen Graduate School Tsinghua University
Priority to CN201410172991.1A priority Critical patent/CN103995690B/en
Publication of CN103995690A publication Critical patent/CN103995690A/en
Application granted granted Critical
Publication of CN103995690B publication Critical patent/CN103995690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of parallel time sequential mining method based on GPU, it is characterised in that comprise the following steps: in the core buffer that recorded CPU in scanning list entries data base;Computational length is candidate sequence and the support of the described candidate sequence of a length of 1 of 1;It is calculated the Frequent episodes of a length of 1;In calculating candidate sequence information matrix and candidate events information matrix and copying the video memory of GPU to, use candidate sequence and its support of CUDA parallel computation a length of 2 on described GPU, and result is saved in the video memory of GPU;It is calculated the Frequent episodes of a length of 2;It is calculated vertical format data base according to list entries data base;By a length of 1 and the Frequent episodes of a length of 2 and vertical format database copy in the video memory of GPU, in GPU, parallel computation obtains the Frequent episodes of remaining length.The present invention can improve computational efficiency.

Description

A kind of parallel time sequential mining method based on GPU
Technical field
The present invention relates to database mining technology, especially relate to a kind of parallel time sequence based on GPU and dig Pick method.
Background technology
The fast development of Internet technology has brought the epoch of an information huge explosion into us, and big data have become For the irreversible tendency of the day.Along with the continuous reduction of data storage device cost, and data acquisition modes With the variation of channel, increasing company and the tissue construction data base of oneself, for storing magnanimity User data.But, the Rapid Accumulation of data brings the problem of information overload, enterprise and user and really feels emerging The information of interest is buried in the middle of the data of a large amount of numerous and complicated, and useful information is difficult to effectively be excavated.Number It it is considered as then current one of effective tool solving information overload problem according to digging technology.By to mass data Analysis and excavation, we can therefrom obtain a large amount of valuable information, make big data preferably take for people Business.
Sequential mode mining is an important research direction as Data Mining, increasingly by research people The concern of member.Its object is to find the sequence pattern frequently occurred in high-volume database.Sequence pattern is index According to the ordered sequence frequently occurred according to certain order in storehouse.Traditional data mining task only finds out use The article set that family may be bought, without concern for the order problem between these article set, but sequence pattern Excavate and then temporal information is taken into account, not only excavate the article set that user buys, it is also noted that these things Product are integrated into temporal precedence.Since Ru Ci, we just can currently purchase according to user more accurately The article bought predict next step purchasing behavior, and then provide the user some recommendation informations being more worth, side User is helped to find faster required.The mutual order letter between pattern is introduced just because of sequential mode mining Breath, has particularly significant the most in practice and is widely applied.Real-life many problem in science and business Industry problem can be converted into finds the sequence problem with precedence.In webpage real-time recommendation system, logical Cross and the record in network access log is carried out sequential mining, it may be determined that the tense between each accessed page Relation, and the order of the visitor views webpage of website, based on these series models, we can be to webpage Layout be optimized, it might even be possible to provide personalized access to experience to different users.In bioinformatics Gene order detection, forecasting sequence function, mutual relation between research sequence and effect etc., all can lead to Cross and DNA sequence is carried out Frequent Pattern Mining, thus effectively instruct identification and functional annotation, the egg of gene The identification of white matter sequence composition information.
At present, Sequential Pattern Mining Algorithm is broadly divided into two classes, and a class generates based on candidate sequence and test Method, another kind of is the method increased based on pattern.Generate based on candidate and the method for test utilizes correlation rule Apriori rule in excavation, by the process of iteration, every time the Frequent Sequential Patterns of a length of k according to Set merging mode is merged into the candidate sequence of a length of k+1, and then candidate sequence is carried out by scan database Support counting, support becomes frequent k+1 sequence more than the candidate pattern of threshold value, defeated as iteration next time Entering, terminating until searching all of frequent mode then algorithm.The method increased based on pattern is generating frequent sequence During row, do not produce any candidate sequence, be all to open from a kind of special representation building raw data base Begin, search volume is split, increase sequence by existing Frequent episodes is added frequent suffix.
But at big data age, either the information on the data base of enterprise or the Internet, all has data Measure big feature.Sequential mining process to mass data, inevitably produces a large amount of candidate sequence, logical The process crossing calculating sifting sequence will be quite time-consuming, and this is undoubtedly for performance and the memory size of processor It it is all a huge challenge.And existing major part algorithm is all to perform in a serial fashion, for a lot of real From the point of view of time property requires high real world applications problem, execution efficiency is difficult to meet the demand of mass data processing.
As can be seen here, existing time series digging technology there is also following problem: digs performing mass data During pick task, the speed of algorithm is very slow, and execution efficiency is low.So, real-time has become time series with rapidity The key technology excavated and difficult point.
Summary of the invention
The technical problem to be solved is to provide a kind of parallel time sequential mining side based on GPU Method, to improve digging efficiency.
GPU, as the most up-to-date parallel computing platform, has the floating-point operation ability of superelevation, and has low merit Consumption, autgmentability are strong, lower-price characteristic, as the hardware device that modern computer is indispensable, apply the widest General.Making full use of the extensive multi-threading parallel process ability of GPU, that accelerates sequential mining algorithm performs speed Degree, so that the digging efficiency of mass data is greatly improved, can meet the sequence pattern of numerous reality Excavate application to high efficiency demand.Accordingly, the present invention is by following means solution aforementioned technical problem:
A kind of parallel time sequential mining method based on GPU, comprises the following steps:
Step 101: in the core buffer that recorded CPU in scanning list entries data base;
Step 102: according to the record in described sequence library, computational length is candidate sequence and the institute of 1 State the support of the candidate sequence of a length of 1;
Step 103: according to the support of the described candidate sequence of a length of 1, be calculated the frequency of a length of 1 Numerous sequence;
Step 104: scan the record in described list entries data base, calculates candidate sequence information matrix and time Select event information matrix;
Step 105: described candidate sequence information matrix and candidate events information matrix are copied to the video memory of GPU In, use the candidate sequence and described a length of 2 of CUDA parallel computation a length of 2 on described GPU The support of candidate sequence, and result is saved in the video memory of GPU;
Step 106: the support of the described candidate sequence of a length of 2 and the candidate sequence of a length of 2 is copied In the core buffer of CPU, it is calculated the Frequent episodes of a length of 2;
Step 107: be calculated vertical format data base according to described list entries data base;
Step 108: by the described Frequent episodes of a length of 1, the Frequent episodes of a length of 2 and described vertical lattice Formula database copy is in the video memory of GPU, and in GPU, parallel computation obtains the Frequent episodes of remaining length.
Compared with prior art, the technique scheme that the present invention provides, employing GPU (English full name: Graphic Processing Unit, Chinese implication: graphic process unit) the parallel optimization accelerating length frequency more than 1 The calculating process of numerous sequence and calculated the process of candidate sequence support by equivalence class, calculates process master The video memory of GPU to utilize CUDA (English full name: Compute Unified Device Architecture, Chinese implication: calculate Unified Device framework, is the volume for GPU universal programming of NVIDIA company exploitation Journey model) technology completes, and it is more a lot of soon than the speed that the existing CPU of utilization carries out calculating that it calculates speed, Solve the problem that Frequent episodes algorithm speed is the slowest that calculates in prior art, improve the effect that time series is excavated Rate, it is achieved that quickly time series is excavated.
Accompanying drawing explanation
Fig. 1 is the parallel time sequential mining method flow diagram based on GPU that the embodiment of the present invention provides.
Fig. 2 is the schematic diagram of list entries data base in the embodiment of the present invention.
Fig. 3 is the vertical format data base's schematic diagram in the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with preferred embodiment the invention will be further described.
As it is shown in figure 1, the parallel time sequential mining method based on GPU of the present embodiment comprises the following steps:
Step 101: in the core buffer that recorded CPU in scanning list entries data base.
This step can use following concrete grammar: by the record piecemeal in the list entries data base in storage device Being read in the core buffer of CPU, the capacity of described core buffer is more than the minimal buffering district threshold preset Value, less than the maximum free memory of system, when the record in described core buffer is processed complete, from depositing List entries data base in storage equipment reads next blocks of data process to described core buffer, until All records in described sequence library are scanned complete.More preferably as follows with concrete implementation:
List entries data base is stored in the hard disk of system the most in the form of a file, and the type of storage can have Multiple, such as binary format or text formatting etc..When obtaining the record in sequence library, specifically, permissible Read from hard disk, it is preferable that the sequence data of reading is stored in Installed System Memory.The sequence data of input Storage format in internal memory can pre-define as required, fixed in advance according to this to facilitate in subsequent process The storage format of justice reads from internal memory.
Specifically, it is also possible to the interface of defined nucleotide sequence data base, open including for preservation list entries data-base recording The core buffer size warded off, sequence data pointer, current data block size, the read-write position of current data block Put and data block reads mark etc. completely.Letter is read according to this sequence library interface definition sequence library Number, uses this sequence library function reading to open list entries database file from hard disk, obtains sequence data Storehouse record, is stored in internal memory by defined form.Such as, the size of definition core buffer is 2048, The sequence library function reading then defined sequence data library file from hard disk reads in 2048 integers big Little data are in internal memory, and after the data block read in is disposed, function reading reads next from hard disk The data of individual 2048 integer sizes are in internal memory, until sequence data library file is fully read into.
Step 102: according to the record in described sequence library, computational length is candidate sequence and the institute of 1 State the support of the candidate sequence of a length of 1.
As a kind of preferred version, this step can be adopted with the following method: scans the sequence in a core buffer Data-base recording, when occurring a certain for the first time in described sequence library record, if in sequence number before Described item did not occurred in recording according to storehouse, then described item was preserved the candidate sequence as a length of 1, and The support of described item is increased by 1;If described item repeatedly occurring in described sequence library record, the most only Preserve when for the first time described item occurring and increase its support, again occurring in described sequence library record Any process is not made during described item, if first described item occurs in other any sequence data-base recording, Then the support of described item is increased by 1.This step is more preferably as follows with concrete method:
Two arrays ItemCount and FrequentIndex, wherein, array ItemCount is opened up in internal memory Size be the value of maximal term occurred in sequence library, the support of the item for occurring in records series, The size of another array FrequentIndex is also the value of maximal term, for the frequent sequence that record length is 1 The numbering of row.
Scan the sequence library record in a core buffer, a certain when this record occurs for the first time, and Record before did not occurred, then the position that this is corresponding in array ItemCount was added 1;If This is repeatedly occurred in currently processed sequence library record, the most only will when occurring for the first time Position corresponding in ItemCount adds 1, again occurs not making any process time this in current record, If there is this in other sequence library record, then by this position corresponding in ItemCount array Put and add 1, if repeatedly there is this in other sequence library record, the most only for the first time occur this Position corresponding in ItemCount adds 1.Input database file in hard disk is scanned one in the manner described above After Bian, the final ItemCount array obtained is in list entries data base each candidate of a length of 1 The support of sequence.
Step 103: according to the support of the described candidate sequence of a length of 1, be calculated the frequency of a length of 1 Numerous sequence.
As a kind of preferred version, this step can be adopted with the following method: by each candidate sequence of a length of 1 Support compare with the minimum support threshold value preset, if described support is more than or equal to described minimum Support threshold, then save as the Frequent episodes of a length of 1 by this candidate sequence of a length of 1, if Described support be less than described minimum support threshold value, then this candidate sequence of a length of 1 is saved as one long Degree is the non-Frequent episodes of 1.This step is more preferably as follows with concrete method:
By each element in ItemCount array with preset minimum support threshold ratio relatively, if greater than Minimum support threshold value, then the item that this element is corresponding becomes the Frequent episodes of a length of 1.Traversal ItemCount During array element by index from small to large, the Frequent episodes of first a length of 1 obtained, In FrequentIndex array, relevant position inserts 1, the Frequent episodes of second a length of 1 obtained, In FrequentIndex array, relevant position inserts 2, the like, for the non-Frequent episodes of a length of 1, The position that it is corresponding in FrequentIndex array is inserted-1, is the Frequent episodes of 2 for computational length.
Wherein, the minimum support threshold value preset can be absolute support, it is also possible to be relative support, In the present embodiment, the minimum support threshold value preset is relative support 0.01, and the list entries of i.e. 1% comprises During candidate sequence, then this candidate sequence becomes Frequent episodes.
Step 104: scan the record in described list entries data base, calculates candidate sequence information matrix and time Select event information matrix.
As a kind of preferred version, this step can be adopted with the following method: by the note in described list entries data base Record is read in the core buffer of CPU, and use that iterative algorithm processes in described core buffer successively is each Bar record, obtains candidate sequence information matrix and candidate events information matrix;To the note in described core buffer Record travels through one by one, the position that the Frequent episodes of a length of 1 in described record is occurred for the first time and last The position of secondary appearance is saved in described candidate sequence information matrix, by the frequent sequence of a length of 1 in described record The position of the Frequent episodes of a length of 1 occurred after row is saved in described candidate events information matrix.This step Rapid more preferably as follows with concrete method:
The candidate sequence of a length of 2, specifically, can be divided into sequence type and event type.In sequence type Two items comprised belong to two different item collection in a sequential recording, two Xiang Ze comprised in event type Belong to the same item collection in a sequential recording.Such as, if a sequential recording of input is <(1,5,6) (2) (3,7) (2,4)>, wherein angle brackets represent a sequence, and round parentheses represent an item collection, round parentheses In digitized representation item collection included in item, the order of item collection represents temporal precedence.Candidate's sequence Row<(1) (3)>belong to sequence type, owing to item collection (1) belongs to two different items in this sequential recording with item collection (3) Collection (1,5,6) and (3,7), because this (1) and item (3) have precedence in time.Candidate sequence<(1,5)>belongs to Event type, belongs to the same item collection (1,5,6) in this sequential recording, because of this (1) due to item collection (1) and item collection (5) It is concurrent with item (5).The candidate sequence of sequence type produces needs candidate sequence information matrix, The candidate sequence of event type produces needs candidate events information matrix.Specific as follows:
Matrix SequenceInfo is used to store candidate sequence information.Wherein, SequenceInfo is a two dimension Matrix, the size of its first dimension is NumFrequent (number of the Frequent episodes of a length of 1), the second dimension Size can be any integer more than 2, in this embodiment, it is preferred that, be preset as 10.SequenceInfo First dimension label of matrix represents the Frequent episodes of corresponding a length of 1, in the first two memory space of the second dimension Store item collection numbering that this Frequent episodes of a length of 1 occurs for the first time in current input sequence and the most respectively After once occur item collection numbering.Such as, the sequential recording of input is<(1,5,6) (2) (3,7) (2,4)>, for length Degree is the Frequent episodes 2 of 1, its item collection numbered 2 occurred for the first time, the last item collection numbering occurred It is 4, therefore SequenceInfo [2] [0] is set to its item collection numbering 2 occurred for the first time, SequenceInfo [2] [1] It is set to its item collection numbering 4 occurred for the last time.Scanning list entries data base, calculates whole SequenceInfo Matrix.
Matrix ItemsetInfo is used to store candidate events information.Wherein, matrix ItemsetInfo be one one-dimensional Matrix, its size is NumFrequent* (NumFrequent-1)/2, have recorded at each frequency of a length of 1 Other Frequent episodes of a length of 1 occurred after numerous sequence, wherein NumFrequent's a length of 1 is frequent The number of sequence.Such as, the sequential recording of input is<(1,5,6) (2) (3,7) (2,4)>, its middle term 1,2,3,4 It is the Frequent episodes of a length of 1, the Frequent episodes 1 for a length of 1, occur in that Frequent episodes behind 2,3,4, thus by the 2nd in memory area corresponding for the Frequent episodes of a length of 1 in ItemsetInfo matrix, 3rd, the 4th position puts 1, represents after frequent episode 2,3,4 occurs in frequent episode 1.
Step 105: described candidate sequence information matrix and candidate events information matrix are copied to the video memory of GPU In, use the candidate sequence and described a length of 2 of CUDA parallel computation a length of 2 on described GPU The support of candidate sequence, and result is saved in the video memory of GPU.The process that this step was done in the past All carrying out in internal memory, period, calculated various data were the most all saved in the middle of internal memory, and in this step Calculating process is to complete in the global memory of GPU end.
As a kind of preferred version, this step can be adopted with the following method: opens up in the global memory of GPU and deposits Storage space, copies candidate sequence information matrix and candidate events information matrix to described GPU from internal memory In global memory;To each Frequent episodes of a length of 1, GPU opens up an independent thread meter Calculate with the described Frequent episodes of a length of 1 as prefix a length of 2 candidate sequence and described a length of 2 The support of candidate sequence, the described candidate sequence of a length of 2 and support thereof are saved in the complete of GPU In office's memory headroom.This step is more preferably as follows with concrete method:
The function using CUDA to provide opens up memory space in the global memory of GPU, and by candidate sequence Information matrix and candidate events information matrix copy to the global memory of GPU from internal memory, in the present embodiment, Memory allocation function cudaMalloc () using CUDA to provide opens up storage sky in the global memory of GPU Between, use memory copying api function cudaMemcpy () by above-mentioned each item data from memory copying to GPU In global memory.
GPU utilizes candidate sequence information matrix and candidate events information matrix, parallel computation a length of 2 Candidate sequence and support.The kernel mechanism utilizing CUDA to provide starts a kernel in GPU Function, this kernel function can start the thread of executed in parallel in GPU simultaneously, and the most each thread is born Duty calculates one group of candidate 2 sequence with some Frequent episodes of a length of 1 as prefix.In the present embodiment, Built-in variable threadIdx and blockIdx using CUDA to provide determines the index of each GPU thread Number, corresponding with the data in candidate sequence information matrix and candidate events information matrix, computational length is 2 Candidate sequence and support.Such as, Line 1 journey calculates with the Frequent episodes of first a length of 1 is front One group of candidate 2 sequence sewed, this thread is by accessing the frequency of first a length of 1 in candidate sequence information matrix Item collection numbering, the item collection numbering of last appearance and remaining Frequent episodes of a length of 1 that numerous sequence occurs first First, last occur item collection numbering, the candidate sequence of a length of 2 can be obtained.If first a length of The item collection numbering that the Frequent episodes of 1 occurs first occurs less than the Frequent episodes last of another one a length of 1 Collection numbering, the then Frequent episodes of first a length of 1 and the Frequent episodes group of described another one a length of 1 Become the candidate sequence of a length of 2.It addition, Line 1 journey is during calculating, a kernel can be restarted Function, is carried out the data block that the Frequent episodes of in candidate events information matrix first a length of 1 is corresponding parallel Process, find out the candidate sequence of the event type of a length of 2.In the present embodiment, CUDA is used to provide Dynamic Parallelism technology from main kernel function start secondary kernel function.
Step 106: the support of the described candidate sequence of a length of 2 and the candidate sequence of a length of 2 is copied In the core buffer of CPU, it is calculated the Frequent episodes of a length of 2.
As a kind of preferred version, this step can use following method: by the described candidate sequence of a length of 2 and The support of the candidate sequence of a length of 2 copies the core buffer of CPU from the global memory of GPU to In, using described CPU computational length is the Frequent episodes of 2;By each candidate's sequence of described a length of 2 Row support and preset minimum support threshold ratio relatively, if the support of described candidate sequence is more than or equal to Minimum support threshold value, then save as the Frequent episodes of a length of 2 by the described candidate sequence of a length of 2, as The support of the most described candidate sequence is less than minimum support threshold value, then protected by the described candidate sequence of a length of 2 Save as the non-Frequent episodes of a length of 2.This step is more preferably as follows with concrete implementation:
This step completes in internal memory, can be first by the candidate sequence of a length of 2 in the global memory of GPU And support copies in internal memory, it is calculated the Frequent episodes of a length of 2 according to these data.Detailed process As follows: use CUDA provide api function cudaMemcpy () by a length of 2 candidate sequence and Degree of holding copies in internal memory, then the support of candidate sequence and the support threshold preset is compared, if waited Select the support of sequence more than preset value, then this candidate sequence of a length of 2 is frequently, in saving it in In depositing, become the Frequent episodes of a length of 2, if the support of candidate sequence is less than preset value, then this length Be 2 candidate sequence be non-Frequent episodes.In the present embodiment, the support threshold preset is relative support 0.01, the most all of sequential recording have 1% comprise current candidate sequence, then this candidate sequence is frequently.
Step 107: be calculated vertical format data base according to described list entries data base.
As a kind of preferred version, this step can be adopted with the following method: in scanning list entries data base Record, if comprising the Frequent episodes of a length of 1 in described record, then numbers the sequence of this record and described The Frequent episodes of a length of 1 item collection numbering in described record preserves;List entries data described in iterative scans All records in storehouse, calculate sequence numbering and item collection volume that all described Frequent episodes of a length of 1 occur Number, and the described Frequent episodes of a length of 1, described sequence numbering and item collection numbering are saved as described input sequence The vertical format data base of column database.This step is more preferably as follows with concrete method:
As Figure 2-3, this operation is carried out in internal memory, opens up a two-dimensional array VerticalDatabase and deposits Equivalence class after storage conversion, wherein, the first dimension size of the VerticalDatabase frequency equal to a length of 1 Numerous sequence number, the second dimension size needs to guarantee to preserve the sequence of all Frequent episodes comprising a length of 1 Numbering, in the present embodiment, the sequence sum being preset as in list entries data base.Read in a sequence data Storehouse record, numbers the sequence of this sequence and adds the Frequent episodes that all length is 1 correspondence comprised in sequence to Equivalence class in, and the item collection numbering occurred in the sequence by each Frequent episodes of a length of 1 adds to In equivalence class.Such as, in the list entries data base of the present embodiment, presetting minimum support threshold value is 2, the Frequent episodes that can obtain a length of 1 includes item 1,2,5, and the support of its middle term 1 is 3, item 2 Support be 2, the support of item 5 is 3.Read in sequence numbered 1 and item collection numbering in sequence library When being record (1,2,6) of 1, comprise the Frequent episodes 1 and 2 of a length of 1, therefore by sequence numbering 1 and item collection Numbering 1 is added in the equivalence class VerticalDatabase of Frequent episodes 1 and 2 correspondence of a length of 1, When reading in record (1,7) of sequence numbered 1 and item collection numbered 2, comprise the Frequent episodes 1 of a length of 1, Therefore sequence numbering 1 and item collection numbering 2 are added to the equivalence class of Frequent episodes 1 correspondence of a length of 1 In VerticalDatabase, in the manner described above, all records in the scanning list entries data base of iteration, It is calculated the equivalence class that each Frequent episodes is corresponding.
Step 108: by the described Frequent episodes of a length of 1, the Frequent episodes of a length of 2 and described vertical lattice Formula database copy is in the video memory of GPU, and in GPU, parallel computation obtains the Frequent episodes of remaining length.
In preferred scheme, this step can be adopted with the following method: opens up storage in the global memory of GPU Space, by the described Frequent episodes of a length of 1, the Frequent episodes of a length of 2 and described vertical format data base Copy the memory space in described global memory to;Multithreading is opened up, each described line in described GPU Journey is responsible for calculating the Frequent episodes that all length is 3 with the Frequent episodes of a length of 1 as prefix, it is thus achieved that After the described Frequent episodes of a length of 3, open up multiple thread according to the number of the Frequent episodes of a length of 2, often One described thread be responsible for calculating all length with the Frequent episodes of a length of 2 as prefix be 4 frequent Sequence, iteration performs till obtaining the Frequent episodes of all length;Computational length is the candidate sequence of k+1 Support time, the Frequent episodes of two a length of k of the candidate sequence of described a length of k+1 will be generated in institute State the record piecemeal in equivalence class, GPU open up multiple thread to described record block parallel processing, Obtain the support of the candidate sequence of described a length of k+1.
The function using CUDA to provide opens up memory space in the global memory of GPU, and by a length of 1 Frequent episodes, the Frequent episodes of a length of 2 and vertical format data base copy in the global memory of GPU, In the present embodiment, use global memory's partition function cudaMalloc () overall situation at GPU of CUDA offer Internal memory is opened up memory space, uses memory copying function cudaMemcpy () by above-mentioned each item data from internal memory Copy in the global memory of GPU.
Calculate the process of remaining length Frequent episodes and can be divided into calculating and with each Frequent episodes of a length of 1 be The subprocess of the Frequent episodes of prefix, is separate between each subprocess, calculates during a son All Frequent episodes are only calculated, without another height by other Frequent episodes generated during this During Frequent episodes information, not there is between each subprocess dependence, therefore calculate with each length The process being the Frequent episodes that Frequent episodes is prefix of 1 can be with executed in parallel.In the present embodiment, use CUDA starts a kernel function at GPU end, and this kernel function can be opened up in GPU and hold parallel The thread of row, number of threads is the number of the Frequent episodes of a length of 1, it is possible to use it is interior that CUDA provides Put variable threadIdx and blockIdx and determine the call number of each thread, and set up pass with corresponding subprocess Connection, each thread is responsible for the execution of a subprocess.One subprocess, can be dynamic during performing Start extra kernel function, current subprocess is further broken into less subprocess, improve and calculate The concurrency of process, utilizes the parallel processing capability of GPU to greatest extent, it is possible to use CUDA provides Dynamic parallel technology starts extra kernel function in subprocess.
The support of candidate sequence can be real by calculating the join operation of the vertical format data base of two auxiliary sequences Existing, such as, in the present embodiment, if the support of the candidate sequence wanting computational length to be 2 (1,2), item 1 To occur in the middle of the same item collection of a sequence with item 2, may search for the equivalence class record of item 1, Find the record identical with sequence numbering in the equivalence class record of item 2 and item collection numbering, as candidate sequence The vertical format data base of (1,2), and calculate the value number of sequence numbering in this vertical format data base, as this The support of candidate sequence.The join operation calculating auxiliary sequence vertical format data base can be parallel in GPU Performing, the sequence in equivalence class record numbered and be divided into subdivision according to a determining deviation, each thread is born The calculating of one subdivision of duty, and at the end of all threads calculate, result reduction is merged, form complete hanging down Straight form database.
The Frequent episodes of remaining length parallel computation in GPU obtained copies in internal memory, with a length of 1 Frequent episodes and a length of 2 Frequent episodes merge and generate complete Frequent episodes file, and export hard disk In.
The embodiment of the present invention provide said method, use GPU parallel optimization accelerating length more than 1 frequent The calculating process of sequence and calculated the process of candidate sequence support by equivalence class, calculates process main Utilizing CUDA technology to complete in the video memory of GPU, it calculates speed and counts than the existing CPU of utilization The speed calculated is a lot of soon, solves the problem that Frequent episodes algorithm speed is the slowest that calculates in prior art, improves The efficiency that time series is excavated, it is achieved that quickly time series is excavated.GPU has high-performance concurrently with flexible Property, equivalence can become a ultra-large parallel processor, has powerful floating-point operation ability and the highest Memory bandwidth, the high-performance that its Large-scale parallel computing brings and the programmability constantly strengthened, and price is low Honest and clean so that the technical scheme that present example provides has stronger practicality, ease for use and motility.
Above content is to combine concrete preferred implementation further description made for the present invention, it is impossible to Assert the present invention be embodied as be confined to these explanations.For those skilled in the art For, without departing from the inventive concept of the premise, it is also possible to make some equivalents and substitute or obvious modification, and And performance or purposes identical, all should be considered as belonging to protection scope of the present invention.

Claims (9)

1. a parallel time sequential mining method based on GPU, it is characterised in that comprise the following steps:
Step 101: in the core buffer that recorded CPU in scanning list entries data base;
Step 102: according to the record in described sequence library, computational length is candidate sequence and the institute of 1 State the support of the candidate sequence of a length of 1;
Step 103: according to the support of the described candidate sequence of a length of 1, be calculated the frequency of a length of 1 Numerous sequence;
Step 104: scan the record in described list entries data base, calculates candidate sequence information matrix and time Select event information matrix;
Step 105: described candidate sequence information matrix and candidate events information matrix are copied to the video memory of GPU In, use candidate sequence and the institute calculating the parallel computation a length of 2 on described GPU of Unified Device framework State the support of the candidate sequence of a length of 2, and result is saved in the video memory of GPU;
Step 106: the support of the described candidate sequence of a length of 2 and the candidate sequence of a length of 2 is copied In the core buffer of CPU, it is calculated the Frequent episodes of a length of 2;
Step 107: be calculated vertical format data base according to described list entries data base;
Step 108: by the described Frequent episodes of a length of 1, the Frequent episodes of a length of 2 and described vertical lattice Formula database copy is in the video memory of GPU, and in GPU, parallel computation obtains the Frequent episodes of remaining length.
Method the most according to claim 1, it is characterised in that described step 101 includes:
Record piecemeal in list entries data base in storage device is read in the core buffer of CPU, The capacity of described core buffer is more than the minimal buffering district threshold value preset, less than the maximum free memory of system, When the record in described core buffer is processed complete, the list entries data base from storage device reads Take off a blocks of data to process to described core buffer, until all record quilts in described sequence library Scanned.
Method the most according to claim 2, it is characterised in that described step 102 includes:
Scan the sequence library record in a core buffer, when first time in described sequence library record When occurring a certain, if described item occurring in sequence library record before, then described item is preserved As the candidate sequence of a length of 1, and the support of described item is increased by 1;
If described item repeatedly occurs in described sequence library record, the most only when for the first time described item occurring Preserve and increase its support, when described item occurring again in described sequence library record, not making any place Reason, if there is described item first in other any sequence data-base recording, then by the support of described item Increase by 1.
Method the most according to claim 1, it is characterised in that described step 103 includes:
The support of each candidate sequence of a length of 1 is compared with the minimum support threshold value preset, If described support is more than or equal to described minimum support threshold value, then this candidate sequence of a length of 1 is preserved It is the Frequent episodes of a length of 1, if described support is less than described minimum support threshold value, then should The candidate sequence of a length of 1 saves as the non-Frequent episodes of a length of 1.
Method the most according to claim 1, it is characterised in that described step 104 includes:
Record in described list entries data base is read in the core buffer of CPU, use iterative algorithm Process each record in described core buffer successively, obtain candidate sequence information matrix and candidate events letter Breath matrix;
Record in described core buffer is traveled through one by one, by the frequent sequence of a length of 1 in described record Position and the last position occurred that row occur for the first time are saved in described candidate sequence information matrix, will The Frequent episodes of different a length of 1 occurred for the first time after the Frequent episodes of a length of 1 in described record Position is saved in described candidate events information matrix.
Method the most according to claim 1, it is characterised in that described step 105 includes:
Memory space is opened up, by candidate sequence information matrix and candidate events information matrix in the video memory of GPU Copy to from internal memory in the video memory of described GPU;To each Frequent episodes of a length of 1, in GPU Open up candidate's sequence of an independent thread calculating a length of 2 with the described Frequent episodes of a length of 1 as prefix Row and the support of the described candidate sequence of a length of 2, by the described candidate sequence of a length of 2 and support thereof Degree is saved in the video memory space of GPU.
Method the most according to claim 1, it is characterised in that described step 106 includes:
By the support of the described candidate sequence of a length of 2 and the candidate sequence of a length of 2 from the video memory of GPU In copy in the core buffer of CPU, using described CPU computational length is the Frequent episodes of 2;Will be every The support of one described candidate sequence of a length of 2 and preset minimum support threshold ratio relatively, if described The support of candidate sequence is more than or equal to minimum support threshold value, then preserved by the described candidate sequence of a length of 2 For the Frequent episodes of a length of 2, if the support of described candidate sequence is less than minimum support threshold value, then will The described candidate sequence of a length of 2 saves as the non-Frequent episodes of a length of 2.
Method the most according to claim 1, it is characterised in that described step 107 includes:
A record in scanning list entries data base, if comprising the frequent sequence of a length of 1 in described record Row, then number the sequence of this record and the described Frequent episodes of a length of 1 item collection numbering in described record Preserve;
All records in list entries data base described in iterative scans, calculate all described a length of 1 frequent Sequence numbering and item collection that sequence occurs are numbered, and by the described Frequent episodes of a length of 1, described sequence numbering With the vertical format data base that item collection numbering saves as described list entries data base.
Method the most according to claim 1, it is characterised in that described step 108 includes:
In the video memory of GPU, open up memory space, by the described Frequent episodes of a length of 1, a length of 2 Frequent episodes and described vertical format database copy are to the memory space in described video memory;
Opening up multithreading in described GPU, each described thread is responsible for calculating the frequency with a length of 1 Numerous sequence is the Frequent episodes that all length is 3 of prefix, it is thus achieved that after the described Frequent episodes of a length of 3, root Opening up multiple thread according to the number of the Frequent episodes of a length of 2, it is long with one that each described thread is responsible for calculating Degree is the Frequent episodes that all length that Frequent episodes is prefix is 4 of 2, and iteration performs until obtaining all length Till the Frequent episodes of degree;
When computational length is the support of the candidate sequence of k+1, candidate's sequence of described a length of k+1 will be generated The Frequent episodes of two a length of k of row record piecemeal in described equivalence class, opens up in GPU Multiple threads, to described record block parallel processing, obtain the support of the candidate sequence of described a length of k+1.
CN201410172991.1A 2014-04-25 2014-04-25 A kind of parallel time sequential mining method based on GPU Active CN103995690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410172991.1A CN103995690B (en) 2014-04-25 2014-04-25 A kind of parallel time sequential mining method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410172991.1A CN103995690B (en) 2014-04-25 2014-04-25 A kind of parallel time sequential mining method based on GPU

Publications (2)

Publication Number Publication Date
CN103995690A CN103995690A (en) 2014-08-20
CN103995690B true CN103995690B (en) 2016-08-17

Family

ID=51309868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410172991.1A Active CN103995690B (en) 2014-04-25 2014-04-25 A kind of parallel time sequential mining method based on GPU

Country Status (1)

Country Link
CN (1) CN103995690B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055672B (en) * 2016-06-03 2019-05-03 西安电子科技大学 A kind of signal sequence Frequent Episodes Mining with time-constrain
CN110289049B (en) * 2019-06-19 2021-06-25 江南大学 Method for analyzing heat resistance relation between communication path and lipase
CN112182497B (en) * 2020-09-25 2021-04-27 齐鲁工业大学 Biological sequence-based negative sequence pattern similarity analysis method, realization system and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6473757B1 (en) * 2000-03-28 2002-10-29 Lucent Technologies Inc. System and method for constraint based sequential pattern mining
CN103150515A (en) * 2012-12-29 2013-06-12 江苏大学 Association rule mining method for privacy protection under distributed environment
CN103279332A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Data flow parallel processing method based on GPU-CUDA platform and genetic algorithm
CN103559016A (en) * 2013-10-23 2014-02-05 江西理工大学 Frequent subgraph excavating method based on graphic processor parallel computing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI464608B (en) * 2010-11-18 2014-12-11 Wang Yen Yao Fast algorithm for mining high utility itemsets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6473757B1 (en) * 2000-03-28 2002-10-29 Lucent Technologies Inc. System and method for constraint based sequential pattern mining
CN103150515A (en) * 2012-12-29 2013-06-12 江苏大学 Association rule mining method for privacy protection under distributed environment
CN103279332A (en) * 2013-06-09 2013-09-04 浪潮电子信息产业股份有限公司 Data flow parallel processing method based on GPU-CUDA platform and genetic algorithm
CN103559016A (en) * 2013-10-23 2014-02-05 江西理工大学 Frequent subgraph excavating method based on graphic processor parallel computing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Frequent Itemset Mining on Graphics Processors》;Wenbin Fang,Mian Lu,Xiangye Xiao,Bingsheng He1,Qiong Luo;《Proceedings of the Fifth International Workshop on Data Management on》;20090628;34-42 *
刘莹;菅立恒;梁莘燊;李小君;高洋;王琤.《基于CUDA架构的GPU的并行数据挖掘技术研究》.《科研信息化技术与应用》.2011,第1卷(第04期), *

Also Published As

Publication number Publication date
CN103995690A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
Nguyen et al. Mining high-utility itemsets in dynamic profit databases
Gan et al. ProUM: Projection-based utility mining on sequence data
Plattner A course in in-memory data management
Shieh et al. Inverted file compression through document identifier reassignment
Yao et al. Spatial coding-based approach for partitioning big spatial data in Hadoop
Mohammed et al. Big data challenges and achievements: applications on smart cities and energy sector
Zhu et al. Efficient episode mining with minimal and non-overlapping occurrences
Dong et al. e-NSP: efficient negative sequential pattern mining based on identified positive patterns without database rescanning
Zhang et al. A survey of key technologies for high utility patterns mining
CN103995690B (en) A kind of parallel time sequential mining method based on GPU
CN105095247A (en) Symbolic data analysis method and system
CN104317794B (en) Chinese Feature Words association mode method for digging and its system based on dynamic item weights
CN106469176A (en) A kind of method and apparatus for extracting text snippet
CN106844736B (en) Time-space co-occurrence mode mining method based on time-space network
Tan et al. -Based Extraction of News Contents for Text Mining
Zhao et al. Pushing the Boundaries of Chinese Painting Classification on Limited Datasets: Introducing a Novel Transformer Architecture with Enhanced Feature Extraction
JP2011159100A (en) Successive similar document retrieval apparatus, successive similar document retrieval method and program
Ye et al. The cmvai-file: an efficient approximation-based high-dimensional index structure
Jea et al. Mining hybrid sequential patterns by hierarchical mining technique
Liu et al. Mining top-k high average-utility itemsets based on breadth-first search
Zhou et al. Trajectory Compression with Spatio-Temporal Semantic Constraints
Kang et al. Integration of cloud and big data analytics for future smart cities
Pokhrel et al. Design of fast and scalable clustering algorithm on spark
Tang et al. Efficient and Compact Spreadsheet Formula Graphs
Swensen Improving Adjacency List Storage Methods for Polypeptide Similarity Analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant