CN103995690A

CN103995690A - Parallel time sequence mining method based on GPU

Info

Publication number: CN103995690A
Application number: CN201410172991.1A
Authority: CN
Inventors: 杨世权; 袁博
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2014-04-25
Filing date: 2014-04-25
Publication date: 2014-08-20
Anticipated expiration: 2034-04-25
Also published as: CN103995690B

Abstract

The invention discloses a parallel time sequence mining method based on a GPU. The parallel time sequence mining method is characterized by including the following steps: scanning records in an input sequence database into an internal storage buffering area of a CPU, calculating candidate sequences 1 in length and the support degrees of the candidate sequences 1 in length, calculating frequent sequences 1 in length, calculating a candidate sequence information matrix and a candidate event information matrix, copying the candidate sequence information matrix and the candidate event information matrix into a video memory of the GPU, calculating candidate sequences 2 in length and the support degrees of the candidate sequences 2 in length on the GPU in parallel through a CUDA, storing the result into the video memory of the GPU, calculating to obtain frequent sequences 2 in length, calculating to obtain a vertical format database according to the input sequence database, copying the frequent sequences 1 in length and the frequent sequences 2 in length into the video memory of the GPU, and carrying out parallel calculating in the GPU to obtain frequent sequences with other lengths. According to the parallel time sequence mining method, the calculating efficiency can be improved.

Description

A kind of parallel time sequence method for digging based on GPU

Technical field

The present invention relates to database mining technology, especially relate to a kind of parallel time sequence method for digging based on GPU.

Background technology

The epoch of an information big bang have been brought us in the fast development of Internet technology, and large data have become the irreversible tendency of the day.Along with the continuous reduction of data storage device cost, and the variation of data acquisition modes and channel, increasing company and tissue construction the database of oneself, for storing the user data of magnanimity.But the Rapid Accumulation of data has brought the problem of information overload, the real interested information of enterprise and user is buried in the middle of the data of a large amount of numerous and complicated, and Useful Information is difficult to effectively be excavated.Data mining technology is considered to one of effective tool of current solution information overload problem.By the analysis to mass data and excavation, we can therefrom obtain a large amount of valuable information, and making large data is better people's service.

Sequential mode mining is an important research direction as Data Mining, more and more receives researchist's concern.Its object is to find the frequent sequence pattern occurring in high-volume database.Sequence pattern refers in database according to the frequent ordered sequence occurring of certain order.Traditional data mining task is only to find out the article set that user may buy, and be indifferent to the order problem between these article set, but sequential mode mining is taken temporal information into account, not only excavate the article set that user buys, also will point out that these article are integrated into temporal precedence.So, we just can predict next step buying behavior more accurately according to the article of the current purchase of user, and then the recommendation information that provides some to have more value for user, and help user finds required faster.Introduce the mutual order information between pattern just because of sequential mode mining, therefore had in practice very important and application widely.Real-life many problem in science and business problem can be converted into the sequence problem with precedence of finding.In the real-time commending system of webpage, carry out sequence excavation by the record in access to netwoks daily record, can determine the temporal relationship between each accessed page, and the order of the visitor views webpage of website, based on these series models, we can be optimized the layout of webpage, provide personalized access to experience even can to different users.Mutual relationship and effect etc. between gene order detection in bioinformatics, forecasting sequence function, research sequence, all can be by DNA sequence dna is carried out to Frequent Pattern Mining, thus the identification of the identification of gene and functional annotation, protein sequence composition information effectively instructed.

At present, Sequential Pattern Mining Algorithm is mainly divided into two classes, and a class is to generate and the method for test based on candidate sequence, and another kind of is method based on pattern-growth.The method that generates and test based on candidate is utilized the Apriori rule in association rule mining, by the process of iteration, the Frequent Sequential Patterns that is at every turn k length is merged into according to set merging mode the candidate sequence that length is k+1, then scan database carries out support counting to candidate sequence, the candidate pattern that support is greater than threshold value becomes frequent k+1 sequence, as the input of next iteration, until search the algorithm termination of all frequent modes.Method based on pattern-growth is in the process of the frequent sequence of generation, do not produce any candidate sequence, be all a kind of special representation from building raw data base, search volume is cut apart, increase sequence by existing frequent sequence is added to frequent suffix.

But at large data age, be no matter the information on database or the internet of enterprise, all there is the feature that data volume is large.To the sequence mining process of mass data, inevitably can produce a large amount of candidate sequences, will be very consuming time by the process of calculating sifting sequence, this is all a huge challenge for performance and the memory size of processor undoubtedly.And existing most of algorithm is all to carry out in the mode of serial, for the high real world applications problem of a lot of requirement of real-times, execution efficiency is difficult to the demand of satisfying magnanimity data processing.

As can be seen here, existing time series digging technology also exists following problem: in the time carrying out mass data mining task, the speed of algorithm is very slow, and execution efficiency is low.So, real-time has become time series to excavate gordian technique and difficult point with rapidity.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of parallel time sequence method for digging based on GPU, to improve digging efficiency.

GPU, as current up-to-date parallel computing platform, has the floating-point operation ability of superelevation, and has that low-power consumption, extendability are strong, lower-price characteristic, as the hardware device of modern computer indispensability, applies very extensive.Make full use of the extensive multi-threaded parallel processing power of GPU, accelerate the execution speed of sequence mining algorithm, thereby the digging efficiency of mass data is promoted greatly, the sequential mode mining that can meet numerous reality is applied high efficiency demand.Accordingly, the present invention solves aforementioned technical problem by following means:

A parallel time sequence method for digging based on GPU, comprises the following steps:

Step 101: in the core buffer that is recorded to CPU in scanning list entries database;

Step 102: according to the record in described sequence library, the support of the candidate sequence that the candidate sequence that computational length is 1 and described length are 1;

Step 103: the support of the candidate sequence that is 1 according to described length, calculates length and be 1 frequent sequence;

Step 104: scan the record in described list entries database, calculated candidate sequence information matrix and candidate events information matrix;

Step 105: described candidate sequence information matrix and candidate events information matrix are copied in the video memory of GPU, adopt the support of the candidate sequence that the CUDA candidate sequence that parallel computation length is 2 on described GPU and described length are 2, and result is kept in the video memory of GPU;

Step 106: the support of the candidate sequence that the candidate sequence that is 2 by described length and length are 2 copies in the core buffer of CPU, calculates length and is 2 frequent sequence;

Step 107: calculate vertical format database according to described list entries database;

Step 108: the frequent sequence that is 1 by described length, frequent sequence that length is 2 and described vertical format database copy are in the video memory of GPU, and parallel computation obtains the frequent sequence of all the other length in GPU.

Compared with prior art, technique scheme provided by the invention, adopt GPU (English full name: Graphic Processing Unit, Chinese implication: graphic process unit) parallel optimization accelerating length is greater than the computation process of 1 frequent sequence and by the process of equivalence class calculated candidate sequence support, computation process is mainly utilized CUDA (English full name: Compute Unified Device Architecture in the video memory of GPU, Chinese implication: calculate unified equipment framework, being the programming model for GPU universal programming of NVIDIA company exploitation) technology completes, the speed that its computing velocity is calculated than the existing CPU of utilization is a lot of soon, solve and in prior art, calculated the very slow problem of frequent sequence algorithm speed, improve the efficiency that time series is excavated, having realized time series fast excavates.

Brief description of the drawings

Fig. 1 is the parallel time sequence method for digging process flow diagram based on GPU that the embodiment of the present invention provides.

Fig. 2 is the schematic diagram of list entries database in the embodiment of the present invention.

Fig. 3 is the vertical format database schematic diagram in the embodiment of the present invention.

Embodiment

Below in conjunction with preferred embodiment the invention will be further described.

As shown in Figure 1, the parallel time sequence method for digging based on GPU of the present embodiment comprises the following steps:

Step 101: in the core buffer that is recorded to CPU in scanning list entries database.

This step can adopt following concrete grammar: the piecemeal that records in the list entries database in memory device is read in the core buffer of CPU, the capacity of described core buffer is greater than default minimal buffering district threshold value, be less than the maximum free memory of system, when recording in described core buffer processed when complete, in list entries database from memory device, read next blocks of data and process to described core buffer, until all records in described sequence library are scanned complete.Be more preferably with concrete implementation method as follows:

List entries database is stored in the hard disk of system with the form of file conventionally, and the type of storage can have multiple, as binary format or text formatting etc.While obtaining recording in sequence library, particularly, can from hard disk, read, preferably, the sequence data reading is stored in Installed System Memory.The storage format of the sequence data of input in internal memory can pre-define as required, reads according to this predefined storage format to facilitate in subsequent process from internal memory.

Particularly, the interface of all right defined nucleotide sequence database, is included as and preserves the core buffer size that list entries data-base recording is opened up, sequence data pointer, and current data block size, the read and write position of current data block and data block read mark etc. completely.According to this sequence library interface definition sequence library function reading, use this sequence library function reading to open list entries database file from hard disk, obtain sequence library record, be stored in internal memory by defined form.For example, the size of definition core buffer is 2048, the data of reading in 2048 integer sizes in the sequence data library file of the sequence library function reading of definition from hard disk are in internal memory, after the data block of having read in is disposed, function reading reads the data of next 2048 integer sizes in internal memory, until sequence data library file is read in completely from hard disk.

Step 102: according to the record in described sequence library, the support of the candidate sequence that the candidate sequence that computational length is 1 and described length are 1.

As a kind of preferred version, this step can be adopted with the following method: scan the sequence library record in a core buffer, when occurring for the first time in described sequence library record when a certain, if do not occur described in sequence library record before, the candidate sequence that is 1 using described preservation as a length, and the support of described is increased to 1; If repeatedly occur described in described sequence library record, only in the time occurring for the first time described, preserve and increase its support, again occur described in described sequence library record time, do not do any processing, if occur first described in other arbitrary sequence library record time, the support of described is increased to 1.This step is more preferably with concrete method as follows:

In internal memory, open up two array ItemCount and FrequentIndex, wherein, the size of array ItemCount is the value of the maximal term that occurs in sequence library, the support of the item that is used for occurring in records series, the size of another array FrequentIndex is also the value of maximal term, the numbering of the frequent sequence that is 1 for record length.

Scan the sequence library record in a core buffer, a certain when occurring for the first time in this record, and do not occur in record before, this position corresponding in array ItemCount is added to 1; If repeatedly there is this in the sequence library record when pre-treatment, only in the time occurring for the first time, position corresponding in ItemCount is added to 1, again there is this in current record time, do not do any processing, if there is this in other sequence library record, this position corresponding in ItemCount array is added to 1, if repeatedly there is this in other sequence library record, only occur that for the first time this position corresponding in ItemCount adds 1.Input database file in hard disk is scanned after one time in the manner described above, and the final ItemCount array obtaining is the support of the candidate sequence that in list entries database, each length is 1.

Step 103: the support of the candidate sequence that is 1 according to described length, calculates length and be 1 frequent sequence.

As a kind of preferred version, this step can be adopted with the following method: the support of the candidate sequence that is 1 by each length and default minimum support threshold value compare, if described support is more than or equal to described minimum support threshold value, the candidate sequence that is 1 by this length saves as the frequent sequence that a length is 1, if described support is less than described minimum support threshold value, the candidate sequence that is 1 by this length saves as the non-frequent sequence that a length is 1.This step is more preferably with concrete method as follows:

By each element in ItemCount array and default minimum support threshold value comparison, if be greater than minimum support threshold value, it is 1 frequent sequence that the item that this element is corresponding becomes length.Traversal ItemCount presses index from small to large when array element, the frequent sequence that first length obtaining is 1, in FrequentIndex array, relevant position inserts 1, the frequent sequence that second length obtaining is 1, in FrequentIndex array, relevant position inserts 2, the like, the non-frequent sequence that is 1 for length, its position corresponding in FrequentIndex array is inserted to-1, the frequent sequence that is 2 for computational length.

Wherein, default minimum support threshold value can be absolute support, can be also relative support, in the present embodiment, default minimum support threshold value is relative support 0.01, and when 1% list entries comprises candidate sequence, this candidate sequence becomes frequent sequence.

Step 104: scan the record in described list entries database, calculated candidate sequence information matrix and candidate events information matrix.

As a kind of preferred version, this step can be adopted with the following method: the record in described list entries database is read in the core buffer of CPU, adopt iterative algorithm to process successively each record in described core buffer, obtain candidate sequence information matrix and candidate events information matrix; Record in described core buffer is traveled through one by one, the position that the frequent sequence that is 1 by length in described record occurs for the first time and the last position occurring are kept in described candidate sequence information matrix, and the position of the frequent sequence that the length occurring after the frequent sequence that is 1 by length in described record is 1 is kept in described candidate events information matrix.This step is more preferably with concrete method as follows:

Length is 2 candidate sequence, particularly, can be divided into sequence type and event type.Two items that comprise in sequence type belong to two different item collection in a sequential recording, and two Xiang Ze that comprise in event type belong to same collection in a sequential recording.For example, if a sequential recording of input is < (1,5,6) (2) (3,7) (2,4) >, wherein angle brackets represent a sequence, parenthesis represent an item collection, and the digitized representation item in parenthesis is concentrated the item comprising, and the order of a collection is representing temporal precedence.Candidate sequence < (1) (3) > belongs to sequence type, because item collection (1) and a collection (3) belong in this sequential recording two different item collection (1,5,6) and (3,7), because this (1) and (3) has precedence in time.Candidate sequence < (1,5) > belongs to event type, because item collection (1) and a collection (5) belong to same the collection (1 in this sequential recording, 5,6), because this (1) and (5) is concurrent.The candidate sequence of sequence type produces needs candidate sequence information matrix, and the candidate sequence of event type produces needs candidate events information matrix.Specific as follows:

Adopt matrix S equenceInfo storage candidate sequence information.Wherein, SequenceInfo is a two-dimensional matrix, and the size of its first dimension is NumFrequent (number of the frequent sequence that length is 1), and the size of the second dimension can be to be greater than arbitrarily 2 integer, in the present embodiment, preferably, is preset as 10.The first dimension label of SequenceInfo matrix represents that corresponding length is 1 frequent sequence, and in the first two storage space of the second dimension, storing respectively this length is item collection numbering and the last item collection numbering occurring that 1 frequent sequence occurs for the first time in current list entries.For example, the sequential recording of input is < (1,5,6) (2) (3,7) (2,4) >, the frequent sequence 2 that is 1 for length, the item collection that it occurs is for the first time numbered 2, and the last item collection occurring is numbered 4, therefore by SequenceInfo[2] [0] be set to item collection numbering 2, the SequenceInfo[2 that it occurs for the first time] [1] be set to its last item collection numbering 4 occurring.Scanning list entries database, calculates whole SequenceInfo matrix.

Adopt matrix ItemsetInfo storage candidate events information.Wherein, matrix ItemsetInfo is an one dimension matrix, its size is NumFrequent* (NumFrequent-1)/2, the frequent sequence that other length occurring after the frequent sequence that to have recorded in each length be 1 is 1, the number of the frequent sequence that wherein NumFrequent length is 1.For example, the sequential recording of input is < (1,5,6) (2) (3,7) (2,4) >, it is 1 frequent sequence that its middle term 1,2,3,4 is length, the frequent sequence 1 that is 1 for length, there is frequent sequence 2,3,4 thereafter, therefore the 2nd, the 3rd, the 4th position in storage area corresponding to the frequent sequence that is 1 by length in ItemsetInfo matrix puts 1, frequent 2,3,4 of representative appears at after frequent 1.

Step 105: described candidate sequence information matrix and candidate events information matrix are copied in the video memory of GPU, adopt the support of the candidate sequence that the CUDA candidate sequence that parallel computation length is 2 on described GPU and described length are 2, and result is kept in the video memory of GPU.The processing of doing before this step is all carried out in internal memory, and the various data that calculate during this time are also all kept in the middle of internal memory, and computation process in this step is to complete in the global memory of GPU end.

As a kind of preferred version, this step can be adopted with the following method: in the global memory of GPU, open up storage space, candidate sequence information matrix and candidate events information matrix are copied in the global memory of described GPU from internal memory; The frequent sequence that is 1 to each length, in GPU, open up one independently the frequent sequence of thread computes taking described length as 1 be the support of candidate sequence as 2 of the length of prefix and the described length candidate sequence as 2, the candidate sequence that is 2 by described length and support thereof are kept in the global memory space of GPU.This step is more preferably with concrete method as follows:

The function that uses CUDA to provide is opened up storage space in the global memory of GPU, and candidate sequence information matrix and candidate events information matrix are copied to the global memory of GPU from internal memory, in the present embodiment, use the memory allocation function cudaMalloc () that provides of CUDA to open up storage space in the global memory of GPU, use memory copying api function cudaMemcpy () by above-mentioned every data in the global memory from memory copying to GPU.

On GPU, utilize candidate sequence information matrix and candidate events information matrix, the candidate sequence that parallel computation length is 2 and support thereof.Utilize the kernel mechanism that CUDA provides in GPU, to start a kernel function, this kernel function can start the thread of executed in parallel in GPU simultaneously, and wherein each thread is responsible for calculating frequent sequence taking some length as 1 one group of candidate's 2 sequence as prefix.In the present embodiment, the built-in variable threadIdx providing with CUDA and blockIdx determine the call number of each GPU thread, corresponding with the data in candidate sequence information matrix and candidate events information matrix, the candidate sequence that computational length is 2 and support thereof.For example, Line 1 journey is calculated frequent sequence taking first length as 1 one group of candidate's 2 sequence as prefix, the frequent sequence that the item collection numbering that this thread occurs first by the frequent sequence that in access candidate sequence information matrix, first length is 1, the item collection numbering that last occurs and all the other length are 1 first, the item collection numbering that occurs of last, can obtain length and be 2 candidate sequence.If it is the item collection numbering that 1 frequent sequence last occurs that the item collection numbering that the frequent sequence that first length is 1 occurs is first less than another one length, the candidate sequence that length is 2 of frequent sequence composition that the frequent sequence that first length is 1 and this length are 1.In addition, Line 1 journey, in computation process, can be restarted a kernel function, and data block corresponding to frequent sequence that is 1 to first length in candidate events information matrix carried out parallel processing, finds out length and be the candidate sequence of 2 event type.In the present embodiment, the Dynamic Parallelism technology that uses CUDA to provide starts secondary kernel function from main kernel function.

Step 106: the support of the candidate sequence that the candidate sequence that is 2 by described length and length are 2 copies in the core buffer of CPU, calculates length and is 2 frequent sequence.

As a kind of preferred version, this step can adopt following method: the support of the candidate sequence that the candidate sequence that is 2 by described length and length are 2 copies in the core buffer of CPU from the global memory of GPU, adopts the frequent sequence that described CPU computational length is 2; The support of the candidate sequence that is 2 by length described in each and default minimum support threshold value comparison, if the support of described candidate sequence is more than or equal to minimum support threshold value, it is 2 frequent sequence that the candidate sequence that is 2 by described length saves as length, if the support of described candidate sequence is less than minimum support threshold value, it is 2 non-frequent sequence that the candidate sequence that is 2 by described length saves as length.This step is more preferably with concrete implementation method as follows:

This step completes in internal memory, and candidate sequence and the support thereof that can be first 2 by the length in the global memory of GPU copy in internal memory, and calculating length according to these data is 2 frequent sequence.Detailed process is as follows: candidate sequence and support thereof that the api function cudaMemcpy () that uses CUDA to provide is 2 by length copy in internal memory, then by the support of candidate sequence and default support threshold value comparison, if the support of candidate sequence is greater than preset value, the candidate sequence that this length is 2 is for frequently, be kept in internal memory, become length and be 2 frequent sequence, if the support of candidate sequence is less than preset value, the candidate sequence that this length is 2 is non-frequent sequence.In the present embodiment, default support threshold value is relative support 0.01, in all sequential recordings, has 1% to comprise current candidate sequence, and this candidate sequence is for frequently.

Step 107: calculate vertical format database according to described list entries database.

As a kind of preferred version, this step can be adopted with the following method: a record in scanning list entries database, if comprise length in described record and be 1 frequent sequence, the Case Number of the frequent sequence that is 1 by the sequence numbering of this record and described length in described record preserved; All records described in iterative scans in list entries database, sequence numbering and Case Number that the frequent sequence that to calculate all described length be 1 occurs, and the frequent sequence that is 1 by described length, described sequence numbering and Case Number save as the vertical format database of described list entries database.This step is more preferably with concrete method as follows:

As Figure 2-3, originally operate in internal memory and carry out, open up an equivalence class after two-dimensional array VerticalDatabase storage conversion, wherein, it is 1 frequent sequence number that the first dimension size of VerticalDatabase equals length, the second dimension size need to guarantee to preserve all sequence numberings that length is 1 frequent sequence that comprise, and in the present embodiment, is preset as the sequence sum in list entries database.Read in a sequence library record, the sequence numbering of this sequence is added in equivalence class corresponding to frequent sequence that all length that comprises in sequence is 1, and the item collection numbering that the frequent sequence that is 1 by each length occurs in sequence is added in equivalence class.For example, in the list entries database of the present embodiment, default minimum support threshold value is 2, can obtain length and is 1 frequent sequence and comprise 1,2,5, and the support of its middle term 1 is 3, and 2 support is 2, and 5 support is 3.Read in sequence numbering in sequence library and be 1 and a collection be numbered 1 record (1, 2, 6) time, comprise length and be 1 frequent sequence 1 and 2, therefore adding sequence numbering 1 and a collection numbering 1 to length is in 1 the equivalence class VerticalDatabase of frequent sequence 1 and 2 correspondences, read in sequence numbering and be 1 and a collection be numbered 2 record (1, 7) time, comprise length and be 1 frequent sequence 1, therefore adding sequence numbering 1 and a collection numbering 2 to length is in 1 the equivalence class VerticalDatabase of frequent sequence 1 correspondence, in the manner described above, all records in the scanning list entries database of iteration, calculate equivalence class corresponding to each frequent sequence.

In preferred scheme, this step can be adopted with the following method: in the global memory of GPU, open up storage space, the frequent sequence that is 1 by described length, frequent sequence that length is 2 and described vertical format database copy are to the storage space in described global memory; In described GPU, open up multithreading, described in each, thread is responsible for calculating frequent sequence taking a length as 1 as all length of the prefix frequent sequence as 3, obtain described length and be after 3 frequent sequence, the number of the frequent sequence that is 2 according to length is opened up multiple threads, described in each, thread is responsible for calculating frequent sequence taking a length as 2 as all length of the prefix frequent sequence as 4, and iteration is carried out until obtain the frequent sequence of all length; When computational length is the support of candidate sequence of k+1, the frequent sequence that is k by two length of to generate described length be k+1 candidate sequence records piecemeal in described equivalence class, in GPU, open up multiple threads to described record block parallel processing, obtain the support that described length is the candidate sequence of k+1.

The function that uses CUDA to provide is opened up storage space in the global memory of GPU, and the frequent sequence that is 1 by length, the length frequent sequence that is 2 and vertical format database replication are in the global memory of GPU, in the present embodiment, use the partition function cudaMalloc of global memory () that provides of CUDA to open up storage space in the global memory of GPU, use memory copying function cudaMemcpy () by above-mentioned every data in the global memory from memory copying to GPU.

The process of calculating the frequent sequence of all the other length can be divided into calculates the subprocess of the frequent sequence taking each length as 1 as the frequent sequence of prefix, between each subprocess, be separate, subprocess all frequent sequence of calculating of falling into a trap only calculates by other the frequent sequence generating in this process, and do not need the frequent sequence information in another subprocess, between each subprocess, do not there is dependence, therefore calculate frequent sequence taking each length as 1 as the process of the frequent sequence of prefix can executed in parallel.In the present embodiment, adopt CUDA to start a kernel function at GPU end, this kernel function can be opened up the thread of executed in parallel in GPU, number of threads is that length is the number of 1 frequent sequence, can use built-in variable threadIdx that CUDA provides and blockIdx to determine the call number of each thread, and associated with corresponding subprocess foundation, each thread is responsible for the execution of a subprocess.A subprocess is in the process of carrying out, can start dynamically extra kernel function, current subprocess is further decomposed into less subprocess, improve the concurrency of computation process, utilize to greatest extent the parallel processing capability of GPU, can use the dynamic parallel technology that CUDA provides in subprocess, to start extra kernel function.

The support of candidate sequence can realize by the join operation of the vertical format database of two auxiliary sequences of calculating, for example, in the present embodiment, if the candidate sequence (1 that computational length is 2, 2) support, in the middle of item 1 and 2 same of will appear at a sequence collect, equivalence class record that can search terms 1, find and sequence numbering and a record that collection numbering is identical in the equivalence class record of item 2, as candidate sequence (1, 2) vertical format database, and calculate the value number of sequence numbering in this vertical format database, as the support of this candidate sequence.Calculate the join operation of auxiliary sequence vertical format database can be in GPU executed in parallel, sequence numbering in equivalence class record is divided into subdivision according to a determining deviation, each thread is responsible for the calculating of a subdivision, and in the time that all thread computes finish, result reduction is merged, form complete vertical format database.

The frequent sequence of all the other length that parallel computation in GPU is obtained copies in internal memory, merges and generates complete frequent sequential file, and output in hard disk with the length frequent sequence that is 1 and the length frequent sequence that is 2.

The said method that the embodiment of the present invention provides, adopt GPU parallel optimization accelerating length to be greater than the computation process of 1 frequent sequence and by the process of equivalence class calculated candidate sequence support, computation process mainly utilizes CUDA technology to complete in the video memory of GPU, the speed that its computing velocity is calculated than the existing CPU of utilization is a lot of soon, solve and in prior art, calculated the very slow problem of frequent sequence algorithm speed, improve the efficiency that time series is excavated, realized time series fast and excavated.GPU has high-performance and dirigibility concurrently, can equivalence become a ultra-large parallel processor, there is powerful floating-point operation ability and very high memory bandwidth, its large-scale parallel calculates the high-performance of bringing and the programmability constantly strengthening, and cheap, the technical scheme that example of the present invention is provided has stronger practicality, ease for use and dirigibility.

Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that specific embodiment of the invention is confined to these explanations.For those skilled in the art, without departing from the inventive concept of the premise, can also make some being equal to substitute or obvious modification, and performance or purposes identical, all should be considered as belonging to protection scope of the present invention.

Claims

1. the parallel time sequence method for digging based on GPU, is characterized in that, comprises the following steps:

2. method according to claim 1, is characterized in that, described step 101 comprises:

The piecemeal that records in list entries database in memory device is read in the core buffer of CPU, the capacity of described core buffer is greater than default minimal buffering district threshold value, be less than the maximum free memory of system, when recording in described core buffer processed when complete, in list entries database from memory device, read next blocks of data and process to described core buffer, until all records in described sequence library are scanned complete.

3. method according to claim 2, is characterized in that, described step 102 comprises:

Scan the sequence library record in a core buffer, when occurring for the first time in described sequence library record when a certain, if do not occur described in sequence library record before, the candidate sequence that is 1 using described preservation as a length, and the support of described is increased to 1;

If repeatedly occur described in described sequence library record, only in the time occurring for the first time described, preserve and increase its support, again occur described in described sequence library record time, do not do any processing, if occur first described in other arbitrary sequence library record time, the support of described is increased to 1.

4. method according to claim 1, is characterized in that, described step 103 comprises:

The support of the candidate sequence that is 1 by each length and default minimum support threshold value compare, if described support is more than or equal to described minimum support threshold value, the candidate sequence that is 1 by this length saves as the frequent sequence that a length is 1, if described support is less than described minimum support threshold value, the candidate sequence that is 1 by this length saves as the non-frequent sequence that a length is 1.

5. method according to claim 1, is characterized in that, described step 104 comprises:

Record in described list entries database is read in the core buffer of CPU, adopt iterative algorithm to process successively each record in described core buffer, obtain candidate sequence information matrix and candidate events information matrix;

Record in described core buffer is traveled through one by one, the position that the frequent sequence that is 1 by length in described record occurs for the first time and the last position occurring are kept in described candidate sequence information matrix, and the position of the frequent sequence that the length occurring after the frequent sequence that is 1 by length in described record is 1 is kept in described candidate events information matrix.

6. method according to claim 1, is characterized in that, described step 105 comprises:

In the global memory of GPU, open up storage space, candidate sequence information matrix and candidate events information matrix are copied in the global memory of described GPU from internal memory; The frequent sequence that is 1 to each length, in GPU, open up one independently the frequent sequence of thread computes taking described length as 1 be the support of candidate sequence as 2 of the length of prefix and the described length candidate sequence as 2, the candidate sequence that is 2 by described length and support thereof are kept in the global memory space of GPU.

7. method according to claim 1, is characterized in that, described step 106 comprises:

The candidate sequence that is 2 by described length and length are that the support of 2 candidate sequence copies in the core buffer of CPU from the global memory of GPU, adopt the frequent sequence that described CPU computational length is 2; The support of the candidate sequence that is 2 by length described in each and default minimum support threshold value comparison, if the support of described candidate sequence is more than or equal to minimum support threshold value, it is 2 frequent sequence that the candidate sequence that is 2 by described length saves as length, if the support of described candidate sequence is less than minimum support threshold value, it is 2 non-frequent sequence that the candidate sequence that is 2 by described length saves as length.

8. method according to claim 1, is characterized in that, described step 107 comprises:

A record in scanning list entries database, is 1 frequent sequence if comprise length in described record, and the Case Number of the frequent sequence that is 1 by the sequence numbering of this record and described length in described record preserved;

All records described in iterative scans in list entries database, sequence numbering and Case Number that the frequent sequence that to calculate all described length be 1 occurs, and the frequent sequence that is 1 by described length, described sequence numbering and Case Number save as the vertical format database of described list entries database.

9. method according to claim 1, is characterized in that, described step 108 comprises:

In the global memory of GPU, open up storage space, the frequent sequence that is 1 by described length, frequent sequence that length is 2 and described vertical format database copy are to the storage space in described global memory;

In described GPU, open up multithreading, described in each, thread is responsible for calculating frequent sequence taking a length as 1 as all length of the prefix frequent sequence as 3, obtain described length and be after 3 frequent sequence, the number of the frequent sequence that is 2 according to length is opened up multiple threads, described in each, thread is responsible for calculating frequent sequence taking a length as 2 as all length of the prefix frequent sequence as 4, and iteration is carried out until obtain the frequent sequence of all length;

When computational length is the support of candidate sequence of k+1, the frequent sequence that is k by two length of to generate described length be k+1 candidate sequence records piecemeal in described equivalence class, in GPU, open up multiple threads to described record block parallel processing, obtain the support that described length is the candidate sequence of k+1.