CN103577160A

CN103577160A - Characteristic extraction parallel-processing method for big data

Info

Publication number: CN103577160A
Application number: CN201310487250.8A
Authority: CN
Inventors: 刘镇; 焦弘杰; 吕超; 钱萍
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2014-02-12

Abstract

The invention discloses a characteristic extraction parallel-processing method for big data. According to the method, on the basis of a CUDA (Compute Unified Device Architecture), the parallel computing capacity of a GPU (Graphics Processing Unit) is adopted to process the big data. When the big data is processed, multi-thread concurrent execution processing is carried out on the data by using a parallelizable matrix array processing method so as to greatly increase the speed of the characteristic extraction. The parallel matrix array processing method adopted in the method disclosed by the invention is that each characteristic character of the task data is sequentially matched with each characteristic character of the characteristic data to form a '01' array, and then parallel processing is performed on the '01' array according to the length of the characteristic data, so that the correct matching result is obtained. The method takes advantage of the characteristics of the matrix array, is very good in parallelism, can effectively and fully enable the data processing to be parallel, and is particularly suitable for the rapid characteristic extraction of the big data.

Description

A kind of feature extraction method for parallel processing towards large data

Technical field

The invention belongs to large technical field of data processing, relate to a kind of method of feature extraction, more specifically relate to a kind of feature extraction method for parallel processing towards large data.

Technical background

Along with the arriving of large data age, large data of fast processing how, and extract the study hotspot that effective information has become IT industry frontier nature." large data " refers to that a scale of construction is large especially, and data category is many and require enough fast data sets of processing speed, and such data set cannot extract and manage its content with traditional database instrument.

According to the retrieval to existing Patent data, at present the disposal route of large data is mainly contained: improve CPU nuclear volume, set up distributed cluster system and optimize the aspects such as parallel algorithm.But because these methods are all only confined to rely on the calculation process ability of CPU, the limited amount of CPU core, the restriction of setting up the more high factor of distributed cluster system cost, still await further innovation and improve the disposal route of large data and ability in addition.

Current, Feature Extraction Technology is more and more extensive in the utilization of the aspects such as image processing, pattern-recognition and network invasion monitoring, and especially under large data environment, the efficiency of feature extraction has become the bottleneck that restricts fast processing data capability.

Summary of the invention

The object of the invention is under large data environment, the present situation that traditional computing machine mainly relies on CPU to come serial to complete to the feature extraction of data, a kind of feature extraction method for parallel processing towards large data is proposed, make computing machine faster to the speed of feature extraction data processing, processing power is stronger.

To achieve these goals, the technical scheme that the present invention addresses the above problem is a kind of feature extraction method for parallel processing towards large data, when the method is processed large data in hardware allows process range, according to task data to be dealt with and characteristic, build one can parallelization the matrix array of operation, by adopting the mode of parallel processing array, data are carried out to multi-thread concurrent and carry out characteristic matching, extract the data that meet feature, and add up the number of times that successfully extracts data.

According to above-mentioned technical scheme, it is the framework based on CUDA that the present invention adopts the method for parallel processing, utilizes GPU computation capability to realize.

Above-mentioned task data need to be delivered to the storage unit of GPU from CPU, to use GPU to carry out concurrent operation.

For above-mentioned parallel computation under large data environment, the speed of in real time data in buffer area being carried out to feature extraction is more than or equal to the transmission rate of data stream, and according to the concurrent width of the adaptive adjustment feature extraction of the transmission rate of data stream, guarantee can concurrently controlling of dynamic dataflow processing.

Above-mentioned feature extraction method for parallel processing, in conjunction with GPU ardware feature, in the scope of its processing power, the method that the utilization that matching algorithm is taked can parallelization matrix array deal with data comprises following two steps, and equal executed in parallel.

Step 1: task data and each character of characteristic are carried out to PARALLEL MATCHING successively, form an effective matrix array.

Step 2: according to the length of characteristic, the effective array of parallel processing, draws the result of correct coupling, i.e. the number of times of successful characteristic matching.

The leaching process of above-mentioned characteristic, while moving for minimizing program, constantly read the number of times of characteristic, further improve arithmetic speed, will store characteristic key with constant internal memory, described characteristic need to be delivered to the constant internal memory of GPU from CPU.The restrict access of constant internal memory is read-only, in certain address from constant internal memory for the first time, reads after characteristic, when other same addresses of thread request, will directly from buffer memory, read characteristic, thereby save time.

Above-mentioned task data and each character of characteristic are carried out to PARALLEL MATCHING successively, form an effective matrix array, according to task data length STRLEN and characteristic length K EYLEN, each character of task data and characteristic is carried out to PARALLEL MATCHING successively, form " 01 " matrix array of a KEYLEN*STRLEN, with the i of matrix array is capable, make comparisons with i character of characteristic respectively, identical be designated as " 1 ", difference is designated as " 0 ".

According to above-mentioned characteristic length K EYLEN, to the method for the parallel processing of effective array, be: the little array of the individual KEYLEN*KEYLEN of parallel processing successively (STRLEN-KEYLEN+1), whether the diagonal line numerical value that judges it is " 1 " entirely, whether the first bit value that first judges decimal group diagonal of a matrix is " 1 ", if not " 1 " (but " 0 "), need not continue to judge next bit numerical value, directly turn to the next little array of judgement; If " 1 " continues to judge whether the next bit numerical value on diagonal line is " 1 ", until diagonal line numerical value is all " 1 ", has a successful feature extraction, record successfully and mate once.

Accompanying drawing explanation

Accompanying drawing 1 in the present invention for the process flow diagram of the characteristic extraction algorithm of large data environment.

Accompanying drawing 2 in the present invention for the characteristic extraction algorithm embodiment process flow diagram of large data environment.

Accompanying drawing 3 is the structural representation of character in task data matching characteristic data in the present invention.

Accompanying drawing 4 is for utilizing the method for dividing array, the structural representation of parallel processing " 01 " matrix array in the present invention.

Accompanying drawing 5 is the algorithm flow chart of parallel processing matrix array in the present invention.

Embodiment

Below in conjunction with accompanying drawing, content of the present invention is further detailed.

1. the overall procedure of a kind of feature extraction method for parallel processing towards large data relating in the present invention is: during towards large data, in hardware handles limit of power, according to task data to be dealt with and characteristic, build one can parallelization the matrix array of operation, by adopting the mode of parallel processing array, data are carried out to multi-thread concurrent and carry out characteristic matching, extract the data that meet feature, and the number of times of statistical correction feature extraction (referring to accompanying drawing 1).

2. known characteristic and task data are delivered to the storage space of GPU from CPU respectively, wherein characteristic is stored in constant storer (Constant Memory), and task data is stored in global storage (Global Memory) (referring to accompanying drawing 2).

3. the GPU kernel function kernel calling under CUDA framework carries out concurrent operation, and detailed process is as follows:

(1) task data and each character of characteristic are carried out to PARALLEL MATCHING successively, form an effective matrix array, according to task data length STRLEN and characteristic length K EYLEN, each character of task data and characteristic is carried out to PARALLEL MATCHING successively, form " 01 " matrix array of a KEYLEN*STRLEN, with the i of matrix array is capable, make comparisons with i character of characteristic respectively, identical be designated as " 1 ", difference is designated as " 0 " (referring to accompanying drawing 3).

(2), according to characteristic length K EYLEN, the little array (referring to accompanying drawing 4) of the individual KEYLEN*KEYLEN of parallel processing successively (STRLEN-KEYLEN+1), judges whether its diagonal line numerical value is " 1 " entirely.Determination methods is (referring to accompanying drawing 5):

1. extract the first bit value of decimal group diagonal of a matrix;

2. judge whether this value is " 1 ", if " 1 " turns to the 3. step, otherwise turn to the 6. step.

3. judge this whether for this reason position of last on decimal group diagonal of a matrix, if last position turns to the 5. step, otherwise turn to 4. step.

4. extract the next bit numerical value on diagonal line, turn to the 2. step;

5. the match is successful, and counting variable sum adds 1.

6. judge next little array.

This situation can be utilized a judgement statement to complete above-mentioned treatment step, thereby save time according to the characteristic of compiler, improves speed-up ratio.For example characteristic has 3 characters, be designated as a[0], a[1], a[2], now only need judge whether eligible ((a[0]==1) & & (a[1]==1) & & (a[2]==1)), whether the match is successful just can to judge this.

4. the thread in synchronous kernel, after the concurrent operation of guaranteeing GPU all completes, (referring to accompanying drawing 2) on host memory returned in the result transmission that GPU computing is obtained.

5. discharge the upper memory headroom for task data and characteristic distribution of GPU, and the result of calculation (referring to accompanying drawing 2) that indicating characteristic extracts on main frame.

Claims

1. towards a feature extraction method for parallel processing for large data, it is characterized in that: in the scope of hardware handles ability, this disposal route comprises following steps:

Step 1: be task data and characteristic memory allocated space on GPU;

Step 2: when processing large data, according to task data to be dealt with and characteristic, a matrix array with good concurrency of parallel structure;

Step 3: by adopting the method for parallel processing matrix array, data are carried out to multi-thread concurrent and carry out characteristic matching;

Step 4: extract the data that meet feature, and add up the number of times that successfully extracts data.

2. a kind of feature extraction method for parallel processing towards large data according to claim 1, is characterized in that: the method for described employing parallel processing matrix array is the framework based on CUDA, utilizes GPU computation capability to realize.

3. a kind of feature extraction method for parallel processing towards large data according to claim 1, is characterized in that: described task data need to be delivered to the storage unit of GPU from CPU, to use GPU to carry out concurrent operation.

4. a kind of feature extraction method for parallel processing towards large data according to claim 1, it is characterized in that: described extracting meets the data of feature, under large data environment, the speed of in real time data in buffer area being carried out to feature extraction is more than or equal to the transmission rate of data stream, and according to the concurrent width of the adaptive adjustment feature extraction of the transmission rate of data stream, guarantee can concurrently controlling of dynamic dataflow processing.

5. a kind of feature extraction method for parallel processing towards large data according to claim 1, it is characterized in that: described carries out multi-thread concurrent execution characteristic matching to data, in conjunction with GPU ardware feature, in the scope of its processing power, the utilization that matching algorithm is taked can parallelization matrix array deal with data method comprise following two steps, and equal executed in parallel:

Step 1: task data and each character of characteristic are carried out to PARALLEL MATCHING successively, form an effective matrix array;

6. according to the feature extraction method for parallel processing towards large data described in claim 1 and 5, it is characterized in that: described characteristic need to be delivered to the constant internal memory of GPU from CPU, with constant internal memory, store characteristic key, the restrict access of constant internal memory is read-only, in certain address from constant internal memory for the first time, read after characteristic, when other same addresses of thread request, will directly from buffer memory, read characteristic.

7. a kind of feature extraction method for parallel processing towards large data according to claim 5, it is characterized in that: the PARALLEL MATCHING of described step 1 is, according to task data length STRLEN and characteristic length K EYLEN, each character of task data and characteristic is carried out to PARALLEL MATCHING successively, form " 01 " matrix array of a KEYLEN*STRLEN, with the i of matrix array is capable, make comparisons with i character of characteristic respectively, identical be designated as " 1 ", difference is designated as " 0 ".

8. a kind of feature extraction method for parallel processing towards large data according to claim 5, it is characterized in that: during described PARALLEL MATCHING successively according to characteristic length K EYLEN, the little array of the individual KEYLEN*KEYLEN of parallel processing successively (STRLEN-KEYLEN+1), whether the diagonal line numerical value that judges it is " 1 " entirely, whether the first bit value that first judges decimal group diagonal of a matrix is " 1 ", if not " 1 " (but " 0 "), need not continue to judge next bit numerical value, directly turn to the next little array of judgement; If " 1 " continues to judge whether the next bit numerical value on diagonal line is " 1 ", until diagonal line numerical value is all " 1 ", has a successful feature extraction, record successfully and mate once.