CN103427844A

CN103427844A - High-speed lossless data compression method based on GPU-CPU hybrid platform

Info

Publication number: CN103427844A
Application number: CN2013103210717A
Authority: CN
Inventors: 金海�; 郑然�; 周斌
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2013-07-26
Filing date: 2013-07-26
Publication date: 2013-12-04
Anticipated expiration: 2033-07-26
Also published as: CN103427844B

Abstract

The invention discloses a high-speed lossless data compression method based on a GPU-CPU hybrid platform. The method includes: allowing a CPU to read a data file to be compressed, copying the data file to be compressed from a memory to a global memory of a GPU, setting a thread block set bk(a) of the GPU, setting the number b of threads in each thread block, setting a length c of a compression dictionary window, setting a head hand p_dic_h pointing to a first compression dictionary window, setting the size d of a pre-reading window, setting a hand p_pre_r pointing to a first pre-reading window, setting an initial value p_dic_h-c of the hand, initializing an operating thread set threads(a*b), c*d in size, and (a*b/2)/c matrixes gMatrix, calling the (a*b/2)-th thread in the operating thread set threads(a*b) to process q=(a*b/2)/c data, c+d in length, in the data file to be compressed, searching for oblique segments having most continuous 1 from each of the q result matrixes gMatrix, and determining a three-element result array locations(p) of each result matrix. The method has the advantage that compression rate of massive data can be increased greatly.

Description

A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Technical field

The invention belongs to the computer data compression technique area, more specifically, relate to a kind of high-speed lossless data compression method based on GPU and CPU mixing platform.

Background technology

According to International Data Corporation's (International Data Corporation is called for short IDC) research, show: over nearly 10 years, the every mistake of global information total amount 2 years, will double.2011, the data total amount that the whole world is created and is replicated was 1.8ZB(1800EB), within 2015, reaching 8ZB(8000EB), and coming decade is the year two thousand twenty left and right, the data in the whole world will exceed than now 50 times.Meanwhile, although the network bandwidth and new memory technology have also obtained development fast, but still far can not meet the performance requirement of current mass data transfers and storage.And one of key technology that solves mass data transfers and store the challenge faced is exactly data compression.By data compression technique, can effectively reduce the data capacity that needs transmission and storage, thereby effectively control data transmission and the cost deposited realize that the low-cost high-efficiency of data is managed.

The compression ratio that focuses on how improving data that traditional compression theory based on the CPU platform and compression algorithm are paid close attention to, the mass data epoch speed of compressing data is had higher requirement.In order to improve compression speed, data parallelly compressed becomes a new developing direction.The prerequisite of carrying out the parallel data compression is to find other concurrency of data level, and data are carried out to piecemeal, and each piecemeal to be compressed be a kind of natural parallel idea simultaneously.Existing J.Gilchrist, GZIP, S.Pradhan scheduling algorithm utilize respectively multithreading, and the mode of multinuclear and cluster has realized the parallel data compression on the CPU platform.But in actual application, increase along with need data volume to be processed, due to the mass communication amount produced in mutual, intermediate object program need to take a large amount of memory headrooms, and the intrinsic hardware architecture of CPU itself be not suitable for large-scale parallel computation characteristic, all of these factors taken together makes the parallel compression algorithm of employing under the CPU platform can not make the speed of compression reach the set goal.

The scholar is also arranged by some the classical lossless data compression algorithms under some CPU platforms, as the run-length encoding algorithm, BZIP2 algorithm etc., by improving, have been transplanted under the GPU platform.These algorithms, in order to solve the deficiency of above-mentioned CPU compression, are absorbed in the shared storage and the global storage that how to utilize GPU, thereby are reduced to greatest extent the communication between disparate modules, reduce taking of internal memory, and then improve the speed of compression.But, because these algorithms itself are based on CPU platform invention, in itself and be not suitable for this different hardware configuration of GPU, therefore, in actual application, the raising of compression speed still has much room for improvement.

Summary of the invention

Above defect or Improvement requirement for prior art, the invention provides a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, its purpose is to be decomposed into parallel computation and two parts of serial computing by the process by data compression, parallel computation partly comprises divides a plurality of packed data dictionaries and pre-reading window, be organized as a plurality of matrixes, and a plurality of thread block that data file to be compressed is transferred in GPU have walked abreast, and being transferred to CPU, the generation of compressed encoding in the serial computing part and output completes, this mode of operation combines GPU and CPU advantage and strong point separately itself, thereby in the situation that guarantee that compression ratio does not reduce, greatly improve the compression speed of mass data.

For achieving the above object, according to one aspect of the present invention, provide a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, comprised the following steps:

(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;

(2) the thread block group bk[a on GPU is set], the number of threads b in each thread block, the sum that wherein a is thread block;

(3) length that the compression dictionary window is set is c, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;

(4) pre-reading window size being set is d, points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;

(5) initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;

(6) call worker thread group threads[a*b] in (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;

(7) find in each of q matrix of consequence gMatrix and there is continuous 1 oblique line section at most, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;

(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),

(9) according to matching result array match[q] treat compression data file and compressed;

(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to step (6).

Preferably, the value of d equals 16*n, and the span of n is 1-8.

Preferably, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) point to second compression dictionary window stem of compression data file, and (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.

Preferably, step (6) is specially, and data of pending compression dictionary window and the pre-read data window corresponding with it for each are carried out respectively following steps:

(6-1) counter i=0 is set;

(6-2) thread T1 is set, its thread number is th1, T1 is sets of threads threads[a*b/2] in the individual thread of (c*k) individual thread to the (c* (k+1)-1) one, it judges in k compression dictionary window in (th1 mod c) bit byte and k pre-read data window whether i*16 mates to (i+1) * 16-1 bit byte, 0<=k<q wherein, when two bytes match, return value 1, otherwise return value 0, and this matching result is write back to position ((th1 mod c) * d+i*16) in k gMatrix matrix of global storage in position ((th1 mod c) * d+i*16+16),

(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7).

Preferably, as length, when being less than 3, show not find coupling, now, x and y indirect assignment are-1.

Preferably, step (7) specifically comprises following sub-step:

(7-1) thread T2 is set, its thread number is th2, T2 is sets of threads threads[a*b] in the individual thread of (c*k) individual thread to the (c* (k+1)+d-1) one, T2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, and records parameter x corresponding to this subsegment, y and length;

(7-2) corresponding data x, the y that thread T2 obtains it and length are stored in ternary as a result in (th2 mod p) individual element of array locations.

Preferably, step (9) specifically comprises following sub-step:

(9-1) by matching result array match[q] be transferred to the main storage CPU from GPU;

(9-2) CPU will be stored in matching result array match[q] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[q], each element in the compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, the substring that offset and length mean respectively pre-reading window side-play amount and length of long coupling in corresponding compression dictionary window;

(9-3) treat compression data file according to flag byte and carry out data compression.

Preferably, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, continuous 128 initial data bytes can be exported at most in back, first of Mixed markers byte is 1,7 original and compress blended datas of rear 7 bit representation, corresponding position is that what mean output at 0 o'clock is initial data, is that what mean output at 1 o'clock is compressed encoding, realizes thus the compression of data, if find coupling, former state is exported data.

According to another aspect of the present invention, a kind of high-speed lossless data compression system based on GPU and CPU mixing platform is provided, comprising:

The first module, for reading data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;

The second module, for arranging the thread block group bk[a on GPU], the number of threads b in each thread block, the sum that wherein a is thread block;

The 3rd module, be c for the length that the compression dictionary window is set, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;

Four module, be d for pre-reading window size is set, and points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;

The 5th module, for initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;

The 6th module, for calling worker thread group threads[a*b] (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;

The 7th module, there is continuous 1 oblique line section at most for each searching at q matrix of consequence gMatrix, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;

The 8th module, for finding the locations[p that each gMatrix is corresponding] array has the element of maximum length value: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),

The 9th module, for according to matching result array match[q] treat compression data file and compressed;

Whether the tenth module, arrived the afterbody of data file to be compressed for judging pointer p_pre_r, and if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to the 6th module.

In general, the above technical scheme of conceiving by the present invention compared with prior art, can obtain following beneficial effect:

(1) the present invention is mated by matrix parallel, accelerated the matching speed of data: the mode of operation of the matrix parallel coupling that the present invention proposes, the matrix matching of a parallel processing q c*q in once circulating, make the step-length of circulation coupling be increased to the q*d byte by the d byte, accelerated thus matching speed;

(2) the present invention has realized asynchronous process, overlapping redundant data is found the time with compressed encoding output, make the operation in CPU and GPU to carry out with the form of similar streamline, thereby also reduced to a certain extent the time of whole data compression, improved the speed of compression.

The accompanying drawing explanation

Fig. 1 is the flow chart that the present invention is based on the high-speed lossless data compression method of GPU and CPU mixing platform.

Fig. 2 is that the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it are divided schematic diagram.

Fig. 3 to Fig. 6 is the schematic diagram of application example of the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition, below in each execution mode of described the present invention involved technical characterictic as long as form each other conflict, just can mutually not combine.

As shown in Figure 1, the high-speed lossless data compression method that the present invention is based on GPU and CPU mixing platform comprises the following steps:

(2) the thread block group bk[a on GPU is set], the number of threads b in each thread block, the sum that wherein a is thread block, it is positive integer, the span of b is 256 to 1024;

(3) length that the compression dictionary window is set is c, and the head pointer of pointing to first compression dictionary window is set is p_dic_h; Particularly, the span of c is 2KB-8KB, and preferred value is taken as 4KB, and the p_dic_h initial value is for pointing to the place that starts of this compression data file;

(4) pre-reading window size being set is d, points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c, and the d value equals 16*n, and the span of n is 1-8, and preferably value is 4, and the preferred value of each pre-reading window is 64B;

(6) call worker thread group threads[a*b] in (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d; The division of the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it as shown in Figure 2, p_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-d) point to second compression dictionary window stem of compression data file, (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.Particularly, data of pending compression dictionary window and the pre-read data window corresponding with it for each, carry out respectively following steps:

(6-1) counter i=0 is set;

(6-2) thread T1 is set, its thread number is th1, T1 is sets of threads threads[a*b/2] in the individual thread of (c*k) individual thread to the (c* (k+1)-1) one, it judges in k compression dictionary window in (th1 mod c) bit byte and k pre-read data window whether i*16 mates (being whether the two equates) to (i+1) * 16-1 bit byte, 0<=k<q wherein, when two bytes match (equate), return value 1, otherwise return value 0, and this matching result is write back to position ((th1 mod c) * d+i*16) in k gMatrix matrix of global storage in position ((th1 mod c) * d+i*16+16),

(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7);

(7) find in each of q matrix of consequence gMatrix and there is continuous 1 oblique line section at most, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, length means the length (number that comprises " 1 ") of this oblique line section, as length when being less than 3, show not find coupling (if this is while due to the length of mating substring, being less than two bytes, code after compression is also longer than what do not compress, meaningless), now, x and y are nonsensical, can indirect assignment be-1,

This step specifically comprises following sub-step:

(7-2) corresponding data x, the y that thread T2 obtains it and length are stored in ternary as a result in (th2 mod p) individual element of array locations;

(9) according to matching result array match[q] treat compression data file and compressed, specifically comprise following sub-step;

(9-3) treat compression data file according to flag byte and carry out data compression; Particularly, the flag byte type obtained has two kinds, and a kind of is the original logo byte, a kind of is the Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, and continuous 128 initial data bytes can be exported at most in back; And first of Mixed markers byte is 1, the blended datas of the original and compression of 7 of rear 7 bit representations, corresponding position is that what mean output at 0 o'clock is initial data, is that what mean output at 1 o'clock is compressed encoding, realizes thus the compression of data; If find coupling, former state is exported data;

The present invention is based on the high-speed lossless data compression system of GPU and CPU mixing platform, it is characterized in that, comprising:

Example

In order clearly to set forth principle of the present invention, below illustrate implementation procedure of the present invention.

(1) CPU reads data file to be compressed, in the global storage by this compression data file from the memory copying to GPU;

(2) the thread block group bk[1024 on GPU is set], the number of threads 512 in each thread block;

(3) length that the compression dictionary window is set is 4096 bytes, and the head pointer of pointing to first compression dictionary window is p_dic_h=0;

(4) pre-reading window length being set is 64 bytes, and the pointer that points to first pre-reading window is p_pre_r=4096;

(5) initial work sets of threads threads[1024*512], and 64 gMatrix matrixes, its size is 4096*64;

(6) call worker thread group threads[1024*512] in 1024*256 thread process data file to be compressed in 64 length data that are (4096+64) B; The division of the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it is as shown in Figure 3:

P_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-64) point to second compression dictionary window stem of compression data file, (p_pre_r-64) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation can be processed 64 pairs of compression dictionary windows and pre-read data window.Particularly, data of pending compression dictionary window and the pre-read data window corresponding with it for each, carry out respectively following steps (at this, we select the work for the treatment of of first pair of compression dictionary window and pre-read data window is described):

For simple and clear description compression process, we select the data in first compression dictionary in data file to be compressed is " hellothisaeybc ... isa ", and length is 4096 bytes; And the front 16B data of the data of first pre-reading window are " thisisaexampletosh ".T0 to T4095 totally 4096 threads in sets of threads, its thread number is respectively 0 to 4095, Tm1 (m1 ∈ [0,4095]) be any one thread wherein, m1 is its thread number, this thread is responsible for judging in the compression dictionary window, whether the m1 bit byte mates (being whether the two equates) with the 0th to 15 bit bytes in pre-read data window, when two bytes match (equate), return value 1, otherwise return value 0, and this matching result is write back to position m1*64 in the gMatrix matrix of correspondence of global storage in the m1*64+16 of position.Because the data of only having selected 16B are compressed, once circulation when we only carry out i=0; The result of in the matrix of consequence gMatrix that the acquisition size is 4096*64 1/4, the result sizes of this time obtaining is 4096*16, as shown in Figure 4:

(7) below, only be described in 64 matrix of consequence gMatrix in first gMatrix in how to find and there is at most continuous 1 oblique line section.

(7-1) in the sets of threads, T0 to T4110 is total to (4096+15) individual thread, and its thread number is respectively 0 to 4110, Tm2 (m2 ∈ [0,4110]) be any one thread wherein, m2 is its thread number, and Tm2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, as shown in Figure 5:

(7-2) thread Tm2 is its x that gets parms, y and length, and they are stored in to ternary as a result in m2 the element of array locations;

In this example, thread T5 and thread T10 have found a plurality of continuous 1 sub-line segment, and it is the element assignment in corresponding locations array: locations (5)={ 5,0,6}, locations (10)={ 10,2,3}, the value of other elements of locations array is { 1,-1,0}, as shown in Figure 6:

(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: T0 to T63 totally 64 threads in sets of threads, its thread number is respectively 0 to 63, Tm3 (m3 ∈ [0, 63]) be any one thread wherein, m3 is its thread number, Tm3 is responsible for finding the element that has maximum length value in the locations array that each gMatrix matrix is corresponding, and by it corresponding parameter x, y and length deposit overall matching result array match[64 in], each element in this array also stores ternary result (x, y, length).In this example, the locations element with maximum length value that we utilize thread T0 to find in the 0th corresponding matrix of consequence gMatrix is the 5th element, and its relevant parameter is write to the 0th element in array match, i.e. match (0)={ 5,0,6};

(9) according to matching result array match[64] treat compression data file and compressed, specifically comprise following sub-step;

(9-1) by matching result array match[64] be transferred to the main storage CPU from GPU;

(9-2) CPU will be stored in matching result array match[64] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[64], in this example, by the data reduction of match (0), be that displacement offset and the length l ength of long coupling of the 0th pre-reading window in the 0th compression dictionary window is respectively 5 and 6, and will export a Mixed markers byte, be that the flag value is 224(11100000), compress (0)={ 224,5,6}; Now data file to be compressed before (4096+16) individual byte be output as after compressed: hellothisaeybc ... isa22456amplet3osh.Now, 4096B data in first compression dictionary window are can not be compressed, former state output, mean have the substring of 6 byte longs compressed in pre-read data window immediately following 22456 of back, its original text is that the compression dictionary window starts from displacement 5, length is 6 bytes, the more a string not compressed initial data amplet of back output, long 6 bytes; Then export an original logo byte, its value is 3(00000011), last, export the not compressed initial data osh of 3 bytes thereafter.

(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if pointed to tail of file, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+64*64 being set, p_dic_h=p_dic_h+64*64, then return to step (6).

Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the high-speed lossless data compression method based on GPU and CPU mixing platform, is characterized in that, comprises the following steps:

2. high-speed lossless data compression method according to claim 1, is characterized in that, the value of d equals 16*n, and the span of n is 1-8.

3. high-speed lossless data compression method according to claim 1, it is characterized in that, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) point to second compression dictionary window stem of compression data file, (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.

4. high-speed lossless data compression method according to claim 1, is characterized in that, step (6) is specially, and data of pending compression dictionary window and the pre-read data window corresponding with it for each are carried out respectively following steps:

(6-1) counter i=0 is set;

5. high-speed lossless data compression method according to claim 1, is characterized in that, as length, when being less than 3, shows not find coupling, and now, x and y indirect assignment are-1.

6. high-speed lossless data compression method according to claim 1, is characterized in that, step (7) specifically comprises following sub-step:

7. high-speed lossless data compression method according to claim 1, is characterized in that, step (9) specifically comprises following sub-step:

8. high-speed lossless data compression method according to claim 7, it is characterized in that, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, continuous 128 initial data bytes can be exported at most in back, first of Mixed markers byte is 1, 7 original and compress blended datas of rear 7 bit representation, corresponding position is that what mean output at 0 o'clock is initial data, be within 1 o'clock, mean output be compressed encoding, realize thus the compression of data, if do not find coupling, former state is exported data.

9. the high-speed lossless data compression system based on GPU and CPU mixing platform, is characterized in that, comprising: