CN103427844A - High-speed lossless data compression method based on GPU-CPU hybrid platform - Google Patents

High-speed lossless data compression method based on GPU-CPU hybrid platform Download PDF

Info

Publication number
CN103427844A
CN103427844A CN2013103210717A CN201310321071A CN103427844A CN 103427844 A CN103427844 A CN 103427844A CN 2013103210717 A CN2013103210717 A CN 2013103210717A CN 201310321071 A CN201310321071 A CN 201310321071A CN 103427844 A CN103427844 A CN 103427844A
Authority
CN
China
Prior art keywords
thread
data
length
window
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013103210717A
Other languages
Chinese (zh)
Other versions
CN103427844B (en
Inventor
金海�
郑然�
周斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201310321071.7A priority Critical patent/CN103427844B/en
Publication of CN103427844A publication Critical patent/CN103427844A/en
Application granted granted Critical
Publication of CN103427844B publication Critical patent/CN103427844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a high-speed lossless data compression method based on a GPU-CPU hybrid platform. The method includes: allowing a CPU to read a data file to be compressed, copying the data file to be compressed from a memory to a global memory of a GPU, setting a thread block set bk(a) of the GPU, setting the number b of threads in each thread block, setting a length c of a compression dictionary window, setting a head hand p_dic_h pointing to a first compression dictionary window, setting the size d of a pre-reading window, setting a hand p_pre_r pointing to a first pre-reading window, setting an initial value p_dic_h-c of the hand, initializing an operating thread set threads(a*b), c*d in size, and (a*b/2)/c matrixes gMatrix, calling the (a*b/2)-th thread in the operating thread set threads(a*b) to process q=(a*b/2)/c data, c+d in length, in the data file to be compressed, searching for oblique segments having most continuous 1 from each of the q result matrixes gMatrix, and determining a three-element result array locations(p) of each result matrix. The method has the advantage that compression rate of massive data can be increased greatly.

Description

A kind of high-speed lossless data compression method based on GPU and CPU mixing platform
Technical field
The invention belongs to the computer data compression technique area, more specifically, relate to a kind of high-speed lossless data compression method based on GPU and CPU mixing platform.
Background technology
According to International Data Corporation's (International Data Corporation is called for short IDC) research, show: over nearly 10 years, the every mistake of global information total amount 2 years, will double.2011, the data total amount that the whole world is created and is replicated was 1.8ZB(1800EB), within 2015, reaching 8ZB(8000EB), and coming decade is the year two thousand twenty left and right, the data in the whole world will exceed than now 50 times.Meanwhile, although the network bandwidth and new memory technology have also obtained development fast, but still far can not meet the performance requirement of current mass data transfers and storage.And one of key technology that solves mass data transfers and store the challenge faced is exactly data compression.By data compression technique, can effectively reduce the data capacity that needs transmission and storage, thereby effectively control data transmission and the cost deposited realize that the low-cost high-efficiency of data is managed.
The compression ratio that focuses on how improving data that traditional compression theory based on the CPU platform and compression algorithm are paid close attention to, the mass data epoch speed of compressing data is had higher requirement.In order to improve compression speed, data parallelly compressed becomes a new developing direction.The prerequisite of carrying out the parallel data compression is to find other concurrency of data level, and data are carried out to piecemeal, and each piecemeal to be compressed be a kind of natural parallel idea simultaneously.Existing J.Gilchrist, GZIP, S.Pradhan scheduling algorithm utilize respectively multithreading, and the mode of multinuclear and cluster has realized the parallel data compression on the CPU platform.But in actual application, increase along with need data volume to be processed, due to the mass communication amount produced in mutual, intermediate object program need to take a large amount of memory headrooms, and the intrinsic hardware architecture of CPU itself be not suitable for large-scale parallel computation characteristic, all of these factors taken together makes the parallel compression algorithm of employing under the CPU platform can not make the speed of compression reach the set goal.
The scholar is also arranged by some the classical lossless data compression algorithms under some CPU platforms, as the run-length encoding algorithm, BZIP2 algorithm etc., by improving, have been transplanted under the GPU platform.These algorithms, in order to solve the deficiency of above-mentioned CPU compression, are absorbed in the shared storage and the global storage that how to utilize GPU, thereby are reduced to greatest extent the communication between disparate modules, reduce taking of internal memory, and then improve the speed of compression.But, because these algorithms itself are based on CPU platform invention, in itself and be not suitable for this different hardware configuration of GPU, therefore, in actual application, the raising of compression speed still has much room for improvement.
Summary of the invention
Above defect or Improvement requirement for prior art, the invention provides a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, its purpose is to be decomposed into parallel computation and two parts of serial computing by the process by data compression, parallel computation partly comprises divides a plurality of packed data dictionaries and pre-reading window, be organized as a plurality of matrixes, and a plurality of thread block that data file to be compressed is transferred in GPU have walked abreast, and being transferred to CPU, the generation of compressed encoding in the serial computing part and output completes, this mode of operation combines GPU and CPU advantage and strong point separately itself, thereby in the situation that guarantee that compression ratio does not reduce, greatly improve the compression speed of mass data.
For achieving the above object, according to one aspect of the present invention, provide a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, comprised the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
(2) the thread block group bk[a on GPU is set], the number of threads b in each thread block, the sum that wherein a is thread block;
(3) length that the compression dictionary window is set is c, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;
(4) pre-reading window size being set is d, points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;
(5) initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) call worker thread group threads[a*b] in (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;
(7) find in each of q matrix of consequence gMatrix and there is continuous 1 oblique line section at most, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;
(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
(9) according to matching result array match[q] treat compression data file and compressed;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to step (6).
Preferably, the value of d equals 16*n, and the span of n is 1-8.
Preferably, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) point to second compression dictionary window stem of compression data file, and (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.
Preferably, step (6) is specially, and data of pending compression dictionary window and the pre-read data window corresponding with it for each are carried out respectively following steps:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 is sets of threads threads[a*b/2] in the individual thread of (c*k) individual thread to the (c* (k+1)-1) one, it judges in k compression dictionary window in (th1 mod c) bit byte and k pre-read data window whether i*16 mates to (i+1) * 16-1 bit byte, 0<=k<q wherein, when two bytes match, return value 1, otherwise return value 0, and this matching result is write back to position ((th1 mod c) * d+i*16) in k gMatrix matrix of global storage in position ((th1 mod c) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7).
Preferably, as length, when being less than 3, show not find coupling, now, x and y indirect assignment are-1.
Preferably, step (7) specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 is sets of threads threads[a*b] in the individual thread of (c*k) individual thread to the (c* (k+1)+d-1) one, T2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, and records parameter x corresponding to this subsegment, y and length;
(7-2) corresponding data x, the y that thread T2 obtains it and length are stored in ternary as a result in (th2 mod p) individual element of array locations.
Preferably, step (9) specifically comprises following sub-step:
(9-1) by matching result array match[q] be transferred to the main storage CPU from GPU;
(9-2) CPU will be stored in matching result array match[q] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[q], each element in the compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, the substring that offset and length mean respectively pre-reading window side-play amount and length of long coupling in corresponding compression dictionary window;
(9-3) treat compression data file according to flag byte and carry out data compression.
Preferably, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, continuous 128 initial data bytes can be exported at most in back, first of Mixed markers byte is 1,7 original and compress blended datas of rear 7 bit representation, corresponding position is that what mean output at 0 o'clock is initial data, is that what mean output at 1 o'clock is compressed encoding, realizes thus the compression of data, if find coupling, former state is exported data.
According to another aspect of the present invention, a kind of high-speed lossless data compression system based on GPU and CPU mixing platform is provided, comprising:
The first module, for reading data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
The second module, for arranging the thread block group bk[a on GPU], the number of threads b in each thread block, the sum that wherein a is thread block;
The 3rd module, be c for the length that the compression dictionary window is set, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;
Four module, be d for pre-reading window size is set, and points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;
The 5th module, for initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
The 6th module, for calling worker thread group threads[a*b] (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;
The 7th module, there is continuous 1 oblique line section at most for each searching at q matrix of consequence gMatrix, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;
The 8th module, for finding the locations[p that each gMatrix is corresponding] array has the element of maximum length value: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
The 9th module, for according to matching result array match[q] treat compression data file and compressed;
Whether the tenth module, arrived the afterbody of data file to be compressed for judging pointer p_pre_r, and if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to the 6th module.
In general, the above technical scheme of conceiving by the present invention compared with prior art, can obtain following beneficial effect:
(1) the present invention is mated by matrix parallel, accelerated the matching speed of data: the mode of operation of the matrix parallel coupling that the present invention proposes, the matrix matching of a parallel processing q c*q in once circulating, make the step-length of circulation coupling be increased to the q*d byte by the d byte, accelerated thus matching speed;
(2) the present invention has realized asynchronous process, overlapping redundant data is found the time with compressed encoding output, make the operation in CPU and GPU to carry out with the form of similar streamline, thereby also reduced to a certain extent the time of whole data compression, improved the speed of compression.
The accompanying drawing explanation
Fig. 1 is the flow chart that the present invention is based on the high-speed lossless data compression method of GPU and CPU mixing platform.
Fig. 2 is that the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it are divided schematic diagram.
Fig. 3 to Fig. 6 is the schematic diagram of application example of the present invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition, below in each execution mode of described the present invention involved technical characterictic as long as form each other conflict, just can mutually not combine.
As shown in Figure 1, the high-speed lossless data compression method that the present invention is based on GPU and CPU mixing platform comprises the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
(2) the thread block group bk[a on GPU is set], the number of threads b in each thread block, the sum that wherein a is thread block, it is positive integer, the span of b is 256 to 1024;
(3) length that the compression dictionary window is set is c, and the head pointer of pointing to first compression dictionary window is set is p_dic_h; Particularly, the span of c is 2KB-8KB, and preferred value is taken as 4KB, and the p_dic_h initial value is for pointing to the place that starts of this compression data file;
(4) pre-reading window size being set is d, points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c, and the d value equals 16*n, and the span of n is 1-8, and preferably value is 4, and the preferred value of each pre-reading window is 64B;
(5) initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) call worker thread group threads[a*b] in (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d; The division of the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it as shown in Figure 2, p_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-d) point to second compression dictionary window stem of compression data file, (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.Particularly, data of pending compression dictionary window and the pre-read data window corresponding with it for each, carry out respectively following steps:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 is sets of threads threads[a*b/2] in the individual thread of (c*k) individual thread to the (c* (k+1)-1) one, it judges in k compression dictionary window in (th1 mod c) bit byte and k pre-read data window whether i*16 mates (being whether the two equates) to (i+1) * 16-1 bit byte, 0<=k<q wherein, when two bytes match (equate), return value 1, otherwise return value 0, and this matching result is write back to position ((th1 mod c) * d+i*16) in k gMatrix matrix of global storage in position ((th1 mod c) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7);
(7) find in each of q matrix of consequence gMatrix and there is continuous 1 oblique line section at most, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, length means the length (number that comprises " 1 ") of this oblique line section, as length when being less than 3, show not find coupling (if this is while due to the length of mating substring, being less than two bytes, code after compression is also longer than what do not compress, meaningless), now, x and y are nonsensical, can indirect assignment be-1,
This step specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 is sets of threads threads[a*b] in the individual thread of (c*k) individual thread to the (c* (k+1)+d-1) one, T2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, and records parameter x corresponding to this subsegment, y and length;
(7-2) corresponding data x, the y that thread T2 obtains it and length are stored in ternary as a result in (th2 mod p) individual element of array locations;
(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
(9) according to matching result array match[q] treat compression data file and compressed, specifically comprise following sub-step;
(9-1) by matching result array match[q] be transferred to the main storage CPU from GPU;
(9-2) CPU will be stored in matching result array match[q] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[q], each element in the compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, the substring that offset and length mean respectively pre-reading window side-play amount and length of long coupling in corresponding compression dictionary window;
(9-3) treat compression data file according to flag byte and carry out data compression; Particularly, the flag byte type obtained has two kinds, and a kind of is the original logo byte, a kind of is the Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, and continuous 128 initial data bytes can be exported at most in back; And first of Mixed markers byte is 1, the blended datas of the original and compression of 7 of rear 7 bit representations, corresponding position is that what mean output at 0 o'clock is initial data, is that what mean output at 1 o'clock is compressed encoding, realizes thus the compression of data; If find coupling, former state is exported data;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to step (6).
The present invention is based on the high-speed lossless data compression system of GPU and CPU mixing platform, it is characterized in that, comprising:
The first module, for reading data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
The second module, for arranging the thread block group bk[a on GPU], the number of threads b in each thread block, the sum that wherein a is thread block;
The 3rd module, be c for the length that the compression dictionary window is set, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;
Four module, be d for pre-reading window size is set, and points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;
The 5th module, for initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
The 6th module, for calling worker thread group threads[a*b] (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;
The 7th module, there is continuous 1 oblique line section at most for each searching at q matrix of consequence gMatrix, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;
The 8th module, for finding the locations[p that each gMatrix is corresponding] array has the element of maximum length value: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
The 9th module, for according to matching result array match[q] treat compression data file and compressed;
Whether the tenth module, arrived the afterbody of data file to be compressed for judging pointer p_pre_r, and if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to the 6th module.
Example
In order clearly to set forth principle of the present invention, below illustrate implementation procedure of the present invention.
(1) CPU reads data file to be compressed, in the global storage by this compression data file from the memory copying to GPU;
(2) the thread block group bk[1024 on GPU is set], the number of threads 512 in each thread block;
(3) length that the compression dictionary window is set is 4096 bytes, and the head pointer of pointing to first compression dictionary window is p_dic_h=0;
(4) pre-reading window length being set is 64 bytes, and the pointer that points to first pre-reading window is p_pre_r=4096;
(5) initial work sets of threads threads[1024*512], and 64 gMatrix matrixes, its size is 4096*64;
(6) call worker thread group threads[1024*512] in 1024*256 thread process data file to be compressed in 64 length data that are (4096+64) B; The division of the compression dictionary window of data file to be compressed and the pre-reading window corresponding with it is as shown in Figure 3:
P_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-64) point to second compression dictionary window stem of compression data file, (p_pre_r-64) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation can be processed 64 pairs of compression dictionary windows and pre-read data window.Particularly, data of pending compression dictionary window and the pre-read data window corresponding with it for each, carry out respectively following steps (at this, we select the work for the treatment of of first pair of compression dictionary window and pre-read data window is described):
For simple and clear description compression process, we select the data in first compression dictionary in data file to be compressed is " hellothisaeybc ... isa ", and length is 4096 bytes; And the front 16B data of the data of first pre-reading window are " thisisaexampletosh ".T0 to T4095 totally 4096 threads in sets of threads, its thread number is respectively 0 to 4095, Tm1 (m1 ∈ [0,4095]) be any one thread wherein, m1 is its thread number, this thread is responsible for judging in the compression dictionary window, whether the m1 bit byte mates (being whether the two equates) with the 0th to 15 bit bytes in pre-read data window, when two bytes match (equate), return value 1, otherwise return value 0, and this matching result is write back to position m1*64 in the gMatrix matrix of correspondence of global storage in the m1*64+16 of position.Because the data of only having selected 16B are compressed, once circulation when we only carry out i=0; The result of in the matrix of consequence gMatrix that the acquisition size is 4096*64 1/4, the result sizes of this time obtaining is 4096*16, as shown in Figure 4:
(7) below, only be described in 64 matrix of consequence gMatrix in first gMatrix in how to find and there is at most continuous 1 oblique line section.
(7-1) in the sets of threads, T0 to T4110 is total to (4096+15) individual thread, and its thread number is respectively 0 to 4110, Tm2 (m2 ∈ [0,4110]) be any one thread wherein, m2 is its thread number, and Tm2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, as shown in Figure 5:
(7-2) thread Tm2 is its x that gets parms, y and length, and they are stored in to ternary as a result in m2 the element of array locations;
In this example, thread T5 and thread T10 have found a plurality of continuous 1 sub-line segment, and it is the element assignment in corresponding locations array: locations (5)={ 5,0,6}, locations (10)={ 10,2,3}, the value of other elements of locations array is { 1,-1,0}, as shown in Figure 6:
(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: T0 to T63 totally 64 threads in sets of threads, its thread number is respectively 0 to 63, Tm3 (m3 ∈ [0, 63]) be any one thread wherein, m3 is its thread number, Tm3 is responsible for finding the element that has maximum length value in the locations array that each gMatrix matrix is corresponding, and by it corresponding parameter x, y and length deposit overall matching result array match[64 in], each element in this array also stores ternary result (x, y, length).In this example, the locations element with maximum length value that we utilize thread T0 to find in the 0th corresponding matrix of consequence gMatrix is the 5th element, and its relevant parameter is write to the 0th element in array match, i.e. match (0)={ 5,0,6};
(9) according to matching result array match[64] treat compression data file and compressed, specifically comprise following sub-step;
(9-1) by matching result array match[64] be transferred to the main storage CPU from GPU;
(9-2) CPU will be stored in matching result array match[64] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[64], in this example, by the data reduction of match (0), be that displacement offset and the length l ength of long coupling of the 0th pre-reading window in the 0th compression dictionary window is respectively 5 and 6, and will export a Mixed markers byte, be that the flag value is 224(11100000), compress (0)={ 224,5,6}; Now data file to be compressed before (4096+16) individual byte be output as after compressed: hellothisaeybc ... isa22456amplet3osh.Now, 4096B data in first compression dictionary window are can not be compressed, former state output, mean have the substring of 6 byte longs compressed in pre-read data window immediately following 22456 of back, its original text is that the compression dictionary window starts from displacement 5, length is 6 bytes, the more a string not compressed initial data amplet of back output, long 6 bytes; Then export an original logo byte, its value is 3(00000011), last, export the not compressed initial data osh of 3 bytes thereafter.
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if pointed to tail of file, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+64*64 being set, p_dic_h=p_dic_h+64*64, then return to step (6).
Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (9)

1. the high-speed lossless data compression method based on GPU and CPU mixing platform, is characterized in that, comprises the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
(2) the thread block group bk[a on GPU is set], the number of threads b in each thread block, the sum that wherein a is thread block;
(3) length that the compression dictionary window is set is c, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;
(4) pre-reading window size being set is d, points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;
(5) initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) call worker thread group threads[a*b] in (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;
(7) find in each of q matrix of consequence gMatrix and there is continuous 1 oblique line section at most, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;
(8) find the locations[p that each gMatrix is corresponding] element that there is maximum length value in array: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
(9) according to matching result array match[q] treat compression data file and compressed;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to step (6).
2. high-speed lossless data compression method according to claim 1, is characterized in that, the value of d equals 16*n, and the span of n is 1-8.
3. high-speed lossless data compression method according to claim 1, it is characterized in that, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) point to second compression dictionary window stem of compression data file, (p_pre_r-d) point to the second pre-reading window stem of data file to be compressed, so divide, once circulation, can process q to compression dictionary window and pre-read data window.
4. high-speed lossless data compression method according to claim 1, is characterized in that, step (6) is specially, and data of pending compression dictionary window and the pre-read data window corresponding with it for each are carried out respectively following steps:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 is sets of threads threads[a*b/2] in the individual thread of (c*k) individual thread to the (c* (k+1)-1) one, it judges in k compression dictionary window in (th1 mod c) bit byte and k pre-read data window whether i*16 mates to (i+1) * 16-1 bit byte, 0<=k<q wherein, when two bytes match, return value 1, otherwise return value 0, and this matching result is write back to position ((th1 mod c) * d+i*16) in k gMatrix matrix of global storage in position ((th1 mod c) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7).
5. high-speed lossless data compression method according to claim 1, is characterized in that, as length, when being less than 3, shows not find coupling, and now, x and y indirect assignment are-1.
6. high-speed lossless data compression method according to claim 1, is characterized in that, step (7) specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 is sets of threads threads[a*b] in the individual thread of (c*k) individual thread to the (c* (k+1)+d-1) one, T2 is responsible for finding in an oblique line section has continuous 1 maximum subsegment, and records parameter x corresponding to this subsegment, y and length;
(7-2) corresponding data x, the y that thread T2 obtains it and length are stored in ternary as a result in (th2 mod p) individual element of array locations.
7. high-speed lossless data compression method according to claim 1, is characterized in that, step (9) specifically comprises following sub-step:
(9-1) by matching result array match[q] be transferred to the main storage CPU from GPU;
(9-2) CPU will be stored in matching result array match[q] in the data reduction substring side-play amount and the length of long coupling in the compression dictionary window that become pre-reading window, and output squeezing coding tlv triple compress[q], each element in the compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, the substring that offset and length mean respectively pre-reading window side-play amount and length of long coupling in corresponding compression dictionary window;
(9-3) treat compression data file according to flag byte and carry out data compression.
8. high-speed lossless data compression method according to claim 7, it is characterized in that, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, first of original logo byte is 0, the initial data length of rear 7 bit representation outputs, continuous 128 initial data bytes can be exported at most in back, first of Mixed markers byte is 1, 7 original and compress blended datas of rear 7 bit representation, corresponding position is that what mean output at 0 o'clock is initial data, be within 1 o'clock, mean output be compressed encoding, realize thus the compression of data, if do not find coupling, former state is exported data.
9. the high-speed lossless data compression system based on GPU and CPU mixing platform, is characterized in that, comprising:
The first module, for reading data file to be compressed, by the global storage of this data file to be compressed from the memory copying to GPU;
The second module, for arranging the thread block group bk[a on GPU], the number of threads b in each thread block, the sum that wherein a is thread block;
The 3rd module, be c for the length that the compression dictionary window is set, and the head pointer of pointing to first compression dictionary window is set is p_dic_h;
Four module, be d for pre-reading window size is set, and points to the pointer p_pre_r of first pre-reading window, and the initial value of this pointer is set to p_dic_h-c;
The 5th module, for initial work sets of threads threads[a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
The 6th module, for calling worker thread group threads[a*b] (a*b/2) individual thread process data file to be compressed in data that q=(a*b/2)/a c length is c+d;
The 7th module, there is continuous 1 oblique line section at most for each searching at q matrix of consequence gMatrix, the ternary of determining each matrix of consequence is array locations[p as a result], each element in array stores ternary result (x, y, length), the quantity that wherein p is this matrix of consequence bend section, and equal c+d – 1, x means the side-play amount of oblique line section corresponding compression dictionary with respect to its place matrix of consequence, y means the side-play amount of this oblique line section corresponding pre-reading window with respect to its place matrix of consequence, and length means the length of this oblique line section;
The 8th module, for finding the locations[p that each gMatrix is corresponding] array has the element of maximum length value: thread T3 is set, its thread number is th3, T3 is sets of threads threads[a*b] in the individual thread of the 0th thread to the (q-1) one, the T3 thread is responsible for finding ternary that each gMatrix matrix is corresponding array locations[p as a result] in there is maximum length value element, and by it corresponding parameter x, y and length deposit overall matching result array match[q in], each element in this array also stores ternary result (x, y, length),
The 9th module, for according to matching result array match[q] treat compression data file and compressed;
Whether the tenth module, arrived the afterbody of data file to be compressed for judging pointer p_pre_r, and if so, process finishes; Otherwise, to front slide dictionary window and pre-reading window, p_pre_r=p_pre_r+q*d being set, p_dic_h=p_dic_h+q*d, then return to the 6th module.
CN201310321071.7A 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform Active CN103427844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310321071.7A CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310321071.7A CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Publications (2)

Publication Number Publication Date
CN103427844A true CN103427844A (en) 2013-12-04
CN103427844B CN103427844B (en) 2016-03-02

Family

ID=49652097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310321071.7A Active CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Country Status (1)

Country Link
CN (1) CN103427844B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091305A (en) * 2014-08-11 2014-10-08 詹曙 Quick image segmentation method used for computer graph and image processing and based on GPU platform and morphological component analysis
CN104965761A (en) * 2015-07-21 2015-10-07 华中科技大学 Flow program multi-granularity division and scheduling method based on GPU/CPU hybrid architecture
CN105630529A (en) * 2014-11-05 2016-06-01 京微雅格(北京)科技有限公司 Loading method of FPGA (Field Programmable Gate Array) configuration file, and decoder
CN106019858A (en) * 2016-07-22 2016-10-12 合肥芯碁微电子装备有限公司 Direct writing type photoetching machine image data bit-by-bit compression method based on CUDA technology
CN107508602A (en) * 2017-09-01 2017-12-22 郑州云海信息技术有限公司 A kind of data compression method, system and its CPU processor
CN110007855A (en) * 2019-02-28 2019-07-12 华中科技大学 A kind of the 3D stacking NVM internal storage data compression method and system of hardware supported
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN111628779A (en) * 2020-05-29 2020-09-04 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
CN112463388A (en) * 2020-12-09 2021-03-09 广州科莱瑞迪医疗器材股份有限公司 SGRT data processing method and device based on multithreading

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123723A (en) * 2006-08-11 2008-02-13 北京大学 Digital video decoding method based on image processor
CN101937082A (en) * 2009-07-02 2011-01-05 北京理工大学 GPU (Graphic Processing Unit) many-core platform based parallel imaging method of synthetic aperture radar
CN101937555A (en) * 2009-07-02 2011-01-05 北京理工大学 Parallel generation method of pulse compression reference matrix based on GPU (Graphic Processing Unit) core platform
US20110157192A1 (en) * 2009-12-29 2011-06-30 Microsoft Corporation Parallel Block Compression With a GPU
CN102436438A (en) * 2011-12-13 2012-05-02 华中科技大学 Sparse matrix data storage method based on ground power unit (GPU)
US8374242B1 (en) * 2008-12-23 2013-02-12 Elemental Technologies Inc. Video encoder using GPU
CN103177414A (en) * 2013-03-27 2013-06-26 天津大学 Structure-based dependency graph node similarity concurrent computation method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123723A (en) * 2006-08-11 2008-02-13 北京大学 Digital video decoding method based on image processor
US8374242B1 (en) * 2008-12-23 2013-02-12 Elemental Technologies Inc. Video encoder using GPU
CN101937082A (en) * 2009-07-02 2011-01-05 北京理工大学 GPU (Graphic Processing Unit) many-core platform based parallel imaging method of synthetic aperture radar
CN101937555A (en) * 2009-07-02 2011-01-05 北京理工大学 Parallel generation method of pulse compression reference matrix based on GPU (Graphic Processing Unit) core platform
US20110157192A1 (en) * 2009-12-29 2011-06-30 Microsoft Corporation Parallel Block Compression With a GPU
CN102436438A (en) * 2011-12-13 2012-05-02 华中科技大学 Sparse matrix data storage method based on ground power unit (GPU)
CN103177414A (en) * 2013-03-27 2013-06-26 天津大学 Structure-based dependency graph node similarity concurrent computation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FARACH M等: "Optimal parallel dictionary matching and compression", 《PROCEEDINGS OF THE 7TH ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES (SPAA 95)》, 26 April 1995 (1995-04-26), pages 244 - 253 *
SHEN KE等: "Overview of parallel processing approaches to image and video compression", 《SPIE 2186: PROCEEDINGS OF THE CONFERENCE ON IMAGE AND VIDEO COMPRESSION》, 1 May 1994 (1994-05-01), pages 197 - 208 *
崔晨: "基于GPU的H.264编码器关键模块的并行算法设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 10, 15 October 2012 (2012-10-15), pages 1 - 64 *
胡晓玲: "H.264/AVC视频压缩编码在CUDA平台上的并行实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 09, 15 September 2011 (2011-09-15), pages 1 - 68 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091305A (en) * 2014-08-11 2014-10-08 詹曙 Quick image segmentation method used for computer graph and image processing and based on GPU platform and morphological component analysis
CN104091305B (en) * 2014-08-11 2016-05-18 詹曙 A kind of for Fast image segmentation method computer graphic image processing, based on GPU platform and morphology PCA
CN105630529A (en) * 2014-11-05 2016-06-01 京微雅格(北京)科技有限公司 Loading method of FPGA (Field Programmable Gate Array) configuration file, and decoder
CN104965761A (en) * 2015-07-21 2015-10-07 华中科技大学 Flow program multi-granularity division and scheduling method based on GPU/CPU hybrid architecture
CN104965761B (en) * 2015-07-21 2018-11-02 华中科技大学 A kind of more granularity divisions of string routine based on GPU/CPU mixed architectures and dispatching method
CN106019858B (en) * 2016-07-22 2018-05-22 合肥芯碁微电子装备有限公司 A kind of direct-write type lithography machine image data bitwise compression method based on CUDA technologies
CN106019858A (en) * 2016-07-22 2016-10-12 合肥芯碁微电子装备有限公司 Direct writing type photoetching machine image data bit-by-bit compression method based on CUDA technology
CN107508602A (en) * 2017-09-01 2017-12-22 郑州云海信息技术有限公司 A kind of data compression method, system and its CPU processor
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN110308982B (en) * 2018-03-20 2021-11-19 华为技术有限公司 Shared memory multiplexing method and device
CN110007855A (en) * 2019-02-28 2019-07-12 华中科技大学 A kind of the 3D stacking NVM internal storage data compression method and system of hardware supported
CN110007855B (en) * 2019-02-28 2020-04-28 华中科技大学 Hardware-supported 3D stacked NVM (non-volatile memory) memory data compression method and system
CN111628779A (en) * 2020-05-29 2020-09-04 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
CN111628779B (en) * 2020-05-29 2023-10-20 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
CN112463388A (en) * 2020-12-09 2021-03-09 广州科莱瑞迪医疗器材股份有限公司 SGRT data processing method and device based on multithreading

Also Published As

Publication number Publication date
CN103427844B (en) 2016-03-02

Similar Documents

Publication Publication Date Title
CN103427844B (en) A kind of high-speed lossless data compression method based on GPU and CPU mixing platform
US10915450B2 (en) Methods and systems for padding data received by a state machine engine
US10606787B2 (en) Methods and apparatuses for providing data received by a state machine engine
US8959135B2 (en) Data structure for tiling and packetizing a sparse matrix
EP2895968B1 (en) Optimal data representation and auxiliary structures for in-memory database query processing
CN108416427A (en) Convolution kernel accumulates data flow, compressed encoding and deep learning algorithm
TWI836132B (en) Storage system and method for dynamically scaling sort operation for storage system
CN102970043A (en) GZIP (GNUzip)-based hardware compressing system and accelerating method thereof
CN109672449B (en) Device and method for rapidly realizing LZ77 compression based on FPGA
CN109889205A (en) Encoding method and system, decoding method and system, and encoding and decoding method and system
CN114697654B (en) Neural network quantization compression method and system
CN114697672B (en) Neural network quantization compression method and system based on run Cheng Quanling coding
CN103995827B (en) High-performance sort method in MapReduce Computational frames
US11580055B2 (en) Devices for time division multiplexing of state machine engine signals
US20210027148A1 (en) Compression of neural network activation data
EP0961966A1 (en) N-way processing of bit strings in a dataflow architecture
Arming et al. Data compression in hardware—the burrows-wheeler approach
CN103336810B (en) A kind of Topology Analysis of Power Distribution Network method based on multi-core computer
CN101572693A (en) Equipment and method for parallel mode matching
CN202931290U (en) Compression hardware system based on GZIP
CN110349635A (en) A kind of parallel compression method of gene sequencing quality of data score
CN113392963B (en) FPGA-based CNN hardware acceleration system design method
CN102571107B (en) System and method for decoding high-speed parallel Turbo codes in LTE (Long Term Evolution) system
CN111897513B (en) Multiplier based on reverse polarity technology and code generation method thereof
CN107341113A (en) Cache compression method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant