CN103427844B - A kind of high-speed lossless data compression method based on GPU and CPU mixing platform - Google Patents

A kind of high-speed lossless data compression method based on GPU and CPU mixing platform Download PDF

Info

Publication number
CN103427844B
CN103427844B CN201310321071.7A CN201310321071A CN103427844B CN 103427844 B CN103427844 B CN 103427844B CN 201310321071 A CN201310321071 A CN 201310321071A CN 103427844 B CN103427844 B CN 103427844B
Authority
CN
China
Prior art keywords
data
thread
window
length
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310321071.7A
Other languages
Chinese (zh)
Other versions
CN103427844A (en
Inventor
金海�
郑然�
周斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201310321071.7A priority Critical patent/CN103427844B/en
Publication of CN103427844A publication Critical patent/CN103427844A/en
Application granted granted Critical
Publication of CN103427844B publication Critical patent/CN103427844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, comprise: CPU reads data file to be compressed, by in the global storage of this data file to be compressed from memory copying to GPU, thread block group bk [a] on GPU is set, number of threads b in each thread block, the length arranging compression dictionary window is c, and the head pointer arranging sensing first compression dictionary window is p_dic_h, it is d that setting pre-reads window size, point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c, initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d, calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b] is the data of c+d, q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence.The present invention can improve the compression speed of mass data greatly.

Description

A kind of high-speed lossless data compression method based on GPU and CPU mixing platform
Technical field
The invention belongs to computer data compression technique area, more specifically, relate to a kind of high-speed lossless data compression method based on GPU and CPU mixing platform.
Background technology
Show according to International Data Corporation's (InternationalDataCorporation is called for short IDC) research: over nearly 10 years, global information total amount often spends 2 years, will double.2011, the data total amount that the whole world is created and is replicated was 1.8ZB(1800EB), will 8ZB(8000EB be reached by 2015), and about coming decade and the year two thousand twenty, the data in the whole world will exceed 50 times than now.Meanwhile, develop fast although the network bandwidth and new memory technology have also been obtained, but still far can not meet the performance requirement of current mass data transfers and storage.And one of the key technology solving mass data transfers and store facing challenges is exactly data compression.By data compression technique, the data capacity needing transmission and store effectively can be reduced, thus effective control data transmission and the cost deposited, the low-cost high-efficiency realizing data manages.
What traditional compression theory based on CPU platform and compression algorithm were paid close attention to focuses on the compression ratio how improving data, and the mass data epoch speed of compressing data is had higher requirement.In order to improve compression speed, data parallelly compressed becomes a new developing direction.The prerequisite of carrying out parallel data compression needs to find other concurrency of data level, and data are carried out piecemeal, and to carry out compression to each piecemeal be a kind of naturally parallel idea simultaneously.Existing J.Gilchrist, GZIP, S.Pradhan scheduling algorithm utilizes multithreading respectively, the mode of multinuclear and cluster, and CPU platform achieves parallel data compression.But in the application of reality, along with the increase needing data volume to be processed, due to the mutual middle mass communication amount produced, intermediate object program needs to take a large amount of memory headrooms, and CPU hardware architecture inherently be not suitable for large-scale parallel computation characteristic, all of these factors taken together makes under CPU platform, adopt parallel compression algorithm that the speed compressed can not be made to reach the set goal.
Also have scholar by the lossless data compression algorithms of some classics under some CPU platforms, as run-length encoding algorithm, BZIP2 algorithm etc. pass through to improve, under being transplanted to GPU platform.The deficiency that these algorithms compress to solve above-mentioned CPU, is absorbed in the shared storage and global storage that how to utilize GPU, thus reduces the communication between disparate modules to greatest extent, reduces taking of internal memory, and then improves the speed of compression.But because these algorithms itself are based on the invention of CPU platform, in itself and be not suitable for this different hardware configuration of GPU, therefore in the application of reality, the raising of compression speed still has much room for improvement.
Summary of the invention
For above defect or the Improvement requirement of prior art, the invention provides a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, its object is to by by the procedure decomposition of data compression being parallel computation and serial computing two parts, parallel computation part comprises the multiple packed data dictionary of division and pre-reading window, be organized as multiple matrix, and data file to be compressed transferred to the multiple thread block in GPU to walk abreast, and transfer to CPU to complete the generation of compressed encoding in serial computing part and output, this mode of operation combines the respective advantage of GPU and CPU itself and strong point, thus when ensureing that compression ratio does not reduce, greatly improve the compression speed of mass data.
For achieving the above object, according to one aspect of the present invention, provide a kind of high-speed lossless data compression method based on GPU and CPU mixing platform, comprise the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from memory copying to GPU;
(2) arrange the thread block group bk [a] on GPU, the number of threads b in each thread block, wherein a is the sum of thread block;
(3) length arranging compression dictionary window is c, and the head pointer arranging sensing first compression dictionary window is p_dic_h;
(4) setting pre-reads window size is d, and point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c;
(5) initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b] is the data of c+d;
(7) q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence, each element in array stores ternary result (x, y, length), wherein p is the quantity of this matrix of consequence bend section, and equal c+d – 1, x represents the side-play amount of the compression dictionary that oblique line section is corresponding relative to its place matrix of consequence, y represents the side-play amount of the pre-reading window that this oblique line section is corresponding relative to its place matrix of consequence, and length represents the length of this oblique line section;
(8) find the element in locations corresponding to each gMatrix [p] array with maximum length value: thread T3 is set, its thread number is th3, T3 be in sets of threads threads [a*b] the 0th thread in (q-1) individual thread, T3 thread is responsible for finding the element in ternary result array locations [p] corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [q] of the overall situation, each element in this array also stores ternary result (x, y, length),
(9) treat compression data file according to matching result array match [q] to compress;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, and if so, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q*d, then return step (6).
Preferably, the value of d equals 16*n, and the span of n is 1-8.
Preferably, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) points to compression data file second compression dictionary window stem, (p_pre_r-d) points to the second pre-reading window stem of data file to be compressed, so divides, once circulate, q can be processed to compression dictionary window and pre-reads data window.
Preferably, step (6) is specially, and the data of the compression dictionary window pending for each and the pre-reads data window corresponding with it, perform following steps respectively:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 be in sets of threads threads [a*b/2] (c*k) individual thread in (c* (k+1)-1) individual thread, it judges in a kth compression dictionary window, whether (th1modc) bit byte mates with the i-th * 16 to (i+1) * 16-1 bit byte in a kth pre-reads data window, wherein 0<=k<q, when two bytes match, return value 1, otherwise return value 0, and this matching result is write back position ((th1modc) * d+i*16) in a kth gMatrix matrix of global storage in position ((th1modc) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7).
Preferably, when length is for being less than 3, show not find coupling, now, x and y indirect assignment is-1.
Preferably, step (7) specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 be in sets of threads threads [a*b] (c*k) individual thread in (c* (k+1)+d-1) individual thread, T2 is responsible for having in searching oblique line section continuous 1 maximum subsegment, and records this subsegment corresponding parameter x, y and length;
(7-2) thread T2 is obtained corresponding data x, y and length are stored in (th2modp) individual element of ternary result array locations.
Preferably, step (9) specifically comprises following sub-step:
(9-1) matching result array match [q] is transferred to the main storage CPU from GPU;
(9-2) data reduction be stored in matching result array match [q] is become side-play amount and the length of the substring of pre-reading window the longest coupling in compression dictionary window by CPU, and output squeezing coding tlv triple compress [q], each element in compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, offset and length represents side-play amount and the length of the substring of pre-reading window the longest coupling in the compression dictionary window of correspondence respectively;
(9-3) treat compression data file according to flag byte and carry out data compression.
Preferably, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, original logo byte first is 0, the initial data length that rear 7 bit representations export, continuous 128 initial data bytes can be exported at most below, Mixed markers byte first is 1, the original blended data with compressing of rear 7 bit representation 7, what corresponding position represented output when being 0 is initial data, and what represent output when being 1 is compressed encoding, realizes the compression of data thus, if do not find coupling, then former state exports data.
According to another aspect of the present invention, provide a kind of high-speed lossless data compression system based on GPU and CPU mixing platform, comprising:
First module, for reading data file to be compressed, by the global storage of this data file to be compressed from memory copying to GPU;
Second module, for arranging the thread block group bk [a] on GPU, the number of threads b in each thread block, wherein a is the sum of thread block;
3rd module is c for arranging the length of compression dictionary window, and the head pointer arranging sensing first compression dictionary window is p_dic_h;
Four module, pre-reading window size for setting is d, and point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c;
5th module, for initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
6th module is the data of c+d for calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b];
7th module, for q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence, each element in array stores ternary result (x, y, length), wherein p is the quantity of this matrix of consequence bend section, and equal c+d – 1, x represents the side-play amount of the compression dictionary that oblique line section is corresponding relative to its place matrix of consequence, y represents the side-play amount of the pre-reading window that this oblique line section is corresponding relative to its place matrix of consequence, length represents the length of this oblique line section,
8th module, for finding the element in locations corresponding to each gMatrix [p] array with maximum length value: arrange thread T3, its thread number is th3, T3 be in sets of threads threads [a*b] the 0th thread in (q-1) individual thread, T3 thread is responsible for finding the element in ternary result array locations [p] corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [q] of the overall situation, each element in this array also stores ternary result (x, y, length),
9th module, compresses for treating compression data file according to matching result array match [q];
Tenth module, for judging whether pointer p_pre_r has arrived the afterbody of data file to be compressed, and if so, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q*d, then return the 6th module.
In general, the above technical scheme conceived by the present invention compared with prior art, can obtain following beneficial effect:
(1) the present invention is mated by matrix parallel, accelerate the matching speed of data: the mode of operation of the matrix parallel coupling that the present invention proposes, the matrix matching of a parallel processing q c*q in once circulating, make the step-length of circulation coupling increase q*d byte by d byte, accelerate matching speed thus;
(2) present invention achieves asynchronous process, overlapping redundant data finds the time exported with compressed encoding, operation in CPU and GPU can be performed with the form of similar streamline, thus also decrease the time of whole data compression to a certain extent, improve the speed of compression.
Accompanying drawing explanation
Fig. 1 is the flow chart of the high-speed lossless data compression method that the present invention is based on GPU and CPU mixing platform.
Fig. 2 is the compression dictionary window of data file to be compressed and the pre-reading window division schematic diagram corresponding with it.
Fig. 3 to Fig. 6 is the schematic diagram of application example of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.In addition, if below in described each execution mode of the present invention involved technical characteristic do not form conflict each other and just can mutually combine.
As shown in Figure 1, the high-speed lossless data compression method that the present invention is based on GPU and CPU mixing platform comprises the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from memory copying to GPU;
(2) arrange the thread block group bk [a] on GPU, the number of threads b in each thread block, wherein a is the sum of thread block, and it is positive integer, and the span of b is 256 to 1024;
(3) length arranging compression dictionary window is c, and the head pointer arranging sensing first compression dictionary window is p_dic_h; Specifically, the span of c is 2KB-8KB, and preferred value is taken as 4KB, and p_dic_h initial value is the beginning pointing to this compression data file;
(4) setting pre-reads window size is d, and point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c, and d value equals 16*n, and the span of n is 1-8, and preferred value is 4, and namely the preferred value of each pre-reading window is 64B;
(5) initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b] is the data of c+d; The compression dictionary window of data file to be compressed and the division of the pre-reading window corresponding with it are as shown in Figure 2, p_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-d) points to compression data file second compression dictionary window stem, (p_pre_r-d) the second pre-reading window stem of data file to be compressed is pointed to, division like this, once circulates, and can process q to compression dictionary window and pre-reads data window.Specifically, the data of the compression dictionary window pending for each and the pre-reads data window corresponding with it, perform following steps respectively:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 be in sets of threads threads [a*b/2] (c*k) individual thread in (c* (k+1)-1) individual thread, it judges in a kth compression dictionary window, whether (th1modc) bit byte mates with the i-th * 16 to (i+1) * 16-1 bit byte in a kth pre-reads data window (namely whether the two is equal), wherein 0<=k<q, when two bytes match (namely equal), return value 1, otherwise return value 0, and this matching result is write back position ((th1modc) * d+i*16) in a kth gMatrix matrix of global storage in position ((th1modc) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7);
(7) q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence, each element in array stores ternary result (x, y, length), wherein p is the quantity of this matrix of consequence bend section, and equal c+d – 1, x represents the side-play amount of the compression dictionary that oblique line section is corresponding relative to its place matrix of consequence, y represents the side-play amount of the pre-reading window that this oblique line section is corresponding relative to its place matrix of consequence, length represents the length (namely comprising the number of " 1 ") of this oblique line section, when length is for being less than 3, show not find coupling (if this is when to be length owing to mating substring be less than two bytes, code after compression is also longer than what do not compress, meaningless), now, x and y is nonsensical, can indirect assignment be-1,
This step specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 be in sets of threads threads [a*b] (c*k) individual thread in (c* (k+1)+d-1) individual thread, T2 is responsible for having in searching oblique line section continuous 1 maximum subsegment, and records this subsegment corresponding parameter x, y and length;
(7-2) thread T2 is obtained corresponding data x, y and length are stored in (th2modp) individual element of ternary result array locations;
(8) find the element in locations corresponding to each gMatrix [p] array with maximum length value: thread T3 is set, its thread number is th3, T3 be in sets of threads threads [a*b] the 0th thread in (q-1) individual thread, T3 thread is responsible for finding the element in ternary result array locations [p] corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [q] of the overall situation, each element in this array also stores ternary result (x, y, length),
(9) treat compression data file according to matching result array match [q] to compress, specifically comprise following sub-step;
(9-1) matching result array match [q] is transferred to the main storage CPU from GPU;
(9-2) data reduction be stored in matching result array match [q] is become side-play amount and the length of the substring of pre-reading window the longest coupling in compression dictionary window by CPU, and output squeezing coding tlv triple compress [q], each element in compressed encoding three-number set stores ternary result (flag, offset, length) wherein, flag is flag byte, whether what show output is packed data, offset and length represents side-play amount and the length of the substring of pre-reading window the longest coupling in the compression dictionary window of correspondence respectively;
(9-3) treat compression data file according to flag byte and carry out data compression; Specifically, the flag byte type obtained has two kinds, and one is original logo byte, one is Mixed markers byte, original logo byte first is 0, rear 7 bit representations export initial data length, after can export at most continuous 128 initial data bytes; And Mixed markers byte first is 1, the original blended data with compressing of rear 7 bit representation 7, what represent output when corresponding position is 0 is initial data, and what represent output when being 1 is compressed encoding, realizes the compression of data thus; If do not find coupling, then former state exports data;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, and if so, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q*d, then return step (6).
The present invention is based on the high-speed lossless data compression system of GPU and CPU mixing platform, it is characterized in that, comprising:
First module, for reading data file to be compressed, by the global storage of this data file to be compressed from memory copying to GPU;
Second module, for arranging the thread block group bk [a] on GPU, the number of threads b in each thread block, wherein a is the sum of thread block;
3rd module is c for arranging the length of compression dictionary window, and the head pointer arranging sensing first compression dictionary window is p_dic_h;
Four module, pre-reading window size for setting is d, and point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c;
5th module, for initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
6th module is the data of c+d for calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b];
7th module, for q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence, each element in array stores ternary result (x, y, length), wherein p is the quantity of this matrix of consequence bend section, and equal c+d – 1, x represents the side-play amount of the compression dictionary that oblique line section is corresponding relative to its place matrix of consequence, y represents the side-play amount of the pre-reading window that this oblique line section is corresponding relative to its place matrix of consequence, length represents the length of this oblique line section,
8th module, for finding the element in locations corresponding to each gMatrix [p] array with maximum length value: arrange thread T3, its thread number is th3, T3 be in sets of threads threads [a*b] the 0th thread in (q-1) individual thread, T3 thread is responsible for finding the element in ternary result array locations [p] corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [q] of the overall situation, each element in this array also stores ternary result (x, y, length),
9th module, compresses for treating compression data file according to matching result array match [q];
Tenth module, for judging whether pointer p_pre_r has arrived the afterbody of data file to be compressed, and if so, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q*d, then return the 6th module.
Example
In order to clearly set forth principle of the present invention, below illustrate implementation procedure of the present invention.
(1) CPU reads data file to be compressed, by the global storage of this compression data file from memory copying to GPU;
(2) the thread block group bk [1024] on GPU is set, the number of threads 512 in each thread block;
(3) length arranging compression dictionary window is 4096 bytes, and the head pointer pointing to first compression dictionary window is p_dic_h=0;
(4) setting pre-reads length of window is 64 bytes, and the pointer pointing to first pre-reading window is p_pre_r=4096;
(5) initial work sets of threads threads [1024*512], and 64 gMatrix matrixes, its size is 4096*64;
(6) data that 64 length are (4096+64) B are called in 1024*256 thread process data file to be compressed in worker thread group threads [1024*512]; The compression dictionary window of data file to be compressed and the division of the pre-reading window corresponding with it be as shown in Figure 3:
P_dic_h points to first compression dictionary window stem of data file to be compressed, and p_pre_r points to first pre-reading window stem of data file to be compressed; And (p_dic_h-64) points to compression data file second compression dictionary window stem, (p_pre_r-64) the second pre-reading window stem of data file to be compressed is pointed to, division like this, once circulation can process 64 pairs of compression dictionary windows and pre-reads data window.Specifically, the data of the compression dictionary window pending for each and the pre-reads data window corresponding with it, perform following steps (at this, we select to be described the work for the treatment of of first pair of compression dictionary window and pre-reads data window) respectively:
In order to simple and clear description compression process, we select the data in data file to be compressed in first compression dictionary to be " hellothisaeybc ... isa ", and length is 4096 bytes; And the front 16B data of the data of first pre-reading window are " thisisaexampletosh ".T0 to T4095 totally 4096 threads in sets of threads, its thread number is respectively 0 to 4095, Tm1 (m1 ∈ [0,4095]) be wherein any one thread, m1 is its thread number, this thread is responsible for judging in compression dictionary window, whether m1 bit byte mates with the 0 to 15 bit byte in pre-reads data window (namely whether the two is equal), when two bytes match (namely equal), return value 1, otherwise return value 0, and this matching result is write back in the position m1*64 to position m1*64+16 in the gMatrix matrix of the correspondence of global storage.Because the data that only have selected 16B are compressed, then once circulation when we only perform i=0; Obtain the result that size is 1/4 in the matrix of consequence gMatrix of 4096*64, the result sizes namely this time obtained is 4096*16, as shown in Figure 4:
(7) be only described in below in first gMatrix in 64 matrix of consequence gMatrix and how find the oblique line section with continuous at most 1.
(7-1) T0 to T4110 (4096+15) individual thread altogether in sets of threads, its thread number is respectively 0 to 4110, Tm2 (m2 ∈ [0,4110]) be wherein any one thread, m2 is its thread number, Tm2 is responsible for having in searching oblique line section continuous 1 maximum subsegment, as shown in Figure 5:
(7-2) thread Tm2 is got parms x, y and length, and they is stored in m2 the element of ternary result array locations;
In this example, thread T5 and thread T10 has found the sub-line segment of multiple continuous 1, and it is the element assignment in corresponding locations array: locations (5)={ 5,0,6}, locations (10)={ 10,2,3}, the value of other elements of locations array is {-1,-1,0}, as shown in Figure 6:
(8) element in locations corresponding to each gMatrix [p] array with maximum length value is found: T0 to T63 totally 64 threads in sets of threads, its thread number is respectively 0 to 63, Tm3 (m3 ∈ [0, 63]) be wherein any one thread, m3 is its thread number, Tm3 is responsible for finding the element in locations array corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [64] of the overall situation, each element in this array also stores ternary result (x, y, length).In this example, the locations element with maximum length value that we utilize thread T0 to find in the 0th corresponding matrix of consequence gMatrix is the 5th element, by the 0th element in its relevant parameter write array match, i.e. and match (0)={ 5,0,6};
(9) treat compression data file according to matching result array match [64] to compress, specifically comprise following sub-step;
(9-1) matching result array match [64] is transferred to the main storage CPU from GPU;
(9-2) data reduction be stored in matching result array match [64] is become side-play amount and the length of the substring of pre-reading window the longest coupling in compression dictionary window by CPU, and output squeezing coding tlv triple compress [64], in this example, be that displacement offset and the length length of the longest coupling of the 0th pre-reading window in the 0th compression dictionary window is respectively 5 and 6 by the data reduction of match (0), and by output Mixed markers byte, namely flag value is 224(11100000), compress (0)={ 224,5,6}; Now before data file to be compressed, (4096+16) individual byte is compressed rear output and is: hellothisaeybc ... isa22456amplet3osh.Now, 4096B data in first compression dictionary window can not be compressed, former state exports, represent in pre-reads data window have the substring of 6 byte longs to be compressed immediately following 22456 below, its original text is that compression dictionary window is from displacement 5 place, length is 6 bytes, then exports a string not by compress amplet below, long 6 bytes; Then export an original logo byte, its value is 3(00000011), finally, to export thereafter the un-compressed initial data osh of 3 bytes.
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, if point to tail of file, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+64*64, p_dic_h=p_dic_h+64*64, then return step (6).
Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (4)

1., based on a high-speed lossless data compression method for GPU and CPU mixing platform, it is characterized in that, comprise the following steps:
(1) CPU reads data file to be compressed, by the global storage of this data file to be compressed from memory copying to GPU;
(2) arrange the thread block group bk [a] on GPU, the number of threads b in each thread block, wherein a is the sum of thread block;
(3) length arranging compression dictionary window is c, and the head pointer arranging sensing first compression dictionary window is p_dic_h;
(4) setting pre-reads window size is d, and point to the pointer p_pre_r of first pre-reading window, the initial value of this pointer is set to p_dic_h-c;
(5) initial work sets of threads threads [a*b], and (a*b/2)/c gMatrix matrix, its size is c*d;
(6) calling q=(a*b/2)/c length in (a*b/2) the individual thread process data file to be compressed in worker thread group threads [a*b] is the data of c+d, in step (6), p_dic_h points to first compression dictionary window stem of data file to be compressed, p_pre_r points to first pre-reading window stem of data file to be compressed, and (p_dic_h-d) points to compression data file second compression dictionary window stem, (p_pre_r-d) the second pre-reading window stem of data file to be compressed is pointed to, division like this, once circulate, q can be processed to compression dictionary window and pre-reads data window, this step is specially, the data of the compression dictionary window pending for each and the pre-reads data window corresponding with it, perform following steps respectively:
(6-1) counter i=0 is set;
(6-2) thread T1 is set, its thread number is th1, T1 be in sets of threads threads [a*b/2] (c*k) individual thread in (c* (k+1)-1) individual thread, it judges in a kth compression dictionary window, whether (th1modc) bit byte mates with the i-th * 16 to (i+1) * 16-1 bit byte in a kth pre-reads data window, wherein 0<=k<q, when two bytes match, return value 1, otherwise return value 0, and this matching result is write back position ((th1modc) * d+i*16) in a kth gMatrix matrix of global storage in position ((th1modc) * d+i*16+16),
(6-3) i=i+1 is set, and has judged whether i<n, if it is proceed to step (6-2), otherwise proceed to step (7);
(7) q matrix of consequence gMatrix each in find have continuous at most 1 oblique line section, determine the ternary result array locations [p] of each matrix of consequence, each element in array stores ternary result (x, y, length), wherein p is the quantity of this matrix of consequence bend section, and equal c+d – 1, x represents the side-play amount of the compression dictionary that oblique line section is corresponding relative to its place matrix of consequence, y represents the side-play amount of the pre-reading window that this oblique line section is corresponding relative to its place matrix of consequence, and length represents the length of this oblique line section; This step specifically comprises following sub-step:
(7-1) thread T2 is set, its thread number is th2, T2 be in sets of threads threads [a*b] (c*k) individual thread in (c* (k+1)+d-1) individual thread, T2 is responsible for having in searching oblique line section continuous 1 maximum subsegment, and records this subsegment corresponding parameter x, y and length;
(7-2) thread T2 is obtained corresponding data x, y and length are stored in (th2modp) individual element of ternary result array locations;
(8) find the element in locations corresponding to each gMatrix [p] array with maximum length value: thread T3 is set, its thread number is th3, T3 be in sets of threads threads [a*b] the 0th thread in (q-1) individual thread, T3 thread is responsible for finding the element in ternary result array locations [p] corresponding to each gMatrix matrix with maximum length value, and by the parameter x of its correspondence, y and length is stored in the matching result array match [q] of the overall situation, each element in this array also stores ternary result (x, y, length),
(9) treat compression data file according to matching result array match [q] to compress; This step specifically comprises following sub-step:
(9-1) matching result array match [q] is transferred to the main storage CPU from GPU;
(9-2) data reduction be stored in matching result array match [q] is become side-play amount and the length of the substring of pre-reading window the longest coupling in compression dictionary window by CPU, and output squeezing coding tlv triple compress [q], each element in compressed encoding three-number set stores ternary result (flag, offset, length), wherein, flag is flag byte, whether what show output is packed data, offset and length represents side-play amount and the length of the substring of pre-reading window the longest coupling in the compression dictionary window of correspondence respectively;
(9-3) treat compression data file according to flag byte and carry out data compression;
(10) judge whether pointer p_pre_r has arrived the afterbody of data file to be compressed, and if so, then process terminates; Otherwise forward slip dictionary window and pre-reading window, namely arrange p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q*d, then return step (6).
2. high-speed lossless data compression method according to claim 1, is characterized in that, the value of d equals 16*n, and the span of n is 1-8.
3. high-speed lossless data compression method according to claim 1, is characterized in that, when length is for being less than 3, shows not find coupling, and now, x and y indirect assignment is-1.
4. high-speed lossless data compression method according to claim 1, it is characterized in that, in step (9-3), the flag byte type obtained comprises original logo byte and Mixed markers byte, original logo byte first is 0, the initial data length that rear 7 bit representations export, continuous 128 initial data bytes can be exported at most below, Mixed markers byte first is 1, the original blended data with compressing of rear 7 bit representation 7, what corresponding position represented output when being 0 is initial data, what represent output when being 1 is compressed encoding, realize the compression of data thus, if do not find coupling, then former state exports data.
CN201310321071.7A 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform Active CN103427844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310321071.7A CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310321071.7A CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Publications (2)

Publication Number Publication Date
CN103427844A CN103427844A (en) 2013-12-04
CN103427844B true CN103427844B (en) 2016-03-02

Family

ID=49652097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310321071.7A Active CN103427844B (en) 2013-07-26 2013-07-26 A kind of high-speed lossless data compression method based on GPU and CPU mixing platform

Country Status (1)

Country Link
CN (1) CN103427844B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091305B (en) * 2014-08-11 2016-05-18 詹曙 A kind of for Fast image segmentation method computer graphic image processing, based on GPU platform and morphology PCA
CN105630529A (en) * 2014-11-05 2016-06-01 京微雅格(北京)科技有限公司 Loading method of FPGA (Field Programmable Gate Array) configuration file, and decoder
CN104965761B (en) * 2015-07-21 2018-11-02 华中科技大学 A kind of more granularity divisions of string routine based on GPU/CPU mixed architectures and dispatching method
CN106019858B (en) * 2016-07-22 2018-05-22 合肥芯碁微电子装备有限公司 A kind of direct-write type lithography machine image data bitwise compression method based on CUDA technologies
CN107508602A (en) * 2017-09-01 2017-12-22 郑州云海信息技术有限公司 A kind of data compression method, system and its CPU processor
CN110308982B (en) * 2018-03-20 2021-11-19 华为技术有限公司 Shared memory multiplexing method and device
CN110007855B (en) * 2019-02-28 2020-04-28 华中科技大学 Hardware-supported 3D stacked NVM (non-volatile memory) memory data compression method and system
CN111628779B (en) * 2020-05-29 2023-10-20 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
CN112463388B (en) * 2020-12-09 2023-03-10 广州科莱瑞迪医疗器材股份有限公司 SGRT data processing method and device based on multithreading

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123723A (en) * 2006-08-11 2008-02-13 北京大学 Digital video decoding method based on image processor
CN101937082A (en) * 2009-07-02 2011-01-05 北京理工大学 GPU (Graphic Processing Unit) many-core platform based parallel imaging method of synthetic aperture radar
CN101937555A (en) * 2009-07-02 2011-01-05 北京理工大学 Parallel generation method of pulse compression reference matrix based on GPU (Graphic Processing Unit) core platform
CN102436438A (en) * 2011-12-13 2012-05-02 华中科技大学 Sparse matrix data storage method based on ground power unit (GPU)
US8374242B1 (en) * 2008-12-23 2013-02-12 Elemental Technologies Inc. Video encoder using GPU
CN103177414A (en) * 2013-03-27 2013-06-26 天津大学 Structure-based dependency graph node similarity concurrent computation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110157192A1 (en) * 2009-12-29 2011-06-30 Microsoft Corporation Parallel Block Compression With a GPU

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123723A (en) * 2006-08-11 2008-02-13 北京大学 Digital video decoding method based on image processor
US8374242B1 (en) * 2008-12-23 2013-02-12 Elemental Technologies Inc. Video encoder using GPU
CN101937082A (en) * 2009-07-02 2011-01-05 北京理工大学 GPU (Graphic Processing Unit) many-core platform based parallel imaging method of synthetic aperture radar
CN101937555A (en) * 2009-07-02 2011-01-05 北京理工大学 Parallel generation method of pulse compression reference matrix based on GPU (Graphic Processing Unit) core platform
CN102436438A (en) * 2011-12-13 2012-05-02 华中科技大学 Sparse matrix data storage method based on ground power unit (GPU)
CN103177414A (en) * 2013-03-27 2013-06-26 天津大学 Structure-based dependency graph node similarity concurrent computation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
H.264/AVC视频压缩编码在CUDA平台上的并行实现;胡晓玲;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110915(第09期);第1-68页 *
Optimal parallel dictionary matching and compression;Farach M等;《Proceedings of the 7th ACM Symposium on Parallel Algorithms and Architectures (SPAA 95)》;19950426;第244-253页 *
Overview of parallel processing approaches to image and video compression;Shen Ke等;《SPIE 2186: Proceedings of the Conference on Image and Video Compression》;19940501;第197-208页 *
基于GPU的H.264编码器关键模块的并行算法设计与实现;崔晨;《中国优秀硕士学位论文全文数据库 信息科技辑》;20121015(第10期);第1-64页 *

Also Published As

Publication number Publication date
CN103427844A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN103427844B (en) A kind of high-speed lossless data compression method based on GPU and CPU mixing platform
US11836081B2 (en) Methods and systems for handling data received by a state machine engine
EP2895968B1 (en) Optimal data representation and auxiliary structures for in-memory database query processing
CN102970043B (en) A kind of compression hardware system based on GZIP and accelerated method thereof
US9477682B1 (en) Parallel compression of data chunks of a shared data object using a log-structured file system
CN108416427A (en) Convolution kernel accumulates data flow, compressed encoding and deep learning algorithm
CN107590533A (en) A kind of compression set for deep neural network
CN104915322A (en) Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN108847850A (en) A kind of segmentation polarization code coding/decoding method based on CRC-SSCL
CN104361068B (en) Parallel method of partition and system during a kind of data deduplication
CN103248369A (en) Compression system and method based on FPFA (Field Programmable Gate Array)
CN109889205A (en) Encoding method and system, decoding method and system, and encoding and decoding method and system
CN109672449B (en) Device and method for rapidly realizing LZ77 compression based on FPGA
CN114697672B (en) Neural network quantization compression method and system based on run Cheng Quanling coding
CN106648817A (en) Cross-platform data object transmission method
CN101458638A (en) Large scale data verification method for embedded system
CN101042424A (en) Method and apparatus for detecting application-specific integrated circuits
CN104394415A (en) Method for distributed decoding of video big data
CN103336810B (en) A kind of Topology Analysis of Power Distribution Network method based on multi-core computer
CN102025614B (en) Online reconfigurable quaternary tree network on-chip system and reconfiguration method
CN202931290U (en) Compression hardware system based on GZIP
CN101572693A (en) Equipment and method for parallel mode matching
CN209803775U (en) Data processing apparatus
CN107341113A (en) Cache compression method and device
CN102571107A (en) System and method for decoding high-speed parallel Turbo codes in LTE (Long Term Evolution) system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant