CN101866493A

CN101866493A - Method for realizing fast wavelet transform by using GPU

Info

Publication number: CN101866493A
Application number: CN 201010204236
Authority: CN
Inventors: 李云松; 宋长贺; 吴宪云; 刘凯; 王柯俨; 肖嵩
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-06-18
Filing date: 2010-06-18
Publication date: 2010-10-20
Anticipated expiration: 2030-06-18
Also published as: CN101866493B

Abstract

The invention discloses a method for realizing fast wavelet transform by using a GPU, which comprises the following steps of: (1) copying data from a host memory of a computer to a memory area A of GPU equipment; (2) performing line translation on the data in the memory area A by using a horizontal filter, and storing the translation result in the memory area A; (3) performing line transposition on the data in the memory area A by using a line transposing device, and storing the data after the transposition in a memory area B of the GPU equipment; (4) performing line translation on the data in the memory area B again by using the horizontal filter, and storing the translation result in the memory area B; (5) performing line transposition on the data in the memory area B again by using the line transposing device, and storing the data after the transposition in the memory area A; and (6) copying the data in the memory area A of the GPU equipment to the host memory of the computer. The method employs parallel data processing, has a simple implementation process, and improves the processing speed of the wavelet transform.

Description

A kind of method that realizes fast wavelet transform with GPU

Technical field

The present invention relates to technical field of image processing, be specifically related to the method that a kind of graphic process unit GPU that uses a computer realizes fast wavelet transform, realize the high-speed image coding on the multi-purpose computer to be useful in.

Background technology

In digital image processing field, wavelet transformation is because its outstanding performance in the JPEG2000 Standard of image compression has obtained application widely.Yet wavelet transformation computing complexity need the data volume of processing general bigger, thereby speed is slower, has seriously limited the application of wavelet transformation.So should take certain optimized Measures.At present the various optimizations of wavelet transformation are based on application-specific integrated circuit ASIC or programmable logic device (PLD) FPGA more, it is long to have a construction cycle, and difficulty is big, shortcomings such as cost height.

At this situation, there is the scholar to propose on GPU equipment, to carry out the scheme of wavelet transformation.GPU equipment relies on common driving of multinuclear and very high memory bandwidth, has obtained very high arithmetic capability.People such as the old David of Nanjing Univ. of Posts and Telecommunications at the GPU device characteristics, have proposed a kind of fast wavelet transform method in its article " realization of Daubechies9/7 discrete wavelet transformer scaling method on GPU " (Chinese multimedia communication 2007 12 phase the 45th page to 50 pages).This method realizes wavelet transformation with paralleling tactic, solves the problem of GPU equipment shared drive restriction emphatically, with every row or every column data segmentation of data.Though this method has good versatility, can handle larger-size image, say as the author, introduced new problem inevitably.What wherein influence most efficient is exactly that a section head of every row or every each section of column data and a section tail all must symmetric extensions.The a thread first and section mantissa certificate of processing section is individually all wanted in each continuation, and at this moment, other threads all are idle conditions, have caused the waste of resource.Another shortcoming of this method is that line translation will be carried out rank transformation afterwards, every row of data must be extracted, and this method does not propose the method for the data of effectively arranging again, causes rank transformation efficient not high.

Summary of the invention

The present invention is directed to the problems referred to above, propose a kind of GPU that uses and realize, do not carry out data sementation, and have the fast wavelet transform method of special ranks transposition function.

Step of the present invention is as follows:

The 1st step: will need the view data of wavelet transformation to copy to the A of GPU Device memory zone from the main frame internal memory;

The 2nd step: the usage level wave filter, carry out line translation to the 1st data that go on foot among the GPU Device memory zone A that obtains, and transformation results still be stored among the region of memory A of GPU equipment;

The 3rd step: use the ranks deferring device, the data procession transposition among the GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of GPU equipment;

The 4th step: the usage level wave filter, carry out line translation once more to the 3rd data that go on foot in the GPU Device memory area B that obtains, and transformation results still be stored among the region of memory B of GPU equipment;

The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of GPU equipment;

The 6th step: the data among the A of GPU Device memory zone are copied to the main frame internal memory.

The present invention compared with prior art has the following advantages:

The first, the present invention only needs the data of GPU equipment shared drive buffer memory delegation or row, and along with the raising of GPU device design level, the also corresponding increase of operational shared drive.Therefore, the present invention does not allow to be subject to the restriction of the problem of GPU equipment shared drive deficiency.

The second, the horizontal filter of the present invention's design has fully used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.

The 3rd, the ranks deferring device of the present invention's design has used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.By using this ranks deferring device at a high speed, the efficient reduction that causes owing to the column data storage is discontinuous in the time of can avoiding carrying out rank transformation.Simultaneously, because rank transformation realizes that by line translation the present invention no longer needs independent vertical filter to realize rank transformation, make that the system that is realized by this method is simpler.

Description of drawings

Fig. 1 is a process flow diagram of the present invention.

Embodiment

The present invention adopts the CUDA language, can realize on the GPU equipment of any a support CUDA of NVIDIA framework.Before implementing method of the present invention, should call the cudaMalloc function earlier and on GPU equipment, distribute two region of memorys that size is identical, be designated as memory sections A, B respectively.After using method of the present invention, also should call the cudaFree function and discharge this two sections region of memorys.

With reference to Fig. 1, the present invention can realize by following steps:

Step 1 copies to the GPU Device memory with the data form computer host memory.The data form computer host memory can use function cudaMemcpy or cudaMemcpy2D to realize to duplicating of GPU Device memory.

Step 2, the usage level wave filter, the data among the A of the GPU Device memory that step 1 is obtained zone are carried out line translation, and transformation results still are stored among the region of memory A of GPU equipment.

The present invention has designed a line filter and has finished line translation to the image data line, and horizontal filter just can be realized the line translation of whole image data by using a plurality of line filters like this.In the use to the GPU device resource, each line filter uses a thread block of being made up of 256 threads, and the shared drive of data line amount size.By internal memory first address coef_row and line width w that data line is provided to line filter, line filter just can be finished the Filtering Processing to this line data.For finishing above-mentioned functions, the specific implementation method of line filter is as follows:

2a) data of delegation are separated with the even number item by odd term, i.e. deinterleaving, and the data storage after will separating is in the shared drive of this thread block.The present invention uses a kind of special circulation controlling means, and data are handled by a plurality of thread parallels ground.All threads in each thread block of GPU equipment and each thread block all have an index value, and the present invention uses this index value to finish round-robin control.Because line filter is finished by a thread block, so use the index value threadIdx.x of each thread in this thread block design cycle the time.If loop variable is i, the i initial value is set to threadIdx.x, and in 256 threads in this thread block, the initial value of i is respectively 0 to 255 like this.Round-robin controlled condition is made as i＜width, like this, if if the width of delegation less than the quantity of thread in this piece, then has only specific thread to work, if width then can be carried out circulation greater than the quantity of thread in this piece.I is from the quantity blockDim.x that increases thread in the thread block behind each execution loop body, and like this, next time, circulation then just continued the datamation that last circulation is left.By the cooperation of these 256 threads, each data of this delegation all will be handled successively, and the cycle index that needs significantly reduces.For the shared drive of having opened up at preparation process, it is divided into two sections, the first address of the last period is labeled as s, and back one segment mark is designated as p.According to above-mentioned circulation controlling means, with coef_row address offset amount be the data of 2i put to s address offset amount be the position of i; With coef_row address offset amount be the data of 2i+1 put to p address offset amount be the position of i.Through w/2*256+1 circulation at the most, can finish moving of data.Answer in the end call function _ _ syncthreads realize all threads in the piece synchronously, only carried out synchronously, the data after each thread execution are only visible to other threads.

2b) according to the circulation controlling means described in the 2a to deinterleaving after data in the shared drive carry out four times respectively and promote.Promoting for the first time is with s[i] and s[i+1] two data additions, multiply by and be added to p[i after promoting coefficient] on, it is synchronous to carry out the primary line journey after all data processing are finished.Secondary lifting is with for the first time process is similar, and different is with p[i] and p[i+1] multiply by after the addition promote coefficient and be added to s[i+1] on.The lifting of third and fourth time is then identical with the lifting of first and second time.

2c) according to the described circulation controlling means of 2a, deposit in the data of s and the beginning of p address in the shared drive after above four steps are promoted, promptly low frequency and high-frequency data multiply by normalization coefficient respectively and send shared drive back to and cover the original data of this row.

After finishing the design of line filter, horizontal filter can be finished the processing of all row by using 512 line filters.The specific implementation method of horizontal filter is: according to the identical circulation controlling means of line filter, the initial value of loop variable i is the call number blockIdx.x of thread block, controlled condition is i＜h, each back i that carries out is from increasing 512, in loop body inside, provide the internal memory first address coef_all+i*w and the line width w of data line to line filter, use line filter to carry out the Filtering Processing of corresponding line.

Horizontal filter uses 512 line filters to work concurrently, and each line filter uses the work of a plurality of thread parallels ground, has realized the data processing of highly-parallel.Simultaneously, line filter is all finished in shared drive the main operation of data, thereby data have very high transmission speed.The data processing of highly-parallel and high-speed data transmission make horizontal filter have high speed.

Step 3 is used the ranks deferring device, the data procession transposition among the GPU Device memory zone A that step 2 is obtained, with the data storage behind the transposition in the region of memory B of GPU equipment.

In the use to the GPU device resource, the ranks deferring device uses 256 thread block, and each thread block is used the shared drive of 256 threads and a column data amount size.The address of the position of deposit data before the ranks deferring device provides two to indicate transposition respectively and behind the transposition, and the line width of image is high with row, the ranks deferring device can be finished the transposition of piece image.

The ranks deferring device mainly recirculates by one two and finishes, all thread block in the whole GPU equipment of outer round-robin scheduling, the thread in thread block of interior loop scheduling.Outer round-robin loop variable j initial value is made as the thread block call number, controlled condition be j less than line width, every circulation primary j is from increasing 256, loop body is an interior loop.The initial value of the loop variable i of interior loop is the index value of this thread in this thread block, and controlled condition is that i is high less than row, and every circulation primary i is from increasing 256, and loop body is finished data-moving.The process of data-moving is to import earlier that the address offset amount is i the position of the deposit data of i*w+j in shared drive in the data, again the data of i position in the shared drive are put to the OPADD side-play amount be in the position of i+j*h.

The ranks deferring device recirculates by two, has realized following two functions: the first, and the ranks deferring device has been realized the data manipulation of highly-parallel by a large amount of thread of cycle control.The second, the ranks deferring device be not directly with data from the relevant position that the phase position of Input Address is moved OPADD, but earlier with metadata cache in shared drive, from shared drive, be fetched into outgoing position again, realized high-speed data transmission.The data manipulation of highly-parallel and high-speed data transmission make the ranks deferring device have high speed.

Step 4, the usage level wave filter, the data in the GPU Device memory area B that step 3 is obtained are carried out line translation once more, and transformation results still are stored among the region of memory B of GPU equipment.Horizontal filter in this step horizontal filter in step 2 is identical, and the data behind the transposition are carried out horizontal filtering, is equivalent to finish the filtering of the vertical direction of data before the transposition.

Step 5 is used the ranks deferring device, and data are carried out ranks transposition once more in the GPU Device memory area B that step 4 is obtained, with the data storage behind the transposition in the region of memory A of GPU equipment.Step 4 is promoted as a result repeating step ranks matrix transpose operation once more, promptly finished the recovery of step 3 transposition.

Step 6 copies to the main frame internal memory with the data among the A of GPU Device memory zone.After copying data to the main frame internal memory, view data originally just is covered as the data behind the wavelet transformation.

The present invention has realized wavelet transformation at a high speed by above step, is used in and realizes the high-speed image coding on the multi-purpose computer.

Claims

1. method that realizes fast wavelet transform with GPU comprises:

2. a kind of method that realizes fast wavelet transform with GPU according to claim 1 wherein the 2nd goes on foot the data processing of using multithreading to walk abreast with the horizontal filter described in the 4th step, and uses shared drive to carry out the transmission of data.

3. a kind of method that realizes fast wavelet transform with GPU according to claim 1 wherein the 3rd goes on foot the data processing of using multithreading to walk abreast with the ranks deferring device described in the 5th step, and uses shared drive to carry out the transmission of data.