CN101866493B

CN101866493B - Method for realizing fast wavelet transform by using GPU

Info

Publication number: CN101866493B
Application number: CN2010102042369A
Authority: CN
Inventors: 李云松; 宋长贺; 吴宪云; 刘凯; 王柯俨; 肖嵩
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-06-18
Filing date: 2010-06-18
Publication date: 2012-01-04
Anticipated expiration: 2030-06-18
Also published as: CN101866493A

Abstract

The invention discloses a method for realizing fast wavelet transform by using a GPU, which comprises the following steps of: (1) copying data from a host memory of a computer to a memory area A of GPU equipment; (2) performing line translation on the data in the memory area A by using a horizontal filter, and storing the translation result in the memory area A; (3) performing line transposition on the data in the memory area A by using a line transposing device, and storing the data after the transposition in a memory area B of the GPU equipment; (4) performing line translation on the data in the memory area B again by using the horizontal filter, and storing the translation result in the memory area B; (5) performing line transposition on the data in the memory area B again by using the line transposing device, and storing the data after the transposition in the memory area A; and (6) copying the data in the memory area A of the GPU equipment to the host memory of the computer. The method employs parallel data processing, has a simple implementation process, and improves the processing speed of the wavelet transform.

Description

A kind of method that realizes fast wavelet transform with GPU

Technical field

The present invention relates to technical field of image processing, be specifically related to the method that a kind of graphic process unit GPU that uses a computer realizes fast wavelet transform, realize the high-speed image coding on the multi-purpose computer to be useful in.

Background technology

In digital image processing field, wavelet transformation is because its outstanding performance in the JPEG2000 Standard of image compression has obtained application widely.Yet the wavelet transformation computing is complicated, need the data volume of processing general bigger, thereby speed is slower, has seriously limited the application of wavelet transformation.So should take certain optimized Measures.At present the various optimizations of wavelet transformation are based on application-specific integrated circuit ASIC or PLD FPGA more, it is long to have a construction cycle, and difficulty is big, shortcomings such as cost height.

To this situation, there is the scholar to propose on GPU equipment, to carry out the scheme of wavelet transformation.GPU equipment relies on multinuclear driven in common and very high memory bandwidth, has obtained very high arithmetic capability.People such as the old David of Nanjing Univ. of Posts and Telecommunications to the GPU device characteristics, have proposed a kind of fast wavelet transform method in its article " realization of Daubechies9/7 discrete wavelet transformer scaling method on GPU " (Chinese multimedia communication 2007 12 phase the 45th page to 50 pages).This method realizes wavelet transformation with paralleling tactic, solves the problem of GPU equipment shared drive restriction emphatically, with every row or every column data segmentation of data.Though this method has good versatility, can handle larger-size image, say as the author, introduced new problem inevitably.What wherein influence most efficient is exactly that the section head of every row or every each section of column data must symmetric extension with a section tail.The a thread first and section mantissa certificate of processing section is individually all wanted in each continuation, and at this moment, other threads all are idle conditions, have caused the waste of resource.Another shortcoming of this method is that line translation will be carried out rank transformation afterwards, must every row of data be extracted, and this method does not propose the method for the data of effectively arranging again, causes rank transformation efficient not high.

Summary of the invention

The present invention is directed to the problems referred to above, propose a kind of GPU that uses and realize, do not carry out data sementation, and have the fast wavelet transform method of special ranks transposition function.

Step of the present invention is following:

The 1st step: will need the view data of wavelet transformation to copy to the A of GPU Device memory zone from the main frame internal memory;

The 2nd step: the usage level wave filter, the data among the GPU Device memory zone A that the 1st step was obtained are carried out line translation, and transformation results still is stored among the region of memory A of GPU equipment;

The 3rd step: use the ranks deferring device, the data procession transposition among the GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of GPU equipment;

The 4th step: the usage level wave filter, carry out line translation once more to the data in the 3rd GPU Device memory area B that obtains of step, and transformation results still be stored among the region of memory B of GPU equipment;

The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of GPU equipment;

The 6th step: the data among the A of GPU Device memory zone are copied to the main frame internal memory.

The present invention compared with prior art has the following advantages:

The first, the present invention only needs the data of GPU equipment shared drive buffer memory delegation or row, and along with the raising of GPU device design level, the also corresponding increase of operational shared drive.Therefore, the present invention does not allow to be subject to the restriction of the not enough problem of GPU equipment shared drive.

The second, the horizontal filter of the present invention's design has fully used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.

The 3rd, the ranks deferring device of the present invention's design has used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.Through using this ranks deferring device at a high speed, the efficient reduction that causes owing to the column data storage is discontinuous in the time of can avoiding carrying out rank transformation.Simultaneously, because rank transformation realizes that through line translation the present invention no longer needs independent vertical filter to realize rank transformation, makes that the system that is realized by this method is simpler.

Description of drawings

Fig. 1 is a process flow diagram of the present invention.

Embodiment

The present invention adopts the CUDA language, can on the GPU equipment of any a support CUDA of NVIDIA framework, realize.Before the described method of embodiment of the present invention, should call the cudaMalloc function earlier and on GPU equipment, distribute two region of memorys that size is identical, be designated as memory sections A, B respectively.After using method of the present invention, also should call the cudaFree function and discharge these two sections region of memorys.

With reference to Fig. 1, the present invention can realize through following steps:

Step 1 copies to the GPU Device memory with the data form computer host memory.The data form computer host memory can use function cudaMemcpy or cudaMemcpy2D to realize to duplicating of GPU Device memory.

Step 2, the usage level wave filter, the data among the A of the GPU Device memory that step 1 is obtained zone are carried out line translation, and transformation results still is stored among the region of memory A of GPU equipment.

The present invention has designed a line filter and has accomplished the line translation to the image data line, and horizontal filter just can be realized the line translation of whole image data through using a plurality of line filters like this.In the use to the GPU device resource, each line filter uses a thread block of being made up of 256 threads, and the shared drive of data line amount size.Through internal memory first address coef_row and line width w that data line is provided to line filter, line filter just can be accomplished the Filtering Processing to this line data.For accomplishing above-mentioned functions, the practical implementation method of line filter is following:

2a) data of delegation are separated with the even number item by odd term, i.e. deinterleaving, and the data storage after will separating is in the shared drive of this thread block.The present invention uses a kind of special cyclic control method, and data are handled by a plurality of thread parallels ground.All threads in each thread block of GPU equipment and each thread block all have an index value, and the present invention uses this index value to accomplish round-robin control.Because line filter is accomplished by a thread block, so use the index value threadIdx.x of each thread in this thread block design cycle the time.If loop variable is i, the i initial value is set to threadIdx.x, and in 256 threads in this thread block, the initial value of i is respectively 0 to 255 like this.Round-robin controlled condition is made as i＜width, like this, if if the width of delegation less than the quantity of thread in this piece, then has only specific thread to work, if width then can be carried out circulation greater than the quantity of thread in this piece.I is from the quantity blockDim.x that increases thread in the thread block behind each execution loop body, and like this, next time, circulation then just continued the datamation that last circulation is left.Through the cooperation of these 256 threads, each data of this delegation all will be handled successively, and the cycle index that needs significantly reduces.For the shared drive of having opened up at preparation process, it is divided into two sections, the first address of the last period is labeled as s, and back one segment mark is designated as p.According to above-mentioned circulation controlling means, with coef_row address offset amount be the data of 2i put to s address offset amount be the position of i; With coef_row address offset amount be the data of 2i+1 put to p address offset amount be the position of i.Through w/2*256+1 circulation at the most, can accomplish moving of data.Answer in the end call function _ _ syncthreads realize all threads in the piece synchronously, only carried out synchronously, the data after each thread execution are only visible to other threads.

2b) according to the circulation controlling means described in the 2a to deinterleaving after data in the shared drive carry out four times respectively and promote.Promoting for the first time is with s [i] and two data additions of s [i+1], multiply by to be added on the p [i] after promoting coefficient, and it is synchronous to carry out the primary line journey after all data processing completion.Secondary lifting is with process is similar for the first time, and different is to be added on the s [i+1] multiply by the lifting coefficient after p [i] and p [i+1] addition.The lifting of third and fourth time is then identical with the lifting of first and second time.

2c) according to the described circulation controlling means of 2a, deposit in the data that s and p address start in the shared drive after above four steps are promoted, promptly low frequency and high-frequency data multiply by normalization coefficient respectively and send shared drive back to and cover the original data of this row.

After accomplishing the design of line filter, horizontal filter can be accomplished the processing of all row through using 512 line filters.The concrete implementation method of horizontal filter is: according to the identical circulation controlling means of line filter; The initial value of loop variable i is the call number blockIdx.x of thread block; Controlled condition is i＜h, and each back i that carries out is from increasing 512, and is inner at loop body; To line filter the internal memory first address coef_all+i*w and the line width w of data line are provided, use line filter to carry out the Filtering Processing of corresponding line.

Horizontal filter uses 512 line filters to work concurrently, and each line filter uses the work of a plurality of thread parallels ground, has realized the data processing of highly-parallel.Simultaneously, line filter is all accomplished in shared drive the main operation of data, thereby data have very high transmission speed.The data processing of highly-parallel and high-speed data transmission make horizontal filter have high speed.

Step 3 is used the ranks deferring device, the data procession transposition among the GPU Device memory zone A that step 2 is obtained, with the data storage behind the transposition in the region of memory B of GPU equipment.

In the use to the GPU device resource, the ranks deferring device uses 256 thread block, and each thread block is used the shared drive of 256 threads and column data amount size.The address of the position of deposit data before the ranks deferring device provides two to indicate transposition respectively and behind the transposition, and the line width of image is high with row, the ranks deferring device can be accomplished the transposition of piece image.

The ranks deferring device mainly recirculates by one two and accomplishes, all thread block in the whole GPU equipment of outer round-robin scheduling, the thread in thread block of interior loop scheduling.Outer round-robin loop variable j initial value is made as the thread block call number, controlled condition be j less than line width, every circulation primary j is from increasing 256, loop body is an interior loop.The initial value of the loop variable i of interior loop is the index value of this thread in this thread block, and controlled condition is that i is high less than row, and every circulation primary i is from increasing 256, and loop body is accomplished data-moving.The process of data-moving is to import earlier that the address offset amount is i the position of the deposit data of i*w+j in shared drive in the data, again with the data of i position in the shared drive put to the OPADD side-play amount be in the position of i+j*h.

The ranks deferring device recirculates through two, has realized following two functions: the first, and the ranks deferring device has been realized the data manipulation of highly-parallel through a large amount of thread of cycle control.The second, the ranks deferring device be not directly with data from the relevant position that the phase position of INADD is moved OPADD, but earlier with metadata cache in shared drive, from shared drive, be fetched into outgoing position again, realized high-speed data transmission.The data manipulation of highly-parallel and high-speed data transmission make the ranks deferring device have high speed.

Step 4, the usage level wave filter, the data in the GPU Device memory area B that step 3 is obtained are carried out line translation once more, and transformation results still is stored among the region of memory B of GPU equipment.Horizontal filter in this step horizontal filter in step 2 is identical, and the data behind the transposition are carried out horizontal filtering, is equivalent to accomplish the filtering of the vertical direction of data before the transposition.

Step 5 is used the ranks deferring device, and data are carried out ranks transposition once more in the GPU Device memory area B that step 4 is obtained, with the data storage behind the transposition in the region of memory A of GPU equipment.Step 4 is promoted as a result repeating step ranks matrix transpose operation once more, promptly accomplished the recovery of step 3 transposition.

Step 6 copies to the main frame internal memory with the data among the A of GPU Device memory zone.After copying data to the main frame internal memory, view data originally just is covered as the data behind the wavelet transformation.

The present invention has realized wavelet transformation at a high speed through above step, is used in and realizes the high-speed image coding on the multi-purpose computer.

Claims

1. method that realizes fast wavelet transform with graphic process unit GPU comprises:

The 1st step: will need the view data of wavelet transformation to copy to the A of graphic process unit GPU Device memory zone from the main frame internal memory;

The 2nd step: the usage level wave filter, the data among the graphic process unit GPU Device memory zone A that the 1st step was obtained are carried out line translation, and transformation results still is stored among the region of memory A of graphic process unit GPU equipment;

The 3rd step: use the ranks deferring device, the data procession transposition among the graphic process unit GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of graphic process unit GPU equipment;

The 4th step: the usage level wave filter, carry out line translation once more to the data in the 3rd graphic process unit GPU Device memory area B that obtains of step, and transformation results still be stored among the region of memory B of graphic process unit GPU equipment;

The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the graphic process unit GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of graphic process unit GPU equipment;

The 6th step: the data among the A of graphic process unit GPU Device memory zone are copied to the main frame internal memory.

2. a kind of method that realizes fast wavelet transform with graphic process unit GPU according to claim 1 wherein the 2nd goes on foot the data processing of using multithreading to walk abreast with the horizontal filter described in the 4th step, and uses shared drive to carry out the transmission of data.

3. a kind of method that realizes fast wavelet transform with graphic process unit GPU according to claim 1 wherein the 3rd goes on foot the data processing of using multithreading to walk abreast with the ranks deferring device described in the 5th step, and uses shared drive to carry out the transmission of data.