CN101866493B - Method for realizing fast wavelet transform by using GPU - Google Patents
Method for realizing fast wavelet transform by using GPU Download PDFInfo
- Publication number
- CN101866493B CN101866493B CN2010102042369A CN201010204236A CN101866493B CN 101866493 B CN101866493 B CN 101866493B CN 2010102042369 A CN2010102042369 A CN 2010102042369A CN 201010204236 A CN201010204236 A CN 201010204236A CN 101866493 B CN101866493 B CN 101866493B
- Authority
- CN
- China
- Prior art keywords
- data
- process unit
- graphic process
- unit gpu
- memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Image Processing (AREA)
Abstract
The invention discloses a method for realizing fast wavelet transform by using a GPU, which comprises the following steps of: (1) copying data from a host memory of a computer to a memory area A of GPU equipment; (2) performing line translation on the data in the memory area A by using a horizontal filter, and storing the translation result in the memory area A; (3) performing line transposition on the data in the memory area A by using a line transposing device, and storing the data after the transposition in a memory area B of the GPU equipment; (4) performing line translation on the data in the memory area B again by using the horizontal filter, and storing the translation result in the memory area B; (5) performing line transposition on the data in the memory area B again by using the line transposing device, and storing the data after the transposition in the memory area A; and (6) copying the data in the memory area A of the GPU equipment to the host memory of the computer. The method employs parallel data processing, has a simple implementation process, and improves the processing speed of the wavelet transform.
Description
Technical field
The present invention relates to technical field of image processing, be specifically related to the method that a kind of graphic process unit GPU that uses a computer realizes fast wavelet transform, realize the high-speed image coding on the multi-purpose computer to be useful in.
Background technology
In digital image processing field, wavelet transformation is because its outstanding performance in the JPEG2000 Standard of image compression has obtained application widely.Yet the wavelet transformation computing is complicated, need the data volume of processing general bigger, thereby speed is slower, has seriously limited the application of wavelet transformation.So should take certain optimized Measures.At present the various optimizations of wavelet transformation are based on application-specific integrated circuit ASIC or PLD FPGA more, it is long to have a construction cycle, and difficulty is big, shortcomings such as cost height.
To this situation, there is the scholar to propose on GPU equipment, to carry out the scheme of wavelet transformation.GPU equipment relies on multinuclear driven in common and very high memory bandwidth, has obtained very high arithmetic capability.People such as the old David of Nanjing Univ. of Posts and Telecommunications to the GPU device characteristics, have proposed a kind of fast wavelet transform method in its article " realization of Daubechies9/7 discrete wavelet transformer scaling method on GPU " (Chinese multimedia communication 2007 12 phase the 45th page to 50 pages).This method realizes wavelet transformation with paralleling tactic, solves the problem of GPU equipment shared drive restriction emphatically, with every row or every column data segmentation of data.Though this method has good versatility, can handle larger-size image, say as the author, introduced new problem inevitably.What wherein influence most efficient is exactly that the section head of every row or every each section of column data must symmetric extension with a section tail.The a thread first and section mantissa certificate of processing section is individually all wanted in each continuation, and at this moment, other threads all are idle conditions, have caused the waste of resource.Another shortcoming of this method is that line translation will be carried out rank transformation afterwards, must every row of data be extracted, and this method does not propose the method for the data of effectively arranging again, causes rank transformation efficient not high.
Summary of the invention
The present invention is directed to the problems referred to above, propose a kind of GPU that uses and realize, do not carry out data sementation, and have the fast wavelet transform method of special ranks transposition function.
Step of the present invention is following:
The 1st step: will need the view data of wavelet transformation to copy to the A of GPU Device memory zone from the main frame internal memory;
The 2nd step: the usage level wave filter, the data among the GPU Device memory zone A that the 1st step was obtained are carried out line translation, and transformation results still is stored among the region of memory A of GPU equipment;
The 3rd step: use the ranks deferring device, the data procession transposition among the GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of GPU equipment;
The 4th step: the usage level wave filter, carry out line translation once more to the data in the 3rd GPU Device memory area B that obtains of step, and transformation results still be stored among the region of memory B of GPU equipment;
The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of GPU equipment;
The 6th step: the data among the A of GPU Device memory zone are copied to the main frame internal memory.
The present invention compared with prior art has the following advantages:
The first, the present invention only needs the data of GPU equipment shared drive buffer memory delegation or row, and along with the raising of GPU device design level, the also corresponding increase of operational shared drive.Therefore, the present invention does not allow to be subject to the restriction of the not enough problem of GPU equipment shared drive.
The second, the horizontal filter of the present invention's design has fully used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.
The 3rd, the ranks deferring device of the present invention's design has used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.Through using this ranks deferring device at a high speed, the efficient reduction that causes owing to the column data storage is discontinuous in the time of can avoiding carrying out rank transformation.Simultaneously, because rank transformation realizes that through line translation the present invention no longer needs independent vertical filter to realize rank transformation, makes that the system that is realized by this method is simpler.
Description of drawings
Fig. 1 is a process flow diagram of the present invention.
Embodiment
The present invention adopts the CUDA language, can on the GPU equipment of any a support CUDA of NVIDIA framework, realize.Before the described method of embodiment of the present invention, should call the cudaMalloc function earlier and on GPU equipment, distribute two region of memorys that size is identical, be designated as memory sections A, B respectively.After using method of the present invention, also should call the cudaFree function and discharge these two sections region of memorys.
With reference to Fig. 1, the present invention can realize through following steps:
Step 1 copies to the GPU Device memory with the data form computer host memory.The data form computer host memory can use function cudaMemcpy or cudaMemcpy2D to realize to duplicating of GPU Device memory.
Step 2, the usage level wave filter, the data among the A of the GPU Device memory that step 1 is obtained zone are carried out line translation, and transformation results still is stored among the region of memory A of GPU equipment.
The present invention has designed a line filter and has accomplished the line translation to the image data line, and horizontal filter just can be realized the line translation of whole image data through using a plurality of line filters like this.In the use to the GPU device resource, each line filter uses a thread block of being made up of 256 threads, and the shared drive of data line amount size.Through internal memory first address coef_row and line width w that data line is provided to line filter, line filter just can be accomplished the Filtering Processing to this line data.For accomplishing above-mentioned functions, the practical implementation method of line filter is following:
2a) data of delegation are separated with the even number item by odd term, i.e. deinterleaving, and the data storage after will separating is in the shared drive of this thread block.The present invention uses a kind of special cyclic control method, and data are handled by a plurality of thread parallels ground.All threads in each thread block of GPU equipment and each thread block all have an index value, and the present invention uses this index value to accomplish round-robin control.Because line filter is accomplished by a thread block, so use the index value threadIdx.x of each thread in this thread block design cycle the time.If loop variable is i, the i initial value is set to threadIdx.x, and in 256 threads in this thread block, the initial value of i is respectively 0 to 255 like this.Round-robin controlled condition is made as i<width, like this, if if the width of delegation less than the quantity of thread in this piece, then has only specific thread to work, if width then can be carried out circulation greater than the quantity of thread in this piece.I is from the quantity blockDim.x that increases thread in the thread block behind each execution loop body, and like this, next time, circulation then just continued the datamation that last circulation is left.Through the cooperation of these 256 threads, each data of this delegation all will be handled successively, and the cycle index that needs significantly reduces.For the shared drive of having opened up at preparation process, it is divided into two sections, the first address of the last period is labeled as s, and back one segment mark is designated as p.According to above-mentioned circulation controlling means, with coef_row address offset amount be the data of 2i put to s address offset amount be the position of i; With coef_row address offset amount be the data of 2i+1 put to p address offset amount be the position of i.Through w/2*256+1 circulation at the most, can accomplish moving of data.Answer in the end call function _ _ syncthreads realize all threads in the piece synchronously, only carried out synchronously, the data after each thread execution are only visible to other threads.
2b) according to the circulation controlling means described in the 2a to deinterleaving after data in the shared drive carry out four times respectively and promote.Promoting for the first time is with s [i] and two data additions of s [i+1], multiply by to be added on the p [i] after promoting coefficient, and it is synchronous to carry out the primary line journey after all data processing completion.Secondary lifting is with process is similar for the first time, and different is to be added on the s [i+1] multiply by the lifting coefficient after p [i] and p [i+1] addition.The lifting of third and fourth time is then identical with the lifting of first and second time.
2c) according to the described circulation controlling means of 2a, deposit in the data that s and p address start in the shared drive after above four steps are promoted, promptly low frequency and high-frequency data multiply by normalization coefficient respectively and send shared drive back to and cover the original data of this row.
After accomplishing the design of line filter, horizontal filter can be accomplished the processing of all row through using 512 line filters.The concrete implementation method of horizontal filter is: according to the identical circulation controlling means of line filter; The initial value of loop variable i is the call number blockIdx.x of thread block; Controlled condition is i<h, and each back i that carries out is from increasing 512, and is inner at loop body; To line filter the internal memory first address coef_all+i*w and the line width w of data line are provided, use line filter to carry out the Filtering Processing of corresponding line.
Horizontal filter uses 512 line filters to work concurrently, and each line filter uses the work of a plurality of thread parallels ground, has realized the data processing of highly-parallel.Simultaneously, line filter is all accomplished in shared drive the main operation of data, thereby data have very high transmission speed.The data processing of highly-parallel and high-speed data transmission make horizontal filter have high speed.
Step 3 is used the ranks deferring device, the data procession transposition among the GPU Device memory zone A that step 2 is obtained, with the data storage behind the transposition in the region of memory B of GPU equipment.
In the use to the GPU device resource, the ranks deferring device uses 256 thread block, and each thread block is used the shared drive of 256 threads and column data amount size.The address of the position of deposit data before the ranks deferring device provides two to indicate transposition respectively and behind the transposition, and the line width of image is high with row, the ranks deferring device can be accomplished the transposition of piece image.
The ranks deferring device mainly recirculates by one two and accomplishes, all thread block in the whole GPU equipment of outer round-robin scheduling, the thread in thread block of interior loop scheduling.Outer round-robin loop variable j initial value is made as the thread block call number, controlled condition be j less than line width, every circulation primary j is from increasing 256, loop body is an interior loop.The initial value of the loop variable i of interior loop is the index value of this thread in this thread block, and controlled condition is that i is high less than row, and every circulation primary i is from increasing 256, and loop body is accomplished data-moving.The process of data-moving is to import earlier that the address offset amount is i the position of the deposit data of i*w+j in shared drive in the data, again with the data of i position in the shared drive put to the OPADD side-play amount be in the position of i+j*h.
The ranks deferring device recirculates through two, has realized following two functions: the first, and the ranks deferring device has been realized the data manipulation of highly-parallel through a large amount of thread of cycle control.The second, the ranks deferring device be not directly with data from the relevant position that the phase position of INADD is moved OPADD, but earlier with metadata cache in shared drive, from shared drive, be fetched into outgoing position again, realized high-speed data transmission.The data manipulation of highly-parallel and high-speed data transmission make the ranks deferring device have high speed.
Step 4, the usage level wave filter, the data in the GPU Device memory area B that step 3 is obtained are carried out line translation once more, and transformation results still is stored among the region of memory B of GPU equipment.Horizontal filter in this step horizontal filter in step 2 is identical, and the data behind the transposition are carried out horizontal filtering, is equivalent to accomplish the filtering of the vertical direction of data before the transposition.
Step 5 is used the ranks deferring device, and data are carried out ranks transposition once more in the GPU Device memory area B that step 4 is obtained, with the data storage behind the transposition in the region of memory A of GPU equipment.Step 4 is promoted as a result repeating step ranks matrix transpose operation once more, promptly accomplished the recovery of step 3 transposition.
Step 6 copies to the main frame internal memory with the data among the A of GPU Device memory zone.After copying data to the main frame internal memory, view data originally just is covered as the data behind the wavelet transformation.
The present invention has realized wavelet transformation at a high speed through above step, is used in and realizes the high-speed image coding on the multi-purpose computer.
Claims (3)
1. method that realizes fast wavelet transform with graphic process unit GPU comprises:
The 1st step: will need the view data of wavelet transformation to copy to the A of graphic process unit GPU Device memory zone from the main frame internal memory;
The 2nd step: the usage level wave filter, the data among the graphic process unit GPU Device memory zone A that the 1st step was obtained are carried out line translation, and transformation results still is stored among the region of memory A of graphic process unit GPU equipment;
The 3rd step: use the ranks deferring device, the data procession transposition among the graphic process unit GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of graphic process unit GPU equipment;
The 4th step: the usage level wave filter, carry out line translation once more to the data in the 3rd graphic process unit GPU Device memory area B that obtains of step, and transformation results still be stored among the region of memory B of graphic process unit GPU equipment;
The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the graphic process unit GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of graphic process unit GPU equipment;
The 6th step: the data among the A of graphic process unit GPU Device memory zone are copied to the main frame internal memory.
2. a kind of method that realizes fast wavelet transform with graphic process unit GPU according to claim 1 wherein the 2nd goes on foot the data processing of using multithreading to walk abreast with the horizontal filter described in the 4th step, and uses shared drive to carry out the transmission of data.
3. a kind of method that realizes fast wavelet transform with graphic process unit GPU according to claim 1 wherein the 3rd goes on foot the data processing of using multithreading to walk abreast with the ranks deferring device described in the 5th step, and uses shared drive to carry out the transmission of data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102042369A CN101866493B (en) | 2010-06-18 | 2010-06-18 | Method for realizing fast wavelet transform by using GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102042369A CN101866493B (en) | 2010-06-18 | 2010-06-18 | Method for realizing fast wavelet transform by using GPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101866493A CN101866493A (en) | 2010-10-20 |
CN101866493B true CN101866493B (en) | 2012-01-04 |
Family
ID=42958207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010102042369A Expired - Fee Related CN101866493B (en) | 2010-06-18 | 2010-06-18 | Method for realizing fast wavelet transform by using GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101866493B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103198451B (en) * | 2013-01-31 | 2016-01-20 | 西安电子科技大学 | A kind of GPU realizes the method for fast wavelet transform by piecemeal |
CN103414901A (en) * | 2013-08-26 | 2013-11-27 | 江苏新瑞峰信息科技有限公司 | Quick JPED 2000 image compression system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588451A (en) * | 2004-07-22 | 2005-03-02 | 华中科技大学 | Circuit for realizing direct two dimension discrete small wave change |
CN1589021A (en) * | 2004-09-28 | 2005-03-02 | 华中科技大学 | Parallel two-dimension discrete small wave transform circuit |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6904177B2 (en) * | 2001-03-13 | 2005-06-07 | Canon Kabushiki Kaisha | Filter processing apparatus |
-
2010
- 2010-06-18 CN CN2010102042369A patent/CN101866493B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588451A (en) * | 2004-07-22 | 2005-03-02 | 华中科技大学 | Circuit for realizing direct two dimension discrete small wave change |
CN1589021A (en) * | 2004-09-28 | 2005-03-02 | 华中科技大学 | Parallel two-dimension discrete small wave transform circuit |
Non-Patent Citations (3)
Title |
---|
Keyan Wang, et al..EFFICIENT LINE-BASED VLSI ARCHITECTURE FOR 2-D LIFTING DWT.《Image Processing, 2006 IEEE International Conference on》.2006, * |
Tien-Tsin Wong, et al..Discrete Wavelet Transform on Consumer-Level Graphics Hardware.《IEEE TRANSACTIONS ON MULTIMEDIA,2007》.2007,第9卷(第3期), * |
Wladimir J. van der Laan, et al.Accelerating wavelet-based video coding on graphics hardware using CUDA.《Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis (2009)》.2009, * |
Also Published As
Publication number | Publication date |
---|---|
CN101866493A (en) | 2010-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7337103B2 (en) | neural processor | |
Zhang et al. | Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system | |
DE112013004078B4 (en) | Share storage over a unified storage architecture | |
Klöckner et al. | Nodal discontinuous Galerkin methods on graphics processors | |
CN100538628C (en) | Be used for system and method in SIMD structure processing threads group | |
DE112013007788B4 (en) | MOBILE COMPUTER SYSTEM, MOBILE COMPUTER DEVICE AND NON-VOLATILE MACHINE READABLE MEDIUM FOR SHARING VIRTUAL MEMORY | |
CN104835110B (en) | A kind of asynchronous diagram data processing system based on GPU | |
Negrut et al. | Unified memory in cuda 6.0. a brief overview of related data access and transfer issues | |
DE112009004320T5 (en) | Memory Subsystem | |
CN101866493B (en) | Method for realizing fast wavelet transform by using GPU | |
CN103049241A (en) | Method for improving computation performance of CPU (Central Processing Unit) +GPU (Graphics Processing Unit) heterogeneous device | |
Zlateski et al. | ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs | |
Kono et al. | Scalability analysis of tightly-coupled FPGA-cluster for lattice boltzmann computation | |
DE102022120207A1 (en) | Efficient transforms and transposes to optimize rate distortion and reconstruction in video encoders | |
WO2017013877A1 (en) | 2d discrete fourier transform with simultaneous edge artifact removal for real-time applications | |
CN107943592A (en) | A kind of method for avoiding GPU resource contention towards GPU cluster environment | |
DE102023105572A1 (en) | Efficient matrix multiplication and addition with a group of warps | |
EP1769391A1 (en) | A method of and apparatus for implementing fast orthogonal transforms of variable size | |
Filippas et al. | Streaming dilated convolution engine | |
CN201111042Y (en) | Two-dimension wavelet transform integrate circuit structure | |
Dyer et al. | Design and analysis of system on a chip encoder for JPEG2000 | |
Shirvaikar et al. | A comparison between DSP and FPGA platforms for real-time imaging applications | |
CN107577834A (en) | A kind of two-dimensional discrete wavelet conversion architecture design based on boosting algorithm | |
Shahbahrami | Improving the performance of 2D discrete wavelet transform using data-level parallelism | |
Davidson et al. | Building on a framework: Using FG for more flexibility and improved performance in parallel programs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120104 Termination date: 20170618 |
|
CF01 | Termination of patent right due to non-payment of annual fee |