CN101866493A - Method for realizing fast wavelet transform by using GPU - Google Patents

Method for realizing fast wavelet transform by using GPU Download PDF

Info

Publication number
CN101866493A
CN101866493A CN 201010204236 CN201010204236A CN101866493A CN 101866493 A CN101866493 A CN 101866493A CN 201010204236 CN201010204236 CN 201010204236 CN 201010204236 A CN201010204236 A CN 201010204236A CN 101866493 A CN101866493 A CN 101866493A
Authority
CN
China
Prior art keywords
data
gpu
memory
transposition
memory area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010204236
Other languages
Chinese (zh)
Other versions
CN101866493B (en
Inventor
李云松
宋长贺
吴宪云
刘凯
王柯俨
肖嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN2010102042369A priority Critical patent/CN101866493B/en
Publication of CN101866493A publication Critical patent/CN101866493A/en
Application granted granted Critical
Publication of CN101866493B publication Critical patent/CN101866493B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)

Abstract

The invention discloses a method for realizing fast wavelet transform by using a GPU, which comprises the following steps of: (1) copying data from a host memory of a computer to a memory area A of GPU equipment; (2) performing line translation on the data in the memory area A by using a horizontal filter, and storing the translation result in the memory area A; (3) performing line transposition on the data in the memory area A by using a line transposing device, and storing the data after the transposition in a memory area B of the GPU equipment; (4) performing line translation on the data in the memory area B again by using the horizontal filter, and storing the translation result in the memory area B; (5) performing line transposition on the data in the memory area B again by using the line transposing device, and storing the data after the transposition in the memory area A; and (6) copying the data in the memory area A of the GPU equipment to the host memory of the computer. The method employs parallel data processing, has a simple implementation process, and improves the processing speed of the wavelet transform.

Description

A kind of method that realizes fast wavelet transform with GPU
Technical field
The present invention relates to technical field of image processing, be specifically related to the method that a kind of graphic process unit GPU that uses a computer realizes fast wavelet transform, realize the high-speed image coding on the multi-purpose computer to be useful in.
Background technology
In digital image processing field, wavelet transformation is because its outstanding performance in the JPEG2000 Standard of image compression has obtained application widely.Yet wavelet transformation computing complexity need the data volume of processing general bigger, thereby speed is slower, has seriously limited the application of wavelet transformation.So should take certain optimized Measures.At present the various optimizations of wavelet transformation are based on application-specific integrated circuit ASIC or programmable logic device (PLD) FPGA more, it is long to have a construction cycle, and difficulty is big, shortcomings such as cost height.
At this situation, there is the scholar to propose on GPU equipment, to carry out the scheme of wavelet transformation.GPU equipment relies on common driving of multinuclear and very high memory bandwidth, has obtained very high arithmetic capability.People such as the old David of Nanjing Univ. of Posts and Telecommunications at the GPU device characteristics, have proposed a kind of fast wavelet transform method in its article " realization of Daubechies9/7 discrete wavelet transformer scaling method on GPU " (Chinese multimedia communication 2007 12 phase the 45th page to 50 pages).This method realizes wavelet transformation with paralleling tactic, solves the problem of GPU equipment shared drive restriction emphatically, with every row or every column data segmentation of data.Though this method has good versatility, can handle larger-size image, say as the author, introduced new problem inevitably.What wherein influence most efficient is exactly that a section head of every row or every each section of column data and a section tail all must symmetric extensions.The a thread first and section mantissa certificate of processing section is individually all wanted in each continuation, and at this moment, other threads all are idle conditions, have caused the waste of resource.Another shortcoming of this method is that line translation will be carried out rank transformation afterwards, every row of data must be extracted, and this method does not propose the method for the data of effectively arranging again, causes rank transformation efficient not high.
Summary of the invention
The present invention is directed to the problems referred to above, propose a kind of GPU that uses and realize, do not carry out data sementation, and have the fast wavelet transform method of special ranks transposition function.
Step of the present invention is as follows:
The 1st step: will need the view data of wavelet transformation to copy to the A of GPU Device memory zone from the main frame internal memory;
The 2nd step: the usage level wave filter, carry out line translation to the 1st data that go on foot among the GPU Device memory zone A that obtains, and transformation results still be stored among the region of memory A of GPU equipment;
The 3rd step: use the ranks deferring device, the data procession transposition among the GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of GPU equipment;
The 4th step: the usage level wave filter, carry out line translation once more to the 3rd data that go on foot in the GPU Device memory area B that obtains, and transformation results still be stored among the region of memory B of GPU equipment;
The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of GPU equipment;
The 6th step: the data among the A of GPU Device memory zone are copied to the main frame internal memory.
The present invention compared with prior art has the following advantages:
The first, the present invention only needs the data of GPU equipment shared drive buffer memory delegation or row, and along with the raising of GPU device design level, the also corresponding increase of operational shared drive.Therefore, the present invention does not allow to be subject to the restriction of the problem of GPU equipment shared drive deficiency.
The second, the horizontal filter of the present invention's design has fully used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.
The 3rd, the ranks deferring device of the present invention's design has used the multiprocessor and the high speed shared drive resource of GPU equipment, thereby has very high processing speed.By using this ranks deferring device at a high speed, the efficient reduction that causes owing to the column data storage is discontinuous in the time of can avoiding carrying out rank transformation.Simultaneously, because rank transformation realizes that by line translation the present invention no longer needs independent vertical filter to realize rank transformation, make that the system that is realized by this method is simpler.
Description of drawings
Fig. 1 is a process flow diagram of the present invention.
Embodiment
The present invention adopts the CUDA language, can realize on the GPU equipment of any a support CUDA of NVIDIA framework.Before implementing method of the present invention, should call the cudaMalloc function earlier and on GPU equipment, distribute two region of memorys that size is identical, be designated as memory sections A, B respectively.After using method of the present invention, also should call the cudaFree function and discharge this two sections region of memorys.
With reference to Fig. 1, the present invention can realize by following steps:
Step 1 copies to the GPU Device memory with the data form computer host memory.The data form computer host memory can use function cudaMemcpy or cudaMemcpy2D to realize to duplicating of GPU Device memory.
Step 2, the usage level wave filter, the data among the A of the GPU Device memory that step 1 is obtained zone are carried out line translation, and transformation results still are stored among the region of memory A of GPU equipment.
The present invention has designed a line filter and has finished line translation to the image data line, and horizontal filter just can be realized the line translation of whole image data by using a plurality of line filters like this.In the use to the GPU device resource, each line filter uses a thread block of being made up of 256 threads, and the shared drive of data line amount size.By internal memory first address coef_row and line width w that data line is provided to line filter, line filter just can be finished the Filtering Processing to this line data.For finishing above-mentioned functions, the specific implementation method of line filter is as follows:
2a) data of delegation are separated with the even number item by odd term, i.e. deinterleaving, and the data storage after will separating is in the shared drive of this thread block.The present invention uses a kind of special circulation controlling means, and data are handled by a plurality of thread parallels ground.All threads in each thread block of GPU equipment and each thread block all have an index value, and the present invention uses this index value to finish round-robin control.Because line filter is finished by a thread block, so use the index value threadIdx.x of each thread in this thread block design cycle the time.If loop variable is i, the i initial value is set to threadIdx.x, and in 256 threads in this thread block, the initial value of i is respectively 0 to 255 like this.Round-robin controlled condition is made as i<width, like this, if if the width of delegation less than the quantity of thread in this piece, then has only specific thread to work, if width then can be carried out circulation greater than the quantity of thread in this piece.I is from the quantity blockDim.x that increases thread in the thread block behind each execution loop body, and like this, next time, circulation then just continued the datamation that last circulation is left.By the cooperation of these 256 threads, each data of this delegation all will be handled successively, and the cycle index that needs significantly reduces.For the shared drive of having opened up at preparation process, it is divided into two sections, the first address of the last period is labeled as s, and back one segment mark is designated as p.According to above-mentioned circulation controlling means, with coef_row address offset amount be the data of 2i put to s address offset amount be the position of i; With coef_row address offset amount be the data of 2i+1 put to p address offset amount be the position of i.Through w/2*256+1 circulation at the most, can finish moving of data.Answer in the end call function _ _ syncthreads realize all threads in the piece synchronously, only carried out synchronously, the data after each thread execution are only visible to other threads.
2b) according to the circulation controlling means described in the 2a to deinterleaving after data in the shared drive carry out four times respectively and promote.Promoting for the first time is with s[i] and s[i+1] two data additions, multiply by and be added to p[i after promoting coefficient] on, it is synchronous to carry out the primary line journey after all data processing are finished.Secondary lifting is with for the first time process is similar, and different is with p[i] and p[i+1] multiply by after the addition promote coefficient and be added to s[i+1] on.The lifting of third and fourth time is then identical with the lifting of first and second time.
2c) according to the described circulation controlling means of 2a, deposit in the data of s and the beginning of p address in the shared drive after above four steps are promoted, promptly low frequency and high-frequency data multiply by normalization coefficient respectively and send shared drive back to and cover the original data of this row.
After finishing the design of line filter, horizontal filter can be finished the processing of all row by using 512 line filters.The specific implementation method of horizontal filter is: according to the identical circulation controlling means of line filter, the initial value of loop variable i is the call number blockIdx.x of thread block, controlled condition is i<h, each back i that carries out is from increasing 512, in loop body inside, provide the internal memory first address coef_all+i*w and the line width w of data line to line filter, use line filter to carry out the Filtering Processing of corresponding line.
Horizontal filter uses 512 line filters to work concurrently, and each line filter uses the work of a plurality of thread parallels ground, has realized the data processing of highly-parallel.Simultaneously, line filter is all finished in shared drive the main operation of data, thereby data have very high transmission speed.The data processing of highly-parallel and high-speed data transmission make horizontal filter have high speed.
Step 3 is used the ranks deferring device, the data procession transposition among the GPU Device memory zone A that step 2 is obtained, with the data storage behind the transposition in the region of memory B of GPU equipment.
In the use to the GPU device resource, the ranks deferring device uses 256 thread block, and each thread block is used the shared drive of 256 threads and a column data amount size.The address of the position of deposit data before the ranks deferring device provides two to indicate transposition respectively and behind the transposition, and the line width of image is high with row, the ranks deferring device can be finished the transposition of piece image.
The ranks deferring device mainly recirculates by one two and finishes, all thread block in the whole GPU equipment of outer round-robin scheduling, the thread in thread block of interior loop scheduling.Outer round-robin loop variable j initial value is made as the thread block call number, controlled condition be j less than line width, every circulation primary j is from increasing 256, loop body is an interior loop.The initial value of the loop variable i of interior loop is the index value of this thread in this thread block, and controlled condition is that i is high less than row, and every circulation primary i is from increasing 256, and loop body is finished data-moving.The process of data-moving is to import earlier that the address offset amount is i the position of the deposit data of i*w+j in shared drive in the data, again the data of i position in the shared drive are put to the OPADD side-play amount be in the position of i+j*h.
The ranks deferring device recirculates by two, has realized following two functions: the first, and the ranks deferring device has been realized the data manipulation of highly-parallel by a large amount of thread of cycle control.The second, the ranks deferring device be not directly with data from the relevant position that the phase position of Input Address is moved OPADD, but earlier with metadata cache in shared drive, from shared drive, be fetched into outgoing position again, realized high-speed data transmission.The data manipulation of highly-parallel and high-speed data transmission make the ranks deferring device have high speed.
Step 4, the usage level wave filter, the data in the GPU Device memory area B that step 3 is obtained are carried out line translation once more, and transformation results still are stored among the region of memory B of GPU equipment.Horizontal filter in this step horizontal filter in step 2 is identical, and the data behind the transposition are carried out horizontal filtering, is equivalent to finish the filtering of the vertical direction of data before the transposition.
Step 5 is used the ranks deferring device, and data are carried out ranks transposition once more in the GPU Device memory area B that step 4 is obtained, with the data storage behind the transposition in the region of memory A of GPU equipment.Step 4 is promoted as a result repeating step ranks matrix transpose operation once more, promptly finished the recovery of step 3 transposition.
Step 6 copies to the main frame internal memory with the data among the A of GPU Device memory zone.After copying data to the main frame internal memory, view data originally just is covered as the data behind the wavelet transformation.
The present invention has realized wavelet transformation at a high speed by above step, is used in and realizes the high-speed image coding on the multi-purpose computer.

Claims (3)

1. method that realizes fast wavelet transform with GPU comprises:
The 1st step: will need the view data of wavelet transformation to copy to the A of GPU Device memory zone from the main frame internal memory;
The 2nd step: the usage level wave filter, carry out line translation to the 1st data that go on foot among the GPU Device memory zone A that obtains, and transformation results still be stored among the region of memory A of GPU equipment;
The 3rd step: use the ranks deferring device, the data procession transposition among the GPU Device memory zone A that the 2nd step was obtained, with the data storage behind the transposition in the region of memory B of GPU equipment;
The 4th step: the usage level wave filter, carry out line translation once more to the 3rd data that go on foot in the GPU Device memory area B that obtains, and transformation results still be stored among the region of memory B of GPU equipment;
The 5th step: use the ranks deferring device, data are carried out ranks transposition once more in the GPU Device memory area B that the 4th step was obtained, with the data storage behind the transposition in the region of memory A of GPU equipment;
The 6th step: the data among the A of GPU Device memory zone are copied to the main frame internal memory.
2. a kind of method that realizes fast wavelet transform with GPU according to claim 1 wherein the 2nd goes on foot the data processing of using multithreading to walk abreast with the horizontal filter described in the 4th step, and uses shared drive to carry out the transmission of data.
3. a kind of method that realizes fast wavelet transform with GPU according to claim 1 wherein the 3rd goes on foot the data processing of using multithreading to walk abreast with the ranks deferring device described in the 5th step, and uses shared drive to carry out the transmission of data.
CN2010102042369A 2010-06-18 2010-06-18 Method for realizing fast wavelet transform by using GPU Expired - Fee Related CN101866493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102042369A CN101866493B (en) 2010-06-18 2010-06-18 Method for realizing fast wavelet transform by using GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102042369A CN101866493B (en) 2010-06-18 2010-06-18 Method for realizing fast wavelet transform by using GPU

Publications (2)

Publication Number Publication Date
CN101866493A true CN101866493A (en) 2010-10-20
CN101866493B CN101866493B (en) 2012-01-04

Family

ID=42958207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102042369A Expired - Fee Related CN101866493B (en) 2010-06-18 2010-06-18 Method for realizing fast wavelet transform by using GPU

Country Status (1)

Country Link
CN (1) CN101866493B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198451A (en) * 2013-01-31 2013-07-10 西安电子科技大学 Method utilizing graphic processing unit (GPU) for achieving rapid wavelet transformation through segmentation
CN103414901A (en) * 2013-08-26 2013-11-27 江苏新瑞峰信息科技有限公司 Quick JPED 2000 image compression system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020168113A1 (en) * 2001-03-13 2002-11-14 Tadayoshi Nakayama Filter processing apparatus
CN1588451A (en) * 2004-07-22 2005-03-02 华中科技大学 Circuit for realizing direct two dimension discrete small wave change
CN1589021A (en) * 2004-09-28 2005-03-02 华中科技大学 Parallel two-dimension discrete small wave transform circuit

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020168113A1 (en) * 2001-03-13 2002-11-14 Tadayoshi Nakayama Filter processing apparatus
CN1588451A (en) * 2004-07-22 2005-03-02 华中科技大学 Circuit for realizing direct two dimension discrete small wave change
CN1589021A (en) * 2004-09-28 2005-03-02 华中科技大学 Parallel two-dimension discrete small wave transform circuit

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《IEEE TRANSACTIONS ON MULTIMEDIA,2007》 20070430 Tien-Tsin Wong, et al. Discrete Wavelet Transform on Consumer-Level Graphics Hardware 第9卷, 第3期 2 *
《Image Processing, 2006 IEEE International Conference on》 20061011 Keyan Wang, et al. EFFICIENT LINE-BASED VLSI ARCHITECTURE FOR 2-D LIFTING DWT , 2 *
《Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis (2009)》 20090918 Wladimir J. van der Laan, et al Accelerating wavelet-based video coding on graphics hardware using CUDA , 2 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198451A (en) * 2013-01-31 2013-07-10 西安电子科技大学 Method utilizing graphic processing unit (GPU) for achieving rapid wavelet transformation through segmentation
CN103198451B (en) * 2013-01-31 2016-01-20 西安电子科技大学 A kind of GPU realizes the method for fast wavelet transform by piecemeal
CN103414901A (en) * 2013-08-26 2013-11-27 江苏新瑞峰信息科技有限公司 Quick JPED 2000 image compression system

Also Published As

Publication number Publication date
CN101866493B (en) 2012-01-04

Similar Documents

Publication Publication Date Title
Zhang et al. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system
US11775801B2 (en) Neural processor
DE112013004078B4 (en) Share storage over a unified storage architecture
Klöckner et al. Nodal discontinuous Galerkin methods on graphics processors
CN104077233B (en) Multichannel convolutive layer treating method and apparatus
DE102013014169A1 (en) Dynamically variable circular buffers
Negrut et al. Unified memory in cuda 6.0. a brief overview of related data access and transfer issues
DE112009004320T5 (en) Memory Subsystem
Kono et al. Scalability analysis of tightly-coupled FPGA-cluster for lattice boltzmann computation
DE112013004079T5 (en) Shared virtual memory
CN103049241A (en) Method for improving computation performance of CPU (Central Processing Unit) +GPU (Graphics Processing Unit) heterogeneous device
Zlateski et al. ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs
CN101866493B (en) Method for realizing fast wavelet transform by using GPU
CN103198451A (en) Method utilizing graphic processing unit (GPU) for achieving rapid wavelet transformation through segmentation
Quan et al. A fast discrete wavelet transform using hybrid parallelism on GPUs
CN107943592A (en) A kind of method for avoiding GPU resource contention towards GPU cluster environment
DE102023105572A1 (en) Efficient matrix multiplication and addition with a group of warps
DE102013201195A1 (en) Previously scheduled repetitions of divergent operations
CN102300092B (en) Lifting scheme-based 9/7 wavelet inverse transformation image decompressing method
Filippas et al. Streaming dilated convolution engine
Liu et al. Parallel program design for JPEG compression encoding
WO2006014528A1 (en) A method of and apparatus for implementing fast orthogonal transforms of variable size
CN201111042Y (en) Two-dimension wavelet transform integrate circuit structure
Ikuzawa et al. Reducing memory usage by the lifting-based discrete wavelet transform with a unified buffer on a GPU
Lee et al. VLSI design of a wavelet processing core

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120104

Termination date: 20170618