CN102999316A

CN102999316A - Parallel implementation method of orthogonal tracking algorithm in GPU (Graphics Processing Unit)

Info

Publication number: CN102999316A
Application number: CN2012104657992A
Authority: CN
Inventors: 张颢; 陈帅; 孟华东; 王希勤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2012-11-16
Filing date: 2012-11-16
Publication date: 2013-03-27

Abstract

The invention discloses a parallel implementation method of an orthogonal tracking algorithm in a GPU (Graphics Processing Unit). The parallel implementation method specifically comprises the following steps of: S1. generating an observation matrix on the GPU; S2. repeatedly iterating by using the orthogonal tracking algorithm so as to estimate an original signal, calculating observation data corresponding to the original signal by using the observation matrix, and comparing the observation data with real observation data so as to judge whether the iteration operation is terminated or not; S3. calculating a row which has the maximum relevance with residual errors in the observation matrix, and complementing the row into a base matrix, wherein the base matrix is one part of the observation matrix; and S4. estimating non-zero elements of the original signal in the base matrix by using a least square method, updating the original signal, and continuing the step S2. By utilizing the method, the operation time of the orthogonal tracking algorithm can be shortened, and the purpose of improving the data processing efficiency and reducing the cost are achieved.

Description

The Parallel Implementation method of quadrature tracing algorithm on GPU

Technical field

The present invention relates to the signal processing technology field, be specifically related to the Parallel Implementation method of a kind of OMP (OrthogonalMatching Pursuit, quadrature tracing algorithm) on GPU.

Background technology

In recent years, compressed sensing (CS) theory obtains extensive concern, and it satisfies under the prerequisite of sparse property at signal, uses much smaller than the sample frequency of nyquist sampling rate data are sampled, and namely can recover original signal fully.Compressed sensing is illustrated as with following mathematic(al) representation:

For original signal x ∈ R ^N, by observing matrix Φ ∈ R ^{M * N}, obtain observation vector y ∈ R ^M:

y=Φx （1）

Wherein M＜＜N, among the x significantly element number be S, S＜＜N.The CS theoretical research be: known observation y, estimate to satisfy the sparse solution x of formula (1), namely find one

Satisfy:

\min {| | \tilde{x | |}}_{0}, s . t . y = Φ \tilde{x} - - - (2)

Wherein, ‖ ‖ ₀Expression L ₀Norm is namely calculated the nonzero element number.

At present, for the optimization problem of formula (2), proposed a series of derivation algorithm, comprised approximate L1 optimization, greedy algorithm, Focuss algorithm etc., these algorithms can both effectively recover sparse signal under special scenes.Yet the common feature of this class algorithm is that computation complexity is high, and when finding the solution large-scale data, traditional C PU serial implementation long operational time can't go out the original sparse signal by real-time recovery; Although and can realize quick calculating by mainframe computer or cluster, required cost is high, can not satisfy the demand that engineering is used.

In recent years, graphic process unit (Graphics Processing Unit, GPU) develops into multinuclear, the multithreading common application platform of a high-speed parallel, has very high cost performance in solution computation-intensive problem.The present invention attempts utilizing this platform of GPU to improve the execution speed of OMP algorithm.

Following Introduction of Literatures the main background technology in this field:

1.Tropp J A,Gilbert A C.Signal recovery from random measurementsvia orthogonal matching pursuit[J].IEEE Transactions on InformationTheory,2007,53(12):4655-4666.

In the document, proposed a kind of algorithm that zero Norm minimum is optimized of finding the solution based on greedy algorithm, this algorithm is with respect to less based on the approximate convex optimized algorithm computation complexity of a norm, and resolution is higher.With respect to traditional coupling track algorithm, rectangular projection has increased probability and the speed of convergence of successful recovery in each iterative process.

2.Sangkyun Lee S W.Implementing algorithms for signal and imagereconstruction on graphical processing units.Computer SciencesDepartment,University of Wisconsin-Madison,Tech.Rep.,November,2008.

In the document, the people such as the Sangkyun Lee of Wisconsin university have realized the SpaRSA algorithm of compressed sensing at the GPU platform.The SpaRSA algorithm is a kind of of convex optimized algorithm, and computation complexity is larger, even still to need to realize computing time of growing at the GPU platform.Simultaneously, the SpaRSA algorithm has the common shortcoming of protruding optimization class algorithm, is exactly to have higher secondary lobe.

3.Andrecut M.Fast GPU implementation of sparse signal recoveryfrom random projections[J].Engineering Letters.2009,17(3):151-158.

In this document, the people such as Andrecut of Calgary university have realized the GPU parallelization of match tracing (Matching Pursuit, MP) algorithm.The shortcoming that the method exists is exactly that the speed of convergence of MP algorithm itself is slow, and when its correlativity was larger, the probability of success recovery was little.

Summary of the invention

The technical matters that (one) will solve

The present invention mainly solves existing algorithm when finding the solution large-scale data, traditional C PU serial implementation long operational time, the technical matters that cost is high.

(2) technical scheme

For addressing the above problem, the invention provides the Parallel Implementation method of a kind of quadrature tracing algorithm on GPU, may further comprise the steps:

S1, generate observing matrix at GPU;

S2, use the quadrature tracing algorithm estimation original signal that iterates, utilize above-mentioned observing matrix to calculate observation data corresponding to described original signal, and compare with true observation data, judge whether to stop described iterative operation;

S3, calculate in the described observing matrix row with residual error correlativity maximum, it is added in the basis matrix, described basis matrix is the part of described observing matrix;

S4, utilize least square method in described basis matrix, to estimate the nonzero element of described original signal, upgrade original signal, continue step S2.

In step S1, described observing matrix is for taking out at random the matrix that row obtains to DCT discrete cosine transform matrix, and its element calculates according to following formula:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

Wherein, Φ is observing matrix, and m is line number, m ∈ (0,1,2 ..., N-1), n is columns, N is the length that the quadrature tracing algorithm is treated restoring signal, h is the pseudo-random number sequence that computing machine generates, and M element arranged in this sequence, and M is the observation number in the compressed sensing, and M＜N.

Further, GPU is assigned to the generation task of described element in 64 threads, and wherein i thread is responsible for generating (Φ (i, 0), Φ (i, 1), Φ (i, 2), ..., Φ (i, N-1)), a plurality of threads are in the parallel generation of finishing observing matrix Φ of a plurality of processors.

In step S2, to be true observation data be lower than the appointed threshold value with the variance of the observation data of utilizing described observing matrix to calculate to the condition that described iteration stops, and is described as with mathematical formulae:

{| | y - Φ {\hat{x}}_{k} | |}_{2} < ϵ {| | y | |}_{2}

Wherein y is true observation data, and Φ is observing matrix,

Be the original signal that estimates after k step iteration, ε is relative error, and it is relevant with observation noise, ‖ ‖ ₂Two norms of expression vector.

Further, finish in that the multiplication of the above observing matrix of GPU platform and described original signal vector is parallel, each stream handle is responsible for delegation and the described original signal vector of described observing matrix and is done inner product, in single stream handle, multi-threaded parallel carries out multiply operation to the Partial Elements of described original signal vector, and the calculating of described two norms is carried out by a plurality of thread parallels.

In step S3, in the correlation process of GPU each row and residual error in calculating described observing matrix, each stream handle of GPU inside is carried out the correlativity of row and residual error, compares at last the result of every flow processor, and the row of correlativity maximum are expanded in the basis matrix.

In step S4, finish least-squares estimation by the cublasDger function that calls cublas.

(3) beneficial effect

The inventive method can shorten the working time of OMP algorithm, reaches the purpose that improves data-handling efficiency, reduces cost.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method;

Fig. 2 is the process flow diagram of the embodiment of the invention;

Fig. 3 is matrix-vector multiplication two-stage Parallel Implementation principle schematic on GPU;

Fig. 4 is the computing time comparison diagram of OMP algorithm on GPU and CPU.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.

Fig. 1 is the process flow diagram of the inventive method, the invention provides the Parallel Implementation method of a kind of quadrature tracing algorithm on GPU, may further comprise the steps:

S1, generate observing matrix at GPU;

In step S1, institute's observing matrix is for taking out at random the matrix that row obtains to DCT discrete cosine transform matrix, and its element calculates according to following formula:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

Wherein, Φ is observing matrix, and m is line number, m ∈ (0,1,2 ..., N-1), n is columns, N is the length that quadrature hindcast method is treated restoring signal, h is the pseudo-random number sequence that computing machine generates, and M element arranged in this sequence, and M is the observation number in the compressed sensing, and M＜N.

Further, GPU is assigned to the generation task of described element in 64 threads, and wherein i thread is responsible for generating (Φ (i, 0), Φ (i, 1), Φ (i, 2), Φ (i, N-1)), a plurality of threads are in the parallel generation of finishing observing matrix Φ of a plurality of processors.

{| | y - Φ {\hat{x}}_{k} | |}_{2} < ϵ {| | y | |}_{2}

Wherein y is true observation data, and Φ is observing matrix,

Embodiment

Fig. 2 is the process flow diagram of the embodiment of the invention, may further comprise the steps:

Step S1: generate observing matrix Φ at GPU.The observing matrix that adopts among the present invention is the capable battle array of taking out at random of DCT (Discrete Cosine Transform, discrete cosine transform) matrix.Wherein, take out at random line operate and generated by computer simulation, determine to take out line position by producing a series of pseudo random number.According to the characteristics of DCT matrix, it is the pseudo-random sequence h=(h of M that computer simulation generates length ₀, h ₁, h ₂..., h _M-1), h _i∈ (0,1,2 ..., N-1), it determines the randomness of stochastic sampling.Thereby, generate and owe fixed observing matrix Φ.Therefore the element of Φ calculates according to following formula:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

Wherein N is the length that the OMP algorithm waits to recover vector, and M is the observation number in the OMP algorithm, M＜N.For the convenient detection that realizes performance, generate at random sparse signal x, wherein the number of remarkable element is S among the x, S is defined as degree of rarefication in the compressed sensing problem, S＜＜N, significantly the amplitude of element generates at random.Calculate observation data y by Φ and x, be used for the recovery of OMP algorithm.

In the GPU implementation procedure, at first need to distribute at GPU the storage space of the M of Φ * N float type, the generation task is assigned in 64 threads, thread i is responsible for generating (Φ (i, 0), Φ (i, 1), Φ (i, 2) ..., Φ (i, N-1)), (Φ (i, 0), Φ (i, 1), Φ (i, 2) ..., Φ (i, N-1)) ..., a plurality of threads highly-parallel on a plurality of processors is finished the generation of observing matrix.Simultaneously, pseudo-random sequence h needs repeatedly access, and all is read-only operation, can by the characteristics of the GPU parallel programming model of Nvidia, h be stored as the constant storage unit.The IO access delay can effectively be reduced to the Access Optimization of constant storage unit in GPU inside, thereby reduces the overall operation time.

Step S2: check whether finishing iteration operates.Data are transferred to GPU from CPU, initialization data.Before carrying out the OMP algorithm steps, at first need to give observation data at GPU, intermediate variable storage allocation space, and observation data is transferred among the GPU.In the specific implementation, the interface cublasAlloc that calculates the storehouse by the cublas vector finishes the Memory Allocation to variable, finishes that by cublasSetVector observation data is transferred to the internal memory of GPU from CPU.

The OMP algorithm estimation original signal that need to iterate, the end condition of iteration be true observation with the observation of estimated signal calculating between poor energy be lower than a certain thresholding, be described as with mathematical formulae:

{| | y - Φ {\hat{x}}_{k} | |}_{2} < ϵ {| | y | |}_{2}

Wherein y is true observation,

Be the restoration result after k step iteration, ε is relative error, ‖ ‖ ₂Two norms of expression vector.

The multiplication highly-parallel of matrix and vector on the GPU platform, each stream handle is responsible for delegation and the vector of matrix and is done inner product, and in single stream handle, multi-threaded parallel carries out multiply operation to the Partial Elements of vector.The visible accompanying drawing 3 of concrete operations, the multiplication highly-parallel of matrix and vector on the GPU platform, each stream handle is responsible for delegation and the vector of matrix and is done inner product, and in single stream handle, multi-threaded parallel carries out multiply operation to the Partial Elements of vector.The calculating of two norms, a plurality of thread parallels are carried out, and each thread is finished square calculating of part vector, with this part vector summation, finishes at last the read group total of each several part.

According to the characteristics of CUDA model (Computing Unified Device Architecture, unified calculation framework model) the parallel granularity of secondary, the Parallel Implementation of matrix and vector multiplication is divided into coarse grain parallelism and fine grained parallel.At v=Φ ^TAmong the r, the execute vector multiplication is coarse grain parallelism between each row of matrix and the r, and (Thread Block) finishes by thread block, and thread block i is responsible for execution: v _i=＜φ _i, r 〉, wherein＜, two vector calculation inner products of expression.At v _i=＜φ _i, r〉in the computation process element and element to carry out multiply operation be that multi-threaded parallel is finished, namely the thread j of thread block i is responsible for execution:

T wherein _iBe the intermediate result of calculating, T is each thread block center line number of passes.Guarantee that by calling syncthreads all threads in the same thread block are all complete, then calculate

Namely finished the multiply operation of a matrix and vector.Because can be by shared drive communication, with the resource v of same thread block access between the thread in the same thread block _i, t _iBe stored in the shared drive and can effectively reduce access delay.

Step S3: with the row of residual error correlativity maximum, expand basis matrix in the parallel computation observing matrix.The OMP algorithm has been inherited the characteristics of greedy class algorithm, and in each iteration, the row with residual error correlativity maximum in the selection observing matrix add in the basis matrix.GPU is in each row of compute matrix and the correlation process of residual error, and each stream handle of GPU inside is carried out the correlativity of row and residual error, the last result of every flow processor relatively, with the row of correlativity maximum expand to support concentrated.Simultaneously, the index value of the row of record correlativity maximum, the vector that front k step index value consists of is v.

Step S4: utilize least square method to estimate the nonzero element of original signal at the basis matrix in k step.Finish least-squares estimation by the cublasDger function that calls cublas, obtain current estimated signal, continue step S2.

The inventive method can shorten the working time of OMP algorithm, reaches the purpose that improves data-handling efficiency, reduces cost.Fig. 4 is the computing time comparison diagram of OMP algorithm on GPU and CPU.

The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and replacement, these improvement and replacement also should be considered as protection scope of the present invention.

Claims

1. the Parallel Implementation method of a quadrature tracing algorithm on GPU is characterized in that, may further comprise the steps:

S1, generate observing matrix at GPU;

2. the method for claim 1 is characterized in that, in step S1, described observing matrix is for taking out at random the matrix that row obtains to DCT discrete cosine transform matrix, and its element calculates according to following formula:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

3. method as claimed in claim 2 is characterized in that, GPU is assigned to the generation task of described element in 64 threads, wherein i thread is responsible for generating (Φ (i, 0), Φ (i, 1), Φ (i, 2) ..., Φ (i, N-1)), a plurality of threads are in the parallel generation of finishing observing matrix Φ of a plurality of processors.

4. the method for claim 1 is characterized in that, in step S2, to be true observation data be lower than the appointed threshold value with the variance of the observation data of utilizing described observing matrix to calculate to the condition that described iteration stops, and is described as with mathematical formulae:

{| | y - Φ {\hat{x}}_{k} | |}_{2} < ϵ {| | y | |}_{2}

Wherein y is true observation data, and Φ is observing matrix,

5. method as claimed in claim 4, it is characterized in that, finish in that the multiplication of the above observing matrix of GPU platform and described original signal vector is parallel, each stream handle is responsible for delegation and the described original signal vector of described observing matrix and is done inner product, in single stream handle, multi-threaded parallel carries out multiply operation to the Partial Elements of described original signal vector, and the calculating of described two norms is carried out by a plurality of thread parallels.

6. the method for claim 1, it is characterized in that, in step S3, in the correlation process of GPU each row and residual error in calculating described observing matrix, each stream handle of GPU inside is carried out the correlativity of row and residual error, the last result of every flow processor relatively expands to the row of correlativity maximum in the basis matrix.

7. the method for claim 1 is characterized in that, in step S4, finishes least-squares estimation by the cublasDger function that calls cublas.