CN102750262A

CN102750262A - Method for realizing sparse signal recovery on CPU (Central Processing Unit) based on OMP (Orthogonal Matching Pursuit) algorithm

Info

Publication number: CN102750262A
Application number: CN2012102162247A
Authority: CN
Inventors: 张颢; 陈帅; 孟华东; 王希勤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2012-06-26
Filing date: 2012-06-26
Publication date: 2012-10-24

Abstract

The invention discloses a method for realizing sparse signal recovery on a CPU (Central Processing Unit) based on an OMP (Orthogonal Matching Pursuit) algorithm. The method comprises the following steps of: generating an observation matrix on the CPU, and selecting a column with the greatest relevancy to the residual in the observation matrix to complement a basis matrix, wherein the residual is the difference between the observations generated by an actual observation signal and an estimation signal, and the basis matrix is a matrix formed by nonzero element index values in corresponding column vectors in the observation matrix; by use of a method of least squares, estimating the nonzero elements of an original signal on the basis matrix of the kth step; continuing to select the column with the greatest relevancy to the residual in the observation matrix on the CPU to complement the basis matrix, when the variance between the real observation and the estimation observation is lower than a specified threshold, ending the iterative operation. The method provided by the invention has the following advantages: in parallel realization of the OMP algorithm by the CPU, the advantages of low computational complexity and high convergence rate of the OMP algorithm are combined, and simultaneously, the characteristic of remarkable acceleration performance of the CPU algorithm to the vector computation is fully used, and the running speed of the sparse recovery algorithm is improved effectively.

Description

On GPU, realize the method that sparse signal recovers based on the OMP algorithm

Technical field

The invention belongs to the signal processing technology field, particularly a kind of method that on GPU, realizes the sparse signal recovery based on the OMP algorithm.

Background technology

In recent years, the compressed sensing theory obtains extensive concern, and its explanation is satisfied under the prerequisite of sparse property at signal, uses much smaller than the SF of nyquist sampling rate data are sampled, and promptly can recover original signal fully.Compressed sensing is illustrated as with following mathematic(al) representation:

For original signal x ∈ R ^N, through observing matrix Φ ∈ R ^{M * N}, obtain observation vector y ∈ R ^M:

y=Φx (1)

Wherein M＜＜N, among the x significantly element number be S, S＜＜N.The CS theoretical research be: known observation y; Estimate to satisfy the sparse solution x of formula (1), promptly find one

to satisfy:

\min {| | \tilde{x} | |}_{0}, s . t . y = Φ \tilde{x}

Wherein, || || ₀Expression L ₀Norm is promptly calculated the nonzero element number.

At present, over against the optimization problem of formula (2), proposed a series of derivation algorithm, comprised approximate L1 optimization, greedy algorithm, Focuss algorithm etc., these algorithms can both effectively recover sparse signal under special scenes.Yet the common feature of this type algorithm is that computation complexity is high, and when finding the solution large-scale data, traditional C PU serial realizes long operational time, can't go out the original sparse signal by real-time recovery; Though and can realize quick calculating by mainframe computer or cluster, required cost is high, can not satisfy the demand of practical applications.

In recent years, (Graphics Processing Unit GPU) develops into multinuclear, the multithreading common application platform of a high-speed parallelization to graphic process unit, has very high cost performance solving on the computation-intensive problem.The present invention attempts utilizing this platform of GPU to improve the execution speed of OMP algorithm.

Following article and patent documentation have covered the main background technology in this field basically.In order to explain out the evolution of technology, we arrange in chronological order, and introduce the main contribution and the shortcoming of document one by one.

1.Tropp?J?A,Gilbert?A?C.Signal?recovery?from?random?measurements?via?orthogonal?matching?pursuit[J].IEEE?Transactions?on?Information?Theory,2007,53(12):4655-4666.

In the document, proposed a kind of algorithm of finding the solution zero norm minimum optimization based on greedy algorithm, this algorithm is with respect to littler based on the approximate convex optimized algorithm computation complexity of a norm, and resolution is higher.With respect to traditional coupling track algorithm, rectangular projection has increased the probability and the speed of convergence of successful recovery in each iterative process.

2.Sangkyun?Lee?S?W.Implementing?algorithms?for?signal?and?image?reconstruction?on?graphical?processing?units.Computer?Sciences?Department,University?of?Wisconsin-Madison,Tech.Rep.,November,2008.

In the document, people such as the Sangkyun Lee of Wisconsin university have realized the SpaRSA algorithm of compressed sensing on the GPU platform.The SpaRSA algorithm is a kind of of convex optimized algorithm, and computation complexity is bigger, even still needing on the GPU platform, to realize long computing time.Simultaneously, the SpaRSA algorithm has the protruding drawback of optimizing type algorithm, has higher secondary lobe exactly.

3.Andrecut?M.Fast?GPU?implementation?of?sparse?signal?recovery?from?random?projections[J].Engineering?Letters.2009,17(3):151-158.

In this document, the people such as Andrecut of Calgary university have realized match tracing (Matching Pursuit, MP) the GPU parallelization of algorithm.The shortcoming that this method exists is exactly that the speed of convergence of MP algorithm itself is slow, and when basic correlativity was big, the probability of success recovery was little.

Summary of the invention

In order to overcome the deficiency of above-mentioned prior art, the object of the present invention is to provide and a kind ofly on GPU, realize the method that sparse signal recovers based on the OMP algorithm, with OMP algorithm Parallel Implementation on GPU, thereby sparse signal is recovered.

To achieve these goals, the technical scheme of the present invention's employing is:

On GPU, realize the method that sparse signal recovers based on the OMP algorithm, may further comprise the steps:

Step 1: on GPU, generate observing matrix Φ, element calculates according to following formula in the matrix:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

Wherein, h=(h ₀, h ₁, h ₂..., h _M-1), h _i∈ (0,1,2 ..., the pseudo-random number sequence that N-1) generates for computing machine, N treats the length of restoring signal for the OMP algorithm, M is the observation number in the compressed sensing, M<n;

Step 2: in GPU, select to add in the basis matrix with the maximum row of residual error correlativity among the observing matrix Φ; Wherein, Residual error is defined as the difference between the observation that actual observation and estimated signal produce, and the definition basis matrix is the nonzero element index value matrix that corresponding column vector is formed in observing matrix Φ to mathematical expression for

;

Said GPU in each row of compute matrix and the correlation process of residual error, v=Φ ^TR, wherein

Each inner stream handle of GPU is carried out the correlativity of row and residual error, promptly The result who compares each stream handle at last, the row that correlativity is maximum expand to be supported to concentrate, simultaneously, and the index value of the row of record correlativity maximum, preceding k step index value constitutes vector v;

Each stream handle among the said GPU is responsible for vector and vectorial r does inner product; In each stream handle; Through with

and r be divided into corresponding multistage, a plurality of thread parallels carry out multiply operation to each segmentation;

Step 3: utilize least square method on the k basis matrix in step, to estimate the nonzero element of original signal, find the solution through least-squares estimation, the realization of least square is decomposed realization through QR;

Step 4: continue step 2, the variance of observing when true observation and estimation is lower than the appointed threshold value, promptly

The finishing iteration operation, wherein y is true observation,

Be the restoration result after k step iteration, ε is a relative error, and is relevant with observation noise, || a|| ₂Two norms of representing vectorial a.

Observing matrix Φ is the capable battle array of taking out at random of DCT matrix in the said step 1, wherein, takes out line operate at random and is generated by computer simulation, confirms to take out line position through producing a series of pseudo random number.

Be that said observing matrix Φ allocated size is M * N on GPU; The storage space of float type.

Generate observing matrix Φ parallel carrying out on GPU in the said step 1; Specifically be with in this generation Task Distribution to 64 thread, thread i is responsible for generating the parallel generation of accomplishing observing matrix Φ on a plurality of processors of

a plurality of threads.

The Parallel Implementation of said multiply operation is divided into coarse grain parallelism and fine granularity is parallel, at matrix and vector multiplication Φ ^TAmong the r, matrix Φ ^TThe execute vector multiplication is a coarse grain parallelism between each row and the r, is accomplished by thread block, and thread block i is responsible for execution: v _i=<φ _i, r>, wherein<..; Represent two vector calculation inner products; At v _i=<φ _i, r>Element and element execution multiply operation is accomplished by multi-threaded parallel in the computation process, and this is fine-grained parallel.Specifically be embodied as, the thread j of thread block i is responsible for execution

T wherein _iBe the intermediate result of calculating, T is a Thread Count in each thread block, guarantees that through syncthreads function among the SDK that calls the GPU concurrent development all threads in the same thread block are all complete, calculates then

Promptly accomplished Φ ^TR.

Said two norm calculation are carried out by a plurality of thread parallels, and concrete realization can be with reference to Φ ^TFine-grained Parallel Implementation among the r.

Compared with prior art; Advantage of the present invention is: GPU is to the Parallel Implementation of OMP algorithm; In conjunction with the advantage of little, the fast convergence rate of OMP algorithm computation complexity; Give full play to the GPU algorithm simultaneously and calculate the acceleration outstanding feature, effectively improved the travelling speed of sparse recovery algorithms for vector.

Description of drawings

Fig. 1 is parallel OMP algorithm flow chart.

Fig. 2 is matrix-vector multiplication coarse grain parallelism on GPU.

Fig. 3 is that matrix-vector multiplication fine granularity on GPU is parallel

Fig. 4 is that the OMP algorithm compares the computing time on GPU and CPU.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is explained further details.

In the parallel OMP algorithm flow chart of Fig. 1, at first need be on GPU the storage allocation space, and carry out initialization.Carry out iterative operation then, this part is divided into following four steps:

Step 1 is calculated residual energy, whether checks termination of iterations.

Step 2, parallel computation observing matrix and vectorial correlativity are selected the maximum column index of the degree of correlation then concurrently.

Step 3, the column vector that the column index of a last generating step is corresponding is added in the basis matrix.

Step 4 based on the method for the employing of the basis matrix after expansion least square, is estimated restoration result.Jump to step 1.

Behind termination of iterations, the restoration result of last iteration is the algorithm execution result.

Specifically, method of the present invention comprises the steps:

Step 1: on GPU, generate observing matrix Φ.It is M pseudo-random sequence h=(h that computer simulation generates length ₀, h ₁, h ₂..., h _M-1), hi ∈ (0,1,2 ..., N-1), the randomness of its decision stochastic sampling.Thereby, generate and owe fixed observing matrix Φ.Thereby the element of Φ calculates according to following formula:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

N waits to recover the length of vector for the OMP algorithm, and M observes number in the OMP algorithm, M < N.For the convenient performance detection that realizes, generate sparse signal x at random, wherein the number of remarkable element is S among the x, and S is defined as degree of rarefication in the compressed sensing problem, and < < N, significantly the amplitude of element generates S at random.There are Φ and x to calculate observation data y, are used for the recovery of OMP algorithm.

In the GPU implementation procedure, at first need on GPU, distribute the storage space of M * N float type of Φ, will generate in Task Distribution to 64 thread, thread i is responsible for generating (Φ (i, 0), Φ (i; 1), Φ (i, 2) ..., Φ (i, N-1)); (Φ (i, 0), Φ (i, 1), Φ (i, 2); ..., Φ (i, N-1)) ..., a plurality of threads highly-parallel on a plurality of processors is accomplished the generation of observing matrix.Simultaneously, pseudo-random sequence h needs repeatedly visit, and all is read-only operation, can h be stored as the constant storage unit by the characteristics of the GPU multiple programming model of Nvidia.The IO access delay can effectively be reduced to the read access optimization of constant storage unit in GPU inside, thereby reduces the overall operation time.

Step 2: data are transferred to GPU from CPU, initialization data.Before carrying out the OMP algorithm steps, at first need on GPU, give observation data, intermediate variable storage allocation space, and observation data is transferred among the GPU.In concrete the realization, the interface cublasAlloc that calculates the storehouse by the cublas vector accomplishes the Memory Allocation to variable, accomplishes that through cublasSetVector observation data is transferred to the internal memory of GPU from CPU.

Step 3: check whether finishing iteration is operated.The OMP algorithm estimation original signal that need iterate, the end condition of iteration be true observation with the observation of estimated signal calculating between poor energy be lower than a certain thresholding, be described as with mathematical formulae:

{| | y - {Φ \hat{x}}_{k} | |}_{2} < ϵ {| | y | |}_{2}

Wherein y is true observation,

Be the restoration result after k step iteration, ε is a relative error, || || ₂Two norms of expression vector.

Matrix and vectorial multiplication highly-parallelization on the GPU platform, each stream handle is responsible for the delegation and the vector of matrix and is done inner product, and in single stream handle, multi-threaded parallel carries out multiply operation to the part element of vector.The visible accompanying drawing 2 of concrete operations.Two norm calculation, a plurality of thread parallels are carried out, and each thread is accomplished square calculating of part vector, with this part vector summation, accomplishes the anded of each several part at last.

Step 4: with the maximum row of residual error correlativity, expand basis matrix in the parallel computation observing matrix.The OMP algorithm has been inherited the characteristics of greedy type of algorithm, in each iteration, selects to add in the basis matrix with the maximum row of residual error correlativity in the observing matrix.GPU is in each row of compute matrix and the correlation process of residual error, and each inner stream handle of GPU is carried out the correlativity of row and residual error, the result of each stream handle relatively at last, the row that correlativity is maximum expand to support concentrated.Simultaneously, the index value of the row that the record correlativity is maximum, the vector that preceding k step index value constitutes is v.

Step 5: utilize least square method on the k basis matrix in step, to estimate the nonzero element of original signal.

CublasDger function through calling cublas is accomplished least-squares estimation, obtains current estimated signal.Continue step 3.

Fig. 2 is expressed as in the realization of matrix-vector multiplication coarse grain parallelism on GPU, matrix Φ ^TThe execute vector multiplication is a coarse grain parallelism between each row and the r, is accomplished by thread block, and thread block i is responsible for execution: v _i=<φ _i, r>, wherein<..; Represent two vector calculation inner products;

Fig. 3 is expressed as in the parallel realization of matrix-vector multiplication fine granularity on GPU, v _i=<φ _i, r>Element and element execution multiply operation is accomplished by multi-threaded parallel in the computation process.Specifically be embodied as, the thread j of thread block i is responsible for execution

T wherein _iThe intermediate result of be calculating, T be a Thread Count in each thread block, through among the SDK that calls the GPU concurrent development _ the syncthreads function guarantees that all threads in the same thread block are all complete, calculating then Promptly accomplished Φ ^TR.

As shown in Figure 4, the computing time of OMP algorithm on GPU and CPU relatively in, special hour of data scale, GPU was because length consuming time in the start-up course, and little data scale can't embody its parallel advantage, so overall computing time is long; Along with data scale increases, the parallel advantage of GPU progressively embodies, and the mistiming that GPU realizes and traditional C PU realizes is exponential increase.

Claims

1. on GPU, realize the method that sparse signal recovers based on the OMP algorithm, it is characterized in that, may further comprise the steps:

Φ (m, n) = \{\begin{matrix} \frac{1}{\sqrt{N}}, h (m) = 0,0 \leq n \leq N - 1 \\ \sqrt{\frac{2}{N}} \cos \frac{π (2 n + 1) h (m)}{2 N}, 1 \leq h (m) \leq N - 1,0 \leq n \leq N - 1 \end{matrix}

Wherein, h=(h ₀, h ₁, h ₂..., h _M-1), hi ∈ (0,1,2 ..., the pseudo-random number sequence that N-1) generates for computing machine, N treats the length of restoring signal for the OMP algorithm, M is the observation number in the compressed sensing, M<n;

;

Each inner stream handle of GPU is carried out the correlativity of row and residual error, promptly

The result who compares each stream handle at last, the row that correlativity is maximum expand to be supported to concentrate, simultaneously, and the index value of the row of record correlativity maximum, preceding k step index value constitutes vector v;

The finishing iteration operation, wherein y is true observation,

2. according to the said method that on GPU, realizes the sparse signal recovery of claim 1; It is characterized in that observing matrix Φ is the capable battle array of taking out at random of DCT matrix in the said step 1, wherein; Take out line operate at random and generate, confirm to take out line position through producing a series of pseudo random number by computer simulation.

3. according to the said method that on GPU, realizes the sparse signal recovery of claim 1, it is characterized in that, is that said observing matrix Φ allocated size is M * N on GPU; The storage space of float type.

4. according to the said method that on GPU, realizes the sparse signal recovery of claim 1; It is characterized in that; Generate observing matrix Φ parallel carrying out on GPU in the said step 1; Specifically be with in this generation Task Distribution to 64 thread, thread i is responsible for generating the parallel generation of accomplishing observing matrix Φ on a plurality of processors of

a plurality of threads.

5. according to the said method that on GPU, realizes the sparse signal recovery of claim 1, it is characterized in that the Parallel Implementation of said multiply operation is divided into coarse grain parallelism and fine granularity is parallel, at matrix and vector multiplication Φ ^TAmong the r, matrix Φ ^TThe execute vector multiplication is a coarse grain parallelism between each row and the r, is accomplished by thread block, and thread block i is responsible for execution: v _i=<φ _i, r>, wherein<..; Represent two vector calculation inner products; At v _i=<φ _i, r>Element and element execution multiply operation is accomplished by multi-threaded parallel in the computation process, and this is fine-grained parallel; Specifically be embodied as, the thread j of thread block i is responsible for execution

Promptly accomplished Φ ^TR.

6. according to the said method that on GPU, realizes the sparse signal recovery of claim 1, it is characterized in that said two norm calculation are carried out by a plurality of thread parallels.