CN105739951A

CN105739951A - GPU-based L1 minimization problem fast solving method

Info

Publication number: CN105739951A
Application number: CN201610116008.3A
Authority: CN
Inventors: 高家全; 李泽界; 王宇
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2016-03-01
Filing date: 2016-03-01
Publication date: 2016-07-06
Anticipated expiration: 2036-03-01
Also published as: CN105739951B

Abstract

The invention provides a GPU-based L1 minimization problem fast solving method. A CUDA parallel computation model is utilized on a NVIDIA Maxwell-architecture GPU device, and an L1 minimization problem fast solving method is provided by utilizing GPU new features and internal kernel merging and optimizing technologies. According to the GPU-based L1 minimization problem fast solving method, not only designs of self-adaptation optimal vector computation, non-transposed matrix vector multiplication and transposed matrix vector multiplication are included, but also parallel solving of single or parallel multiple L1 minimization problems can be implemented only by simple CUDA thread allocation configuration difference. Based on experimental results, the GPU-based L1 minimization problem fast solving method provided by the invention is efficient, and also has high parallelism and adaptability. Compared with the existing parallel solving method, the GPU-based L1 minimization problem fast solving method has largely improved performances.

Description

A kind of L1 minimization problem fast solution method based on GPU

Technical field

The present invention relates to signal processing and field of face identification, relate more specifically to a kind of L1 minimization problem fast solution method based on GPU.

Background technology

L1 minimization problem is min | | x | |₁, meet constraint Ax=b, wherein A ∈ R^m×n(m < < n) is the dense matrix of a full rank, b ∈ R^mIt is vector set in advance, x ∈ RⁿIt is unknown solution.The solution of L1 minimization problem, also referred to as rarefaction representation, has been widely applied to multiple field, for instance signal processing, machine learning and statistical inference etc..For solving L1 minimization problem, researcher has designed many effective algorithms.Such as, gradient projection method, block newton interior point method, Homotopy Method, iterative shrinkage threshold method and augmented vector approach etc..In practical situation, b often comprises noise, and therefore a mutation of this problem is referred to as unconfined base tracking Denoising Problems (BPDN problem) or Lasso problem, as follows:

\underset{x}{m i n} \frac{1}{2} | | A x - b | |_{2}^{2} + λ | | x | |_{1}

Wherein, λ is scalar weights.

Along with the growth of problem scale, the execution efficiency of algorithm is largely cut down.For improving its efficiency, an effective approach is exactly be transplanted to by these algorithms on distributed or the framework of multinuclear, for instance currently a popular Graphics Processing Unit (GPU).Since NVIDIA company described CUDA programming model in 2007, the research accelerating data process based on GPU has become the focus that people study.

Most of L1 minimize algorithm, and its computing is mainly taken advantage of by dense matrix vector and formed with vector operation.Owing to CUBLAS storehouse having comprised the efficient realization of these computings, so the existing L1 based on GPU acceleration minimizes algorithm and is based primarily upon CUBLAS storehouse.But, finding by testing, the matrix vector in CUBLAS takes advantage of method along with the growth of matrix line number or columns, can produce performance inconsistency, and minimax performance gap is notable.CUBLAS does not support to melt core, when concurrent multiple L1 minimization problem, it is impossible to make full use of the new feature of existing GPU, and can not most optimally configure the calculating resource of whole GPU, there is bigger overhead.Therefore, the present invention is on the GPU equipment of Maxwell framework, based on iteratively faster collapse threshold method, by abundant digging utilization GPU hardware resource and computing capability, it is provided that the efficient parallel method for solving of a kind of L1 minimization problem.

Summary of the invention

It is an object of the invention to for now methodical deficiency, by digging utilization GPU hardware resource and computing capability, it is provided that the efficient parallel method for solving of a kind of L1 minimization problem.The present invention gives two solvers, the Parallel solver of single L1 minimization problem Parallel solver and concurrent multiple L1 minimization problem, the non-transposed matrix vector including adaptive optimization takes advantage of Parallel Design, transposed matrix vector to take advantage of Parallel Design and streaming Parallel Design.

For reaching above-mentioned purpose, present invention employs techniques below scheme.

Iteratively faster collapse threshold method (FISTA) is a kind of iterative shrinkage threshold method, it is possible to solves unconfined base and follows the trail of Denoising Problems, and relates generally to matrix vector and take advantage of and vector operation, easily parallel.Therefore the present invention is based on FISTA, on the Maxwell framework GPU equipment of NVIDIA, adopts CUDA parallel computational model, and parallel acceleration solves L1 minimization problem.Devise the vector calculus of adaptive optimization, non-transposed matrix vector is taken advantage of and is taken advantage of with transposed matrix vector, and achieves single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of rational CUDA thread.

Concretely comprising the following steps of this method for solving invention:

1) arrange according to the calculating resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread bundle and arrange and thread distribution setting；

2) adopt the mode that 32 byte-aligned are filled, preserve 0 and start the matrix by row storage of index in data dictionary.Transmission data dictionary and vector are from host side to GPU equipment end；

3) simultaneously in host side, the input parameter of asynchronous computing FISTA；

4) number of the L1 minimization problem solved as required, if only solving single L1 minimization problem, inspires the setting calling single L1 minimization problem Parallel solver in GPU equipment end；If solving concurrent multiple L1 minimization problem, GPU equipment end inspires the setting of the Parallel solver calling concurrent multiple L1 minimization problem；

5) in GPU equipment end, the non-transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the non-transposed matrix vector in FISTA is taken advantage of；

6) in GPU equipment end, the transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the transposed matrix vector in FISTA is taken advantage of；

7) in GPU equipment end, merging remaining vector operation, use streaming Parallel Design, amalgamation mode calculates remaining vector operation in FISTA；

8) simultaneously in host side, asynchronous computing scalar value；

9) if reaching the condition of convergence, stopping iteration, transmission rarefaction representation, from GPU equipment end to host side, otherwise returns step 5), continue iteration.

FISTA relates generally to vector operation and matrix vector is taken advantage of, without matrix inversion operation and matrix decomposition.Therefore, it is not only suitable for parallel, and easily extends to extensive high dimensional data.Vector operation and matrix vector are taken advantage of and are both belonged to low calculating density operation, it is limited to bandwidth, thus, the present invention adopts kernel amalgamation mode, multiple vector calculuses and matrix vector are taken advantage of and merges (kernel), the global memory eliminating intermediate object program accesses, and utilizes data locality.Being made up of additionally, matrix vector is taken advantage of multiple inner product operations, every a line of matrix and vector do inner product operation, and by shared drive, vector is reused.Meanwhile, employ the allocation strategy of adaptive optimization, make full use of computing capability 5.0 and the storage organization of above GPU equipment, in conjunction with data locality, it is achieved multi-level buffer controls to optimize, and reduces global data access.

Described step 5) in non-transposed matrix vector take advantage of Parallel Design, arrange according to the distribution of thread bundle, adaptive optimization ground one thread bundle of distribution or multiple thread bundle go to calculate an inner product, and utilize the openness of solution to reduce amount of calculation.

This Parallel Design includes following two benches reduction:

1) in first stage, all threads in each thread block are first collaborative parallel reads continuous print segment vector to shared drive, then each thread of same thread bundle completes corresponding part reduction operation, then, instruction of shuffling is utilized to complete the reduction in thread bundle, and store the result in continuous print shared drive, until all load vector；

2), in second stage, utilize the shared drive data that the first stage is stored by instruction of shuffling to carry out reduction, calculate and obtain corresponding inner product result.

In this Parallel Design, the number of threads that thread bundle comprises is 32.For obtaining the optimum thread bundle number calculating an inner product, it is proposed that following self-adjusted block strategy:

Minw=sm × 2048/k/32, meets m≤w

Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary.As k=1, this Parallel Design only needs first stage reduction；As k=32, directly vector can be loaded into depositor.

Described step 6) in transposed matrix vector take advantage of Parallel Design, according to thread distribution arrange, adaptive optimization ground distribution one thread or multiple thread go calculate an inner product.

This Parallel Design also includes two benches reduction:

1), in first stage, the first collaborative parallel continuous print segment vector that reads of all threads in each thread block is to shared drive, and then each thread completes corresponding part reduction operation, and stores the result in continuous print shared drive；

2) second stage, the shared drive data that the first stage is obtained carry out reduction, calculate and obtain corresponding inner product result.

This Parallel Design adopts following self-adaptive thread allocation strategy:

Mint=sm × 2048/k, meets n≤t

Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary.As k=1, it is only necessary to first stage.

Described step 7) in streaming Parallel Design, process each element entry in vector operation with streaming mode of loading, including soft-threshold operator, it is also possible to be vectored, and use the CUDA built-in function provided, eliminate branch.

Described step 4) in when enabling single L1 minimization problem Parallel solver and arranging, then a GPU equipment only solves a L1 minimization problem.Non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation to be realized respectively by three CUDA kernel function.

Described step 4) in when the Parallel solver enabling concurrent multiple L1 minimization problem is arranged, then a GPU equipment can concurrently solve multiple L1 minimization problem.Solving of each L1 minimization problem is completed by one or more thread block, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation only to be realized by a CUDA kernel function.It addition, use the CUDA built-in function provided by the access cache of data dictionary matrix in read-only data buffer memory, improve access efficiency.

The Parallel implementation method of this L1 minimization problem proposed by the invention, abundant digging utilization GPU hardware resource and computing capability, there is high degree of parallelism and adaptability.

Accompanying drawing explanation

Fig. 1 is the storage hierarchy figure of the GPU equipment of computing capability SM5.0+.

Fig. 2 is matrix storage format schematic diagram in the present invention.

Fig. 3 is the filled matrix schematic diagram using 32 byte-aligned in the present invention.

Fig. 4 is the interior nuclear fusion schematic diagram of FISTA in the present invention.

Fig. 5 is the Parallel solver of single L1 minimization problem performance comparison result schematic diagram on GPU and CPU in the present invention.

Fig. 6 is the performance comparison result schematic diagram of the Parallel solver of concurrent multiple L1 minimization problems and single version in the present invention.

Fig. 7 is the method flow diagram of the present invention.

Detailed description of the invention

In the following description, in conjunction with accompanying drawing 1-7 and specific implementation method, the present invention is further explained in detail.

Iteratively faster collapse threshold algorithm is a kind of iterative shrinkage thresholding algorithm, accelerates version by realized in conjunction with Nesterovs Optimal gradient algorithm, has the convergency factor O (k of non-progressive²).This algorithm with the addition of sequence { y as new in the next one_k, k=1,2 ..., concrete iterative step is as follows:

Wherein, λ is scalar weights, and (u, a)=sign (u) max{ | u |-a, 0} is soft-threshold operator to soft, y₁=x₀, t₁=1, L_fIt is the f () lipschitz constant associated, can by calculating A^TThe characteristic spectrum of A obtain (| | A^TA||₂), f (y_k)=A^T(Ay_k-b)。

The present invention relates to and solve L1 minimization problem, have employed iteratively faster collapse threshold algorithm, this algorithm relates generally to vector operation, matrix vector is taken advantage of.The present invention is on the Maxwell framework GPU equipment of NVIDIA, based on CUDA parallel computational model, accelerates iteratively faster collapse threshold algorithm parallel.

The present invention proposes the non-transposed matrix vector of adaptive optimization and takes advantage of Parallel Design, transposed matrix vector to take advantage of Parallel Design and streaming Parallel Design.Utilize above-mentioned Parallel Design, and achieve single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of reasonable CUDA thread.

The concrete steps of described Parallel implementation method include as follows:

1) read GPU facility information, including computing capability, flow multiprocessor number, according to the dimension of data dictionary and GPU facility information, complete the distribution setting of thread bundle and thread distribution is arranged；

2) adopt the mode that 32 byte-aligned are filled, preserve 0 and start the matrix by row storage of index in data dictionary A.Transmission data dictionary A, data item b and rarefaction representation x is from host side to GPU equipment end；

3) simultaneously in host side, the input parameter of asynchronous computing iteratively faster collapse threshold algorithm is (such as L_f)；

4) number of the L1 minimization problem solved as required, if solving single L1 minimization problem, inspires the setting calling single L1 minimization problem Parallel solver in GPU equipment end；If solving concurrent multiple L1 minimization problem, GPU equipment end inspires the setting of the Parallel solver calling concurrent multiple L1 minimization problem；

5) in GPU equipment end, the non-transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the non-transposed matrix vector in iteratively faster collapse threshold algorithm takes advantage of Ay_k-b；

6) in GPU equipment end, the transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the transposed matrix vector in iteratively faster collapse threshold algorithm takes advantage of A^T(*)；

7) in GPU equipment end, use streaming Parallel Design, melt remaining vector operation in joint account iteratively faster collapse threshold algorithm；

8) simultaneously in host side, asynchronous computing t_k+1Value；

9) if the openness satisfied setting of iterative steps or solution, stopping iteration, transmission rarefaction representation, from GPU equipment end to host side, otherwise returns step 5), continue iteration.

Above-mentioned steps 4) in when enabling single L1 minimization problem Parallel solver and arranging, then a GPU equipment only solves a L1 minimization problem, calling three CUDA kernel function and realize that the non-transposed matrix vector of iteratively faster collapse threshold algorithm is taken advantage of, transposed matrix vector is taken advantage of and all vector operation respectively, the idiographic flow of FISTA is such as shown in algorithm 1.First kernel function uses non-transposed matrix vector to take advantage of Parallel Design to realize Ay_k-b；Second kernel uses transposed matrix vector to take advantage of Parallel Design to realize A^T(*)；Then merging remaining vector operation is the 3rd kernel, uses streaming Parallel Design, as shown in Figure 4.The dimensional attributes of the input and output object of each kernel function can be different, namely uses different startups to configure (configuration etc. referring to thread grid and thread block).

Above-mentioned steps 4) in when the Parallel solver enabling concurrent multiple L1 minimization problem is arranged, then a GPU equipment can concurrently solve multiple L1 minimization problem.Solving of each L1 minimization problem is completed by one or more thread block, and only use a CUDA kernel function to take advantage of Parallel Design, transposed matrix vector take advantage of Parallel Design and merge the streaming Parallel Design of vector operation in conjunction with non-transposed matrix vector, it is achieved iteratively faster collapse threshold algorithm.The access cache of data dictionary matrix in read-only data buffer memory, is improved access efficiency by use _ ldg () function, it addition, no longer in host side, asynchronous computing t_k+1Value.

Non-transposed matrix vector is taken advantage of and is defined as Ax (A ∈ R^m×n,x∈Rⁿ), (every a line of A and x do inner product operation) it is made up of m inner product；And each inner product operation can independently be calculated.Above-mentioned steps 5) in non-transposed matrix vector take advantage of Ax Parallel Design, distribute a thread bundle warp or multiple thread bundle warp go calculate Ax an inner product, calculate multiple inner product simultaneously, these thread bundles of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (calculating resource extent different), propose self-adaptive thread bundle allocation strategy, choose to Automatic Optimal k thread bundle warp and calculate a dot product so that more CUDA core and other working cell participate in computing.And, this design also utilizes the openness amount of calculation reducing this kernel of solution.

This Parallel Design uses shared drive to carry out buffer memory vector x, including following two benches reduction:

First stage, comprises the steps:

1) x-load step: the first collaborative continuous print segment vector x that reads of all threads in each thread block, to shared drive xP, then performs partial-reduction step.By this mode, the access for vector x is to merge, and by sharing the fragment of vector x, it is possible to reduce access times.

2) partial-reduction step: the segment vector x already loaded into shared drive is carried out reduction operation by each thread of a thread block, and formula is as follows:

BVAL+=xP_i*A_rj

Wherein, bVAL is the part reduction value that a thread is responsible for, xP_iIt is the i-th element of the segment vector x being loaded into shared drive, A_rjIt is xP_iCorresponding matrix A element.If vector x has not all loaded, then return x-load step；Otherwise, then warp-reduction step is performed.Obviously, each thread is likely to need repeatedly to perform reduction operation, and is also merge for accessing the matrix A in global memory.

3) warp-reduction step: in each thread bundle warp, each thread completes the part reduction value that oneself is responsible for, and then performs the CUDA instruction of shuffling (shuffleinstructions) provided and completes last reduction operation.

Second stage, uses multiple thread bundle warp to read this continuous print shared drive, utilizes instruction of shuffling to complete the reduction in thread bundle warp, calculating obtains corresponding inner product result, it is not over as inner product calculates, returns first stage, continue to calculate next group inner product；

This Parallel Design adopts following self-adaptive thread bundle allocation strategy:

Minw=sm × 2048/k/32, meets m≤w

Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary.As k=1, it is only necessary to first stage reduction；As k=32, directly vector is loaded into depositor.

Transposed matrix vector is taken advantage of and is defined as A^Tx(A∈R^m×n,x∈R^m), (every string of A and x do inner product operation) it is made up of n inner product；And each inner product operation can independently be calculated.Above-mentioned steps 6) in transposed matrix vector take advantage of Parallel Design, distribute a thread or multiple thread go to calculate an inner product of transposed matrix vector, calculate multiple inner product simultaneously, these threads of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (calculating resource extent different), propose self-adaptive thread allocation strategy, choose to Automatic Optimal k thread and calculate an inner product, more CUDA core and other working cell so just can be made to participate in computing.

First stage, comprises the steps:

1) x-load step: all threads in each thread block read continuous print segment vector x first synergistically to shared drive, then performs partial-reduction step.

BVAL+=xP_i*A_jc

Wherein, bVAL is the part reduction value that a thread is responsible for, xP_iIt is the i-th element of the segment vector x being loaded into shared drive, A_jcIt is xP_iCorresponding matrix A element.If x has not all loaded, then returning x-load step, if all loaded by x, then performing second stage.In partial-reduction step, pressing row storage matrix owing to using, and start index from 0, if the tissue of sets of threads (k thread one sets of threads of composition) is unreasonable, the access for the matrix A in global memory is nonjoinder.Therefore, creating according to following definition of sets of threads, guaranteeing to access is merge.

Definition 1: assume that thread block size is that s, h thread is assigned to A together^TA dot product in x, and z=s/h.So, sets of threads tissue is as follows: and 0, z .., (h-1) * z}, 1, z+1, and .., (h-1) * z+1} ..., { z-1,2*z-1 .., 2* (h-1) * z-1}.

Second stage, uses multiple warp to read this continuous print shared drive and carries out reduction, calculate and obtain corresponding inner product result；

This Parallel Design adopts following self-adaptive thread allocation strategy:

Mint=sm × 2048/k, meets n≤t

Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and wherein k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary.As k=1, it is only necessary to first stage reduction.

Above-mentioned steps 7) in streaming Parallel Design, process each element entry in vector operation with streaming mode of loading, namely each thread calculates an element entry, including soft-threshold operator, can also be vectored, and CUDA built-in function fmax () provided is provided, eliminate branch.

For bandwagon effect, the present invention tests the method for solving of the present invention with the matrix of single precision, and test environment is the IntelXeon double-core CPU computer joining a NVIDIAGTX980 video card, and compilation run environment is CUDA6.5.Accompanying drawing 4 and accompanying drawing 5 respectively show the performance of two next concurrent solvers proposed by the invention, wherein CFISTA represents based on the CUBLAS FISTA realized, GFISTA represents the single L1 minimization problem Parallel solver of the present invention, and MFISTAOL represents concurrent multiple L1 minimization problem Parallel solvers of the present invention.Visible, compare the method for solving based on CUBLAS, the single L1 minimization problem Parallel solver of the present invention has in performance and significantly promotes；Comparing single L1 minimization problem Parallel solver, concurrent multiple L1 minimization problem Parallel solvers of the present invention also have the lifting of certain amplitude in performance.

Storage organization referring to the GPU equipment of Fig. 1, NVIDIA computing capability 5.0+ is multi-level；Each thread can access the shared drive shared in thread block；L2 buffer memory can buffer memory global memory (being positioned at dynamic random access memory) automatically；Read-only data buffer memory (L1 buffer memory) can by programme-control, for buffer memory global memory.

Referring to Fig. 2 and Fig. 3, preserve to data dictionary with 0 matrix by row storage starting index, and use 32 bytes to be filled with alignment, so can optimize global memory's access performance, reduce internal storage access affairs.

Referring to Fig. 4, the nuclear fusion of single L1 minimization problem Parallel solver.First kernel function non-transposed matrix vector of fusion is taken advantage of and is realized Ay with vector subtraction_k-b；Second kernel realizes A^T(*)；Remaining vector behaviour is fused into the 3rd kernel.

Referring to Fig. 5, for each test case, initial x₀Always contain 1024 nonzero elements and b=Ax₀；Terminate after 50 iteration；Listing the execution time of all algorithms in figure, unit of time is the second.Comparing CFISTA, GFISTA and be obtained in that scope is from the speed-up ratio of 37.68 to 53.66 times, average speedup is 48.22, and performance boost is significant.

Referring to Fig. 6, concurrent multiple L1 minimize solver MFISTASOL.The same Fig. 5 of test configurations, for each use-case, concurrently solves 128 L1 minimum problems.Compared with performing single L1 minimization problem Parallel solver GFISTA with order, MFISTASOL can have the average speedup of more than 3.0.

Claims

1. the L1 minimization problem fast solution method based on GPU, it is characterised in that based on iteratively faster collapse threshold algorithm, on the Maxwell framework GPU equipment of NVIDIA, adopts CUDA parallel computational model, and parallel acceleration solves L1 minimization problem；Devise the vector calculus of adaptive optimization, non-transposed matrix vector is taken advantage of and is taken advantage of with transposed matrix vector, and achieves single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of rational CUDA thread；

Specifically comprising the following steps that of described method for solving

2) adopt the mode that 32 byte-aligned are filled, preserve in the matrix by row storage that data dictionary starts index to 0；Transmission data dictionary and vector are from host side to GPU equipment end；

8) simultaneously in host side, asynchronous computing scalar value；

2. the step 5 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in non-transposed matrix vector take advantage of Parallel Design, it is characterized in that: arrange according to the distribution of thread bundle, adaptive optimization ground one thread bundle of distribution or multiple thread bundle go to calculate an inner product, and utilize the openness of solution to reduce amount of calculation；

This Parallel Design includes following two benches reduction:

2), in second stage, utilize the shared drive data that the first stage is stored by instruction of shuffling to carry out reduction, calculate and obtain corresponding inner product result；

In this Parallel Design, the number of threads that thread bundle comprises is 32；For obtaining the optimum thread bundle number calculating an inner product, it is proposed that following self-adjusted block strategy:

Minw=sm × 2048/k/32, meets m≤w

Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary；As k=1, this Parallel Design only needs first stage reduction；As k=32, directly vector can be loaded into depositor.

3. the step 6 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in transposed matrix vector take advantage of Parallel Design, it is characterized in that, arranging according to thread distribution, adaptive optimization ground one thread of distribution or multiple thread go to calculate an inner product；

This Parallel Design also includes two benches reduction:

2) second stage, the shared drive data that the first stage is obtained carry out reduction, calculate and obtain corresponding inner product result；

This Parallel Design adopts following self-adaptive thread allocation strategy:

Mint=sm × 2048/k, meets n≤t

Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary；As k=1, it is only necessary to first stage.

4. the step 7 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in streaming Parallel Design, it is characterized in that, the each element entry in vector operation is processed with streaming mode of loading, including soft-threshold operator, can also be vectored, and the CUDA built-in function provided is provided, eliminate branch.

5. the step 4 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in enable single L1 minimization problem Parallel solver arrange, it is characterized in that, one GPU equipment only solves a L1 minimization problem, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation to be realized respectively by three CUDA kernel function.

6. the step 4 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in enable concurrent multiple L1 minimization problem Parallel solver arrange, it is characterized in that, a GPU equipment can concurrently solve multiple L1 minimization problem；Solving of each L1 minimization problem is completed by one or more thread block, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation only to be realized by a CUDA kernel function；It addition, use the CUDA built-in function provided by the access cache of data dictionary matrix in read-only data buffer memory, improve access efficiency.