CN105739951A - GPU-based L1 minimization problem fast solving method - Google Patents

GPU-based L1 minimization problem fast solving method Download PDF

Info

Publication number
CN105739951A
CN105739951A CN201610116008.3A CN201610116008A CN105739951A CN 105739951 A CN105739951 A CN 105739951A CN 201610116008 A CN201610116008 A CN 201610116008A CN 105739951 A CN105739951 A CN 105739951A
Authority
CN
China
Prior art keywords
minimization problem
thread
parallel
vector
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610116008.3A
Other languages
Chinese (zh)
Other versions
CN105739951B (en
Inventor
高家全
李泽界
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201610116008.3A priority Critical patent/CN105739951B/en
Publication of CN105739951A publication Critical patent/CN105739951A/en
Application granted granted Critical
Publication of CN105739951B publication Critical patent/CN105739951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities

Abstract

The invention provides a GPU-based L1 minimization problem fast solving method. A CUDA parallel computation model is utilized on a NVIDIA Maxwell-architecture GPU device, and an L1 minimization problem fast solving method is provided by utilizing GPU new features and internal kernel merging and optimizing technologies. According to the GPU-based L1 minimization problem fast solving method, not only designs of self-adaptation optimal vector computation, non-transposed matrix vector multiplication and transposed matrix vector multiplication are included, but also parallel solving of single or parallel multiple L1 minimization problems can be implemented only by simple CUDA thread allocation configuration difference. Based on experimental results, the GPU-based L1 minimization problem fast solving method provided by the invention is efficient, and also has high parallelism and adaptability. Compared with the existing parallel solving method, the GPU-based L1 minimization problem fast solving method has largely improved performances.

Description

A kind of L1 minimization problem fast solution method based on GPU
Technical field
The present invention relates to signal processing and field of face identification, relate more specifically to a kind of L1 minimization problem fast solution method based on GPU.
Background technology
L1 minimization problem is min | | x | |1, meet constraint Ax=b, wherein A ∈ Rm×n(m < < n) is the dense matrix of a full rank, b ∈ RmIt is vector set in advance, x ∈ RnIt is unknown solution.The solution of L1 minimization problem, also referred to as rarefaction representation, has been widely applied to multiple field, for instance signal processing, machine learning and statistical inference etc..For solving L1 minimization problem, researcher has designed many effective algorithms.Such as, gradient projection method, block newton interior point method, Homotopy Method, iterative shrinkage threshold method and augmented vector approach etc..In practical situation, b often comprises noise, and therefore a mutation of this problem is referred to as unconfined base tracking Denoising Problems (BPDN problem) or Lasso problem, as follows:
m i n x 1 2 | | A x - b | | 2 2 + &lambda; | | x | | 1
Wherein, λ is scalar weights.
Along with the growth of problem scale, the execution efficiency of algorithm is largely cut down.For improving its efficiency, an effective approach is exactly be transplanted to by these algorithms on distributed or the framework of multinuclear, for instance currently a popular Graphics Processing Unit (GPU).Since NVIDIA company described CUDA programming model in 2007, the research accelerating data process based on GPU has become the focus that people study.
Most of L1 minimize algorithm, and its computing is mainly taken advantage of by dense matrix vector and formed with vector operation.Owing to CUBLAS storehouse having comprised the efficient realization of these computings, so the existing L1 based on GPU acceleration minimizes algorithm and is based primarily upon CUBLAS storehouse.But, finding by testing, the matrix vector in CUBLAS takes advantage of method along with the growth of matrix line number or columns, can produce performance inconsistency, and minimax performance gap is notable.CUBLAS does not support to melt core, when concurrent multiple L1 minimization problem, it is impossible to make full use of the new feature of existing GPU, and can not most optimally configure the calculating resource of whole GPU, there is bigger overhead.Therefore, the present invention is on the GPU equipment of Maxwell framework, based on iteratively faster collapse threshold method, by abundant digging utilization GPU hardware resource and computing capability, it is provided that the efficient parallel method for solving of a kind of L1 minimization problem.
Summary of the invention
It is an object of the invention to for now methodical deficiency, by digging utilization GPU hardware resource and computing capability, it is provided that the efficient parallel method for solving of a kind of L1 minimization problem.The present invention gives two solvers, the Parallel solver of single L1 minimization problem Parallel solver and concurrent multiple L1 minimization problem, the non-transposed matrix vector including adaptive optimization takes advantage of Parallel Design, transposed matrix vector to take advantage of Parallel Design and streaming Parallel Design.
For reaching above-mentioned purpose, present invention employs techniques below scheme.
Iteratively faster collapse threshold method (FISTA) is a kind of iterative shrinkage threshold method, it is possible to solves unconfined base and follows the trail of Denoising Problems, and relates generally to matrix vector and take advantage of and vector operation, easily parallel.Therefore the present invention is based on FISTA, on the Maxwell framework GPU equipment of NVIDIA, adopts CUDA parallel computational model, and parallel acceleration solves L1 minimization problem.Devise the vector calculus of adaptive optimization, non-transposed matrix vector is taken advantage of and is taken advantage of with transposed matrix vector, and achieves single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of rational CUDA thread.
Concretely comprising the following steps of this method for solving invention:
1) arrange according to the calculating resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread bundle and arrange and thread distribution setting;
2) adopt the mode that 32 byte-aligned are filled, preserve 0 and start the matrix by row storage of index in data dictionary.Transmission data dictionary and vector are from host side to GPU equipment end;
3) simultaneously in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problem solved as required, if only solving single L1 minimization problem, inspires the setting calling single L1 minimization problem Parallel solver in GPU equipment end;If solving concurrent multiple L1 minimization problem, GPU equipment end inspires the setting of the Parallel solver calling concurrent multiple L1 minimization problem;
5) in GPU equipment end, the non-transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the non-transposed matrix vector in FISTA is taken advantage of;
6) in GPU equipment end, the transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the transposed matrix vector in FISTA is taken advantage of;
7) in GPU equipment end, merging remaining vector operation, use streaming Parallel Design, amalgamation mode calculates remaining vector operation in FISTA;
8) simultaneously in host side, asynchronous computing scalar value;
9) if reaching the condition of convergence, stopping iteration, transmission rarefaction representation, from GPU equipment end to host side, otherwise returns step 5), continue iteration.
FISTA relates generally to vector operation and matrix vector is taken advantage of, without matrix inversion operation and matrix decomposition.Therefore, it is not only suitable for parallel, and easily extends to extensive high dimensional data.Vector operation and matrix vector are taken advantage of and are both belonged to low calculating density operation, it is limited to bandwidth, thus, the present invention adopts kernel amalgamation mode, multiple vector calculuses and matrix vector are taken advantage of and merges (kernel), the global memory eliminating intermediate object program accesses, and utilizes data locality.Being made up of additionally, matrix vector is taken advantage of multiple inner product operations, every a line of matrix and vector do inner product operation, and by shared drive, vector is reused.Meanwhile, employ the allocation strategy of adaptive optimization, make full use of computing capability 5.0 and the storage organization of above GPU equipment, in conjunction with data locality, it is achieved multi-level buffer controls to optimize, and reduces global data access.
Described step 5) in non-transposed matrix vector take advantage of Parallel Design, arrange according to the distribution of thread bundle, adaptive optimization ground one thread bundle of distribution or multiple thread bundle go to calculate an inner product, and utilize the openness of solution to reduce amount of calculation.
This Parallel Design includes following two benches reduction:
1) in first stage, all threads in each thread block are first collaborative parallel reads continuous print segment vector to shared drive, then each thread of same thread bundle completes corresponding part reduction operation, then, instruction of shuffling is utilized to complete the reduction in thread bundle, and store the result in continuous print shared drive, until all load vector;
2), in second stage, utilize the shared drive data that the first stage is stored by instruction of shuffling to carry out reduction, calculate and obtain corresponding inner product result.
In this Parallel Design, the number of threads that thread bundle comprises is 32.For obtaining the optimum thread bundle number calculating an inner product, it is proposed that following self-adjusted block strategy:
Minw=sm × 2048/k/32, meets m≤w
Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary.As k=1, this Parallel Design only needs first stage reduction;As k=32, directly vector can be loaded into depositor.
Described step 6) in transposed matrix vector take advantage of Parallel Design, according to thread distribution arrange, adaptive optimization ground distribution one thread or multiple thread go calculate an inner product.
This Parallel Design also includes two benches reduction:
1), in first stage, the first collaborative parallel continuous print segment vector that reads of all threads in each thread block is to shared drive, and then each thread completes corresponding part reduction operation, and stores the result in continuous print shared drive;
2) second stage, the shared drive data that the first stage is obtained carry out reduction, calculate and obtain corresponding inner product result.
This Parallel Design adopts following self-adaptive thread allocation strategy:
Mint=sm × 2048/k, meets n≤t
Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary.As k=1, it is only necessary to first stage.
Described step 7) in streaming Parallel Design, process each element entry in vector operation with streaming mode of loading, including soft-threshold operator, it is also possible to be vectored, and use the CUDA built-in function provided, eliminate branch.
Described step 4) in when enabling single L1 minimization problem Parallel solver and arranging, then a GPU equipment only solves a L1 minimization problem.Non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation to be realized respectively by three CUDA kernel function.
Described step 4) in when the Parallel solver enabling concurrent multiple L1 minimization problem is arranged, then a GPU equipment can concurrently solve multiple L1 minimization problem.Solving of each L1 minimization problem is completed by one or more thread block, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation only to be realized by a CUDA kernel function.It addition, use the CUDA built-in function provided by the access cache of data dictionary matrix in read-only data buffer memory, improve access efficiency.
The Parallel implementation method of this L1 minimization problem proposed by the invention, abundant digging utilization GPU hardware resource and computing capability, there is high degree of parallelism and adaptability.
Accompanying drawing explanation
Fig. 1 is the storage hierarchy figure of the GPU equipment of computing capability SM5.0+.
Fig. 2 is matrix storage format schematic diagram in the present invention.
Fig. 3 is the filled matrix schematic diagram using 32 byte-aligned in the present invention.
Fig. 4 is the interior nuclear fusion schematic diagram of FISTA in the present invention.
Fig. 5 is the Parallel solver of single L1 minimization problem performance comparison result schematic diagram on GPU and CPU in the present invention.
Fig. 6 is the performance comparison result schematic diagram of the Parallel solver of concurrent multiple L1 minimization problems and single version in the present invention.
Fig. 7 is the method flow diagram of the present invention.
Detailed description of the invention
In the following description, in conjunction with accompanying drawing 1-7 and specific implementation method, the present invention is further explained in detail.
Iteratively faster collapse threshold algorithm is a kind of iterative shrinkage thresholding algorithm, accelerates version by realized in conjunction with Nesterovs Optimal gradient algorithm, has the convergency factor O (k of non-progressive2).This algorithm with the addition of sequence { y as new in the next onek, k=1,2 ..., concrete iterative step is as follows:
Wherein, λ is scalar weights, and (u, a)=sign (u) max{ | u |-a, 0} is soft-threshold operator to soft, y1=x0, t1=1, LfIt is the f () lipschitz constant associated, can by calculating ATThe characteristic spectrum of A obtain (| | ATA||2), f (yk)=AT(Ayk-b)。
The present invention relates to and solve L1 minimization problem, have employed iteratively faster collapse threshold algorithm, this algorithm relates generally to vector operation, matrix vector is taken advantage of.The present invention is on the Maxwell framework GPU equipment of NVIDIA, based on CUDA parallel computational model, accelerates iteratively faster collapse threshold algorithm parallel.
The present invention proposes the non-transposed matrix vector of adaptive optimization and takes advantage of Parallel Design, transposed matrix vector to take advantage of Parallel Design and streaming Parallel Design.Utilize above-mentioned Parallel Design, and achieve single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of reasonable CUDA thread.
The concrete steps of described Parallel implementation method include as follows:
1) read GPU facility information, including computing capability, flow multiprocessor number, according to the dimension of data dictionary and GPU facility information, complete the distribution setting of thread bundle and thread distribution is arranged;
2) adopt the mode that 32 byte-aligned are filled, preserve 0 and start the matrix by row storage of index in data dictionary A.Transmission data dictionary A, data item b and rarefaction representation x is from host side to GPU equipment end;
3) simultaneously in host side, the input parameter of asynchronous computing iteratively faster collapse threshold algorithm is (such as Lf);
4) number of the L1 minimization problem solved as required, if solving single L1 minimization problem, inspires the setting calling single L1 minimization problem Parallel solver in GPU equipment end;If solving concurrent multiple L1 minimization problem, GPU equipment end inspires the setting of the Parallel solver calling concurrent multiple L1 minimization problem;
5) in GPU equipment end, the non-transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the non-transposed matrix vector in iteratively faster collapse threshold algorithm takes advantage of Ayk-b;
6) in GPU equipment end, the transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the transposed matrix vector in iteratively faster collapse threshold algorithm takes advantage of AT(*);
7) in GPU equipment end, use streaming Parallel Design, melt remaining vector operation in joint account iteratively faster collapse threshold algorithm;
8) simultaneously in host side, asynchronous computing tk+1Value;
9) if the openness satisfied setting of iterative steps or solution, stopping iteration, transmission rarefaction representation, from GPU equipment end to host side, otherwise returns step 5), continue iteration.
Above-mentioned steps 4) in when enabling single L1 minimization problem Parallel solver and arranging, then a GPU equipment only solves a L1 minimization problem, calling three CUDA kernel function and realize that the non-transposed matrix vector of iteratively faster collapse threshold algorithm is taken advantage of, transposed matrix vector is taken advantage of and all vector operation respectively, the idiographic flow of FISTA is such as shown in algorithm 1.First kernel function uses non-transposed matrix vector to take advantage of Parallel Design to realize Ayk-b;Second kernel uses transposed matrix vector to take advantage of Parallel Design to realize AT(*);Then merging remaining vector operation is the 3rd kernel, uses streaming Parallel Design, as shown in Figure 4.The dimensional attributes of the input and output object of each kernel function can be different, namely uses different startups to configure (configuration etc. referring to thread grid and thread block).
Above-mentioned steps 4) in when the Parallel solver enabling concurrent multiple L1 minimization problem is arranged, then a GPU equipment can concurrently solve multiple L1 minimization problem.Solving of each L1 minimization problem is completed by one or more thread block, and only use a CUDA kernel function to take advantage of Parallel Design, transposed matrix vector take advantage of Parallel Design and merge the streaming Parallel Design of vector operation in conjunction with non-transposed matrix vector, it is achieved iteratively faster collapse threshold algorithm.The access cache of data dictionary matrix in read-only data buffer memory, is improved access efficiency by use _ ldg () function, it addition, no longer in host side, asynchronous computing tk+1Value.
Non-transposed matrix vector is taken advantage of and is defined as Ax (A ∈ Rm×n,x∈Rn), (every a line of A and x do inner product operation) it is made up of m inner product;And each inner product operation can independently be calculated.Above-mentioned steps 5) in non-transposed matrix vector take advantage of Ax Parallel Design, distribute a thread bundle warp or multiple thread bundle warp go calculate Ax an inner product, calculate multiple inner product simultaneously, these thread bundles of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (calculating resource extent different), propose self-adaptive thread bundle allocation strategy, choose to Automatic Optimal k thread bundle warp and calculate a dot product so that more CUDA core and other working cell participate in computing.And, this design also utilizes the openness amount of calculation reducing this kernel of solution.
This Parallel Design uses shared drive to carry out buffer memory vector x, including following two benches reduction:
First stage, comprises the steps:
1) x-load step: the first collaborative continuous print segment vector x that reads of all threads in each thread block, to shared drive xP, then performs partial-reduction step.By this mode, the access for vector x is to merge, and by sharing the fragment of vector x, it is possible to reduce access times.
2) partial-reduction step: the segment vector x already loaded into shared drive is carried out reduction operation by each thread of a thread block, and formula is as follows:
BVAL+=xPi*Arj
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the i-th element of the segment vector x being loaded into shared drive, ArjIt is xPiCorresponding matrix A element.If vector x has not all loaded, then return x-load step;Otherwise, then warp-reduction step is performed.Obviously, each thread is likely to need repeatedly to perform reduction operation, and is also merge for accessing the matrix A in global memory.
3) warp-reduction step: in each thread bundle warp, each thread completes the part reduction value that oneself is responsible for, and then performs the CUDA instruction of shuffling (shuffleinstructions) provided and completes last reduction operation.
Second stage, uses multiple thread bundle warp to read this continuous print shared drive, utilizes instruction of shuffling to complete the reduction in thread bundle warp, calculating obtains corresponding inner product result, it is not over as inner product calculates, returns first stage, continue to calculate next group inner product;
This Parallel Design adopts following self-adaptive thread bundle allocation strategy:
Minw=sm × 2048/k/32, meets m≤w
Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary.As k=1, it is only necessary to first stage reduction;As k=32, directly vector is loaded into depositor.
Transposed matrix vector is taken advantage of and is defined as ATx(A∈Rm×n,x∈Rm), (every string of A and x do inner product operation) it is made up of n inner product;And each inner product operation can independently be calculated.Above-mentioned steps 6) in transposed matrix vector take advantage of Parallel Design, distribute a thread or multiple thread go to calculate an inner product of transposed matrix vector, calculate multiple inner product simultaneously, these threads of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (calculating resource extent different), propose self-adaptive thread allocation strategy, choose to Automatic Optimal k thread and calculate an inner product, more CUDA core and other working cell so just can be made to participate in computing.
This Parallel Design uses shared drive to carry out buffer memory vector x, including following two benches reduction:
First stage, comprises the steps:
1) x-load step: all threads in each thread block read continuous print segment vector x first synergistically to shared drive, then performs partial-reduction step.
2) partial-reduction step: the segment vector x already loaded into shared drive is carried out reduction operation by each thread of a thread block, and formula is as follows:
BVAL+=xPi*Ajc
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the i-th element of the segment vector x being loaded into shared drive, AjcIt is xPiCorresponding matrix A element.If x has not all loaded, then returning x-load step, if all loaded by x, then performing second stage.In partial-reduction step, pressing row storage matrix owing to using, and start index from 0, if the tissue of sets of threads (k thread one sets of threads of composition) is unreasonable, the access for the matrix A in global memory is nonjoinder.Therefore, creating according to following definition of sets of threads, guaranteeing to access is merge.
Definition 1: assume that thread block size is that s, h thread is assigned to A togetherTA dot product in x, and z=s/h.So, sets of threads tissue is as follows: and 0, z .., (h-1) * z}, 1, z+1, and .., (h-1) * z+1} ..., { z-1,2*z-1 .., 2* (h-1) * z-1}.
Second stage, uses multiple warp to read this continuous print shared drive and carries out reduction, calculate and obtain corresponding inner product result;
This Parallel Design adopts following self-adaptive thread allocation strategy:
Mint=sm × 2048/k, meets n≤t
Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and wherein k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary.As k=1, it is only necessary to first stage reduction.
Above-mentioned steps 7) in streaming Parallel Design, process each element entry in vector operation with streaming mode of loading, namely each thread calculates an element entry, including soft-threshold operator, can also be vectored, and CUDA built-in function fmax () provided is provided, eliminate branch.
For bandwagon effect, the present invention tests the method for solving of the present invention with the matrix of single precision, and test environment is the IntelXeon double-core CPU computer joining a NVIDIAGTX980 video card, and compilation run environment is CUDA6.5.Accompanying drawing 4 and accompanying drawing 5 respectively show the performance of two next concurrent solvers proposed by the invention, wherein CFISTA represents based on the CUBLAS FISTA realized, GFISTA represents the single L1 minimization problem Parallel solver of the present invention, and MFISTAOL represents concurrent multiple L1 minimization problem Parallel solvers of the present invention.Visible, compare the method for solving based on CUBLAS, the single L1 minimization problem Parallel solver of the present invention has in performance and significantly promotes;Comparing single L1 minimization problem Parallel solver, concurrent multiple L1 minimization problem Parallel solvers of the present invention also have the lifting of certain amplitude in performance.
Storage organization referring to the GPU equipment of Fig. 1, NVIDIA computing capability 5.0+ is multi-level;Each thread can access the shared drive shared in thread block;L2 buffer memory can buffer memory global memory (being positioned at dynamic random access memory) automatically;Read-only data buffer memory (L1 buffer memory) can by programme-control, for buffer memory global memory.
Referring to Fig. 2 and Fig. 3, preserve to data dictionary with 0 matrix by row storage starting index, and use 32 bytes to be filled with alignment, so can optimize global memory's access performance, reduce internal storage access affairs.
Referring to Fig. 4, the nuclear fusion of single L1 minimization problem Parallel solver.First kernel function non-transposed matrix vector of fusion is taken advantage of and is realized Ay with vector subtractionk-b;Second kernel realizes AT(*);Remaining vector behaviour is fused into the 3rd kernel.
Referring to Fig. 5, for each test case, initial x0Always contain 1024 nonzero elements and b=Ax0;Terminate after 50 iteration;Listing the execution time of all algorithms in figure, unit of time is the second.Comparing CFISTA, GFISTA and be obtained in that scope is from the speed-up ratio of 37.68 to 53.66 times, average speedup is 48.22, and performance boost is significant.
Referring to Fig. 6, concurrent multiple L1 minimize solver MFISTASOL.The same Fig. 5 of test configurations, for each use-case, concurrently solves 128 L1 minimum problems.Compared with performing single L1 minimization problem Parallel solver GFISTA with order, MFISTASOL can have the average speedup of more than 3.0.

Claims (6)

1. the L1 minimization problem fast solution method based on GPU, it is characterised in that based on iteratively faster collapse threshold algorithm, on the Maxwell framework GPU equipment of NVIDIA, adopts CUDA parallel computational model, and parallel acceleration solves L1 minimization problem;Devise the vector calculus of adaptive optimization, non-transposed matrix vector is taken advantage of and is taken advantage of with transposed matrix vector, and achieves single L1 minimization problem Parallel solver and the Parallel solver of concurrent multiple L1 minimization problem by the distribution of rational CUDA thread;
Specifically comprising the following steps that of described method for solving
1) arrange according to the calculating resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread bundle and arrange and thread distribution setting;
2) adopt the mode that 32 byte-aligned are filled, preserve in the matrix by row storage that data dictionary starts index to 0;Transmission data dictionary and vector are from host side to GPU equipment end;
3) simultaneously in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problem solved as required, if only solving single L1 minimization problem, inspires the setting calling single L1 minimization problem Parallel solver in GPU equipment end;If solving concurrent multiple L1 minimization problem, GPU equipment end inspires the setting of the Parallel solver calling concurrent multiple L1 minimization problem;
5) in GPU equipment end, the non-transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the non-transposed matrix vector in FISTA is taken advantage of;
6) in GPU equipment end, the transposed matrix vector of adaptive optimization is used to take advantage of Parallel Design, it is achieved the transposed matrix vector in FISTA is taken advantage of;
7) in GPU equipment end, merging remaining vector operation, use streaming Parallel Design, amalgamation mode calculates remaining vector operation in FISTA;
8) simultaneously in host side, asynchronous computing scalar value;
9) if reaching the condition of convergence, stopping iteration, transmission rarefaction representation, from GPU equipment end to host side, otherwise returns step 5), continue iteration.
2. the step 5 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in non-transposed matrix vector take advantage of Parallel Design, it is characterized in that: arrange according to the distribution of thread bundle, adaptive optimization ground one thread bundle of distribution or multiple thread bundle go to calculate an inner product, and utilize the openness of solution to reduce amount of calculation;
This Parallel Design includes following two benches reduction:
1) in first stage, all threads in each thread block are first collaborative parallel reads continuous print segment vector to shared drive, then each thread of same thread bundle completes corresponding part reduction operation, then, instruction of shuffling is utilized to complete the reduction in thread bundle, and store the result in continuous print shared drive, until all load vector;
2), in second stage, utilize the shared drive data that the first stage is stored by instruction of shuffling to carry out reduction, calculate and obtain corresponding inner product result;
In this Parallel Design, the number of threads that thread bundle comprises is 32;For obtaining the optimum thread bundle number calculating an inner product, it is proposed that following self-adjusted block strategy:
Minw=sm × 2048/k/32, meets m≤w
Wherein w represents thread bundle group (by k the thread Shu Zucheng) quantity that distribution produces, k represents the thread bundle quantity distributing to an inner product, if less than 1, is taken as 1, sm represents the stream multiprocessor number that GPU equipment comprises, and m represents the matrix line number of data dictionary;As k=1, this Parallel Design only needs first stage reduction;As k=32, directly vector can be loaded into depositor.
3. the step 6 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in transposed matrix vector take advantage of Parallel Design, it is characterized in that, arranging according to thread distribution, adaptive optimization ground one thread of distribution or multiple thread go to calculate an inner product;
This Parallel Design also includes two benches reduction:
1), in first stage, the first collaborative parallel continuous print segment vector that reads of all threads in each thread block is to shared drive, and then each thread completes corresponding part reduction operation, and stores the result in continuous print shared drive;
2) second stage, the shared drive data that the first stage is obtained carry out reduction, calculate and obtain corresponding inner product result;
This Parallel Design adopts following self-adaptive thread allocation strategy:
Mint=sm × 2048/k, meets n≤t
Wherein t represents sets of threads (being made up of k the thread) quantity that distribution produces, and k represents the number of threads distributing to an inner product, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment comprises, and n represents the matrix columns of data dictionary;As k=1, it is only necessary to first stage.
4. the step 7 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in streaming Parallel Design, it is characterized in that, the each element entry in vector operation is processed with streaming mode of loading, including soft-threshold operator, can also be vectored, and the CUDA built-in function provided is provided, eliminate branch.
5. the step 4 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in enable single L1 minimization problem Parallel solver arrange, it is characterized in that, one GPU equipment only solves a L1 minimization problem, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation to be realized respectively by three CUDA kernel function.
6. the step 4 of a kind of L1 minimization problem fast solution method based on GPU according to claim 1) in enable concurrent multiple L1 minimization problem Parallel solver arrange, it is characterized in that, a GPU equipment can concurrently solve multiple L1 minimization problem;Solving of each L1 minimization problem is completed by one or more thread block, and non-transposed matrix vector takes advantage of Parallel Design, transposed matrix vector to take advantage of the streaming Parallel Design of Parallel Design and fusion vector operation only to be realized by a CUDA kernel function;It addition, use the CUDA built-in function provided by the access cache of data dictionary matrix in read-only data buffer memory, improve access efficiency.
CN201610116008.3A 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU Active CN105739951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610116008.3A CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610116008.3A CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Publications (2)

Publication Number Publication Date
CN105739951A true CN105739951A (en) 2016-07-06
CN105739951B CN105739951B (en) 2018-05-08

Family

ID=56248952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610116008.3A Active CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Country Status (1)

Country Link
CN (1) CN105739951B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502771A (en) * 2016-09-09 2017-03-15 中国农业大学 Time overhead model building method and system based on kernel functions
CN107886519A (en) * 2017-10-17 2018-04-06 杭州电子科技大学 Multichannel chromatogram three-dimensional image fast partition method based on CUDA
WO2019000435A1 (en) * 2017-06-30 2019-01-03 华为技术有限公司 Task processing method and device, medium, and device thereof
CN109709547A (en) * 2019-01-21 2019-05-03 电子科技大学 A kind of reality beam scanning radar acceleration super-resolution imaging method
CN114943194A (en) * 2022-05-16 2022-08-26 水利部交通运输部国家能源局南京水利科学研究院 River pollution tracing method based on geostatistics
US20220358206A1 (en) * 2021-05-10 2022-11-10 Commissariat à l'Energie Atomique et aux Energies Alternatives Method for the execution of a binary code by a microprocessor
CN117785480A (en) * 2024-02-07 2024-03-29 北京壁仞科技开发有限公司 Processor, reduction calculation method and electronic equipment
CN117785480B (en) * 2024-02-07 2024-04-26 北京壁仞科技开发有限公司 Processor, reduction calculation method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057770A1 (en) * 2010-09-07 2012-03-08 Kwang Eun Jang Method and apparatus for reconstructing image and medical image system employing the method
CN103505206A (en) * 2012-06-18 2014-01-15 山东大学威海分校 Fast and parallel dynamic MRI method based on compressive sensing technology
US9118347B1 (en) * 2011-08-30 2015-08-25 Marvell International Ltd. Method and apparatus for OFDM encoding and decoding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057770A1 (en) * 2010-09-07 2012-03-08 Kwang Eun Jang Method and apparatus for reconstructing image and medical image system employing the method
US9118347B1 (en) * 2011-08-30 2015-08-25 Marvell International Ltd. Method and apparatus for OFDM encoding and decoding
CN103505206A (en) * 2012-06-18 2014-01-15 山东大学威海分校 Fast and parallel dynamic MRI method based on compressive sensing technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAO YA ZHANG等: "Accelerated proximal algorithms for L1-minimization problem", 《 WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2014 11TH INTERNATIONAL COMPUTER CONFERENCE ON》 *
刘杰等: "快速L1范数最小化算法的性能分析和比较", 《电脑知识与技术》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502771A (en) * 2016-09-09 2017-03-15 中国农业大学 Time overhead model building method and system based on kernel functions
CN106502771B (en) * 2016-09-09 2019-08-02 中国农业大学 Time overhead model building method and system based on kernel function
WO2019000435A1 (en) * 2017-06-30 2019-01-03 华为技术有限公司 Task processing method and device, medium, and device thereof
CN110088730A (en) * 2017-06-30 2019-08-02 华为技术有限公司 Task processing method, device, medium and its equipment
CN110088730B (en) * 2017-06-30 2021-05-18 华为技术有限公司 Task processing method, device, medium and equipment
CN107886519A (en) * 2017-10-17 2018-04-06 杭州电子科技大学 Multichannel chromatogram three-dimensional image fast partition method based on CUDA
CN109709547A (en) * 2019-01-21 2019-05-03 电子科技大学 A kind of reality beam scanning radar acceleration super-resolution imaging method
US20220358206A1 (en) * 2021-05-10 2022-11-10 Commissariat à l'Energie Atomique et aux Energies Alternatives Method for the execution of a binary code by a microprocessor
CN114943194A (en) * 2022-05-16 2022-08-26 水利部交通运输部国家能源局南京水利科学研究院 River pollution tracing method based on geostatistics
CN117785480A (en) * 2024-02-07 2024-03-29 北京壁仞科技开发有限公司 Processor, reduction calculation method and electronic equipment
CN117785480B (en) * 2024-02-07 2024-04-26 北京壁仞科技开发有限公司 Processor, reduction calculation method and electronic equipment

Also Published As

Publication number Publication date
CN105739951B (en) 2018-05-08

Similar Documents

Publication Publication Date Title
CN105739951A (en) GPU-based L1 minimization problem fast solving method
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
Martín et al. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
CN106055311B (en) MapReduce tasks in parallel methods based on assembly line multithreading
CN104765589B (en) Grid parallel computation preprocess method based on MPI
CN105608135B (en) Data mining method and system based on Apriori algorithm
Rostrup et al. Fast and memory-efficient minimum spanning tree on the GPU
Hugues et al. Sparse matrix formats evaluation and optimization on a GPU
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
US20230409885A1 (en) Hardware Environment-Based Data Operation Method, Apparatus and Device, and Storage Medium
Martínez-del-Amor et al. Population Dynamics P systems on CUDA
CN108984483B (en) Electric power system sparse matrix solving method and system based on DAG and matrix rearrangement
CN110264392B (en) Strong connection graph detection method based on multiple GPUs
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Liu et al. GPU accelerated fast FEM deformation simulation
Qiao et al. Parallelizing and optimizing neural Encoder–Decoder models without padding on multi-core architecture
Liu et al. Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA
US20160224902A1 (en) Parallel gibbs sampler using butterfly-patterned partial sums
CN109522127B (en) Fluid machinery simulation program heterogeneous acceleration method based on GPU
CN109741421B (en) GPU-based dynamic graph coloring method
CN103678888A (en) Cardiac blood flowing indicating and displaying method based on Euler fluid simulation algorithm
Wen et al. A swap dominated tensor re-generation strategy for training deep learning models
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
Yang et al. Efficient dense structure mining using mapreduce
CN104866297B (en) A kind of method and apparatus for optimizing kernel function

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant