CN102609393A - Method for processing data of systems of linear equations and device - Google Patents

Method for processing data of systems of linear equations and device Download PDF

Info

Publication number
CN102609393A
CN102609393A CN2012100273404A CN201210027340A CN102609393A CN 102609393 A CN102609393 A CN 102609393A CN 2012100273404 A CN2012100273404 A CN 2012100273404A CN 201210027340 A CN201210027340 A CN 201210027340A CN 102609393 A CN102609393 A CN 102609393A
Authority
CN
China
Prior art keywords
gpu
linear equations
solution
thread
equations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100273404A
Other languages
Chinese (zh)
Other versions
CN102609393B (en
Inventor
张清
沈铂
吴庆
廖文军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201210027340.4A priority Critical patent/CN102609393B/en
Publication of CN102609393A publication Critical patent/CN102609393A/en
Application granted granted Critical
Publication of CN102609393B publication Critical patent/CN102609393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a method for processing data of systems of linear equations and a device, which relate to the field of high-performance computation of computers and the field of scientific and engineering numerical calculation. The method includes determining the maximum number M of the systems of linear equations which are parallelly solved by a GPU (graphic processing unit) at one step according to the size of a graphic memory of the GPU and the size of dimensions of a coefficient matrix of the to-be-computed systems of linear equations; parallelly solving M equations by the aid of a kernel of the GPU until the integral systems of linear equations are solved when the M is larger than or equal to the number of stream processors of the GPU; and calling a matrix algebra library based on the GPU and a many-core architecture to compute the systems of linear equations when the M is smaller than the number of the stream processors of the GPU, and solving one system of linear equations at a time. The invention further discloses a data processing device for the systems of linear equations. The method and the device in the technical scheme meet requirements of specific algorithms of industrial production, the computation performance is improved, and power consumption is greatly reduced.

Description

A kind of data processing method of system of linear equations and device
Technical field
The present invention relates to computing machine high-performance computing sector, scientific and engineering numerical evaluation field, be specifically related to a kind of data processing method and device of system of linear equations.
Background technology
In recent years, must development obtain considerable progress with China high-performance computer in the world.Per second hundred TFlops, thousands of TFlops supercomputer are studied success in succession and are dropped into commercial production, and the many problems that can't find the solution and study become possibility before making.
Large-scale scientific and engineering numerical evaluation is one of high-performance calculation important applied field, and wherein On Solving System of Linear Equations is a very general problem, and it is related in a lot of engineering fields, like petroleum prospecting, weather forecast, and turbulent flow simulation etc.
For some field on the commercial production; The core algorithm of some functions relates to On Solving System of Linear Equations; At present; People often use large-scale CPU cluster to handle, and its principle is that a CPU nuclear is handled a system of linear equations or a plurality of nuclear is handled a system of linear equations, and the result with each CPU nuclear gathers output at last.The maintenance management expense of this way CPU cluster, power consumption are all very high, and the time of its processing is long, have satisfied not industrial demand far away.Along with the appearance of GPU (Graphic Processing Unit, graphic process unit), GPU utilizes its many nuclear processing poweies; Its performance peak value and memory bandwidth are higher than CPU far away; And very being fit to separate this intensive calculations type of thread system of equations algorithm case, common way is that a GPU Kenrel calculates an equation at present, as the LU in the magma storehouse of increasing income decomposes and finds the solution system of linear equations and adopt this disposal route exactly; It is fit to finding the solution for large-scale thread system of equations very much; But,, cause some extra expenses of calling with the problem of calling repeatedly that GPU Kernel occurs for repeatedly finding the solution the thread system of equations in the industry.
Thus it is clear that,, need a kind of scheme that is directed against the Solving Linear of different scales for satisfying industrial demand.
Summary of the invention
Technical matters to be solved by this invention is, a kind of data processing method and device of system of linear equations is provided, and can find the solution to the system of linear equations of different scales.
In order to solve the problems of the technologies described above, the invention discloses a kind of data processing method of system of linear equations, comprising:
System confirms the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
As M during more than or equal to the stream handle number of GPU, confirm that current what will calculate is small-sized system of linear equations, then find the solution M equation up to finding the solution complete system of linear equations through disposable should the walking abreast of a GPU kernel (Kernel);
As M during less than the stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, then call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once find the solution a system of linear equations.
Preferably, in the said method, the system of linear equations matrix of coefficients dimension size that system will calculate according to GPU video memory size and institute confirms that the process of the disposable maximum linear system of equations number M found the solution of walking abreast of GPU is following:
Confirm the available video memory space W of GPU,, calculate and find the solution a required video memory space W 0 of system of linear equations according to the dimension size of system of linear equations matrix of coefficients; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
Preferably, in the said method, should walk abreast through GPU Kernel is disposable that to find the solution M equation following up to the process of finding the solution complete system of linear equations:
Confirm to calculate unified equipment framework (CUDA) threading model; M system of linear equations is designed to a grid (Grid); Wherein each system of linear equations is designed to a thread block (GPU Block), the several dynamically adjustment according to the dimension change of matrix of coefficients of the thread of each thread block (Thread);
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPUBlock comprises a plurality of thread bundles (warp), and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
Preferably, in the said method, said global memory uses and refers to: 32 interior threads of thread bundle are visited the contiguous memory in the global memory simultaneously.
Preferably, in the said method, said shared drive uses and refers to: with data public in each piece, put into shared drive.
Preferably, in the said method, said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
The invention also discloses a kind of data processing equipment of system of linear equations, comprising:
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that GPU video memory size and institute will calculate;
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd as M during more than or equal to the stream handle number of GPU, confirms that current what will calculate is small-sized system of linear equations, finds the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
Unit the 4th as M during less than the SM stream handle number of GPU, confirms that current what will calculate is large-scale system of linear equations, calls based on GPU and carries out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once finds the solution a system of linear equations.
Preferably, in the said method, first module is calculated and is confirmed the available video memory space W of GPU, according to the dimension size of system of linear equations matrix of coefficients, calculates and finds the solution a required video memory space W 0 of system of linear equations; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
Preferably; In the said method; Unified equipment framework (CUDA) threading model is confirmed to calculate in Unit the 3rd, and M system of linear equations is designed to a grid; Wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPU piece comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
Preferably, in the said method, said global memory uses and refers to: the contiguous memory during 32 interior threads of thread bundle are visited in the global memory simultaneously;
Said shared drive uses and refers to: with data public in each piece, put into shared drive;
Said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
The present techniques scheme can utilize GPU to realize rapid solving according to the different scales system of linear equations, has satisfied the demand of commercial production special algorithm.And with routine techniques in utilize the CPU processor that the scheme that the thread equation carries out is compared, the present techniques scheme has improved operational performance, greatly reduces power consumption, reduces machine room construction cost and management, operation, maintenance cost.
Embodiment
Fig. 1 is the process flow diagram of finding the solution of present embodiment neutral line system of equations.
Fig. 2 is present embodiment neutral line system of equations and GPU hardware corresponding relation figure.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, hereinafter will combine accompanying drawing that technical scheme of the present invention is done further explain.Need to prove, under the situation of not conflicting, the combination each other arbitrarily of the application's embodiment and the characteristic among the embodiment.
Embodiment 1
Though this case applicant finds that On Solving System of Linear Equations has a lot of algorithms, the algorithm that has is very high for finding the solution large-scale system of linear equations efficient, for the then non-constant of small-sized On Solving System of Linear Equations efficient; The algorithm that has can be found the solution a plurality of equations simultaneously, and just efficient is high, and for once finding the solution an equation, then efficient is very low.Therefore, the applicant proposes to adopt GPU that the different scales system of linear equations is found the solution, and promptly for the smaller system of linear equations of scale, finds the solution a plurality of systems of linear equations based on GPU is can be disposable parallel, to improve parallel scale the biglyyest, gives full play to computational resource; And, can call the GPU storehouse magma that increases income based on GPU for the bigger system of linear equations of scale, realize once separating a system of linear equations, thereby reach On Solving System of Linear Equations to different scales.Particularly, present embodiment provides a kind of data processing method of system of linear equations, and this method is as shown in Figure 1, comprises the steps:
Step 100, system confirms the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that GPU video memory size and institute will calculate;
This step specifically can be divided into following steps:
A) calculate the available video memory space W of GPU;
B) according to the dimension size of system of linear equations matrix of coefficients, calculate and find the solution a required video memory space W 0 of system of linear equations;
C) statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations;
D) according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, M=(W-W1)/W0.
Step 200 judges that M is whether more than or equal to SM (stream handle) number of GPU, if get into step 300, otherwise get into step 400;
Step 300 confirms that the dimension of matrix of coefficients is less this moment, and what promptly current institute will calculate is small-sized system of linear equations, and disposable the walking abreast of a GPU Kernel capable of using found the solution M equation up to finding the solution complete system of linear equations, process ends;
In this step; Confirm current to calculate be small-sized system of linear equations the time; Utilizing disposable should walking abreast of GPUKernel to find the solution M equation up to finding the solution complete system of linear equations, is to utilize the many nuclear resources of GPU in order to make full use of, and a GPU Kernel just separates a system of linear equations if this is; Its concurrency is low, and the GPU computational resource is underused.
In the present embodiment, the concrete computation process of small-sized system of linear equations is following:
A) confirm the CUDA threading model;
M system of linear equations is designed to a GPU Grid, and wherein each system of linear equations is designed to a GPU Block, the several dynamically adjustment according to the dimension change of matrix of coefficients of the Thread of each GPU Block (thread).
B) confirm the CUDA parallel algorithm;
This CUDA parallel algorithm is based on that LU factorization designs, and its place the most consuming time is the unit's operation that disappears, can utilize the parallel unit's operation that disappears of GPU Thread.A GPU Block comprises a plurality of Warp (like 32 GPU Thread), and Warp is responsible for the unit that disappears of row, a plurality of Warp cancellation multiple row that walks abreast; Thread is responsible for the unit that disappears of delegation in the row in Warp, and 32 Thread realize the parallel multirows unit that disappears.
C) confirm the CUDA memory model, wherein, determined CUDA memory model supports that global memory (Global memory) uses, shared drive (Shared memory) uses and first order buffer memory (L1Cache) uses;
Particularly; Global memory uses and refers to: Fermi GPU comes memory access according to the Warp mode; In order to realize that Global Memory is merged visit, make its memory access performance reach optimum, 32 interior Thread of Warp should visit the contiguous memory in the Global memory simultaneously.Because two-dimentional H matrix is deposited according to the row mode of priority, each row interior element data is deposited continuously, so 32 threads should be visited 32 line data elements in the same row simultaneously when disappearing unit's operation, improves the memory access performance;
Shared memory uses and refers to: because Shared memory is internal memory on the sheet of GPU, access speed is fast, for public data in the Block piece, as the common column element when disappearing unit's operation, can put into shared drive, will improve the memory access performance;
L1Cache uses and refers to: Fermi GPU provides L1 Cache, can be placed on the data that the part is repeatedly visited among the L1 Cache, improves the memory access performance.Because it be internal memory on the sheet of GPU, and is resource-constrained, the big or small sum of L1 Cache and Shared memory is merely 64K, can be its dynamic-configuration 16K or 48K dual mode.When Shared memory enough used, the situation smaller like the system of equations scale can be configured to 48K to it, increased L1 Cache size, can further improve the memory access performance like this.
Step 400 confirms that the dimension of matrix of coefficients is bigger this moment, and what promptly current institute will calculate is large-scale system of linear equations, then calls magma storehouse (based on GPU and many matrix algebra storehouses of examining frameworks) and carries out computing, once finds the solution a system of linear equations, process ends.
Below in conjunction with concrete application scenarios, for example handle module with seismic prospecting, the implementation procedure of said method is described.
Steps A; Module is handled in seismic prospecting; Through analyzing its CPU serial algorithm, find that its core algorithm is 1806 for the system of linear equations number that adopts the LU factorization circulation to find the solution, and do not exist data to rely between each system of linear equations; Explain that this algorithm can the disposable parallel maximum linear system of equations number of finding the solution be 1806, i.e. N=1806.
Step B confirms the disposable parallel maximum linear system of equations number M of finding the solution according to GPU video memory size and system of linear equations matrix of coefficients dimension size.
Particularly, suppose the available video memory space W=2.6G of NVIIDA GPU C2050, wherein ECC opens;
And each system of linear equations matrix of coefficients is a two dimension, and when the operation scale is little operation (being small-sized system of linear equations), its dimension size is 343*343, and the so required video memory space W of small-sized system of linear equations 0=0.897M is found the solution in calculating; When operation larger (being large-scale system of linear equations), its dimension size is 3375*3375, calculates and finds the solution the so required video memory space W of small-sized system of linear equations 0=86.9M;
Add up the required video memory space W 1 of other data and be 1.5G;
According to GPU video memory capable of using space and the required video memory of a system of linear equations space, when the little operation of operation, it is following to calculate the disposable parallel maximum linear system of equations number of finding the solution:
M=(W-W1)/W0=1255;
When the big operation of operation, it is following to calculate the disposable parallel maximum linear system of equations number of finding the solution:
M=(W-W1)/W0=12。
Step C, NVIDIA C2050 comprise 14 SM, when the little operation of operation, and M=1255; It explains that greater than the SM number 14 of GPU the dimension of matrix of coefficients is less, belongs to small-sized Solving Linear, gets into step D this moment; When operation during big operation, M=12 explains that the dimension of matrix of coefficients is bigger during less than the SM number 14 of GPU; Belong to large-scale Solving Linear, call the magma storehouse of increasing income and carry out computing this moment, once finds the solution a system of linear equations.
Step D, disposable should the walking abreast of GPU Kernel is found the solution M=1255 equation; 1806 equations call twice GPU Kernel and can accomplish.
In the present embodiment, the concrete operations of step D are following:
Confirm the CUDA threading model; Be about to 1255 systems of linear equations and be designed to a GPU Grid, wherein each system of linear equations is designed to a GPU Block, totally 1255 block; Its definition mode is dim3 dimGrid (1255), and itself and GPU hardware corresponding relation are as shown in Figure 2; The Thread number of each GPUBlock is 448, and its definition mode is (32,14);
Confirm the CUDA parallel algorithm, promptly this CUDA parallel algorithm is based on that LU factorization designs, and its place the most consuming time is the unit's operation that disappears, can utilize the parallel unit's operation that disappears of GPU Thread.To this little operation, a GPU Block comprises 14 Warp (32 GPU Thread), and a Warp is responsible for the unit that disappears of row, 14 parallel cancellation 14 row of Warp; Thread is responsible for the unit that disappears of delegation in the row in Warp, and 32 Thread realize the parallel multirows unit that disappears.
Confirm the design of CUDA memory model; At first Global memory uses: Fermi GPU comes memory access according to the Warp mode; In order to realize that Global Memory is merged visit; Make its memory access performance reach optimum, 32 interior Thread of Warp should visit the contiguous memory in the Global memory simultaneously.Because two-dimentional H matrix is deposited according to the row mode of priority, each row interior element data is deposited continuously, so 32 threads should be visited 32 line data elements in the same row simultaneously when disappearing unit's operation, improves the memory access performance.Secondly, carry out Shared memory and use: because Shared memory is internal memory on the sheet of GPU, access speed is fast; For public data in the Block piece; As the common column element in when operation unit of disappearing, can put into shared drive, will improve the memory access performance.Last L1 Cache uses: Fermi GPU provides L1 Cache, can be placed on the data that the part is repeatedly visited among the L1 Cache, improves the memory access performance.Because it be internal memory on the sheet of GPU, and is resource-constrained, the big or small sum of L1 Cache and Shared memory is merely 64K, can be its dynamic-configuration 16K or 48K dual mode.Under this little operation situation, when Shared memory enough uses, can be configured to 48K to L1Cache, increase L1 Cache size, can further improve the memory access performance like this.
Carry out performance test to above-mentioned instantiation below.
Once test environment and test data at first are described, wherein, test environment comprises hardware environment, software environment, serial, concurrent program, and concrete parameters is as shown in table 1.
Table 1 is test environment and test data table
Next check results of property; (little operation is system of linear equations on a small scale to the operation of above-mentioned two kinds of different scales; Big operation is extensive system of linear equations) carry out performance test, and each test repeats 10 times, gets its performance mean value; Record CPU single-threaded serial average operating time (RNA_CPU_Aver_Time) and GPU concurrent program average operating time (RNA_GPU_Ave_Time), concrete performance is as shown in table 2.
Two kinds of algorithm performances of table 2 are the deck watch as a result
The system of linear equations scale RNA_CPU_Ave_Time RNA_GPU_Ave_Time Speed-up ratio
Little operation (343*343) 258s 13.89s 18.57 doubly
Big operation (3375*3375) 32116s 637s 50.4 doubly
Can find out from table 2: 1, to different operation scales, promptly different system of linear equations scales can be quickened by dynamic selection method, and the GPU performance of parallel program promotes obviously than the CPU single-threaded serial, and speed-up ratio reaches 18.57~50.4 times;
2, the scale of system of linear equations is big more, and its calculated amount is big more, and the speed-up ratio of GPU concurrent program is high more.
Can find out by the foregoing description; The present techniques scheme realizes the different scales On Solving System of Linear Equations through the Fermi GPU platform of NVIDIA, utilizes the crowd of GPU to assess the calculation ability, dynamically finds the solution a plurality of equations according to system of equations matrix of coefficients dimension is disposable; Than finding the solution of traditional C PU platform; Performance has improved 18.57~50.4 times, has satisfied industrial demand well, and has reduced machine room construction cost and management, operation, maintenance cost.
Embodiment 2
Present embodiment is introduced a kind of data processing equipment of system of linear equations, can realize the scheme of the foregoing description 1.This device comprises like lower unit at least.
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that GPU video memory size and institute will calculate;
In the present embodiment, first module, calculate M according to following formula:
M=(W-W1)/W0
In the following formula, W is the available video memory of GPU space;
W0 is the required video memory of a system of linear equations space, and it can be according to the dimension size of system of linear equations matrix of coefficients, and calculating is tried to achieve;
W1 finds the solution the required video memory of the extra auxiliary data of system of linear equations space.
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd; When the judged result of Unit second when being M more than or equal to the stream handle number of GPU; Confirm that current what will calculate is small-sized system of linear equations, find the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU Kernel;
Particularly, Unit the 3rd is confirmed the CUDA threading model earlier, and M system of linear equations is designed to a GPU Grid, and wherein each system of linear equations is designed to a GPU piece, and the Thread Count of each GPU piece changes and dynamically adjustment according to the dimension of matrix of coefficients; Confirm the CUDA parallel algorithm again, utilize the unit's operation that disappears of GPU thread parallel, a GPU piece comprises a plurality of Warp, and a Warp is responsible for the unit that disappears of row, a plurality of Warp cancellation multiple row that walks abreast; Thread is responsible for the unit that disappears of delegation in the row in Warp, and a plurality of threads are realized the parallel multirow unit that disappears; Confirm the CUDA memory model at last, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.Concrete process can repeat no more at this referring to the operation among the embodiment 1.
Unit the 4th; When the judged result of Unit second when being M less than the SM stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks; Wherein, once find the solution a system of linear equations.
One of ordinary skill in the art will appreciate that all or part of step in the said method can instruct related hardware to accomplish through program, said program can be stored in the computer-readable recording medium, like ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of the foregoing description also can use one or more integrated circuit to realize.Correspondingly, each the module/unit in the foregoing description can adopt the form of hardware to realize, also can adopt the form of software function module to realize.The application is not restricted to the combination of the hardware and software of any particular form.
The above is merely preferred embodiments of the present invention, is not to be used to limit protection scope of the present invention.All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. the data processing method of a system of linear equations is characterized in that, this method comprises:
System confirms the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
As M during more than or equal to the stream handle number of GPU, confirm that current what will calculate is small-sized system of linear equations, then find the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
As M during less than the stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, then call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once find the solution a system of linear equations.
2. the method for claim 1 is characterized in that, the system of linear equations matrix of coefficients dimension size that system will calculate according to GPU video memory size and institute confirms that the process of the disposable maximum linear system of equations number M found the solution of walking abreast of GPU is following:
Confirm the available video memory space W of GPU,, calculate and find the solution a required video memory space W 0 of system of linear equations according to the dimension size of system of linear equations matrix of coefficients; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
3. according to claim 1 or claim 2 method is characterized in that, should walk abreast through GPU kernel is disposable that to find the solution M equation following up to the process of finding the solution complete system of linear equations:
Confirm to calculate unified equipment framework (CUDA) threading model, M system of linear equations is designed to a grid, wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a thread block comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
4. method as claimed in claim 3 is characterized in that, said global memory uses and refers to:
32 interior threads of thread bundle are visited the contiguous memory in the global memory simultaneously.
5. method as claimed in claim 3 is characterized in that, said shared drive uses and refers to:
With data public in each piece, put into shared drive.
6. method as claimed in claim 3 is characterized in that, said first order buffer memory uses and refers to:
The data that the part is repeatedly visited are placed in the first order buffer memory.
7. the data processing equipment of a system of linear equations is characterized in that, this device comprises:
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd as M during more than or equal to the stream handle number of GPU, confirms that current what will calculate is small-sized system of linear equations, finds the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
Unit the 4th as M during less than the SM stream handle number of GPU, confirms that current what will calculate is large-scale system of linear equations, calls based on GPU and carries out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once finds the solution a system of linear equations.
8. device as claimed in claim 7 is characterized in that,
Said first module is calculated and is confirmed the available video memory space W of GPU, according to the dimension size of system of linear equations matrix of coefficients, calculates and finds the solution a required video memory space W 0 of system of linear equations; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
9. like claim 7 or 8 described devices, it is characterized in that,
Said Unit the 3rd; Confirm to calculate unified equipment framework (CUDA) threading model; M system of linear equations is designed to a grid, and wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPU piece comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
10. device as claimed in claim 9 is characterized in that,
Said global memory uses and refers to: the contiguous memory during 32 interior threads of thread bundle are visited in the global memory simultaneously;
Said shared drive uses and refers to: with data public in each piece, put into shared drive;
Said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
CN201210027340.4A 2012-02-08 2012-02-08 Method for processing data of systems of linear equations and device Active CN102609393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210027340.4A CN102609393B (en) 2012-02-08 2012-02-08 Method for processing data of systems of linear equations and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210027340.4A CN102609393B (en) 2012-02-08 2012-02-08 Method for processing data of systems of linear equations and device

Publications (2)

Publication Number Publication Date
CN102609393A true CN102609393A (en) 2012-07-25
CN102609393B CN102609393B (en) 2015-07-22

Family

ID=46526778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210027340.4A Active CN102609393B (en) 2012-02-08 2012-02-08 Method for processing data of systems of linear equations and device

Country Status (1)

Country Link
CN (1) CN102609393B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657219A (en) * 2015-02-27 2015-05-27 西安交通大学 Application program thread count dynamic regulating method used under isomerous many-core system
CN105955713A (en) * 2016-05-10 2016-09-21 河北省科学院应用数学研究所 Spline interpolation and data parallel-based data processing method
CN106407006A (en) * 2016-08-31 2017-02-15 上海交通大学 GPU (Graphics Processing Unit) dynamic task allocation method based on Whippletree model
CN106598913A (en) * 2016-12-23 2017-04-26 郑州云海信息技术有限公司 KNL cluster acceleration solving method and apparatus
CN107480043A (en) * 2016-12-23 2017-12-15 宝沃汽车(中国)有限公司 The method of testing and system of code execution time
CN107608786A (en) * 2017-08-25 2018-01-19 北京科技大学 A kind of high stored building group Method of Seismic Disaster Analysisof based on GPU and Distributed Calculation
CN107817969A (en) * 2016-08-31 2018-03-20 华为技术有限公司 A kind of program creating method, device and computer system
CN108920412A (en) * 2018-06-20 2018-11-30 中国科学院计算技术研究所 For the algorithm automated tuning method of Heterogeneous Computing machine architecture
CN110021339A (en) * 2017-12-27 2019-07-16 北京大学 Cluster parallel computing accelerated method based on protein folding measuring and calculating protein structure
CN110675490A (en) * 2019-09-27 2020-01-10 武汉中旗生物医疗电子有限公司 Three-dimensional ultrasonic rendering imaging method and device
CN112446004A (en) * 2019-08-28 2021-03-05 无锡江南计算技术研究所 Unstructured grid DILU preconditioned child-many-core parallel optimization algorithm
CN112486671A (en) * 2020-11-16 2021-03-12 青海大学 GRAPES system optimization method, system, medium and device based on GPU
CN117473212A (en) * 2023-12-27 2024-01-30 粤港澳大湾区数字经济研究院(福田) GPU acceleration method, device, equipment and storage medium of NTT algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661457A (en) * 2008-08-29 2010-03-03 国际商业机器公司 Method and device for solving triangular linear equation set of multiprocessor system
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661457A (en) * 2008-08-29 2010-03-03 国际商业机器公司 Method and device for solving triangular linear equation set of multiprocessor system
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657219A (en) * 2015-02-27 2015-05-27 西安交通大学 Application program thread count dynamic regulating method used under isomerous many-core system
CN104657219B (en) * 2015-02-27 2017-10-20 西安交通大学 A kind of application program threads number dynamic adjusting method being used under isomery many-core system
CN105955713A (en) * 2016-05-10 2016-09-21 河北省科学院应用数学研究所 Spline interpolation and data parallel-based data processing method
CN105955713B (en) * 2016-05-10 2018-04-03 河北省科学院应用数学研究所 Data processing method based on spline interpolation and data parallel
CN106407006A (en) * 2016-08-31 2017-02-15 上海交通大学 GPU (Graphics Processing Unit) dynamic task allocation method based on Whippletree model
CN107817969A (en) * 2016-08-31 2018-03-20 华为技术有限公司 A kind of program creating method, device and computer system
CN106598913A (en) * 2016-12-23 2017-04-26 郑州云海信息技术有限公司 KNL cluster acceleration solving method and apparatus
CN107480043A (en) * 2016-12-23 2017-12-15 宝沃汽车(中国)有限公司 The method of testing and system of code execution time
CN107608786A (en) * 2017-08-25 2018-01-19 北京科技大学 A kind of high stored building group Method of Seismic Disaster Analysisof based on GPU and Distributed Calculation
CN110021339A (en) * 2017-12-27 2019-07-16 北京大学 Cluster parallel computing accelerated method based on protein folding measuring and calculating protein structure
CN110021339B (en) * 2017-12-27 2021-04-30 北京大学 Cluster parallel computing acceleration method based on protein folding calculation protein structure
CN108920412A (en) * 2018-06-20 2018-11-30 中国科学院计算技术研究所 For the algorithm automated tuning method of Heterogeneous Computing machine architecture
CN108920412B (en) * 2018-06-20 2020-12-29 中国科学院计算技术研究所 Algorithm automatic tuning method for heterogeneous computer system structure
CN112446004A (en) * 2019-08-28 2021-03-05 无锡江南计算技术研究所 Unstructured grid DILU preconditioned child-many-core parallel optimization algorithm
CN112446004B (en) * 2019-08-28 2023-07-07 无锡江南计算技术研究所 Non-structural grid DILU preconditioned sub-many-core parallel optimization method
CN110675490A (en) * 2019-09-27 2020-01-10 武汉中旗生物医疗电子有限公司 Three-dimensional ultrasonic rendering imaging method and device
CN110675490B (en) * 2019-09-27 2023-04-28 武汉中旗生物医疗电子有限公司 Three-dimensional ultrasonic rendering imaging method and device
CN112486671A (en) * 2020-11-16 2021-03-12 青海大学 GRAPES system optimization method, system, medium and device based on GPU
CN117473212A (en) * 2023-12-27 2024-01-30 粤港澳大湾区数字经济研究院(福田) GPU acceleration method, device, equipment and storage medium of NTT algorithm
CN117473212B (en) * 2023-12-27 2024-04-16 粤港澳大湾区数字经济研究院(福田) GPU acceleration method, device, equipment and storage medium of NTT algorithm

Also Published As

Publication number Publication date
CN102609393B (en) 2015-07-22

Similar Documents

Publication Publication Date Title
CN102609393B (en) Method for processing data of systems of linear equations and device
Pattnaik et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities
US8782645B2 (en) Automatic load balancing for heterogeneous cores
Chen et al. Adaptive cache management for energy-efficient GPU computing
Burtscher et al. A quantitative study of irregular programs on GPUs
Hong-Tao et al. K-means on commodity GPUs with CUDA
US8132172B2 (en) Thread scheduling on multiprocessor systems
CN102253919A (en) Concurrent numerical simulation method and system based on GPU and CPU cooperative computing
EP3742350A1 (en) Parallelization strategies for training a neural network
Zhang et al. Locality based warp scheduling in GPGPUs
Martínez-del-Amor et al. Population Dynamics P systems on CUDA
CN102110079A (en) Tuning calculation method of distributed conjugate gradient method based on MPI
Chen et al. tpSpMV: a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures
US20130138923A1 (en) Multithreaded data merging for multi-core processing unit
US10558500B2 (en) Scheduling heterogenous processors
Zhang et al. NUMA-Aware DGEMM based on 64-bit ARMv8 multicore processors architecture
Wang Power analysis and optimizations for GPU architecture using a power simulator
Valero et al. Towards a more efficient use of gpus
CN115756605A (en) Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
Ding et al. An efficient and comprehensive scheduler on Asymmetric Multicore Architecture systems
Chu et al. Efficient Algorithm Design of Optimizing SpMV on GPU
US10488911B2 (en) Method and computing system of allocating registers
Madruga et al. Parallel shared-memory workloads performance on asymmetric multi-core architectures
Wang Modeling and minimizing memory contention in general-purpose GPUs
Zhan et al. A graph-theory-based method for parallelizing the multiple-flow-direction algorithm on cuda compatible graphics processing units

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant