CN102609393A - Method for processing data of systems of linear equations and device - Google Patents
Method for processing data of systems of linear equations and device Download PDFInfo
- Publication number
- CN102609393A CN102609393A CN2012100273404A CN201210027340A CN102609393A CN 102609393 A CN102609393 A CN 102609393A CN 2012100273404 A CN2012100273404 A CN 2012100273404A CN 201210027340 A CN201210027340 A CN 201210027340A CN 102609393 A CN102609393 A CN 102609393A
- Authority
- CN
- China
- Prior art keywords
- gpu
- linear equations
- solution
- thread
- equations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a method for processing data of systems of linear equations and a device, which relate to the field of high-performance computation of computers and the field of scientific and engineering numerical calculation. The method includes determining the maximum number M of the systems of linear equations which are parallelly solved by a GPU (graphic processing unit) at one step according to the size of a graphic memory of the GPU and the size of dimensions of a coefficient matrix of the to-be-computed systems of linear equations; parallelly solving M equations by the aid of a kernel of the GPU until the integral systems of linear equations are solved when the M is larger than or equal to the number of stream processors of the GPU; and calling a matrix algebra library based on the GPU and a many-core architecture to compute the systems of linear equations when the M is smaller than the number of the stream processors of the GPU, and solving one system of linear equations at a time. The invention further discloses a data processing device for the systems of linear equations. The method and the device in the technical scheme meet requirements of specific algorithms of industrial production, the computation performance is improved, and power consumption is greatly reduced.
Description
Technical field
The present invention relates to computing machine high-performance computing sector, scientific and engineering numerical evaluation field, be specifically related to a kind of data processing method and device of system of linear equations.
Background technology
In recent years, must development obtain considerable progress with China high-performance computer in the world.Per second hundred TFlops, thousands of TFlops supercomputer are studied success in succession and are dropped into commercial production, and the many problems that can't find the solution and study become possibility before making.
Large-scale scientific and engineering numerical evaluation is one of high-performance calculation important applied field, and wherein On Solving System of Linear Equations is a very general problem, and it is related in a lot of engineering fields, like petroleum prospecting, weather forecast, and turbulent flow simulation etc.
For some field on the commercial production; The core algorithm of some functions relates to On Solving System of Linear Equations; At present; People often use large-scale CPU cluster to handle, and its principle is that a CPU nuclear is handled a system of linear equations or a plurality of nuclear is handled a system of linear equations, and the result with each CPU nuclear gathers output at last.The maintenance management expense of this way CPU cluster, power consumption are all very high, and the time of its processing is long, have satisfied not industrial demand far away.Along with the appearance of GPU (Graphic Processing Unit, graphic process unit), GPU utilizes its many nuclear processing poweies; Its performance peak value and memory bandwidth are higher than CPU far away; And very being fit to separate this intensive calculations type of thread system of equations algorithm case, common way is that a GPU Kenrel calculates an equation at present, as the LU in the magma storehouse of increasing income decomposes and finds the solution system of linear equations and adopt this disposal route exactly; It is fit to finding the solution for large-scale thread system of equations very much; But,, cause some extra expenses of calling with the problem of calling repeatedly that GPU Kernel occurs for repeatedly finding the solution the thread system of equations in the industry.
Thus it is clear that,, need a kind of scheme that is directed against the Solving Linear of different scales for satisfying industrial demand.
Summary of the invention
Technical matters to be solved by this invention is, a kind of data processing method and device of system of linear equations is provided, and can find the solution to the system of linear equations of different scales.
In order to solve the problems of the technologies described above, the invention discloses a kind of data processing method of system of linear equations, comprising:
System confirms the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
As M during more than or equal to the stream handle number of GPU, confirm that current what will calculate is small-sized system of linear equations, then find the solution M equation up to finding the solution complete system of linear equations through disposable should the walking abreast of a GPU kernel (Kernel);
As M during less than the stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, then call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once find the solution a system of linear equations.
Preferably, in the said method, the system of linear equations matrix of coefficients dimension size that system will calculate according to GPU video memory size and institute confirms that the process of the disposable maximum linear system of equations number M found the solution of walking abreast of GPU is following:
Confirm the available video memory space W of GPU,, calculate and find the solution a required video memory space W 0 of system of linear equations according to the dimension size of system of linear equations matrix of coefficients; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
Preferably, in the said method, should walk abreast through GPU Kernel is disposable that to find the solution M equation following up to the process of finding the solution complete system of linear equations:
Confirm to calculate unified equipment framework (CUDA) threading model; M system of linear equations is designed to a grid (Grid); Wherein each system of linear equations is designed to a thread block (GPU Block), the several dynamically adjustment according to the dimension change of matrix of coefficients of the thread of each thread block (Thread);
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPUBlock comprises a plurality of thread bundles (warp), and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
Preferably, in the said method, said global memory uses and refers to: 32 interior threads of thread bundle are visited the contiguous memory in the global memory simultaneously.
Preferably, in the said method, said shared drive uses and refers to: with data public in each piece, put into shared drive.
Preferably, in the said method, said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
The invention also discloses a kind of data processing equipment of system of linear equations, comprising:
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that GPU video memory size and institute will calculate;
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd as M during more than or equal to the stream handle number of GPU, confirms that current what will calculate is small-sized system of linear equations, finds the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
Unit the 4th as M during less than the SM stream handle number of GPU, confirms that current what will calculate is large-scale system of linear equations, calls based on GPU and carries out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once finds the solution a system of linear equations.
Preferably, in the said method, first module is calculated and is confirmed the available video memory space W of GPU, according to the dimension size of system of linear equations matrix of coefficients, calculates and finds the solution a required video memory space W 0 of system of linear equations; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
Preferably; In the said method; Unified equipment framework (CUDA) threading model is confirmed to calculate in Unit the 3rd, and M system of linear equations is designed to a grid; Wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPU piece comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
Preferably, in the said method, said global memory uses and refers to: the contiguous memory during 32 interior threads of thread bundle are visited in the global memory simultaneously;
Said shared drive uses and refers to: with data public in each piece, put into shared drive;
Said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
The present techniques scheme can utilize GPU to realize rapid solving according to the different scales system of linear equations, has satisfied the demand of commercial production special algorithm.And with routine techniques in utilize the CPU processor that the scheme that the thread equation carries out is compared, the present techniques scheme has improved operational performance, greatly reduces power consumption, reduces machine room construction cost and management, operation, maintenance cost.
Embodiment
Fig. 1 is the process flow diagram of finding the solution of present embodiment neutral line system of equations.
Fig. 2 is present embodiment neutral line system of equations and GPU hardware corresponding relation figure.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, hereinafter will combine accompanying drawing that technical scheme of the present invention is done further explain.Need to prove, under the situation of not conflicting, the combination each other arbitrarily of the application's embodiment and the characteristic among the embodiment.
Embodiment 1
Though this case applicant finds that On Solving System of Linear Equations has a lot of algorithms, the algorithm that has is very high for finding the solution large-scale system of linear equations efficient, for the then non-constant of small-sized On Solving System of Linear Equations efficient; The algorithm that has can be found the solution a plurality of equations simultaneously, and just efficient is high, and for once finding the solution an equation, then efficient is very low.Therefore, the applicant proposes to adopt GPU that the different scales system of linear equations is found the solution, and promptly for the smaller system of linear equations of scale, finds the solution a plurality of systems of linear equations based on GPU is can be disposable parallel, to improve parallel scale the biglyyest, gives full play to computational resource; And, can call the GPU storehouse magma that increases income based on GPU for the bigger system of linear equations of scale, realize once separating a system of linear equations, thereby reach On Solving System of Linear Equations to different scales.Particularly, present embodiment provides a kind of data processing method of system of linear equations, and this method is as shown in Figure 1, comprises the steps:
This step specifically can be divided into following steps:
A) calculate the available video memory space W of GPU;
B) according to the dimension size of system of linear equations matrix of coefficients, calculate and find the solution a required video memory space W 0 of system of linear equations;
C) statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations;
D) according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, M=(W-W1)/W0.
In this step; Confirm current to calculate be small-sized system of linear equations the time; Utilizing disposable should walking abreast of GPUKernel to find the solution M equation up to finding the solution complete system of linear equations, is to utilize the many nuclear resources of GPU in order to make full use of, and a GPU Kernel just separates a system of linear equations if this is; Its concurrency is low, and the GPU computational resource is underused.
In the present embodiment, the concrete computation process of small-sized system of linear equations is following:
A) confirm the CUDA threading model;
M system of linear equations is designed to a GPU Grid, and wherein each system of linear equations is designed to a GPU Block, the several dynamically adjustment according to the dimension change of matrix of coefficients of the Thread of each GPU Block (thread).
B) confirm the CUDA parallel algorithm;
This CUDA parallel algorithm is based on that LU factorization designs, and its place the most consuming time is the unit's operation that disappears, can utilize the parallel unit's operation that disappears of GPU Thread.A GPU Block comprises a plurality of Warp (like 32 GPU Thread), and Warp is responsible for the unit that disappears of row, a plurality of Warp cancellation multiple row that walks abreast; Thread is responsible for the unit that disappears of delegation in the row in Warp, and 32 Thread realize the parallel multirows unit that disappears.
C) confirm the CUDA memory model, wherein, determined CUDA memory model supports that global memory (Global memory) uses, shared drive (Shared memory) uses and first order buffer memory (L1Cache) uses;
Particularly; Global memory uses and refers to: Fermi GPU comes memory access according to the Warp mode; In order to realize that Global Memory is merged visit, make its memory access performance reach optimum, 32 interior Thread of Warp should visit the contiguous memory in the Global memory simultaneously.Because two-dimentional H matrix is deposited according to the row mode of priority, each row interior element data is deposited continuously, so 32 threads should be visited 32 line data elements in the same row simultaneously when disappearing unit's operation, improves the memory access performance;
Shared memory uses and refers to: because Shared memory is internal memory on the sheet of GPU, access speed is fast, for public data in the Block piece, as the common column element when disappearing unit's operation, can put into shared drive, will improve the memory access performance;
L1Cache uses and refers to: Fermi GPU provides L1 Cache, can be placed on the data that the part is repeatedly visited among the L1 Cache, improves the memory access performance.Because it be internal memory on the sheet of GPU, and is resource-constrained, the big or small sum of L1 Cache and Shared memory is merely 64K, can be its dynamic-configuration 16K or 48K dual mode.When Shared memory enough used, the situation smaller like the system of equations scale can be configured to 48K to it, increased L1 Cache size, can further improve the memory access performance like this.
Below in conjunction with concrete application scenarios, for example handle module with seismic prospecting, the implementation procedure of said method is described.
Steps A; Module is handled in seismic prospecting; Through analyzing its CPU serial algorithm, find that its core algorithm is 1806 for the system of linear equations number that adopts the LU factorization circulation to find the solution, and do not exist data to rely between each system of linear equations; Explain that this algorithm can the disposable parallel maximum linear system of equations number of finding the solution be 1806, i.e. N=1806.
Step B confirms the disposable parallel maximum linear system of equations number M of finding the solution according to GPU video memory size and system of linear equations matrix of coefficients dimension size.
Particularly, suppose the available video memory space W=2.6G of NVIIDA GPU C2050, wherein ECC opens;
And each system of linear equations matrix of coefficients is a two dimension, and when the operation scale is little operation (being small-sized system of linear equations), its dimension size is 343*343, and the so required video memory space W of small-sized system of linear equations 0=0.897M is found the solution in calculating; When operation larger (being large-scale system of linear equations), its dimension size is 3375*3375, calculates and finds the solution the so required video memory space W of small-sized system of linear equations 0=86.9M;
Add up the required video memory space W 1 of other data and be 1.5G;
According to GPU video memory capable of using space and the required video memory of a system of linear equations space, when the little operation of operation, it is following to calculate the disposable parallel maximum linear system of equations number of finding the solution:
M=(W-W1)/W0=1255;
When the big operation of operation, it is following to calculate the disposable parallel maximum linear system of equations number of finding the solution:
M=(W-W1)/W0=12。
Step C, NVIDIA C2050 comprise 14 SM, when the little operation of operation, and M=1255; It explains that greater than the SM number 14 of GPU the dimension of matrix of coefficients is less, belongs to small-sized Solving Linear, gets into step D this moment; When operation during big operation, M=12 explains that the dimension of matrix of coefficients is bigger during less than the SM number 14 of GPU; Belong to large-scale Solving Linear, call the magma storehouse of increasing income and carry out computing this moment, once finds the solution a system of linear equations.
Step D, disposable should the walking abreast of GPU Kernel is found the solution M=1255 equation; 1806 equations call twice GPU Kernel and can accomplish.
In the present embodiment, the concrete operations of step D are following:
Confirm the CUDA threading model; Be about to 1255 systems of linear equations and be designed to a GPU Grid, wherein each system of linear equations is designed to a GPU Block, totally 1255 block; Its definition mode is dim3 dimGrid (1255), and itself and GPU hardware corresponding relation are as shown in Figure 2; The Thread number of each GPUBlock is 448, and its definition mode is (32,14);
Confirm the CUDA parallel algorithm, promptly this CUDA parallel algorithm is based on that LU factorization designs, and its place the most consuming time is the unit's operation that disappears, can utilize the parallel unit's operation that disappears of GPU Thread.To this little operation, a GPU Block comprises 14 Warp (32 GPU Thread), and a Warp is responsible for the unit that disappears of row, 14 parallel cancellation 14 row of Warp; Thread is responsible for the unit that disappears of delegation in the row in Warp, and 32 Thread realize the parallel multirows unit that disappears.
Confirm the design of CUDA memory model; At first Global memory uses: Fermi GPU comes memory access according to the Warp mode; In order to realize that Global Memory is merged visit; Make its memory access performance reach optimum, 32 interior Thread of Warp should visit the contiguous memory in the Global memory simultaneously.Because two-dimentional H matrix is deposited according to the row mode of priority, each row interior element data is deposited continuously, so 32 threads should be visited 32 line data elements in the same row simultaneously when disappearing unit's operation, improves the memory access performance.Secondly, carry out Shared memory and use: because Shared memory is internal memory on the sheet of GPU, access speed is fast; For public data in the Block piece; As the common column element in when operation unit of disappearing, can put into shared drive, will improve the memory access performance.Last L1 Cache uses: Fermi GPU provides L1 Cache, can be placed on the data that the part is repeatedly visited among the L1 Cache, improves the memory access performance.Because it be internal memory on the sheet of GPU, and is resource-constrained, the big or small sum of L1 Cache and Shared memory is merely 64K, can be its dynamic-configuration 16K or 48K dual mode.Under this little operation situation, when Shared memory enough uses, can be configured to 48K to L1Cache, increase L1 Cache size, can further improve the memory access performance like this.
Carry out performance test to above-mentioned instantiation below.
Once test environment and test data at first are described, wherein, test environment comprises hardware environment, software environment, serial, concurrent program, and concrete parameters is as shown in table 1.
Table 1 is test environment and test data table
Next check results of property; (little operation is system of linear equations on a small scale to the operation of above-mentioned two kinds of different scales; Big operation is extensive system of linear equations) carry out performance test, and each test repeats 10 times, gets its performance mean value; Record CPU single-threaded serial average operating time (RNA_CPU_Aver_Time) and GPU concurrent program average operating time (RNA_GPU_Ave_Time), concrete performance is as shown in table 2.
Two kinds of algorithm performances of table 2 are the deck watch as a result
The system of linear equations scale | RNA_CPU_Ave_Time | RNA_GPU_Ave_Time | Speed-up ratio |
Little operation (343*343) | 258s | 13.89s | 18.57 doubly |
Big operation (3375*3375) | 32116s | 637s | 50.4 doubly |
Can find out from table 2: 1, to different operation scales, promptly different system of linear equations scales can be quickened by dynamic selection method, and the GPU performance of parallel program promotes obviously than the CPU single-threaded serial, and speed-up ratio reaches 18.57~50.4 times;
2, the scale of system of linear equations is big more, and its calculated amount is big more, and the speed-up ratio of GPU concurrent program is high more.
Can find out by the foregoing description; The present techniques scheme realizes the different scales On Solving System of Linear Equations through the Fermi GPU platform of NVIDIA, utilizes the crowd of GPU to assess the calculation ability, dynamically finds the solution a plurality of equations according to system of equations matrix of coefficients dimension is disposable; Than finding the solution of traditional C PU platform; Performance has improved 18.57~50.4 times, has satisfied industrial demand well, and has reduced machine room construction cost and management, operation, maintenance cost.
Embodiment 2
Present embodiment is introduced a kind of data processing equipment of system of linear equations, can realize the scheme of the foregoing description 1.This device comprises like lower unit at least.
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that GPU video memory size and institute will calculate;
In the present embodiment, first module, calculate M according to following formula:
M=(W-W1)/W0
In the following formula, W is the available video memory of GPU space;
W0 is the required video memory of a system of linear equations space, and it can be according to the dimension size of system of linear equations matrix of coefficients, and calculating is tried to achieve;
W1 finds the solution the required video memory of the extra auxiliary data of system of linear equations space.
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd; When the judged result of Unit second when being M more than or equal to the stream handle number of GPU; Confirm that current what will calculate is small-sized system of linear equations, find the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU Kernel;
Particularly, Unit the 3rd is confirmed the CUDA threading model earlier, and M system of linear equations is designed to a GPU Grid, and wherein each system of linear equations is designed to a GPU piece, and the Thread Count of each GPU piece changes and dynamically adjustment according to the dimension of matrix of coefficients; Confirm the CUDA parallel algorithm again, utilize the unit's operation that disappears of GPU thread parallel, a GPU piece comprises a plurality of Warp, and a Warp is responsible for the unit that disappears of row, a plurality of Warp cancellation multiple row that walks abreast; Thread is responsible for the unit that disappears of delegation in the row in Warp, and a plurality of threads are realized the parallel multirow unit that disappears; Confirm the CUDA memory model at last, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.Concrete process can repeat no more at this referring to the operation among the embodiment 1.
Unit the 4th; When the judged result of Unit second when being M less than the SM stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks; Wherein, once find the solution a system of linear equations.
One of ordinary skill in the art will appreciate that all or part of step in the said method can instruct related hardware to accomplish through program, said program can be stored in the computer-readable recording medium, like ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of the foregoing description also can use one or more integrated circuit to realize.Correspondingly, each the module/unit in the foregoing description can adopt the form of hardware to realize, also can adopt the form of software function module to realize.The application is not restricted to the combination of the hardware and software of any particular form.
The above is merely preferred embodiments of the present invention, is not to be used to limit protection scope of the present invention.All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. the data processing method of a system of linear equations is characterized in that, this method comprises:
System confirms the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
As M during more than or equal to the stream handle number of GPU, confirm that current what will calculate is small-sized system of linear equations, then find the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
As M during less than the stream handle number of GPU, confirm that current what will calculate is large-scale system of linear equations, then call based on GPU and carry out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once find the solution a system of linear equations.
2. the method for claim 1 is characterized in that, the system of linear equations matrix of coefficients dimension size that system will calculate according to GPU video memory size and institute confirms that the process of the disposable maximum linear system of equations number M found the solution of walking abreast of GPU is following:
Confirm the available video memory space W of GPU,, calculate and find the solution a required video memory space W 0 of system of linear equations according to the dimension size of system of linear equations matrix of coefficients; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
3. according to claim 1 or claim 2 method is characterized in that, should walk abreast through GPU kernel is disposable that to find the solution M equation following up to the process of finding the solution complete system of linear equations:
Confirm to calculate unified equipment framework (CUDA) threading model, M system of linear equations is designed to a grid, wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a thread block comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
4. method as claimed in claim 3 is characterized in that, said global memory uses and refers to:
32 interior threads of thread bundle are visited the contiguous memory in the global memory simultaneously.
5. method as claimed in claim 3 is characterized in that, said shared drive uses and refers to:
With data public in each piece, put into shared drive.
6. method as claimed in claim 3 is characterized in that, said first order buffer memory uses and refers to:
The data that the part is repeatedly visited are placed in the first order buffer memory.
7. the data processing equipment of a system of linear equations is characterized in that, this device comprises:
First module is confirmed the disposable maximum linear system of equations number M that walks abreast and find the solution of GPU according to the system of linear equations matrix of coefficients dimension size that graphic process unit (GPU) video memory size and institute will calculate;
Unit second judges that M is whether more than or equal to the stream handle number of GPU;
Unit the 3rd as M during more than or equal to the stream handle number of GPU, confirms that current what will calculate is small-sized system of linear equations, finds the solution M equation up to finding the solution complete system of linear equations through disposable should walking abreast of GPU kernel;
Unit the 4th as M during less than the SM stream handle number of GPU, confirms that current what will calculate is large-scale system of linear equations, calls based on GPU and carries out computing with the matrix algebra storehouses of many nuclear frameworks, wherein, once finds the solution a system of linear equations.
8. device as claimed in claim 7 is characterized in that,
Said first module is calculated and is confirmed the available video memory space W of GPU, according to the dimension size of system of linear equations matrix of coefficients, calculates and finds the solution a required video memory space W 0 of system of linear equations; Statistics is found the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations; Again according to GPU video memory space W capable of using, find the solution the required video memory space W 1 of the extra auxiliary data of system of linear equations and the required video memory space W 0 of system of linear equations, calculate the disposable parallel maximum linear system of equations number M of finding the solution, wherein:
M=(W-W1)/W0。
9. like claim 7 or 8 described devices, it is characterized in that,
Said Unit the 3rd; Confirm to calculate unified equipment framework (CUDA) threading model; M system of linear equations is designed to a grid, and wherein each system of linear equations is designed to a thread block, and the Thread Count of each thread block changes and dynamically adjustment according to the dimension of matrix of coefficients;
Confirm the CUDA parallel algorithm, utilize the disappear unit operation of GPU thread parallel, a GPU piece comprises a plurality of thread bundles, and a thread bundle is responsible for the unit that disappears of row, a plurality of thread Shu Binghang cancellation multiple rows; Thread is responsible for the unit that disappears of delegation in the row in thread bundle, and a plurality of threads are realized the parallel multirow unit that disappears;
Confirm the CUDA memory model, said CUDA memory model supports that global memory uses, shared drive uses and first order buffer memory uses.
10. device as claimed in claim 9 is characterized in that,
Said global memory uses and refers to: the contiguous memory during 32 interior threads of thread bundle are visited in the global memory simultaneously;
Said shared drive uses and refers to: with data public in each piece, put into shared drive;
Said first order buffer memory uses and refers to: the data that the part is repeatedly visited are placed in the first order buffer memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210027340.4A CN102609393B (en) | 2012-02-08 | 2012-02-08 | Method for processing data of systems of linear equations and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210027340.4A CN102609393B (en) | 2012-02-08 | 2012-02-08 | Method for processing data of systems of linear equations and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102609393A true CN102609393A (en) | 2012-07-25 |
CN102609393B CN102609393B (en) | 2015-07-22 |
Family
ID=46526778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210027340.4A Active CN102609393B (en) | 2012-02-08 | 2012-02-08 | Method for processing data of systems of linear equations and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102609393B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104657219A (en) * | 2015-02-27 | 2015-05-27 | 西安交通大学 | Application program thread count dynamic regulating method used under isomerous many-core system |
CN105955713A (en) * | 2016-05-10 | 2016-09-21 | 河北省科学院应用数学研究所 | Spline interpolation and data parallel-based data processing method |
CN106407006A (en) * | 2016-08-31 | 2017-02-15 | 上海交通大学 | GPU (Graphics Processing Unit) dynamic task allocation method based on Whippletree model |
CN106598913A (en) * | 2016-12-23 | 2017-04-26 | 郑州云海信息技术有限公司 | KNL cluster acceleration solving method and apparatus |
CN107480043A (en) * | 2016-12-23 | 2017-12-15 | 宝沃汽车(中国)有限公司 | The method of testing and system of code execution time |
CN107608786A (en) * | 2017-08-25 | 2018-01-19 | 北京科技大学 | A kind of high stored building group Method of Seismic Disaster Analysisof based on GPU and Distributed Calculation |
CN107817969A (en) * | 2016-08-31 | 2018-03-20 | 华为技术有限公司 | A kind of program creating method, device and computer system |
CN108920412A (en) * | 2018-06-20 | 2018-11-30 | 中国科学院计算技术研究所 | For the algorithm automated tuning method of Heterogeneous Computing machine architecture |
CN110021339A (en) * | 2017-12-27 | 2019-07-16 | 北京大学 | Cluster parallel computing accelerated method based on protein folding measuring and calculating protein structure |
CN110675490A (en) * | 2019-09-27 | 2020-01-10 | 武汉中旗生物医疗电子有限公司 | Three-dimensional ultrasonic rendering imaging method and device |
CN112446004A (en) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | Unstructured grid DILU preconditioned child-many-core parallel optimization algorithm |
CN112486671A (en) * | 2020-11-16 | 2021-03-12 | 青海大学 | GRAPES system optimization method, system, medium and device based on GPU |
CN117473212A (en) * | 2023-12-27 | 2024-01-30 | 粤港澳大湾区数字经济研究院(福田) | GPU acceleration method, device, equipment and storage medium of NTT algorithm |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661457A (en) * | 2008-08-29 | 2010-03-03 | 国际商业机器公司 | Method and device for solving triangular linear equation set of multiprocessor system |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
-
2012
- 2012-02-08 CN CN201210027340.4A patent/CN102609393B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661457A (en) * | 2008-08-29 | 2010-03-03 | 国际商业机器公司 | Method and device for solving triangular linear equation set of multiprocessor system |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104657219A (en) * | 2015-02-27 | 2015-05-27 | 西安交通大学 | Application program thread count dynamic regulating method used under isomerous many-core system |
CN104657219B (en) * | 2015-02-27 | 2017-10-20 | 西安交通大学 | A kind of application program threads number dynamic adjusting method being used under isomery many-core system |
CN105955713A (en) * | 2016-05-10 | 2016-09-21 | 河北省科学院应用数学研究所 | Spline interpolation and data parallel-based data processing method |
CN105955713B (en) * | 2016-05-10 | 2018-04-03 | 河北省科学院应用数学研究所 | Data processing method based on spline interpolation and data parallel |
CN106407006A (en) * | 2016-08-31 | 2017-02-15 | 上海交通大学 | GPU (Graphics Processing Unit) dynamic task allocation method based on Whippletree model |
CN107817969A (en) * | 2016-08-31 | 2018-03-20 | 华为技术有限公司 | A kind of program creating method, device and computer system |
CN106598913A (en) * | 2016-12-23 | 2017-04-26 | 郑州云海信息技术有限公司 | KNL cluster acceleration solving method and apparatus |
CN107480043A (en) * | 2016-12-23 | 2017-12-15 | 宝沃汽车(中国)有限公司 | The method of testing and system of code execution time |
CN107608786A (en) * | 2017-08-25 | 2018-01-19 | 北京科技大学 | A kind of high stored building group Method of Seismic Disaster Analysisof based on GPU and Distributed Calculation |
CN110021339A (en) * | 2017-12-27 | 2019-07-16 | 北京大学 | Cluster parallel computing accelerated method based on protein folding measuring and calculating protein structure |
CN110021339B (en) * | 2017-12-27 | 2021-04-30 | 北京大学 | Cluster parallel computing acceleration method based on protein folding calculation protein structure |
CN108920412A (en) * | 2018-06-20 | 2018-11-30 | 中国科学院计算技术研究所 | For the algorithm automated tuning method of Heterogeneous Computing machine architecture |
CN108920412B (en) * | 2018-06-20 | 2020-12-29 | 中国科学院计算技术研究所 | Algorithm automatic tuning method for heterogeneous computer system structure |
CN112446004A (en) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | Unstructured grid DILU preconditioned child-many-core parallel optimization algorithm |
CN112446004B (en) * | 2019-08-28 | 2023-07-07 | 无锡江南计算技术研究所 | Non-structural grid DILU preconditioned sub-many-core parallel optimization method |
CN110675490A (en) * | 2019-09-27 | 2020-01-10 | 武汉中旗生物医疗电子有限公司 | Three-dimensional ultrasonic rendering imaging method and device |
CN110675490B (en) * | 2019-09-27 | 2023-04-28 | 武汉中旗生物医疗电子有限公司 | Three-dimensional ultrasonic rendering imaging method and device |
CN112486671A (en) * | 2020-11-16 | 2021-03-12 | 青海大学 | GRAPES system optimization method, system, medium and device based on GPU |
CN117473212A (en) * | 2023-12-27 | 2024-01-30 | 粤港澳大湾区数字经济研究院(福田) | GPU acceleration method, device, equipment and storage medium of NTT algorithm |
CN117473212B (en) * | 2023-12-27 | 2024-04-16 | 粤港澳大湾区数字经济研究院(福田) | GPU acceleration method, device, equipment and storage medium of NTT algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN102609393B (en) | 2015-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102609393B (en) | Method for processing data of systems of linear equations and device | |
Pattnaik et al. | Scheduling techniques for GPU architectures with processing-in-memory capabilities | |
US8782645B2 (en) | Automatic load balancing for heterogeneous cores | |
Chen et al. | Adaptive cache management for energy-efficient GPU computing | |
Burtscher et al. | A quantitative study of irregular programs on GPUs | |
Hong-Tao et al. | K-means on commodity GPUs with CUDA | |
US8132172B2 (en) | Thread scheduling on multiprocessor systems | |
CN102253919A (en) | Concurrent numerical simulation method and system based on GPU and CPU cooperative computing | |
EP3742350A1 (en) | Parallelization strategies for training a neural network | |
Zhang et al. | Locality based warp scheduling in GPGPUs | |
Martínez-del-Amor et al. | Population Dynamics P systems on CUDA | |
CN102110079A (en) | Tuning calculation method of distributed conjugate gradient method based on MPI | |
Chen et al. | tpSpMV: a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures | |
US20130138923A1 (en) | Multithreaded data merging for multi-core processing unit | |
US10558500B2 (en) | Scheduling heterogenous processors | |
Zhang et al. | NUMA-Aware DGEMM based on 64-bit ARMv8 multicore processors architecture | |
Wang | Power analysis and optimizations for GPU architecture using a power simulator | |
Valero et al. | Towards a more efficient use of gpus | |
CN115756605A (en) | Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs | |
Ding et al. | An efficient and comprehensive scheduler on Asymmetric Multicore Architecture systems | |
Chu et al. | Efficient Algorithm Design of Optimizing SpMV on GPU | |
US10488911B2 (en) | Method and computing system of allocating registers | |
Madruga et al. | Parallel shared-memory workloads performance on asymmetric multi-core architectures | |
Wang | Modeling and minimizing memory contention in general-purpose GPUs | |
Zhan et al. | A graph-theory-based method for parallelizing the multiple-flow-direction algorithm on cuda compatible graphics processing units |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |