CN106598913A - KNL cluster acceleration solving method and apparatus - Google Patents
KNL cluster acceleration solving method and apparatus Download PDFInfo
- Publication number
- CN106598913A CN106598913A CN201611208888.3A CN201611208888A CN106598913A CN 106598913 A CN106598913 A CN 106598913A CN 201611208888 A CN201611208888 A CN 201611208888A CN 106598913 A CN106598913 A CN 106598913A
- Authority
- CN
- China
- Prior art keywords
- knl
- solving
- approximate solution
- solution
- mpi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Operations Research (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a KNL (Knights Landing) cluster acceleration solving apparatus. A method comprises the steps of reading a coefficient matrix and a constant term of a symmetric positive linear equation set, and setting an initial solution and a solving precision requirement; controlling each KNL kernel to perform procedure subject calculation by utilizing an MPI (Message Passing Interface), and constructing an approximate solution, wherein the procedure subject is operation code segments for vector multiplication, vector addition, vector inner products, scalar products and vector products of large-scale sparse matrixes integrated in the KNL kernel; judging whether the approximate solution meets the solving precision requirement or not; and if yes, outputting the approximate solution. According to the method, a conjugate gradient algorithm is transplanted in a KNL cluster platform, so that the utilization rate of hardware resources is increased, the time for solving large-scale symmetric positive linear equation sets is shortened, the energy consumption is reduced, and the management and operation maintenance costs of a machine room are reduced; and the acceleration method is simple and easy to realize, so that the development cost is reduced. The KNL cluster acceleration solving apparatus disclosed by the invention has the abovementioned beneficial effects.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of KNL clusters acceleration method for solving and device.
Background technology
The solution of mathematics physics model is one of numerous engineering productions and the requisite work of scientific research field.With calculating
The a series of numerical computation methods such as the development of machine, finite difference, finite element, boundary element, non-mesh method are born in succession.These
Numerical computation method has a something in common:Mathematics physics model derived from practical problem is separated into by specific mode
One linear algebraic equation systems.With the discrete system of linear equations for obtaining of Finite Element Method be often symmetric positive definite or through letter
Single process becomes symmetric positive definite problem.However, with the increase of problem scale, the solution of system of linear equations becomes engineering life
Produce and the big bottleneck in scientific research.Therefore, how to improve the time for solving extensive symmetric positive definite system of linear equations that shortens, and
And energy consumption is reduced, the cost of computer lab management, O&M is reduced, is those skilled in the art's technical issues that need to address.
The content of the invention
It is an object of the invention to provide a kind of KNL clusters accelerate method for solving and device, conjugate gradient algorithms are transplanted to
The utilization rate of hardware resource in KNL cluster platforms, is improved, and extensive symmetric positive definite system of linear equations is solved so as to shorten
Time, energy consumption is reduced, reduce development cost.
To solve above-mentioned technical problem, the present invention provides a kind of KNL clusters and accelerates method for solving, including:
The coefficient matrix and constant term of symmetric positive definite system of linear equations are read, and sets initial solution and solving precision requirement;
Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution;Wherein, procedure subject is to be integrated in
The operation part of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in KNL kernels
Section;
Judge whether the approximate solution meets the solving precision and require;
If so, then output meets the approximate solution that the solving precision is required.
Optionally, controlling each KNL kernels using MPI carries out procedure subject calculating, constructs approximate solution, including:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and it is empty privately owned storage to be arranged for each process
Between;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined number
According to including the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and to whole knots
Fruit is processed, and obtains approximate solution.
Optionally, the solution task of the symmetric positive definite system of linear equations is divided, including:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p
Block;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node
Reason core is divided into NgrpIndividual group.
Optionally, KNL kernels carry out procedure subject calculating, including:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
Optionally, KNL kernels carry out procedure subject calculating, including:
The data or array that memory read-write in procedure subject is limited open up MCDRAM high bandwidth internal memories.The present invention is also carried
Accelerate solving device for a kind of KNL clusters, including:
Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and set initial solution and
Solving precision is required;
Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution;Its
In, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with
The operation part section of vector product;
Judge solving precision module, require for judging whether the approximate solution meets the solving precision;
As a result output module, meets the approximate solution that the solving precision is required for output.
Optionally, the approximate solution solves module, including:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and enters for each
Journey arranges privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to all to enter
Journey;Wherein, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, receives whole processes for the MPI host processes and is calculated according to the tentation data
Result afterwards, and whole results are processed, obtain approximate solution.
Optionally, specially using static division mode, divided by row is linear by symmetric positive definite for the task division unit
Unit of the coefficient matrix divided by row of equation group into N_p blocks;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor in KNL clusters
Calculate node number, NgrpFor processing core is divided into N in each calculate nodegrpIndividual group.
Optionally, the approximate solution solves unit, including:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads enter line program master
Body is calculated.
Optionally, the approximate solution solves module, including:
Unit is opened up in the limited array space of memory bandwidth, for the data that are limited memory read-write in procedure subject or array
Open up MCDRAM high bandwidth internal memories.
KNL clusters provided by the present invention accelerate method for solving, including:Read the coefficient square of symmetric positive definite system of linear equations
Battle array and constant term, and set initial solution and solving precision requirement;Each KNL kernels are controlled using MPI carries out procedure subject calculating, structure
Make approximate solution;Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, vector
The operation part section of inner product, scalar and vector product;Judge whether the approximate solution meets the solving precision and require;If so,
Then output meets the approximate solution that the solving precision is required;
It can be seen that, the method has been transplanted to conjugate gradient algorithms in KNL cluster platforms, i.e., using MPI realize node it
Between task distribution and message transmission, realize parallel acceleration that extensive matrix-vector calculates so as to improve using KNL chips
The utilization rate of hardware resource, shortens the time for solving extensive symmetric positive definite system of linear equations, and reduces energy consumption, reduces
Computer lab management, the cost of O&M, and accelerated method are simply easily achieved, and reduce development cost;The invention discloses a kind of KNL
Cluster accelerates solving device, with above-mentioned beneficial effect, will not be described here.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis
The accompanying drawing of offer obtains other accompanying drawings.
The flow chart that Fig. 1 accelerates method for solving by the KNL clusters that the embodiment of the present invention is provided;
Fig. 2 divides schematic diagram by the task that the embodiment of the present invention is provided;
The MPI design cycle schematic diagrams that Fig. 3 is provided by the embodiment of the present invention;
Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided.
Specific embodiment
The core of the present invention is to provide a kind of KNL clusters and accelerates method for solving and device, and conjugate gradient algorithms are transplanted to
In KNL cluster platforms, the utilization rate of hardware resource is improve, extensive symmetric positive definite system of linear equations is solved so as to shorten
Time, reduce energy consumption, reduce development cost.
To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
At present, conjugate gradient method (i.e. CG) is the one of the current most popular symmetrical Large Scale Sparse system of linear equations of solution
Class method.It is why so welcome, it is only to be needed using first derivative information due to the solution of CG, with than steepest descent method more
Fast convergence rate, it is less than Newton iteration method amount of calculation, and be not required to determine any parameter.Therefore, CG has storage little, convergence
Hurry up, stability is strong, without the need for external parameter and the advantages of be adapted to parallel.
Wherein, CG is to be proposed by Hestenes and Stiefel first at the initial stage fifties, enters research relevant in decades
Unprecedented development is obtained, theory relevant at present and method are very ripe.For sparse vectors Ax=
B, the conjugate gradient method algorithm flow that capital and interest are adopted are as follows:
KNL (Knights Landing) be Intel Company release the second filial generation to core piece is melted by force, for high-performance simultaneously
The many-core processor that row is calculated.KNL chips can individually do central host processor, which employs the improvement of Silvermont frameworks
Customization version and 14nm new technologies, core amounts up to 64-72, each core can at most open 4 threads, at most possess 288
Individual thread, more than 3TFlops, single precision is then more than 6TFlops for double-precision floating point performance.And OPA frameworks are one and aim at optimization
The brand-new interconnection technique of high-performance calculation, is also a Interworking Solution end to end, can enjoy widely user
To the performance advantage of HPC cluster.
MPI (Message Passing Interface) is to issue and by numerous parallel computer factories in May, 1994
Business, software development organization and Parallel application unit safeguard a kind of message passing interface jointly, be at present in the world it is most popular simultaneously
A kind of volume of the scalable parallel computer of one of row programmed environment, especially distributed storage and network of workstations and a group of planes
Journey example.MPI is mainly made up of Fortran+MPI or C+MPI, is had up to a hundred function call interfaces, can be directly invoked.MPI has
Have many good qualities:With portable and ease for use;There is complete asynchronous communication function;There is formal and detailed explication.
MPI is accomplished on PC, MS Windows and on all main UNIX/Linux work stations, main flow parallel machine,
In the distributed storage environment constituted by more senior and abstract program based on level messages transmission procedure, MPI standard
The benefit brought is obvious.Therefore, CG is combined MPI so as to realize accelerating to solve symmetric positive definite line in KNL clusters by the present embodiment
Property equation group process, specifically refer to Fig. 1, Fig. 1 is accelerated the stream of method for solving by the KNL clusters that the embodiment of the present invention is provided
Cheng Tu;The method can include:
S100, the coefficient matrix and constant term that read symmetric positive definite system of linear equations, and set initial solution and solving precision
Require;
S110, each KNL kernels are controlled using MPI carry out procedure subject calculating, construct approximate solution;Wherein, procedure subject is
It is integrated in the fortune of Large Scale Sparse matrix-vector multiplication in KNL kernels, vector addition, inner product of vectors, scalar and vector product
Calculate code segment;
S120, judge whether the approximate solution meets the solving precision and require;
If then performing S130 then exports the approximate solution for meeting that the solving precision is required;If otherwise returning S110.
Wherein, as conjugate gradient algorithms can be divided into the three parts such as early stage pretreatment, iterative part, result output, its
In middle iterative part namely Fig. 1, step S110 calculates the main body of the part for CG algorithms of approximate solution, and its operand accounts for overall
More than 98%.Remainder with I/O operation, initial value set etc. data prepare and whether meet the judgement of required precision as
It is main, it is unsuitable for parallel processing, therefore computing is carried out using CPU;And single iteration operating process is sentenced except a small amount of branch
Outside disconnected, mainly include the multiplication of Large Scale Sparse matrix-vector, scalar multiplication with vector, vector norm and addition of vectors equal matrix
Vector operations, are very suitable for the parallel acceleration of KNL clusters.Therefore, four kinds of matrix-vector operation part sections are made in step S110
KNL kernels are integrated in for procedure subject carries out parallel processing.
I.e. in the present embodiment, S100 to S120 is that conjugate gradient algorithms solve symmetric positive definite system of linear equations flow process;The conjugation
Gradient algorithm is that approximate solution solution procedure is passed through MPI form parallel high-speeds with conjugate gradient algorithms difference in prior art
Ground operation in KNL clusters realizes, that is, realize the MPI parallel versions of conjugate gradient method algorithm, will be with the basis of MPI versions
Four kinds of computings such as the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors and scalar and vector product are main body
Code segment is used as KNL kernels.
The calculating process of wherein step S110 is controlled by MPI.It is i.e. specific, between calculate node (i.e. KNL kernels)
In parallel through distributed storage message passing interface programming model (MPI) realize, wherein:The realization of MPI is intended to data
Task is divided on different calculate node equipment and completes calculating task, and by MPI complete message transmission between process with
Data interaction.I.e. optional, controlling each KNL kernels using MPI carries out procedure subject calculating, and construction approximate solution can include:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and it is empty privately owned storage to be arranged for each process
Between;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined number
According to including the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and to whole knots
Fruit is processed, and obtains approximate solution.
Specifically, said process is task resolution, arranges thread and distributes task for thread, and each thread is each privately owned
The calculating of task that is allocated of memory space, and result of calculation is fed back to into MPI host processes can also be referred to as main thread,
Main thread carries out processing according to each result for receiving the approximate solution for obtaining this time calculating.
Now in order that KNL PC cluster speed is quicker, need the configuration for ensureing each calculate node balanced, this
Sample will not cause the delay of whole process because of the delay of a calculate node in can ensure that circulation every time.Therefore task is drawn
Divide and distribution needs equilibrium, it is desirable that the computational efficiency for obtaining each calculate node is identical.Such as the first calculate node is calculated
Ability is 2, and the second calculate node computing capability is 1, then can distribute two tasks to the first calculate node when task is distributed,
Distribute a task to the second calculate node, to ensure the task balance of each calculate node.But preferably way is to ensure each
The hardware condition of individual calculate node is all identical, can more quick directly mean allocation during task of so reallocating.It is i.e. every
Individual calculate node internal memory (including DDR physical memories and MCDRAM high bandwidth internal memories) size and type of memory are as far as possible identical,
The quantity of central processing unit is identical.To reduce the communication delay between node, Infiniband, 10,000,000,000 ether between node, are adopted
Cluster is built in the high performance network interconnection such as net or Intel OPA, the network switch by the way of total exchange.
Therefore each calculate node is in the case where hardware condition is the same, by the solution of the symmetric positive definite system of linear equations
Task carries out division can be included:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p
Block;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node
Reason core is divided into NgrpIndividual group.
Specifically, according to KNL core groups in the quantity and node of calculate node in cluster (node can also be referred to as)
Quantity opens process, it is assumed that have N in clusternodeProcessing core can be divided into N in each calculate node by individual calculate nodegrp
Individual group, N_p=N can be openednode*NgrpIndividual MPI processes, each MPI process are responsible for controlling the calculating of a KNL core group and controlling
Message transmission between core group processed.Data are divided and takes advantage of N_p data block, each process is responsible for the calculating task of a data block.
Said process is illustrated by taking Ax=b as an example, Fig. 2 is refer to, by the coefficient matrices A of system of linear equations Ax=b and the right side
End vector b divided by row, the nonzero element number in A is size, and line number is n, and for x, each process is required for full x, therefore,
The x sizes of each process definition are n, and each process only calculates n/N_P element in x, are needed after calculating every time between process
Communicate, obtain full x data.Host process sets up out of order index, is responsible for the collection and broadcast of data.Each process is from binary system text
Respective desired data is read in part;Then each process according to the process number of oneself come static division task.Each process initiation
Multithreading, main thread are responsible for and will also undertake corresponding calculating task outside other process communications, other threads be responsible for calculating with
Other tasks, terminate local process after the completion of calculating.
Will system of linear equations coefficient matrices A divided by row into N_p blocks, give N_p process respectively and calculate, protect as far as possible
Demonstrate,prove the number of nonzero element in each process quite, to guarantee the load balancing between process, host process is responsible for number in process group
According to collection and broadcast, each process possesses independent data, completes the calculating task of oneself, and the information between process is by message biography
Pass the interaction of interface completion message.
The concrete Programming Notes said process with MPI multi-process design is divided below by MPI tasks.Wherein, MPI tasks
Division is specially:
Task between node is divided using static division mode, divided by row, it is assumed that required system of linear equations be Ax=b and
Nonzero element number in sparse matrix A is size, and line number is n, it is assumed that open l process P0, P1..., Pl-1, can be by coefficient square
Battle array A is with vector b divided by row into l blocks, i.e. A=[A0 T,A1 T,…,Al-1 T]T, b=[b0 T,b1 T,…,bl-1 T]T.By data block A0
~Al-1And b0~bl-1It is respectively allocated to process P0~Pl-1, and vector x is shared for all processes.It is concrete as shown in Figure 2.Therefore,
The x sizes of each process definition are n, and each process only calculates n/l element in x, are needed after calculating every time between process
Communication, obtains x vectorial.The line number that each process is processed is represented with array H [N]:
Communicate for convenience, the Hpos that defines arrays [l+1] represents the position that the data of each process calculating start:
1:Hpos [0]=0;
2:For (i=1;i<l+1;i++)
3:Hpos [i]=Hpos [i-1]+H [i-1];
MPI multi-process design is specially:
MPI multi-process design cycles are as shown in Figure 3.Each MPI process (abbreviation process) controls the calculating of 1 KNL core group
With data transfer.L KNL core group is had in assuming cluster, starts l process P first0, P1..., Pl-1, wherein process p0Based on
Process;Each process opens up privately owned memory space, process p according to task0It is responsible for reading data, and data on-demand broadcasting is given
Other processes;Process p0-Pl-1Respective calculating task is performed respectively, and result is fed back to into host process p0;Host process is to feedback
As a result carry out processing, integrate, and necessary result is broadcast to into other processes;The upper two steps operation of repetition is completed until calculating.
MPI mainly employs the mode of collective communication and completes the message transmission between process in implementing parallel.The MPI message called is passed
Passing built-in function includes:MPI_Reduce, MPI_ALLReduce, MPI_Bcast and MPI_Allgatherv.MPI design frameworks
False code is as follows:
Based on above-mentioned technical proposal, KNL clusters provided in an embodiment of the present invention accelerate method for solving, the method be conjugated ladder
Degree algorithm has been transplanted in KNL cluster platforms, i.e., realize the distribution of task and message transmission between node using MPI, utilizes
KNL chips realize that the parallel acceleration of extensive matrix-vector calculating, so as to improve the utilization rate of hardware resource, shortens solution
The time of extensive symmetric positive definite system of linear equations, and energy consumption is reduced, the cost of computer lab management, O&M is reduced, and is added
Fast method is simply easily achieved, and reduces development cost.
Based on above-described embodiment, in order to further improve the calculating speed in calculate node.Can be within calculate node
Accelerate to complete using KNL numerous cores, it is real using the multi-thread programming model (OpenMP) of shared drive in KNL intra-nodes
It is now parallel.Wherein, the realization of OpenMP is intended to the calculating core for making full use of KNL processors numerous and each core can be opened
The code segment of computation-intensive is put into parallel processing on KNL polycaryon processors by 4 threads, so as to accelerate asking for system of linear equations
Solution.I.e. in the present embodiment, KNL kernels carry out procedure subject calculating can include:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
Specifically, multithreading is opened according to the quantity that core is calculated in KNL check figures and realizes the parallel acceleration in equipment;Assume
There is N in each core groupKNLIndividual core, it is assumed that have N in each core groupknl coreIndividual core, can at most open 4*Nknl coreIndividual thread.
Wherein, Large Scale Sparse matrix-vector in algorithm is multiplied by the present embodiment, inner product of vectors, scalar multiplication with vector and
Addition of vectors equal matrix vector operations are used as multi-threaded parallel region.Therefore, in the present embodiment in algorithm by calling subfunction
Mode complete the operation of four kinds of matrix-vectors of the above.Kernel is completed by way of " #pragma omp " speech and accelerates design.
KNL chips each cores supports 4 hardware threads, can at most open 4*Nknl coreIndividual OpenMP threads, in four kinds of subfunctions
Core OpenMP design frameworks are as follows:
1>The matrix-vector multiplication of kernel function
2>The vector number of kernel function is taken advantage of
3>The inner product of vectors of kernel function
4>The addition of vectors of kernel function
Based on above-described embodiment, in order to further improve the speed that each KNL kernels main body is calculated, in the present embodiment in KNL
Core carries out procedure subject and calculates and can include:
The data or array that memory read-write in described program main body is limited open up MCDRAM high bandwidth internal memories.
Specifically, MCDRAM high bandwidths memory part in node is made full use of, memory access intensive code segment is opened up into height
On the internal memory of bandwidth.For example, its code is all opened up the internal memory of high bandwidth when internal memory shared by procedure subject is less than 16GB
On, or access frequency highest data segment in this section of program is opened up into high bandwidth during memory size 16GB shared by procedure subject
Internal memory on.
The computing resource of KNL clusters is made full use of, is improved and is calculated performance, reducing energy consumption, so as to reduce computer lab management, fortune
Dimension cost.In the above-described embodiments first connected applications the characteristics of, design hardware platform building plan;Then realize conjugation
The MPI parallel versions of gradient method algorithm;Then will be with the multiplication of Large Scale Sparse matrix-vector, vector on the basis of MPI versions
Four kinds of computings such as addition, inner product of vectors and scalar and vector product for main body code segment as KNL kernels, each KNL core
Group can at most open 4*Nknl coreIndividual thread OpenMP threads, complete the parallel computation on node KNL chips.
Based on above-mentioned technical proposal, the KNL clusters that the embodiment of the present invention is carried accelerate method for solving, mainly employ MPI+
OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed
Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.It is sweeping in the method
Matrix-vector be multiplied, vector be multiplied by scalar, vector subtract each other and inner product of vectors can as the important component part of kernel function,
Multi-threading parallel process.Improve the utilization rate of hardware resource, shorten solve extensive symmetric positive definite system of linear equations when
Between, and energy consumption is reduced, and the cost of computer lab management, O&M being reduced, and accelerated method is simply easily achieved, reduction is developed into
This.
Solving device is accelerated to be introduced KNL clusters provided in an embodiment of the present invention below, KNL clusters described below
Acceleration solving device can be mutually to should refer to above-described KNL clusters acceleration method for solving.
Fig. 4 is refer to, Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided;The dress
Putting to include:
Read module 100, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution
And solving precision is required;
Approximate solution solves module 200, carries out procedure subject calculating for controlling each KNL kernels using MPI, and construction is approximate
Solution;Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors,
The operation part section of scalar and vector product;
Judge solving precision module 300, require for judging whether the approximate solution meets the solving precision;
As a result output module 400, meet the approximate solution that the solving precision is required for output.
Specifically, the KNL clusters accelerate building for the hardware platform environment of solving device to configure including between calculate node
Weighing apparatus design, in node between configuration Equalization Design, node internet choose, in node in the configuration of computing device and node
Mode that KNL internal memories are opened up etc..
Wherein, Equalization Design is configured, each calculate node internal memory (includes in DDR physical memories and MCDRAM high bandwidths
Deposit) size and type of memory it is as far as possible identical, the quantity of central processing unit is identical.To reduce the communication delay between node, section
Using high performance network interconnections such as Infiniband, ten thousand mbit ethernets or Intel OPA between point, the network switch is using full friendship
The mode changed builds cluster.Each KNL node memory and mode of operation configuration consistency in KNL clusters, each KNL node have phase
Same KNL processor chips;It is connected by High speed network between each KNL nodes, in the KNL clusters, the network switch is adopted
Use total exchange mode;
Configuration in dynamical hardware cluster platform, node is built balanced, to improve the exchange capacity and resource of data
Utilization rate.It is assumed that adopting KNL PC clusters, each KNL node configuration internal memory is consistent, using same-type, size
DDR internal memories and the storage of MCDRAM high bandwidths, each KNL node are arranged to identical pattern, it is to avoid memory read-write velocity contrast
The disposal ability gap great disparity for causing, causes whole node processing power low.Meanwhile, the KNL process adopted in calculate node
Device chip is identical, it is ensured that the core amounts and dominant frequency in each processor chips are identical.Additionally, the communication between process is to node
Between internet requirement it is higher, therefore, the interconnection of calculate node adopts 10,000,000,000 ether or Infiniband High speed networks,
To avoid due to the inconsistent information occlusion of bandwidth, the network switch is by the way of total exchange.
Each intra-node configures two identical processors, and to ensure the core amounts of processor, dominant frequency is identical, each
Node Configuration Type and size identical internal memory (being not less than 128GB), it is ensured that the KNL chip cores number and dominant frequency in node
Unanimously.And integrated MCDRAM is consistent on chip, between node, using express network interconnection, (it is high that network can choose Infiniband
Fast internet, Ethernet and Intel OPA network connections), switch is by the way of full connection.
Based on above-described embodiment, the approximate solution solves module 200 can be included:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and enters for each
Journey arranges privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to all to enter
Journey;Wherein, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, receives whole processes for the MPI host processes and is calculated according to the tentation data
Result afterwards, and whole results are processed, obtain approximate solution.
Based on above-described embodiment, specially using static division mode, divided by row will be symmetrical for the task division unit
Unit of the coefficient matrix divided by row of positive definite system of linear equations into N_p blocks;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor
Calculate node number, N in KNL clustersgrpFor processing core is divided into N in each calculate nodegrpIndividual group.
Based on above-described embodiment, the approximate solution solves unit, including:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads enter line program master
Body is calculated.
Based on above-mentioned any embodiment, the approximate solution solves module 200 can be included:
Unit is opened up in the limited array space of memory bandwidth, for the data that are limited memory read-write in procedure subject or array
Open up MCDRAM high bandwidth internal memories.
Specifically, the KNL platforms of the embodiment have the core amounts much larger than CPU, and each core supports 4 hardware
Thread, calculates disposal ability very powerful.Meanwhile, storing on the MCDRAM pieces of KNL platform configurations 16GB, its bandwidth is additionally, MPI
Programming programs and binary compatible consistent with CPU platforms with good portability and complete communication function and KNL.Tool
Standby simple and direct, efficient the characteristics of.Meanwhile, conjugate gradient method has simplicity, fast convergence rate, stability high and is easy to parallel spy
Point.Therefore, devising a kind of KNL clusters based on conjugate gradient method accelerates solving device, the device mainly to employ MPI+
OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed
Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.In the device, initial solution sets
Pretreatment before fixed and iteration, constructs new approximate solution, and it is single line to judge whether approximate solution meets the operation such as required precision
What journey was completed, and sweeping matrix-vector is multiplied, vector is multiplied by that scalar, vector subtract each other and inner product of vectors can be used as kernel
The important component part of function, multi-threading parallel process.
In description, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other realities
Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment
Speech, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration
.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description
And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and
The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These
Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty
Technical staff can use different methods to realize described function to each specific application, but this realization should not
Think beyond the scope of this invention.
The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor
Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Method for solving and device is accelerated to be described in detail a kind of KNL clusters provided by the present invention above.Herein
In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side
Assistant solves the method for the present invention and its core concept.It should be pointed out that for those skilled in the art, not
On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these improve and modification also falls into
In the protection domain of the claims in the present invention.
Claims (10)
1. a kind of KNL clusters accelerate method for solving, it is characterised in that include:
The coefficient matrix and constant term of symmetric positive definite system of linear equations are read, and sets initial solution and solving precision requirement;
Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution;Wherein, procedure subject is to be integrated in KNL
The operation part section of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in kernel;
Judge whether the approximate solution meets the solving precision and require;
If so, then output meets the approximate solution that the solving precision is required.
2. KNL clusters according to claim 1 accelerate method for solving, it is characterised in that control each KNL kernels using MPI
Procedure subject calculating is carried out, approximate solution is constructed, including:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and privately owned memory space is set for each process;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined packet data
Include the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and whole results are entered
Row is processed, and obtains approximate solution.
3. KNL clusters according to claim 2 accelerate method for solving, it is characterised in that the symmetric positive definite is linearly square
The solution task of journey group is divided, including:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks;Its
In, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpFor in each calculate node by process cores
The heart is divided into NgrpIndividual group.
4. KNL clusters according to claim 3 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter
Calculate, including:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
5. KNL clusters according to claim 4 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter
Calculate, including:
The data or array that memory read-write in described program main body is limited open up MCDRAM high bandwidth internal memories.
6. a kind of KNL clusters accelerate solving device, it is characterised in that include:
Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution and solution
Required precision;
Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution;Wherein,
Procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with to
The operation part section of amount product;
Judge solving precision module, require for judging whether the approximate solution meets the solving precision;
As a result output module, meets the approximate solution that the solving precision is required for output.
7. KNL clusters according to claim 6 accelerate solving device, it is characterised in that the approximate solution solves module, bag
Include:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and sets for each process
Put privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to whole processes;Its
In, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, after being calculated according to the tentation data for the whole processes of MPI host processes reception
As a result, and to whole results process, obtain approximate solution.
8. KNL clusters according to claim 7 accelerate solving device, it is characterised in that the task division unit is concrete
Be using static division mode, divided by row by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks list
Unit;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node
Reason core is divided into NgrpIndividual group.
9. KNL clusters according to claim 8 accelerate solving device, it is characterised in that the approximate solution solves unit, bag
Include:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads carry out procedure subject meter
Calculate.
10. KNL clusters according to claim 9 accelerate solving device, it is characterised in that the approximate solution solves module,
Including:
Unit is opened up in the limited array space of memory bandwidth, and the data or array for memory read-write in procedure subject is limited are opened up
To MCDRAM high bandwidth internal memories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611208888.3A CN106598913A (en) | 2016-12-23 | 2016-12-23 | KNL cluster acceleration solving method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611208888.3A CN106598913A (en) | 2016-12-23 | 2016-12-23 | KNL cluster acceleration solving method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106598913A true CN106598913A (en) | 2017-04-26 |
Family
ID=58601476
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611208888.3A Pending CN106598913A (en) | 2016-12-23 | 2016-12-23 | KNL cluster acceleration solving method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106598913A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562691A (en) * | 2017-08-14 | 2018-01-09 | 中国科学院力学研究所 | A kind of micro thrust dynamic testing method based on least square method |
CN108986063A (en) * | 2018-07-25 | 2018-12-11 | 浪潮(北京)电子信息产业有限公司 | The method, apparatus and computer readable storage medium of gradient fusion |
CN115099031A (en) * | 2022-06-21 | 2022-09-23 | 南京维拓科技股份有限公司 | Method for realizing parallel computing of simulation solver under Windows environment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609393A (en) * | 2012-02-08 | 2012-07-25 | 浪潮(北京)电子信息产业有限公司 | Method for processing data of systems of linear equations and device |
CN104408019A (en) * | 2014-10-29 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform |
CN105260342A (en) * | 2015-09-22 | 2016-01-20 | 浪潮(北京)电子信息产业有限公司 | Solving method and system for symmetric positive definite linear equation set |
-
2016
- 2016-12-23 CN CN201611208888.3A patent/CN106598913A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609393A (en) * | 2012-02-08 | 2012-07-25 | 浪潮(北京)电子信息产业有限公司 | Method for processing data of systems of linear equations and device |
CN104408019A (en) * | 2014-10-29 | 2015-03-11 | 浪潮电子信息产业股份有限公司 | Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform |
CN105260342A (en) * | 2015-09-22 | 2016-01-20 | 浪潮(北京)电子信息产业有限公司 | Solving method and system for symmetric positive definite linear equation set |
Non-Patent Citations (1)
Title |
---|
AVINASH SODANI ET AL.: "KNIGHTS LANDING: SECOND-GENERATION INTEL XEON PHI PRODUCT", 《IEEE MICRO》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562691A (en) * | 2017-08-14 | 2018-01-09 | 中国科学院力学研究所 | A kind of micro thrust dynamic testing method based on least square method |
CN107562691B (en) * | 2017-08-14 | 2020-03-17 | 中国科学院力学研究所 | Micro thrust dynamic test method based on least square method |
CN108986063A (en) * | 2018-07-25 | 2018-12-11 | 浪潮(北京)电子信息产业有限公司 | The method, apparatus and computer readable storage medium of gradient fusion |
CN115099031A (en) * | 2022-06-21 | 2022-09-23 | 南京维拓科技股份有限公司 | Method for realizing parallel computing of simulation solver under Windows environment |
CN115099031B (en) * | 2022-06-21 | 2024-04-26 | 南京维拓科技股份有限公司 | Parallel computing method for realizing simulation solver in Windows environment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks | |
Navaridas et al. | Simulating and evaluating interconnection networks with INSEE | |
Agrawal et al. | Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills | |
CN106598913A (en) | KNL cluster acceleration solving method and apparatus | |
Bienz et al. | Node aware sparse matrix–vector multiplication | |
US20220121928A1 (en) | Enhanced reconfigurable interconnect network | |
He et al. | A survey to predict the trend of AI-able server evolution in the cloud | |
Ahn et al. | Soft memory box: A virtual shared memory framework for fast deep neural network training in distributed high performance computing | |
US20230305967A1 (en) | Banked memory architecture for multiple parallel datapath channels in an accelerator | |
Kidane et al. | NoC based virtualized accelerators for cloud computing | |
Ding et al. | Leveraging one-sided communication for sparse triangular solvers | |
Del Sozzo et al. | A scalable FPGA design for cloud n-body simulation | |
CN107679409A (en) | A kind of acceleration method and system of data encryption | |
Robertsen et al. | Lattice Boltzmann simulations at petascale on multi-GPU systems with asynchronous data transfer and strictly enforced memory read alignment | |
Hironaka et al. | Multi-fpga management on flow-in-cloud prototype system | |
KR101656693B1 (en) | Apparatus and method for simulating computational fluid dynamics using Hadoop platform | |
Gharan et al. | Flexible simulation and modeling for 2D topology NoC system design | |
Hou et al. | Co-designing the topology/algorithm to accelerate distributed training | |
McManus | A strategy for mapping unstructured mesh computational mechanics programs onto distributed memory parallel architectures | |
EP4006736A1 (en) | Connecting processors using twisted torus configurations | |
da Rosa Righi et al. | Designing Cloud-Friendly HPC Applications | |
CN114218874A (en) | Parafoil fluid-solid coupling parallel computing method | |
TW202217564A (en) | Runtime virtualization of reconfigurable data flow resources | |
Ho et al. | Towards FPGA-assisted spark: An SVM training acceleration case study | |
Kidane et al. | Run-time scalable noc for fpga based virtualized ips |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170426 |
|
RJ01 | Rejection of invention patent application after publication |