CN106598913A - KNL cluster acceleration solving method and apparatus - Google Patents

KNL cluster acceleration solving method and apparatus Download PDF

Info

Publication number
CN106598913A
CN106598913A CN201611208888.3A CN201611208888A CN106598913A CN 106598913 A CN106598913 A CN 106598913A CN 201611208888 A CN201611208888 A CN 201611208888A CN 106598913 A CN106598913 A CN 106598913A
Authority
CN
China
Prior art keywords
knl
solving
approximate solution
solution
mpi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611208888.3A
Other languages
Chinese (zh)
Inventor
王明清
张清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201611208888.3A priority Critical patent/CN106598913A/en
Publication of CN106598913A publication Critical patent/CN106598913A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a KNL (Knights Landing) cluster acceleration solving apparatus. A method comprises the steps of reading a coefficient matrix and a constant term of a symmetric positive linear equation set, and setting an initial solution and a solving precision requirement; controlling each KNL kernel to perform procedure subject calculation by utilizing an MPI (Message Passing Interface), and constructing an approximate solution, wherein the procedure subject is operation code segments for vector multiplication, vector addition, vector inner products, scalar products and vector products of large-scale sparse matrixes integrated in the KNL kernel; judging whether the approximate solution meets the solving precision requirement or not; and if yes, outputting the approximate solution. According to the method, a conjugate gradient algorithm is transplanted in a KNL cluster platform, so that the utilization rate of hardware resources is increased, the time for solving large-scale symmetric positive linear equation sets is shortened, the energy consumption is reduced, and the management and operation maintenance costs of a machine room are reduced; and the acceleration method is simple and easy to realize, so that the development cost is reduced. The KNL cluster acceleration solving apparatus disclosed by the invention has the abovementioned beneficial effects.

Description

A kind of KNL clusters accelerate method for solving and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of KNL clusters acceleration method for solving and device.
Background technology
The solution of mathematics physics model is one of numerous engineering productions and the requisite work of scientific research field.With calculating The a series of numerical computation methods such as the development of machine, finite difference, finite element, boundary element, non-mesh method are born in succession.These Numerical computation method has a something in common:Mathematics physics model derived from practical problem is separated into by specific mode One linear algebraic equation systems.With the discrete system of linear equations for obtaining of Finite Element Method be often symmetric positive definite or through letter Single process becomes symmetric positive definite problem.However, with the increase of problem scale, the solution of system of linear equations becomes engineering life Produce and the big bottleneck in scientific research.Therefore, how to improve the time for solving extensive symmetric positive definite system of linear equations that shortens, and And energy consumption is reduced, the cost of computer lab management, O&M is reduced, is those skilled in the art's technical issues that need to address.
The content of the invention
It is an object of the invention to provide a kind of KNL clusters accelerate method for solving and device, conjugate gradient algorithms are transplanted to The utilization rate of hardware resource in KNL cluster platforms, is improved, and extensive symmetric positive definite system of linear equations is solved so as to shorten Time, energy consumption is reduced, reduce development cost.
To solve above-mentioned technical problem, the present invention provides a kind of KNL clusters and accelerates method for solving, including:
The coefficient matrix and constant term of symmetric positive definite system of linear equations are read, and sets initial solution and solving precision requirement;
Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution;Wherein, procedure subject is to be integrated in The operation part of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in KNL kernels Section;
Judge whether the approximate solution meets the solving precision and require;
If so, then output meets the approximate solution that the solving precision is required.
Optionally, controlling each KNL kernels using MPI carries out procedure subject calculating, constructs approximate solution, including:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and it is empty privately owned storage to be arranged for each process Between;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined number According to including the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and to whole knots Fruit is processed, and obtains approximate solution.
Optionally, the solution task of the symmetric positive definite system of linear equations is divided, including:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p Block;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node Reason core is divided into NgrpIndividual group.
Optionally, KNL kernels carry out procedure subject calculating, including:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
Optionally, KNL kernels carry out procedure subject calculating, including:
The data or array that memory read-write in procedure subject is limited open up MCDRAM high bandwidth internal memories.The present invention is also carried Accelerate solving device for a kind of KNL clusters, including:
Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and set initial solution and Solving precision is required;
Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution;Its In, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with The operation part section of vector product;
Judge solving precision module, require for judging whether the approximate solution meets the solving precision;
As a result output module, meets the approximate solution that the solving precision is required for output.
Optionally, the approximate solution solves module, including:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and enters for each Journey arranges privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to all to enter Journey;Wherein, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, receives whole processes for the MPI host processes and is calculated according to the tentation data Result afterwards, and whole results are processed, obtain approximate solution.
Optionally, specially using static division mode, divided by row is linear by symmetric positive definite for the task division unit Unit of the coefficient matrix divided by row of equation group into N_p blocks;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor in KNL clusters Calculate node number, NgrpFor processing core is divided into N in each calculate nodegrpIndividual group.
Optionally, the approximate solution solves unit, including:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads enter line program master Body is calculated.
Optionally, the approximate solution solves module, including:
Unit is opened up in the limited array space of memory bandwidth, for the data that are limited memory read-write in procedure subject or array Open up MCDRAM high bandwidth internal memories.
KNL clusters provided by the present invention accelerate method for solving, including:Read the coefficient square of symmetric positive definite system of linear equations Battle array and constant term, and set initial solution and solving precision requirement;Each KNL kernels are controlled using MPI carries out procedure subject calculating, structure Make approximate solution;Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, vector The operation part section of inner product, scalar and vector product;Judge whether the approximate solution meets the solving precision and require;If so, Then output meets the approximate solution that the solving precision is required;
It can be seen that, the method has been transplanted to conjugate gradient algorithms in KNL cluster platforms, i.e., using MPI realize node it Between task distribution and message transmission, realize parallel acceleration that extensive matrix-vector calculates so as to improve using KNL chips The utilization rate of hardware resource, shortens the time for solving extensive symmetric positive definite system of linear equations, and reduces energy consumption, reduces Computer lab management, the cost of O&M, and accelerated method are simply easily achieved, and reduce development cost;The invention discloses a kind of KNL Cluster accelerates solving device, with above-mentioned beneficial effect, will not be described here.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis The accompanying drawing of offer obtains other accompanying drawings.
The flow chart that Fig. 1 accelerates method for solving by the KNL clusters that the embodiment of the present invention is provided;
Fig. 2 divides schematic diagram by the task that the embodiment of the present invention is provided;
The MPI design cycle schematic diagrams that Fig. 3 is provided by the embodiment of the present invention;
Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided.
Specific embodiment
The core of the present invention is to provide a kind of KNL clusters and accelerates method for solving and device, and conjugate gradient algorithms are transplanted to In KNL cluster platforms, the utilization rate of hardware resource is improve, extensive symmetric positive definite system of linear equations is solved so as to shorten Time, reduce energy consumption, reduce development cost.
To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
At present, conjugate gradient method (i.e. CG) is the one of the current most popular symmetrical Large Scale Sparse system of linear equations of solution Class method.It is why so welcome, it is only to be needed using first derivative information due to the solution of CG, with than steepest descent method more Fast convergence rate, it is less than Newton iteration method amount of calculation, and be not required to determine any parameter.Therefore, CG has storage little, convergence Hurry up, stability is strong, without the need for external parameter and the advantages of be adapted to parallel.
Wherein, CG is to be proposed by Hestenes and Stiefel first at the initial stage fifties, enters research relevant in decades Unprecedented development is obtained, theory relevant at present and method are very ripe.For sparse vectors Ax= B, the conjugate gradient method algorithm flow that capital and interest are adopted are as follows:
KNL (Knights Landing) be Intel Company release the second filial generation to core piece is melted by force, for high-performance simultaneously The many-core processor that row is calculated.KNL chips can individually do central host processor, which employs the improvement of Silvermont frameworks Customization version and 14nm new technologies, core amounts up to 64-72, each core can at most open 4 threads, at most possess 288 Individual thread, more than 3TFlops, single precision is then more than 6TFlops for double-precision floating point performance.And OPA frameworks are one and aim at optimization The brand-new interconnection technique of high-performance calculation, is also a Interworking Solution end to end, can enjoy widely user To the performance advantage of HPC cluster.
MPI (Message Passing Interface) is to issue and by numerous parallel computer factories in May, 1994 Business, software development organization and Parallel application unit safeguard a kind of message passing interface jointly, be at present in the world it is most popular simultaneously A kind of volume of the scalable parallel computer of one of row programmed environment, especially distributed storage and network of workstations and a group of planes Journey example.MPI is mainly made up of Fortran+MPI or C+MPI, is had up to a hundred function call interfaces, can be directly invoked.MPI has Have many good qualities:With portable and ease for use;There is complete asynchronous communication function;There is formal and detailed explication. MPI is accomplished on PC, MS Windows and on all main UNIX/Linux work stations, main flow parallel machine, In the distributed storage environment constituted by more senior and abstract program based on level messages transmission procedure, MPI standard The benefit brought is obvious.Therefore, CG is combined MPI so as to realize accelerating to solve symmetric positive definite line in KNL clusters by the present embodiment Property equation group process, specifically refer to Fig. 1, Fig. 1 is accelerated the stream of method for solving by the KNL clusters that the embodiment of the present invention is provided Cheng Tu;The method can include:
S100, the coefficient matrix and constant term that read symmetric positive definite system of linear equations, and set initial solution and solving precision Require;
S110, each KNL kernels are controlled using MPI carry out procedure subject calculating, construct approximate solution;Wherein, procedure subject is It is integrated in the fortune of Large Scale Sparse matrix-vector multiplication in KNL kernels, vector addition, inner product of vectors, scalar and vector product Calculate code segment;
S120, judge whether the approximate solution meets the solving precision and require;
If then performing S130 then exports the approximate solution for meeting that the solving precision is required;If otherwise returning S110.
Wherein, as conjugate gradient algorithms can be divided into the three parts such as early stage pretreatment, iterative part, result output, its In middle iterative part namely Fig. 1, step S110 calculates the main body of the part for CG algorithms of approximate solution, and its operand accounts for overall More than 98%.Remainder with I/O operation, initial value set etc. data prepare and whether meet the judgement of required precision as It is main, it is unsuitable for parallel processing, therefore computing is carried out using CPU;And single iteration operating process is sentenced except a small amount of branch Outside disconnected, mainly include the multiplication of Large Scale Sparse matrix-vector, scalar multiplication with vector, vector norm and addition of vectors equal matrix Vector operations, are very suitable for the parallel acceleration of KNL clusters.Therefore, four kinds of matrix-vector operation part sections are made in step S110 KNL kernels are integrated in for procedure subject carries out parallel processing.
I.e. in the present embodiment, S100 to S120 is that conjugate gradient algorithms solve symmetric positive definite system of linear equations flow process;The conjugation Gradient algorithm is that approximate solution solution procedure is passed through MPI form parallel high-speeds with conjugate gradient algorithms difference in prior art Ground operation in KNL clusters realizes, that is, realize the MPI parallel versions of conjugate gradient method algorithm, will be with the basis of MPI versions Four kinds of computings such as the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors and scalar and vector product are main body Code segment is used as KNL kernels.
The calculating process of wherein step S110 is controlled by MPI.It is i.e. specific, between calculate node (i.e. KNL kernels) In parallel through distributed storage message passing interface programming model (MPI) realize, wherein:The realization of MPI is intended to data Task is divided on different calculate node equipment and completes calculating task, and by MPI complete message transmission between process with Data interaction.I.e. optional, controlling each KNL kernels using MPI carries out procedure subject calculating, and construction approximate solution can include:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and it is empty privately owned storage to be arranged for each process Between;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined number According to including the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and to whole knots Fruit is processed, and obtains approximate solution.
Specifically, said process is task resolution, arranges thread and distributes task for thread, and each thread is each privately owned The calculating of task that is allocated of memory space, and result of calculation is fed back to into MPI host processes can also be referred to as main thread, Main thread carries out processing according to each result for receiving the approximate solution for obtaining this time calculating.
Now in order that KNL PC cluster speed is quicker, need the configuration for ensureing each calculate node balanced, this Sample will not cause the delay of whole process because of the delay of a calculate node in can ensure that circulation every time.Therefore task is drawn Divide and distribution needs equilibrium, it is desirable that the computational efficiency for obtaining each calculate node is identical.Such as the first calculate node is calculated Ability is 2, and the second calculate node computing capability is 1, then can distribute two tasks to the first calculate node when task is distributed, Distribute a task to the second calculate node, to ensure the task balance of each calculate node.But preferably way is to ensure each The hardware condition of individual calculate node is all identical, can more quick directly mean allocation during task of so reallocating.It is i.e. every Individual calculate node internal memory (including DDR physical memories and MCDRAM high bandwidth internal memories) size and type of memory are as far as possible identical, The quantity of central processing unit is identical.To reduce the communication delay between node, Infiniband, 10,000,000,000 ether between node, are adopted Cluster is built in the high performance network interconnection such as net or Intel OPA, the network switch by the way of total exchange.
Therefore each calculate node is in the case where hardware condition is the same, by the solution of the symmetric positive definite system of linear equations Task carries out division can be included:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p Block;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node Reason core is divided into NgrpIndividual group.
Specifically, according to KNL core groups in the quantity and node of calculate node in cluster (node can also be referred to as) Quantity opens process, it is assumed that have N in clusternodeProcessing core can be divided into N in each calculate node by individual calculate nodegrp Individual group, N_p=N can be openednode*NgrpIndividual MPI processes, each MPI process are responsible for controlling the calculating of a KNL core group and controlling Message transmission between core group processed.Data are divided and takes advantage of N_p data block, each process is responsible for the calculating task of a data block.
Said process is illustrated by taking Ax=b as an example, Fig. 2 is refer to, by the coefficient matrices A of system of linear equations Ax=b and the right side End vector b divided by row, the nonzero element number in A is size, and line number is n, and for x, each process is required for full x, therefore, The x sizes of each process definition are n, and each process only calculates n/N_P element in x, are needed after calculating every time between process Communicate, obtain full x data.Host process sets up out of order index, is responsible for the collection and broadcast of data.Each process is from binary system text Respective desired data is read in part;Then each process according to the process number of oneself come static division task.Each process initiation Multithreading, main thread are responsible for and will also undertake corresponding calculating task outside other process communications, other threads be responsible for calculating with Other tasks, terminate local process after the completion of calculating.
Will system of linear equations coefficient matrices A divided by row into N_p blocks, give N_p process respectively and calculate, protect as far as possible Demonstrate,prove the number of nonzero element in each process quite, to guarantee the load balancing between process, host process is responsible for number in process group According to collection and broadcast, each process possesses independent data, completes the calculating task of oneself, and the information between process is by message biography Pass the interaction of interface completion message.
The concrete Programming Notes said process with MPI multi-process design is divided below by MPI tasks.Wherein, MPI tasks Division is specially:
Task between node is divided using static division mode, divided by row, it is assumed that required system of linear equations be Ax=b and Nonzero element number in sparse matrix A is size, and line number is n, it is assumed that open l process P0, P1..., Pl-1, can be by coefficient square Battle array A is with vector b divided by row into l blocks, i.e. A=[A0 T,A1 T,…,Al-1 T]T, b=[b0 T,b1 T,…,bl-1 T]T.By data block A0 ~Al-1And b0~bl-1It is respectively allocated to process P0~Pl-1, and vector x is shared for all processes.It is concrete as shown in Figure 2.Therefore, The x sizes of each process definition are n, and each process only calculates n/l element in x, are needed after calculating every time between process Communication, obtains x vectorial.The line number that each process is processed is represented with array H [N]:
Communicate for convenience, the Hpos that defines arrays [l+1] represents the position that the data of each process calculating start:
1:Hpos [0]=0;
2:For (i=1;i<l+1;i++)
3:Hpos [i]=Hpos [i-1]+H [i-1];
MPI multi-process design is specially:
MPI multi-process design cycles are as shown in Figure 3.Each MPI process (abbreviation process) controls the calculating of 1 KNL core group With data transfer.L KNL core group is had in assuming cluster, starts l process P first0, P1..., Pl-1, wherein process p0Based on Process;Each process opens up privately owned memory space, process p according to task0It is responsible for reading data, and data on-demand broadcasting is given Other processes;Process p0-Pl-1Respective calculating task is performed respectively, and result is fed back to into host process p0;Host process is to feedback As a result carry out processing, integrate, and necessary result is broadcast to into other processes;The upper two steps operation of repetition is completed until calculating. MPI mainly employs the mode of collective communication and completes the message transmission between process in implementing parallel.The MPI message called is passed Passing built-in function includes:MPI_Reduce, MPI_ALLReduce, MPI_Bcast and MPI_Allgatherv.MPI design frameworks False code is as follows:
Based on above-mentioned technical proposal, KNL clusters provided in an embodiment of the present invention accelerate method for solving, the method be conjugated ladder Degree algorithm has been transplanted in KNL cluster platforms, i.e., realize the distribution of task and message transmission between node using MPI, utilizes KNL chips realize that the parallel acceleration of extensive matrix-vector calculating, so as to improve the utilization rate of hardware resource, shortens solution The time of extensive symmetric positive definite system of linear equations, and energy consumption is reduced, the cost of computer lab management, O&M is reduced, and is added Fast method is simply easily achieved, and reduces development cost.
Based on above-described embodiment, in order to further improve the calculating speed in calculate node.Can be within calculate node Accelerate to complete using KNL numerous cores, it is real using the multi-thread programming model (OpenMP) of shared drive in KNL intra-nodes It is now parallel.Wherein, the realization of OpenMP is intended to the calculating core for making full use of KNL processors numerous and each core can be opened The code segment of computation-intensive is put into parallel processing on KNL polycaryon processors by 4 threads, so as to accelerate asking for system of linear equations Solution.I.e. in the present embodiment, KNL kernels carry out procedure subject calculating can include:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
Specifically, multithreading is opened according to the quantity that core is calculated in KNL check figures and realizes the parallel acceleration in equipment;Assume There is N in each core groupKNLIndividual core, it is assumed that have N in each core groupknl coreIndividual core, can at most open 4*Nknl coreIndividual thread.
Wherein, Large Scale Sparse matrix-vector in algorithm is multiplied by the present embodiment, inner product of vectors, scalar multiplication with vector and Addition of vectors equal matrix vector operations are used as multi-threaded parallel region.Therefore, in the present embodiment in algorithm by calling subfunction Mode complete the operation of four kinds of matrix-vectors of the above.Kernel is completed by way of " #pragma omp " speech and accelerates design. KNL chips each cores supports 4 hardware threads, can at most open 4*Nknl coreIndividual OpenMP threads, in four kinds of subfunctions Core OpenMP design frameworks are as follows:
1>The matrix-vector multiplication of kernel function
2>The vector number of kernel function is taken advantage of
3>The inner product of vectors of kernel function
4>The addition of vectors of kernel function
Based on above-described embodiment, in order to further improve the speed that each KNL kernels main body is calculated, in the present embodiment in KNL Core carries out procedure subject and calculates and can include:
The data or array that memory read-write in described program main body is limited open up MCDRAM high bandwidth internal memories.
Specifically, MCDRAM high bandwidths memory part in node is made full use of, memory access intensive code segment is opened up into height On the internal memory of bandwidth.For example, its code is all opened up the internal memory of high bandwidth when internal memory shared by procedure subject is less than 16GB On, or access frequency highest data segment in this section of program is opened up into high bandwidth during memory size 16GB shared by procedure subject Internal memory on.
The computing resource of KNL clusters is made full use of, is improved and is calculated performance, reducing energy consumption, so as to reduce computer lab management, fortune Dimension cost.In the above-described embodiments first connected applications the characteristics of, design hardware platform building plan;Then realize conjugation The MPI parallel versions of gradient method algorithm;Then will be with the multiplication of Large Scale Sparse matrix-vector, vector on the basis of MPI versions Four kinds of computings such as addition, inner product of vectors and scalar and vector product for main body code segment as KNL kernels, each KNL core Group can at most open 4*Nknl coreIndividual thread OpenMP threads, complete the parallel computation on node KNL chips.
Based on above-mentioned technical proposal, the KNL clusters that the embodiment of the present invention is carried accelerate method for solving, mainly employ MPI+ OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.It is sweeping in the method Matrix-vector be multiplied, vector be multiplied by scalar, vector subtract each other and inner product of vectors can as the important component part of kernel function, Multi-threading parallel process.Improve the utilization rate of hardware resource, shorten solve extensive symmetric positive definite system of linear equations when Between, and energy consumption is reduced, and the cost of computer lab management, O&M being reduced, and accelerated method is simply easily achieved, reduction is developed into This.
Solving device is accelerated to be introduced KNL clusters provided in an embodiment of the present invention below, KNL clusters described below Acceleration solving device can be mutually to should refer to above-described KNL clusters acceleration method for solving.
Fig. 4 is refer to, Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided;The dress Putting to include:
Read module 100, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution And solving precision is required;
Approximate solution solves module 200, carries out procedure subject calculating for controlling each KNL kernels using MPI, and construction is approximate Solution;Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, The operation part section of scalar and vector product;
Judge solving precision module 300, require for judging whether the approximate solution meets the solving precision;
As a result output module 400, meet the approximate solution that the solving precision is required for output.
Specifically, the KNL clusters accelerate building for the hardware platform environment of solving device to configure including between calculate node Weighing apparatus design, in node between configuration Equalization Design, node internet choose, in node in the configuration of computing device and node Mode that KNL internal memories are opened up etc..
Wherein, Equalization Design is configured, each calculate node internal memory (includes in DDR physical memories and MCDRAM high bandwidths Deposit) size and type of memory it is as far as possible identical, the quantity of central processing unit is identical.To reduce the communication delay between node, section Using high performance network interconnections such as Infiniband, ten thousand mbit ethernets or Intel OPA between point, the network switch is using full friendship The mode changed builds cluster.Each KNL node memory and mode of operation configuration consistency in KNL clusters, each KNL node have phase Same KNL processor chips;It is connected by High speed network between each KNL nodes, in the KNL clusters, the network switch is adopted Use total exchange mode;
Configuration in dynamical hardware cluster platform, node is built balanced, to improve the exchange capacity and resource of data Utilization rate.It is assumed that adopting KNL PC clusters, each KNL node configuration internal memory is consistent, using same-type, size DDR internal memories and the storage of MCDRAM high bandwidths, each KNL node are arranged to identical pattern, it is to avoid memory read-write velocity contrast The disposal ability gap great disparity for causing, causes whole node processing power low.Meanwhile, the KNL process adopted in calculate node Device chip is identical, it is ensured that the core amounts and dominant frequency in each processor chips are identical.Additionally, the communication between process is to node Between internet requirement it is higher, therefore, the interconnection of calculate node adopts 10,000,000,000 ether or Infiniband High speed networks, To avoid due to the inconsistent information occlusion of bandwidth, the network switch is by the way of total exchange.
Each intra-node configures two identical processors, and to ensure the core amounts of processor, dominant frequency is identical, each Node Configuration Type and size identical internal memory (being not less than 128GB), it is ensured that the KNL chip cores number and dominant frequency in node Unanimously.And integrated MCDRAM is consistent on chip, between node, using express network interconnection, (it is high that network can choose Infiniband Fast internet, Ethernet and Intel OPA network connections), switch is by the way of full connection.
Based on above-described embodiment, the approximate solution solves module 200 can be included:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and enters for each Journey arranges privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to all to enter Journey;Wherein, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, receives whole processes for the MPI host processes and is calculated according to the tentation data Result afterwards, and whole results are processed, obtain approximate solution.
Based on above-described embodiment, specially using static division mode, divided by row will be symmetrical for the task division unit Unit of the coefficient matrix divided by row of positive definite system of linear equations into N_p blocks;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor Calculate node number, N in KNL clustersgrpFor processing core is divided into N in each calculate nodegrpIndividual group.
Based on above-described embodiment, the approximate solution solves unit, including:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads enter line program master Body is calculated.
Based on above-mentioned any embodiment, the approximate solution solves module 200 can be included:
Unit is opened up in the limited array space of memory bandwidth, for the data that are limited memory read-write in procedure subject or array Open up MCDRAM high bandwidth internal memories.
Specifically, the KNL platforms of the embodiment have the core amounts much larger than CPU, and each core supports 4 hardware Thread, calculates disposal ability very powerful.Meanwhile, storing on the MCDRAM pieces of KNL platform configurations 16GB, its bandwidth is additionally, MPI Programming programs and binary compatible consistent with CPU platforms with good portability and complete communication function and KNL.Tool Standby simple and direct, efficient the characteristics of.Meanwhile, conjugate gradient method has simplicity, fast convergence rate, stability high and is easy to parallel spy Point.Therefore, devising a kind of KNL clusters based on conjugate gradient method accelerates solving device, the device mainly to employ MPI+ OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.In the device, initial solution sets Pretreatment before fixed and iteration, constructs new approximate solution, and it is single line to judge whether approximate solution meets the operation such as required precision What journey was completed, and sweeping matrix-vector is multiplied, vector is multiplied by that scalar, vector subtract each other and inner product of vectors can be used as kernel The important component part of function, multi-threading parallel process.
In description, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration .
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can use different methods to realize described function to each specific application, but this realization should not Think beyond the scope of this invention.
The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Method for solving and device is accelerated to be described in detail a kind of KNL clusters provided by the present invention above.Herein In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side Assistant solves the method for the present invention and its core concept.It should be pointed out that for those skilled in the art, not On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these improve and modification also falls into In the protection domain of the claims in the present invention.

Claims (10)

1. a kind of KNL clusters accelerate method for solving, it is characterised in that include:
The coefficient matrix and constant term of symmetric positive definite system of linear equations are read, and sets initial solution and solving precision requirement;
Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution;Wherein, procedure subject is to be integrated in KNL The operation part section of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in kernel;
Judge whether the approximate solution meets the solving precision and require;
If so, then output meets the approximate solution that the solving precision is required.
2. KNL clusters according to claim 1 accelerate method for solving, it is characterised in that control each KNL kernels using MPI Procedure subject calculating is carried out, approximate solution is constructed, including:
The solution task of the symmetric positive definite system of linear equations is divided;
The process of respective amount is started according to the division number of the task of solution, and privately owned memory space is set for each process;
MPI host processes read tentation data, and the tentation data is sent to whole processes;Wherein, the predetermined packet data Include the coefficient matrix, the constant term and the initial solution;
The MPI host processes receive the result after whole processes are calculated according to the tentation data, and whole results are entered Row is processed, and obtains approximate solution.
3. KNL clusters according to claim 2 accelerate method for solving, it is characterised in that the symmetric positive definite is linearly square The solution task of journey group is divided, including:
Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks;Its In, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpFor in each calculate node by process cores The heart is divided into NgrpIndividual group.
4. KNL clusters according to claim 3 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter Calculate, including:
The KNL core groups open 4*Nknl coreIndividual OpenMP threads carry out procedure subject calculating.
5. KNL clusters according to claim 4 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter Calculate, including:
The data or array that memory read-write in described program main body is limited open up MCDRAM high bandwidth internal memories.
6. a kind of KNL clusters accelerate solving device, it is characterised in that include:
Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution and solution Required precision;
Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution;Wherein, Procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with to The operation part section of amount product;
Judge solving precision module, require for judging whether the approximate solution meets the solving precision;
As a result output module, meets the approximate solution that the solving precision is required for output.
7. KNL clusters according to claim 6 accelerate solving device, it is characterised in that the approximate solution solves module, bag Include:
Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided;
Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and sets for each process Put privately owned memory space;
Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to whole processes;Its In, the tentation data includes the coefficient matrix, the constant term and the initial solution;
Approximate solution solves unit, after being calculated according to the tentation data for the whole processes of MPI host processes reception As a result, and to whole results process, obtain approximate solution.
8. KNL clusters according to claim 7 accelerate solving device, it is characterised in that the task division unit is concrete Be using static division mode, divided by row by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks list Unit;Wherein, N_p=Nnode*Ngrp;Wherein, NnodeFor calculate node number in KNL clusters, NgrpTo locate in each calculate node Reason core is divided into NgrpIndividual group.
9. KNL clusters according to claim 8 accelerate solving device, it is characterised in that the approximate solution solves unit, bag Include:
Approximate solution solves subelement, opens 4*N for the KNL core groupsknl coreIndividual OpenMP threads carry out procedure subject meter Calculate.
10. KNL clusters according to claim 9 accelerate solving device, it is characterised in that the approximate solution solves module, Including:
Unit is opened up in the limited array space of memory bandwidth, and the data or array for memory read-write in procedure subject is limited are opened up To MCDRAM high bandwidth internal memories.
CN201611208888.3A 2016-12-23 2016-12-23 KNL cluster acceleration solving method and apparatus Pending CN106598913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611208888.3A CN106598913A (en) 2016-12-23 2016-12-23 KNL cluster acceleration solving method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611208888.3A CN106598913A (en) 2016-12-23 2016-12-23 KNL cluster acceleration solving method and apparatus

Publications (1)

Publication Number Publication Date
CN106598913A true CN106598913A (en) 2017-04-26

Family

ID=58601476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611208888.3A Pending CN106598913A (en) 2016-12-23 2016-12-23 KNL cluster acceleration solving method and apparatus

Country Status (1)

Country Link
CN (1) CN106598913A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562691A (en) * 2017-08-14 2018-01-09 中国科学院力学研究所 A kind of micro thrust dynamic testing method based on least square method
CN108986063A (en) * 2018-07-25 2018-12-11 浪潮(北京)电子信息产业有限公司 The method, apparatus and computer readable storage medium of gradient fusion
CN115099031A (en) * 2022-06-21 2022-09-23 南京维拓科技股份有限公司 Method for realizing parallel computing of simulation solver under Windows environment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609393A (en) * 2012-02-08 2012-07-25 浪潮(北京)电子信息产业有限公司 Method for processing data of systems of linear equations and device
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN105260342A (en) * 2015-09-22 2016-01-20 浪潮(北京)电子信息产业有限公司 Solving method and system for symmetric positive definite linear equation set

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609393A (en) * 2012-02-08 2012-07-25 浪潮(北京)电子信息产业有限公司 Method for processing data of systems of linear equations and device
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN105260342A (en) * 2015-09-22 2016-01-20 浪潮(北京)电子信息产业有限公司 Solving method and system for symmetric positive definite linear equation set

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AVINASH SODANI ET AL.: "KNIGHTS LANDING: SECOND-GENERATION INTEL XEON PHI PRODUCT", 《IEEE MICRO》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562691A (en) * 2017-08-14 2018-01-09 中国科学院力学研究所 A kind of micro thrust dynamic testing method based on least square method
CN107562691B (en) * 2017-08-14 2020-03-17 中国科学院力学研究所 Micro thrust dynamic test method based on least square method
CN108986063A (en) * 2018-07-25 2018-12-11 浪潮(北京)电子信息产业有限公司 The method, apparatus and computer readable storage medium of gradient fusion
CN115099031A (en) * 2022-06-21 2022-09-23 南京维拓科技股份有限公司 Method for realizing parallel computing of simulation solver under Windows environment
CN115099031B (en) * 2022-06-21 2024-04-26 南京维拓科技股份有限公司 Parallel computing method for realizing simulation solver in Windows environment

Similar Documents

Publication Publication Date Title
Li et al. A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks
Navaridas et al. Simulating and evaluating interconnection networks with INSEE
Agrawal et al. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills
CN106598913A (en) KNL cluster acceleration solving method and apparatus
Bienz et al. Node aware sparse matrix–vector multiplication
US20220121928A1 (en) Enhanced reconfigurable interconnect network
He et al. A survey to predict the trend of AI-able server evolution in the cloud
Ahn et al. Soft memory box: A virtual shared memory framework for fast deep neural network training in distributed high performance computing
US20230305967A1 (en) Banked memory architecture for multiple parallel datapath channels in an accelerator
Kidane et al. NoC based virtualized accelerators for cloud computing
Ding et al. Leveraging one-sided communication for sparse triangular solvers
Del Sozzo et al. A scalable FPGA design for cloud n-body simulation
CN107679409A (en) A kind of acceleration method and system of data encryption
Robertsen et al. Lattice Boltzmann simulations at petascale on multi-GPU systems with asynchronous data transfer and strictly enforced memory read alignment
Hironaka et al. Multi-fpga management on flow-in-cloud prototype system
KR101656693B1 (en) Apparatus and method for simulating computational fluid dynamics using Hadoop platform
Gharan et al. Flexible simulation and modeling for 2D topology NoC system design
Hou et al. Co-designing the topology/algorithm to accelerate distributed training
McManus A strategy for mapping unstructured mesh computational mechanics programs onto distributed memory parallel architectures
EP4006736A1 (en) Connecting processors using twisted torus configurations
da Rosa Righi et al. Designing Cloud-Friendly HPC Applications
CN114218874A (en) Parafoil fluid-solid coupling parallel computing method
TW202217564A (en) Runtime virtualization of reconfigurable data flow resources
Ho et al. Towards FPGA-assisted spark: An SVM training acceleration case study
Kidane et al. Run-time scalable noc for fpga based virtualized ips

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170426

RJ01 Rejection of invention patent application after publication