CN106598913A

CN106598913A - KNL cluster acceleration solving method and apparatus

Info

Publication number: CN106598913A
Application number: CN201611208888.3A
Authority: CN
Inventors: 王明清; 张清
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-04-26

Abstract

The invention discloses a KNL (Knights Landing) cluster acceleration solving apparatus. A method comprises the steps of reading a coefficient matrix and a constant term of a symmetric positive linear equation set, and setting an initial solution and a solving precision requirement; controlling each KNL kernel to perform procedure subject calculation by utilizing an MPI (Message Passing Interface), and constructing an approximate solution, wherein the procedure subject is operation code segments for vector multiplication, vector addition, vector inner products, scalar products and vector products of large-scale sparse matrixes integrated in the KNL kernel; judging whether the approximate solution meets the solving precision requirement or not; and if yes, outputting the approximate solution. According to the method, a conjugate gradient algorithm is transplanted in a KNL cluster platform, so that the utilization rate of hardware resources is increased, the time for solving large-scale symmetric positive linear equation sets is shortened, the energy consumption is reduced, and the management and operation maintenance costs of a machine room are reduced; and the acceleration method is simple and easy to realize, so that the development cost is reduced. The KNL cluster acceleration solving apparatus disclosed by the invention has the abovementioned beneficial effects.

Description

A kind of KNL clusters accelerate method for solving and device

Technical field

The present invention relates to field of computer technology, more particularly to a kind of KNL clusters acceleration method for solving and device.

Background technology

The solution of mathematics physics model is one of numerous engineering productions and the requisite work of scientific research field.With calculating The a series of numerical computation methods such as the development of machine, finite difference, finite element, boundary element, non-mesh method are born in succession.These Numerical computation method has a something in common：Mathematics physics model derived from practical problem is separated into by specific mode One linear algebraic equation systems.With the discrete system of linear equations for obtaining of Finite Element Method be often symmetric positive definite or through letter Single process becomes symmetric positive definite problem.However, with the increase of problem scale, the solution of system of linear equations becomes engineering life Produce and the big bottleneck in scientific research.Therefore, how to improve the time for solving extensive symmetric positive definite system of linear equations that shortens, and And energy consumption is reduced, the cost of computer lab management, O＆M is reduced, is those skilled in the art's technical issues that need to address.

The content of the invention

It is an object of the invention to provide a kind of KNL clusters accelerate method for solving and device, conjugate gradient algorithms are transplanted to The utilization rate of hardware resource in KNL cluster platforms, is improved, and extensive symmetric positive definite system of linear equations is solved so as to shorten Time, energy consumption is reduced, reduce development cost.

To solve above-mentioned technical problem, the present invention provides a kind of KNL clusters and accelerates method for solving, including：

The coefficient matrix and constant term of symmetric positive definite system of linear equations are read, and sets initial solution and solving precision requirement；

Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution；Wherein, procedure subject is to be integrated in The operation part of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in KNL kernels Section；

Judge whether the approximate solution meets the solving precision and require；

If so, then output meets the approximate solution that the solving precision is required.

Optionally, controlling each KNL kernels using MPI carries out procedure subject calculating, constructs approximate solution, including：

The solution task of the symmetric positive definite system of linear equations is divided；

The process of respective amount is started according to the division number of the task of solution, and it is empty privately owned storage to be arranged for each process Between；

MPI host processes read tentation data, and the tentation data is sent to whole processes；Wherein, the predetermined number According to including the coefficient matrix, the constant term and the initial solution；

The MPI host processes receive the result after whole processes are calculated according to the tentation data, and to whole knots Fruit is processed, and obtains approximate solution.

Optionally, the solution task of the symmetric positive definite system of linear equations is divided, including：

Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p Block；Wherein, N_p=N_node*N_grp；Wherein, N_nodeFor calculate node number in KNL clusters, N_grpTo locate in each calculate node Reason core is divided into N_grpIndividual group.

Optionally, KNL kernels carry out procedure subject calculating, including：

The KNL core groups open 4*N^knl _coreIndividual OpenMP threads carry out procedure subject calculating.

Optionally, KNL kernels carry out procedure subject calculating, including：

The data or array that memory read-write in procedure subject is limited open up MCDRAM high bandwidth internal memories.The present invention is also carried Accelerate solving device for a kind of KNL clusters, including：

Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and set initial solution and Solving precision is required；

Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution；Its In, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with The operation part section of vector product；

Judge solving precision module, require for judging whether the approximate solution meets the solving precision；

As a result output module, meets the approximate solution that the solving precision is required for output.

Optionally, the approximate solution solves module, including：

Task division unit, for the solution task of the symmetric positive definite system of linear equations is divided；

Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and enters for each Journey arranges privately owned memory space；

Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to all to enter Journey；Wherein, the tentation data includes the coefficient matrix, the constant term and the initial solution；

Approximate solution solves unit, receives whole processes for the MPI host processes and is calculated according to the tentation data Result afterwards, and whole results are processed, obtain approximate solution.

Optionally, specially using static division mode, divided by row is linear by symmetric positive definite for the task division unit Unit of the coefficient matrix divided by row of equation group into N_p blocks；Wherein, N_p=N_node*N_grp；Wherein, N_nodeFor in KNL clusters Calculate node number, N_grpFor processing core is divided into N in each calculate node_grpIndividual group.

Optionally, the approximate solution solves unit, including：

Approximate solution solves subelement, opens 4*N for the KNL core groups^knl _coreIndividual OpenMP threads enter line program master Body is calculated.

Optionally, the approximate solution solves module, including：

Unit is opened up in the limited array space of memory bandwidth, for the data that are limited memory read-write in procedure subject or array Open up MCDRAM high bandwidth internal memories.

KNL clusters provided by the present invention accelerate method for solving, including：Read the coefficient square of symmetric positive definite system of linear equations Battle array and constant term, and set initial solution and solving precision requirement；Each KNL kernels are controlled using MPI carries out procedure subject calculating, structure Make approximate solution；Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, vector The operation part section of inner product, scalar and vector product；Judge whether the approximate solution meets the solving precision and require；If so, Then output meets the approximate solution that the solving precision is required；

It can be seen that, the method has been transplanted to conjugate gradient algorithms in KNL cluster platforms, i.e., using MPI realize node it Between task distribution and message transmission, realize parallel acceleration that extensive matrix-vector calculates so as to improve using KNL chips The utilization rate of hardware resource, shortens the time for solving extensive symmetric positive definite system of linear equations, and reduces energy consumption, reduces Computer lab management, the cost of O＆M, and accelerated method are simply easily achieved, and reduce development cost；The invention discloses a kind of KNL Cluster accelerates solving device, with above-mentioned beneficial effect, will not be described here.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis The accompanying drawing of offer obtains other accompanying drawings.

The flow chart that Fig. 1 accelerates method for solving by the KNL clusters that the embodiment of the present invention is provided；

Fig. 2 divides schematic diagram by the task that the embodiment of the present invention is provided；

The MPI design cycle schematic diagrams that Fig. 3 is provided by the embodiment of the present invention；

Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided.

Specific embodiment

The core of the present invention is to provide a kind of KNL clusters and accelerates method for solving and device, and conjugate gradient algorithms are transplanted to In KNL cluster platforms, the utilization rate of hardware resource is improve, extensive symmetric positive definite system of linear equations is solved so as to shorten Time, reduce energy consumption, reduce development cost.

To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

At present, conjugate gradient method (i.e. CG) is the one of the current most popular symmetrical Large Scale Sparse system of linear equations of solution Class method.It is why so welcome, it is only to be needed using first derivative information due to the solution of CG, with than steepest descent method more Fast convergence rate, it is less than Newton iteration method amount of calculation, and be not required to determine any parameter.Therefore, CG has storage little, convergence Hurry up, stability is strong, without the need for external parameter and the advantages of be adapted to parallel.

Wherein, CG is to be proposed by Hestenes and Stiefel first at the initial stage fifties, enters research relevant in decades Unprecedented development is obtained, theory relevant at present and method are very ripe.For sparse vectors Ax= B, the conjugate gradient method algorithm flow that capital and interest are adopted are as follows：

KNL (Knights Landing) be Intel Company release the second filial generation to core piece is melted by force, for high-performance simultaneously The many-core processor that row is calculated.KNL chips can individually do central host processor, which employs the improvement of Silvermont frameworks Customization version and 14nm new technologies, core amounts up to 64-72, each core can at most open 4 threads, at most possess 288 Individual thread, more than 3TFlops, single precision is then more than 6TFlops for double-precision floating point performance.And OPA frameworks are one and aim at optimization The brand-new interconnection technique of high-performance calculation, is also a Interworking Solution end to end, can enjoy widely user To the performance advantage of HPC cluster.

MPI (Message Passing Interface) is to issue and by numerous parallel computer factories in May, 1994 Business, software development organization and Parallel application unit safeguard a kind of message passing interface jointly, be at present in the world it is most popular simultaneously A kind of volume of the scalable parallel computer of one of row programmed environment, especially distributed storage and network of workstations and a group of planes Journey example.MPI is mainly made up of Fortran+MPI or C+MPI, is had up to a hundred function call interfaces, can be directly invoked.MPI has Have many good qualities：With portable and ease for use；There is complete asynchronous communication function；There is formal and detailed explication. MPI is accomplished on PC, MS Windows and on all main UNIX/Linux work stations, main flow parallel machine, In the distributed storage environment constituted by more senior and abstract program based on level messages transmission procedure, MPI standard The benefit brought is obvious.Therefore, CG is combined MPI so as to realize accelerating to solve symmetric positive definite line in KNL clusters by the present embodiment Property equation group process, specifically refer to Fig. 1, Fig. 1 is accelerated the stream of method for solving by the KNL clusters that the embodiment of the present invention is provided Cheng Tu；The method can include：

S100, the coefficient matrix and constant term that read symmetric positive definite system of linear equations, and set initial solution and solving precision Require；

S110, each KNL kernels are controlled using MPI carry out procedure subject calculating, construct approximate solution；Wherein, procedure subject is It is integrated in the fortune of Large Scale Sparse matrix-vector multiplication in KNL kernels, vector addition, inner product of vectors, scalar and vector product Calculate code segment；

S120, judge whether the approximate solution meets the solving precision and require；

If then performing S130 then exports the approximate solution for meeting that the solving precision is required；If otherwise returning S110.

Wherein, as conjugate gradient algorithms can be divided into the three parts such as early stage pretreatment, iterative part, result output, its In middle iterative part namely Fig. 1, step S110 calculates the main body of the part for CG algorithms of approximate solution, and its operand accounts for overall More than 98%.Remainder with I/O operation, initial value set etc. data prepare and whether meet the judgement of required precision as It is main, it is unsuitable for parallel processing, therefore computing is carried out using CPU；And single iteration operating process is sentenced except a small amount of branch Outside disconnected, mainly include the multiplication of Large Scale Sparse matrix-vector, scalar multiplication with vector, vector norm and addition of vectors equal matrix Vector operations, are very suitable for the parallel acceleration of KNL clusters.Therefore, four kinds of matrix-vector operation part sections are made in step S110 KNL kernels are integrated in for procedure subject carries out parallel processing.

I.e. in the present embodiment, S100 to S120 is that conjugate gradient algorithms solve symmetric positive definite system of linear equations flow process；The conjugation Gradient algorithm is that approximate solution solution procedure is passed through MPI form parallel high-speeds with conjugate gradient algorithms difference in prior art Ground operation in KNL clusters realizes, that is, realize the MPI parallel versions of conjugate gradient method algorithm, will be with the basis of MPI versions Four kinds of computings such as the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors and scalar and vector product are main body Code segment is used as KNL kernels.

The calculating process of wherein step S110 is controlled by MPI.It is i.e. specific, between calculate node (i.e. KNL kernels) In parallel through distributed storage message passing interface programming model (MPI) realize, wherein：The realization of MPI is intended to data Task is divided on different calculate node equipment and completes calculating task, and by MPI complete message transmission between process with Data interaction.I.e. optional, controlling each KNL kernels using MPI carries out procedure subject calculating, and construction approximate solution can include：

Specifically, said process is task resolution, arranges thread and distributes task for thread, and each thread is each privately owned The calculating of task that is allocated of memory space, and result of calculation is fed back to into MPI host processes can also be referred to as main thread, Main thread carries out processing according to each result for receiving the approximate solution for obtaining this time calculating.

Now in order that KNL PC cluster speed is quicker, need the configuration for ensureing each calculate node balanced, this Sample will not cause the delay of whole process because of the delay of a calculate node in can ensure that circulation every time.Therefore task is drawn Divide and distribution needs equilibrium, it is desirable that the computational efficiency for obtaining each calculate node is identical.Such as the first calculate node is calculated Ability is 2, and the second calculate node computing capability is 1, then can distribute two tasks to the first calculate node when task is distributed, Distribute a task to the second calculate node, to ensure the task balance of each calculate node.But preferably way is to ensure each The hardware condition of individual calculate node is all identical, can more quick directly mean allocation during task of so reallocating.It is i.e. every Individual calculate node internal memory (including DDR physical memories and MCDRAM high bandwidth internal memories) size and type of memory are as far as possible identical, The quantity of central processing unit is identical.To reduce the communication delay between node, Infiniband, 10,000,000,000 ether between node, are adopted Cluster is built in the high performance network interconnection such as net or Intel OPA, the network switch by the way of total exchange.

Therefore each calculate node is in the case where hardware condition is the same, by the solution of the symmetric positive definite system of linear equations Task carries out division can be included：

Specifically, according to KNL core groups in the quantity and node of calculate node in cluster (node can also be referred to as) Quantity opens process, it is assumed that have N in cluster_nodeProcessing core can be divided into N in each calculate node by individual calculate node_grp Individual group, N_p=N can be opened_node*N_grpIndividual MPI processes, each MPI process are responsible for controlling the calculating of a KNL core group and controlling Message transmission between core group processed.Data are divided and takes advantage of N_p data block, each process is responsible for the calculating task of a data block.

Said process is illustrated by taking Ax=b as an example, Fig. 2 is refer to, by the coefficient matrices A of system of linear equations Ax=b and the right side End vector b divided by row, the nonzero element number in A is size, and line number is n, and for x, each process is required for full x, therefore, The x sizes of each process definition are n, and each process only calculates n/N_P element in x, are needed after calculating every time between process Communicate, obtain full x data.Host process sets up out of order index, is responsible for the collection and broadcast of data.Each process is from binary system text Respective desired data is read in part；Then each process according to the process number of oneself come static division task.Each process initiation Multithreading, main thread are responsible for and will also undertake corresponding calculating task outside other process communications, other threads be responsible for calculating with Other tasks, terminate local process after the completion of calculating.

Will system of linear equations coefficient matrices A divided by row into N_p blocks, give N_p process respectively and calculate, protect as far as possible Demonstrate,prove the number of nonzero element in each process quite, to guarantee the load balancing between process, host process is responsible for number in process group According to collection and broadcast, each process possesses independent data, completes the calculating task of oneself, and the information between process is by message biography Pass the interaction of interface completion message.

The concrete Programming Notes said process with MPI multi-process design is divided below by MPI tasks.Wherein, MPI tasks Division is specially：

Task between node is divided using static division mode, divided by row, it is assumed that required system of linear equations be Ax=b and Nonzero element number in sparse matrix A is size, and line number is n, it is assumed that open l process P₀, P₁..., P_l-1, can be by coefficient square Battle array A is with vector b divided by row into l blocks, i.e. A=[A₀ ^T,A₁ ^T,…,A_l-1 ^T]^T, b=[b₀ ^T,b₁ ^T,…,b_l-1 ^T]^T.By data block A₀ ～A_l-1And b₀～b_l-1It is respectively allocated to process P₀～P_l-1, and vector x is shared for all processes.It is concrete as shown in Figure 2.Therefore, The x sizes of each process definition are n, and each process only calculates n/l element in x, are needed after calculating every time between process Communication, obtains x vectorial.The line number that each process is processed is represented with array H [N]:

Communicate for convenience, the Hpos that defines arrays [l+1] represents the position that the data of each process calculating start：

1：Hpos [0]=0；

2：For (i=1；i<l+1；i++)

3：Hpos [i]=Hpos [i-1]+H [i-1]；

MPI multi-process design is specially：

MPI multi-process design cycles are as shown in Figure 3.Each MPI process (abbreviation process) controls the calculating of 1 KNL core group With data transfer.L KNL core group is had in assuming cluster, starts l process P first₀, P₁..., P_l-1, wherein process p₀Based on Process；Each process opens up privately owned memory space, process p according to task₀It is responsible for reading data, and data on-demand broadcasting is given Other processes；Process p₀-P_l-1Respective calculating task is performed respectively, and result is fed back to into host process p₀；Host process is to feedback As a result carry out processing, integrate, and necessary result is broadcast to into other processes；The upper two steps operation of repetition is completed until calculating. MPI mainly employs the mode of collective communication and completes the message transmission between process in implementing parallel.The MPI message called is passed Passing built-in function includes：MPI_Reduce, MPI_ALLReduce, MPI_Bcast and MPI_Allgatherv.MPI design frameworks False code is as follows：

Based on above-mentioned technical proposal, KNL clusters provided in an embodiment of the present invention accelerate method for solving, the method be conjugated ladder Degree algorithm has been transplanted in KNL cluster platforms, i.e., realize the distribution of task and message transmission between node using MPI, utilizes KNL chips realize that the parallel acceleration of extensive matrix-vector calculating, so as to improve the utilization rate of hardware resource, shortens solution The time of extensive symmetric positive definite system of linear equations, and energy consumption is reduced, the cost of computer lab management, O＆M is reduced, and is added Fast method is simply easily achieved, and reduces development cost.

Based on above-described embodiment, in order to further improve the calculating speed in calculate node.Can be within calculate node Accelerate to complete using KNL numerous cores, it is real using the multi-thread programming model (OpenMP) of shared drive in KNL intra-nodes It is now parallel.Wherein, the realization of OpenMP is intended to the calculating core for making full use of KNL processors numerous and each core can be opened The code segment of computation-intensive is put into parallel processing on KNL polycaryon processors by 4 threads, so as to accelerate asking for system of linear equations Solution.I.e. in the present embodiment, KNL kernels carry out procedure subject calculating can include：

Specifically, multithreading is opened according to the quantity that core is calculated in KNL check figures and realizes the parallel acceleration in equipment；Assume There is N in each core group_KNLIndividual core, it is assumed that have N in each core group^knl _coreIndividual core, can at most open 4*N^knl _coreIndividual thread.

Wherein, Large Scale Sparse matrix-vector in algorithm is multiplied by the present embodiment, inner product of vectors, scalar multiplication with vector and Addition of vectors equal matrix vector operations are used as multi-threaded parallel region.Therefore, in the present embodiment in algorithm by calling subfunction Mode complete the operation of four kinds of matrix-vectors of the above.Kernel is completed by way of " #pragma omp " speech and accelerates design. KNL chips each cores supports 4 hardware threads, can at most open 4*N^knl _coreIndividual OpenMP threads, in four kinds of subfunctions Core OpenMP design frameworks are as follows：

1>The matrix-vector multiplication of kernel function

2>The vector number of kernel function is taken advantage of

3>The inner product of vectors of kernel function

4>The addition of vectors of kernel function

Based on above-described embodiment, in order to further improve the speed that each KNL kernels main body is calculated, in the present embodiment in KNL Core carries out procedure subject and calculates and can include：

The data or array that memory read-write in described program main body is limited open up MCDRAM high bandwidth internal memories.

Specifically, MCDRAM high bandwidths memory part in node is made full use of, memory access intensive code segment is opened up into height On the internal memory of bandwidth.For example, its code is all opened up the internal memory of high bandwidth when internal memory shared by procedure subject is less than 16GB On, or access frequency highest data segment in this section of program is opened up into high bandwidth during memory size 16GB shared by procedure subject Internal memory on.

The computing resource of KNL clusters is made full use of, is improved and is calculated performance, reducing energy consumption, so as to reduce computer lab management, fortune Dimension cost.In the above-described embodiments first connected applications the characteristics of, design hardware platform building plan；Then realize conjugation The MPI parallel versions of gradient method algorithm；Then will be with the multiplication of Large Scale Sparse matrix-vector, vector on the basis of MPI versions Four kinds of computings such as addition, inner product of vectors and scalar and vector product for main body code segment as KNL kernels, each KNL core Group can at most open 4*N^knl _coreIndividual thread OpenMP threads, complete the parallel computation on node KNL chips.

Based on above-mentioned technical proposal, the KNL clusters that the embodiment of the present invention is carried accelerate method for solving, mainly employ MPI+ OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.It is sweeping in the method Matrix-vector be multiplied, vector be multiplied by scalar, vector subtract each other and inner product of vectors can as the important component part of kernel function, Multi-threading parallel process.Improve the utilization rate of hardware resource, shorten solve extensive symmetric positive definite system of linear equations when Between, and energy consumption is reduced, and the cost of computer lab management, O＆M being reduced, and accelerated method is simply easily achieved, reduction is developed into This.

Solving device is accelerated to be introduced KNL clusters provided in an embodiment of the present invention below, KNL clusters described below Acceleration solving device can be mutually to should refer to above-described KNL clusters acceleration method for solving.

Fig. 4 is refer to, Fig. 4 is accelerated the structured flowchart of solving device by the KNL clusters that the embodiment of the present invention is provided；The dress Putting to include：

Read module 100, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution And solving precision is required；

Approximate solution solves module 200, carries out procedure subject calculating for controlling each KNL kernels using MPI, and construction is approximate Solution；Wherein, procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, The operation part section of scalar and vector product；

Judge solving precision module 300, require for judging whether the approximate solution meets the solving precision；

As a result output module 400, meet the approximate solution that the solving precision is required for output.

Specifically, the KNL clusters accelerate building for the hardware platform environment of solving device to configure including between calculate node Weighing apparatus design, in node between configuration Equalization Design, node internet choose, in node in the configuration of computing device and node Mode that KNL internal memories are opened up etc..

Wherein, Equalization Design is configured, each calculate node internal memory (includes in DDR physical memories and MCDRAM high bandwidths Deposit) size and type of memory it is as far as possible identical, the quantity of central processing unit is identical.To reduce the communication delay between node, section Using high performance network interconnections such as Infiniband, ten thousand mbit ethernets or Intel OPA between point, the network switch is using full friendship The mode changed builds cluster.Each KNL node memory and mode of operation configuration consistency in KNL clusters, each KNL node have phase Same KNL processor chips；It is connected by High speed network between each KNL nodes, in the KNL clusters, the network switch is adopted Use total exchange mode；

Configuration in dynamical hardware cluster platform, node is built balanced, to improve the exchange capacity and resource of data Utilization rate.It is assumed that adopting KNL PC clusters, each KNL node configuration internal memory is consistent, using same-type, size DDR internal memories and the storage of MCDRAM high bandwidths, each KNL node are arranged to identical pattern, it is to avoid memory read-write velocity contrast The disposal ability gap great disparity for causing, causes whole node processing power low.Meanwhile, the KNL process adopted in calculate node Device chip is identical, it is ensured that the core amounts and dominant frequency in each processor chips are identical.Additionally, the communication between process is to node Between internet requirement it is higher, therefore, the interconnection of calculate node adopts 10,000,000,000 ether or Infiniband High speed networks, To avoid due to the inconsistent information occlusion of bandwidth, the network switch is by the way of total exchange.

Each intra-node configures two identical processors, and to ensure the core amounts of processor, dominant frequency is identical, each Node Configuration Type and size identical internal memory (being not less than 128GB), it is ensured that the KNL chip cores number and dominant frequency in node Unanimously.And integrated MCDRAM is consistent on chip, between node, using express network interconnection, (it is high that network can choose Infiniband Fast internet, Ethernet and Intel OPA network connections), switch is by the way of full connection.

Based on above-described embodiment, the approximate solution solves module 200 can be included：

Based on above-described embodiment, specially using static division mode, divided by row will be symmetrical for the task division unit Unit of the coefficient matrix divided by row of positive definite system of linear equations into N_p blocks；Wherein, N_p=N_node*N_grp；Wherein, N_nodeFor Calculate node number, N in KNL clusters_grpFor processing core is divided into N in each calculate node_grpIndividual group.

Based on above-described embodiment, the approximate solution solves unit, including：

Based on above-mentioned any embodiment, the approximate solution solves module 200 can be included：

Specifically, the KNL platforms of the embodiment have the core amounts much larger than CPU, and each core supports 4 hardware Thread, calculates disposal ability very powerful.Meanwhile, storing on the MCDRAM pieces of KNL platform configurations 16GB, its bandwidth is additionally, MPI Programming programs and binary compatible consistent with CPU platforms with good portability and complete communication function and KNL.Tool Standby simple and direct, efficient the characteristics of.Meanwhile, conjugate gradient method has simplicity, fast convergence rate, stability high and is easy to parallel spy Point.Therefore, devising a kind of KNL clusters based on conjugate gradient method accelerates solving device, the device mainly to employ MPI+ OpenMP Hybrid paradigms.The message for being wherein responsible for the data between equipment, task division and equipment room using MPI is passed Pass, share the parallel acceleration that storage OpenMP multi-thread programmings model is mainly responsible for kernel in algorithm.In the device, initial solution sets Pretreatment before fixed and iteration, constructs new approximate solution, and it is single line to judge whether approximate solution meets the operation such as required precision What journey was completed, and sweeping matrix-vector is multiplied, vector is multiplied by that scalar, vector subtract each other and inner product of vectors can be used as kernel The important component part of function, multi-threading parallel process.

In description, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration .

Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can use different methods to realize described function to each specific application, but this realization should not Think beyond the scope of this invention.

The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

Method for solving and device is accelerated to be described in detail a kind of KNL clusters provided by the present invention above.Herein In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side Assistant solves the method for the present invention and its core concept.It should be pointed out that for those skilled in the art, not On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these improve and modification also falls into In the protection domain of the claims in the present invention.

Claims

1. a kind of KNL clusters accelerate method for solving, it is characterised in that include：

Each KNL kernels are controlled using MPI carries out procedure subject calculating, constructs approximate solution；Wherein, procedure subject is to be integrated in KNL The operation part section of the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar and vector product in kernel；

2. KNL clusters according to claim 1 accelerate method for solving, it is characterised in that control each KNL kernels using MPI Procedure subject calculating is carried out, approximate solution is constructed, including：

The process of respective amount is started according to the division number of the task of solution, and privately owned memory space is set for each process；

MPI host processes read tentation data, and the tentation data is sent to whole processes；Wherein, the predetermined packet data Include the coefficient matrix, the constant term and the initial solution；

The MPI host processes receive the result after whole processes are calculated according to the tentation data, and whole results are entered Row is processed, and obtains approximate solution.

3. KNL clusters according to claim 2 accelerate method for solving, it is characterised in that the symmetric positive definite is linearly square The solution task of journey group is divided, including：

Using static division mode, divided by row is by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks；Its In, N_p=N_node*N_grp；Wherein, N_nodeFor calculate node number in KNL clusters, N_grpFor in each calculate node by process cores The heart is divided into N_grpIndividual group.

4. KNL clusters according to claim 3 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter Calculate, including：

5. KNL clusters according to claim 4 accelerate method for solving, it is characterised in that KNL kernels carry out procedure subject meter Calculate, including：

6. a kind of KNL clusters accelerate solving device, it is characterised in that include：

Read module, for reading the coefficient matrix and constant term of symmetric positive definite system of linear equations, and sets initial solution and solution Required precision；

Approximate solution solves module, carries out procedure subject calculating for controlling each KNL kernels using MPI, constructs approximate solution；Wherein, Procedure subject be integrated in KNL kernels the multiplication of Large Scale Sparse matrix-vector, vector addition, inner product of vectors, scalar with to The operation part section of amount product；

7. KNL clusters according to claim 6 accelerate solving device, it is characterised in that the approximate solution solves module, bag Include：

Task allocation unit, starts the process of respective amount for the division number according to the task of solution, and sets for each process Put privately owned memory space；

Data allocation unit, reads tentation data for MPI host processes, and the tentation data is sent to whole processes；Its In, the tentation data includes the coefficient matrix, the constant term and the initial solution；

Approximate solution solves unit, after being calculated according to the tentation data for the whole processes of MPI host processes reception As a result, and to whole results process, obtain approximate solution.

8. KNL clusters according to claim 7 accelerate solving device, it is characterised in that the task division unit is concrete Be using static division mode, divided by row by the coefficient matrix divided by row of symmetric positive definite system of linear equations into N_p blocks list Unit；Wherein, N_p=N_node*N_grp；Wherein, N_nodeFor calculate node number in KNL clusters, N_grpTo locate in each calculate node Reason core is divided into N_grpIndividual group.

9. KNL clusters according to claim 8 accelerate solving device, it is characterised in that the approximate solution solves unit, bag Include：

Approximate solution solves subelement, opens 4*N for the KNL core groups^knl _coreIndividual OpenMP threads carry out procedure subject meter Calculate.

10. KNL clusters according to claim 9 accelerate solving device, it is characterised in that the approximate solution solves module, Including：

Unit is opened up in the limited array space of memory bandwidth, and the data or array for memory read-write in procedure subject is limited are opened up To MCDRAM high bandwidth internal memories.