CN109739678A

CN109739678A - Based on the reduction redundancy read method communicated between register

Info

Publication number: CN109739678A
Application number: CN201910022567.1A
Authority: CN
Inventors: 甘霖; 徐敬蘅; 付昊桓; 杨晋喆; 王紫薇; 杨广文
Original assignee: National Supercomputing Wuxi Center
Current assignee: National Supercomputing Wuxi Center
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2019-05-10

Abstract

It is provided by the invention that field of computer technology is belonged to based on the reduction redundancy read method communicated between register, including reading data from the core respectively direction y of the data point set to be calculated from storage in the dma from the m in a direction y in core cluster；If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by the data boundary in the adjacent direction y for including from core；The boundary in its direction y is obtained by register communication from core from core n with adjacent；Its data boundary is read from core from core n from adjacent；It wherein, is m from the number of the slave core in core cluster on the direction y.The invention reduces the data volume that data are read directly from DMA, alleviates the part that the redundancy in data calculating is read, avoids data waste, improve the utilization rate of DMA bandwidth.

Description

Based on the reduction redundancy read method communicated between register

Technical field

Field of computer technology of the present invention more particularly to a kind of based on the reduction redundancy read method communicated between register.

Background technique

Since the 1970s, with the rise of supercomputer, scientific algorithm has become one of mainstream science normal form, Its importance is no less than the theory and experimental arms of subject.The use of this calculation paradigm is greatly promoted various scientific domains Development, such as atmospheric simulation and earthquake simulation.In these scientific domains, unjustified memory access operations include one A or multiple connected storage access modes but the Primary memory that cannot be always aligned access behavior, such as answering based on template With (for example, atmospheric simulation and seismic modeling) and collaboration access.Program based on convolutional neural networks (such as answer by Deep Learning With) proportion is very big, it will usually very big influence is brought on performance.

In the past ten years, not with the demand of the computing capability to HPC application (such as numerical simulation and deep learning) It is disconnected to increase, performance requirement is no longer satisfied by the supercomputer that traditional universal cpu forms.It is growing in order to meet Capability requirement, heterogeneous system or chip become most popular one of extensive scientific algorithm solution.Such as China The light in martial prowess Taihu Lake became the supercomputer that First in the world is more than the performance of 100pFLops in 2016, is this The representative of architecture.The light in martial prowess Taihu Lake has provided maximum atmospheric simulation and earthquake simulation in the world, pole since deployment The earth help mankind take precautions against natural calamities and Climate change simulation.As the announced route map of China constructs the super of next-generation (EXA scale) Grade computer, which is most promising selection.In order to realize better property Energy and power-efficient, the optical oomputing chip (sw26010 processor) in martial prowess Taihu Lake discard buffer structure completely, is more The processor of computing unit saves space.

When calculating data, each it is required to for dot array data to be loaded into the memory from core on piece from DMA from core.With For the 2Dstencil computation model of standard, each data point is intended to by the way that totally 12 data are calculated up and down, it is assumed that The dot array data of 8*8 is calculated, then is each required to read in the data volume of 12*12 from DMA for calculating from core, if in total 8 It is a then to need to read the data volume of 8*12*12 in total from core, it is each from there is a large amount of data overlap between core, to cause A large amount of data waste.

Summary of the invention

It is provided by the invention based on the reduction redundancy read method communicated between register, it is existing in the prior art to solve Redundancy reads problem.

To achieve the above object, the technical scheme adopted by the invention is as follows:

It is provided by the invention based on the reduction redundancy read method communicated between register, including

M of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster Middle reading data；

If from core n it is adjacent from core read data in include its direction y data boundary, do not read from DMA By the data boundary in the adjacent direction y for including from core；

The boundary in its direction y is obtained by register communication from core from core n with adjacent；

Its data boundary is read from core from core n from adjacent；

It wherein, is m from the number of the slave core in core cluster on the direction y.

It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that including each from core institute To be calculated the dot array data for being NX*NY；

" m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster for step Data are read in direction " it is specially " from m on mono- direction core cluster y from core respectively from the number to be calculated stored in the dma Strong point collection reads NY row data in the y-direction ".

It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that be located at from core cluster one The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on a direction y.

It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that each from the direction core x Data boundary directly read from DMA.

Above-mentioned technical proposal have it is following a little or the utility model has the advantages that

It is provided by the invention by between register communicate based on the reduction redundancy read method communicated between register, will be each It is read out from assessing to communicate between the covered part in of counting directly passes through register, reduces and directly read from DMA The data volume of data alleviates the part that the redundancy in data calculating is read, avoids data waste, improve DMA bandwidth Utilization rate.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, the present invention and its feature, outer Shape and advantage will become more apparent.Identical label indicates identical part in all the attached drawings.Not deliberately according to than Example draws attached drawing, it is preferred that emphasis is shows the gist of the present invention.

Fig. 1 is the process signal based on the reduction redundancy read method communicated between register that the embodiment of the present invention 1 provides Figure；

Fig. 2 be the embodiment of the present invention 1 provide based on each of reduction redundancy read method communicated between register from core Schematic diagram data needed for calculating.

Specific embodiment

The present invention is further illustrated with specific embodiment with reference to the accompanying drawing, but not as limit of the invention It is fixed.

Embodiment 1:

As shown in Fig. 1~2, the embodiment of the present invention 1 provide based on the reduction redundancy read method communicated between register, It is characterized in that, including

S101: m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster Data are read in direction；

S102: if from core n it is adjacent from core read data in include its direction y data boundary, not from DMA It reads by the data boundary in the adjacent direction y for including from core；

S103: the boundary in its direction y is obtained by register communication from core from core n with adjacent；

S104: its data boundary is read from core from core n from adjacent；

As shown in Figure 1, the data point set to be calculated of storage in the dma does not include the borderline region and its y at its direction x both ends The borderline region at direction both ends；The dot array data for being NX*NY is each calculated from core.

The slave core cluster that slave core cluster in the present embodiment 1 is 8*8, i.e. m=8.As shown in Fig. 2, each read from core Data include the dot matrix to be calculated of NX*NY, the left margin dot matrix for the X1*NY being arranged on the left of dot matrix to be calculated, setting wait count The right margin dot matrix of X2*NY on the right side of dot matrix, the coboundary dot matrix for the NX*Y1 being arranged in above dot matrix to be calculated and setting is calculated to exist The lower boundary dot matrix of NX*Y2 below dot matrix to be calculated.In the prior art, it each is respectively necessary for reading from DMA from core and calculate Required data, need the data volume of (NX+X1+X2) * (NY+Y1+Y2), and 8 of the direction y need to read (NX+X1+ from core X2) the data volume of * (8NY+8Y1+8Y2)；Each from there is many data redundancies between core, a large amount of data are caused Waste.

And step is passed through based on the reduction redundancy read method communicated between register using the offer of the embodiment of the present invention 1 S101~S102 reads data to reduce redundancy, intermediate except two of the direction y head and the tail in addition to core specific in the present embodiment It is comprised in from the borderline region (i.e. coboundary dot matrix and lower boundary dot matrix in Fig. 2) in the direction y of core adjacent from core reading In the data taken, even if need to only read its corresponding NY that need to calculate dot array data from DMA on the direction slave core y between being also right Row is without reading the data boundary on the direction y；And it is located at two of head and the tail from there is a block boundary region in core on the direction y respectively It does not include to read NY row in y-direction and also needed while need to calculating dot array data by the adjacent data read from core It is directly read from DMA, positioned at the slave core of stem, boundary dot matrix needs to read from DMA thereon, and the slave core positioned at tail portion is under it Dot matrix needs in boundary are read from DMA；For a certain for core, directly communicate to obtain the direction y boundary from core with adjacent, from phase Neighbour reads corresponding data boundary from core；And each is from borderline region (i.e. the left margin dot matrix and the right side in Fig. 2 in the direction core x Boundary dot matrix) directly read from DMA.For intermediate one from the data volume that the needs of core are read be (NX+X1+X2) * NY, Data volume required for slave core for stem is (NX+X1+X2) * (NY+Y1), and data volume needed for the slave core for tail portion is (NX+X1+X2) 8 of the direction * (NY+Y2), y need to read the data volume of (NX+X1+X2) * (8NY+Y1+Y2) altogether from core, far Much smaller than (NX+X1+X2) * (8NY+8Y1+8Y2) data volume in the prior art.

It should be appreciated by those skilled in the art that those skilled in the art combine the prior art and above-described embodiment can be real The existing change case, it will not be described here.Such change case does not affect the essence of the present invention, and it will not be described here.

Presently preferred embodiments of the present invention is described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, devices and structures not described in detail herein should be understood as gives reality with the common mode in this field It applies；Anyone skilled in the art makes many possible changes and modifications not departing from technical solution of the present invention, or Equivalent example modified to equivalent change, this is not affected the essence of the present invention.Therefore, all without departing from skill of the present invention The content of art scheme, according to the technical essence of the invention any simple modification made to the above embodiment, equivalent variations and repair Decorations, all of which are still within the scope of protection of the technical scheme of the invention.

Claims

1. a kind of based on the reduction redundancy read method communicated between register, which is characterized in that including

It is read from the core respectively direction y of the data point set to be calculated from storage in the dma from m of a direction y in core cluster Access evidence；

If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by phase The data boundary in the direction y that neighbour includes from core；

Its data boundary is read from core from core n from adjacent；

2. as described in claim 1 based on the reduction redundancy read method communicated between register, which is characterized in that including each The dot array data for being NX*NY is calculated from core；

" m of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster for step Middle reading data " are specially " from m on mono- direction core cluster y from core respectively from the data point to be calculated stored in the dma Collection reads NY row data in the y-direction ".

3. as claimed in claim 2 based on the reduction redundancy read method communicated between register, which is characterized in that be located at from core The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on one direction y of cluster It takes.

4. as claimed in claim 3 based on the reduction redundancy read method communicated between register, which is characterized in that each from core Data boundary on the direction x is directly read from DMA.