CN109739678A - Based on the reduction redundancy read method communicated between register - Google Patents

Based on the reduction redundancy read method communicated between register Download PDF

Info

Publication number
CN109739678A
CN109739678A CN201910022567.1A CN201910022567A CN109739678A CN 109739678 A CN109739678 A CN 109739678A CN 201910022567 A CN201910022567 A CN 201910022567A CN 109739678 A CN109739678 A CN 109739678A
Authority
CN
China
Prior art keywords
core
data
read
dma
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910022567.1A
Other languages
Chinese (zh)
Inventor
甘霖
徐敬蘅
付昊桓
杨晋喆
王紫薇
杨广文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Supercomputing Wuxi Center
Original Assignee
National Supercomputing Wuxi Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Supercomputing Wuxi Center filed Critical National Supercomputing Wuxi Center
Priority to CN201910022567.1A priority Critical patent/CN109739678A/en
Publication of CN109739678A publication Critical patent/CN109739678A/en
Pending legal-status Critical Current

Links

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

It is provided by the invention that field of computer technology is belonged to based on the reduction redundancy read method communicated between register, including reading data from the core respectively direction y of the data point set to be calculated from storage in the dma from the m in a direction y in core cluster;If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by the data boundary in the adjacent direction y for including from core;The boundary in its direction y is obtained by register communication from core from core n with adjacent;Its data boundary is read from core from core n from adjacent;It wherein, is m from the number of the slave core in core cluster on the direction y.The invention reduces the data volume that data are read directly from DMA, alleviates the part that the redundancy in data calculating is read, avoids data waste, improve the utilization rate of DMA bandwidth.

Description

Based on the reduction redundancy read method communicated between register
Technical field
Field of computer technology of the present invention more particularly to a kind of based on the reduction redundancy read method communicated between register.
Background technique
Since the 1970s, with the rise of supercomputer, scientific algorithm has become one of mainstream science normal form, Its importance is no less than the theory and experimental arms of subject.The use of this calculation paradigm is greatly promoted various scientific domains Development, such as atmospheric simulation and earthquake simulation.In these scientific domains, unjustified memory access operations include one A or multiple connected storage access modes but the Primary memory that cannot be always aligned access behavior, such as answering based on template With (for example, atmospheric simulation and seismic modeling) and collaboration access.Program based on convolutional neural networks (such as answer by Deep Learning With) proportion is very big, it will usually very big influence is brought on performance.
In the past ten years, not with the demand of the computing capability to HPC application (such as numerical simulation and deep learning) It is disconnected to increase, performance requirement is no longer satisfied by the supercomputer that traditional universal cpu forms.It is growing in order to meet Capability requirement, heterogeneous system or chip become most popular one of extensive scientific algorithm solution.Such as China The light in martial prowess Taihu Lake became the supercomputer that First in the world is more than the performance of 100pFLops in 2016, is this The representative of architecture.The light in martial prowess Taihu Lake has provided maximum atmospheric simulation and earthquake simulation in the world, pole since deployment The earth help mankind take precautions against natural calamities and Climate change simulation.As the announced route map of China constructs the super of next-generation (EXA scale) Grade computer, which is most promising selection.In order to realize better property Energy and power-efficient, the optical oomputing chip (sw26010 processor) in martial prowess Taihu Lake discard buffer structure completely, is more The processor of computing unit saves space.
When calculating data, each it is required to for dot array data to be loaded into the memory from core on piece from DMA from core.With For the 2Dstencil computation model of standard, each data point is intended to by the way that totally 12 data are calculated up and down, it is assumed that The dot array data of 8*8 is calculated, then is each required to read in the data volume of 12*12 from DMA for calculating from core, if in total 8 It is a then to need to read the data volume of 8*12*12 in total from core, it is each from there is a large amount of data overlap between core, to cause A large amount of data waste.
Summary of the invention
It is provided by the invention based on the reduction redundancy read method communicated between register, it is existing in the prior art to solve Redundancy reads problem.
To achieve the above object, the technical scheme adopted by the invention is as follows:
It is provided by the invention based on the reduction redundancy read method communicated between register, including
M of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster Middle reading data;
If from core n it is adjacent from core read data in include its direction y data boundary, do not read from DMA By the data boundary in the adjacent direction y for including from core;
The boundary in its direction y is obtained by register communication from core from core n with adjacent;
Its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that including each from core institute To be calculated the dot array data for being NX*NY;
" m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster for step Data are read in direction " it is specially " from m on mono- direction core cluster y from core respectively from the number to be calculated stored in the dma Strong point collection reads NY row data in the y-direction ".
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that be located at from core cluster one The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on a direction y.
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that each from the direction core x Data boundary directly read from DMA.
Above-mentioned technical proposal have it is following a little or the utility model has the advantages that
It is provided by the invention by between register communicate based on the reduction redundancy read method communicated between register, will be each It is read out from assessing to communicate between the covered part in of counting directly passes through register, reduces and directly read from DMA The data volume of data alleviates the part that the redundancy in data calculating is read, avoids data waste, improve DMA bandwidth Utilization rate.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, the present invention and its feature, outer Shape and advantage will become more apparent.Identical label indicates identical part in all the attached drawings.Not deliberately according to than Example draws attached drawing, it is preferred that emphasis is shows the gist of the present invention.
Fig. 1 is the process signal based on the reduction redundancy read method communicated between register that the embodiment of the present invention 1 provides Figure;
Fig. 2 be the embodiment of the present invention 1 provide based on each of reduction redundancy read method communicated between register from core Schematic diagram data needed for calculating.
Specific embodiment
The present invention is further illustrated with specific embodiment with reference to the accompanying drawing, but not as limit of the invention It is fixed.
Embodiment 1:
As shown in Fig. 1~2, the embodiment of the present invention 1 provide based on the reduction redundancy read method communicated between register, It is characterized in that, including
S101: m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster Data are read in direction;
S102: if from core n it is adjacent from core read data in include its direction y data boundary, not from DMA It reads by the data boundary in the adjacent direction y for including from core;
S103: the boundary in its direction y is obtained by register communication from core from core n with adjacent;
S104: its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
As shown in Figure 1, the data point set to be calculated of storage in the dma does not include the borderline region and its y at its direction x both ends The borderline region at direction both ends;The dot array data for being NX*NY is each calculated from core.
The slave core cluster that slave core cluster in the present embodiment 1 is 8*8, i.e. m=8.As shown in Fig. 2, each read from core Data include the dot matrix to be calculated of NX*NY, the left margin dot matrix for the X1*NY being arranged on the left of dot matrix to be calculated, setting wait count The right margin dot matrix of X2*NY on the right side of dot matrix, the coboundary dot matrix for the NX*Y1 being arranged in above dot matrix to be calculated and setting is calculated to exist The lower boundary dot matrix of NX*Y2 below dot matrix to be calculated.In the prior art, it each is respectively necessary for reading from DMA from core and calculate Required data, need the data volume of (NX+X1+X2) * (NY+Y1+Y2), and 8 of the direction y need to read (NX+X1+ from core X2) the data volume of * (8NY+8Y1+8Y2);Each from there is many data redundancies between core, a large amount of data are caused Waste.
And step is passed through based on the reduction redundancy read method communicated between register using the offer of the embodiment of the present invention 1 S101~S102 reads data to reduce redundancy, intermediate except two of the direction y head and the tail in addition to core specific in the present embodiment It is comprised in from the borderline region (i.e. coboundary dot matrix and lower boundary dot matrix in Fig. 2) in the direction y of core adjacent from core reading In the data taken, even if need to only read its corresponding NY that need to calculate dot array data from DMA on the direction slave core y between being also right Row is without reading the data boundary on the direction y;And it is located at two of head and the tail from there is a block boundary region in core on the direction y respectively It does not include to read NY row in y-direction and also needed while need to calculating dot array data by the adjacent data read from core It is directly read from DMA, positioned at the slave core of stem, boundary dot matrix needs to read from DMA thereon, and the slave core positioned at tail portion is under it Dot matrix needs in boundary are read from DMA;For a certain for core, directly communicate to obtain the direction y boundary from core with adjacent, from phase Neighbour reads corresponding data boundary from core;And each is from borderline region (i.e. the left margin dot matrix and the right side in Fig. 2 in the direction core x Boundary dot matrix) directly read from DMA.For intermediate one from the data volume that the needs of core are read be (NX+X1+X2) * NY, Data volume required for slave core for stem is (NX+X1+X2) * (NY+Y1), and data volume needed for the slave core for tail portion is (NX+X1+X2) 8 of the direction * (NY+Y2), y need to read the data volume of (NX+X1+X2) * (8NY+Y1+Y2) altogether from core, far Much smaller than (NX+X1+X2) * (8NY+8Y1+8Y2) data volume in the prior art.
It is provided by the invention by between register communicate based on the reduction redundancy read method communicated between register, will be each It is read out from assessing to communicate between the covered part in of counting directly passes through register, reduces and directly read from DMA The data volume of data alleviates the part that the redundancy in data calculating is read, avoids data waste, improve DMA bandwidth Utilization rate.
It should be appreciated by those skilled in the art that those skilled in the art combine the prior art and above-described embodiment can be real The existing change case, it will not be described here.Such change case does not affect the essence of the present invention, and it will not be described here.
Presently preferred embodiments of the present invention is described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, devices and structures not described in detail herein should be understood as gives reality with the common mode in this field It applies;Anyone skilled in the art makes many possible changes and modifications not departing from technical solution of the present invention, or Equivalent example modified to equivalent change, this is not affected the essence of the present invention.Therefore, all without departing from skill of the present invention The content of art scheme, according to the technical essence of the invention any simple modification made to the above embodiment, equivalent variations and repair Decorations, all of which are still within the scope of protection of the technical scheme of the invention.

Claims (4)

1. a kind of based on the reduction redundancy read method communicated between register, which is characterized in that including
It is read from the core respectively direction y of the data point set to be calculated from storage in the dma from m of a direction y in core cluster Access evidence;
If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by phase The data boundary in the direction y that neighbour includes from core;
The boundary in its direction y is obtained by register communication from core from core n with adjacent;
Its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
2. as described in claim 1 based on the reduction redundancy read method communicated between register, which is characterized in that including each The dot array data for being NX*NY is calculated from core;
" m of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster for step Middle reading data " are specially " from m on mono- direction core cluster y from core respectively from the data point to be calculated stored in the dma Collection reads NY row data in the y-direction ".
3. as claimed in claim 2 based on the reduction redundancy read method communicated between register, which is characterized in that be located at from core The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on one direction y of cluster It takes.
4. as claimed in claim 3 based on the reduction redundancy read method communicated between register, which is characterized in that each from core Data boundary on the direction x is directly read from DMA.
CN201910022567.1A 2019-01-10 2019-01-10 Based on the reduction redundancy read method communicated between register Pending CN109739678A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910022567.1A CN109739678A (en) 2019-01-10 2019-01-10 Based on the reduction redundancy read method communicated between register

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910022567.1A CN109739678A (en) 2019-01-10 2019-01-10 Based on the reduction redundancy read method communicated between register

Publications (1)

Publication Number Publication Date
CN109739678A true CN109739678A (en) 2019-05-10

Family

ID=66364264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910022567.1A Pending CN109739678A (en) 2019-01-10 2019-01-10 Based on the reduction redundancy read method communicated between register

Country Status (1)

Country Link
CN (1) CN109739678A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126601A1 (en) * 2006-09-22 2008-05-29 Sony Computer Entertainment Inc. Methods and apparatus for allocating DMA activity between a plurality of entities
US7536669B1 (en) * 2006-08-30 2009-05-19 Xilinx, Inc. Generic DMA IP core interface for FPGA platform design
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN109002659A (en) * 2018-09-07 2018-12-14 西安交通大学 A kind of fluid machinery simulated program optimization method based on supercomputer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7536669B1 (en) * 2006-08-30 2009-05-19 Xilinx, Inc. Generic DMA IP core interface for FPGA platform design
US20080126601A1 (en) * 2006-09-22 2008-05-29 Sony Computer Entertainment Inc. Methods and apparatus for allocating DMA activity between a plurality of entities
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN109002659A (en) * 2018-09-07 2018-12-14 西安交通大学 A kind of fluid machinery simulated program optimization method based on supercomputer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
姚文军等: "基于神威太湖之光的NAMD软件的移植与优化", 《计算机工程与科学》 *
孟德龙等: "神威太湖之光上OpenFOAM的移植与优化", 《计算机科学》 *

Similar Documents

Publication Publication Date Title
CN110309837B (en) Data processing method and image processing method based on convolutional neural network characteristic diagram
CN104636273B (en) A kind of sparse matrix storage method on SIMD many-core processors with Multi-Level Cache
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN107918525A (en) The storage device of peer-to-peer communications can be performed and include its data-storage system
CN106021182B (en) A kind of row transposition architecture design method based on Two-dimensional FFT processor
CN102541774B (en) Multi-grain parallel storage system and storage
CN101717817B (en) Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar
CN103617150A (en) GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
CN102292748A (en) Multi level display control list in tile based 3D computer graphics system
CN110569979A (en) Logical-physical bit remapping method for noisy medium-sized quantum equipment
CN110209353B (en) I/O parallel acceleration method, device and medium for ROMS mode in area coupling forecast system
CN102207904B (en) Device and method for being emulated to reconfigurable processor
CN103413569B (en) One reads and one writes static RAM
CN106991656A (en) A kind of distributed geometric correction system and method for mass remote sensing image
CN111783933A (en) Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
CN109766208A (en) Based on the non-alignment internal storage access accelerated method communicated between register
CN102438149A (en) Realization method of AVS (Audio Video Standard) inverse transformation based on reconfiguration technology
CN113254391B (en) Neural network accelerator convolution calculation and data loading parallel method and device
CN103226977B (en) Quick NAND FLASH controller based on FPGA and control method thereof
CN109739678A (en) Based on the reduction redundancy read method communicated between register
CN102064835A (en) Decoder suitable for quasi-cyclic LDPC decoding
US10452356B2 (en) Arithmetic processing apparatus and control method for arithmetic processing apparatus
CN109408148A (en) A kind of production domesticization computing platform and its apply accelerated method
CN102646071A (en) Device and method for executing write hit operation of high-speed buffer memory at single period
CN104391676A (en) Instruction fetching method and instruction fetching structure thereof for low-cost high-band-width microprocessor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190510