CN109739678A - Based on the reduction redundancy read method communicated between register - Google Patents
Based on the reduction redundancy read method communicated between register Download PDFInfo
- Publication number
- CN109739678A CN109739678A CN201910022567.1A CN201910022567A CN109739678A CN 109739678 A CN109739678 A CN 109739678A CN 201910022567 A CN201910022567 A CN 201910022567A CN 109739678 A CN109739678 A CN 109739678A
- Authority
- CN
- China
- Prior art keywords
- core
- data
- read
- dma
- register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
It is provided by the invention that field of computer technology is belonged to based on the reduction redundancy read method communicated between register, including reading data from the core respectively direction y of the data point set to be calculated from storage in the dma from the m in a direction y in core cluster;If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by the data boundary in the adjacent direction y for including from core;The boundary in its direction y is obtained by register communication from core from core n with adjacent;Its data boundary is read from core from core n from adjacent;It wherein, is m from the number of the slave core in core cluster on the direction y.The invention reduces the data volume that data are read directly from DMA, alleviates the part that the redundancy in data calculating is read, avoids data waste, improve the utilization rate of DMA bandwidth.
Description
Technical field
Field of computer technology of the present invention more particularly to a kind of based on the reduction redundancy read method communicated between register.
Background technique
Since the 1970s, with the rise of supercomputer, scientific algorithm has become one of mainstream science normal form,
Its importance is no less than the theory and experimental arms of subject.The use of this calculation paradigm is greatly promoted various scientific domains
Development, such as atmospheric simulation and earthquake simulation.In these scientific domains, unjustified memory access operations include one
A or multiple connected storage access modes but the Primary memory that cannot be always aligned access behavior, such as answering based on template
With (for example, atmospheric simulation and seismic modeling) and collaboration access.Program based on convolutional neural networks (such as answer by Deep Learning
With) proportion is very big, it will usually very big influence is brought on performance.
In the past ten years, not with the demand of the computing capability to HPC application (such as numerical simulation and deep learning)
It is disconnected to increase, performance requirement is no longer satisfied by the supercomputer that traditional universal cpu forms.It is growing in order to meet
Capability requirement, heterogeneous system or chip become most popular one of extensive scientific algorithm solution.Such as China
The light in martial prowess Taihu Lake became the supercomputer that First in the world is more than the performance of 100pFLops in 2016, is this
The representative of architecture.The light in martial prowess Taihu Lake has provided maximum atmospheric simulation and earthquake simulation in the world, pole since deployment
The earth help mankind take precautions against natural calamities and Climate change simulation.As the announced route map of China constructs the super of next-generation (EXA scale)
Grade computer, which is most promising selection.In order to realize better property
Energy and power-efficient, the optical oomputing chip (sw26010 processor) in martial prowess Taihu Lake discard buffer structure completely, is more
The processor of computing unit saves space.
When calculating data, each it is required to for dot array data to be loaded into the memory from core on piece from DMA from core.With
For the 2Dstencil computation model of standard, each data point is intended to by the way that totally 12 data are calculated up and down, it is assumed that
The dot array data of 8*8 is calculated, then is each required to read in the data volume of 12*12 from DMA for calculating from core, if in total 8
It is a then to need to read the data volume of 8*12*12 in total from core, it is each from there is a large amount of data overlap between core, to cause
A large amount of data waste.
Summary of the invention
It is provided by the invention based on the reduction redundancy read method communicated between register, it is existing in the prior art to solve
Redundancy reads problem.
To achieve the above object, the technical scheme adopted by the invention is as follows:
It is provided by the invention based on the reduction redundancy read method communicated between register, including
M of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster
Middle reading data;
If from core n it is adjacent from core read data in include its direction y data boundary, do not read from DMA
By the data boundary in the adjacent direction y for including from core;
The boundary in its direction y is obtained by register communication from core from core n with adjacent;
Its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that including each from core institute
To be calculated the dot array data for being NX*NY;
" m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster for step
Data are read in direction " it is specially " from m on mono- direction core cluster y from core respectively from the number to be calculated stored in the dma
Strong point collection reads NY row data in the y-direction ".
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that be located at from core cluster one
The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on a direction y.
It is provided by the invention based on the reduction redundancy read method communicated between register, it is preferable that each from the direction core x
Data boundary directly read from DMA.
Above-mentioned technical proposal have it is following a little or the utility model has the advantages that
It is provided by the invention by between register communicate based on the reduction redundancy read method communicated between register, will be each
It is read out from assessing to communicate between the covered part in of counting directly passes through register, reduces and directly read from DMA
The data volume of data alleviates the part that the redundancy in data calculating is read, avoids data waste, improve DMA bandwidth
Utilization rate.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, the present invention and its feature, outer
Shape and advantage will become more apparent.Identical label indicates identical part in all the attached drawings.Not deliberately according to than
Example draws attached drawing, it is preferred that emphasis is shows the gist of the present invention.
Fig. 1 is the process signal based on the reduction redundancy read method communicated between register that the embodiment of the present invention 1 provides
Figure;
Fig. 2 be the embodiment of the present invention 1 provide based on each of reduction redundancy read method communicated between register from core
Schematic diagram data needed for calculating.
Specific embodiment
The present invention is further illustrated with specific embodiment with reference to the accompanying drawing, but not as limit of the invention
It is fixed.
Embodiment 1:
As shown in Fig. 1~2, the embodiment of the present invention 1 provide based on the reduction redundancy read method communicated between register,
It is characterized in that, including
S101: m of one direction y are from core respectively from the y for storing data point set to be calculated in the dma from core cluster
Data are read in direction;
S102: if from core n it is adjacent from core read data in include its direction y data boundary, not from DMA
It reads by the data boundary in the adjacent direction y for including from core;
S103: the boundary in its direction y is obtained by register communication from core from core n with adjacent;
S104: its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
As shown in Figure 1, the data point set to be calculated of storage in the dma does not include the borderline region and its y at its direction x both ends
The borderline region at direction both ends;The dot array data for being NX*NY is each calculated from core.
The slave core cluster that slave core cluster in the present embodiment 1 is 8*8, i.e. m=8.As shown in Fig. 2, each read from core
Data include the dot matrix to be calculated of NX*NY, the left margin dot matrix for the X1*NY being arranged on the left of dot matrix to be calculated, setting wait count
The right margin dot matrix of X2*NY on the right side of dot matrix, the coboundary dot matrix for the NX*Y1 being arranged in above dot matrix to be calculated and setting is calculated to exist
The lower boundary dot matrix of NX*Y2 below dot matrix to be calculated.In the prior art, it each is respectively necessary for reading from DMA from core and calculate
Required data, need the data volume of (NX+X1+X2) * (NY+Y1+Y2), and 8 of the direction y need to read (NX+X1+ from core
X2) the data volume of * (8NY+8Y1+8Y2);Each from there is many data redundancies between core, a large amount of data are caused
Waste.
And step is passed through based on the reduction redundancy read method communicated between register using the offer of the embodiment of the present invention 1
S101~S102 reads data to reduce redundancy, intermediate except two of the direction y head and the tail in addition to core specific in the present embodiment
It is comprised in from the borderline region (i.e. coboundary dot matrix and lower boundary dot matrix in Fig. 2) in the direction y of core adjacent from core reading
In the data taken, even if need to only read its corresponding NY that need to calculate dot array data from DMA on the direction slave core y between being also right
Row is without reading the data boundary on the direction y;And it is located at two of head and the tail from there is a block boundary region in core on the direction y respectively
It does not include to read NY row in y-direction and also needed while need to calculating dot array data by the adjacent data read from core
It is directly read from DMA, positioned at the slave core of stem, boundary dot matrix needs to read from DMA thereon, and the slave core positioned at tail portion is under it
Dot matrix needs in boundary are read from DMA;For a certain for core, directly communicate to obtain the direction y boundary from core with adjacent, from phase
Neighbour reads corresponding data boundary from core;And each is from borderline region (i.e. the left margin dot matrix and the right side in Fig. 2 in the direction core x
Boundary dot matrix) directly read from DMA.For intermediate one from the data volume that the needs of core are read be (NX+X1+X2) * NY,
Data volume required for slave core for stem is (NX+X1+X2) * (NY+Y1), and data volume needed for the slave core for tail portion is
(NX+X1+X2) 8 of the direction * (NY+Y2), y need to read the data volume of (NX+X1+X2) * (8NY+Y1+Y2) altogether from core, far
Much smaller than (NX+X1+X2) * (8NY+8Y1+8Y2) data volume in the prior art.
It is provided by the invention by between register communicate based on the reduction redundancy read method communicated between register, will be each
It is read out from assessing to communicate between the covered part in of counting directly passes through register, reduces and directly read from DMA
The data volume of data alleviates the part that the redundancy in data calculating is read, avoids data waste, improve DMA bandwidth
Utilization rate.
It should be appreciated by those skilled in the art that those skilled in the art combine the prior art and above-described embodiment can be real
The existing change case, it will not be described here.Such change case does not affect the essence of the present invention, and it will not be described here.
Presently preferred embodiments of the present invention is described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, devices and structures not described in detail herein should be understood as gives reality with the common mode in this field
It applies;Anyone skilled in the art makes many possible changes and modifications not departing from technical solution of the present invention, or
Equivalent example modified to equivalent change, this is not affected the essence of the present invention.Therefore, all without departing from skill of the present invention
The content of art scheme, according to the technical essence of the invention any simple modification made to the above embodiment, equivalent variations and repair
Decorations, all of which are still within the scope of protection of the technical scheme of the invention.
Claims (4)
1. a kind of based on the reduction redundancy read method communicated between register, which is characterized in that including
It is read from the core respectively direction y of the data point set to be calculated from storage in the dma from m of a direction y in core cluster
Access evidence;
If from core n it is adjacent from core read data in include its direction y data boundary, not from DMA read by phase
The data boundary in the direction y that neighbour includes from core;
The boundary in its direction y is obtained by register communication from core from core n with adjacent;
Its data boundary is read from core from core n from adjacent;
It wherein, is m from the number of the slave core in core cluster on the direction y.
2. as described in claim 1 based on the reduction redundancy read method communicated between register, which is characterized in that including each
The dot array data for being NX*NY is calculated from core;
" m of one direction y are from core respectively from the direction y for storing data point set to be calculated in the dma from core cluster for step
Middle reading data " are specially " from m on mono- direction core cluster y from core respectively from the data point to be calculated stored in the dma
Collection reads NY row data in the y-direction ".
3. as claimed in claim 2 based on the reduction redundancy read method communicated between register, which is characterized in that be located at from core
The slave core at both ends is directly read from DMA with the adjacent data boundary for not including from the data that core is read on one direction y of cluster
It takes.
4. as claimed in claim 3 based on the reduction redundancy read method communicated between register, which is characterized in that each from core
Data boundary on the direction x is directly read from DMA.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910022567.1A CN109739678A (en) | 2019-01-10 | 2019-01-10 | Based on the reduction redundancy read method communicated between register |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910022567.1A CN109739678A (en) | 2019-01-10 | 2019-01-10 | Based on the reduction redundancy read method communicated between register |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739678A true CN109739678A (en) | 2019-05-10 |
Family
ID=66364264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910022567.1A Pending CN109739678A (en) | 2019-01-10 | 2019-01-10 | Based on the reduction redundancy read method communicated between register |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739678A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080126601A1 (en) * | 2006-09-22 | 2008-05-29 | Sony Computer Entertainment Inc. | Methods and apparatus for allocating DMA activity between a plurality of entities |
US7536669B1 (en) * | 2006-08-30 | 2009-05-19 | Xilinx, Inc. | Generic DMA IP core interface for FPGA platform design |
CN107168683A (en) * | 2017-05-05 | 2017-09-15 | 中国科学院软件研究所 | GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010 |
CN109002659A (en) * | 2018-09-07 | 2018-12-14 | 西安交通大学 | A kind of fluid machinery simulated program optimization method based on supercomputer |
-
2019
- 2019-01-10 CN CN201910022567.1A patent/CN109739678A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7536669B1 (en) * | 2006-08-30 | 2009-05-19 | Xilinx, Inc. | Generic DMA IP core interface for FPGA platform design |
US20080126601A1 (en) * | 2006-09-22 | 2008-05-29 | Sony Computer Entertainment Inc. | Methods and apparatus for allocating DMA activity between a plurality of entities |
CN107168683A (en) * | 2017-05-05 | 2017-09-15 | 中国科学院软件研究所 | GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010 |
CN109002659A (en) * | 2018-09-07 | 2018-12-14 | 西安交通大学 | A kind of fluid machinery simulated program optimization method based on supercomputer |
Non-Patent Citations (2)
Title |
---|
姚文军等: "基于神威太湖之光的NAMD软件的移植与优化", 《计算机工程与科学》 * |
孟德龙等: "神威太湖之光上OpenFOAM的移植与优化", 《计算机科学》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309837B (en) | Data processing method and image processing method based on convolutional neural network characteristic diagram | |
CN104636273B (en) | A kind of sparse matrix storage method on SIMD many-core processors with Multi-Level Cache | |
CN103336758A (en) | Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same | |
CN107918525A (en) | The storage device of peer-to-peer communications can be performed and include its data-storage system | |
CN106021182B (en) | A kind of row transposition architecture design method based on Two-dimensional FFT processor | |
CN102541774B (en) | Multi-grain parallel storage system and storage | |
CN101717817B (en) | Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar | |
CN103617150A (en) | GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system | |
CN102292748A (en) | Multi level display control list in tile based 3D computer graphics system | |
CN110569979A (en) | Logical-physical bit remapping method for noisy medium-sized quantum equipment | |
CN110209353B (en) | I/O parallel acceleration method, device and medium for ROMS mode in area coupling forecast system | |
CN102207904B (en) | Device and method for being emulated to reconfigurable processor | |
CN103413569B (en) | One reads and one writes static RAM | |
CN106991656A (en) | A kind of distributed geometric correction system and method for mass remote sensing image | |
CN111783933A (en) | Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation | |
CN109766208A (en) | Based on the non-alignment internal storage access accelerated method communicated between register | |
CN102438149A (en) | Realization method of AVS (Audio Video Standard) inverse transformation based on reconfiguration technology | |
CN113254391B (en) | Neural network accelerator convolution calculation and data loading parallel method and device | |
CN103226977B (en) | Quick NAND FLASH controller based on FPGA and control method thereof | |
CN109739678A (en) | Based on the reduction redundancy read method communicated between register | |
CN102064835A (en) | Decoder suitable for quasi-cyclic LDPC decoding | |
US10452356B2 (en) | Arithmetic processing apparatus and control method for arithmetic processing apparatus | |
CN109408148A (en) | A kind of production domesticization computing platform and its apply accelerated method | |
CN102646071A (en) | Device and method for executing write hit operation of high-speed buffer memory at single period | |
CN104391676A (en) | Instruction fetching method and instruction fetching structure thereof for low-cost high-band-width microprocessor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190510 |