CN105573959B - A kind of distributed computer calculating storage one - Google Patents

A kind of distributed computer calculating storage one Download PDF

Info

Publication number
CN105573959B
CN105573959B CN201610077605.XA CN201610077605A CN105573959B CN 105573959 B CN105573959 B CN 105573959B CN 201610077605 A CN201610077605 A CN 201610077605A CN 105573959 B CN105573959 B CN 105573959B
Authority
CN
China
Prior art keywords
mpcu
cpu
central processing
processing unit
computing unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610077605.XA
Other languages
Chinese (zh)
Other versions
CN105573959A (en
Inventor
何虎
侯毓敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610077605.XA priority Critical patent/CN105573959B/en
Publication of CN105573959A publication Critical patent/CN105573959A/en
Application granted granted Critical
Publication of CN105573959B publication Critical patent/CN105573959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/161Computing infrastructure, e.g. computer clusters, blade chassis or hardware partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A kind of distributed computer framework calculating storage one, including central processing unit and one or more computing units, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip package multilayer DRAM and first level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, the wherein described computing unit directly accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, the central processing unit and computing unit realize Fast Block data exchange by the DMA, central processing unit runs operating system, and realize necessary control generic operation, computing unit is responsible for completing calculating task.The present invention is based on 3D encapsulation technologies, realize the interaction configuration of computing unit and central processing unit, are converted to distributed computing model by the calculating pattern of centralization, substantially reduce the burden of central processing unit.

Description

A kind of distributed computer calculating storage one
Technical field
The invention belongs to computer architecture fields, are related to a kind of calculate and store integrated architecture and for the system The programming model of structure designs, more particularly to a kind of distributed computer calculating storage one.
Background technology
In recent decades, the rapid development of processor and memory is brought prodigious convenient to people’s lives.Micro- place While the dominant frequency of reason device increasingly improves, performance is skyrocketed through, the monolithic capacity of memory constantly creates new peak, access time Constantly decline.However, the performance of microprocessor is being developed with annual 60% rate, and the improvement rate of the access time of DRAM About annual 7%.Therefore, computer designers are faced with a processor and what memory performance gap was growing asks Topic.Currently, this is the maximum obstacle for promoting computer overall performance, the performance of memory has become entire computer system Maximum bottleneck.
With the development of technique, more and more multistage cache is introduced in the architecture and is possibly realized.The introducing of Cache It can shorten the delay of storage system, but the increase that the series of cache can not possibly be unlimited, and in the worst case, The series of cache is excessive even to make system delay bigger.And the bandwidth of cache is limited, and the introducing of multistage cache is brought Bandwidth increase it is also seldom.
In traditional counting system structure, computing unit includes mainly CPU, GPU and DSP etc..In the mistake of data processing Cheng Zhong, data are taken out from memory, and across storage system layer by layer, the position where being transferred to computing unit is calculated, It is finally stored back to memory again.Such structure is known as the architecture centered on calculating.With the development of the times, computer Application demand also has changed a lot.Nowadays we come into the epoch of big data, computer data to be treated Amount is more and more huger.For data calculating, the ratio shared by data access is also increasing, in addition to above-mentioned delay and Except the limitation of bandwidth, the problem of power consumption, is also increasingly severe.Traditional architecture centered on calculating increasingly cannot Suitable for present application demand.
One effective solution method is exactly additionally to place computing unit in the position close to memory, and the computing unit is not Need across layer by layer storage system carry out data access, by it is original be converted into centered on calculating it is data-centered.It is early The architecture for the calculating storage one that phase is proposed is generally using the side for integrating processing logic and memory in chip piece Formula.But since the technique of production processor and the technique difference of production inventory device are totally different, to produce and such be integrated with processor It is very expensive, therefore the commercial calculating not produced with the architecture in the architecture of one with memory Machine.
With the development of the times, microelectronic technique level is constantly progressive, and the 3D encapsulation technologies nowadays occurred allow will be different The DRAM and computing unit of technique are integrated.For example, Micron companies just propose HMC (Hybrid Memory Cube) Structure.HMC uses 3D encapsulation technologies, and multilayer DRAM and circuit layer are stacked.Wherein, logic circuit layer is responsible for each The sequence of layer DRAM, refreshes, data routing, error correction, and the tasks such as high speed interconnection between primary processor.Lead between each layer Cross TSV (Through Silicon Vias) technologies and the interconnection of fine copper post.TSV can provide thousands of mutual in the longitudinal direction Connection supports multiple-level stack, this has substantially reduced the distance of data transmission, has reduced power consumption.The stacking of multilayer DRAM is brought Highdensity pin arrangements.Relative to common DDR3, the Energy Efficiency Ratio of HMC improves more than 6 times.HMC improves storage density, contracting Short access delay, improves data bandwidth.For multiple nucleus system, HMC increases request responsiveness, is connect by Design abstraction Mouthful, a kind of new DRAM control strategies are realized, the interaction between CPU and DRAM is reduced.
The appearance of new technology is that the research for the computer architecture for calculating storage one brings new opportunity.It proposed in recent years Computing unit is integrated in patrolling for 3D-DRAM by the concept of NDP (Near Data Processing) using 3D encapsulation technologies Circuit layer is collected, computing unit can be completed to calculate in the position where data, to reduce the distance of data movement.However it is directed to this For the specific architecture design of one theory still among exploration, there is presently no any specific designs to realize.
Invention content
In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of distributions calculating storage one Formula computer is based on 3D encapsulation technologies, realizes the interaction configuration of computing unit and central processing unit.
To achieve the goals above, the technical solution adopted by the present invention is:
A kind of distributed computer calculating storage one, including central processing unit and one or more computing units, institute Central processing unit operation operating system is stated, and realizes necessary control generic operation, the computing unit is responsible for completing calculating task.
The computing unit is Large-scale parallel computing unit (MPCU), such as SIMD machines, the multithreading similar to GPU Machine, configurable arrays or multiple nucleus system etc.;The central processing unit is the lightweight processor core of non-computational type, at ARM Manage the light weight level processor of device or the non-computational type similar to arm processor.
The computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and one Layer logic circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is directly right DRAM accesses, and the stored controller of central processing unit accesses to DRAM by the storage system of layering, described Central processing unit and the computing unit realize Fast Block data exchange by the DMA.
In the DRAM, hews out one piece of region and not cacheable part, this subregion is used as to be used for computing unit Program is run, and other region of memory are read-only for computing unit, and the central processing unit is able to access that entirely Region of memory.
The computing unit is integrated in DDR chips, and a central processing unit is connect with multiple DDR chips, in described Reasonable distribution application program between central processor and computing unit, farthest to optimize the performance of program operation.
Entire architecture system provides API for application developer, and provides Driver to the user and realize to computing unit Control, application developer by the API of offer come to invention propose calculating storage one distributed computer carry out Programming.
Compared with prior art, the present invention substantially reduces the burden of central processing unit, is turned by the calculating pattern of centralization It is changed to distributed computing model.Due to alleviating the computation burden of central processing unit, central processing unit no longer needs largely to count Calculate resource so that the area and power consumption of central processing unit all substantially reduce.Studies have shown that data are handed between processor and memory The power consumption changed is about 19.5 times of processor internal exchange of data power consumption.And in the computer architecture, central processing unit is not Frequent visit memory is needed again, this so that accessing the power consumption that memory is brought is greatly reduced.It is integrated in 3D-DRAM Computing unit can directly access to DRAM, have benefited from the advantage of 3D encapsulation technologies, and computing unit accesses the speed of DRAM It is remarkably enhanced, power consumption substantially reduces.Accordingly, with respect to traditional computing architecture, the distributed computer architecture Prodigious advantage is suffered from power consumption and performance, while also reducing the area of central processing unit.
Description of the drawings
Fig. 1 calculates the architecture and its technical support of storage one.
Fig. 2 calculates the architecture global design thought of storage one.
Fig. 3 calculates the architecture implementation of storage one.
Fig. 4 calculates data branch mode in the integrated architecture of storage.
The architecture that Fig. 5 calculates storage one programs example.
Fig. 6 calculates the distributed computer of storage one.
Specific implementation mode
The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in Figure 1, on the basis of traditional counting system structure, one is additionally being placed close to the position of memory Computing unit, the computing unit can be a MPCU (Large-scale parallel computing units), be responsible for completing a large amount of calculating task. And the CPU in architecture then only needs to be responsible for operation operating system, realizes the operation of simple control class.Such as Fig. 1, MPCU and DDR memory is encapsulated in using 3D encapsulation technologies in same chip.That is, MPCU corresponds to the logic circuit in HMC The part of layer (Logic Die).
Fig. 2 gives the global design thought for the architecture for calculating storage one.Whole system structure includes one non- The processor core (Computation Light Core) of calculation type, a large-scale parallel computation unit (Massive Parallel Computation Unit, MPCU), storage control (Memory Controller) and memory (DRAM).Wherein, MPCU can be SIMD machines, be similar to the multithreading machine of GPU, configurable arrays or multiple nucleus system etc.. MPCU is integrated among memory, it can be with direct access to memory.The processor core of non-computational type is needed by storing control Device accesses to memory.The processor core of non-computational type is responsible for the control of program flow, can also run operating system or Virtual machine, MPCU are responsible for that calculation amount is larger or the operation of the frequent program of internal storage access.
The processor core of non-computational type and large-scale parallel processing element share one piece of physical memory.In DRAM, open It opens up one piece of region and is used as not cacheable part, this subregion runs program for MPCU, and other region of memory pair It is read-only for MPCU.The processor core of non-computational type can then access entire region of memory.Whole system is programming people Member provides API, and providing Driver to the user allows it to control MPCU.Application developer passes through offer API is programmed come the distributed computer of the calculating storage one proposed to invention.
The processor core of non-computational type can be connected with multiple memories for being integrated with MPCU, complete to lead by these units The calculating task wanted thus substantially reduces the burden of central processing unit, and distribution is converted to by the calculating pattern of centralization Calculating pattern.
Fig. 3 is the specific implementation for the distributed computer for calculating storage one.Whole system is divided into two parts.I The primary processor in system is called CPU, integrated computing unit in memory is called MPCU.Chip where CPU Referred to as master chip (host chip) further includes CPU cache, CPU TLB, internal system bus and DDR on master chip Physical layer protocol.Chip where MPCU is DDR chips (DDR chip), further includes MPCU TLB and DMA on chip (Direct Memory Access).DMA is used for realizing the fast exchange of block number evidence between CPU and MPCU.CPU and MPCU needs Apply for that one piece of respective program cache region is run for respective program in DDR.Whole system for outside provide API and Driver, application program control system by these interfaces.CPU can run operating system.Application program can be Reasonable distribution is carried out between CPU and MPCU, farthest to optimize the performance of program operation.
Fig. 4 gives the mode of the data conversion between CPU and MPCU.In the entire system, CPU and MPCU shares same Block physical memory.CPU needs to obtain data by CPU cache, and MPCU can directly obtain data from memory.When When CPU wants to pass data to MPCU, it can also directly realize quick block data transfer by DMA, at this moment no longer By CPU cache.CPU converts virtual address to physical address by CPU TLB first, then by DMA by the journey of CPU The program buffer of data exchange in sequence buffering area to MPCU.It is also same reason that MPCU, which passes data to CPU,.DMA It is integrated among DDR, it does not follow storage consistency protocol, therefore, in CPU has carried out data transfer, has modified After depositing, cache must be set in vain by it, so that in operation later, CPU can get correct data.
Binary executable program is distributed to MPCU by CPU.Communication between CPU and MPCU is by system calling come real It is existing.In order to realize the communication between CPU and MPCU, 11 system call functions are devised altogether:
(1)settargetmemory:When CPU/MPCU executes the system item function, system will be established in one piece It deposits region and passes data to MPCU/CPU for CPU/MPCU.
(2)switch2mpcu:CPU is hung up, MPCU work is transformed by CPU work.
(3)switch2cpu:MPCU is hung up, by MPCU work escape to CP U work.
(4)cpu2mpcu:While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU.
(5)mpcu2cpu:While by MPCU escape to CP U, by the data transfer of MPCU to CPU.
(6)freecpu:Discharge CPU.
(7)freempcu:Discharge MPCU.
(8)cacheflush:By the memory of the data write-in in cache.
(9)invalidate:After CPU has modified memory, relevant part in cache is set in vain.
(10)suspendmpcu:Hang up MPCU.
(11)suspendcpu:Hang up CPU.
These above-mentioned system call functions are designed as API and are supplied to application developer, while real in an operating system The control to CPU and DEV then may be implemented in now corresponding function.As an example, Fig. 5 gives Mpeg2decode programs and is counting Calculate the method for operation in the computer configuation of storage one.
Mpeg2 is the standard of a video and audio lossy compression.The function that Mpeg2decode programs are realized is will to compress Audio-video document be decoded.As shown in figure 5, first, CPU and MPCU are initialized respectively.MPCU is hung up, and waits for quilt CPU wakes up.CPU obtains a frame data, establishes data cache region, and pass data to MPCU.At this point, MPCU is waken up, It receives data and is decoded work later, CPU is passed data to after completing.CPU receives the data that decoding is completed, will It is written file and waits next frame data to be obtained.So circulation is gone down, it is known that all data decodings are completed.It should count in the process According to transmission and CPU and MPCU between communication depend on design addition system call function realize.
The distributed computer for the calculating storage one being made of as shown in Figure 6 the architecture for calculating storage one. The same CPU can be connected with multiple DDR chips.Each DDR chips are the DRAM of 3D encapsulation, and the logical layer of the chip is integrated MPCU.Relative to the computer architecture of traditional centralization, the distributed computer for calculating storage one has speed fast, work( Consume low advantage.

Claims (7)

1. a kind of distributed computer calculating storage one, including central processing unit and one or more computing units, described Central processing unit runs operating system, and realizes necessary control generic operation, and the computing unit is responsible for completing calculating task, Be characterized in that, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and First level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is direct It accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, institute It states central processing unit and the computing unit and Fast Block data exchange is realized by the DMA.
2. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is big Scale parallel computation unit MPCU, the central processing unit are the lightweight processor core of non-computational type.
3. according to claim 1 calculate storage one distributed computer, which is characterized in that the central processing unit and Computing unit shares one piece of physical memory.
4. calculating the distributed computer of storage one according to claim 1, which is characterized in that in the DRAM, open up Go out one piece of region and be used as not cacheable part, this subregion runs program for computing unit, and other region of memory It is read-only for computing unit, and the central processing unit is able to access that entire region of memory.
5. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is integrated In DDR chips, a central processing unit is connect with multiple DDR chips.
6. calculating the distributed computer of storage one according to claim 1, which is characterized in that entire architecture system is journey Sequence developer provides API, and provides control of the Driver realizations to computing unit to the user, and application developer is by carrying The API of confession is programmed come the distributed computer integrated to the calculating storage.
7. calculating the distributed computer of storage one according to claim 6, which is characterized in that the API realizes as follows Function, to realize the communication between central processor CPU and computing unit MPCU:
(1)settargetmemory:When CPU/MPCU executes the system call function, system will establish one piece of memory field Domain passes data to MPCU/CPU for CPU/MPCU;
(2)switch2mpcu:CPU is hung up, MPCU work is transformed by CPU work;
(3)switch2cpu:MPCU is hung up, by MPCU work escape to CP U work;
(4)cpu2 mpcu:While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU;
(5)mpcu 2cpu:While by MPCU escape to CP U, by the data transfer of MPCU to CPU;
(6)freecpu:Discharge CPU;
(7)free mpcu:Discharge MPCU;
(8)cacheflush:Memory is written into data in cache;
(9)invalidate:After CPU has modified memory, relevant part in cache is set in vain;
(10)suspend mpcu:Hang up MPCU;
(11)suspendcpu:Hang up CPU.
CN201610077605.XA 2016-02-03 2016-02-03 A kind of distributed computer calculating storage one Active CN105573959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610077605.XA CN105573959B (en) 2016-02-03 2016-02-03 A kind of distributed computer calculating storage one

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610077605.XA CN105573959B (en) 2016-02-03 2016-02-03 A kind of distributed computer calculating storage one

Publications (2)

Publication Number Publication Date
CN105573959A CN105573959A (en) 2016-05-11
CN105573959B true CN105573959B (en) 2018-10-19

Family

ID=55884114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610077605.XA Active CN105573959B (en) 2016-02-03 2016-02-03 A kind of distributed computer calculating storage one

Country Status (1)

Country Link
CN (1) CN105573959B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10732866B2 (en) 2016-10-27 2020-08-04 Samsung Electronics Co., Ltd. Scaling out architecture for DRAM-based processing unit (DPU)
CN109144717A (en) * 2017-07-07 2019-01-04 广东网金控股股份有限公司 One kind is based on double ARM chip master control methods and terminal device
CN109558370A (en) * 2017-09-23 2019-04-02 成都海存艾匹科技有限公司 Three-dimensional computations encapsulation
US10884672B2 (en) 2018-04-02 2021-01-05 Samsung Electronics Co., Ltd. NDP-server: a data-centric computing architecture based on storage server in data center
CN111581124A (en) * 2019-02-19 2020-08-25 睿宽智能科技有限公司 Method for shortening text exchange time and semiconductor device thereof
CN111174805A (en) * 2019-04-30 2020-05-19 奥特酷智能科技(南京)有限公司 Distributed centralized automatic driving system
CN112804297B (en) * 2020-12-30 2022-08-19 之江实验室 Assembled distributed computing and storage system and construction method thereof
CN113377293B (en) * 2021-07-08 2022-07-05 支付宝(杭州)信息技术有限公司 Method and device for calculating in storage device and storage device
CN114912587B (en) * 2022-06-09 2023-05-26 上海燧原科技有限公司 Neural network distributed training system, method, device, computing unit and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4101960A (en) * 1977-03-29 1978-07-18 Burroughs Corporation Scientific processor
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101882302A (en) * 2010-06-02 2010-11-10 北京理工大学 Motion blur image restoration system based on multi-core
CN102282542A (en) * 2008-10-14 2011-12-14 奇托尔·V·斯里尼瓦桑 TICC-paradigm to build formally verified parallel software for multi-core chips
CN104820657A (en) * 2015-05-14 2015-08-05 西安电子科技大学 Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4101960A (en) * 1977-03-29 1978-07-18 Burroughs Corporation Scientific processor
CN102282542A (en) * 2008-10-14 2011-12-14 奇托尔·V·斯里尼瓦桑 TICC-paradigm to build formally verified parallel software for multi-core chips
CN101751244A (en) * 2010-01-04 2010-06-23 清华大学 Microprocessor
CN101882302A (en) * 2010-06-02 2010-11-10 北京理工大学 Motion blur image restoration system based on multi-core
CN104820657A (en) * 2015-05-14 2015-08-05 西安电子科技大学 Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Memristor:The Enabler of Computation-in-Memory Architecture for Big-Data》;Said Hamdioui等;《2015 International Conference on Memristive Systems(MEMRISYS)》;20151231;见摘要第1-3行,第2页右栏第2段,图1(e) *
《基于GPU的异构并行环境下程序优化策略研究》;刘星等;《湖北第二师范学院学报》;20100831;第27卷(第8期);全文 *

Also Published As

Publication number Publication date
CN105573959A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN105573959B (en) A kind of distributed computer calculating storage one
Ke et al. Near-memory processing in action: Accelerating personalized recommendation with axdimm
JP6974270B2 (en) Intelligent high bandwidth memory system and logic die for it
Binnig et al. The end of slow networks: It's time for a redesign
Siegl et al. Data-centric computing frontiers: A survey on processing-in-memory
CN104699631A (en) Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN106462501A (en) Hybrid memory cube system interconnect directory-based cache coherence methodology
Huang et al. Active-routing: Compute on the way for near-data processing
US11276459B2 (en) Memory die including local processor and global processor, memory device, and electronic device
US11789644B2 (en) Memory centric system incorporating computational memory
Hassan et al. Near data processing: Impact and optimization of 3D memory system architecture on the uncore
CN109791507A (en) Improve the mechanism of the data locality of distribution GPUS
Sun et al. 3D DRAM design and application to 3D multicore systems
CN108256643A (en) A kind of neural network computing device and method based on HMC
KR20230041593A (en) Scalable address decoding scheme for cxl type-2 devices with programmable interleave granularity
US11966330B2 (en) Link affinitization to reduce transfer latency
Wang et al. Application defined on-chip networks for heterogeneous chiplets: An implementation perspective
WO2016078205A1 (en) Directory structure implementation method and system for host system
JP2022151611A (en) Integrated three-dimensional (3D) DRAM cache
CN109491934A (en) A kind of storage management system control method of integrated computing function
CN101404177B (en) Computation type memory with data processing capability
CN112579487A (en) Techniques for decoupled access-execution near memory processing
Chen et al. GCIM: Towards Efficient Processing of Graph Convolutional Networks in 3D-Stacked Memory
Woo et al. Pragmatic integration of an SRAM row cache in heterogeneous 3-D DRAM architecture using TSV
CN100580804C (en) Dynamic RAM device with data-handling capacity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant