CN105573959B

CN105573959B - A kind of distributed computer calculating storage one

Info

Publication number: CN105573959B
Application number: CN201610077605.XA
Authority: CN
Inventors: 何虎; 侯毓敏
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-02-03
Filing date: 2016-02-03
Publication date: 2018-10-19
Anticipated expiration: 2036-02-03
Also published as: CN105573959A

Abstract

A kind of distributed computer framework calculating storage one, including central processing unit and one or more computing units, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip package multilayer DRAM and first level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, the wherein described computing unit directly accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, the central processing unit and computing unit realize Fast Block data exchange by the DMA, central processing unit runs operating system, and realize necessary control generic operation, computing unit is responsible for completing calculating task.The present invention is based on 3D encapsulation technologies, realize the interaction configuration of computing unit and central processing unit, are converted to distributed computing model by the calculating pattern of centralization, substantially reduce the burden of central processing unit.

Description

A kind of distributed computer calculating storage one

Technical field

The invention belongs to computer architecture fields, are related to a kind of calculate and store integrated architecture and for the system The programming model of structure designs, more particularly to a kind of distributed computer calculating storage one.

Background technology

In recent decades, the rapid development of processor and memory is brought prodigious convenient to people’s lives.Micro- place While the dominant frequency of reason device increasingly improves, performance is skyrocketed through, the monolithic capacity of memory constantly creates new peak, access time Constantly decline.However, the performance of microprocessor is being developed with annual 60% rate, and the improvement rate of the access time of DRAM About annual 7%.Therefore, computer designers are faced with a processor and what memory performance gap was growing asks Topic.Currently, this is the maximum obstacle for promoting computer overall performance, the performance of memory has become entire computer system Maximum bottleneck.

With the development of technique, more and more multistage cache is introduced in the architecture and is possibly realized.The introducing of Cache It can shorten the delay of storage system, but the increase that the series of cache can not possibly be unlimited, and in the worst case, The series of cache is excessive even to make system delay bigger.And the bandwidth of cache is limited, and the introducing of multistage cache is brought Bandwidth increase it is also seldom.

In traditional counting system structure, computing unit includes mainly CPU, GPU and DSP etc..In the mistake of data processing Cheng Zhong, data are taken out from memory, and across storage system layer by layer, the position where being transferred to computing unit is calculated, It is finally stored back to memory again.Such structure is known as the architecture centered on calculating.With the development of the times, computer Application demand also has changed a lot.Nowadays we come into the epoch of big data, computer data to be treated Amount is more and more huger.For data calculating, the ratio shared by data access is also increasing, in addition to above-mentioned delay and Except the limitation of bandwidth, the problem of power consumption, is also increasingly severe.Traditional architecture centered on calculating increasingly cannot Suitable for present application demand.

One effective solution method is exactly additionally to place computing unit in the position close to memory, and the computing unit is not Need across layer by layer storage system carry out data access, by it is original be converted into centered on calculating it is data-centered.It is early The architecture for the calculating storage one that phase is proposed is generally using the side for integrating processing logic and memory in chip piece Formula.But since the technique of production processor and the technique difference of production inventory device are totally different, to produce and such be integrated with processor It is very expensive, therefore the commercial calculating not produced with the architecture in the architecture of one with memory Machine.

With the development of the times, microelectronic technique level is constantly progressive, and the 3D encapsulation technologies nowadays occurred allow will be different The DRAM and computing unit of technique are integrated.For example, Micron companies just propose HMC (Hybrid Memory Cube) Structure.HMC uses 3D encapsulation technologies, and multilayer DRAM and circuit layer are stacked.Wherein, logic circuit layer is responsible for each The sequence of layer DRAM, refreshes, data routing, error correction, and the tasks such as high speed interconnection between primary processor.Lead between each layer Cross TSV (Through Silicon Vias) technologies and the interconnection of fine copper post.TSV can provide thousands of mutual in the longitudinal direction Connection supports multiple-level stack, this has substantially reduced the distance of data transmission, has reduced power consumption.The stacking of multilayer DRAM is brought Highdensity pin arrangements.Relative to common DDR3, the Energy Efficiency Ratio of HMC improves more than 6 times.HMC improves storage density, contracting Short access delay, improves data bandwidth.For multiple nucleus system, HMC increases request responsiveness, is connect by Design abstraction Mouthful, a kind of new DRAM control strategies are realized, the interaction between CPU and DRAM is reduced.

The appearance of new technology is that the research for the computer architecture for calculating storage one brings new opportunity.It proposed in recent years Computing unit is integrated in patrolling for 3D-DRAM by the concept of NDP (Near Data Processing) using 3D encapsulation technologies Circuit layer is collected, computing unit can be completed to calculate in the position where data, to reduce the distance of data movement.However it is directed to this For the specific architecture design of one theory still among exploration, there is presently no any specific designs to realize.

Invention content

In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of distributions calculating storage one Formula computer is based on 3D encapsulation technologies, realizes the interaction configuration of computing unit and central processing unit.

To achieve the goals above, the technical solution adopted by the present invention is：

A kind of distributed computer calculating storage one, including central processing unit and one or more computing units, institute Central processing unit operation operating system is stated, and realizes necessary control generic operation, the computing unit is responsible for completing calculating task.

The computing unit is Large-scale parallel computing unit (MPCU), such as SIMD machines, the multithreading similar to GPU Machine, configurable arrays or multiple nucleus system etc.；The central processing unit is the lightweight processor core of non-computational type, at ARM Manage the light weight level processor of device or the non-computational type similar to arm processor.

The computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and one Layer logic circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is directly right DRAM accesses, and the stored controller of central processing unit accesses to DRAM by the storage system of layering, described Central processing unit and the computing unit realize Fast Block data exchange by the DMA.

In the DRAM, hews out one piece of region and not cacheable part, this subregion is used as to be used for computing unit Program is run, and other region of memory are read-only for computing unit, and the central processing unit is able to access that entirely Region of memory.

The computing unit is integrated in DDR chips, and a central processing unit is connect with multiple DDR chips, in described Reasonable distribution application program between central processor and computing unit, farthest to optimize the performance of program operation.

Entire architecture system provides API for application developer, and provides Driver to the user and realize to computing unit Control, application developer by the API of offer come to invention propose calculating storage one distributed computer carry out Programming.

Compared with prior art, the present invention substantially reduces the burden of central processing unit, is turned by the calculating pattern of centralization It is changed to distributed computing model.Due to alleviating the computation burden of central processing unit, central processing unit no longer needs largely to count Calculate resource so that the area and power consumption of central processing unit all substantially reduce.Studies have shown that data are handed between processor and memory The power consumption changed is about 19.5 times of processor internal exchange of data power consumption.And in the computer architecture, central processing unit is not Frequent visit memory is needed again, this so that accessing the power consumption that memory is brought is greatly reduced.It is integrated in 3D-DRAM Computing unit can directly access to DRAM, have benefited from the advantage of 3D encapsulation technologies, and computing unit accesses the speed of DRAM It is remarkably enhanced, power consumption substantially reduces.Accordingly, with respect to traditional computing architecture, the distributed computer architecture Prodigious advantage is suffered from power consumption and performance, while also reducing the area of central processing unit.

Description of the drawings

Fig. 1 calculates the architecture and its technical support of storage one.

Fig. 2 calculates the architecture global design thought of storage one.

Fig. 3 calculates the architecture implementation of storage one.

Fig. 4 calculates data branch mode in the integrated architecture of storage.

The architecture that Fig. 5 calculates storage one programs example.

Fig. 6 calculates the distributed computer of storage one.

Specific implementation mode

The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.

As shown in Figure 1, on the basis of traditional counting system structure, one is additionally being placed close to the position of memory Computing unit, the computing unit can be a MPCU (Large-scale parallel computing units), be responsible for completing a large amount of calculating task. And the CPU in architecture then only needs to be responsible for operation operating system, realizes the operation of simple control class.Such as Fig. 1, MPCU and DDR memory is encapsulated in using 3D encapsulation technologies in same chip.That is, MPCU corresponds to the logic circuit in HMC The part of layer (Logic Die).

Fig. 2 gives the global design thought for the architecture for calculating storage one.Whole system structure includes one non- The processor core (Computation Light Core) of calculation type, a large-scale parallel computation unit (Massive Parallel Computation Unit, MPCU), storage control (Memory Controller) and memory (DRAM).Wherein, MPCU can be SIMD machines, be similar to the multithreading machine of GPU, configurable arrays or multiple nucleus system etc.. MPCU is integrated among memory, it can be with direct access to memory.The processor core of non-computational type is needed by storing control Device accesses to memory.The processor core of non-computational type is responsible for the control of program flow, can also run operating system or Virtual machine, MPCU are responsible for that calculation amount is larger or the operation of the frequent program of internal storage access.

The processor core of non-computational type and large-scale parallel processing element share one piece of physical memory.In DRAM, open It opens up one piece of region and is used as not cacheable part, this subregion runs program for MPCU, and other region of memory pair It is read-only for MPCU.The processor core of non-computational type can then access entire region of memory.Whole system is programming people Member provides API, and providing Driver to the user allows it to control MPCU.Application developer passes through offer API is programmed come the distributed computer of the calculating storage one proposed to invention.

The processor core of non-computational type can be connected with multiple memories for being integrated with MPCU, complete to lead by these units The calculating task wanted thus substantially reduces the burden of central processing unit, and distribution is converted to by the calculating pattern of centralization Calculating pattern.

Fig. 3 is the specific implementation for the distributed computer for calculating storage one.Whole system is divided into two parts.I The primary processor in system is called CPU, integrated computing unit in memory is called MPCU.Chip where CPU Referred to as master chip (host chip) further includes CPU cache, CPU TLB, internal system bus and DDR on master chip Physical layer protocol.Chip where MPCU is DDR chips (DDR chip), further includes MPCU TLB and DMA on chip (Direct Memory Access).DMA is used for realizing the fast exchange of block number evidence between CPU and MPCU.CPU and MPCU needs Apply for that one piece of respective program cache region is run for respective program in DDR.Whole system for outside provide API and Driver, application program control system by these interfaces.CPU can run operating system.Application program can be Reasonable distribution is carried out between CPU and MPCU, farthest to optimize the performance of program operation.

Fig. 4 gives the mode of the data conversion between CPU and MPCU.In the entire system, CPU and MPCU shares same Block physical memory.CPU needs to obtain data by CPU cache, and MPCU can directly obtain data from memory.When When CPU wants to pass data to MPCU, it can also directly realize quick block data transfer by DMA, at this moment no longer By CPU cache.CPU converts virtual address to physical address by CPU TLB first, then by DMA by the journey of CPU The program buffer of data exchange in sequence buffering area to MPCU.It is also same reason that MPCU, which passes data to CPU,.DMA It is integrated among DDR, it does not follow storage consistency protocol, therefore, in CPU has carried out data transfer, has modified After depositing, cache must be set in vain by it, so that in operation later, CPU can get correct data.

Binary executable program is distributed to MPCU by CPU.Communication between CPU and MPCU is by system calling come real It is existing.In order to realize the communication between CPU and MPCU, 11 system call functions are devised altogether：

(1)settargetmemory：When CPU/MPCU executes the system item function, system will be established in one piece It deposits region and passes data to MPCU/CPU for CPU/MPCU.

(2)switch2mpcu：CPU is hung up, MPCU work is transformed by CPU work.

(3)switch2cpu：MPCU is hung up, by MPCU work escape to CP U work.

(4)cpu2mpcu：While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU.

(5)mpcu2cpu：While by MPCU escape to CP U, by the data transfer of MPCU to CPU.

(6)freecpu：Discharge CPU.

(7)freempcu：Discharge MPCU.

(8)cacheflush：By the memory of the data write-in in cache.

(9)invalidate：After CPU has modified memory, relevant part in cache is set in vain.

(10)suspendmpcu：Hang up MPCU.

(11)suspendcpu：Hang up CPU.

These above-mentioned system call functions are designed as API and are supplied to application developer, while real in an operating system The control to CPU and DEV then may be implemented in now corresponding function.As an example, Fig. 5 gives Mpeg2decode programs and is counting Calculate the method for operation in the computer configuation of storage one.

Mpeg2 is the standard of a video and audio lossy compression.The function that Mpeg2decode programs are realized is will to compress Audio-video document be decoded.As shown in figure 5, first, CPU and MPCU are initialized respectively.MPCU is hung up, and waits for quilt CPU wakes up.CPU obtains a frame data, establishes data cache region, and pass data to MPCU.At this point, MPCU is waken up, It receives data and is decoded work later, CPU is passed data to after completing.CPU receives the data that decoding is completed, will It is written file and waits next frame data to be obtained.So circulation is gone down, it is known that all data decodings are completed.It should count in the process According to transmission and CPU and MPCU between communication depend on design addition system call function realize.

The distributed computer for the calculating storage one being made of as shown in Figure 6 the architecture for calculating storage one. The same CPU can be connected with multiple DDR chips.Each DDR chips are the DRAM of 3D encapsulation, and the logical layer of the chip is integrated MPCU.Relative to the computer architecture of traditional centralization, the distributed computer for calculating storage one has speed fast, work( Consume low advantage.

Claims

1. a kind of distributed computer calculating storage one, including central processing unit and one or more computing units, described Central processing unit runs operating system, and realizes necessary control generic operation, and the computing unit is responsible for completing calculating task, Be characterized in that, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and First level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is direct It accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, institute It states central processing unit and the computing unit and Fast Block data exchange is realized by the DMA.

2. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is big Scale parallel computation unit MPCU, the central processing unit are the lightweight processor core of non-computational type.

3. according to claim 1 calculate storage one distributed computer, which is characterized in that the central processing unit and Computing unit shares one piece of physical memory.

4. calculating the distributed computer of storage one according to claim 1, which is characterized in that in the DRAM, open up Go out one piece of region and be used as not cacheable part, this subregion runs program for computing unit, and other region of memory It is read-only for computing unit, and the central processing unit is able to access that entire region of memory.

5. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is integrated In DDR chips, a central processing unit is connect with multiple DDR chips.

6. calculating the distributed computer of storage one according to claim 1, which is characterized in that entire architecture system is journey Sequence developer provides API, and provides control of the Driver realizations to computing unit to the user, and application developer is by carrying The API of confession is programmed come the distributed computer integrated to the calculating storage.

7. calculating the distributed computer of storage one according to claim 6, which is characterized in that the API realizes as follows Function, to realize the communication between central processor CPU and computing unit MPCU：

(1)settargetmemory：When CPU/MPCU executes the system call function, system will establish one piece of memory field Domain passes data to MPCU/CPU for CPU/MPCU；

(2)switch2mpcu：CPU is hung up, MPCU work is transformed by CPU work；

(3)switch2cpu：MPCU is hung up, by MPCU work escape to CP U work；

(4)cpu2 mpcu：While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU；

(5)mpcu 2cpu：While by MPCU escape to CP U, by the data transfer of MPCU to CPU；

(6)freecpu：Discharge CPU；

(7)free mpcu：Discharge MPCU；

(8)cacheflush：Memory is written into data in cache；

(9)invalidate：After CPU has modified memory, relevant part in cache is set in vain；

(10)suspend mpcu：Hang up MPCU；

(11)suspendcpu：Hang up CPU.