CN105573959B - A kind of distributed computer calculating storage one - Google Patents
A kind of distributed computer calculating storage one Download PDFInfo
- Publication number
- CN105573959B CN105573959B CN201610077605.XA CN201610077605A CN105573959B CN 105573959 B CN105573959 B CN 105573959B CN 201610077605 A CN201610077605 A CN 201610077605A CN 105573959 B CN105573959 B CN 105573959B
- Authority
- CN
- China
- Prior art keywords
- mpcu
- cpu
- central processing
- processing unit
- computing unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003860 storage Methods 0.000 title claims abstract description 40
- 238000012545 processing Methods 0.000 claims abstract description 33
- 238000012536 packaging technology Methods 0.000 claims abstract description 3
- 230000015654 memory Effects 0.000 claims description 37
- 230000006870 function Effects 0.000 claims description 8
- 238000012546 transfer Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 4
- 239000013256 coordination polymer Substances 0.000 claims description 4
- YOIDHZBOHMNTNP-UHFFFAOYSA-N 1-(4-chlorophenyl)-3-(4-methylphenyl)sulfonylurea Chemical compound C1=CC(C)=CC=C1S(=O)(=O)NC(=O)NC1=CC=C(Cl)C=C1 YOIDHZBOHMNTNP-UHFFFAOYSA-N 0.000 claims 14
- 241001282153 Scopelogadus mizolepis Species 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 10
- 238000005538 encapsulation Methods 0.000 abstract description 8
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000000034 method Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 7
- 239000008186 active pharmaceutical agent Substances 0.000 description 6
- 238000011161 development Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007334 memory performance Effects 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004043 responsiveness Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/161—Computing infrastructure, e.g. computer clusters, blade chassis or hardware partitioning
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
A kind of distributed computer framework calculating storage one, including central processing unit and one or more computing units, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip package multilayer DRAM and first level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, the wherein described computing unit directly accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, the central processing unit and computing unit realize Fast Block data exchange by the DMA, central processing unit runs operating system, and realize necessary control generic operation, computing unit is responsible for completing calculating task.The present invention is based on 3D encapsulation technologies, realize the interaction configuration of computing unit and central processing unit, are converted to distributed computing model by the calculating pattern of centralization, substantially reduce the burden of central processing unit.
Description
Technical field
The invention belongs to computer architecture fields, are related to a kind of calculate and store integrated architecture and for the system
The programming model of structure designs, more particularly to a kind of distributed computer calculating storage one.
Background technology
In recent decades, the rapid development of processor and memory is brought prodigious convenient to people’s lives.Micro- place
While the dominant frequency of reason device increasingly improves, performance is skyrocketed through, the monolithic capacity of memory constantly creates new peak, access time
Constantly decline.However, the performance of microprocessor is being developed with annual 60% rate, and the improvement rate of the access time of DRAM
About annual 7%.Therefore, computer designers are faced with a processor and what memory performance gap was growing asks
Topic.Currently, this is the maximum obstacle for promoting computer overall performance, the performance of memory has become entire computer system
Maximum bottleneck.
With the development of technique, more and more multistage cache is introduced in the architecture and is possibly realized.The introducing of Cache
It can shorten the delay of storage system, but the increase that the series of cache can not possibly be unlimited, and in the worst case,
The series of cache is excessive even to make system delay bigger.And the bandwidth of cache is limited, and the introducing of multistage cache is brought
Bandwidth increase it is also seldom.
In traditional counting system structure, computing unit includes mainly CPU, GPU and DSP etc..In the mistake of data processing
Cheng Zhong, data are taken out from memory, and across storage system layer by layer, the position where being transferred to computing unit is calculated,
It is finally stored back to memory again.Such structure is known as the architecture centered on calculating.With the development of the times, computer
Application demand also has changed a lot.Nowadays we come into the epoch of big data, computer data to be treated
Amount is more and more huger.For data calculating, the ratio shared by data access is also increasing, in addition to above-mentioned delay and
Except the limitation of bandwidth, the problem of power consumption, is also increasingly severe.Traditional architecture centered on calculating increasingly cannot
Suitable for present application demand.
One effective solution method is exactly additionally to place computing unit in the position close to memory, and the computing unit is not
Need across layer by layer storage system carry out data access, by it is original be converted into centered on calculating it is data-centered.It is early
The architecture for the calculating storage one that phase is proposed is generally using the side for integrating processing logic and memory in chip piece
Formula.But since the technique of production processor and the technique difference of production inventory device are totally different, to produce and such be integrated with processor
It is very expensive, therefore the commercial calculating not produced with the architecture in the architecture of one with memory
Machine.
With the development of the times, microelectronic technique level is constantly progressive, and the 3D encapsulation technologies nowadays occurred allow will be different
The DRAM and computing unit of technique are integrated.For example, Micron companies just propose HMC (Hybrid Memory Cube)
Structure.HMC uses 3D encapsulation technologies, and multilayer DRAM and circuit layer are stacked.Wherein, logic circuit layer is responsible for each
The sequence of layer DRAM, refreshes, data routing, error correction, and the tasks such as high speed interconnection between primary processor.Lead between each layer
Cross TSV (Through Silicon Vias) technologies and the interconnection of fine copper post.TSV can provide thousands of mutual in the longitudinal direction
Connection supports multiple-level stack, this has substantially reduced the distance of data transmission, has reduced power consumption.The stacking of multilayer DRAM is brought
Highdensity pin arrangements.Relative to common DDR3, the Energy Efficiency Ratio of HMC improves more than 6 times.HMC improves storage density, contracting
Short access delay, improves data bandwidth.For multiple nucleus system, HMC increases request responsiveness, is connect by Design abstraction
Mouthful, a kind of new DRAM control strategies are realized, the interaction between CPU and DRAM is reduced.
The appearance of new technology is that the research for the computer architecture for calculating storage one brings new opportunity.It proposed in recent years
Computing unit is integrated in patrolling for 3D-DRAM by the concept of NDP (Near Data Processing) using 3D encapsulation technologies
Circuit layer is collected, computing unit can be completed to calculate in the position where data, to reduce the distance of data movement.However it is directed to this
For the specific architecture design of one theory still among exploration, there is presently no any specific designs to realize.
Invention content
In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of distributions calculating storage one
Formula computer is based on 3D encapsulation technologies, realizes the interaction configuration of computing unit and central processing unit.
To achieve the goals above, the technical solution adopted by the present invention is:
A kind of distributed computer calculating storage one, including central processing unit and one or more computing units, institute
Central processing unit operation operating system is stated, and realizes necessary control generic operation, the computing unit is responsible for completing calculating task.
The computing unit is Large-scale parallel computing unit (MPCU), such as SIMD machines, the multithreading similar to GPU
Machine, configurable arrays or multiple nucleus system etc.;The central processing unit is the lightweight processor core of non-computational type, at ARM
Manage the light weight level processor of device or the non-computational type similar to arm processor.
The computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and one
Layer logic circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is directly right
DRAM accesses, and the stored controller of central processing unit accesses to DRAM by the storage system of layering, described
Central processing unit and the computing unit realize Fast Block data exchange by the DMA.
In the DRAM, hews out one piece of region and not cacheable part, this subregion is used as to be used for computing unit
Program is run, and other region of memory are read-only for computing unit, and the central processing unit is able to access that entirely
Region of memory.
The computing unit is integrated in DDR chips, and a central processing unit is connect with multiple DDR chips, in described
Reasonable distribution application program between central processor and computing unit, farthest to optimize the performance of program operation.
Entire architecture system provides API for application developer, and provides Driver to the user and realize to computing unit
Control, application developer by the API of offer come to invention propose calculating storage one distributed computer carry out
Programming.
Compared with prior art, the present invention substantially reduces the burden of central processing unit, is turned by the calculating pattern of centralization
It is changed to distributed computing model.Due to alleviating the computation burden of central processing unit, central processing unit no longer needs largely to count
Calculate resource so that the area and power consumption of central processing unit all substantially reduce.Studies have shown that data are handed between processor and memory
The power consumption changed is about 19.5 times of processor internal exchange of data power consumption.And in the computer architecture, central processing unit is not
Frequent visit memory is needed again, this so that accessing the power consumption that memory is brought is greatly reduced.It is integrated in 3D-DRAM
Computing unit can directly access to DRAM, have benefited from the advantage of 3D encapsulation technologies, and computing unit accesses the speed of DRAM
It is remarkably enhanced, power consumption substantially reduces.Accordingly, with respect to traditional computing architecture, the distributed computer architecture
Prodigious advantage is suffered from power consumption and performance, while also reducing the area of central processing unit.
Description of the drawings
Fig. 1 calculates the architecture and its technical support of storage one.
Fig. 2 calculates the architecture global design thought of storage one.
Fig. 3 calculates the architecture implementation of storage one.
Fig. 4 calculates data branch mode in the integrated architecture of storage.
The architecture that Fig. 5 calculates storage one programs example.
Fig. 6 calculates the distributed computer of storage one.
Specific implementation mode
The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in Figure 1, on the basis of traditional counting system structure, one is additionally being placed close to the position of memory
Computing unit, the computing unit can be a MPCU (Large-scale parallel computing units), be responsible for completing a large amount of calculating task.
And the CPU in architecture then only needs to be responsible for operation operating system, realizes the operation of simple control class.Such as Fig. 1, MPCU and
DDR memory is encapsulated in using 3D encapsulation technologies in same chip.That is, MPCU corresponds to the logic circuit in HMC
The part of layer (Logic Die).
Fig. 2 gives the global design thought for the architecture for calculating storage one.Whole system structure includes one non-
The processor core (Computation Light Core) of calculation type, a large-scale parallel computation unit (Massive
Parallel Computation Unit, MPCU), storage control (Memory Controller) and memory
(DRAM).Wherein, MPCU can be SIMD machines, be similar to the multithreading machine of GPU, configurable arrays or multiple nucleus system etc..
MPCU is integrated among memory, it can be with direct access to memory.The processor core of non-computational type is needed by storing control
Device accesses to memory.The processor core of non-computational type is responsible for the control of program flow, can also run operating system or
Virtual machine, MPCU are responsible for that calculation amount is larger or the operation of the frequent program of internal storage access.
The processor core of non-computational type and large-scale parallel processing element share one piece of physical memory.In DRAM, open
It opens up one piece of region and is used as not cacheable part, this subregion runs program for MPCU, and other region of memory pair
It is read-only for MPCU.The processor core of non-computational type can then access entire region of memory.Whole system is programming people
Member provides API, and providing Driver to the user allows it to control MPCU.Application developer passes through offer
API is programmed come the distributed computer of the calculating storage one proposed to invention.
The processor core of non-computational type can be connected with multiple memories for being integrated with MPCU, complete to lead by these units
The calculating task wanted thus substantially reduces the burden of central processing unit, and distribution is converted to by the calculating pattern of centralization
Calculating pattern.
Fig. 3 is the specific implementation for the distributed computer for calculating storage one.Whole system is divided into two parts.I
The primary processor in system is called CPU, integrated computing unit in memory is called MPCU.Chip where CPU
Referred to as master chip (host chip) further includes CPU cache, CPU TLB, internal system bus and DDR on master chip
Physical layer protocol.Chip where MPCU is DDR chips (DDR chip), further includes MPCU TLB and DMA on chip
(Direct Memory Access).DMA is used for realizing the fast exchange of block number evidence between CPU and MPCU.CPU and MPCU needs
Apply for that one piece of respective program cache region is run for respective program in DDR.Whole system for outside provide API and
Driver, application program control system by these interfaces.CPU can run operating system.Application program can be
Reasonable distribution is carried out between CPU and MPCU, farthest to optimize the performance of program operation.
Fig. 4 gives the mode of the data conversion between CPU and MPCU.In the entire system, CPU and MPCU shares same
Block physical memory.CPU needs to obtain data by CPU cache, and MPCU can directly obtain data from memory.When
When CPU wants to pass data to MPCU, it can also directly realize quick block data transfer by DMA, at this moment no longer
By CPU cache.CPU converts virtual address to physical address by CPU TLB first, then by DMA by the journey of CPU
The program buffer of data exchange in sequence buffering area to MPCU.It is also same reason that MPCU, which passes data to CPU,.DMA
It is integrated among DDR, it does not follow storage consistency protocol, therefore, in CPU has carried out data transfer, has modified
After depositing, cache must be set in vain by it, so that in operation later, CPU can get correct data.
Binary executable program is distributed to MPCU by CPU.Communication between CPU and MPCU is by system calling come real
It is existing.In order to realize the communication between CPU and MPCU, 11 system call functions are devised altogether:
(1)settargetmemory:When CPU/MPCU executes the system item function, system will be established in one piece
It deposits region and passes data to MPCU/CPU for CPU/MPCU.
(2)switch2mpcu:CPU is hung up, MPCU work is transformed by CPU work.
(3)switch2cpu:MPCU is hung up, by MPCU work escape to CP U work.
(4)cpu2mpcu:While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU.
(5)mpcu2cpu:While by MPCU escape to CP U, by the data transfer of MPCU to CPU.
(6)freecpu:Discharge CPU.
(7)freempcu:Discharge MPCU.
(8)cacheflush:By the memory of the data write-in in cache.
(9)invalidate:After CPU has modified memory, relevant part in cache is set in vain.
(10)suspendmpcu:Hang up MPCU.
(11)suspendcpu:Hang up CPU.
These above-mentioned system call functions are designed as API and are supplied to application developer, while real in an operating system
The control to CPU and DEV then may be implemented in now corresponding function.As an example, Fig. 5 gives Mpeg2decode programs and is counting
Calculate the method for operation in the computer configuation of storage one.
Mpeg2 is the standard of a video and audio lossy compression.The function that Mpeg2decode programs are realized is will to compress
Audio-video document be decoded.As shown in figure 5, first, CPU and MPCU are initialized respectively.MPCU is hung up, and waits for quilt
CPU wakes up.CPU obtains a frame data, establishes data cache region, and pass data to MPCU.At this point, MPCU is waken up,
It receives data and is decoded work later, CPU is passed data to after completing.CPU receives the data that decoding is completed, will
It is written file and waits next frame data to be obtained.So circulation is gone down, it is known that all data decodings are completed.It should count in the process
According to transmission and CPU and MPCU between communication depend on design addition system call function realize.
The distributed computer for the calculating storage one being made of as shown in Figure 6 the architecture for calculating storage one.
The same CPU can be connected with multiple DDR chips.Each DDR chips are the DRAM of 3D encapsulation, and the logical layer of the chip is integrated
MPCU.Relative to the computer architecture of traditional centralization, the distributed computer for calculating storage one has speed fast, work(
Consume low advantage.
Claims (7)
1. a kind of distributed computer calculating storage one, including central processing unit and one or more computing units, described
Central processing unit runs operating system, and realizes necessary control generic operation, and the computing unit is responsible for completing calculating task,
Be characterized in that, the computing unit is packaged in based on 3D packaging technologies in DDR chips, each DDR chip packages multilayer DRAM and
First level logical circuit layer, logic circuit layer includes a DMA and one or more computing units, wherein the computing unit is direct
It accesses to DRAM, the stored controller of central processing unit accesses to DRAM by the storage system of layering, institute
It states central processing unit and the computing unit and Fast Block data exchange is realized by the DMA.
2. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is big
Scale parallel computation unit MPCU, the central processing unit are the lightweight processor core of non-computational type.
3. according to claim 1 calculate storage one distributed computer, which is characterized in that the central processing unit and
Computing unit shares one piece of physical memory.
4. calculating the distributed computer of storage one according to claim 1, which is characterized in that in the DRAM, open up
Go out one piece of region and be used as not cacheable part, this subregion runs program for computing unit, and other region of memory
It is read-only for computing unit, and the central processing unit is able to access that entire region of memory.
5. calculating the distributed computer of storage one according to claim 1, which is characterized in that the computing unit is integrated
In DDR chips, a central processing unit is connect with multiple DDR chips.
6. calculating the distributed computer of storage one according to claim 1, which is characterized in that entire architecture system is journey
Sequence developer provides API, and provides control of the Driver realizations to computing unit to the user, and application developer is by carrying
The API of confession is programmed come the distributed computer integrated to the calculating storage.
7. calculating the distributed computer of storage one according to claim 6, which is characterized in that the API realizes as follows
Function, to realize the communication between central processor CPU and computing unit MPCU:
(1)settargetmemory:When CPU/MPCU executes the system call function, system will establish one piece of memory field
Domain passes data to MPCU/CPU for CPU/MPCU;
(2)switch2mpcu:CPU is hung up, MPCU work is transformed by CPU work;
(3)switch2cpu:MPCU is hung up, by MPCU work escape to CP U work;
(4)cpu2 mpcu:While being transformed into MPCU by CPU, by the data transfer of CPU to MPCU;
(5)mpcu 2cpu:While by MPCU escape to CP U, by the data transfer of MPCU to CPU;
(6)freecpu:Discharge CPU;
(7)free mpcu:Discharge MPCU;
(8)cacheflush:Memory is written into data in cache;
(9)invalidate:After CPU has modified memory, relevant part in cache is set in vain;
(10)suspend mpcu:Hang up MPCU;
(11)suspendcpu:Hang up CPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610077605.XA CN105573959B (en) | 2016-02-03 | 2016-02-03 | A kind of distributed computer calculating storage one |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610077605.XA CN105573959B (en) | 2016-02-03 | 2016-02-03 | A kind of distributed computer calculating storage one |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105573959A CN105573959A (en) | 2016-05-11 |
CN105573959B true CN105573959B (en) | 2018-10-19 |
Family
ID=55884114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610077605.XA Active CN105573959B (en) | 2016-02-03 | 2016-02-03 | A kind of distributed computer calculating storage one |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105573959B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10732866B2 (en) | 2016-10-27 | 2020-08-04 | Samsung Electronics Co., Ltd. | Scaling out architecture for DRAM-based processing unit (DPU) |
CN109144717A (en) * | 2017-07-07 | 2019-01-04 | 广东网金控股股份有限公司 | One kind is based on double ARM chip master control methods and terminal device |
CN109558370A (en) * | 2017-09-23 | 2019-04-02 | 成都海存艾匹科技有限公司 | Three-dimensional computations encapsulation |
US10884672B2 (en) | 2018-04-02 | 2021-01-05 | Samsung Electronics Co., Ltd. | NDP-server: a data-centric computing architecture based on storage server in data center |
CN111581124A (en) * | 2019-02-19 | 2020-08-25 | 睿宽智能科技有限公司 | Method for shortening text exchange time and semiconductor device thereof |
CN111174805A (en) * | 2019-04-30 | 2020-05-19 | 奥特酷智能科技(南京)有限公司 | Distributed centralized automatic driving system |
CN112804297B (en) * | 2020-12-30 | 2022-08-19 | 之江实验室 | Assembled distributed computing and storage system and construction method thereof |
CN113377293B (en) * | 2021-07-08 | 2022-07-05 | 支付宝(杭州)信息技术有限公司 | Method and device for calculating in storage device and storage device |
CN114912587B (en) * | 2022-06-09 | 2023-05-26 | 上海燧原科技有限公司 | Neural network distributed training system, method, device, computing unit and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4101960A (en) * | 1977-03-29 | 1978-07-18 | Burroughs Corporation | Scientific processor |
CN101751244A (en) * | 2010-01-04 | 2010-06-23 | 清华大学 | Microprocessor |
CN101882302A (en) * | 2010-06-02 | 2010-11-10 | 北京理工大学 | Motion blur image restoration system based on multi-core |
CN102282542A (en) * | 2008-10-14 | 2011-12-14 | 奇托尔·V·斯里尼瓦桑 | TICC-paradigm to build formally verified parallel software for multi-core chips |
CN104820657A (en) * | 2015-05-14 | 2015-08-05 | 西安电子科技大学 | Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor |
-
2016
- 2016-02-03 CN CN201610077605.XA patent/CN105573959B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4101960A (en) * | 1977-03-29 | 1978-07-18 | Burroughs Corporation | Scientific processor |
CN102282542A (en) * | 2008-10-14 | 2011-12-14 | 奇托尔·V·斯里尼瓦桑 | TICC-paradigm to build formally verified parallel software for multi-core chips |
CN101751244A (en) * | 2010-01-04 | 2010-06-23 | 清华大学 | Microprocessor |
CN101882302A (en) * | 2010-06-02 | 2010-11-10 | 北京理工大学 | Motion blur image restoration system based on multi-core |
CN104820657A (en) * | 2015-05-14 | 2015-08-05 | 西安电子科技大学 | Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor |
Non-Patent Citations (2)
Title |
---|
《Memristor:The Enabler of Computation-in-Memory Architecture for Big-Data》;Said Hamdioui等;《2015 International Conference on Memristive Systems(MEMRISYS)》;20151231;见摘要第1-3行,第2页右栏第2段,图1(e) * |
《基于GPU的异构并行环境下程序优化策略研究》;刘星等;《湖北第二师范学院学报》;20100831;第27卷(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105573959A (en) | 2016-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105573959B (en) | A kind of distributed computer calculating storage one | |
Ke et al. | Near-memory processing in action: Accelerating personalized recommendation with axdimm | |
JP6974270B2 (en) | Intelligent high bandwidth memory system and logic die for it | |
Binnig et al. | The end of slow networks: It's time for a redesign | |
Siegl et al. | Data-centric computing frontiers: A survey on processing-in-memory | |
CN104699631A (en) | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) | |
CN106462501A (en) | Hybrid memory cube system interconnect directory-based cache coherence methodology | |
Huang et al. | Active-routing: Compute on the way for near-data processing | |
US11276459B2 (en) | Memory die including local processor and global processor, memory device, and electronic device | |
US11789644B2 (en) | Memory centric system incorporating computational memory | |
Hassan et al. | Near data processing: Impact and optimization of 3D memory system architecture on the uncore | |
CN109791507A (en) | Improve the mechanism of the data locality of distribution GPUS | |
Sun et al. | 3D DRAM design and application to 3D multicore systems | |
CN108256643A (en) | A kind of neural network computing device and method based on HMC | |
KR20230041593A (en) | Scalable address decoding scheme for cxl type-2 devices with programmable interleave granularity | |
US11966330B2 (en) | Link affinitization to reduce transfer latency | |
Wang et al. | Application defined on-chip networks for heterogeneous chiplets: An implementation perspective | |
WO2016078205A1 (en) | Directory structure implementation method and system for host system | |
JP2022151611A (en) | Integrated three-dimensional (3D) DRAM cache | |
CN109491934A (en) | A kind of storage management system control method of integrated computing function | |
CN101404177B (en) | Computation type memory with data processing capability | |
CN112579487A (en) | Techniques for decoupled access-execution near memory processing | |
Chen et al. | GCIM: Towards Efficient Processing of Graph Convolutional Networks in 3D-Stacked Memory | |
Woo et al. | Pragmatic integration of an SRAM row cache in heterogeneous 3-D DRAM architecture using TSV | |
CN100580804C (en) | Dynamic RAM device with data-handling capacity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |