WO2018058430A1 - 一种可扩展内存的芯片 - Google Patents

一种可扩展内存的芯片 Download PDF

Info

Publication number
WO2018058430A1
WO2018058430A1 PCT/CN2016/100795 CN2016100795W WO2018058430A1 WO 2018058430 A1 WO2018058430 A1 WO 2018058430A1 CN 2016100795 W CN2016100795 W CN 2016100795W WO 2018058430 A1 WO2018058430 A1 WO 2018058430A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory module
memory
chip
processor
delay
Prior art date
Application number
PCT/CN2016/100795
Other languages
English (en)
French (fr)
Inventor
戴芬
胡杏
徐君
王元钢
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201680058689.0A priority Critical patent/CN108139971B/zh
Priority to PCT/CN2016/100795 priority patent/WO2018058430A1/zh
Priority to EP16917184.0A priority patent/EP3511837B1/en
Publication of WO2018058430A1 publication Critical patent/WO2018058430A1/zh
Priority to US16/365,677 priority patent/US10678738B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4027Coupling between buses using bus bridges
    • G06F13/4045Coupling between buses using bus bridges where the bus bridge performs an extender function
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0038System on Chip

Definitions

  • the present invention relates to the field of integrated circuits, and more particularly to a chip that can expand memory.
  • In-memory computation is a calculation method that loads all data into memory. By loading all the data into the memory, the import and export of data in the hard disk is avoided, thereby improving the processing speed of the chip.
  • Memory computing requires a large amount of memory and bandwidth, so a large number of memory modules are required to connect to the processor. If each memory module is directly connected to the processor, each memory module can use only 1/N of the bandwidth (assuming there are N memory modules directly connected to the processor); if multiple memory modules are used as a memory module set And directly connected to the processor through a memory module in the memory module set, each memory module set can use a larger bandwidth, but the average number of hops of the processor accessing the memory module is increased, thereby reducing the processor access memory module speed.
  • an embodiment of the present invention provides a chip capable of expanding memory, by integrating a processor and at least two memory module sets on a substrate, and connecting the at least two memory modules through a substrate network, thereby being integrated Multiple memory modules ensure high memory bandwidth and fast access speed.
  • the chip of the expandable memory includes: a substrate and a processor integrated on the substrate, a first memory module set and a second memory module set; the processor passes through at least a first communication interface and the first memory module set Communicating with a memory module, the processor communicating with at least one memory module of the second memory module set by the second communication interface; the memory module and the second memory module of the first memory module set
  • the memory modules in the collection communicate via a substrate network, which is a communication network located inside the substrate.
  • the scalable memory chip provided by the embodiment of the present invention connects a plurality of memory module sets through the substrate network, so that the processor can access the first memory module set through the second memory module set.
  • the memory module can avoid the heavy load communication interface and reduce the delay of the processor accessing the memory module.
  • the processor includes a plurality of processor cores, wherein the plurality of processor cores communicate through an on-chip network, the on-chip network is a communication network located outside the substrate; the first memory module set and Each of the second sets of memory modules includes a plurality of memory modules.
  • the chip configured with the multi-core processor and the multi-memory module can provide more communication paths, which is beneficial to avoid the heavy load communication path, thereby reducing the delay of the processor accessing the memory module.
  • any two memory modules in the first memory module set communicate through the substrate network; any two memory modules in the second memory module set communicate through the substrate network.
  • any two memory modules in each memory module set can be connected to each other through a substrate network, thereby providing more optional communication.
  • the path helps to balance the load on the entire chip.
  • any one of the first memory module set and any one of the second memory module sets communicate through the substrate network.
  • the first communication interface and the second communication interface are respectively located in different processor cores.
  • the first processor core when the first processor core of the processor needs to access the first memory module in the first memory module set, the first processor core is configured to determine from the first processor core One communication path with the fewest hops among the plurality of communication paths to the first memory module is an access path.
  • the processor core that needs to perform read and write operations determines the access path according to the hop count of the plurality of communication paths of the processor core to the memory module, thereby avoiding complicated path selection operations. Reduce the burden on the processor.
  • the second processor core when the second processor core of the processor needs to access the second memory module in the first memory module set, the second processor core is configured to determine from the second processor core One communication path with the smallest access delay among the plurality of communication paths to the second memory module is an access path.
  • the processor core that needs to perform read and write operations determines the access path according to the delay of the plurality of communication paths of the processor core to the memory module, so that the delay of the communication path can be changed according to the delay of the communication path. Adjusting the access path in time helps balance the load on the entire chip.
  • the second processor core is specifically configured to: determine a substrate network delay according to a memory delay and a memory hop, wherein the memory delay is any two adjacent memory modules in the chip.
  • the average time required for data transfer the number of memory hops being the number of memory modules through which data is transmitted from the second processor core to the plurality of communication paths of the second memory module; Determining an on-chip network delay with a kernel hop count, wherein the kernel latency is an average time required for data transfer between any two adjacent processor cores in the processor, the kernel hop count being Determining, by the second processor core, the number of processor cores through which data is transmitted in the plurality of communication paths of the second memory module; determining the second processing according to the substrate network delay and the on-chip network delay And an access delay of the plurality of communication paths of the second memory module; the communication path with the smallest access delay selected from the plurality of communication paths is the access path.
  • the processor core that needs to perform read and write operations determines the access path from the plurality of communication paths according to the delay of different types of communication networks in the communication path, so that the difference can be determined more accurately.
  • the delay of the communication path is the delay of the communication path.
  • the second processor core is further configured to: determine the substrate network delay according to the memory delay, the memory hop count, and a substrate network load parameter, where the substrate network load parameter is used to indicate The amount of load on the substrate network.
  • the substrate network delay is determined by the substrate network load parameter, so that the delay of the communication path can be dynamically determined according to the change of the load of the substrate network.
  • the second processor core is further configured to: determine the on-chip network delay according to the kernel delay, the kernel hop count, and an on-chip network load parameter, where the on-chip network load parameter is used to indicate The amount of load on the network on the chip.
  • the scalable memory chip provided by the embodiment of the present invention determines the on-chip network delay by using the on-chip network load parameter, so that the delay of the communication path can be dynamically determined according to the change of the load of the on-chip network.
  • FIG. 1 is a schematic structural diagram of a chip of an expandable memory to which an embodiment of the present invention is applied;
  • FIG. 2 is a schematic structural diagram of a chip with expandable memory according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a chip with expandable memory according to another embodiment of the present invention.
  • FIG. 1 shows a schematic structural diagram of a chip 100 of an expandable memory to which an embodiment of the present invention is applied.
  • the chip 100 includes a multi-core chip processor (CMP), a plurality of three-dimensional (3D) dynamic random access memory (DRAM), and the CMP.
  • CMP multi-core chip processor
  • DRAM three-dimensional dynamic random access memory
  • a plurality of micro-bumps for communication are disposed between the silicon substrate and the CMP, and the bandwidth of the CMP can be calculated according to the pitch of the microbumps and the circumference of the CMP.
  • the plurality of processor cores in the CMP are connected by a network on chip (NoC), the communication network is located outside the silicon substrate, and the communication between the two DRAMs and between the DRAM and the CMP is performed through the substrate network.
  • the substrate network is a communication network located inside the silicon substrate. Since the NoC does not occupy the internal resources of the substrate, the substrate network can be used to provide a rich communication path between the DRAMs and between the CMP and the DRAM.
  • the chip 100 shown in FIG. 1 is only a schematic description, and the embodiment of the present invention is not limited thereto.
  • the chip 100 may be a central processing unit (CPU) chip or a graphics processing unit (GPU).
  • the chip can also be other types of chips.
  • the processor included in the scalable memory chip of the embodiment of the present invention may be a single core processor, and the memory module included may be a high bandwidth memory (HBM).
  • HBM high bandwidth memory
  • the number of integrated memory modules is not limited.
  • the number shown in FIG. 1 , the relative position between the memory module and the processor, and the relative position between the memory modules are not limited to the position shown in FIG. 1 .
  • the silicon substrate is an example, and the embodiment of the present invention provides Chips that expand memory can also use other A substrate prepared from a material, for example, a substrate prepared using a ceramic material.
  • FIG. 2 is a schematic structural diagram of a chip 200 of an expandable memory according to an embodiment of the present invention. As shown in FIG. 2, the chip 200 includes:
  • the processor 230 communicates with at least one memory module of the first memory module set 210 through the first communication interface 250, and the processor 230 passes through the second communication interface 260 and the second memory module set 220. At least one memory module communicates;
  • the substrate 240 is used to integrate the processor 230, the first memory module set 210, and the second memory module set 220.
  • the substrate 240 can provide abundant substrate resources for constructing the substrate network, as shown in FIG. 2 .
  • the connection between the first memory module set 210 and the second memory module set 220 belongs to the substrate network, is located inside the substrate 240, the connection between the first memory module set 210 and the processor 230, and the second memory.
  • the connection between the module set 220 and the processor 230 also belongs to the substrate network and is located inside the substrate 240.
  • the first communication interface 250 and the second communication interface 260 can be microbumps.
  • the processor 230 determines to access the memory module in the first memory module set 210, if the load of the first communication interface 250 is not high, the processor 230 can directly access the memory module through the first communication interface 250, such that the processor 230 The number of hops of accessing the first memory module 210 is the smallest, so that the delay of the processor 230 accessing the first memory module 210 is minimum. If the load of the first communication structure 250 is high, and the load of the second communication interface 260 is not high at this time, The processor can access the first memory module 210 through the second communication interface 260 and the second memory module 220, so that the communication path with a large load can be avoided, and the access delay of the processor 230 to the first memory module 210 can be reduced.
  • the chip of the expandable memory according to the embodiment of the present invention may further include a plurality of memory module sets, and each memory module set may include a memory module.
  • An expandable memory chip according to an embodiment of the present invention, a plurality of memory modules are set through a substrate network By being connected, the processor can access the memory module in the first memory module set through the second memory module set, thereby avoiding the heavy load communication interface and reducing the delay of the processor accessing the memory module.
  • the processor 230 includes a plurality of processor cores, where the plurality of processor cores communicate through an on-chip network, and the on-chip network is a communication network located outside the substrate 240;
  • the first memory module set 210 and the second memory module set 220 respectively include a plurality of memory modules.
  • the processor 230 may be a single core processor, and the first memory module set 210 and the second memory module set 220 may respectively include one memory module (case 1), and the processor 230 may be a multi-core processor, a first memory module set 210 and The second set of memory modules 220 can each include a plurality of memory modules (case 2).
  • the chip configured according to the case 1 has more communication paths for accessing the memory module of the chip according to the case 2, thereby avoiding the heavy load communication path and reducing the processor access. The delay of the memory module.
  • any two memory modules in the first memory module set 210 communicate through the substrate network
  • Any two of the second memory module sets 220 communicate through the substrate network.
  • any two memory modules in each memory module set can be connected to each other through a substrate network, thereby providing more optional
  • the communication path helps balance the load on the entire chip.
  • any one of the first memory module set 210 and any one of the second memory module sets 220 communicate with each other through the substrate network. Thereby more communication paths can be provided, which is beneficial to balance the load of the entire chip.
  • the first communication interface 250 and the second communication interface 260 are respectively located in different processor cores.
  • the communication interface is relatively close in the position of the processor, it is not conducive to load balancing of the chip.
  • the locations of the first communication interface and the second communication interface are located in the same processor core, other processors The core accesses the memory module through the processor core, thereby causing a large load on the communication path of the processor core. Therefore, different communication interfaces should be located in different processor cores, and the distance between each other should be as much as possible Far.
  • the scalable memory chip provided by the embodiment of the invention has different communication interfaces located in different processor cores, thereby being more advantageous Balance the load of the different communication paths of the entire chip.
  • the processor core is configured to determine from the first processor core to the The communication path with the least number of hops among the plurality of communication paths of the first memory module is the access path.
  • the first processor core may determine, as the access path, a communication path with the least number of hops from the first processor core to the plurality of communication paths of the first memory module, and Reading data stored in the first memory module or writing data in the first memory module through the access path.
  • the “first processor core” and the “first memory module” are non-specific nouns, and the first processor core may be any one of the processors 230 that needs to perform read and write operations.
  • the first memory module may be any one of the first memory module sets.
  • the processor core that needs to perform read and write operations determines the access path according to the hop count of the plurality of communication paths of the processor core to the memory module, thereby avoiding complicated path selection operations. Reduce the burden on the processor.
  • the second processor core is configured to determine that a communication path with a minimum access delay among the plurality of communication paths from the second processor core to the second memory module is an access path.
  • the second processor core may determine, as the access path, a communication path with the smallest access delay from the plurality of communication paths of the second processor core to the second memory module. And reading the data stored in the second memory module or writing the data in the second memory module by using the access path, where the access delay may be an average access delay within a period of time, or may be an access delay of the current time.
  • the “second processor core” and the “second memory module” are non-specific nouns, and the second processor core may be any one of the processors 230 that needs to perform read and write operations.
  • the second memory module may be any one of the first memory module sets.
  • the processor core that needs to perform read and write operations determines the access path according to the delay of the plurality of communication paths of the processor core to the memory module, so that the delay of the communication path can be changed according to the delay of the communication path. Adjusting the access path in time helps balance the load on the entire chip.
  • the second processor core is specifically configured to:
  • a substrate network delay is An average time required for data transmission between any two adjacent memory modules in the chip, the memory hop count being data transmission in a plurality of communication paths from the second processor core to the second memory module The number of memory modules that have passed;
  • the kernel delay is an average time required for data transmission between any two adjacent processor cores in the processor, the kernel hopping a number of processor cores through which data is transmitted in a plurality of communication paths from the second processor core to the second memory module;
  • a communication path that selects the smallest access delay from the plurality of communication paths is the access path.
  • the transmission delay of the substrate network and the on-chip network may be the same or different. Therefore, it is necessary to determine the transmission delay corresponding to different networks for different networks.
  • the data from the second processor core to the first communication interface 250 needs to pass through 5 processor cores (the second processor core does not count the number of processor cores through which the data passes), and the kernel hop count is 5, Assuming that the average delay of each hop in the 5 hops is 1 millisecond (ie, the kernel delay is 1 millisecond), the on-chip network delay is 5 milliseconds. If the processor is a single core processor, ie, the second processor core is the processor's unique processor core, the on-chip network delay is zero.
  • the memory hop count is 5, assuming the 5 The average delay of each hop in a hop is 1 millisecond (that is, the memory delay is 1 millisecond), and the substrate network delay is 5 milliseconds. If the second memory module is the only memory module in the first memory module set, the memory hop count is 1, and the substrate network delay is 1 millisecond.
  • the second processor core may determine, according to the substrate network delay and the on-chip network delay, an access delay of the multiple communication paths of the second processor core to the second memory module, and further, from the multiple communication paths.
  • the communication path with the smallest access delay is selected as the access path.
  • the processor core that needs to perform read and write operations determines the access path from the plurality of communication paths according to the delay of different types of communication networks in the communication path, thereby more accurately Determine the delay of different communication paths.
  • the second processor core is further configured to:
  • the network delay parameter is used to indicate the load of the substrate network.
  • the second processor core is further configured to:
  • the on-chip network delay is determined according to the kernel delay, the kernel hop count, and an on-chip network load parameter, where the on-chip network load parameter is used to indicate a load amount of the on-chip network.
  • the processor core determines the substrate network delay according to the memory delay, the memory hop count, and the substrate network load parameter when determining the substrate network, wherein the substrate network load parameter is positively correlated.
  • the load on the substrate network the second processor core can obtain the load parameters through a period of learning.
  • the second processor core obtains the substrate network by analyzing the relationship between the load of the substrate network and the delay of the substrate network over a period of time.
  • the load parameter is determined by multiplying the memory delay, the number of memory hops, and the substrate network load parameter to determine the substrate network delay.
  • the on-chip network load parameters can be obtained in a similar manner, and the on-chip network delay is determined, which will not be described here.
  • the scalable memory chip determines the substrate network delay through the substrate network load parameter, and determines the on-chip network delay through the on-chip network load parameter, so that the load of the substrate network and the on-chip network can be dynamically changed. Determine the delay of the communication path.
  • the chip 300 includes a silicon substrate, a 16-core processor, the processor cores of the 16-core processor are respectively numbered C1 to C16, and the chip 300 further includes a set of four memory modules, wherein the first memory module The set includes four memory modules, numbered M1 to M4, and the second memory module set includes four memory modules, numbered M5 to M8, and the third memory module set includes four memory modules, numbered M9 to M12, and fourth memory.
  • the module set includes four memory modules, numbered M13 to M16. M1 is connected to C1 through the first communication interface, M5 is connected to C13 through the second communication interface, M9 is connected to C16 through the third communication interface, and M13 is connected to C4 through the fourth communication interface.
  • the chip 300 provided by the embodiment of the present invention can provide a richer communication path for the processor through the communication connection between different sets of memory modules, which is beneficial to balance the load of the chip 300.
  • C3 may first determine the communication path with the fewest hops among the multiple communication paths of C3 to M5, for example, the first communication path: C3-C4-M13-M1-M4-M6-M5, and the second Communication path: C3-C2-C1-C5-C9-C13-M5, the hop count of each communication path is 6, C3 can select any communication path from the first communication path and the second communication path as the access path. There is no need to consider the load of each path, which can avoid complicated path selection operations and reduce the burden on the processor.
  • C3 can also determine the access path of accessing M5 according to the delay of each communication path.
  • the access path can be determined according to the Choose Faster Path (CFP) algorithm.
  • CCP Choose Faster Path
  • the CFP algorithm is as follows:
  • the CFP algorithm is parsed as follows:
  • the current node represents the core node (ie, the processor core) that currently initiates the access request;
  • the destination node represents the target memory node (ie, the memory module that needs to be accessed);
  • close_pillar means to first route to another core node closest to the current core node
  • far_pillar means to first route to the nearest nuclear node to the target memory node, meaning that the core node is farther away from the core node that currently initiates the access request
  • Total_close indicates the total delay obtained by adding the substrate network delay and the on-chip network delay when the communication path is selected in the close_pillar mode, where dest_close_NoC represents the hop count of the on-chip network, and NoC_latency represents the average delay of each hop of the on-chip network.
  • dest_close_NiSI indicates the hop count of the substrate network
  • NiSI_latency indicates the average delay of each hop of the substrate network
  • total_far indicates the total delay obtained by adding the substrate network delay and the on-chip network delay when the communication path is selected in the close_pillar manner
  • dest_far_NoC indicates the hop count of the network on the chip
  • NoC_latency indicates the average delay of each hop of the on-chip network
  • dest_far_NiSI indicates the hop count of the base network
  • NiSI_latency indicates the average delay of each hop of the base network.
  • the above communication path selection method does not consider the load condition of the network, simplifies the steps of the communication path selection, and reduces the burden on the processor.
  • the on-chip network delay can be calculated by the dest_close_NoC ⁇ NoC_latency ⁇ on-chip network load parameter
  • the substrate network delay can be calculated by the dest_close_NiSI ⁇ NiSI_latency ⁇ substrate network load parameter.
  • the processor can determine the on-chip network load parameter by collecting the load and delay of the on-chip network over a period of time, and the processor can also determine the substrate network load parameter by collecting the load and delay of the substrate network within a period of time. Therefore, the total delay of the communication path can be calculated more accurately. How the processor specifically determines the load parameter can be based on the related method of the prior art, and details are not described herein again.
  • the chip provided by the embodiment of the present invention can be applied to a computing device having a computing and storage capability, such as a computer or a server.
  • a computing device having a computing and storage capability
  • the computing device may include other devices such as a hard disk and a network card in addition to the chip described in the foregoing embodiments of the present invention.
  • the computing device can receive data through a communication interface such as a network card, and pass the The chip calculates and stores the received data. I will not repeat them here.
  • the devices, and methods disclosed in the embodiments provided herein may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Multi Processors (AREA)

Abstract

一种可扩展内存的芯片(200),该芯片(200)包括:基板(240)以及集成于基板(240)上的处理器(230)、第一内存模块集合(210)和第二内存模块集合(220);所述处理器(230)通过第一通信接口(250)与所述第一内存模块集合(210)中的至少一个内存模块进行通信,所述处理器(230)通过第二通信接口(260)与所述第二内存模块集合(220)中的至少一个内存模块进行通信;所述第一内存模块集合(210)中的内存模块与所述第二内存模块集合(220)中的内存模块通过基板网络进行通信,所述基板网络是位于所述基板(240)内部的通信网络。可以使处理器(230)通过第二内存模块集合(220)访问第一内存模块集合(210)中的内存模块,从而可以在保证高内存带宽的同时,降低处理器(230)访问内存模块的延迟。

Description

一种可扩展内存的芯片 技术领域
本发明涉及集成电路领域,尤其涉及一种可扩展内存的芯片。
背景技术
内存计算(In-memory computation)是一种将数据全部加载到内存中的计算方法,通过将数据全部加载到内存中,避免了数据在硬盘中的导入和导出,从而提高了芯片的处理速度。
内存计算需要较大内存的容量和带宽,因此,需要大量内存模块与处理器连接。如果每个内存模块都与处理器直接连接时,每个内存模块能够使用的带宽仅为1/N(假设有N个内存模块与处理器直接相连);如果多个内存模块作为一个内存模块集合,并通过该内存模块集合中的一个内存模块与处理器直接连接,每个内存模块集合能够使用的带宽较大,但是处理器访问内存模块的平均跳数增加,从而降低了处理器访问内存模块的速度。
因此,如何在芯片上集成较多的内存模块并保证高内存带宽和较小的访问时延是当前亟需解决的问题。
发明内容
有鉴于此,本发明实施例提供了一种可扩展内存的芯片,通过在基板上集成处理器和至少两个内存模块集合,并通过基板网络连接该至少两个内存模块,从而可以在集成较多内存模块的同时保证高内存带宽和较快的访问速度。
该可扩展内存的芯片包括:基板以及集成于基板上的处理器、第一内存模块集合和第二内存模块集合;所述处理器通过第一通信接口与所述第一内存模块集合中的至少一个内存模块进行通信,所述处理器通过第二通信接口与所述第二内存模块集合中的至少一个内存模块进行通信;所述第一内存模块集合中的内存模块与所述第二内存模块集合中的内存模块通过基板网络进行通信,所述基板网络是位于所述基板内部的通信网络。
本发明实施例提供的可扩展内存芯片,通过基板网络将多个内存模块集合连接起来,可以使处理器通过第二内存模块集合访问第一内存模块集合中 的内存模块,从而可以避开负载较重的通信接口,降低处理器访问内存模块的延迟。
可选地,所述处理器包括多个处理器核,所述多个处理器核通过片上网络进行通信,所述片上网络是位于所述基板外部的通信网络;所述第一内存模块集合和所述第二内存模块集合中分别包括多个内存模块。
配置多核处理器与多内存模块的芯片可以提供更多的通信路径,有利于避开负载较重的通信路径,从而可以降低处理器访问内存模块的时延。
可选地,所述第一内存模块集合中的任意两个内存模块通过所述基板网络进行通信;所述第二内存模块集合中的任意两个内存模块通过所述基板网络进行通信。
当第一内存模块集合和第二内存模块集合分别包括多个内存模块时,每个内存模块集合中的任意两个内存模块之间可以通过基板网络互相连接,从而可以提供更多可选的通信路径,有利于平衡整个芯片的负载。
可选地,所述第一内存模块集合中的任意一个内存模块与所述第二内存模块集合中的任意一个内存模块通过所述基板网络进行通信。
从而可以提供更多可选的通信路径,有利于平衡整个芯片的负载。
可选地,所述第一通信接口与所述第二通信接口分别位于不同的处理器核。
在不可预知负载的情况下,通过将不同的通信接口设置于不同的处理器核,可以避免通信接口集中于一个处理器核导致经过该处理器核的通信路径负载较重。
可选地,当所述处理器的第一处理器核需要访问所述第一内存模块集合中的第一内存模块时,所述第一处理器核用于确定从所述第一处理器核至所述第一内存模块的多条通信路径中跳数最少的一条通信路径为访问路径。
本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据该处理器核到内存模块的多条通信路径的跳数确定访问路径,从而可以避免复杂的路径选择运算,减轻了处理器的负担。
可选地,当所述处理器的第二处理器核需要访问所述第一内存模块集合中的第二内存模块时,所述第二处理器核用于确定从所述第二处理器核至所述第二内存模块的多条通信路径中访问时延最小的一条通信路径为访问路径。
本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据该处理器核到内存模块的多条通信路径的时延确定访问路径,从而可以根据通信路径的时延变化及时调整访问路径,有利于平衡整个芯片的负载。
可选地,所述第二处理器核具体用于:根据内存时延与内存跳数确定基板网络时延,其中,所述内存时延为所述芯片中任意两个相邻的内存模块之间数据传输所需的平均时间,所述内存跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的内存模块的数量;根据内核时延与内核跳数确定片上网络时延,其中,所述内核时延为所述处理器中任意两个相邻的处理器核之间数据传输所需的平均时间,所述内核跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的处理器核的数量;根据所述基板网络时延和所述片上网络时延确定所述第二处理器核至所述第二内存模块的多条通信路径的访问时延;从所述多条通信路径中选择访问时延最小的通信路径为所述访问路径。
本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据通信路径中不同类型的通信网络的时延从多条通信路径中确定访问路径,从而可以更加精确地确定不同通信路径的时延。
可选地,所述第二处理器核还用于:根据所述内存时延、所述内存跳数和基板网络负载参数确定所述基板网络时延,所述基板网络负载参数用于指示所述基板网络的负载量。
本发明实施例提供的可扩展内存的芯片,通过基板网络负载参数确定基板网络时延,从而可以根据基板网络的负载的变化动态确定通信路径的时延。
可选地,所述第二处理器核还用于:根据所述内核时延、所述内核跳数和片上网络负载参数确定所述片上网络时延,所述片上网络负载参数用于指示所述片上网络的负载量。
本发明实施例提供的可扩展内存的芯片,通过片上网络负载参数确定片上网络时延,从而可以根据片上网络的负载的变化动态确定通信路径的时延。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中 所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例的附图。
图1是适用本发明实施例的一种可扩展内存的芯片的示意性结构图;
图2是本发明一实施例提供的可扩展内存的芯片的示意性结构图;
图3是本发明另一实施例提供的可扩展内存的芯片的示意性结构图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部实施例。
图1示出了适用本发明实施例的一种可扩展内存的芯片100的示意性结构图。如图1所示,该芯片100包括一个多核心处理器芯片(Chip of Multi Processor,CMP),多个三维(Three-dimensional,3D)动态随机存储器(Dynamic Random Access Memory,DRAM)以及将该CMP与该多个DRAM集成在一起的硅基板(silicon interposer),其中,硅基板表面可以覆盖金属涂层,DRAM以及CMP可以倒置集成于硅基板上。在硅基板和CMP之间设置有多个用于通信的微凸块(Micro-bump),CMP的带宽可以根据微凸块位置(pitch)以及CMP的周长计算出来。
CMP中的多个处理器核(core)通过片上网络(Network on Chip,NoC)连接,NoC是位于硅基板外部的通信网络,两个DRAM之间以及DRAM与CMP之间通过基板网络进行通信,基板网络是位于硅基板内部的通信网络,由于NoC不占用基板内部资源,因此,可以利用基板网络为DRAM之间以及CMP与DRAM之间提供丰富的通信路径。
图1示出的芯片100仅是示意性的说明,本发明实施例不限于此,芯片100可以是中央处理器(Central Processing Unit,CPU)芯片,也可以是图形处理器(Graphics Processing Unit,GPU)芯片,还可以是其它类型的芯片。本发明实施例提供的可扩展内存的芯片所包括的处理器可以是单核处理器,所包括的内存模块可以是高带宽存储器(High Bandwidth Memory,HBM),芯片集成的内存模块的数量不限于图1所示的数量,内存模块与处理器之间的相对位置以及内存模块之间的相对位置也不限于图1所示的位置,此外,硅基板是举例说明,本发明实施例提供的可扩展内存的芯片还可以使用其他 材料制备的基板,例如,使用陶瓷材料制备的基板。
图2示出了本发明实施例提供的一种可扩展内存的芯片200的示意性结构图,如图2所示,该芯片200包括:
基板240以及集成于基板上的处理器230、第一内存模块集合210和第二内存模块集合220;
所述处理器230通过第一通信接口250与所述第一内存模块集合210中的至少一个内存模块进行通信,所述处理器230通过第二通信接口260与所述第二内存模块集合220中的至少一个内存模块进行通信;
所述第一内存模块集合210中的内存模块与所述第二内存模块集合220中的内存模块通过基板网络进行通信,所述基板网络是位于所述基板240内部的通信网络。
应理解,本发明实施例中的术语“第一”和“第二”仅仅是为了区分不同的内容,不对本发明实施例作其它限定。
本发明实施例中,基板240用于将处理器230、第一内存模块集合210以及第二内存模块集合220集成在一起,基板240可以提供丰富的基板资源,用于构建基板网络,如图2所示,第一内存模块集合210与第二内存模块集合220之间的连线属于基板网络,位于基板240的内部,第一内存模块集合210与处理器230之间的连线以及第二内存模块集合220与处理器230之间的连线也属于基板网络,位于基板240的内部。
第一通信接口250和第二通信接口260可以是微凸块。
当处理器230确定访问第一内存模块集合210中的内存模块时,如果第一通信接口250的负载不高,处理器230可以通过第一通信接口250直接访问所述内存模块,这样处理器230访问第一内存模块210的跳数最小,从而处理器230访问第一内存模块210的延迟最小,如果第一通信结构250的负载较高,且此时第二通信接口260的负载不高,则处理器可以通过第二通信接口260以及第二内存模块220访问第一内存模块210,从而可以避开负载较大的通信路径,减小处理器230对第一内存模块210的访问时延。
上述实施例仅是举例说明,本发明实施例不限于此,根据本发明实施例的可扩展内存的芯片还可以包括更多个内存模块集合,每个内存模块集合中可以包括一个内存模块,也可以包括多个内存模块。
根据本发明实施例的可扩展内存芯片,通过基板网络将多个内存模块集 合连接起来,可以使处理器通过第二内存模块集合访问第一内存模块集合中的内存模块,从而可以避开负载较重的通信接口,降低处理器访问内存模块的延迟。
可选地,所述处理器230包括多个处理器核,所述多个处理器核通过片上网络进行通信,所述片上网络是位于所述基板240外部的通信网络;
所述第一内存模块集合210和所述第二内存模块集合220中分别包括多个内存模块。
处理器230可以为单核处理器,第一内存模块集合210和第二内存模块集合220可以分别包括一个内存模块(情况1),处理器230可以为多核处理器,第一内存模块集合210和第二内存模块集合220可以分别包括多个内存模块(情况2)。根据情况2配置的芯片与根据情况1配置的芯片相比,根据情况2配置的芯片的处理器核访问内存模块的通信路径更多,从而可以避开负载较重的通信路径,降低处理器访问内存模块的时延。
可选地,所述第一内存模块集合210中的任意两个内存模块通过所述基板网络进行通信;
所述第二内存模块集合220中的任意两个内存模块通过所述基板网络进行通信。
当第一内存模块集合210和第二内存模块集合220包括多个内存模块时,每个内存模块集合中的任意两个内存模块之间可以通过基板网络互相连接,从而可以提供更多可选的通信路径,有利于平衡整个芯片的负载。
可选地,所述第一内存模块集合210中的任意一个内存模块与所述第二内存模块集合220中的任意一个内存模块通过所述基板网络进行通信。从而可以提供更多的通信路径,有利于平衡整个芯片的负载。
可选地,所述第一通信接口250与所述第二通信接口260分别位于不同的处理器核。
如果通信接口在处理器的位置上比较接近,不利于芯片的负载平衡,例如,对于一个多核处理器,如果第一通信接口和第二通信接口的位置位于同一个处理器核,则其它处理器核都要通过该处理器核访问内存模块,从而造成经过该处理器核的通信路径的负载较大,因此,不同的通信接口应当位于不同的处理器核,且相互之间的距离应当尽可能的远。本发明实施例提供的可扩展内存的芯片,不同的通信接口位于不同的处理器核,从而更加有利于 平衡整个芯片的不同通信路径的负载。
可选地,当处理器230的第一处理器核需要访问所述第一内存模块集合210中的第一内存模块时,所述处理器核用于确定从所述第一处理器核至所述第一内存模块的多条通信路径中跳数最少的一条通信路径为访问路径。
当第一处理器核需要访问第一内存模块时,第一处理器核可以从第一处理器核到第一内存模块的多条通信路径中确定跳数最少的一个通信路径为访问路径,并通过该访问路径读取第一内存模块中存储的数据或者在第一内存模块中写入数据。应理解,本发明实施例中,“第一处理器核”和“第一内存模块”均为非特指名词,第一处理器核可以为处理器230中的任意一个需要进行读写操作的处理器核,第一内存模块可以为第一内存模块集合中的任意一个内存模块。
本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据该处理器核到内存模块的多条通信路径的跳数确定访问路径,从而可以避免复杂的路径选择运算,减轻了处理器的负担。
可选地,当处理器230的第二处理器核需要访问所述第一内存模块集合210中的第二内存模块时,
所述第二处理器核用于确定从所述第二处理器核至所述第二内存模块的多条通信路径中访问时延最小的一条通信路径为访问路径。
当第二处理器核需要访问第二内存模块时,第二处理器核可以从第二处理器核到第二内存模块的多条通信路径中确定访问时延最小的一条通信路径为访问路径,并通过该访问路径读取第二内存模块中存储的数据或者在第二内存模块中写入数据,该访问时延可以是一段时间内的平均访问时延,也可以是当前时刻的访问时延。应理解,本发明实施例中,“第二处理器核”和“第二内存模块”均为非特指名词,第二处理器核可以为处理器230中的任意一个需要进行读写操作的处理器核,第二内存模块可以为第一内存模块集合中的任意一个内存模块。
本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据该处理器核到内存模块的多条通信路径的时延确定访问路径,从而可以根据通信路径的时延变化及时调整访问路径,有利于平衡整个芯片的负载。
可选地,所述第二处理器核具体用于:
根据内存时延与内存跳数确定基板网络时延,其中,所述内存时延为所 述芯片中任意两个相邻的内存模块之间数据传输所需的平均时间,所述内存跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的内存模块的数量;
根据内核时延与内核跳数确定片上网络时延,其中,所述内核时延为所述处理器中任意两个相邻的处理器核之间数据传输所需的平均时间,所述内核跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的处理器核的数量;
根据所述基板网络时延和所述片上网络时延确定所述第二处理器核至所述第二内存模块的多条通信路径的访问时延;
从所述多条通信路径中选择访问时延最小的通信路径为所述访问路径。
由于基板的类型以及制造工艺不同,基板网络与片上网络的传输时延可能相同,也可能不同,因此,需要针对不同的网络确定不同网络对应的传输时延。
例如,数据从第二处理器核到第一通信接口250需要经过5个处理器核(该第二处理器核不计入数据所经过的处理器核的数量),则内核跳数为5,假设该5跳中每一跳的平均时延为1毫秒(即,内核时延为1毫秒),则片上网络时延为5毫秒。如果处理器为单核处理器,即,第二处理器核为处理器的唯一处理器核,则片上网络时延为0。
再例如,数据从第一通信接口250传输至第二内存模块需要经过5个内存模块(该第二内存模块计入数据所经过的内存模块的数量),则内存跳数为5,假设该5跳中每一跳的平均时延为1毫秒(即,内存时延为1毫秒),则基板网络时延为5毫秒。如果第二内存模块是第一内存模块集合中唯一的内存模块,则内存跳数为1,基板网络时延为1毫秒。
第二处理器核可以根据基板网络时延和片上网络时延确定所述第二处理器核至所述第二内存模块的多条通信路径的访问时延,进而从所述多条通信路径中选择访问时延最小的通信路径为所述访问路径。
因此,本发明实施例提供的可扩展内存的芯片,需要进行读写操作的处理器核根据通信路径中不同类型的通信网络的时延从多条通信路径中确定访问路径,从而可以更加精确地确定不同通信路径的时延。
可选地,所述第二处理器核还用于:
根据所述内存时延、所述内存跳数和基板网络负载参数确定所述基板网 络时延,所述基板网络负载参数用于指示所述基板网络的负载量。
可选地,所述第二处理器核还用于:
根据所述内核时延、所述内核跳数和片上网络负载参数确定所述片上网络时延,所述片上网络负载参数用于指示所述片上网络的负载量。
网络的负载越大,传输延迟越大,因此,应当尽可能的避免使用负载较大的网络进行通信。本发明实施例提供的可扩展内存的芯片,处理器核在确定基板网络时延时可以根据内存时延、内存跳数和基板网络负载参数确定基板网络时延,其中,基板网络负载参数正相关于基板网络的负载,第二处理器核可以通过一段时间的学习获得负载参数,例如,第二处理器核通过分析一段时间内的基板网络的负载量与基板网络的时延的关系得到基板网络负载参数,并通过内存时延、内存跳数和基板网络负载参数相乘确定基板网络时延。
可以通过类似的方法获得片上网络负载参数,并确定片上网络时延,在此不再赘述。
因此,本发明实施例提供的可扩展内存的芯片,通过基板网络负载参数确定基板网络时延,以及通过片上网络负载参数确定片上网络时延,从而可以根据基板网络和片上网络的负载的变化动态确定通信路径的时延。
下面,将详细描述本发明实施例提供的一种可扩展内存的芯片以及该芯片的访问路径的选择方法。
如图3所示,芯片300包括一个硅基板,一个16核处理器,该16核处理器的处理器核分别编号C1至C16,芯片300还包括4个内存模块集合,其中,第一内存模块集合包括4个内存模块,分别编号M1至M4,第二内存模块集合包括4个内存模块,分别编号M5至M8,第三内存模块集合包括4个内存模块,分别编号M9至M12,第四内存模块集合包括4个内存模块,分别编号M13至M16。M1通过第一通信接口与C1相连,M5通过第二通信接口与C13相连,M9通过第三通信接口与C16相连,M13通过第四通信接口与C4相连。其余M之间的连线表示内存模块之间通过基板网络连接。从而,本发明实施例提供的芯片300,通过不同内存模块集合之间的通信连接,可以为处理器提供更加丰富的通信路径,有利于平衡芯片300的负载。
假设处理器中只有相邻的连个处理器核可以通过片上网络直接通信,当 C3需要访问M5时,C3可以首先确定C3至M5的多条通信路径中跳数最少的通信路径,例如,第一通信路径:C3-C4-M13-M1-M4-M6-M5,以及第二通信路径:C3-C2-C1-C5-C9-C13-M5,每条通信路径的跳数均为6,C3可以从第一通信路径和第二通信路径中选择任意一条通信路径为访问路径,无需考虑每条路径的负载情况,从而可以避免复杂的路径选择运算,减轻了处理器的负担。
C3还可以根据每条通信路径的时延确定访问M5的访问路径,例如可以根据快速路径选择(Choose Faster Path,CFP)算法确定访问路径。
CFP算法如下:
Figure PCTCN2016100795-appb-000001
Figure PCTCN2016100795-appb-000002
CFP算法解析如下:
1、current node表示当前发起访问请求的核节点(即,处理器核);destination node表示目标内存节点(即,需要访问的内存模块);
2、close_pillar表示首先路由到离当前核节点最近的另一个核节点;far_pillar表示首先路由到离目标内存节点最近的核节点,意味着离当前发起访问请求的核节点更远的核节点;
3、total_close表示以close_pillar方式选择通信路径时,基板网络时延和片上网络时延相加得到的总时延,其中,dest_close_NoC表示片上网络的跳数,NoC_latency表示片上网络每一跳的平均时延,dest_close_NiSI表示基板网络的跳数,NiSI_latency表示基板网络每一跳的平均时延;total_far表示以close_pillar方式选择通信路径时,基板网络时延和片上网络时延相加得到的总时延,其中,dest_far_NoC表示片上网络的跳数,NoC_latency表示片上网络每一跳的平均时延,dest_far_NiSI表示基板网络的跳数,NiSI_latency表示基板网络每一跳的平均时延。
4、通过比较total_close和total_far的大小,选择较小的通信路径为访问路径。
上述通信路径选择方法没有考虑网络的负载情况,简化了通信路径选择的步骤,降低了处理器的负担。
实际上,网络的负载越大,传输延迟越大,为了更加精确地反映不同通信路径的时延,需要考虑网络的负载情况。
例如,可以通过dest_close_NoC×NoC_latency×片上网络负载参数计算出片上网络时延,以及通过dest_close_NiSI×NiSI_latency×基板网络负载参数计算出基板网络时延。实际应用中,处理器可以通过收集一段时间内的片上网络的负载量与时延确定片上网络负载参数,处理器还可以通过收集一段时间内的基板网络的负载量与时延确定基板网络负载参数,从而可以更加精确的计算出通信路径的总时延。处理器具体如何确定负载参数可以根据参阅现有技术的相关方法,在此不再赘述。
可以理解的是,本发明实施例提供的芯片可以应用于计算机、服务器等具有计算和存储能力的计算设备中。本领域技术人员可以知道,所述计算设备中除了包括上述本发明实施例所述的芯片外,还可以包括硬盘、网卡等其他器件。例如,所述计算设备能够通过网卡等通信接口接收数据,并通过所 述芯片对接收的数据进行计算及存储。在此不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本申请所提供的实施例中所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此。

Claims (10)

  1. 一种可扩展内存的芯片,其特征在于,包括:基板以及集成于基板上的处理器、第一内存模块集合和第二内存模块集合;
    所述处理器通过第一通信接口与所述第一内存模块集合中的至少一个内存模块进行通信,所述处理器通过第二通信接口与所述第二内存模块集合中的至少一个内存模块进行通信;
    所述第一内存模块集合中的内存模块与所述第二内存模块集合中的内存模块通过基板网络进行通信,所述基板网络是位于所述基板内部的通信网络。
  2. 根据权利要求1所述的芯片,其特征在于:
    所述处理器包括多个处理器核,所述多个处理器核通过片上网络进行通信,所述片上网络是位于所述基板外部的通信网络;
    所述第一内存模块集合和所述第二内存模块集合中分别包括多个内存模块。
  3. 根据权利要求2所述的芯片,其特征在于:
    所述第一内存模块集合中的任意两个内存模块通过所述基板网络进行通信;
    所述第二内存模块集合中的任意两个内存模块通过所述基板网络进行通信。
  4. 根据权利要求2或3所述的芯片,其特征在于:所述第一内存模块集合中的任意一个内存模块与所述第二内存模块集合中的任意一个内存模块通过所述基板网络进行通信。
  5. 根据权利要求2至4中任一项所述的芯片,其特征在于:所述第一通信接口与所述第二通信接口分别位于不同的处理器核。
  6. 根据权利要求1至5中任一项所述的芯片,其特征在于:
    当所述处理器的第一处理器核需要访问所述第一内存模块集合中的第一内存模块时,所述第一处理器核用于确定从所述第一处理器核至所述第一内存模块的多条通信路径中跳数最少的一条通信路径为访问路径。
  7. 根据权利要求1至5中任一项所述的芯片,其特征在于:
    当所述处理器的第二处理器核需要访问所述第一内存模块集合中的第 二内存模块时,
    所述第二处理器核用于确定从所述第二处理器核至所述第二内存模块的多条通信路径中访问时延最小的一条通信路径为访问路径。
  8. 根据权利要求7所述的芯片,其特征在于,所述第二处理器核具体用于:
    根据内存时延与内存跳数确定基板网络时延,其中,所述内存时延为所述芯片中任意两个相邻的内存模块之间数据传输所需的平均时间,所述内存跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的内存模块的数量;
    根据内核时延与内核跳数确定片上网络时延,其中,所述内核时延为所述处理器中任意两个相邻的处理器核之间数据传输所需的平均时间,所述内核跳数为从所述第二处理器核至所述第二内存模块的多条通信路径中数据传输所经过的处理器核的数量;
    根据所述基板网络时延和所述片上网络时延确定所述第二处理器核至所述第二内存模块的多条通信路径的访问时延;
    从所述多条通信路径中选择访问时延最小的通信路径为所述访问路径。
  9. 根据权利要求8所述的芯片,其特征在于,所述第二处理器核还用于:
    根据所述内存时延、所述内存跳数和基板网络负载参数确定所述基板网络时延,所述基板网络负载参数用于指示所述基板网络的负载量。
  10. 根据权利要求8或9所述的芯片,其特征在于,所述第二处理器核还用于:
    根据所述内核时延、所述内核跳数和片上网络负载参数确定所述片上网络时延,所述片上网络负载参数用于指示所述片上网络的负载量。
PCT/CN2016/100795 2016-09-29 2016-09-29 一种可扩展内存的芯片 WO2018058430A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201680058689.0A CN108139971B (zh) 2016-09-29 2016-09-29 一种可扩展内存的芯片
PCT/CN2016/100795 WO2018058430A1 (zh) 2016-09-29 2016-09-29 一种可扩展内存的芯片
EP16917184.0A EP3511837B1 (en) 2016-09-29 2016-09-29 Chip having extensible memory
US16/365,677 US10678738B2 (en) 2016-09-29 2019-03-27 Memory extensible chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/100795 WO2018058430A1 (zh) 2016-09-29 2016-09-29 一种可扩展内存的芯片

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/365,677 Continuation US10678738B2 (en) 2016-09-29 2019-03-27 Memory extensible chip

Publications (1)

Publication Number Publication Date
WO2018058430A1 true WO2018058430A1 (zh) 2018-04-05

Family

ID=61763090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/100795 WO2018058430A1 (zh) 2016-09-29 2016-09-29 一种可扩展内存的芯片

Country Status (4)

Country Link
US (1) US10678738B2 (zh)
EP (1) EP3511837B1 (zh)
CN (1) CN108139971B (zh)
WO (1) WO2018058430A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022021822A1 (zh) * 2020-07-30 2022-02-03 西安紫光国芯半导体有限公司 近存计算模块和方法、近存计算网络及构建方法
CN116484391A (zh) * 2023-06-25 2023-07-25 四川华鲲振宇智能科技有限责任公司 一种基于自组网的bmc固件动态存储方法及系统

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11171115B2 (en) 2019-03-18 2021-11-09 Kepler Computing Inc. Artificial intelligence processor with three-dimensional stacked memory
US11836102B1 (en) 2019-03-20 2023-12-05 Kepler Computing Inc. Low latency and high bandwidth artificial intelligence processor
US11152343B1 (en) * 2019-05-31 2021-10-19 Kepler Computing, Inc. 3D integrated ultra high-bandwidth multi-stacked memory
US11844223B1 (en) 2019-05-31 2023-12-12 Kepler Computing Inc. Ferroelectric memory chiplet as unified memory in a multi-dimensional packaging
CN111177069A (zh) * 2020-02-27 2020-05-19 浙江亿邦通信科技有限公司 一种内供电的算力单元芯片
JP2023543466A (ja) * 2020-09-30 2023-10-16 ホアウェイ・テクノロジーズ・カンパニー・リミテッド 回路、チップ、および電子デバイス
US11855043B1 (en) 2021-05-06 2023-12-26 Eliyan Corporation Complex system-in-package architectures leveraging high-bandwidth long-reach die-to-die connectivity over package substrates
US11791233B1 (en) 2021-08-06 2023-10-17 Kepler Computing Inc. Ferroelectric or paraelectric memory and logic chiplet with thermal management in a multi-dimensional packaging
US11842986B1 (en) * 2021-11-25 2023-12-12 Eliyan Corporation Multi-chip module (MCM) with interface adapter circuitry
US11841815B1 (en) 2021-12-31 2023-12-12 Eliyan Corporation Chiplet gearbox for low-cost multi-chip module applications
US12001386B2 (en) * 2022-07-22 2024-06-04 Dell Products L.P. Disabling processor cores for best latency in a multiple core processor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108647B2 (en) * 2009-01-29 2012-01-31 International Business Machines Corporation Digital data architecture employing redundant links in a daisy chain of component modules
CN103081434A (zh) * 2010-08-24 2013-05-01 华为技术有限公司 智能存储器
CN103927274A (zh) * 2013-01-10 2014-07-16 株式会社东芝 存储装置
WO2014178856A1 (en) * 2013-04-30 2014-11-06 Hewlett-Packard Development Company, L.P. Memory network
CN105718380A (zh) * 2015-07-29 2016-06-29 上海磁宇信息科技有限公司 细胞阵列计算系统

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69509717T2 (de) * 1994-08-31 1999-11-11 Motorola Inc Modulare Chipauswahl-Steuerschaltung
US7669027B2 (en) 2004-08-19 2010-02-23 Micron Technology, Inc. Memory command delay balancing in a daisy-chained memory topology
US20060282599A1 (en) * 2005-06-10 2006-12-14 Yung-Cheng Chiu SLI adaptor card and method for mounting the same to motherboard
US7636835B1 (en) * 2006-04-14 2009-12-22 Tilera Corporation Coupling data in a parallel processing environment
US8045546B1 (en) * 2008-07-08 2011-10-25 Tilera Corporation Configuring routing in mesh networks
CN102184139A (zh) * 2010-06-22 2011-09-14 上海盈方微电子有限公司 一种硬件动态内存池管理方法和系统
US8705368B1 (en) * 2010-12-03 2014-04-22 Google Inc. Probabilistic distance-based arbitration
US9448940B2 (en) * 2011-10-28 2016-09-20 The Regents Of The University Of California Multiple core computer processor with globally-accessible local memories
US9111151B2 (en) * 2012-02-17 2015-08-18 National Taiwan University Network on chip processor with multiple cores and routing method thereof
US20150103822A1 (en) * 2013-10-15 2015-04-16 Netspeed Systems Noc interface protocol adaptive to varied host interface protocols
US8850108B1 (en) * 2014-06-04 2014-09-30 Pure Storage, Inc. Storage cluster
CN104484021A (zh) * 2014-12-23 2015-04-01 浪潮电子信息产业股份有限公司 一种可扩展内存的服务器系统
CN105632545B (zh) 2015-03-27 2018-04-06 上海磁宇信息科技有限公司 一种3d内存芯片
KR102419647B1 (ko) * 2015-09-10 2022-07-11 삼성전자주식회사 패킷을 전송하는 장치 및 방법
US9837391B2 (en) * 2015-12-11 2017-12-05 Intel Corporation Scalable polylithic on-package integratable apparatus and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108647B2 (en) * 2009-01-29 2012-01-31 International Business Machines Corporation Digital data architecture employing redundant links in a daisy chain of component modules
CN103081434A (zh) * 2010-08-24 2013-05-01 华为技术有限公司 智能存储器
CN103927274A (zh) * 2013-01-10 2014-07-16 株式会社东芝 存储装置
WO2014178856A1 (en) * 2013-04-30 2014-11-06 Hewlett-Packard Development Company, L.P. Memory network
CN105718380A (zh) * 2015-07-29 2016-06-29 上海磁宇信息科技有限公司 细胞阵列计算系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3511837A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022021822A1 (zh) * 2020-07-30 2022-02-03 西安紫光国芯半导体有限公司 近存计算模块和方法、近存计算网络及构建方法
CN116484391A (zh) * 2023-06-25 2023-07-25 四川华鲲振宇智能科技有限责任公司 一种基于自组网的bmc固件动态存储方法及系统
CN116484391B (zh) * 2023-06-25 2023-08-22 四川华鲲振宇智能科技有限责任公司 一种基于自组网的bmc固件动态存储方法及系统

Also Published As

Publication number Publication date
EP3511837A4 (en) 2019-09-18
CN108139971B (zh) 2020-10-16
EP3511837A1 (en) 2019-07-17
CN108139971A (zh) 2018-06-08
US20190220434A1 (en) 2019-07-18
EP3511837B1 (en) 2023-01-18
US10678738B2 (en) 2020-06-09

Similar Documents

Publication Publication Date Title
WO2018058430A1 (zh) 一种可扩展内存的芯片
JP7478229B2 (ja) 統合キャッシュを有するアクティブブリッジチップレット
US20190146788A1 (en) Memory device performing parallel arithmetic processing and memory module including the same
US9009648B2 (en) Automatic deadlock detection and avoidance in a system interconnect by capturing internal dependencies of IP cores using high level specification
US20150261698A1 (en) Memory system, memory module, memory module access method, and computer system
JP2017500810A (ja) タイミング及び/又は性能を満たすnocチャネルの自動パイプライニング
US10922258B2 (en) Centralized-distributed mixed organization of shared memory for neural network processing
Lant et al. Toward FPGA-based HPC: Advancing interconnect technologies
Schmidt et al. Exploring time and energy for complex accesses to a hybrid memory cube
US11966330B2 (en) Link affinitization to reduce transfer latency
JP2020087413A (ja) メモリシステム
CN116610630B (zh) 一种基于片上网络的多核系统和数据传输方法
US20230222062A1 (en) Apparatus and method for cache-coherence
Bakhoda et al. Designing on-chip networks for throughput accelerators
Kim et al. Memory Network: Enabling Technology for Scalable Near-Data Computing
Kang et al. A high-throughput and low-latency interconnection network for multi-core clusters with 3-D stacked L2 tightly-coupled data memory
Vivet et al. Interconnect challenges for 3D multi-cores: From 3D network-on-chip to cache interconnects
US20240231912A1 (en) Resource-capability-and-connectivity-based workload performance improvement system
US20240231936A1 (en) Resource-capability-and-connectivity-based workload performance system
CN117915670B (zh) 一种存算一体的芯片结构
US20230315334A1 (en) Providing fine grain access to package memory
US20230369171A1 (en) Computing device and electronic device guaranteeing bandwidth per computational performance
US20230317561A1 (en) Scalable architecture for multi-die semiconductor packages
Geyer et al. Near to Far: An Evaluation of Disaggregated Memory for In-Memory Data Processing
Pandey et al. Performance investigation of packet-based communication in 3D-memories

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16917184

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2016917184

Country of ref document: EP

Effective date: 20190408