CN108897714B - Multi-core or many-core processor chip with autonomous region - Google Patents

Multi-core or many-core processor chip with autonomous region Download PDF

Info

Publication number
CN108897714B
CN108897714B CN201810719107.XA CN201810719107A CN108897714B CN 108897714 B CN108897714 B CN 108897714B CN 201810719107 A CN201810719107 A CN 201810719107A CN 108897714 B CN108897714 B CN 108897714B
Authority
CN
China
Prior art keywords
chip
core
region
interfaces
regions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810719107.XA
Other languages
Chinese (zh)
Other versions
CN108897714A (en
Inventor
王永文
徐炜遐
邓让钰
周宏伟
赵振宇
潘国腾
隋兵才
黄立波
孙彩霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810719107.XA priority Critical patent/CN108897714B/en
Publication of CN108897714A publication Critical patent/CN108897714A/en
Application granted granted Critical
Publication of CN108897714B publication Critical patent/CN108897714B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/17Interprocessor communication using an input/output type connection, e.g. channel, I/O port
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a multi-core or many-core processor chip with autonomous regions, which comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a whole chip through the inter-region interconnection interfaces. The invention has low delay of accessing the local memory and high bandwidth of accessing the global memory, is designed and realized aiming at the example of one area, and the examples of other areas can be obtained by copying, rotating or mirroring, thereby avoiding the design and realization in the full chip range, reducing the complexity of the hardware design of the processor, considering both the bandwidth and the delay on the premise of keeping the scale of the processor expandable, and reducing the complexity of the hardware design.

Description

Multi-core or many-core processor chip with autonomous region
Technical Field
The invention relates to the field of microprocessors, in particular to an improvement on the system structure of a multi-core or many-core processor chip with autonomous areas.
Background
With the development of integrated circuit and processor design technologies, it becomes difficult to continue to improve the performance of a single CPU, and it is possible to integrate multiple CPU cores on one chip. IBM integrates 2 POWER processor cores on one chip, the earliest high-performance multi-core processor product is promoted, then, multi-core becomes the mainstream technology of a microprocessor, and the number of cores integrated on the processor chip is increased, so that the processor chip becomes a multi-core. By now, almost all processor chips are multi-core or many-core, from high performance to embedded. In addition to a larger number of CPU cores, more and more functions such as memory controllers, I/O interfaces, or interconnect interfaces may be integrated on a chip. One processor chip can implement a multiprocessor system. This presents new challenges to processor architecture design.
Each CPU core of a multi-core or many-core processor requires memory access and data sharing, so the processor hardware must implement some interconnect communication mechanism. There are two major mechanisms at present: one mechanism is based on a shared cache architecture, shown in FIG. 1, where all processor cores share a cache via a bus or crossbar and access memory. Its advantages are less number of cores, high efficiency, low expandability and limited bandwidth. Another mechanism is a structure based on an on-chip interconnection network, as shown in fig. 2, a node formed by a processor core or a plurality of processor cores is connected with a storage controller and an I/O controller through the on-chip interconnection network, and the on-chip interconnection network may be a topology structure such as a ring, a mesh, and the like. The mechanism has good expandability and has the defects of complex hardware structure, increased distance of the processor for accessing the memory storage due to the enlargement of the network scale and limited access delay when the number of the processor cores is increased. With the increasing number of processor cores, it becomes a design challenge how to balance the bandwidth and delay of multi-core or many-core processors and reduce the complexity of hardware design.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a multi-core or many-core processor chip with autonomous regions, which has low delay of accessing a local memory and high bandwidth of accessing a global memory, is designed and realized aiming at an example of one region, and examples of other regions can be obtained by copying, rotating or mirroring, so that the design and realization in the whole chip range are avoided, the complexity of the hardware design of a processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.
In order to solve the technical problems, the invention adopts the technical scheme that:
a multi-core or many-core processor chip with autonomous regions comprises a chip body formed by n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.
Preferably, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1-m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region.
Preferably, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the full-chip I/O interfaces is between 1 and m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region.
Compared with the prior art, the invention has the following beneficial effects: the chip comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises i processor cores and an inter-region interconnection interface, and the m regions form a whole chip through the inter-region interconnection interfaces; each area is provided with a memory interface and an I/O interface, or the areas share or the memory interface and the I/O interface, and the difference between the invention and the traditional multi-core or many-core processor chip is that the level of the areas is introduced, the scale of a processor in each area is small, and the delay of accessing the local memory is low; a plurality of areas form the whole chip through inter-area interconnection, and the access bandwidth of the global memory is high. Moreover, the design implementation can be performed on the example of one area, and the examples of other areas can be obtained through copying, rotation or mirroring, so that the design implementation in the full chip range is avoided, the complexity of the hardware design of the processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.
Drawings
FIG. 1 is a schematic diagram of a conventional shared cache based multi-core or many-core processor.
FIG. 2 is a schematic diagram of a multi-core or many-core processor based on an on-chip interconnection network.
Fig. 3 is a schematic architecture diagram of a first embodiment of the present invention.
Fig. 4 is a schematic diagram of a chip architecture of a 64-core processor in the prior art.
Fig. 5 is a schematic diagram of a 64-core processor chip according to a first embodiment of the present invention.
Fig. 6 is a schematic diagram of a chip architecture of a 1024-core processor in the prior art.
Fig. 7 is a schematic diagram of a 1024-core processor chip according to a second embodiment of the present invention.
Detailed Description
The first embodiment is as follows:
as shown in fig. 3, the locally autonomous multi-core or many-core processor chip of this embodiment includes a chip body formed by n processor cores, where the chip body is divided into m regions, each region includes I processor cores, j memory interfaces, k I/O interfaces, and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a full chip through the inter-region interconnection interfaces. In this embodiment, the number n = m × i of the full-chip processor cores, that is, n is the product of the number m of the regions and the number i of the single-region processor cores.
In this embodiment, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1 to m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region. And if all the memory interfaces of each region are led out through the chip pins, the number of the all-chip memory interfaces is m x j. If only one memory interface of one area is led out through the chip pins, the number of the full-chip memory interfaces is 1. Considering the number and layout of chip pins, each chip can lead out part or all of the memory interfaces, and the number of the full-chip memory interfaces is between 1 and m × j.
In this embodiment, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the I/O interfaces of the whole chip is 1 to m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region. And if all the I/O interfaces of each region are led out through the chip pins, the number of the I/O interfaces of the whole chip is m x k. If only one I/O interface of the area is led out through the chip pin, the number of the I/O interfaces of the whole chip is 1. Considering the planning of the number of pins of the chip, each chip can lead out part or all of the I/O interfaces, and the number of the I/O of the whole chip is between 1 and m × k.
The multi-core or many-core processor chip with autonomous areas in this embodiment is specifically a 64-core processor chip. As a comparison of the 64-core processor chip of this embodiment, an architecture of the 64-core processor chip of the conventional technology is shown in fig. 4, where every 4 cores (denoted by C in the figure and $ denotes a private Cache of the core) form 1 node through a shared Cache (denoted by Cache in the figure), and 16 nodes are connected through a network on a grid, and a storage interface (denoted by MEM in the figure) and an I/O interface (denoted by I/O in the figure) are suspended. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.
As shown in fig. 5, the 64-core processor chip of this embodiment is divided into 8 regions, 8 cores (denoted by C and private Cache in the figure) of each region share a Cache (denoted by Cache in the figure) through a crossbar switch, each region includes 1 memory interface (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure), and 1 inter-region interconnect interface (denoted by NI in the figure), 8 regions form a full chip through the inter-region interconnect interface NI connection, all memory interfaces are connected to the chip, but only the I/O interfaces of two regions are connected to the chip. According to the structure of regional autonomy, the processor does not need to pass through an on-chip interconnection network when accessing the local memory, the delay is small, only 8 core-scale regions are considered during hardware implementation, and the complexity is small.
Example two:
the present embodiment is basically the same as the first embodiment, and the difference is a special case of a 1024-core processor chip in the present embodiment.
As a comparison of the 1024-core processor chip of this embodiment, as shown in fig. 6, every 4 cores of the 1024-core processor chip of the conventional technology form 1 node by the shared cache, and 256 nodes are connected by the network on the grid chip and have a storage interface and an I/O interface. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.
As shown in fig. 7, the 1024-core processor chip of this embodiment is divided into 4 regions, 64 cores (denoted by C and private cache in the figure) in each region are connected through a network on a grid chip, each region includes 2 memory interfaces (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure) and 1 inter-region interconnect interface (denoted by NI in the figure), the 4 regions are connected in a full cross via the inter-region interconnect interfaces to form a chip, all the memory interfaces are connected out of the chip, but only the I/O interfaces of two regions are connected out of the chip. According to the structure of regional autonomy, the delay of the processor for accessing the local memory is small, the bandwidth is equivalent, and the complexity of hardware implementation is small.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims (3)

1. A multi-core or many-core processor chip with autonomous region comprises a chip body consisting of n processor cores, and is characterized in that: the chip body is divided into m regions, the embodiment of one region is designed and realized, the embodiments of other regions are obtained by copying, rotating or mirroring, each region comprises I processor cores, j memory interfaces, k I/O interfaces and one inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.
2. The area autonomous multi-core or many-core processor chip of claim 1, wherein: at least one memory interface in the m regions is led out through the chip pins, so that the number of the full-chip memory interfaces is 1-m x j, wherein m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces contained in each region.
3. The area autonomous multi-core or many-core processor chip of claim 1, wherein: and at least one I/O interface in the m regions is led out through the chip pins, so that the number of the I/O interfaces of the whole chip is 1-m x k, wherein m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces contained in each region.
CN201810719107.XA 2018-07-03 2018-07-03 Multi-core or many-core processor chip with autonomous region Active CN108897714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810719107.XA CN108897714B (en) 2018-07-03 2018-07-03 Multi-core or many-core processor chip with autonomous region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810719107.XA CN108897714B (en) 2018-07-03 2018-07-03 Multi-core or many-core processor chip with autonomous region

Publications (2)

Publication Number Publication Date
CN108897714A CN108897714A (en) 2018-11-27
CN108897714B true CN108897714B (en) 2022-05-24

Family

ID=64347241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810719107.XA Active CN108897714B (en) 2018-07-03 2018-07-03 Multi-core or many-core processor chip with autonomous region

Country Status (1)

Country Link
CN (1) CN108897714B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114116167B (en) * 2021-11-25 2024-03-19 中国人民解放军国防科技大学 High-performance computing-oriented regional autonomous heterogeneous many-core processor
CN116028418B (en) * 2023-02-13 2023-06-20 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608490A (en) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 Cellular array computing system and communication method thereof
CN106293736A (en) * 2016-08-08 2017-01-04 合肥工业大学 Two-stage programming model and the programmed method thereof of system is calculated for coarseness multinuclear
CN107003949A (en) * 2015-02-04 2017-08-01 华为技术有限公司 The system and method synchronous for the internal memory of multiple nucleus system
EP3327573A1 (en) * 2016-11-28 2018-05-30 Renesas Electronics Corporation Multi-processor and multi-processor system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220520A1 (en) * 2016-01-29 2017-08-03 Knuedge Incorporated Determining an operation state within a computing system with multi-core processing devices

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107003949A (en) * 2015-02-04 2017-08-01 华为技术有限公司 The system and method synchronous for the internal memory of multiple nucleus system
CN105608490A (en) * 2015-07-29 2016-05-25 上海磁宇信息科技有限公司 Cellular array computing system and communication method thereof
CN106293736A (en) * 2016-08-08 2017-01-04 合肥工业大学 Two-stage programming model and the programmed method thereof of system is calculated for coarseness multinuclear
EP3327573A1 (en) * 2016-11-28 2018-05-30 Renesas Electronics Corporation Multi-processor and multi-processor system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多核处理器片上光互连的研究;高凯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150115;第4章 *

Also Published As

Publication number Publication date
CN108897714A (en) 2018-11-27

Similar Documents

Publication Publication Date Title
Kim et al. Memory-centric system interconnect design with hybrid memory cubes
CN104798008B (en) The configurable peak performance limit of control processor
Sewell et al. Swizzle-switch networks for many-core systems
JP5273045B2 (en) Barrier synchronization method, apparatus, and processor
Arimilli et al. The PERCS high-performance interconnect
US9998401B2 (en) Architecture for on-die interconnect
CN104049715A (en) Platform agnostic power management
CN108897714B (en) Multi-core or many-core processor chip with autonomous region
CN103744644A (en) Quad-core processor system built in quad-core structure and data switching method thereof
US20210333860A1 (en) System-wide low power management
US9892042B2 (en) Method and system for implementing directory structure of host system
Sato et al. Co-design and system for the supercomputer “fugaku”
EP4162366A1 (en) Link affinitization to reduce transfer latency
CN106951390B (en) NUMA system construction method capable of reducing cross-node memory access delay
US11461234B2 (en) Coherent node controller
CN109150717B (en) Combined routing method for optimizing network-on-chip power consumption
US10592358B2 (en) Functional interconnect redundancy in cache coherent systems
US20170255558A1 (en) Isolation mode in a cache coherent system
Lotfi-Kamran et al. Dark silicon and the history of computing
Camacho et al. Pc-mesh: A dynamic parallel concentrated mesh
Vivet et al. Interconnect challenges for 3D multi-cores: From 3D network-on-chip to cache interconnects
Al Maruf et al. Memory Disaggregation: Open Challenges in the Era of CXL
Azimi et al. On-chip interconnect trade-offs for tera-scale many-core processors
Wang et al. A diffusional schedule for traffic reducing on network-on-chip
Ausavarungnirun et al. Energy-Efficient Deflection-based On-chip Networks: Topology, Routing, Flow Control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant