CN108897714B - Multi-core or many-core processor chip with autonomous region - Google Patents
Multi-core or many-core processor chip with autonomous region Download PDFInfo
- Publication number
- CN108897714B CN108897714B CN201810719107.XA CN201810719107A CN108897714B CN 108897714 B CN108897714 B CN 108897714B CN 201810719107 A CN201810719107 A CN 201810719107A CN 108897714 B CN108897714 B CN 108897714B
- Authority
- CN
- China
- Prior art keywords
- chip
- core
- region
- interfaces
- regions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/17—Interprocessor communication using an input/output type connection, e.g. channel, I/O port
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a multi-core or many-core processor chip with autonomous regions, which comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a whole chip through the inter-region interconnection interfaces. The invention has low delay of accessing the local memory and high bandwidth of accessing the global memory, is designed and realized aiming at the example of one area, and the examples of other areas can be obtained by copying, rotating or mirroring, thereby avoiding the design and realization in the full chip range, reducing the complexity of the hardware design of the processor, considering both the bandwidth and the delay on the premise of keeping the scale of the processor expandable, and reducing the complexity of the hardware design.
Description
Technical Field
The invention relates to the field of microprocessors, in particular to an improvement on the system structure of a multi-core or many-core processor chip with autonomous areas.
Background
With the development of integrated circuit and processor design technologies, it becomes difficult to continue to improve the performance of a single CPU, and it is possible to integrate multiple CPU cores on one chip. IBM integrates 2 POWER processor cores on one chip, the earliest high-performance multi-core processor product is promoted, then, multi-core becomes the mainstream technology of a microprocessor, and the number of cores integrated on the processor chip is increased, so that the processor chip becomes a multi-core. By now, almost all processor chips are multi-core or many-core, from high performance to embedded. In addition to a larger number of CPU cores, more and more functions such as memory controllers, I/O interfaces, or interconnect interfaces may be integrated on a chip. One processor chip can implement a multiprocessor system. This presents new challenges to processor architecture design.
Each CPU core of a multi-core or many-core processor requires memory access and data sharing, so the processor hardware must implement some interconnect communication mechanism. There are two major mechanisms at present: one mechanism is based on a shared cache architecture, shown in FIG. 1, where all processor cores share a cache via a bus or crossbar and access memory. Its advantages are less number of cores, high efficiency, low expandability and limited bandwidth. Another mechanism is a structure based on an on-chip interconnection network, as shown in fig. 2, a node formed by a processor core or a plurality of processor cores is connected with a storage controller and an I/O controller through the on-chip interconnection network, and the on-chip interconnection network may be a topology structure such as a ring, a mesh, and the like. The mechanism has good expandability and has the defects of complex hardware structure, increased distance of the processor for accessing the memory storage due to the enlargement of the network scale and limited access delay when the number of the processor cores is increased. With the increasing number of processor cores, it becomes a design challenge how to balance the bandwidth and delay of multi-core or many-core processors and reduce the complexity of hardware design.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a multi-core or many-core processor chip with autonomous regions, which has low delay of accessing a local memory and high bandwidth of accessing a global memory, is designed and realized aiming at an example of one region, and examples of other regions can be obtained by copying, rotating or mirroring, so that the design and realization in the whole chip range are avoided, the complexity of the hardware design of a processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.
In order to solve the technical problems, the invention adopts the technical scheme that:
a multi-core or many-core processor chip with autonomous regions comprises a chip body formed by n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.
Preferably, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1-m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region.
Preferably, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the full-chip I/O interfaces is between 1 and m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region.
Compared with the prior art, the invention has the following beneficial effects: the chip comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises i processor cores and an inter-region interconnection interface, and the m regions form a whole chip through the inter-region interconnection interfaces; each area is provided with a memory interface and an I/O interface, or the areas share or the memory interface and the I/O interface, and the difference between the invention and the traditional multi-core or many-core processor chip is that the level of the areas is introduced, the scale of a processor in each area is small, and the delay of accessing the local memory is low; a plurality of areas form the whole chip through inter-area interconnection, and the access bandwidth of the global memory is high. Moreover, the design implementation can be performed on the example of one area, and the examples of other areas can be obtained through copying, rotation or mirroring, so that the design implementation in the full chip range is avoided, the complexity of the hardware design of the processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.
Drawings
FIG. 1 is a schematic diagram of a conventional shared cache based multi-core or many-core processor.
FIG. 2 is a schematic diagram of a multi-core or many-core processor based on an on-chip interconnection network.
Fig. 3 is a schematic architecture diagram of a first embodiment of the present invention.
Fig. 4 is a schematic diagram of a chip architecture of a 64-core processor in the prior art.
Fig. 5 is a schematic diagram of a 64-core processor chip according to a first embodiment of the present invention.
Fig. 6 is a schematic diagram of a chip architecture of a 1024-core processor in the prior art.
Fig. 7 is a schematic diagram of a 1024-core processor chip according to a second embodiment of the present invention.
Detailed Description
The first embodiment is as follows:
as shown in fig. 3, the locally autonomous multi-core or many-core processor chip of this embodiment includes a chip body formed by n processor cores, where the chip body is divided into m regions, each region includes I processor cores, j memory interfaces, k I/O interfaces, and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a full chip through the inter-region interconnection interfaces. In this embodiment, the number n = m × i of the full-chip processor cores, that is, n is the product of the number m of the regions and the number i of the single-region processor cores.
In this embodiment, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1 to m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region. And if all the memory interfaces of each region are led out through the chip pins, the number of the all-chip memory interfaces is m x j. If only one memory interface of one area is led out through the chip pins, the number of the full-chip memory interfaces is 1. Considering the number and layout of chip pins, each chip can lead out part or all of the memory interfaces, and the number of the full-chip memory interfaces is between 1 and m × j.
In this embodiment, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the I/O interfaces of the whole chip is 1 to m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region. And if all the I/O interfaces of each region are led out through the chip pins, the number of the I/O interfaces of the whole chip is m x k. If only one I/O interface of the area is led out through the chip pin, the number of the I/O interfaces of the whole chip is 1. Considering the planning of the number of pins of the chip, each chip can lead out part or all of the I/O interfaces, and the number of the I/O of the whole chip is between 1 and m × k.
The multi-core or many-core processor chip with autonomous areas in this embodiment is specifically a 64-core processor chip. As a comparison of the 64-core processor chip of this embodiment, an architecture of the 64-core processor chip of the conventional technology is shown in fig. 4, where every 4 cores (denoted by C in the figure and $ denotes a private Cache of the core) form 1 node through a shared Cache (denoted by Cache in the figure), and 16 nodes are connected through a network on a grid, and a storage interface (denoted by MEM in the figure) and an I/O interface (denoted by I/O in the figure) are suspended. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.
As shown in fig. 5, the 64-core processor chip of this embodiment is divided into 8 regions, 8 cores (denoted by C and private Cache in the figure) of each region share a Cache (denoted by Cache in the figure) through a crossbar switch, each region includes 1 memory interface (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure), and 1 inter-region interconnect interface (denoted by NI in the figure), 8 regions form a full chip through the inter-region interconnect interface NI connection, all memory interfaces are connected to the chip, but only the I/O interfaces of two regions are connected to the chip. According to the structure of regional autonomy, the processor does not need to pass through an on-chip interconnection network when accessing the local memory, the delay is small, only 8 core-scale regions are considered during hardware implementation, and the complexity is small.
Example two:
the present embodiment is basically the same as the first embodiment, and the difference is a special case of a 1024-core processor chip in the present embodiment.
As a comparison of the 1024-core processor chip of this embodiment, as shown in fig. 6, every 4 cores of the 1024-core processor chip of the conventional technology form 1 node by the shared cache, and 256 nodes are connected by the network on the grid chip and have a storage interface and an I/O interface. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.
As shown in fig. 7, the 1024-core processor chip of this embodiment is divided into 4 regions, 64 cores (denoted by C and private cache in the figure) in each region are connected through a network on a grid chip, each region includes 2 memory interfaces (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure) and 1 inter-region interconnect interface (denoted by NI in the figure), the 4 regions are connected in a full cross via the inter-region interconnect interfaces to form a chip, all the memory interfaces are connected out of the chip, but only the I/O interfaces of two regions are connected out of the chip. According to the structure of regional autonomy, the delay of the processor for accessing the local memory is small, the bandwidth is equivalent, and the complexity of hardware implementation is small.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.
Claims (3)
1. A multi-core or many-core processor chip with autonomous region comprises a chip body consisting of n processor cores, and is characterized in that: the chip body is divided into m regions, the embodiment of one region is designed and realized, the embodiments of other regions are obtained by copying, rotating or mirroring, each region comprises I processor cores, j memory interfaces, k I/O interfaces and one inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.
2. The area autonomous multi-core or many-core processor chip of claim 1, wherein: at least one memory interface in the m regions is led out through the chip pins, so that the number of the full-chip memory interfaces is 1-m x j, wherein m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces contained in each region.
3. The area autonomous multi-core or many-core processor chip of claim 1, wherein: and at least one I/O interface in the m regions is led out through the chip pins, so that the number of the I/O interfaces of the whole chip is 1-m x k, wherein m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces contained in each region.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810719107.XA CN108897714B (en) | 2018-07-03 | 2018-07-03 | Multi-core or many-core processor chip with autonomous region |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810719107.XA CN108897714B (en) | 2018-07-03 | 2018-07-03 | Multi-core or many-core processor chip with autonomous region |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108897714A CN108897714A (en) | 2018-11-27 |
CN108897714B true CN108897714B (en) | 2022-05-24 |
Family
ID=64347241
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810719107.XA Active CN108897714B (en) | 2018-07-03 | 2018-07-03 | Multi-core or many-core processor chip with autonomous region |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108897714B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114116167B (en) * | 2021-11-25 | 2024-03-19 | 中国人民解放军国防科技大学 | High-performance computing-oriented regional autonomous heterogeneous many-core processor |
CN116028418B (en) * | 2023-02-13 | 2023-06-20 | 中国人民解放军国防科技大学 | GPDSP-based extensible multi-core processor, acceleration card and computer |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105608490A (en) * | 2015-07-29 | 2016-05-25 | 上海磁宇信息科技有限公司 | Cellular array computing system and communication method thereof |
CN106293736A (en) * | 2016-08-08 | 2017-01-04 | 合肥工业大学 | Two-stage programming model and the programmed method thereof of system is calculated for coarseness multinuclear |
CN107003949A (en) * | 2015-02-04 | 2017-08-01 | 华为技术有限公司 | The system and method synchronous for the internal memory of multiple nucleus system |
EP3327573A1 (en) * | 2016-11-28 | 2018-05-30 | Renesas Electronics Corporation | Multi-processor and multi-processor system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170220520A1 (en) * | 2016-01-29 | 2017-08-03 | Knuedge Incorporated | Determining an operation state within a computing system with multi-core processing devices |
-
2018
- 2018-07-03 CN CN201810719107.XA patent/CN108897714B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107003949A (en) * | 2015-02-04 | 2017-08-01 | 华为技术有限公司 | The system and method synchronous for the internal memory of multiple nucleus system |
CN105608490A (en) * | 2015-07-29 | 2016-05-25 | 上海磁宇信息科技有限公司 | Cellular array computing system and communication method thereof |
CN106293736A (en) * | 2016-08-08 | 2017-01-04 | 合肥工业大学 | Two-stage programming model and the programmed method thereof of system is calculated for coarseness multinuclear |
EP3327573A1 (en) * | 2016-11-28 | 2018-05-30 | Renesas Electronics Corporation | Multi-processor and multi-processor system |
Non-Patent Citations (1)
Title |
---|
多核处理器片上光互连的研究;高凯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150115;第4章 * |
Also Published As
Publication number | Publication date |
---|---|
CN108897714A (en) | 2018-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kim et al. | Memory-centric system interconnect design with hybrid memory cubes | |
CN104798008B (en) | The configurable peak performance limit of control processor | |
Sewell et al. | Swizzle-switch networks for many-core systems | |
JP5273045B2 (en) | Barrier synchronization method, apparatus, and processor | |
Arimilli et al. | The PERCS high-performance interconnect | |
US9998401B2 (en) | Architecture for on-die interconnect | |
CN104049715A (en) | Platform agnostic power management | |
CN108897714B (en) | Multi-core or many-core processor chip with autonomous region | |
CN103744644A (en) | Quad-core processor system built in quad-core structure and data switching method thereof | |
US20210333860A1 (en) | System-wide low power management | |
US9892042B2 (en) | Method and system for implementing directory structure of host system | |
Sato et al. | Co-design and system for the supercomputer “fugaku” | |
EP4162366A1 (en) | Link affinitization to reduce transfer latency | |
CN106951390B (en) | NUMA system construction method capable of reducing cross-node memory access delay | |
US11461234B2 (en) | Coherent node controller | |
CN109150717B (en) | Combined routing method for optimizing network-on-chip power consumption | |
US10592358B2 (en) | Functional interconnect redundancy in cache coherent systems | |
US20170255558A1 (en) | Isolation mode in a cache coherent system | |
Lotfi-Kamran et al. | Dark silicon and the history of computing | |
Camacho et al. | Pc-mesh: A dynamic parallel concentrated mesh | |
Vivet et al. | Interconnect challenges for 3D multi-cores: From 3D network-on-chip to cache interconnects | |
Al Maruf et al. | Memory Disaggregation: Open Challenges in the Era of CXL | |
Azimi et al. | On-chip interconnect trade-offs for tera-scale many-core processors | |
Wang et al. | A diffusional schedule for traffic reducing on network-on-chip | |
Ausavarungnirun et al. | Energy-Efficient Deflection-based On-chip Networks: Topology, Routing, Flow Control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |