CN108897714B

CN108897714B - Multi-core or many-core processor chip with autonomous region

Info

Publication number: CN108897714B
Application number: CN201810719107.XA
Authority: CN
Inventors: 王永文; 徐炜遐; 邓让钰; 周宏伟; 赵振宇; 潘国腾; 隋兵才; 黄立波; 孙彩霞
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2022-05-24
Anticipated expiration: 2038-07-03
Also published as: CN108897714A

Abstract

The invention discloses a multi-core or many-core processor chip with autonomous regions, which comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a whole chip through the inter-region interconnection interfaces. The invention has low delay of accessing the local memory and high bandwidth of accessing the global memory, is designed and realized aiming at the example of one area, and the examples of other areas can be obtained by copying, rotating or mirroring, thereby avoiding the design and realization in the full chip range, reducing the complexity of the hardware design of the processor, considering both the bandwidth and the delay on the premise of keeping the scale of the processor expandable, and reducing the complexity of the hardware design.

Description

Multi-core or many-core processor chip with autonomous region

Technical Field

The invention relates to the field of microprocessors, in particular to an improvement on the system structure of a multi-core or many-core processor chip with autonomous areas.

Background

With the development of integrated circuit and processor design technologies, it becomes difficult to continue to improve the performance of a single CPU, and it is possible to integrate multiple CPU cores on one chip. IBM integrates 2 POWER processor cores on one chip, the earliest high-performance multi-core processor product is promoted, then, multi-core becomes the mainstream technology of a microprocessor, and the number of cores integrated on the processor chip is increased, so that the processor chip becomes a multi-core. By now, almost all processor chips are multi-core or many-core, from high performance to embedded. In addition to a larger number of CPU cores, more and more functions such as memory controllers, I/O interfaces, or interconnect interfaces may be integrated on a chip. One processor chip can implement a multiprocessor system. This presents new challenges to processor architecture design.

Each CPU core of a multi-core or many-core processor requires memory access and data sharing, so the processor hardware must implement some interconnect communication mechanism. There are two major mechanisms at present: one mechanism is based on a shared cache architecture, shown in FIG. 1, where all processor cores share a cache via a bus or crossbar and access memory. Its advantages are less number of cores, high efficiency, low expandability and limited bandwidth. Another mechanism is a structure based on an on-chip interconnection network, as shown in fig. 2, a node formed by a processor core or a plurality of processor cores is connected with a storage controller and an I/O controller through the on-chip interconnection network, and the on-chip interconnection network may be a topology structure such as a ring, a mesh, and the like. The mechanism has good expandability and has the defects of complex hardware structure, increased distance of the processor for accessing the memory storage due to the enlargement of the network scale and limited access delay when the number of the processor cores is increased. With the increasing number of processor cores, it becomes a design challenge how to balance the bandwidth and delay of multi-core or many-core processors and reduce the complexity of hardware design.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a multi-core or many-core processor chip with autonomous regions, which has low delay of accessing a local memory and high bandwidth of accessing a global memory, is designed and realized aiming at an example of one region, and examples of other regions can be obtained by copying, rotating or mirroring, so that the design and realization in the whole chip range are avoided, the complexity of the hardware design of a processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.

In order to solve the technical problems, the invention adopts the technical scheme that:

a multi-core or many-core processor chip with autonomous regions comprises a chip body formed by n processor cores, wherein the chip body is divided into m regions, each region comprises I processor cores, j memory interfaces, k I/O interfaces and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.

Preferably, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1-m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region.

Preferably, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the full-chip I/O interfaces is between 1 and m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region.

Compared with the prior art, the invention has the following beneficial effects: the chip comprises a chip body consisting of n processor cores, wherein the chip body is divided into m regions, each region comprises i processor cores and an inter-region interconnection interface, and the m regions form a whole chip through the inter-region interconnection interfaces; each area is provided with a memory interface and an I/O interface, or the areas share or the memory interface and the I/O interface, and the difference between the invention and the traditional multi-core or many-core processor chip is that the level of the areas is introduced, the scale of a processor in each area is small, and the delay of accessing the local memory is low; a plurality of areas form the whole chip through inter-area interconnection, and the access bandwidth of the global memory is high. Moreover, the design implementation can be performed on the example of one area, and the examples of other areas can be obtained through copying, rotation or mirroring, so that the design implementation in the full chip range is avoided, the complexity of the hardware design of the processor is reduced, the bandwidth and the delay can be considered on the premise of keeping the scale of the processor expandable, and the complexity of the hardware design is reduced.

Drawings

FIG. 1 is a schematic diagram of a conventional shared cache based multi-core or many-core processor.

FIG. 2 is a schematic diagram of a multi-core or many-core processor based on an on-chip interconnection network.

Fig. 3 is a schematic architecture diagram of a first embodiment of the present invention.

Fig. 4 is a schematic diagram of a chip architecture of a 64-core processor in the prior art.

Fig. 5 is a schematic diagram of a 64-core processor chip according to a first embodiment of the present invention.

Fig. 6 is a schematic diagram of a chip architecture of a 1024-core processor in the prior art.

Fig. 7 is a schematic diagram of a 1024-core processor chip according to a second embodiment of the present invention.

Detailed Description

The first embodiment is as follows:

as shown in fig. 3, the locally autonomous multi-core or many-core processor chip of this embodiment includes a chip body formed by n processor cores, where the chip body is divided into m regions, each region includes I processor cores, j memory interfaces, k I/O interfaces, and an inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, and the m regions form a full chip through the inter-region interconnection interfaces. In this embodiment, the number n = m × i of the full-chip processor cores, that is, n is the product of the number m of the regions and the number i of the single-region processor cores.

In this embodiment, at least one memory interface in the m regions is externally led out through a chip pin, so that the number of the full-chip memory interfaces is 1 to m × j, where m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces included in each region. And if all the memory interfaces of each region are led out through the chip pins, the number of the all-chip memory interfaces is m x j. If only one memory interface of one area is led out through the chip pins, the number of the full-chip memory interfaces is 1. Considering the number and layout of chip pins, each chip can lead out part or all of the memory interfaces, and the number of the full-chip memory interfaces is between 1 and m × j.

In this embodiment, at least one I/O interface in the m regions is externally led out through a chip pin, so that the number of the I/O interfaces of the whole chip is 1 to m × k, where m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces included in each region. And if all the I/O interfaces of each region are led out through the chip pins, the number of the I/O interfaces of the whole chip is m x k. If only one I/O interface of the area is led out through the chip pin, the number of the I/O interfaces of the whole chip is 1. Considering the planning of the number of pins of the chip, each chip can lead out part or all of the I/O interfaces, and the number of the I/O of the whole chip is between 1 and m × k.

The multi-core or many-core processor chip with autonomous areas in this embodiment is specifically a 64-core processor chip. As a comparison of the 64-core processor chip of this embodiment, an architecture of the 64-core processor chip of the conventional technology is shown in fig. 4, where every 4 cores (denoted by C in the figure and $ denotes a private Cache of the core) form 1 node through a shared Cache (denoted by Cache in the figure), and 16 nodes are connected through a network on a grid, and a storage interface (denoted by MEM in the figure) and an I/O interface (denoted by I/O in the figure) are suspended. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.

As shown in fig. 5, the 64-core processor chip of this embodiment is divided into 8 regions, 8 cores (denoted by C and private Cache in the figure) of each region share a Cache (denoted by Cache in the figure) through a crossbar switch, each region includes 1 memory interface (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure), and 1 inter-region interconnect interface (denoted by NI in the figure), 8 regions form a full chip through the inter-region interconnect interface NI connection, all memory interfaces are connected to the chip, but only the I/O interfaces of two regions are connected to the chip. According to the structure of regional autonomy, the processor does not need to pass through an on-chip interconnection network when accessing the local memory, the delay is small, only 8 core-scale regions are considered during hardware implementation, and the complexity is small.

Example two:

the present embodiment is basically the same as the first embodiment, and the difference is a special case of a 1024-core processor chip in the present embodiment.

As a comparison of the 1024-core processor chip of this embodiment, as shown in fig. 6, every 4 cores of the 1024-core processor chip of the conventional technology form 1 node by the shared cache, and 256 nodes are connected by the network on the grid chip and have a storage interface and an I/O interface. According to the structure, the processor needs to pass through an on-chip interconnection network to access any memory, and the memory interface for accessing the remote needs to pass through multi-stage jumping, so that the delay is large.

As shown in fig. 7, the 1024-core processor chip of this embodiment is divided into 4 regions, 64 cores (denoted by C and private cache in the figure) in each region are connected through a network on a grid chip, each region includes 2 memory interfaces (denoted by MEM in the figure), 1I/O interface (denoted by I/O in the figure) and 1 inter-region interconnect interface (denoted by NI in the figure), the 4 regions are connected in a full cross via the inter-region interconnect interfaces to form a chip, all the memory interfaces are connected out of the chip, but only the I/O interfaces of two regions are connected out of the chip. According to the structure of regional autonomy, the delay of the processor for accessing the local memory is small, the bandwidth is equivalent, and the complexity of hardware implementation is small.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims

1. A multi-core or many-core processor chip with autonomous region comprises a chip body consisting of n processor cores, and is characterized in that: the chip body is divided into m regions, the embodiment of one region is designed and realized, the embodiments of other regions are obtained by copying, rotating or mirroring, each region comprises I processor cores, j memory interfaces, k I/O interfaces and one inter-region interconnection interface, a complete shared storage system is formed in each region based on a shared cache or an on-chip interconnection network, the m regions form a whole chip through the inter-region interconnection interfaces, and m, I, j and k are integers greater than or equal to 1.

2. The area autonomous multi-core or many-core processor chip of claim 1, wherein: at least one memory interface in the m regions is led out through the chip pins, so that the number of the full-chip memory interfaces is 1-m x j, wherein m is the total number of the regions into which the chip body is divided, and j is the number of the memory interfaces contained in each region.

3. The area autonomous multi-core or many-core processor chip of claim 1, wherein: and at least one I/O interface in the m regions is led out through the chip pins, so that the number of the I/O interfaces of the whole chip is 1-m x k, wherein m is the total number of the regions into which the chip body is divided, and k is the number of the I/O interfaces contained in each region.