CN115658594A

CN115658594A - Heterogeneous multi-core processor architecture based on NIC-400 cross matrix

Info

Publication number: CN115658594A
Application number: CN202211361457.6A
Authority: CN
Inventors: 赵春林; 周昱; 邵健; 胡鹏
Original assignee: Cetc Shentai Information Technology Co ltd
Current assignee: Cetc Shentai Information Technology Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-31

Abstract

The invention relates to a heterogeneous multi-core processor architecture based on a NIC-400 cross matrix, which comprises heterogeneous processor cores, heterogeneous cores and a display engine DPU, wherein the heterogeneous processor cores are uniformly addressed, and asynchronously communicate with the NIC-400 to meet bandwidth requirements, physical layer implementation and multi-node parallel computation; the heterogeneous cores have different pipeline structures, can independently complete operation by taking the heterogeneous cores as hosts, and are communicated with the Internet through the detachable asynchronous bridge. The invention provides a CoreLink NIC-400 cross matrix architecture taking communication as a core, thoroughly solves the problems brought by a bus architecture from the system architecture, and is beneficial to shortening the development period of a heterogeneous multi-core processor.

Description

Heterogeneous multi-core processor architecture based on NIC-400 cross matrix

Technical Field

The invention relates to the technical field of heterogeneous multi-core processors, in particular to a heterogeneous multi-core processor architecture based on a NIC-400 cross matrix.

Background

With the rapid development of emerging mobile terminal multimedia related industries such as multimedia application, image processing, virtual reality, computer vision, and the like, more stringent requirements are put forward on system level multimedia solutions. The interconnection of heterogeneous multi-cores is more complicated due to the parallelism of processing transactions, the sensitivity of real-time high-performance parallel computing to communication bandwidth and delay is higher and higher, and the time sequence convergence of a heterogeneous multi-core processor at an advanced processing node is more and more difficult. Therefore, architectural optimization, parallel computing, and physical implementation of system-level multimedia are the hot spots of research today.

The system-level multimedia chip integrates various special processor cores, such as a CPU for general-purpose complex logic operation, a rendering architecture GPU for large-scale parallel computing, a video coding and decoding processor VPU, a multi-display format processor DPU, a data-driven neural network accelerator NPU, a special mass data processor unit DPU and the like. The performance of each heterogeneous processor core can be released through a proper bus architecture to solve the problems of parallel computing, communication bandwidth, system power consumption, access delay and the like.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problems of low-efficiency parallel transaction processing, high power consumption of global synchronous transmission, low communication bandwidth caused by bus arbitration and the like of the current system-level heterogeneous multi-core processor, and to provide a parallel transmission interconnection architecture based on an asynchronous cross matrix so as to release the maximum efficiency of each heterogeneous core and improve the parallel transmission rate and interaction delay.

In order to solve the technical problem, the heterogeneous multi-core processor architecture based on the NIC-400 cross matrix comprises heterogeneous processor cores which are uniformly addressed, heterogeneous cores and the NIC-400 are in asynchronous communication so as to meet the bandwidth requirement, physical layer implementation and multi-node parallel computation, and the heterogeneous processor consists of a plurality of heterogeneous cores which are respectively a CPU core for complex computation, a parallel computation unit GPU with high throughput rate, a display engine DPU supporting various video decoding units and a display engine DPU supporting various formats; the heterogeneous cores have different pipeline structures, can independently complete operation by taking the heterogeneous cores as a host, and are communicated with the Internet through the detachable asynchronous bridge; in addition, the characteristics of complete configurability, no blocking, low delay, low power consumption and the like can solve the requirement of the multimedia SoC of the mobile terminal on the PPA from the level of an architecture.

In an embodiment of the present invention, the NIC-400 includes routing nodes corresponding to the network interfaces one to one, and a connection line for cutting off data feedback is connected between two adjacent nodes, and bit width and frequency of each heterogeneous core interface based on the NIC-400 cross matrix can be configured arbitrarily; the method meets the requirement of the target host for realizing high-performance parallel communication and calculation under the condition of lower power consumption and area.

In an embodiment of the present invention, each heterogeneous core interface in the NIC-400 cross matrix is designed by using a full asynchronous clock, the cross matrix has a plurality of configurable parameters to meet the requirements of the system for PPA, and according to the OT requirements that each host can execute, the global ID bit width can be arbitrarily adjusted to reduce the average access delay; according to the application scene requirements, the types of the interfaces of the heterogeneous cores, the clock domains and the bit widths of the interfaces can be configured at will, and in addition, each interface channel can be independently inserted into a register slice to meet the requirements of time sequence and delay.

The NIC-400 may maximize performance for high throughput applications, minimize power consumption of the mobile terminal device, and ensure quality of service for system communications. The advanced QoS-400 can dynamically adjust the communication transaction of the whole network according to different configurations, the QVN-400QoS virtual network can effectively prevent arbitration blocking, and the TLX-400 can reduce route blocking and easily realize the timing convergence of a long path.

In one embodiment of the present invention, the asynchronous bridge is a detachable structure integration based on a handshake form, and meanwhile, both the host side and the slave side of the asynchronous bridge have FIFOs of each clock domain, five channels of the AXI interface have independent FIFOs respectively, and the width of each FIFO is the sum of all signal bit widths of the current channel; the FIFO can be used for completing reliable transmission of mass data so as to meet the requirement of high-bandwidth parallel computing of heterogeneous cores.

In one embodiment of the invention, a memory sharing mode is adopted between the CPU and the GPU in the heterogeneous cores, so that the communication overhead between the heterogeneous cores can be effectively reduced; frequent accesses by the two cores may cause blockages at the system memory controller, and NIC-400 with QVN-400 and QoS-400 functionality may effectively address these problems based on the differences in latency and bandwidth requirements of the two cores.

In one embodiment of the invention, any interface in the heterogeneous processors is provided with an NIC-400 with expandable bandwidth; the GPU performs parallel computation of a large number of feature similarities, the CPU completes more general transactions and high-performance computation of more complex logic, the extensible NIC-400 improves the parallel computation performance and the concurrent transaction processing capacity of the heterogeneous multi-core system, and the heterogeneous multi-core system is flexibly configured into a multi-level bus architecture to reduce the access delay.

In an embodiment of the present invention, switch and Bridge with different clock domains and configurable data widths are adopted in the NIC-400 cross matrix to meet the performance requirement of the host, and the power consumption of the host in the idle clock domain can be effectively reduced by adopting a hierarchical clock gating technology; furthermore, a Switch connecting multiple hosts may be split into small switches to increase frequency and reduce latency of critical paths.

In one embodiment of the invention, the NIC-400 has system-level characteristics and can complete automatic optimization of system architecture and dynamically realize parameter configuration of the NIC-400 through GPV; to meet the actual bandwidth or performance requirements, thin-Link can solve the problem of wiring congestion in physical implementations.

In one embodiment of the present invention, the NIC-400 also has an IP suite that can evaluate and optimize the area of the subsystem or sub-module and the latency of the host accessing the slave through built-in algorithms.

Compared with the prior art, the technical scheme of the invention has the following advantages: the heterogeneous multi-core processor architecture based on the NIC-400 cross matrix releases the performance of each heterogeneous processor core through a proper bus architecture to optimize the efficiency, improves the parallel transmission rate and the interaction delay, can enable the interfaces of the heterogeneous cores and the cross matrix to be simpler, has simple and efficient wiring, and is more favorable for realizing physical timing sequence convergence.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference will now be made in detail to the present disclosure, examples of which are illustrated in the accompanying drawings.

FIG. 1 is a schematic diagram of a heterogeneous multi-core processor architecture based on a NIC architecture of the present invention;

FIG. 2 is a schematic diagram of a processor bus topology based on a NIC-400 crossbar according to the present invention;

FIG. 3 is a schematic diagram of QoS policies for various types of hosts according to the present invention;

fig. 4 is a diagram illustrating QVN policies of various types of hosts according to the present invention.

Detailed Description

As shown in fig. 1, this embodiment provides a heterogeneous multi-core processor architecture based on a NIC-400 cross matrix, where the architecture includes uniform addressing of heterogeneous processor cores, asynchronous communication between the heterogeneous cores and the NIC-400 to meet bandwidth requirements, physical layer implementation, and multi-node parallel computation, and in addition, the heterogeneous processor is composed of multiple heterogeneous cores, and the heterogeneous cores are respectively a CPU core for complex computation, a parallel computation unit GPU with high throughput, a display engine DPU supporting multiple video decoding units VPU, and a display engine DPU supporting multiple formats; the heterogeneous cores have different pipeline structures, can independently complete operation by taking the heterogeneous cores as a host, and are communicated with the Internet through the detachable asynchronous bridge.

The system-level heterogeneous multi-core processor architecture based on the asynchronous coreLink-NIC400 cross matrix is characterized in that all host interface parameters of the cross matrix are completely and independently configured, and an asynchronous communication mode is adopted for physical realization; in order to meet the bandwidth and memory access delay of each heterogeneous core, a CPU (central processing unit) independently occupies a processor bus, a GPU (graphics processing unit) and a VPU (virtual private unit) occupy a multimedia bus, and a real-time host DPU independently occupies a display bus.

Furthermore, as shown in fig. 1, the CPU is a 64axi3@667mhz interface, the ROM loaded on the processor bus is used for the CPU to start a first-level Boot, the SRAM is used as a variable cache or stack space of the CPU application program, so as to increase the speed of starting the CPU, and the ROM and the SRAM are both AXI3 interfaces supporting multiple OTs, so as to increase the parallelism of instructions and data.

The GPU and the VPU are interfaces of 128AXI4@500MHz and 250MHz respectively, the greedy host computer occupies an audio and video bus at the same time, the slave computer side is provided with only one DDR, and the strategy meets the requirements of the GPU and the VPU on communication bandwidth and low access delay.

The DPU is an interface of 64AXI4@250MHz, and in order to meet the real-time performance of the DPU, the DPU solely occupies a display bus, and only one DDR is arranged on the slave side, so that the real-time performance and bandwidth requirements of the DPU are met.

As shown in fig. 2, in the topology structure of the heterogeneous multi-core processor bus architecture, each AXI interface is configurable, and the processor bus is synchronously designed to have only one global clock. The host interface is configured with read-write transmission transactions of 10 and 22 respectively, the QoS priority types are host control and non-locking transmission, and the QoS strategy adopts the arbitration strategies of TRR and LR. Register slices may be inserted to address timing issues, the configuration being contingent. In addition, a host interface for accessing DDR adopts the ThinLinks connection, and the interface type of Switch is AXI.

The multimedia bus and the display bus in the heterogeneous multi-core processor can refer to the configuration mode of the processor bus, wherein the QoS priority type of a host interface of the multimedia bus is host control and non-locking transmission, and the QoS strategy adopts an OTT strategy. The structure of Switch refers to fig. 1. The QoS priority types of the host interfaces of the display buses are host control and non-locking transmission, and the LR arbitration strategy is adopted in the QoS strategy.

The global ID and optimization options are configured to result in an optimal bus architecture. And starting RTLVidentification to finish the accuracy of the checksum configuration parameters of the RTL code.

Each heterogeneous processor core is interconnected with a host interface of the cross matrix bus through an asynchronous bridge with adjustable FIFO depth, the FIFO recombines all signals of the channel and sends the signals to the host interface at the same time, and the FIFO at the host interface breaks up the signals to complete the reliable transmission of channel information. Clock gating, dynamic voltage and frequency adjustment can effectively reduce power consumption as required.

The interlocking bidirectional communication asynchronous bridge is adopted, the width of a handshake signal is automatically adjusted according to the transmission condition, the host side and the slave side do not have a common clock reference, and a strict time sequence relation between a heterogeneous core and a cross matrix is not required. The asynchronous bridge can effectively match the bandwidths of the heterogeneous cores and the cross matrix bus, the interface of the heterogeneous cores and the cross matrix can be simpler due to the characteristic of being detachable, and the physical time sequence convergence can be realized due to the simple and efficient wiring.

The NIC-400 cross matrix bus can be compatible with a master and a slave of various AMBA protocol bus interfaces, the master interface is connected with the slave interface through a multi-stage Switch, and the strategy can effectively reduce communication lines of the master and the slave. The reduction of the interconnection lines can effectively improve the bus frequency and solve the bandwidth problem, and the overall performance can better meet the application requirements of heterogeneous multi-core.

As shown in fig. 3, the priority policy and QoS value of each type of host are listed, and the arbitration policy of each host interface is configured according to the table to meet the bandwidth, delay, etc. requirements of each host.

As shown in fig. 4, the QVN characteristics of the host interfaces are configured, one virtual network per interface, and a policy for pre-allocating tokens to ensure that each transaction will be received by the slave. Configuring QVN prevents transaction blocking at the shared Switch.

Advanced quality of service QoS and virtual network QVN-400 is configured to efficiently manage data transmissions to meet the host's acceptable bandwidth and delay constraints. The CPU belongs to a delay sensitive host, the GPU belongs to a transaction type host, and the DPU display related host belongs to a real-time host. Therefore, the QoS modulation strategy firstly ensures that the delay of the real-time host is minimum, then the delay of the delay sensitive host is minimum, and finally the bandwidth is allocated for the transaction type host.

The CPU interface employs TRR and LR strategies to guarantee the number of OT transfers and reduce transaction latency. The GPU interface employs an OTT policy to ensure the number of simultaneous OT transfers and to prevent transaction blocking. The real-time DPU interface employs an LR policy to ensure real-time transaction transactions. The priority value of the QoS is derived from the host transaction auxiliary information, generally, the DPU and the like display the highest relevant QoS value in real time, the QoS value of the host such as the GPU with various OT transactions is the smallest, and the QoS of the CPU is in an intermediate state.

The QVN virtual network prevents blocking of the interface to ensure that every transaction can be received, every transaction sent by the master must obtain the token of the corresponding slave. Each host employs a configuration of pre-allocated tokens.

Furthermore, the general processor adopts a symmetrical multi-core architecture with a multimedia acceleration unit, and the architecture improves the transaction parallel processing capability and performance and effectively reduces the power consumption. In order to exert the best performance of the general processor, the separated L1 caches all adopt a 32KB group-associated mapping structure and 1MB L2Cache for multi-core shared Cache. The accelerator coherency port is configured to maintain cache coherency for each core. In addition, the port can share cache contents with other modules, all standard reading and writing are supported, an additional consistency functional module is not needed, and the overall performance of the multi-core processor is improved under the condition that power consumption is not increased.

The NEON can assist the VPU to complete the fast decoding operation of various coding standards, and the NEON can improve the performance of a complex video codec by 60-150%.

The clock and power supply of the NEON can be gated separately, thereby saving a large amount of dynamic and static power consumption as a whole.

Deep task partitioning and scheduling are employed to hide the communication overhead of the general purpose processor and the heterogeneous cores. The sub-threads are further split, a part of the subtasks are executed by a general processor CPU, and a part of the subtasks are executed by other heterogeneous cores. The CPU is responsible for task scheduling, and other heterogeneous cores are responsible for accelerating operation, so that the GPU, the VPU and the data acquisition equipment with high data correlation can be arranged on the same bus to reduce mutual access delay.

The increase in master-slave interfaces results in increased fan-out, capacitance and routing, and the bus frequency must be increased by inserting registers. Without reducing the number of host OT accesses, the access latency can increase dramatically. Thus, the NIC-400 crossbar is configured as a 2X4 crossbar with a single stage Switch in series.

The NIC-400 cross matrix is synchronous in clock, the Master and the Slave interfaces are in asynchronous bridge decoupling connection, and the DDR controller and the cross matrix are synchronously designed to improve communication bandwidth and reduce access delay. The bit width of the interface is configured to be 128bits, and the frequency of a CPU core bus is 667MHz to meet the frequency requirement of DDR.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A heterogeneous multi-core processor architecture based on a NIC-400 cross matrix is characterized by comprising heterogeneous processor cores which are uniformly addressed, heterogeneous cores and the NIC-400 which are in asynchronous communication to meet bandwidth requirements, physical layer implementation and multi-node parallel computing, wherein the heterogeneous processor consists of a plurality of heterogeneous cores which are respectively a CPU (Central processing Unit) core for complex computing, a parallel computing unit GPU (graphics processing Unit) with high throughput rate, a display engine DPU supporting various video decoding units and a display engine DPU supporting various formats; the heterogeneous cores have different pipeline structures, can independently complete operation by taking the heterogeneous cores as a host, and are communicated with the Internet through the detachable asynchronous bridge.

2. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: the NIC-400 comprises routing nodes corresponding to the network interfaces one to one, connecting lines for stopping data feedback are connected between two adjacent nodes, and bit width and frequency of each heterogeneous core interface based on the NIC-400 cross matrix can be configured at will.

3. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 2, wherein: each heterogeneous core interface in the NIC-400 cross matrix adopts a fully asynchronous clock design, the cross matrix has a plurality of configurable parameters to meet the PPA requirement of a system, and the global ID bit width can be adjusted at will to reduce the average memory access delay according to the OT requirement which can be executed by each host.

4. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: the asynchronous bridge is a detachable structure integration based on a handshake form, a host side and a slave side of the asynchronous bridge are both provided with FIFOs of various clock domains, five channels of an AXI interface are respectively provided with independent FIFOs, and the width of each FIFO is the sum of all signal bit widths of the current channel.

5. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: the CPU and the GPU in the heterogeneous cores adopt a memory sharing mode, so that the communication overhead between the heterogeneous cores can be effectively reduced.

6. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: NIC-400 with scalable bandwidth for any interface in the heterogeneous processor.

7. The NIC-400 crossbar-based heterogeneous multi-core processor architecture of claim 3, wherein: the NIC-400 cross matrix employs Switch and Bridge with different clock domains and configurable data width, and employs hierarchical clock gating technology.

8. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: the NIC-400 has system-level characteristics, and can complete automatic optimization of system architecture and dynamically realize parameter configuration of the NIC-400 through GPV.

9. The NIC-400 crossbar based heterogeneous multi-core processor architecture of claim 1, wherein: the NIC-400 also has an IP suite that can evaluate and optimize the area of subsystems or sub-modules and the latency of the host to access the slaves through built-in algorithms.