WO2024002175A1

WO2024002175A1 - Instruction execution method, system controller and related product

Info

Publication number: WO2024002175A1
Application number: PCT/CN2023/103271
Authority: WO
Inventors: 张振兴; 刘少礼
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2022-06-29
Filing date: 2023-06-28
Publication date: 2024-01-04
Also published as: CN117348929A

Abstract

An instruction execution method, a system controller and a related product. The method comprises: during transmission of mixed-scale instructions, checking whether a mixed-scale instruction to be transmitted may be discarded; and when it is determined that said mixed-scale instruction may be discarded, blocking transmission of a particular mixed-scale instruction to be transmitted subsequent to said mixed-scale instruction.

Description

Instruction execution methods, system controllers and related products

Cross-references to related applications

This application claims priority to the Chinese patent application filed on June 29, 2022, with application number 202210764229.7 and titled "Instruction Execution Method, System Controller and Related Products".

Technical field

This disclosure relates generally to the field of instruction sets. More specifically, the present disclosure relates to an instruction execution method, a system controller, a system on a chip, a board card, and a machine-readable storage medium.

Background technique

Due to the end of Moore's Law and Dennard Scaling, the performance gain of general-purpose CPUs (Central Processing Units) continues to decline. Domain-specific architecture (DSA) has become the most promising and feasible way to continue to improve the performance and efficiency of the entire computing system. DSA ushered in a big explosion, which was regarded as opening a new golden age of computer architecture. Various DSAs have been proposed to accelerate specific applications, such as various xPUs, including DPU (Data Processing Unit) for data stream processing, GPU (Graphics Processing Unit) for image processing ), NPU (Neural network Processing Unit, neural network processor) for neural networks, TPU (Tensor Processing Unit, tensor processor) for tensor processing, etc. As more and more DSAs, especially DSAs used for computing purposes (also known as IP, Intellectual Property, intellectual property), are integrated into systems on chips (System on Chip, SoC) to achieve high efficiency, current computing The heterogeneity of hardware in systems also continues to grow, moving from standardization to customization.

Currently, IP typically only exposes IP-related hardware interfaces, forcing SoCs to manage the IP as a standalone device using code running on the host CPU. Since it is extremely difficult to manage hardware heterogeneity directly for application developers, significant efforts are often made to build programming frameworks to help application developers manage this hardware heterogeneity. For example, popular programming frameworks for deep learning include PyTorch, TensorFlow, MXNet, etc., all of which provide application developers with high-level, easy-to-use Python interfaces.

Unfortunately, this heterogeneity of software management in CPU-centric SoCs prevents user applications from running efficiently on different SoCs due to lower productivity and hardware utilization. Low productivity stems from both the programming framework and the application. For programming framework developers, in order to support different SoCs, the programming framework must use different IP to implement their respective high-level abstract interfaces, which requires a lot of development work. For application developers, the differences in different IPs in the SoC require different implementations of the same application, resulting in a heavy programming burden. And, this can become even worse for IP that is not supported by the programming framework, since hardware heterogeneity needs to be managed manually. Low hardware utilization is related to CPU-centric SoCs and IP with some generality. In current SoCs, the host CPU must treat IPs as independent devices and utilize code running on the host CPU (i.e., CPU-centric) to manage coordination between different IPs, resulting in both control and data exchange. There are costs that cannot be ignored in all aspects. Furthermore, with the integration of many IPs that have some commonality, domain-specific programming frameworks may not be able to leverage available IP from other domains to perform the same function. For example, using DLA (Deep Learning Accelerator, deep learning accelerator) requires explicit programming in Nivdia Tegra Xavier.

However, there are currently few studies investigating the programming productivity issues caused by growing hardware heterogeneity, and most research still focuses on improving the performance and energy efficiency of individual IPs. Some work develops SoC performance by scheduling IP by chain for flow-based applications in certain scenarios or adding shortcuts in hardware. Others have proposed a fractal approach to solving programming productivity problems, but on machine learning accelerators of varying sizes. As a result, growing hardware heterogeneity has completely changed the paradigm for building future SoC systems and raised key questions on how to build SoC systems with high productivity and high hardware utilization.

Contents of the invention

In order to at least partially solve one or more technical problems mentioned in the background art, the present disclosure provides solutions from multiple aspects. On the one hand, it provides a new unified system-on-chip architecture framework (which can be called system-on-a-chip, Soc-as-a-Processor, SaaP for short), which eliminates hardware heterogeneity from a software perspective and Improve programming productivity and hardware utilization. On the other hand, an architecture-free mixed-scale instruction set is provided to support high productivity and new components of SaaS, including storage vesicles for on-chip management and on-chip interconnects for data paths, thereby Build an efficient SaaS architecture. In yet another aspect, a compilation method is provided for compiling program codes of various high-level programming languages into mixed-scale instructions. Other aspects of this disclosure also provide solutions for branch prediction, exceptions and interrupts in instructions.

In a first aspect, the present disclosure discloses an instruction execution method, including: when issuing a Mixed-Scale (MS) instruction, checking whether the MS instruction may be discarded; and when it is determined that the MS instruction may be discarded Discard, blocking the issuance of a specific MS command following the MS command.

In a second aspect, the present disclosure discloses a system controller configured to perform the instruction execution method according to the aforementioned first aspect.

In a third aspect, the present disclosure discloses a machine-readable storage medium including code that, when executed, causes a machine to perform the method of the aforementioned first aspect.

In a fourth aspect, the present disclosure discloses a system on a chip (SoC), including the system controller of the aforementioned second aspect, and a plurality of heterogeneous IP cores, the plurality of heterogeneous IP cores serve as the execution of the MS instructions unit.

In a fifth aspect, the present disclosure discloses a board card including the system-on-chip of the fourth aspect.

According to the instruction execution method, system controller, machine-readable storage medium, system-on-chip and board provided as above, MS instructions that may cause high revocation costs can be blocked until all instructions that may be discarded before the instruction have been executed, that is, The status is determined. This instruction execution scheme can greatly improve the processing efficiency of the MS instruction pipeline in exception and interrupt handling.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and like or corresponding reference numerals designate like or corresponding parts, wherein:

Figure 1 schematically shows a typical architecture of a SoC;

Figure 2 shows the hardware heterogeneity on the SoC;

Figure 3 shows a typical timeline for a traditional SoC;

Figure 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram;

Figure 4b shows the traditional SoC architecture for comparison;

Figure 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure;

Figure 6 schematically shows an example process of performing tasks on the MISC architecture;

Figure 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure;

FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure;

Figure 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure;

Figure 10 shows an instruction execution example according to an embodiment of the present disclosure;

Figure 11 shows several different data path designs;

Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure;

Figure 13 shows an example program;

Figure 14 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the scope of protection of this disclosure.

It should be understood that the terms “first”, “second”, “third” and “fourth” that may appear in the claims, description and drawings of this disclosure are used to distinguish different objects, rather than to describe them. specific order. The terms "comprise" and "include" used in the description and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components but do not exclude one or more other features, integers , the presence or addition of steps, operations, elements, components and/or collections thereof.

It should also be understood that the terminology used in the specification of the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It will be further understood that the term "and/or" as used in this specification and the claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context.

A system on a chip (SoC) is an integrated circuit chip that integrates all the key components of the system on the same chip. SoC is the most common integration solution in today's mobile/edge scenarios. Its high level of integration improves system performance, reduces overall power consumption and provides significantly smaller area costs compared to motherboard-based solutions.

Figure 1 schematically shows a typical architecture of a SoC.

Due to performance requirements under limited area/power budget, SoC usually integrates a lot of dedicated hardware IP, usually domain-specific architecture for computing purposes, especially to accelerate domain-specific applications or specific applications. Some of these hardware IPs are customized by SoC designers, such as neural network processing IP (Neural Engine (NE) in Apple A15, deep learning accelerator (DLA) in NVIDIA Jetson Xavier, neural processing unit in HiSilicon Kirin (NPU) and Samsung Exynos), some are standardized by IP suppliers, such as Arm or Imagination's CPU and GPU, Synopsys or Cadence's DSP (Digital Signal Processor, digital signal processor), Intel or Xilinx's FPGA (Field-Programmable Gate Array, field programmable gate array), etc.

In the example of Figure 1, CPU 101, GPU 102, NPU (Neural-Network Processing Unit, neural network processing unit) 103, on-chip RAM (Random Access Memory, random access memory) 104, DRAM (Dynamic Random Access Memory, dynamic random access memory) controller 105, arbiter (Arbiter) 106, decoder 107, external storage interface 108, bus bridge 109, UART (Universal Asynchronous Receiver/Transmitter, unified asynchronous receiver and transmitter) 110, GPIO ( General Purpose Input Output, general input and output) 111, ROM (Read Only Memory, read-only memory) interface 112, etc.

Traditional SoC designs utilize a shared data bus or Network on Chip (NoC) to link various components together. A common bus used for SoC on-chip interconnection is ARM’s open standard Advanced Microcontroller Bus Architecture (AMBA).

In the example of Figure 1, the SoC uses shared buses to connect and manage various functional blocks in the SoC. These shared buses include the Advanced High Performance Bus (AHB) for high-speed connections, and the Advanced High Performance Bus (AHB) for low-bandwidth low-speed connections. Advanced Peripheral Bus (APB). Other network-type topologies, namely NoC, can also be introduced to use router-based packet interaction networks to manage more components.

Integrating multiple different IPs results in hardware heterogeneity on the SoC. Hardware heterogeneity includes the heterogeneity of IP within SoC and the heterogeneity of IP between SoCs.

Figure 2 illustrates the hardware heterogeneity on the SoC. The figure shows several IP integrated on the SoC. For example, a certain model A integrates a CPU and a GPU on the SoC; a certain model B integrates a CPU, a GPU and a neural network module on the SoC. Neural Engine (NE) for processing; a certain model C integrates CPU, GPU and neural processing unit (NPU) for neural network processing in SoC; a certain model D integrates CPU, GPU, deep learning unit for deep learning in SoC Learning Accelerator (DLA) and Programmable Vision Accelerator (PVA).

As can be seen from the figure, the IPs on the same SoC are different, for example, used for different purposes. Regarding the heterogeneity of IP within SoC, this is due to the fact that more and more different types of IP (especially for computing purposes) are integrated into SoC to achieve high efficiency. New IP will continue to be introduced into SoC. For example, a new type of neural network processing IP has been widely introduced into recent mobile SoCs. Moreover, the number of processing units in an SoC continues to grow. For example, the SoC of a certain model A mainly includes 10 processing units (2 large cores, 2 small cores, and a 6-core GPU); while in a certain model B, the number of processing units increases to 30 (2 large cores) General-purpose cores, 4 small general-purpose cores, a 16-core neural engine, and a 5-core GPU).

Regarding the heterogeneity of IP between SoCs, the IP that implements the same function on different SoCs may vary greatly because one's own IP is always preferred for business reasons. For example, as shown in (b), (c) and (d) in Figure 2, the same functionality (such as neural network processing) is directed to different IPs. In a certain model B SoC is the neural engine (NE); in a certain model D is the deep learning accelerator (DLA); in a certain model C is the neural processing unit (NPU). In addition, many computing-purpose IPs are specific to a certain field (e.g., deep learning) or have certain generality for certain types of operations (e.g., GPUs with tensor operations).

Programming IP such as GPUs and NPUs for computing purposes can be achieved based on support from programming frameworks and vendors. For example, in order to accelerate neural network processing, application developers can directly use deep learning programming frameworks, such as PyTorch, TensorFlow, MXNet, etc., instead of direct manual management. These programming frameworks provide high-level programming interfaces (C++/Python) to customize IP, which are implemented using the IP vendor's low-level interfaces. IP suppliers provide different programming interfaces, such as PTX (Parallel Thread Execution, parallel thread execution), CUDA (Compute Unified Device Architecture, computing unified device architecture), cuDNN (CUDA Deep Neural Network library, CUDA deep neural network library) and NCCL (NVIDIA Collective Communications Library), etc., to make their hardware drivers suitable for these programming frameworks.

However, programming frameworks require extremely large development efforts because they are required to bridge the gap between software diversity and hardware diversity. Programming frameworks provide application developers with high-level interfaces to improve programming productivity, and these interfaces are carefully implemented to improve hardware performance and efficiency. For example, Tensorflow was initially developed by about 100 developers and is currently maintained by 3,000+ contributors to support dozens of SoC platforms. For thousands of Tensorflow operators, optimizing one operator on a certain IP may take a skilled developer several months. Even if a programming framework is utilized, application developers may be required to have different implementations for different SoCs. For example, a program written for a certain model D cannot be run directly on the server-side DGX-1 of TensorCore of the GPU.

It is difficult for programming frameworks to achieve high efficiency because the SoC is managed through the host CPU. Since the programming framework running on the host CPU controls the entire execution process, the interaction of control and data is inevitable. It uses only CPU-IP interaction for control, and only memory-IP interaction for data exchange.

Figure 3 shows a typical timeline for a traditional SoC. As shown in the figure, the host CPU runs the programming framework for runtime management, where each call to the IP will be started/ended by the host CPU, which brings non-negligible runtime overhead. The data is stored in off-chip main memory, and the IP reads/writes data from the main memory, which brings additional data access. For example, when running the neural network YOLO on a certain model D, control will be returned from the GPU to the programming framework 39 times, occupying 56.75M of DRAM space, 95.06% of which is unnecessary. According to Amdahl's law, the efficiency of a system is limited, especially for programs composed of fragmented operations.

Inventive idea

Considering that exposing hardware heterogeneity to management software can lead to low productivity and low hardware utilization, this disclosure proposes a solution that lets the SoC hardware manage heterogeneity by itself. The inventors noticed that in classic CPUs heterogeneous arithmetic The logic unit (Arithmetic Logic Unit, ALU) and the floating point unit (Float Point Unit, FPU) are regarded as execution units in the pipeline and are managed by hardware. Inspired by this, intuitively, IP can also be regarded as an execution unit in the IP-level pipeline, that is, a unified system-on-a-chip (SoC-as-a-Processor, SaaP).

Figure 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram. For comparison, Figure 4b shows a traditional SoC architecture, where single lines represent control flow and wide lines represent data flow.

As shown in Figure 4a, the SaaP of the embodiment of the present disclosure reconstructs the Soc into a processor, including a system controller 410 (equivalent to the controller in the processor, that is, the pipeline manager), which is used to manage the hardware pipeline, including from System memory (e.g., DRAM 440 in the figure) retrieves instructions, decodes them, dispatches them, undoes them, commits them, etc. Multiple heterogeneous IP cores, including CPU cores, are integrated into the SoC as execution units in the hardware pipeline 420 (equivalent to the arithmetic units in the processor) for executing instructions assigned by the system controller 410 . Therefore, SaaS can utilize hardware pipelines rather than programming frameworks to manage heterogeneous IP cores.

Similar to the multiscalar paradigm, programs are divided into tasks, which can be as small as a single scalar instruction or as large as an entire program. A task can be implemented on various types of IP cores and will be dispatched to a specific IP core during execution. These tasks are called instructions in SaaP. Since tasks have different sizes, the embodiment of the present disclosure proposes a mixed-scale (Mixed-Scale, MS) instruction to work with SaaP with an IP-level pipeline. MS instruction is a unified instruction that can be applied to various heterogeneous IP cores. Therefore, hardware heterogeneity is transparent under MS instructions. MS instructions are fetched, decoded, dispatched, revoked, committed, etc. by the system controller 410. The adoption of MS instructions can fully exploit mixed-level parallelism.

Furthermore, on-chip memory 430 can also be provided for SaaP, such as on-chip SRAM (Static Random Access Memory) or registers, which are used to cache data related to the execution of the execution unit (IP core), such as input data and output. data. Therefore, after the data on the system memory is transferred to the on-chip memory, the IP core can interact with the on-chip memory to access the data. On-chip memory 430 is similar to registers in a processor, whereby on-chip IP coordination can also be implemented implicitly in a manner similar to register forwarding in a multi-scalar pipeline.

In the SaaS hardware pipeline, you can make full use of mixed-level parallelism by using MS instructions, and use on-chip memory to realize data exchange between IP cores, thereby achieving high hardware performance. Moreover, SaaS allows any type of IP core to be integrated as an execution unit, and high-level code from application developers can be compiled to the new IP core with only slight adjustments, thus enabling improved programming productivity.

In contrast, the traditional SoC shown in Figure 4b is CPU-centric, with the programming framework running on the host CPU. Various IP cores are attached to the system bus as isolated devices and managed by software running on the host CPU. As can be seen from the figure, in traditional SoC, for control flow, there is only CPU-IP interaction; for data flow, there is only system memory (DRAM)-IP interaction.

In SaaS, SoC is built with an IP-level pipeline, and the IP core is managed as an execution unit. In this way, the control flow can naturally be managed by the pipeline manager, and no programming framework is required at runtime. Moreover, using a mechanism similar to pipeline forwarding, data exchange can be performed directly between different IP cores.

Extending the CPU scalar pipeline to an IP-level pipeline inevitably faces many challenges. One challenge is consistency. Since heterogeneous IP cores such as DL accelerators access data (e.g., tensors and vectors) in chunks of various sizes, rather than as scalar data, data dependencies are checked as chunks of data flow concurrently through the pipeline. and maintaining data consistency becomes extremely complex. Therefore, the register file, cache hierarchy, and datapath all need to be fundamentally redesigned. Another challenge is scalability. According to Amdahl's law, the overhead of IP coordination (usually at the μs level) inadvertently limits the scalability of traditional SoCs. This overhead also prevents sub-μs-level cores from leveraging IP, as this overhead can exceed execution time. Furthermore, for scalability, SaaS should not favor time/area expensive designs such as chained squashing and crossbar interconnects.

Although there are challenges from many aspects, the inventor's research found that the root of the problem is simply that the ownership of shared data is unclear in traditional design concepts. In a traditional SoC, data can be accessed and modified by different IP cores at any time, and multiple copies of the data can exist. Therefore, in order to execute the program correctly, it is necessary to introduce a large Complex mechanisms with a certain amount of overhead, such as bus snooping, atomic operations, transactional memory, and address resolution buffers, are used to maintain data consistency and IP coordination consistency.

To avoid defects caused by unclear ownership of shared data, SaaS SoCs follow the principles of Pure eXclusive Ownership (PXO) architecture in their design. The principle is that data-related resources in the system, including on-chip buffers, data paths, data caches, memory and I/O (Input/Output, input/output) devices, are exclusively occupied by one IP core at a certain time. The SaaP architecture and related designs provided by the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

SaaS overall architecture

Figure 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure in more detail. Similar to the Tomasulo pipeline, SaaP can contain an out-of-order five-stage pipeline.

As shown in the figure, in SaaP, the system controller as the pipeline manager can include multiple functional components to implement different functions in the pipeline management process. For example, the instruction decoder 511 can decode the MS instruction proposed in the embodiment of the present disclosure. Instruction dispatcher 512 may dispatch MS instructions. The instruction exit circuit 513 is used to complete the instruction submission and exit the completed MS instructions in order. MS instruction cache 514 is used to cache MS instructions. The renaming circuit 515 is used to rename the storage elements involved in the instruction, for example, to solve possible data hazards. The system controller may utilize the renaming mechanism to implement any one or more of the following processes: resolving data hazards on storage elements; MS command revocation, MS command submission, etc. The exception handling circuit 516 is used to respond to exceptions thrown by the IP core and perform corresponding processing. The functions of each component will be described in the relevant sections below.

Integrated heterogeneous IP cores (the figure illustrates various IP cores such as CPU cores, GPU cores, and DLA cores) act as execution units for performing actual operations. These heterogeneous IP cores and related components (such as reservation station 521, IP instruction cache 522, etc.) may be collectively referred to as IP core complex 520.

On-chip memory is also provided in SaaP. In some implementations, on-chip memory can be implemented as a bank of scratchpads (also called a set of memory bubbles) that buffer input and output data. Storage bubbles act as registers in the processor. The storage bubble can include multiple temporary registers with different storage capacities, which are used to cache data related to the execution of multiple heterogeneous IP cores. For example, the capacity of storage bubbles can range from 64B, 128B, 256B,...256KB, to 512KB. Preferably, the number of small-capacity storage bubbles is greater than the number of large-capacity storage bubbles, so as to better support task requirements of different scales. This group of storage vesicles may be collectively referred to as storage vesicle complex 530.

Between storage vesicle complex 530 and IP core complex 520, an on-chip interconnect 540 is provided to provide non-blocking data path connectivity between multiple heterogeneous IP cores and a set of storage vesicles. The on-chip interconnect acts as a shared data bus. In some embodiments, the on-chip interconnect 540 can be implemented based on an ordering network, thereby providing a non-blocking data path with only a small hardware cost and acceptable latency. In this article, the on-chip interconnect 540 may also be referred to as Golgi.

As mentioned earlier, SaaS SoCs are designed to follow the principles of Pure Exclusive Ownership (PXO) architecture. To this end, in some embodiments, one IP core among the above-mentioned multiple heterogeneous IP cores can be designated as the mother core, responsible for managing the entire system. For example, the mother core exclusively manages the exchange of data between system memory and storage bubbles. The mother core also exclusively manages I/O operations with external devices. The mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc. For example, in branching and predictive execution, branching and predictive execution are implemented through exception handling, in which impossible branches are treated as unlikely branch exceptions (UBE). Static prediction can be used to implement branching and speculative execution. Considering the role and function of the mother core, the CPU core with general processing functions is usually determined as the mother core. In some embodiments, it is preferable to enhance the I/O capabilities of the mother core, such as introducing a DMA (Direct Memory Access) module to reduce the pressure of continuous data copying.

In some embodiments, non-parent IP cores may be divided into different IP lanes based on their functionality and/or type. The mother core itself belongs to a separate IP lane. Figure 5 shows the mother core lane, CPU lane, CPU lane, DLA car Tao, wait. Then, when scheduling the MS instruction, the MS instruction can be dispatched to the appropriate IP lane based at least in part on the task type of the MS instruction.

Basically, SaaS uses MS instructions to execute the entire program. Initially, when the system controller retrieves an MS instruction, it decodes it to prepare the data for execution. Data is loaded from system memory to storage bubbles or forwarded quickly from other storage bubbles. If there is no conflict, the MS instruction is sent to the MS instruction dispatcher and then to the appropriate IP core (eg, DLA core) for actual execution. This IP core will load the actual precompiled IP specific code (eg, DLA instructions) based on the MS instruction issued. The IP core will then execute that actual code, much like execution on a regular accelerator. After execution is complete, the MS instruction exits the pipeline and commits its results.

The above provides a general description of the overall architecture and task execution process of the SaaP SoC in the embodiment of this application. The specific implementation of each part will be described in detail below. It can be understood that although the implementation of each part is described in the environment of SaaP SoC, these parts can also be separated from SaaP SoC and applied to other similar environments, such as non-heterogeneous systems. The embodiments of the present disclosure have no limitations in this regard .

MS (Mixed Scale) Instructions

The heterogeneity of hardware is reflected in the different instruction formats in the software and hardware interfaces, and the number of execution cycles of each instruction is also very different. Table 1 shows a comparison between different instruction sets. For scalar systems, there are usually two instruction sets: CISC (Complex Instruction Set Computer, complex instruction set computer) and RISC (Reduced Instruction Set Computer, reduced instruction set computer). As shown in the table, the length of each CISC instruction is uncertain. Some instructions have complex functions and a large number of beats, while some instructions have simple functions and a small number of beats. Depending on the execution complexity of a single instruction, the number of instruction cycles ranges from 2 to 15. The instruction length of RISC is fixed, and the number of instruction cycles for a single instruction is relatively uniform, about 1 to 1.5 cycles.

Due to the heterogeneous nature of SaaP SoC, the instruction sets required by various IP cores (including CPUs and various xPUs) on it are different, such as in terms of scale or granularity. In order to hide this heterogeneity (ie, instruction format, number of execution cycles, etc.), in some embodiments of the present disclosure, a mixed-scale (Mixed-Scale, MS) instruction set (Mixed-scale Instruction Set Computer) is provided. MISC), its form is similar to RISC and can be suitable for various IP cores. Most of these IP cores (various accelerators mainly for computing purposes) need to efficiently process some large-grained complex work, so the number of execution cycles (Cycle Per Instruction, CPI) of a single MS instruction is longer than that of RISC, ranging from 10 to 10,000 +shoot, which belongs to a relatively large range. The MISC provided by embodiments of the present disclosure is also shown in Table 1.

Table 1 Comparison of different instruction sets

Each instance of SaaP is a mixed-scale instruction set computer (MISC). The MISC instruction set consists of MS instructions. Different from RISC and CISC, MISC has its own unique design style.

First, MS instructions have mixed load sizes, which can be relatively small loads, such as only needing to execute 10 beats, or relatively large loads, such as needing to execute more than 10,000 beats. Therefore, the payload carried by each MS instruction may Containers of different sizes are required to facilitate retrieving data from the container and saving the calculated result data into the container. In the embodiment of the present disclosure, the aforementioned set of storage bubbles of various sizes (eg, from 64B to 512KB) are used to store the input and/or output data required by the MS instructions, thereby supporting this mixed load of the MS instructions size.

Secondly, MS instructions are IP-independent, that is, MS instructions are not aware of IP. Specifically, instructions specific to each IP core (that is, heterogeneous instructions) are encapsulated in MS instructions, and the encapsulated MS instruction format is not related to which IP core is specifically encapsulated.

In some embodiments, the MS instruction may include a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction. It can be understood that the MS instruction needs to be run on a certain IP core in the future, which means that there is a piece of code that the IP core can recognize (that is, IP core-specific code). These codes are also composed of one or more pieces of code specific to the IP core. Composed of instructions, these instructions are encapsulated in MS instructions, so they are called sub-instructions. Therefore, the system controller can assign the MS instruction to the corresponding IP core according to the sub-instruction domain. The subinstruction information may include the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction. There can be multiple implementations to represent subcommand information.

In one implementation, the addresses of one or more IP core-specific subinstructions may be placed into the subinstruction field. This method can directly determine the sub-instruction type and address in the MS instruction. However, in this implementation, since the same MS instruction may be able to run on multiple heterogeneous IP cores, the length of the MS instruction will vary with the number of IP core types that can run the MS instruction.

In another implementation, a bit sequence can be used to indicate whether the MS instruction has a corresponding type of sub-instruction, and a first address can be used to indicate the first sub-instruction address. The length of the bit sequence may be the number of IP core types or IP lane types, so that each bit in the bit sequence may be used to indicate whether there is a sub-instruction of the corresponding type. The first sub-instruction address is obtained directly from the first address. The sub-instruction addresses corresponding to subsequent IP lanes can be indexed in a fixed way (for example, separated by a certain address distance), or by directly jumping to the MS instruction. The embodiments of this disclosure have no restrictions on the specific format implementation of MS instructions.

MS instructions are defined to perform complex functions. Therefore, each MS instruction performs a complex function, such as convolution, and the instruction will be broken down into fine-grained IP-specific code (i.e., sub-instructions) for actual execution, such as RISC instructions. The IP-specific code can be code compiled against the standard library (e.g., std::inner_product from Libstdc++ for inner products) or code generated from a vendor-specific library (e.g., from cuBLAS also for inner products) operating CublasSdot). This makes it possible for SaaS to integrate different types of IP because the same MS command can be flexibly issued to different types of IP cores. Therefore, heterogeneity is hidden from application developers, which also increases the robustness of SaaS.

As can be seen from the above, no matter which IP core the sub-instruction is used for, such as CPU, GPU, DLA, NPU, etc., it will not change the format of the MS instruction. Therefore, from this perspective, the MS instruction is IP-independent.

Again, MS instructions have a limited arity. For data management, each MS instruction will access up to three storage bubbles: two source storage bubbles and one destination storage bubble. That is, for data management, each MS instruction has at most two input data fields and one output data field, which are used to indicate data information related to the execution of the MS instruction. In some implementations, these data fields may be represented by numbers of associated storage bubbles, such as indicating two input storage bubbles and one output storage bubble, respectively. Limited metadata reduces the complexity of conflict resolution, renaming, datapath design, and compiler toolchains. For example, if the number of MS instructions is not limited, the decoding time difference between different MS instructions will be very large, resulting in irregular hardware pipelines and some inefficiency problems. Functions or functions with high arity (for example, more than 3 arity) can be implemented through Currying. Currying is a technique that converts multi-variable functions into sequences of single-variable functions, such as through nesting, chaining, etc. Thereby, it is possible to support the conversion of functions/functions with any number of inputs and outputs into functions/functions that satisfy the finite element number of MS instructions.

Finally, MS instructions have no side effects. "No side effects" here means that the execution status of the current instruction will not affect the execution of subsequent instructions. In other words, if the current instruction is to be revoked, it can be revoked without its residual status affecting subsequent instructions. In addition to modifying the data in the output storage bubble, the execution of the MS instruction No observable side effects will be left on the SaaP architecture. The only exception is MS instructions that execute on the mother core, since the mother core can operate on system memory and external devices. This constraint is important for implementing Mixed Level Parallelism (MLP), as it enables simple rollback of effects when MS instructions need to be undoed, for example due to speculative execution requirements. In other words, the data field of the MS instruction executed on the non-mother core IP core can only point to the storage bubble, but not to the system memory. Furthermore, the storage bubble corresponding to the output data is exclusively assigned to the IP core that executes the MS instruction.

It can be seen that by providing a new MS instruction set and making a unified abstraction on the software and hardware interface, the heterogeneity between different hardware or different instructions can be hidden, so that it can be seen at the hardware level It is a unified MS command format. These MS instructions can be distributed to different IP cores for actual execution.

Figure 6 schematically shows an example process of executing tasks on the MISC architecture to better understand the implementation of MS instructions. The illustrated MISC architecture has, for example, a mother core and an IP core. The tasks to be performed are to make sandwiches (materials: bread and meat) and vegetable salads (materials: vegetables). Among them, for the convenience of drawing, in Figure 6, the bread is named A, the meat is named B, the vegetables are named C, the sandwich is named D, and the salad is named E. The mother core manages the system memory, so first the mother core loads the materials to be processed from the system memory to the storage bubbles, and then the IP core can process the materials on the storage bubbles. The above tasks can be expressed as the following MS instruction flow:

1) "Load Bread" v1,void,void

2) "Load Meat" v2,void,void

3) "Make Sandwich" v1, v1, v2

4) "Store Sandwich"void,v1,void

5) "Load Green" v1,void,void

6) "Make Salad" v1,v1,void

7) "Store Salad"void,v1,void

It can be understood that when MS instructions are executed on the mother core and IP core, each core should provide a specific code form, that is, core-specific sub-instructions, so that each core can know how to perform the corresponding task. For the sake of simplicity, these sub-instructions only briefly show their processing tasks or functions in the above MS instruction flow, and different forms are not distinguished. The storage bubbles (v1, v2) used in the MS instruction are logical numbers. In actual implementation, the storage bubbles are renamed to different physical numbers to resolve WAW (Write After Write) dependencies and support out-of-order predictive execution. Void in the instruction indicates that the corresponding domain does not need to store bubbles, such as when system memory is involved.

In Figure 6, ① is the initial state; ② executes the "Load Bread" MS instruction for the mother core. The Load instruction involves access to system memory and is therefore assigned to the mother core for execution. The mother core takes out the data from the system memory and stores it into the storage bubble v1. The specific memory access address information of the system memory may be placed in an additional instruction field, and the embodiments of the present disclosure have no limitation in this regard. ③ Execute the "Load Meat" instruction for the mother core. Similar to the "Load Bread" instruction, the mother core takes out the data from the system memory and stores it in the v2 storage bubble.

Next, ④ To execute the "Make Sandwich" instruction, this MS instruction is assigned to the IP core for processing because it takes more processing time. According to the original instructions, the IP core needs to take out the bread from v1, take out the meat from v2, and put it into v1 after making it. Here, since the v1 to be written and the v1 to be read are the same, there is a Write After Read (WAR) correlation, that is, the data in v1 must be completely read before writing. But this method is not very realistic, because the MS instructions may be very large, for example, tens of thousands of shots are required, and the sandwiches made in the middle need to be stored somewhere. In order to solve this data hazard, a storage bubble renaming mechanism can be used. For example, before the MS instruction is dispatched, the logical name of the storage bubble in the MS instruction is renamed and mapped to a physical name through the storage bubble renaming circuit 515 shown in FIG. 5 to eliminate data risks. At the same time, the storage bubble renaming circuit 515 saves the mapping relationship between the physical name and the logical name. In the example of Figure 6, the storage bubble v1 corresponding to the output data of the "Make Sandwich" instruction is renamed to storage bubble v3, so the prepared sandwich is placed in v3. The ellipsis in v3 in Figure 6 indicates that this writing process will take a while and will not be completed quickly.

In ⑤, since making the sandwich takes a long time, the "Store" that follows cannot be executed yet. Sandwich" instruction, but the subsequent "Load Green" instruction has no dependency on the previous instruction, so it can be executed in parallel. Similarly, the storage bubble v1 involved in the "Load Green" instruction also involves a read-after-write correlation, so The storage bubble renaming mechanism can be used to rename the corresponding storage bubble v1 and map it to storage bubble v4. Similarly, the mother core executes the "Load Green" instruction to take out the data in the system memory and write it into the storage bubble v4. In the bubble.

In ⑥, since the IP core has been occupied to make a sandwich, in order to improve efficiency, according to the scheduling policy, the "Make Salad" instruction can be assigned to the currently idle mother core for execution. The status of each core can be marked, for example, by a bit sequence to facilitate the instruction dispatcher to dispatch instructions. Again, the renaming mechanism is applied here as well. The mother core takes out the vegetables from the storage bubble v4, makes them into salads and puts them into the storage bubble v5.

In ⑦, after the sandwich is made, the previously blocked "Store Sandwich" instruction can be executed. Store instructions involve access to system memory and are therefore assigned to the mother core for execution. The mother core takes out the data stored in vesicle v3 and stores it in the system memory.

In ⑧, after the salad is made, you can execute the "Store Salad" command. The mother core takes out the data stored in vesicle v5 and stores it in the system memory.

It should be noted that in ⑦ and ⑧, even if the salad sandwich is made first, the "Store Salad" instruction needs to be executed after the "Store Sandwich" instruction to ensure sequential submission, so that no any instructions will be generated when the instruction is revoked. side effect.

As can be seen from the above example process, when the data is ready, the IP core can start processing. "Make Sandwich" takes more time, so "Make Salad" is executed on the mother core and completed in advance, so that mixed-level parallelism (MLP) can be fully exploited. Therefore, the execution of different IP cores does not interfere with each other, that is, they can be executed out of order, but they are submitted in order.

system controller

The processing between MS instructions or the instructions themselves is uniformly managed by the system controller (which may also be called an instruction processing device). The functions of each component in the system controller are described in detail below. SaaP SoCs employ out-of-order pipelines to mine mixed-level parallelism between IP cores. The pipeline can contain 5 levels: value & decoding, conflict resolution, dispatch, execution and exit.

FIG. 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure. The following description can be understood with reference to the SaaP architecture shown in Figure 5. In addition, for the convenience of description and understanding, Figure 7 shows the instruction execution process including a complete pipeline, but those skilled in the art can understand that some steps may only occur under specific circumstances and are therefore not necessary in all cases. The necessity can be discerned according to the specific situation.

First, in step 710, instruction fetch & decode is performed. At this level, the MS instruction is retrieved from the MS instruction register 514 based on the MS program counter (PC), and the instruction decoder 511 decodes the retrieved MS instruction to prepare operands. The decoded MS instructions can be placed in the instruction queue of the instruction decoder 511 .

As described previously, the MS instruction includes a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction. The subinstruction information may, for example, indicate the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction.

In some embodiments, when the MS instruction is retrieved and decoded, the corresponding sub-instruction can be retrieved in advance and stored in a designated location according to the decoding result, such as the sub-instruction cache 522 (also referred to as sub-instruction buffer 522 in Figure 5 for the IP command cache). Therefore, when the MS instruction is issued to the corresponding IP core for execution, the IP core can fetch the corresponding sub-instruction from the sub-instruction buffer 522 for execution.

In some cases, the MS instruction may be a branch instruction. In the embodiment of the present disclosure, static prediction is used to determine the direction of the branch instruction, that is, to determine the PC value of the next MS. The inventor analyzed the branch behavior in the benchmark program and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time. Since large-scale instructions determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor. From this, whenever a branch is encountered, it is always assumed that the next MS instruction is a statically predicted possible branch. branch direction.

When a branch is mispredicted, an Unlikely Branch Exception (UBE) will be triggered. When an exception occurs, the erroneous MS instruction needs to be canceled, the next MS instruction count is set to the impossible branch of UBE, or an exception trap occurs under other circumstances. The processing scheme of branching and speculative execution will be described in detail later.

Then, the pipeline proceeds to step 720, where possible conflicts are resolved. At this level, the retrieved MS instructions are queued to resolve conflicts. Possible conflicts include (1) data hazards; (2) structural conflicts (e.g., no space available in the exit unit); and (3) exception violations (e.g., blocking an MS instruction that cannot be easily undone until it is acknowledged to be taken ).

In some embodiments, data hazards such as Read After Write (RAW) and Write After Write (WAW) can be solved through a storage bubble renaming mechanism. The storage bubble renaming circuit 515 is used to rename the logical name of the storage bubble to a physical name and save the storage bubble before dispatching the MS instruction when there is a data hazard on the storage bubble involved in the MS instruction. The mapping relationship between physical names and logical names. Through the storage bubble renaming mechanism, SaaP can support faster MS instruction revocation (achieved by simply discarding the rename mapping to the output data storage bubble) and out-of-order execution without WAW risks.

After possible conflicts are resolved, the pipeline proceeds to step 730 where the MS instructions are dispatched by the instruction dispatcher 512 .

As described previously, an MS instruction has a sub-instruction field that indicates the IP core capable of executing the MS instruction. Therefore, the instruction dispatcher 512 can dispatch the MS instruction to the corresponding IP core according to the information of the sub-instruction field. Specifically, first dispatch the MS instruction to the reservation station to which the IP core belongs for subsequent transmission to the appropriate IP core.

In some embodiments, IP cores may be divided into different IP lanes according to their functions and/or types, with each lane corresponding to a specific IP core model. Correspondingly, reservation stations can also be grouped according to lanes, for example, each lane can correspond to a reservation station. For example, Figure 5 shows a mother core lane, a CPU lane, a CPU lane, a DLA lane, and so on. Different lanes can be suitable for performing different types of tasks. Therefore, when scheduling and dispatching the MS instruction, the MS instruction can be dispatched to the reservation station corresponding to the appropriate lane based at least in part on the task type of the MS instruction for subsequent transmission to the appropriate IP core.

In some embodiments, in addition to considering the task type, scheduling can also be performed among multiple IP lanes capable of executing the MS instruction according to the processing status in each IP lane, thereby improving processing efficiency. Since the same MS instruction may have multiple different implementations executed on multiple IP cores, the processing pressure of the bottleneck lane can be alleviated by selecting the assigned lane according to an appropriate scheduling policy. For example, MS instructions involving convolution operations can be dispatched to the GPU lane or the DLA lane. Both can be performed effectively, and one can be selected based on the pressure of the two lanes, thereby speeding up the processing progress. The scheduling policy may include various rules, such as selecting the IP core with the largest throughput, or selecting the IP core with the shortest number of sub-instructions, etc. The embodiments of the present disclosure are not limited in this regard.

In some embodiments, some specific types of MS instructions must be dispatched to specific IP cores. For example, as mentioned earlier, among multiple heterogeneous IP cores, one IP core can be designated as the mother core, responsible for managing the entire system. Therefore, some MS instructions involving system management types must be dispatched to the mother core for execution.

Specifically, the mother core exclusively manages the exchange of data between system memory and storage vesicles. Therefore, MS instructions of the memory access type that access system memory are dispatched to the mother core. The mother core also exclusively manages I/O operations with external devices. Therefore, I/O type MS instructions such as display output are also dispatched to the mother core. The mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc. Therefore, the interrupt circuit 517 handles the MS instruction that interrupts and is dispatched to the parent core. In addition, when some MS instructions cannot be processed by other IP cores, for example because other IP cores are busy, they can be assigned to the mother core for processing. In addition, according to the MS instruction scheduling policy, some MS instructions may be assigned to the mother core for processing. No more enumeration here.

Next, the pipeline proceeds to step 740, at which stage MS instructions may be executed out of order by the IP core.

Specifically, the IP core assigned to the instruction may utilize the actual IP-specific code to perform the functionality of the MS instruction. For example, the IP core retrieves the corresponding sub-instruction from the sub-instruction register/IP instruction register 522 according to the assigned instruction and executes it. The Tomasulo algorithm can be implemented at this stage to organize these IP cores to support mixed-level parallelism (MLP). Once the dependencies on the storage bubble are resolved, MS instructions can be continuously dispatched to the IP core complex, allowing them to be executed out of order.

Note that in the SaaP SoC provided by the embodiments of this disclosure, since intrusive modifications to the IP core are prohibited, the IP core does not know the SaaP architecture. In order to adapt to SaaP, an adapter is used to encapsulate the IP core. The adapter directs access to the program to the IP instruction cache 522 and directs access to the data to the storage bubble. The program can be an interface signal of the accelerator, such as the CSB (Configuration Space Bus, Configuration Space Bus) control signal for DLA, or a piece of IP-specific code that implements MS instructions (for example, for programmable processors such as CPU/GPU terms). Operational MS instructions perform operations on data stored in a set of storage bubbles. These storage bubbles can be multiple temporary registers with different storage capacities. Each IP core has two data read ports and one data write port. During execution, the physical storage bubble is exclusively connected to the port, so from the perspective of the IP core, the storage bubble works like main memory in a traditional architecture.

Finally, the pipeline advances to step 750, the exit phase. At this level, MS instructions exit the pipeline and commit the results. The instruction exit circuit 513 in Figure 5 is used to exit completed MS instructions in order, and when the MS instruction exits, submit the execution result by confirming the renaming mapping of the storage bubble corresponding to the MS instruction's output data. That is, the commit is accomplished by permanently acknowledging the rename map of the storage bubble of the output data in the rename circuit 515 . Since only the rename mapping is recognized, no data is actually buffered or copied, thus avoiding the additional overhead of copying data when the amount of data is large (which is common in various computing-purpose IP cores).

It should be understood that although the execution process of MS instructions is described in the environment of SaaP SoC, the MS instruction system can also be applied in other environments, and is not limited to environments with heterogeneous IP cores. For example, it can also be used in homogeneous environments. As long as the execution unit of the MS instruction can independently parse and execute the sub-instructions. Therefore, in the above description, the IP core can be directly replaced by the execution unit, and the mother core can be replaced as the main execution unit. The above method is still applicable.

Branching and speculative execution

Branch instructions may also appear in the MS instruction stream, and branch instructions cause control dependencies. The control correlation is actually related to the program counter PC of the MS instruction, and the PC value is used when fetching instructions. If the branch instruction is not handled well, the fetching of the next instruction will be affected, causing the pipeline to be blocked and affecting the pipeline efficiency. Therefore, it is necessary to provide effective branch prediction support for MS instructions, that is, effective for both large-scale instructions and small-scale instructions.

In the traditional processing method of the CPU, the branch condition is calculated during decoding, and then the correct branch target is determined, so that the next instruction can be retrieved from the address of the branch jump position when fetching the next instruction. This branch condition calculation and setting the next PC value to the value of the correct branch target usually only takes up a few beats of overhead. This overhead is very small and can be completely offset by the pipeline in the conventional CPU instruction pipeline. However, in the MS instruction stream, if a branch MS instruction is mispredicted, it means that at a certain point in time during the entire execution of the MS instruction stream, it is discovered that the branch MS instruction is mispredicted. At this time The position of the point may be hundreds of beats or thousands of beats or longer away from the time when the branch MS instruction starts executing. Therefore, in the MS instruction pipeline, it is impossible to determine the PC value of the next MS instruction until it really knows when to jump. In this case, the prediction overhead will be very high.

The inventor analyzed the branch behavior in five benchmark programs and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time, that is, they can be predicted in a static manner. Since large-scale instructions occupy most of the total execution time and determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor.

FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure. This method is executed by the system controller.

As shown, in step 810, the MS instruction is decoded. MS instructions have varying per-instruction execution cycles period (CPI). As mentioned earlier, the CPI range of MS instructions may range from 10 beats to more than 10,000 beats. This changing CPI nature of MS instructions also makes it difficult to use dynamic prediction.

Next, in step 820, in response to the MS instruction being a branch instruction, the next MS instruction is obtained according to the branch indication information, and the branch indication information indicates a possible branch target and/or an impossible branch target.

The static prediction mechanism can use compiler hints to perform static predictions. Specifically, during instruction compilation, the branch instruction information can be determined based on the static branch prediction method and inserted into the MS instruction stream.

Depending on the different static branch prediction methods, the branch indication information may contain different contents. For example, static prediction always takes the possible branch target as the next MS instruction address. In some cases, in order to ensure the temporal locality of the instruction cache, it is possible that the branch target can usually be immediately adjacent to the current MS instruction. Therefore, in these cases, the branch indication information may only need to indicate the impossible branch target. In other cases, the branch indication information may also indicate possible branch targets and impossible branch targets at the same time. Therefore, when the next MS instruction is obtained according to the branch instruction information, the possible branch target indicated by the branch instruction information can be determined as the next MS instruction.

Since it is a prediction, there may be errors. When a branch prediction error occurs, all instructions following the branch instruction need to be canceled. The longer the pipeline level, the more instructions that need to be canceled due to branch prediction errors, and the greater the efficiency loss of the instruction pipeline. Since MS instructions use static prediction, before the branch condition is determined, the next MS instruction is fetched in an inherent way. These instructions can be executed out of order, but they must be submitted in the order described above. Therefore, when the predicted direction of a branch instruction is found to be wrong, it is necessary to revert to the correct next MS instruction. At this time, an exception mechanism needs to be implemented to correct wrong predictions.

Alternatively or additionally, in step 830, when a prediction error occurs, the system controller may receive an Impossible Branch Exception (UBE) event. The UBE event is triggered by an execution unit (such as an IP core) that executes a conditional calculation instruction associated with a branch instruction. This UBE event indicates that according to conditional calculation, the branch direction should be an impossible branch target, that is, an error occurred in the previous branch prediction.

At this time, in step 840, the system controller needs to perform a series of operations in response to the UBE event to resolve the branch prediction error. These operations include: canceling the MS instruction after the branch instruction; committing the MS instruction before the branch instruction; and determining the impossible branch target indicated by the branch indication information as the next MS instruction. This kind of processing corresponds to a precise exception, that is, when an exception occurs, all instructions before the instruction interrupted by the exception are executed, and all instructions after the instruction are as if they were not executed. Since the UBE event is an exception caused by a branch prediction error, the above-mentioned instruction interrupted by the exception is the branch MS instruction.

Depending on the different states of the MS instruction that needs to be revoked, different operations can be taken to achieve the revocation. The MS instruction that needs to be revoked may usually be in three states: being executed in the execution unit; execution has ended; or has not yet been executed. Different states may have effects on different software and hardware, so these effects need to be eliminated. For example, if the instruction is being executed in the execution unit, the execution unit that is executing the MS instructions that need to be revoked needs to be terminated; if the instruction has written to the temporary register (such as a storage bubble) during or after execution, Then you need to discard the temporary registers written by the MS instructions to be canceled; if the instructions have not been executed, you only need to cancel them from the instruction queue. Of course, since the instruction queue will record all unexited/submitted instructions, instructions that are in the executing or completed execution state also need to be canceled from the instruction queue.

Therefore, in some embodiments, undoing the MS instructions following the branch instruction includes: canceling the undoed MS instructions in the instruction queue; terminating the execution units executing the undoed MS instructions; and discarding the temporary files written by the undoed MS instructions. memory.

It can be seen from the instruction exit process described above that when the MS instruction exits, the execution result is submitted by confirming the rename mapping of the storage bubble corresponding to the MS instruction's output data. Therefore, when discarding the temporary registers written by these revoked MS instructions, you only need to delete the corresponding mapping relationship from the record that saves the rename mapping relationship between the physical names and logical names of these temporary registers. Can. As mentioned earlier, through this storage bubble renaming mechanism, faster MS instruction revocation can be supported by simply discarding the rename mapping to the output data storage bubble.

Therefore, in the MS instruction pipeline, processing branch MS instructions through static prediction can save hardware software resources, while adapting to the CPI characteristics of MS instructions with a wide range of changes, and improving pipeline efficiency. Furthermore, handling branch prediction errors through exception mechanisms can further save hardware resources and simplify processing.

Exception and interrupt handling

As can be seen from the previous branch prediction processing, the cost of undoing large-scale MS instructions can be high. Therefore, in the embodiment of the present disclosure, an instruction execution scheme is proposed, which can block MS instructions that may cause high revocation costs until all instructions that may be discarded before the instruction have been executed, that is, the status has been determined. This instruction execution scheme can greatly improve the processing efficiency of the MS instruction pipeline in exception and interrupt handling.

FIG. 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure. This method is executed by the system controller.

As shown in the figure, in step 910, when the MS command is issued, it is checked whether the MS command may be discarded.

In some embodiments, checking whether the MS instruction may be discarded includes checking whether the MS instruction has a possible discard tag. Possible discard tags can be inserted at compile time by the compiler based on the type of MS directive. For example, the compiler can insert a possible discard label when it discovers that the MS instruction is a conditional branch instruction or that other exceptions may occur.

Next, in step 920, when it is determined that the MS instruction may be discarded, the issuance of the specific MS instruction following the MS instruction is blocked.

Specific MS instructions can be those large-scale MS instructions, or MS instructions that generally have a relatively high cost of revocation. Specifically, a specific MS instruction can be judged based on one or more of the following conditions: the size of the temporary register (storage bubble) corresponding to the output data of the MS instruction is greater than the set threshold; the MS instruction that performs a write operation on the system memory; MS instructions whose execution time exceeds a predetermined value; or MS instructions executed by a specific execution unit. When the storage bubble size (capacity size) of the output data exceeds the set threshold, it means that the output data volume of the MS instruction will be relatively large, and the corresponding cancellation overhead will also be high. Blocking MS instructions that write system memory is mainly to ensure storage consistency.

After blocking these specific MS instructions, the MS instructions before them are still issued and executed normally. Depending on the possible situations of these normally launched and executed MS instructions, they can be processed separately.

On the one hand, in step 930, when all potentially discarded MS instructions that caused the blocking of the specific MS instruction have been normally executed, in response to this event, the blocked specific MS instruction may be issued for execution by the execution unit. It can be understood that at this time it can be determined that the specific MS instruction will not be canceled due to the previous instruction, so the normal issuance and execution of the instruction pipeline can be continued.

On the other hand, in step 940, when an exception occurs in the execution of any MS instruction that blocks the specific MS instruction and may be discarded, exception handling is performed in response to the exception event. Similarly, this kind of exception handling corresponds to a precise exception, which requires canceling the MS instruction that caused the exception and the MS instructions executed after it, submitting the MS instruction before the MS instruction; and using the MS instruction of the corresponding exception handler as the next MS instructions.

Similar to the previous description in branch prediction processing, canceling the MS instruction that caused the exception and subsequent MS instructions includes: canceling these canceled MS instructions in the instruction queue; terminating the execution units that execute these canceled MS instructions; and discarding these canceled MS instructions. The scratchpad written by the canceled MS instruction. Likewise, discarding the scratchpads written by these revoked MS instructions includes deleting the corresponding mappings from the record holding the rename mappings between the physical names and logical names of these scratchpads.

When the type of the exception event is an impossible branch exception UBE event triggered by a branch type MS instruction in the branch prediction processing, in addition to the above exception processing, it is also necessary to process the impossible branch instruction information attached to the MS instruction. The branch target is determined to be the next MS instruction after the exception is eliminated. Therefore, after the exception handling is completed, the instruction pipeline can normally jump to the correct branch direction to continue execution.

Figure 10 shows an instruction execution example according to an embodiment of the present disclosure.

As shown in the figure, (a) shows the initial state of the MS instruction flow in the instruction queue, which includes 5 MS instructions to be executed, the #1 MS instruction has a possible discard label, and the different widths occupied by the instructions can represent different scales. , the #3MS command is a large-scale MS command, and the rest are small-scale MS commands. The different backgrounds of the instructions represent that they are in different Status, such as waiting, blocking, launching, executing, exiting, exception, cancellation, etc., please see the legend for specific representation.

(b) shows the instruction issuance step. Small-scale instructions will be issued as soon as possible, while large-scale instructions will be blocked by instructions that may be discarded by the previous issue. Specifically, the #0 instruction is issued first, and then the #1 instruction is issued. When issuing command #1, it was discovered that the command might be discarded. At this time, subsequent large-scale instructions will be blocked. In this example, instruction #2 can still be issued normally because it is a small-scale instruction; instruction #3 is blocked because it is a large-scale instruction, and subsequent instructions are also in a waiting state.

(c) shows the instruction execution process. In this example, the #2 instruction may be executed first. Since the instructions before it have not yet been executed, it needs to wait to ensure sequential submission.

(d1)-(h1) show the processing process when no exception occurs during the execution of the above instruction; (d2)-(g2) show the processing process when the above instruction throws an exception.

Specifically, (d1) shows that the #1 instruction has also been executed normally and has not been discarded. At this time, the large-scale instruction #3 that was blocked because of the #1 instruction can be issued, and the subsequent #4 instruction can also be issued normally. (e1) shows that instructions #0, #1, #2 and #4 have all been executed due to their small size, and instruction #3 is still being executed. (f1) shows that instructions #0, #1, and #2 are submitted sequentially, while instruction #4 must wait for instruction #3 to be executed before it can be submitted. (g1) shows that instruction #3 has also been executed. (h1) shows instructions #3 and #4 committing sequentially.

On the other hand, when an exception occurs during the execution of the #1 instruction, as shown in (d2), an exception program will be processed. The process of exception handling usually includes exception handling preparation, determining the source of the exception, saving the execution state, handling the exception, restoring the execution state and returning, etc. For example, in the exception processing circuit 516 shown in FIG. 5 , it is possible to record whether an exception occurs, and adjust the next MS instruction address according to the processing result.

When handling exceptions, perform precise exception handling. As shown in (e2) and (f2), the #0 instruction before the #1 instruction that triggered the exception continues to execute and completes the commit. Although the #2 command issued after the #1 command that triggered the exception has been executed, it will also be canceled, as shown in (g2). At this time, #3 and #4, which have not been transmitted due to being blocked, are in a waiting state, thereby avoiding the overhead caused by cancellation.

If the exception triggered by the #1 instruction is the UBE event described above, that is, the #1 instruction is a branch instruction, the impossible branch target indicated by it can be determined as the impossible branch target after the exception is eliminated based on the branch instruction information attached to the branch instruction. Next MS command. That is, after the exception is handled, the pipeline will jump to the MS instruction corresponding to the impossible branch target.

If the exception is another type of exception, for example, the denominator in division is zero, it will jump to an exception handler, which may modify the denominator value to a very small non-zero value, and then re-execute #1 after the exception is handled. instruction, and normal instruction pipeline processing continues.

In contrast to abnormal events, interrupt events come from outside the SoC and are therefore unpredictable. However, SaaP does not need to stop exactly at the point where the interrupt signal is raised. When an interrupt occurs, SaaP blocks all MS instructions waiting to be issued and waits for all issued MS instructions to complete and exit.

In SaaS, most system management exceptions, such as memory allocation exceptions (Bad Allocation), page faults (Page Fault), segmentation faults (Segment Fault), etc., can only be raised from the mother core, so they are also within the mother core. Capture and process. Other components in the SaaS architecture as well as other IP cores are neither affected by nor aware of these anomalies.

storage vesicles

In SaaS, storage vesicles are used as an alternative to registers for mixed-scale data access. The storage bubbles can be some independent, mixed-sized single-port scratchpads (Scratchpad), whose capacity can range from 64B to 512KB, for example. In SaaP, storage bubbles can be similar to registers with mixed capacities for use by MS instructions. In this article, "storage vesicle complex" refers to a physical "register" file composed of storage vesicles, rather than a fixed-size register. Preferably, the number of small-capacity (for example, 64B) storage bubbles is greater than the number of large-capacity (for example, 512KB) storage bubbles, so as to better match program requirements and support tasks of different sizes. service. Physically, each memory bubble can be a single SRAM or register with two read ports and one write port. These storage bubbles are designed to better match mixed-scale data access patterns, which can be used as the basic unit of data management in SaaS.

Two IP cores cannot access the same storage bubble at the same time. Therefore, data dependencies can still be managed as simply as a sequential scalar processor, and on-chip IP coordination can be managed via the hardware MS instruction pipeline.

data path

To be able to access any storage vesicle from any IP core, complete connectivity between the IP core complex and the storage vesicle complex is required. Common solutions include data buses (such as those in CPUs) or cross matrices (such as those in multi-core systems). However, none of these connections meet high efficiency requirements. For example, data buses can cause contention, and crossbars are very area intensive, even with only a few dozen cores. In order to achieve non-blocking data transmission at an acceptable cost, embodiments of the present disclosure construct an on-chip interconnection data path based on a sorting network, called Golgi.

Figure 11 shows several different data path designs, where (a) shows the data bus, (b) shows the crossbar matrix, and (c) shows the Golgi provided by embodiments of the present disclosure.

As can be seen from (a), the data bus cannot provide non-blocking access and requires a bus arbiter to resolve access conflicts. As can be seen from (b), the cross matrix can provide non-blocking access and has lower latency, but it requires O(mn) switches, where m is the number of ports of the IP core and n is the number of ports for storage bubbles.

In Golgi shown in (c), the connection problem is treated as a Top-K sorting network, where storage vesicle ports are sorted based on the destination IP port number. The on-chip interconnect consists of a bimodal sequencing network consisting of multiple comparators and switches. When m IP core ports need to access n storage bubble ports, the bimodal sorting network is used to sort the relevant storage bubble ports based on the index of the destination IP core port to construct m IP core ports and n A data path between storage bubble ports.

For the example in (c), when storage vesicles {a, c, d} need to be mapped to IP cores {#3, #1, #2} respectively, Golgi treats the mapping as a mapping of all storage vesicles { An ordering of a, b, c, d} with values {#3, #+∞, #1, #2} respectively, where unused ports are assigned destination numbers +∞.

Specifically, as shown in (c), starting from the storage bubble {a, b, c, d}, the even-numbered columns can be compared with each other first, and the odd-numbered columns can be compared with each other. For example, if the stored bubbles a and c are compared and it is found that the value #3 of a is greater than the value #1 of c, then the two are exchanged. The light hatched line in the figure indicates that the switch is turned on and data can flow laterally. Comparing storage bubbles b and d, it is found that the value #+∞ of b is greater than the value #2 of d, then the two are also exchanged, the switch is turned on, and the data path flows horizontally. At this time, the sorting positions are c, d, a, and b. Next, a comparison is made between two adjacent storage vesicles. For example, if the storage bubbles c and d are compared and it is found that the value #1 of c is less than the value #2 of d, it remains unchanged, the switch is not turned on, and the data path can only flow vertically. Similarly, after comparing storage bubbles d and a, the switch is not turned on; after comparing storage bubbles a and b, the switch is not turned on.

Finally, it can be seen that each IP core exactly corresponds to the storage bubble it wants to access. For example, for IP#1, go straight down from the passage below it, move laterally to the gray dot, and then go straight down to the storage bubble c. The data paths of other IP cores are similar. Thus, based on the sorting network, a non-blocking data path is constructed between the IP core and the storage bubble.

Using a bitonic sorting network, the Golgi can be implemented using O(n(logk) ² ) comparators and switches, which is much smaller than the O(nk) switches required for the cross matrix. Data delivered through the Golgi is subject to several cycles of latency (e.g., 8 cycles), so the preferred practice is to place as little local cache as possible (1KB is enough) in the IP core, as it relies on a large number of random accesses.

In summary, in order to execute an MS instruction, SaaP establishes an exclusive data path between the IP core and its storage bubble. This exclusive data path in SaaS follows the PXO architecture and provides non-blocking data access with minimal hardware cost.

Data can be shared between IP cores by passing memory bubbles between MS instructions. Since the mother core manages system memory, input data is brought together in an MS instruction by the mother core and correctly placed in a storage bubble in for use by another MS command. After being processed by the IP core, the output data is similarly dispersed back to system memory by the mother core. Specifically, the complete data path from system memory to IP core includes: [(Loading MS instructions) ① Memory ② L3/L2 cache ③ Mother core ④ Golgi W0 ⑤ Storage vesicle, (consuming MS instructions) ⑤ The same storage vesicle ⑥ Golgi R0 /1⑦IP core. ]

From a logical perspective, system memory is a resource exclusively owned by the mother core, which greatly reduces system complexity in the following aspects:

1) Page faults can only be initiated by the mother core and are handled within the mother core, so other MS instructions can always be executed safely while ensuring no page faults;

2) The L2/L3 cache is owned exclusively by the mother core, so cache inconsistency/contention/pseudo-sharing never occurs;

3) Interrupts are always handled by the mother core, so other IP cores (literally) are not interrupted.

programming

SaaS can adapt to various general-purpose programming languages (C, C++, Python, etc.) as well as domain-specific languages. Since any task performed on a SaaP is an MS instruction, the key technology is to extract mixed-scale operations to form MS instructions.

Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure.

As shown, in step 1210, mixed scale (MS) operations are extracted from the program to be compiled, and these MS operations may have a variable number of execution cycles. Next, in step 1220, the extracted mixed-scale operations are packaged to form MS instructions.

Low-level operations can be extracted from the base instruction block, while high-level operations can be extracted in various ways, including but not limited to: 1) calling the map directly from the library, 2) reconstructing from the low-level program structure, and 3) manually setting the compiler directives . Therefore, existing programs, such as deep learning applications written in Python using PyTorch, can be compiled onto the SaaP architecture in a manner similar to the multi-scalar pipeline.

In some embodiments, the following five LLVM compilation passes (Pass) can be optionally added to extend the traditional compiler.

a) Call-Map: This is a simple worklist-driven compilation process that converts known library calls into MS instructions. Specific implementations of MS instructions are precompiled from vendor-specific code and referenced as libraries in the process.

Specifically, in one implementation, the call to the library function function can be extracted from the program to be compiled as an MS operation; then according to the mapping list of the library function function to the MS template library, the extracted call to the library function function is converted into Corresponding MS instructions. The MS template library is pre-compiled based on execution unit-specific code capable of executing the library's functional functions.

b) Reconstruct: This is another worklist-driven compilation process that attempts to recover high-level structures from low-level code, so high-level MS instructions can be discovered.

Specifically, in one implementation, the specified program structure in the program to be compiled is identified as an MS operation through template matching; and the identified specified program structure is converted into a predetermined MS instruction. Among them, the template can be predefined based on high-level functional structural characteristics. For example, the template can define a nested loop structure and set some parameters of the nested loop structure, such as how many nested loops there are, the size of each loop, what operations are in the innermost loop, etc. The template can be defined based on some typical high-level structures, such as convolution operation structure, Fast Fourier Transform (FFT), etc. The specific definition content and definition method are not limited in the embodiments of this disclosure.

For example, a user-implemented Fast Fourier Transform FFT (as a nested ring) can be captured via template matching and then replaced using the FFT MS instructions of a vendor-specific library used in Call-Map. The restored FFT MS instructions can be executed more efficiently on the DSP IP core (if available) and can be converted back into a nested loop in the worst case scenario where only the CPU is available. This is a best-effort effort, as it is inherently difficult to accurately reconstruct all high-level structures, but it provides an opportunity for older programs that are not aware of DSA to take advantage of the new DSP IP core.

c) CDFG (Control Data Flow Graph, control data flow graph)-analysis process: different from multi-scalar technology, The program is analyzed on the CDFG graph, not on the CFG (Control Flow Graph, control flow graph) graph. This is because SaaS removes register masking and address resolution mechanisms and organizes data into storage bubbles. After the previous two compilation processes, the operations to be performed on the heterogeneous IP cores can be identified. All remaining code is executed on the CPU as multi-scalar tasks. At this point, the problem is to find the optimal partitioning of the remaining code into MS instructions. A global CDFG is constructed for subsequent use to model the costs of different MS instruction partitions.

Specifically, in one implementation, the operations that have not yet been extracted in the program to be compiled can be divided into one or more operation sets according to multiple division schemes on the control data flow graph of the program to be compiled; and then the optimal division cost is determined partitioning plan. In each partitioning scheme, each operation belongs to and only belongs to one operation set.

There are many ways to divide. Basically, the segmentation scheme can be executed subject to one or more of the following constraints.

For example, the number of input data and output data of an operation set does not exceed the specified value. As specified by the MS instruction, the arity of the input data does not exceed 2, and the arity of the output data does not exceed 1. Therefore, the operation can be divided based on this constraint.

For another example, the size of any input data or output data of an operation set does not exceed a specified threshold. Since the storage element corresponding to the MS instruction is a storage bubble, and the storage bubble has a capacity limit, it is necessary to limit the amount of data processed by the MS instruction to not exceed the capacity limit of the storage bubble.

For another example, during segmentation, the segmentation plans related to conditional operations can include:

1. Prioritize dividing the conditional operation and its two branch operations into an operation set. At this time, the MS instructions corresponding to this operation set are ordinary computing instructions.

2. The conditional operation and its two branch operations are not in the same operation set. Possible reasons for this splitting scheme include: it will cause the operation set to be too large; or it violates the input and output constraints; or the branch operation has been identified as an MS instruction in the previous step, etc. In this case, a branch type MS instruction containing a conditional operation will be generated. In general, placing conditional operations in a short set of operations results in faster branch results during execution. For example, you can control that the same operation set does not contain both conditional operations and non-conditional operations that exceed the execution time threshold.

The splitting cost of the splitting solution can be determined based on a variety of factors, including but not limited to, the number of operation sets; the amount of data interaction required between the operation sets; the number of operation sets that bear the branch function; and the expected execution time of each operation set. Distribution uniformity. These considerations affect the execution efficiency of the instruction pipeline from many aspects, and therefore can be used as a measurement factor to determine the partitioning scheme. For example, the number of operation sets directly corresponds to the number of MS instructions; the amount of data interaction required between operation sets determines the amount of data IO required; the more branch type instructions, the greater the probability of triggering exceptions, which consumes the pipeline The distribution uniformity of the expected execution time will affect the overall operation of the pipeline and avoid pipeline interruption due to excessive time consumption at a certain level.

In some embodiments, the above-mentioned CDFG analysis process is performed after invoking the mapping process and the reconstruction process. Therefore, it can be executed only for the MS operations that were not recognized in the first two compilation passes, that is, for the remaining operations.

d) MS-Cluster (MS set conversion process): This is a transformation compilation process that is used to gather nodes in CDFG to build a complete division into MS instructions.

Specifically, in one implementation, each operation set is converted into an MS instruction separately according to the segmentation scheme determined during the CDFG analysis process. Limited by the capacity of the storage bubble, the algorithm minimizes the total cost of cutting edges across MS instruction boundaries. In particular, MS instructions and system calls including load/store operations are assigned to the mother core.

e) Fractal-Decompose (fractal-decomposition process): This is also a transformation compilation process, used to decompose the MS instructions that violate the storage bubble capacity limit extracted from the call mapping process and reconstruction process, thereby storing the bubble capacity SaaS functionality is no longer limited.

Specifically, in one implementation, the decomposition process includes: checking whether the converted MS instruction complies with the storage capacity constraint of the MS instruction; when the MS instruction does not comply with the storage capacity constraint, split the MS instruction into multiple MS instructions for implementation Same functionality.

MS instructions can be decomposed using various currently known or future developed instruction decomposition methods. due to previous The extracted MS instructions are to be allocated for execution on a certain IP core. Therefore, the multiple operations that constitute the MS instructions are of the same type, that is, isomorphic, and only need to adapt to the physical hardware size. Therefore, in some embodiments, this decomposition process of MS instructions may simply follow a fractal execution model. For example, you can refer to the paper by Y. Zhao, Z. Du, Q. Guo, S. Liu, L. Li, Z. Xu, T. Chen, and Y. Chen et al., "Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp.787–800. In general, MS instructions can be decomposed into several smaller, similar operations in an iterative manner. Since the invention of the embodiment of the present disclosure does not lie in the specific instruction decomposition method, the description will not be carried out here.

Encapsulating MS operations into MS instructions simply means filling in one or more instruction fields of the MS instruction. As mentioned earlier, MS instructions include sub-instruction fields, input and output storage bubble information fields, and may also include system memory address information fields, branch information fields, exception flag fields, etc. Some of these instruction fields are required, such as sub-instruction fields, exception flag fields, etc.; some are filled on demand, such as input and output storage bubble information fields, system memory address information fields, branch information fields, etc.

When populating the sub-instruction field, the MS operation may be identified in the sub-instruction field of the MS instruction; and the sub-instruction field may be associated with one or more execution unit-specific sub-instructions for implementing the MS operation.

In some embodiments, for the conditional calculation MS instruction associated with the branch MS instruction, a possible discard tag may be inserted in the exception tag field for use in subsequent execution of the MS instruction.

In still other embodiments, for branch type MS instructions, a branch indicator may be inserted in the branch information field to indicate possible branch targets and/or impossible branch targets.

Figure 13 shows an example program, where (a) shows the original program to be compiled; the compiled program is divided into two parts, where (b) shows the compiled MS instruction flow, and (c) shows IP-specific MS instruction implementation, that is, the sub-instructions described previously.

In this example, the original program involves the calculation of the Relu layer and the Softmax layer of the neural network in a deep learning application, which is written in the Python language using PyTorch, for example. The calculations of the Relu layer and Softmax layer adopt the method of calling the Torch library. Therefore, according to the call mapping process described earlier, these function calls to the Torch library can be mapped into MS instructions, such as "Matmul (matrix multiplication)", "Eltwadd (element-wise addition)", " Relu" and so on. The increment of the variable Epoch and the conditional branch are packaged and mapped into a conditional branch instruction "Ifcond", and a branch indicator is inserted to indicate possible branch targets and impossible branch targets. The Print statement is mapped to another MS command ("Print").

(c) shows several MS instructions with IP specific codes. As shown, Matmul provides two IP-specific code implementations, one for GPU and one for DLA, such that "Matmul" MS instructions can be scheduled by the instruction dispatcher between the GPU lane and the DLA lane. Ifcond only provides CPU-specific code that involves reading the value Epoch from the first input storage bubble (vil), incrementing it by 1, and then storing it in the output storage bubble (vo). Then calculate the new Epoch value modulo 10 to get the result 10 and make a judgment based on it. If it is determined that the "Then" branch is to be taken (this branch is compared to an impossible branch), a "UBE" event is initiated. Therefore, the Ifcond MS instruction also inserts a "possible discard tag", and any subsequent large-scale MS instructions will be blocked until the Ifcond instruction has been executed. The Print MS instruction is dispatched only to the mother core because this instruction requires system calls and I/O to external devices.

Thus, an exemplary scheme for compiling program code into MS instructions is described above. The program code to be compiled can be in various general programming languages or domain-specific languages. By compiling these program codes into MS instructions, various new IP cores can be added to the SaaP SoC very easily without a lot of programming/compilation work, so the scalability of the SoC can be well supported. In addition, the same MS instruction can use multiple versions of sub-instructions, which also provides more options for scheduling during instruction execution and facilitates improvement of pipeline execution efficiency.

In summary, SaaP provides an excellent design choice for the traditional understanding of heterogeneous SoC. In SaaS, since there are no shared resources under the PXO principle, there is no competition. MS instructions can be executed predictably and undo on error without any overhead because there is nothing in the executing IP core to leave an observable due to an erroneous instruction. side effect. The cache does not need to be consistent since there are no duplicate cache lines, and the Snoop Filter/MESI protocol is saved since there is no bus to snoop. Although additional constraints are imposed on SaaP, it can be seen from this description that these constraints are reasonable from an analytical and empirical perspective.

Figure 14 shows a schematic structural diagram of a board card 1400 according to an embodiment of the present disclosure. As shown in the figure, the board 1400 includes a chip 1401, which may be a SaaP SoC according to an embodiment of the present disclosure, integrating one or more combined processing devices. The combined processing device is an artificial intelligence computing unit to support various depths. Learning and machine learning algorithms meet the needs of intelligent processing in complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 1400 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and powerful computing capabilities.

The chip 1401 is connected to an external device 1403 through an external interface device 1402 . The external device 1403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or Wifi interface. The data to be processed can be transferred to the chip 1401 from the external device 1403 through the external interface device 1402. The calculation results of the chip 1401 can be transmitted back to the external device 1403 via the external interface device 1402 . According to different application scenarios, the external interface device 1402 may have different interface forms, such as PCIe (Peripheral Component Interconnect express, high-speed peripheral component interconnection) interface, etc.

Board 1400 also includes a memory device 1404 for storing data, which includes one or more memory cells 1405 . The storage device 1404 connects and transmits data with the control device 1406 and the chip 1401 through the bus. The control device 1406 in the board card 1400 is configured to control the status of the chip 1401. To this end, in an application scenario, the control device 1406 may include a microcontroller, also known as a microcontroller unit (Micro Controller Unit, MCU).

The SoC chip in the board card provided by the embodiment of the present disclosure may include the corresponding features described above, which will not be repeated here. Embodiments of the present disclosure also provide a corresponding compilation device, which includes a processor configured to execute compiled program code; and a memory configured to store the compiled program code. When the compiled program code is generated by the processor When loaded and executed, the compilation device is caused to execute the compilation method described in any of the previous embodiments. Embodiments of the present disclosure also provide a machine-readable storage medium. The machine-readable storage medium includes compiler code. When executed, the compiler code causes the machine to perform the compilation method described in any of the previous embodiments.

According to different application scenarios, the electronic equipment or devices of the present disclosure may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or Medical equipment. The means of transportation include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields. Furthermore, the electronic equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, electronic equipment or devices with high computing power according to the solution of the present disclosure can be applied to cloud equipment (such as cloud servers), while electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device. Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices to complete unified management, scheduling and collaborative work of end-cloud integration or cloud-edge-end integration.

It should be noted that, for the purpose of simplicity, this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of the present disclosure, those skilled in the art can understand that certain steps may be implemented in other ways. be executed sequentially or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teachings of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the electronic equipment or device embodiment described above, this article splits them based on the logical function, but there may be other splitting methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. In terms of connection relationships between different units or components, the connections discussed above in connection with the drawings may be direct or indirect couplings between the units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.

In this disclosure, units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described in this article can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs and ASICs (Application Specific Integrated Circuits). )wait. Furthermore, the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), dynamic memory, etc. Random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory) , HBM), Hybrid Memory Cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

The embodiments of the present disclosure have been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present disclosure. The description of the above embodiments is only used to help understand the methods and core ideas of the present disclosure; at the same time, for Those of ordinary skill in the art may make changes in the specific implementation methods and application scope based on the ideas disclosed in this disclosure. In summary, the contents of this description should not be construed as limiting this disclosure.

Claims

An instruction execution method, including:

When issuing a mixed scale (MS) instruction, check whether the MS instruction may be discarded; and

When it is determined that the MS command may be discarded, the issuance of specific MS commands following the MS command is blocked.
The method of claim 1, wherein checking whether the MS instruction may be discarded includes:

Check if the MS instruction has a possible discard tag.
The method of claim 2, wherein the possible discard tag is inserted at compile time based on the type of the MS instruction.
The method of claim 3, wherein when the type of the MS instruction is a conditional branch instruction, the MS instruction is inserted into the possible discard tag.
The method according to any one of claims 1-4, further comprising:

In response to the possible discarded MS instructions blocking the specific MS instruction having been normally executed, the blocked specific MS instruction is issued.
The method according to any one of claims 1-5, further comprising:

In response to an exception occurring in the execution of the MS instruction, cancel the MS instruction and subsequent MS instructions executed;

Submit an MS instruction that precedes said MS instruction; and

Use the MS instruction of the corresponding exception handler as the next MS instruction.
The method of claim 6, wherein revoking the MS instruction and subsequently executed MS instructions includes:

Cancel the canceled MS command in the command queue;

Terminate the execution unit executing the revoked MS instruction;

as well as

The scratchpad written by the canceled MS instruction is discarded.
The method of claim 7, wherein discarding the scratchpad written by the revoked MS instruction includes:

Delete the corresponding mapping relationship from the record that stores the renaming mapping relationship between the physical name and the logical name of the temporary register.
The method according to any one of claims 6-8, further comprising:

When the exception is an impossible branch exception (UBE) triggered by a branch-type MS instruction, the impossible branch target indicated by the branch instruction information attached to the MS instruction is determined as the next MS instruction after the exception is eliminated.
The method according to any one of claims 1-9, wherein the specific MS instruction is an MS instruction that meets one or more of the following conditions:

The size of the register corresponding to the output data of the MS instruction exceeds the set threshold;

MS instructions that perform write operations on system memory;

Execute an MS instruction that takes longer than a predetermined value; or

MS instructions executed by a specific execution unit.
A system controller configured to execute the instruction execution method according to any one of claims 1-10.
A machine-readable storage medium including code that, when executed, causes a machine to perform the method of any one of claims 1-10.
A system on a chip (Soc) includes the system controller according to claim 11, and a plurality of heterogeneous IP cores, the plurality of heterogeneous IP cores serving as execution units of the MS instructions.
A board card includes the system on chip according to claim 13.