WO2024002176A1 - 指令处理装置、指令执行方法、片上系统和板卡 - Google Patents
指令处理装置、指令执行方法、片上系统和板卡 Download PDFInfo
- Publication number
- WO2024002176A1 WO2024002176A1 PCT/CN2023/103276 CN2023103276W WO2024002176A1 WO 2024002176 A1 WO2024002176 A1 WO 2024002176A1 CN 2023103276 W CN2023103276 W CN 2023103276W WO 2024002176 A1 WO2024002176 A1 WO 2024002176A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- instructions
- execution
- execution unit
- data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 88
- 238000012545 processing Methods 0.000 title claims abstract description 67
- 238000003860 storage Methods 0.000 claims description 135
- 230000006870 function Effects 0.000 claims description 34
- 238000013507 mapping Methods 0.000 claims description 21
- 230000005540 biological transmission Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 description 41
- 230000003068 static effect Effects 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 13
- 230000007246 mechanism Effects 0.000 description 13
- 230000000903 blocking effect Effects 0.000 description 12
- 238000013135 deep learning Methods 0.000 description 12
- 239000011800 void material Substances 0.000 description 12
- 238000007726 management method Methods 0.000 description 11
- 240000008415 Lactuca sativa Species 0.000 description 10
- 238000013461 design Methods 0.000 description 10
- 235000012045 salad Nutrition 0.000 description 10
- 230000003993 interaction Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000001537 neural effect Effects 0.000 description 7
- 235000008429 bread Nutrition 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 230000001960 triggered effect Effects 0.000 description 6
- 238000000354 decomposition reaction Methods 0.000 description 5
- 230000010354 integration Effects 0.000 description 5
- 235000013372 meat Nutrition 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 239000000872 buffer Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000013523 data management Methods 0.000 description 3
- 238000002408 directed self-assembly Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 235000013311 vegetables Nutrition 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 231100000957 no side effect Toxicity 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 241000414697 Tegra Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000001693 membrane extraction with a sorbent interface Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 235000012033 vegetable salad Nutrition 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/3017—Runtime instruction translation, e.g. macros
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3818—Decoding for concurrent execution
- G06F9/3822—Parallel decoding, e.g. parallel decode units
Definitions
- This disclosure relates generally to the field of instruction sets. More specifically, the present disclosure relates to an instruction processing device, an instruction execution method, a system on a chip, and a board card.
- DSA Domain-Specific Architecture
- DSAs used for computing purposes
- IP Intellectual Property
- SoC System on Chip
- IP typically only exposes IP-related hardware interfaces, forcing SoCs to manage the IP as a standalone device using code running on the host CPU. Since it is extremely difficult to manage hardware heterogeneity directly for application developers, significant efforts are often made to build programming frameworks to help application developers manage this hardware heterogeneity.
- popular programming frameworks for deep learning include PyTorch, TensorFlow, MXNet, etc., all of which provide application developers with high-level, easy-to-use Python interfaces.
- the host CPU In current SoCs, the host CPU must treat IPs as independent devices and utilize code running on the host CPU (i.e., CPU-centric) to manage coordination between different IPs, resulting in both control and data exchange. There are costs that cannot be ignored in all aspects. Furthermore, with the integration of many IPs that have some commonality, domain-specific programming frameworks may not be able to leverage available IP from other domains to perform the same function. For example, using DLA (Deep Learning Accelerator, deep learning accelerator) requires explicit programming in Nivdia Tegra Xavier.
- DLA Deep Learning Accelerator, deep learning accelerator
- the present disclosure provides solutions from multiple aspects.
- it provides a new unified system-on-chip architecture framework (which can be called system-on-a-chip, Soc-as-a-Processor, SaaP for short), which eliminates hardware heterogeneity from a software perspective and Improve programming productivity and hardware utilization.
- an architecture-free mixed-scale instruction set is provided to support high productivity and new components of SaaS, including storage vesicles for on-chip management and on-chip interconnects for data paths, thereby Build an efficient SaaS architecture.
- a compilation method is provided for compiling program codes of various high-level programming languages into mixed-scale instructions.
- Other aspects of this disclosure also provide solutions for branch prediction, exceptions and interrupts in instructions.
- the present disclosure discloses an instruction processing device, including: an instruction decoder for decoding mixed-scale (Mixed-Scale, MS) instructions, where the MS instructions include sub-instruction fields, and the The sub-instruction field indicates specific sub-instruction information of one or more execution units capable of executing the MS instruction; and an instruction dispatcher is used to dispatch the MS instruction to the corresponding execution unit according to the sub-instruction field.
- an instruction decoder for decoding mixed-scale (Mixed-Scale, MS) instructions, where the MS instructions include sub-instruction fields, and the The sub-instruction field indicates specific sub-instruction information of one or more execution units capable of executing the MS instruction
- an instruction dispatcher is used to dispatch the MS instruction to the corresponding execution unit according to the sub-instruction field.
- the present disclosure discloses an instruction execution method including: decoding a mixed scale (MS) instruction, the MS instruction including a sub-instruction field indicating an instruction capable of executing the MS instruction. One or more execution unit-specific sub-instruction information; and dispatching the MS instruction to the corresponding execution unit according to the sub-instruction field.
- MS mixed scale
- the present disclosure discloses a system on a chip (SoC), including the instruction processing device of the first aspect, and a plurality of heterogeneous IP cores serving as the execution units.
- SoC system on a chip
- the present disclosure discloses a board card including the system-on-chip of the third aspect.
- the instruction processing device instruction execution method, on-chip system and board provided above, by providing a new MS instruction set, a unified abstraction is made on the software and hardware interface, so that different hardware or different instructions can be hidden The heterogeneity between them makes it possible to see a unified MS instruction format at the hardware level.
- These MS instructions can be distributed to different execution units for actual execution.
- Figure 1 schematically shows a typical architecture of a SoC
- FIG. 2 shows the hardware heterogeneity on the SoC
- Figure 3 shows a typical timeline for a traditional SoC
- FIG. 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram
- Figure 4b shows the traditional SoC architecture for comparison
- Figure 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure
- Figure 6 schematically shows an example process of performing tasks on the MISC architecture
- Figure 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure
- FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure
- Figure 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure
- Figure 10 shows an instruction execution example according to an embodiment of the present disclosure
- FIG 11 shows several different data path designs
- Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure
- Figure 13 shows an example program
- Figure 14 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.
- the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
- SoC system on a chip
- SoC is an integrated circuit chip that integrates all the key components of the system on the same chip. SoC is the most common integration solution in today's mobile/edge scenarios. Its high level of integration improves system performance, reduces overall power consumption and provides significantly smaller area costs compared to motherboard-based solutions.
- Figure 1 schematically shows a typical architecture of a SoC.
- SoC Due to performance requirements under limited area/power budget, SoC usually integrates a lot of dedicated hardware IP, usually domain-specific architecture for computing purposes, especially to accelerate domain-specific applications or specific applications.
- Some of these hardware IPs are customized by SoC designers, such as neural network processing IP (Neural Engine (NE) in Apple A15, deep learning accelerator (DLA) in NVIDIA Jetson Xavier, neural processing unit in HiSilicon Kirin (NPU) and Samsung Exynos), some are standardized by IP suppliers, such as Arm or Imagination's CPU and GPU, Synopsys or Cadence's DSP (Digital Signal Processor, digital signal processor), Intel or Xilinx's FPGA (Field-Programmable Gate Array, field programmable gate array), etc.
- NE Ne
- DLA deep learning accelerator
- NPU HiSilicon Kirin
- Samsung Exynos some are standardized by IP suppliers, such as Arm or Imagination's CPU and GPU, Synopsys or Cadence's DSP (Digital
- a CPU 101, a GPU 102, an NPU 103, an on-chip RAM (Random Access Memory, random access memory) 104, a DRAM (Dynamic Random Access Memory, a dynamic random access memory) controller 105 are shown.
- SoC Network on Chip
- a common bus used for SoC on-chip interconnection is ARM’s open standard Advanced Microcontroller Bus Architecture (AMBA).
- the SoC uses shared buses to connect and manage various functional blocks in the SoC.
- These shared buses include the Advanced High Performance Bus (AHB) for high-speed connections, and the Advanced High Performance Bus (AHB) for low-bandwidth low-speed connections.
- Advanced Peripheral Bus (APB) Advanced Peripheral Bus
- Hardware heterogeneity includes the heterogeneity of IP within SoC and the heterogeneity of IP between SoCs.
- Figure 2 illustrates the hardware heterogeneity on the SoC.
- the figure shows several IP integrated on the SoC.
- a certain model Model A integrates CPU and GPU on SoC
- model B integrates CPU, GPU and neural engine (NE) for neural network processing on SoC
- model C integrates CPU, GPU and neural network processing on SoC Neural processing unit (NPU) for processing
- a certain model D integrates CPU, GPU, deep learning accelerator (DLA) for deep learning and programmable vision accelerator (PVA) in the SoC.
- DLA deep learning accelerator
- PVA programmable vision accelerator
- the IPs on the same SoC are different, for example, used for different purposes.
- this is due to the fact that more and more different types of IP (especially for computing purposes) are integrated into SoC to achieve high efficiency.
- New IP will continue to be introduced into SoC.
- a new type of neural network processing IP has been widely introduced into recent mobile SoCs.
- the number of processing units in an SoC continues to grow.
- the SoC of a certain model A mainly includes 10 processing units (2 large cores, 2 small cores, and a 6-core GPU); while in a certain model B, the number of processing units increases to 30 (2 large cores) General-purpose cores, 4 small general-purpose cores, a 16-core neural engine, and a 5-core GPU).
- the IP that implements the same function on different SoCs may vary greatly because one's own IP is always preferred for business reasons.
- the same functionality (such as neural network processing) is directed to different IPs.
- SoC is the neural engine (NE); in a certain model D is the deep learning accelerator (DLA); in a certain model C is the neural processing unit (NPU).
- DLA deep learning accelerator
- NPU neural processing unit
- many computing-purpose IPs are specific to a certain field (e.g., deep learning) or have certain generality for certain types of operations (e.g., GPUs with tensor operations).
- Programming IP such as GPUs and NPUs for computing purposes can be achieved based on support from programming frameworks and vendors.
- programming frameworks such as PyTorch, TensorFlow, MXNet, etc.
- These programming frameworks provide high-level programming interfaces (C++/Python) to customize IP, which are implemented using the IP vendor's low-level interfaces.
- IP suppliers provide different programming interfaces, such as PTX (Parallel Thread Execution, parallel thread execution), CUDA (Compute Unified Device Architecture, computing unified device architecture), cuDNN (CUDA Deep Neural Network library, CUDA deep neural network library) and NCCL (NVIDIA Collective Communications Library), etc., to make their hardware drivers suitable for these programming frameworks.
- PTX Parallel Thread Execution, parallel thread execution
- CUDA Computer Unified Device Architecture, computing unified device architecture
- cuDNN CUDA Deep Neural Network library
- CUDA deep neural network library NVIDIA Collective Communications Library
- programming frameworks require extremely large development efforts because they are required to bridge the gap between software diversity and hardware diversity.
- Programming frameworks provide application developers with high-level interfaces to improve programming productivity, and these interfaces are carefully implemented to improve hardware performance and efficiency.
- Tensorflow was initially developed by about 100 developers and is currently maintained by 3,000+ contributors to support dozens of SoC platforms.
- optimizing one operator on a certain IP may take a skilled developer several months.
- application developers may be required to have different implementations for different SoCs. For example, a program written for a certain model D cannot be run directly on the server-side DGX-1 of TensorCore of the GPU.
- FIG. 3 shows a typical timeline for a traditional SoC.
- the host CPU runs the programming framework for runtime management, where each call to the IP will be started/ended by the host CPU, which brings non-negligible runtime overhead.
- the data is stored in off-chip main memory, and the IP reads/writes data from the main memory, which brings additional data access.
- control will be returned from the GPU to the programming framework 39 times, occupying 56.75M of DRAM space, 95.06% of which is unnecessary.
- Amdahl's law the efficiency of a system is limited, especially for programs composed of fragmented operations.
- this disclosure proposes a A solution that lets the SoC hardware manage heterogeneity on its own.
- the inventor noticed that in classic CPUs, heterogeneous Arithmetic Logic Units (ALU) and Float Point Units (FPU) are regarded as execution units in the pipeline and are managed by hardware.
- ALU Arithmetic Logic Unit
- FPU Float Point Unit
- IP can also be regarded as an execution unit in the IP-level pipeline, that is, a unified system-on-a-chip (SoC-as-a-Processor, SaaP).
- Figure 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram.
- Figure 4b shows a traditional SoC architecture, where single lines represent control flow and wide lines represent data flow.
- the SaaP of the embodiment of the present disclosure reconstructs the Soc into a processor, including a system controller 410 (equivalent to the controller in the processor, that is, the pipeline manager), which is used to manage the hardware pipeline, including from System memory (e.g., DRAM 440 in the figure) retrieves instructions, decodes them, dispatches them, undoes them, commits them, etc.
- System memory e.g., DRAM 440 in the figure
- Multiple heterogeneous IP cores, including CPU cores, are integrated into the SoC as execution units in the hardware pipeline 420 (equivalent to the arithmetic units in the processor) for executing instructions assigned by the system controller 410 . Therefore, SaaS can utilize hardware pipelines rather than programming frameworks to manage heterogeneous IP cores.
- MS instruction is a unified instruction that can be applied to various heterogeneous IP cores. Therefore, hardware heterogeneity is transparent under MS instructions.
- MS instructions are fetched, decoded, dispatched, revoked, committed, etc. by the system controller 410. The adoption of MS instructions can fully exploit mixed-level parallelism.
- on-chip memory 430 can also be provided for SaaP, such as on-chip SRAM (Static Random Access Memory) or registers, which are used to cache data related to the execution of the execution unit (IP core), such as input data and output. data. Therefore, after the data on the system memory is transferred to the on-chip memory, the IP core can interact with the on-chip memory to access the data.
- SaaP such as on-chip SRAM (Static Random Access Memory) or registers, which are used to cache data related to the execution of the execution unit (IP core), such as input data and output. data. Therefore, after the data on the system memory is transferred to the on-chip memory, the IP core can interact with the on-chip memory to access the data.
- On-chip memory 430 is similar to registers in a processor, whereby on-chip IP coordination can also be implemented implicitly in a manner similar to register forwarding in a multi-scalar pipeline.
- SaaS In the SaaS hardware pipeline, you can make full use of mixed-level parallelism by using MS instructions, and use on-chip memory to realize data exchange between IP cores, thereby achieving high hardware performance. Moreover, SaaS allows any type of IP core to be integrated as an execution unit, and high-level code from application developers can be compiled to the new IP core with only slight adjustments, thus enabling improved programming productivity.
- the traditional SoC shown in Figure 4b is CPU-centric, with the programming framework running on the host CPU.
- Various IP cores are attached to the system bus as isolated devices and managed by software running on the host CPU.
- DRAM system memory
- SoC is built with an IP-level pipeline, and the IP core is managed as an execution unit.
- the control flow can naturally be managed by the pipeline manager, and no programming framework is required at runtime.
- data exchange can be performed directly between different IP cores.
- SaaS SoCs follow the principles of Pure eXclusive Ownership (PXO) architecture in their design.
- the principle is that data-related resources in the system, including on-chip buffers, data paths, data caches, memory and I/O (Input/Output, input/output) devices, are exclusively occupied by one IP core at a certain time.
- PXO Pure eXclusive Ownership
- FIG. 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure in more detail. Similar to the Tomasulo pipeline, SaaP can contain an out-of-order five-stage pipeline.
- the system controller as the pipeline manager can include multiple functional components to implement different functions in the pipeline management process.
- the instruction decoder 511 can decode the MS instruction proposed in the embodiment of the present disclosure.
- Instruction dispatcher 512 may dispatch MS instructions.
- the instruction exit circuit 513 is used to complete the instruction submission and exit the completed MS instructions in order.
- MS instruction cache 514 is used to cache MS instructions.
- the renaming circuit 515 is used to rename the storage elements involved in the instruction, for example, to solve possible data hazards.
- the system controller may utilize the renaming mechanism to implement any one or more of the following processes: resolving data hazards on storage elements; MS command revocation, MS command submission, etc.
- the exception handling circuit 516 is used to respond to exceptions thrown by the IP core and perform corresponding processing. The functions of each component will be described in the relevant sections below.
- IP cores (the figure illustrates various IP cores such as CPU cores, GPU cores, and DLA cores) act as execution units for performing actual operations.
- IP cores and related components such as reservation station 521, IP instruction cache 522, etc.
- IP core complex 520 may be collectively referred to as IP core complex 520.
- On-chip memory is also provided in SaaP.
- on-chip memory can be implemented as a bank of scratchpads (also called a set of memory bubbles) that buffer input and output data.
- Storage bubbles act as registers in the processor.
- the storage bubble can include multiple temporary registers with different storage capacities, which are used to cache data related to the execution of multiple heterogeneous IP cores.
- the capacity of storage bubbles can range from 64B, 128B, 256B,...256KB, to 512KB.
- the number of small-capacity storage bubbles is greater than the number of large-capacity storage bubbles, so as to better support task requirements of different scales.
- This group of storage vesicles may be collectively referred to as storage vesicle complex 530.
- an on-chip interconnect 540 is provided to provide non-blocking data path connectivity between multiple heterogeneous IP cores and a set of storage vesicles.
- the on-chip interconnect acts as a shared data bus.
- the on-chip interconnect 540 can be implemented based on an ordering network, thereby providing a non-blocking data path with only a small hardware cost and acceptable latency.
- the on-chip interconnect 540 may also be referred to as Golgi.
- one IP core among the above-mentioned multiple heterogeneous IP cores can be designated as the mother core, responsible for managing the entire system.
- the mother core exclusively manages the exchange of data between system memory and storage bubbles.
- the mother core also exclusively manages I/O operations with external devices.
- the mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc.
- OS operating system
- branching and predictive execution branching and predictive execution are implemented through exception handling, in which impossible branches are treated as unlikely branch exceptions (UBE).
- Static prediction can be used to implement branching and speculative execution.
- the CPU core with general processing functions is usually determined as the mother core.
- DMA Direct Memory Access
- non-parent IP cores may be divided into different IP lanes based on their functionality and/or type.
- mother The core itself belongs to a separate IP lane.
- the mother core lane, CPU lane, CPU lane, DLA lane, etc. are shown in Figure 5 . Then, when scheduling the MS instruction, the MS instruction can be dispatched to the appropriate IP lane based at least in part on the task type of the MS instruction.
- SaaS uses MS instructions to execute the entire program. Initially, when the system controller retrieves an MS instruction, it decodes it to prepare the data for execution. Data is loaded from system memory to storage bubbles or forwarded quickly from other storage bubbles. If there is no conflict, the MS instruction is sent to the MS instruction dispatcher and then to the appropriate IP core (eg, DLA core) for actual execution. This IP core will load the actual precompiled IP specific code (eg, DLA instructions) based on the MS instruction issued. The IP core will then execute that actual code, much like execution on a regular accelerator. After execution is complete, the MS instruction exits the pipeline and commits its results.
- IP core eg, DLA core
- Table 1 shows a comparison between different instruction sets.
- CISC Complex Instruction Set Computer, complex instruction set computer
- RISC Reduced Instruction Set Computer, reduced instruction set computer
- the length of each CISC instruction is uncertain. Some instructions have complex functions and a large number of beats, while some instructions have simple functions and a small number of beats.
- the number of instruction cycles ranges from 2 to 15.
- the instruction length of RISC is fixed, and the number of instruction cycles for a single instruction is relatively uniform, about 1 to 1.5 cycles.
- a mixed-scale (Mixed-Scale, MS) instruction set (Mixed-scale Instruction Set Computer, MISC), its form is similar to RISC and can be suitable for various IP cores.
- IP cores (various accelerators mainly for computing purposes) need to efficiently process some large-grained complex work, so the number of execution cycles (Cycle Per Instruction, CPI) of a single MS instruction is longer than that of RISC, ranging from 10 to 10,000 +shoot, which belongs to a relatively large range.
- CPI Execution Cycle
- MISC mixed-scale instruction set computer
- MS instructions have mixed load sizes, which can be relatively small loads, such as only needing to execute 10 beats, or relatively large loads, such as needing to execute more than 10,000 beats. Therefore, the payload carried by each MS instruction may Containers of different sizes are required to facilitate retrieving data from the container and saving the calculated result data into the container.
- the aforementioned set of storage bubbles of various sizes eg, from 64B to 512KB are used to store the input and/or output data required by the MS instructions, thereby supporting this mixed load of the MS instructions size.
- MS instructions are IP-independent, that is, MS instructions are not aware of IP. Specifically, instructions specific to each IP core (that is, heterogeneous instructions) are encapsulated in MS instructions, and the encapsulated MS instruction format is not related to which IP core is specifically encapsulated.
- the MS instruction may include a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction. It can be understood that the MS instruction needs to be run on a certain IP core in the future, which means that there is a piece of code that the IP core can recognize (that is, IP core-specific code). These codes are also composed of one or more pieces of code specific to the IP core. Composed of instructions, these instructions are encapsulated in MS instructions, so they are called sub-instructions. Therefore, the system controller can assign the MS instruction to the corresponding IP core according to the sub-instruction domain.
- the subinstruction information may include the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction. There can be multiple implementations to represent subcommand information.
- the addresses of one or more IP core-specific subinstructions may be placed into the subinstruction field. This method can directly determine the sub-instruction type and address in the MS instruction. However, in this implementation, since the same MS instruction may be able to run on multiple heterogeneous IP cores, the length of the MS instruction will vary with the number of IP core types that can run the MS instruction.
- a bit sequence can be used to indicate whether the MS instruction has a corresponding type of sub-instruction, and a first address can be used to indicate the first sub-instruction address.
- the length of the bit sequence may be the number of IP core types or IP lane types, so that each bit in the bit sequence may be used to indicate whether there is a sub-instruction of the corresponding type.
- the first sub-instruction address is obtained directly from the first address.
- the sub-instruction addresses corresponding to subsequent IP lanes can be indexed in a fixed way (for example, separated by a certain address distance), or by directly jumping to the MS instruction.
- the embodiments of this disclosure have no restrictions on the specific format implementation of MS instructions.
- MS instructions are defined to perform complex functions. Therefore, each MS instruction performs a complex function, such as convolution, and the instruction will be broken down into fine-grained IP-specific code (i.e., sub-instructions) for actual execution, such as RISC instructions.
- IP-specific code can be code compiled against the standard library (e.g., std::inner_product from Libstdc++ for inner products) or code generated from a vendor-specific library (e.g., from cuBLAS also for inner products) operating CublasSdot). This makes it possible for SaaS to integrate different types of IP because the same MS command can be flexibly issued to different types of IP cores. Therefore, heterogeneity is hidden from application developers, which also increases the robustness of SaaS.
- MS instructions have a limited arity.
- each MS instruction will access up to three storage bubbles: two source storage bubbles and one destination storage bubble. That is, for data management, each MS instruction has at most two input data fields and one output data field, which are used to indicate data information related to the execution of the MS instruction.
- these data fields may be represented by numbers of associated storage bubbles, such as indicating two input storage bubbles and one output storage bubble, respectively.
- Limited metadata reduces the complexity of conflict resolution, renaming, datapath design, and compiler toolchains. For example, if the number of MS instructions is not limited, the decoding time difference between different MS instructions will be very large, resulting in irregular hardware pipelines and some inefficiency problems.
- Currying is a technique that converts multi-variable functions into sequences of single-variable functions, such as through nesting, chaining, etc. Thereby, it is possible to support the conversion of functions/functions with any number of inputs and outputs into functions/functions that satisfy the finite element number of MS instructions.
- MS instructions have no side effects. "No side effects” here means that the execution status of the current instruction will not affect the execution of subsequent instructions. In other words, if the current instruction is to be revoked, it can be revoked without its residual status affecting subsequent instructions.
- the execution of the MS instruction does not Will leave any observable side effects on the SaaP architecture.
- MS instructions that execute on the mother core, since the mother core can operate on system memory and external devices. This constraint is important for implementing Mixed Level Parallelism (MLP), as it enables simple rollback of effects when MS instructions need to be undoed, for example due to speculative execution requirements.
- MLP Mixed Level Parallelism
- the data field of the MS instruction executed on the non-mother core IP core can only point to the storage bubble, but not to the system memory.
- the storage bubble corresponding to the output data is exclusively assigned to the IP core that executes the MS instruction.
- FIG. 6 schematically shows an example process of executing tasks on the MISC architecture to better understand the implementation of MS instructions.
- the illustrated MISC architecture has, for example, a mother core and an IP core.
- the tasks to be performed are to make sandwiches (materials: bread and meat) and vegetable salads (materials: vegetables).
- the bread is named A
- the meat is named B
- the vegetables are named C
- the sandwich is named D
- the salad is named E.
- the mother core manages the system memory, so first the mother core loads the materials to be processed from the system memory to the storage bubbles, and then the IP core can process the materials on the storage bubbles.
- the above tasks can be expressed as the following MS instruction flow:
- each core should provide a specific code form, that is, core-specific sub-instructions, so that each core can know how to perform the corresponding task.
- these sub-instructions only briefly show their processing tasks or functions in the above MS instruction flow, and different forms are not distinguished.
- the storage bubbles (v1, v2) used in the MS instruction are logical numbers. In actual implementation, the storage bubbles are renamed to different physical numbers to resolve WAW (Write After Write) dependencies and support out-of-order predictive execution. Void in the instruction indicates that the corresponding domain does not need to store bubbles, such as when system memory is involved.
- 1 is the initial state; 2 executes the "Load Bread” MS instruction for the mother core.
- the Load instruction involves access to system memory and is therefore assigned to the mother core for execution.
- the mother core takes out the data from the system memory and stores it into the storage bubble v1.
- the specific memory access address information of the system memory may be placed in an additional instruction field, and the embodiments of the present disclosure have no limitation in this regard.
- 3 Execute the "Load Meat” instruction for the mother core. Similar to the "Load Bread” instruction, the mother core takes out the data from the system memory and stores it in the v2 storage bubble.
- this MS instruction is assigned to the IP core for processing because it takes more processing time.
- the IP core needs to take out the bread from v1, take out the meat from v2, and put it into v1 after making it.
- WAR Write After Read
- this method is not very realistic, because the MS instructions may be very large, for example, tens of thousands of shots are required, and the sandwiches made in the middle need to be stored somewhere. In order to solve this data hazard, a storage bubble renaming mechanism can be used.
- the storage bubble renaming circuit 515 saves the mapping relationship between the physical name and the logical name.
- the storage bubble v1 corresponding to the output data of the "Make Sandwich" instruction is renamed to storage bubble v3, so the prepared sandwich is placed in v3.
- the ellipsis in v3 in Figure 6 indicates that this writing process will take a while and will not be completed quickly.
- the "Make Salad" instruction can be assigned to the currently idle mother core for execution.
- the status of each core can be marked, for example, by a bit sequence to facilitate the instruction dispatcher to dispatch instructions. Again, the renaming mechanism is applied here as well.
- the mother core takes out the vegetables from the storage bubble v4, makes them into salads and puts them into the storage bubble v5.
- the IP core can start processing.
- “Make Sandwich” takes more time, so “Make Salad” is executed on the mother core and completed in advance, so that mixed-level parallelism (MLP) can be fully exploited. Therefore, the execution of different IP cores does not interfere with each other, that is, they can be executed out of order, but they are submitted in order.
- MLP mixed-level parallelism
- SaaP SoCs employ out-of-order pipelines to mine mixed-level parallelism between IP cores.
- the pipeline can contain 5 levels: value & decoding, conflict resolution, dispatch, execution and exit.
- FIG. 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure.
- the following description can be understood with reference to the SaaP architecture shown in Figure 5.
- Figure 7 shows the instruction execution process including a complete pipeline, but those skilled in the art can understand that some steps may only occur under specific circumstances and are therefore not necessary in all cases. The necessity can be discerned according to the specific situation.
- step 710 instruction fetch & decode is performed.
- the MS instruction is retrieved from the MS instruction register 514 based on the MS program counter (PC), and the instruction decoder 511 decodes the retrieved MS instruction to prepare operands.
- the decoded MS instructions can be placed in the instruction queue of the instruction decoder 511 .
- the MS instruction includes a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction.
- the subinstruction information may, for example, indicate the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction.
- the corresponding sub-instruction when the MS instruction is retrieved and decoded, the corresponding sub-instruction can be retrieved in advance and stored in a designated location according to the decoding result, such as the sub-instruction cache 522 (also referred to as sub-instruction buffer 522 in Figure 5 for the IP command cache). Therefore, when the MS instruction is issued to the corresponding IP core for execution, the IP core can fetch the corresponding sub-instruction from the sub-instruction cache 522 for execution.
- the MS instruction may be a branch instruction.
- static prediction is used to determine the direction of the branch instruction, that is, to determine the PC value of the next MS.
- the inventor analyzed the branch behavior in the benchmark program and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time. Since large-scale instructions determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor. From this, whenever a branch is encountered, it is always assumed that the next MS instruction is a statically predicted possible branch. branch direction.
- step 720 the pipeline proceeds to step 720, where possible conflicts are resolved.
- the retrieved MS instructions are queued to resolve conflicts.
- Possible conflicts include (1) data hazards; (2) structural conflicts (e.g., no space available in the exit unit); and (3) exception violations (e.g., blocking an MS instruction that cannot be easily undone until it is acknowledged to be taken ).
- data hazards such as Read After Write (RAW) and Write After Write (WAW) can be solved through a storage bubble renaming mechanism.
- the storage bubble renaming circuit 515 is used to rename the logical name of the storage bubble to a physical name and save the storage bubble before dispatching the MS instruction when there is a data hazard on the storage bubble involved in the MS instruction.
- the mapping relationship between physical names and logical names Through the storage bubble renaming mechanism, SaaP can support faster MS instruction revocation (achieved by simply discarding the rename mapping to the output data storage bubble) and out-of-order execution without WAW risks.
- step 730 the MS instructions are dispatched by the instruction dispatcher 512 .
- an MS instruction has a sub-instruction field that indicates the IP core capable of executing the MS instruction. Therefore, the instruction dispatcher 512 can dispatch the MS instruction to the corresponding IP core according to the information of the sub-instruction field. Specifically, first dispatch the MS instruction to the reservation station to which the IP core belongs for subsequent transmission to the appropriate IP core.
- IP cores may be divided into different IP lanes according to their functions and/or types, with each lane corresponding to a specific IP core model.
- reservation stations can also be grouped according to lanes, for example, each lane can correspond to a reservation station.
- Figure 5 shows a mother core lane, a CPU lane, a CPU lane, a DLA lane, and so on.
- Different lanes can be suitable for performing different types of tasks. Therefore, when scheduling and dispatching the MS instruction, the MS instruction can be dispatched to the reservation station corresponding to the appropriate lane based at least in part on the task type of the MS instruction for subsequent transmission to the appropriate IP core.
- scheduling can also be performed among multiple IP lanes capable of executing the MS instruction according to the processing status in each IP lane, thereby improving processing efficiency. Since the same MS instruction may have multiple different implementations executed on multiple IP cores, the processing pressure of the bottleneck lane can be alleviated by selecting the assigned lane according to an appropriate scheduling policy. For example, MS instructions involving convolution operations can be dispatched to the GPU lane or the DLA lane. Both can be performed effectively, and one can be selected based on the pressure of the two lanes, thereby speeding up the processing progress.
- the scheduling policy may include various rules, such as selecting the IP core with the largest throughput, or selecting the IP core with the shortest number of sub-instructions, etc. The embodiments of the present disclosure are not limited in this regard.
- some specific types of MS instructions must be dispatched to specific IP cores.
- one IP core can be designated as the mother core, responsible for managing the entire system. Therefore, some MS instructions involving system management types must be dispatched to the mother core for execution.
- the mother core exclusively manages the exchange of data between system memory and storage vesicles. Therefore, MS instructions of the memory access type that access system memory are dispatched to the mother core.
- the mother core also exclusively manages I/O operations with external devices. Therefore, I/O type MS instructions such as display output are also dispatched to the mother core.
- the mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc. Therefore, the interrupt circuit 517 handles the MS instruction that interrupts and is dispatched to the parent core.
- OS operating system
- the interrupt circuit 517 handles the MS instruction that interrupts and is dispatched to the parent core.
- some MS instructions cannot be processed by other IP cores, for example because other IP cores are busy, they can be assigned to the mother core for processing.
- some MS instructions may be assigned to the mother core for processing. No more enumeration here.
- step 740 at which stage MS instructions may be executed out of order by the IP core.
- the IP core assigned to the instruction may utilize the actual IP-specific code to perform the functionality of the MS instruction. For example, the IP core retrieves the corresponding sub-instruction from the sub-instruction register/IP instruction register 522 according to the assigned instruction and executes it.
- the Tomasulo algorithm can be implemented at this stage to organize these IP cores to support mixed-level parallelism (MLP). Once the dependencies on the storage bubble are resolved, MS instructions can be continuously dispatched to the IP core complex, allowing them to be executed out of order.
- MLP mixed-level parallelism
- an adapter is used to encapsulate the IP core.
- the adapter directs access to the program to the IP instruction cache 522 and directs access to the data to the storage bubble.
- the program can be an interface signal of the accelerator, for example, CSB (Configuration Space Bus, Configuration Space Bus) control signal for DLA, or a piece of IP-specific code that implements MS instructions (for example, for programmable processing such as CPU/GPU device).
- Operational MS instructions perform operations on data stored in a set of storage bubbles. These storage bubbles can be multiple temporary registers with different storage capacities.
- Each IP core has two data read ports and one data write port. During execution, the physical storage bubble is exclusively connected to the port, so from the perspective of the IP core, the storage bubble works like main memory in a traditional architecture.
- step 750 the exit phase.
- MS instructions exit the pipeline and commit the results.
- the instruction exit circuit 513 in Figure 5 is used to exit completed MS instructions in order, and when the MS instruction exits, submit the execution result by confirming the rename mapping of the storage bubble corresponding to the MS instruction's output data. That is, the commit is accomplished by permanently acknowledging the rename map of the storage bubble of the output data in the rename circuit 515 . Since only the rename mapping is recognized, no data is actually buffered or copied, thus avoiding the additional overhead of copying data when the amount of data is large (which is common in various computing-purpose IP cores).
- the MS instruction system can also be applied in other environments, and is not limited to environments with heterogeneous IP cores. For example, it can also be used in homogeneous environments.
- the execution unit of the MS instruction can independently parse and execute the sub-instructions. Therefore, in the above description, the IP core can be directly replaced by the execution unit, and the mother core can be replaced as the main execution unit. The above method is still applicable.
- Branch instructions may also appear in the MS instruction stream, and branch instructions cause control dependencies.
- the control correlation is actually related to the program counter PC of the MS instruction, and the PC value is used when fetching instructions. If the branch instruction is not handled well, the fetching of the next instruction will be affected, causing the pipeline to be blocked and affecting the pipeline efficiency. Therefore, it is necessary to provide effective branch prediction support for MS instructions, that is, effective for both large-scale instructions and small-scale instructions.
- the branch condition is calculated during decoding, and then the correct branch target is determined, so that the next instruction can be retrieved from the address of the branch jump position when fetching the next instruction.
- This branch condition calculation and setting the next PC value to the value of the correct branch target usually only takes up a few beats of overhead. This overhead is very small and can be completely offset by the pipeline in the conventional CPU instruction pipeline.
- the MS instruction stream if a branch MS instruction is mispredicted, it means that at a certain point in time during the entire execution of the MS instruction stream, it is discovered that the branch MS instruction is mispredicted. At this time The position of the point may be hundreds of beats or thousands of beats or longer away from the time when the branch MS instruction starts executing. Therefore, in the MS instruction pipeline, it is impossible to determine the PC value of the next MS instruction until it really knows when to jump. In this case, the prediction overhead will be very high.
- the inventor analyzed the branch behavior in five benchmark programs and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time, that is, they can be predicted in a static manner. Since large-scale instructions occupy most of the total execution time and determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor.
- FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure. This method is executed by the system controller.
- the MS instruction is decoded.
- MS instructions have varying per-instruction execution cycles Period (Cycle Per Instruction, CPI).
- CPI Chip Per Instruction
- the CPI range of MS instructions may range from 10 beats to more than 10,000 beats. This changing CPI nature of MS instructions also makes it difficult to use dynamic prediction.
- step 820 in response to the MS instruction being a branch instruction, the next MS instruction is obtained according to the branch indication information, and the branch indication information indicates a possible branch target and/or an impossible branch target.
- the static prediction mechanism can use compiler hints to perform static predictions. Specifically, during instruction compilation, the branch instruction information can be determined based on the static branch prediction method and inserted into the MS instruction stream.
- the branch indication information may contain different contents. For example, static prediction always takes the possible branch target as the next MS instruction address. In some cases, in order to ensure the temporal locality of the instruction cache, it is possible that the branch target can usually be immediately adjacent to the current MS instruction. Therefore, in these cases, the branch indication information may only need to indicate the impossible branch target. In other cases, the branch indication information may also indicate possible branch targets and impossible branch targets at the same time. Therefore, when the next MS instruction is obtained according to the branch instruction information, the possible branch target indicated by the branch instruction information can be determined as the next MS instruction.
- the system controller may receive an Impossible Branch Exception (UBE) event.
- UBE Impossible Branch Exception
- the UBE event is triggered by an execution unit (such as an IP core) that executes a conditional calculation instruction associated with a branch instruction. This UBE event indicates that according to conditional calculation, the branch direction should be an impossible branch target, that is, an error occurred in the previous branch prediction.
- step 840 the system controller needs to perform a series of operations in response to the UBE event to resolve the branch prediction error. These operations include: canceling the MS instruction after the branch instruction; committing the MS instruction before the branch instruction; and determining the impossible branch target indicated by the branch indication information as the next MS instruction.
- This kind of processing corresponds to a precise exception, that is, when an exception occurs, all instructions before the instruction interrupted by the exception are executed, and all instructions after the instruction are as if they were not executed. Since the UBE event is an exception caused by a branch prediction error, the above-mentioned instruction interrupted by the exception is the branch MS instruction.
- the MS instruction that needs to be revoked may usually be in three states: being executed in the execution unit; execution has ended; or has not yet been executed. Different states may have effects on different software and hardware, so these effects need to be eliminated. For example, if the instruction is being executed in the execution unit, the execution unit that is executing the MS instructions that need to be revoked needs to be terminated; if the instruction has written to the temporary register (such as a storage bubble) during or after execution, Then you need to discard the temporary registers written by the MS instructions to be canceled; if the instructions have not been executed, you only need to cancel them from the instruction queue. Of course, since the instruction queue will record all unexited/submitted instructions, instructions that are in the executing or completed execution state also need to be canceled from the instruction queue.
- undoing the MS instructions following the branch instruction includes: canceling the undoed MS instructions in the instruction queue; terminating the execution units executing the undoed MS instructions; and discarding the temporary files written by the undoed MS instructions. memory.
- processing branch MS instructions through static prediction can save hardware software resources, while adapting to the CPI characteristics of MS instructions with a wide range of changes, and improving pipeline efficiency. Furthermore, handling branch prediction errors through exception mechanisms can further save hardware resources and simplify processing.
- an instruction execution scheme which can block MS instructions that may cause high revocation costs until all instructions that may be discarded before the instruction have been executed, that is, the status has been determined.
- This instruction execution scheme can greatly improve the processing efficiency of the MS instruction pipeline in exception and interrupt handling.
- FIG. 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure. This method is executed by the system controller.
- step 910 when the MS command is issued, it is checked whether the MS command may be discarded.
- checking whether the MS instruction may be discarded includes checking whether the MS instruction has a possible discard tag.
- Possible discard tags can be inserted at compile time by the compiler based on the type of MS directive. For example, the compiler can insert a possible discard label when it discovers that the MS instruction is a conditional branch instruction or that other exceptions may occur.
- step 920 when it is determined that the MS instruction may be discarded, the issuance of the specific MS instruction following the MS instruction is blocked.
- Specific MS instructions can be those large-scale MS instructions, or MS instructions that generally have a relatively high cost of revocation.
- a specific MS instruction can be judged based on one or more of the following conditions: the size of the temporary register (storage bubble) corresponding to the output data of the MS instruction is greater than the set threshold; the MS instruction that performs a write operation on the system memory; MS instructions whose execution time exceeds a predetermined value; or MS instructions executed by a specific execution unit.
- the storage bubble size (capacity size) of the output data exceeds the set threshold, it means that the output data volume of the MS instruction will be relatively large, and the corresponding cancellation overhead will also be high. Blocking MS instructions that write system memory is mainly to ensure storage consistency.
- MS instructions before them are still issued and executed normally. Depending on the possible situations of these normally launched and executed MS instructions, they can be processed separately.
- step 930 when all potentially discarded MS instructions that caused the blocking of the specific MS instruction have been normally executed, in response to this event, the blocked specific MS instruction may be issued for execution by the execution unit. It can be understood that at this time it can be determined that the specific MS instruction will not be canceled due to the previous instruction, so the normal issuance and execution of the instruction pipeline can be continued.
- step 940 when an exception occurs in the execution of any MS instruction that blocks the specific MS instruction and may be discarded, exception handling is performed in response to the exception event.
- this kind of exception handling corresponds to a precise exception, which requires canceling the MS instruction that caused the exception and the MS instructions executed after it, submitting the MS instruction before the MS instruction; and using the MS instruction of the corresponding exception handler as the next MS instructions.
- canceling the MS instruction that caused the exception and subsequent MS instructions includes: canceling these canceled MS instructions in the instruction queue; terminating the execution units that execute these canceled MS instructions; and discarding these canceled MS instructions.
- discarding the scratchpads written by these revoked MS instructions includes deleting the corresponding mappings from the record holding the rename mappings between the physical names and logical names of these scratchpads.
- the type of the exception event is an impossible branch exception UBE event triggered by a branch type MS instruction in the branch prediction processing
- the branch target is determined to be the next MS instruction after the exception is eliminated. Therefore, after the exception handling is completed, the instruction pipeline can normally jump to the correct branch direction to continue execution.
- Figure 10 shows an instruction execution example according to an embodiment of the present disclosure.
- (a) shows the initial state of the MS instruction flow in the instruction queue, which includes 5 MS instructions to be executed
- the #1 MS instruction has a possible discard label
- the different widths occupied by the instructions can represent different scales.
- the #3MS command is a large-scale MS command
- the rest are small-scale MS commands.
- the different backgrounds of the instructions represent that they are in different Status, such as waiting, blocking, launching, executing, exiting, exception, cancellation, etc., please see the legend for specific representation.
- (b) shows the instruction issuance step.
- Small-scale instructions will be issued as soon as possible, while large-scale instructions will be blocked by instructions that may be discarded by the previous issue. Specifically, the #0 instruction is issued first, and then the #1 instruction is issued. When issuing command #1, it was discovered that the command might be discarded. At this time, subsequent large-scale instructions will be blocked.
- instruction #2 can still be issued normally because it is a small-scale instruction; instruction #3 is blocked because it is a large-scale instruction, and subsequent instructions are also in a waiting state.
- (c) shows the instruction execution process.
- the #2 instruction may be executed first. Since the instructions before it have not yet been executed, it needs to wait to ensure sequential submission.
- (d1) shows that the #1 instruction has also been executed normally and has not been discarded. At this time, the large-scale instruction #3 that was blocked because of the #1 instruction can be issued, and the subsequent #4 instruction can also be issued normally.
- (e1) shows that instructions #0, #1, #2 and #4 have all been executed due to their small size, and instruction #3 is still being executed.
- (f1) shows that instructions #0, #1, and #2 are submitted sequentially, while instruction #4 must wait for instruction #3 to be executed before it can be submitted.
- (g1) shows that instruction #3 has also been executed.
- (h1) shows instructions #3 and #4 committing sequentially.
- an exception program when an exception occurs during the execution of the #1 instruction, as shown in (d2), an exception program will be processed.
- the process of exception handling usually includes exception handling preparation, determining the source of the exception, saving the execution state, handling the exception, restoring the execution state and returning, etc.
- exception processing circuit 516 shown in FIG. 5 it is possible to record whether an exception occurs, and adjust the next MS instruction address according to the processing result.
- the impossible branch target indicated by it can be determined as the impossible branch target after the exception is eliminated based on the branch instruction information attached to the branch instruction.
- Next MS command That is, after the exception is handled, the pipeline will jump to the MS instruction corresponding to the impossible branch target.
- the denominator in division is zero, it will jump to an exception handler, which may modify the denominator value to a very small non-zero value, and then re-execute #1 after the exception is handled. instruction, and normal instruction pipeline processing continues.
- storage vesicles are used as an alternative to registers for mixed-scale data access.
- the storage bubbles can be some independent, mixed-sized single-port scratchpads (Scratchpad), whose capacity can range from 64B to 512KB, for example.
- storage bubbles can be similar to registers with mixed capacities for use by MS instructions.
- "storage vesicle complex" refers to a physical "register” file composed of storage vesicles, rather than a fixed-size register.
- the number of small-capacity (for example, 64B) storage bubbles is greater than the number of large-capacity (for example, 512KB) storage bubbles, so as to better match program requirements and support tasks of different scales.
- each memory bubble can be a single SRAM or register with two read ports and one write port. These storage bubbles are designed to better match mixed-scale data access patterns, which can be used as the basic unit of data management in SaaS.
- Figure 11 shows several different data path designs, where (a) shows the data bus, (b) shows the crossbar matrix, and (c) shows the Golgi provided by embodiments of the present disclosure.
- the data bus cannot provide non-blocking access and requires a bus arbiter to resolve access conflicts.
- the cross matrix can provide non-blocking access and has lower latency, but it requires O(mn) switches, where m is the number of ports of the IP core and n is the number of ports for storage bubbles.
- connection problem is treated as a Top-K sorting network, where storage vesicle ports are sorted based on the destination IP port number.
- the on-chip interconnect consists of a bimodal sequencing network consisting of multiple comparators and switches.
- the bimodal sorting network is used to sort the relevant storage bubble ports based on the index of the destination IP core port to construct m IP core ports and n A data path between storage bubble ports.
- the even-numbered columns can be compared with each other first, and the odd-numbered columns can be compared with each other.
- the stored bubbles a and c are compared and it is found that the value #3 of a is greater than the value #1 of c, then the two are exchanged.
- the light hatched line in the figure indicates that the switch is turned on and data can flow laterally. Comparing storage bubbles b and d, it is found that the value #+ ⁇ of b is greater than the value #2 of d, then the two are also exchanged, the switch is turned on, and the data path flows horizontally.
- the sorting positions are c, d, a, and b.
- a comparison is made between two adjacent storage vesicles. For example, if the storage bubbles c and d are compared and it is found that the value #1 of c is less than the value #2 of d, it remains unchanged, the switch is not turned on, and the data path can only flow vertically. Similarly, after comparing storage bubbles d and a, the switch is not turned on; after comparing storage bubbles a and b, the switch is not turned on.
- each IP core exactly corresponds to the storage bubble it wants to access. For example, for IP#1, go straight down from the passage below it, move laterally to the gray dot, and then go straight down to the storage bubble c.
- the data paths of other IP cores are similar.
- a non-blocking data path is constructed between the IP core and the storage bubble.
- the Golgi can be implemented using O(n(logk) 2 ) comparators and switches, which is much smaller than the O(nk) switches required for the cross matrix.
- Data delivered through the Golgi is subject to several cycles of latency (e.g., 8 cycles), so the preferred practice is to place as little local cache as possible (1KB is enough) in the IP core, as it relies on a large number of random accesses.
- SaaP in order to execute an MS instruction, establishes an exclusive data path between the IP core and its storage bubble.
- This exclusive data path in SaaS follows the PXO architecture and provides non-blocking data access with minimal hardware cost.
- Data can be shared between IP cores by passing memory bubbles between MS instructions. Since the mother core manages system memory, input data is brought together in an MS instruction by the mother core and correctly placed in a storage bubble in for use by another MS command. After being processed by the IP core, the output data is similarly dispersed back to system memory by the mother core.
- the complete data path from system memory to IP core includes: [(Loading MS instructions) 1 Memory 2 L3/L2 cache 3 Mother core 4 Golgi W0 5 Storage vesicle, (consuming MS instructions) 5 The same storage vesicle 6 Golgi R0 /17IP core. ]
- system memory is a resource exclusively owned by the mother core, which greatly reduces system complexity in the following aspects:
- Page faults can only be initiated by the mother core and are handled within the mother core, so other MS instructions can always be executed safely while ensuring no page faults;
- L2/L3 cache is owned exclusively by the mother core, so cache inconsistency/contention/pseudo-sharing never occurs;
- SaaS can adapt to various general-purpose programming languages (C, C++, Python, etc.) as well as domain-specific languages. Since any task performed on a SaaP is an MS instruction, the key technology is to extract mixed-scale operations to form MS instructions.
- Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure.
- step 1210 mixed scale (MS) operations are extracted from the program to be compiled, and these MS operations may have a variable number of execution cycles.
- step 1220 the extracted mixed-scale operations are packaged to form MS instructions.
- Low-level operations can be extracted from the base instruction block, while high-level operations can be extracted in various ways, including but not limited to: 1) calling the map directly from the library, 2) reconstructing from the low-level program structure, and 3) manually setting the compiler directives . Therefore, existing programs, such as deep learning applications written in Python using PyTorch, can be compiled onto the SaaP architecture in a manner similar to the multi-scalar pipeline.
- the following five LLVM compilation passes (Pass) can be optionally added to extend the traditional compiler.
- the call to the library function function can be extracted from the program to be compiled as an MS operation; then according to the mapping list of the library function function to the MS template library, the extracted call to the library function function is converted into Corresponding MS instructions.
- the MS template library is pre-compiled based on execution unit-specific code capable of executing the library's functional functions.
- the specified program structure in the program to be compiled is identified as an MS operation through template matching; and the identified specified program structure is converted into a predetermined MS instruction.
- the template can be predefined based on high-level functional structural characteristics.
- the template can define a nested loop structure and set some parameters of the nested loop structure, such as how many nested loops there are, the size of each loop, what operations are in the innermost loop, etc.
- the template can be defined based on some typical high-level structures, such as convolution operation structure, Fast Fourier Transform (FFT), etc.
- FFT Fast Fourier Transform
- a user-implemented Fast Fourier Transform FFT (as a nested ring) can be captured via template matching and then replaced using the FFT MS instructions of a vendor-specific library used in Call-Map.
- the restored FFT MS instructions can be executed more efficiently on the DSP IP core (if available) and can be converted back into a nested loop in the worst case scenario where only the CPU is available. This is a best-effort effort, as it is inherently difficult to accurately reconstruct all high-level structures, but it provides an opportunity for older programs that are not aware of DSA to take advantage of the new DSP IP core.
- CDFG Control Data Flow Graph, control data flow graph
- the program is analyzed on the CDFG graph, not on the CFG (Control Flow Graph, control flow graph) graph. This is because SaaS removes register masking and address resolution mechanisms and organizes data into storage bubbles.
- the operations to be performed on the heterogeneous IP cores can be identified. All remaining code is executed on the CPU as multi-scalar tasks. At this point, the problem is to find the optimal partitioning of the remaining code into MS instructions.
- a global CDFG is constructed for subsequent use to model the costs of different MS instruction partitions.
- the operations that have not yet been extracted in the program to be compiled can be divided into one or more operation sets according to multiple division schemes on the control data flow graph of the program to be compiled; and then the optimal division cost is determined partitioning plan.
- each partitioning scheme each operation belongs to and only belongs to one operation set.
- the segmentation scheme can be executed subject to one or more of the following constraints.
- the number of input data and output data of an operation set does not exceed the specified value.
- the arity of the input data does not exceed 2
- the arity of the output data does not exceed 1. Therefore, the operation can be divided based on this constraint.
- the size of any input data or output data of an operation set does not exceed a specified threshold. Since the storage element corresponding to the MS instruction is a storage bubble, and the storage bubble has a capacity limit, it is necessary to limit the amount of data processed by the MS instruction to not exceed the capacity limit of the storage bubble.
- the segmentation plans related to conditional operations can include:
- conditional operation and its two branch operations are not in the same operation set. Possible reasons for this splitting scheme include: it will cause the operation set to be too large; or it violates the input and output constraints; or the branch operation has been identified as an MS instruction in the previous step, etc. In this case, a branch type MS instruction containing a conditional operation will be generated. In general, placing conditional operations in a short set of operations results in faster branch results during execution. For example, you can control that the same operation set does not contain both conditional operations and non-conditional operations that exceed the execution time threshold.
- the splitting cost of the splitting solution can be determined based on a variety of factors, including but not limited to, the number of operation sets; the amount of data interaction required between the operation sets; the number of operation sets that bear the branch function; and the expected execution time of each operation set. Distribution uniformity. These considerations affect the execution efficiency of the instruction pipeline from many aspects, and therefore can be used as a measurement factor to determine the partitioning scheme. For example, the number of operation sets directly corresponds to the number of MS instructions; the amount of data interaction required between operation sets determines the amount of data IO required; the more branch type instructions, the greater the probability of triggering exceptions, which consumes the pipeline The distribution uniformity of the expected execution time will affect the overall operation of the pipeline and avoid pipeline interruption due to excessive time consumption at a certain level.
- the above-mentioned CDFG analysis process is performed after invoking the mapping process and the reconstruction process. Therefore, it can be executed only for the MS operations that were not recognized in the first two compilation passes, that is, for the remaining operations.
- MS-Cluster This is a transformation compilation process that is used to gather nodes in CDFG to build a complete division into MS instructions.
- each operation set is converted into an MS instruction separately according to the segmentation scheme determined during the CDFG analysis process. Limited by the capacity of the storage bubble, the algorithm minimizes the total cost of cutting edges across MS instruction boundaries.
- MS instructions and system calls including load/store operations are assigned to the mother core.
- Fractal-Decompose (fractal-decomposition process): This is also a transformation compilation process, used to decompose the MS instructions that violate the storage bubble capacity limit extracted from the call mapping process and reconstruction process, thereby storing the bubble capacity SaaS functionality is no longer limited.
- the decomposition process includes: checking whether the converted MS instruction complies with the storage capacity constraint of the MS instruction; when the MS instruction does not comply with the storage capacity constraint, split the MS instruction into multiple MS instructions to implement Same functionality.
- MS instructions can be decomposed using various currently known or future developed instruction decomposition methods. Due to previous The extracted MS instructions are to be allocated for execution on a certain IP core. Therefore, the multiple operations that constitute the MS instructions are of the same type, that is, isomorphic, and only need to adapt to the physical hardware size. Therefore, in some embodiments, this decomposition process of MS instructions may simply follow a fractal execution model. For example, you can refer to the paper by Y. Zhao, Z. Du, Q. Guo, S. Liu, L. Li, Z. Xu, T. Chen, and Y.
- MS instructions include sub-instruction fields, input and output storage bubble information fields, and may also include system memory address information fields, branch information fields, exception flag fields, etc. Some of these instruction fields are required, such as sub-instruction fields, exception flag fields, etc.; some are filled on demand, such as input and output storage bubble information fields, system memory address information fields, branch information fields, etc.
- the MS operation When populating the sub-instruction field, the MS operation may be identified in the sub-instruction field of the MS instruction; and the sub-instruction field may be associated with one or more execution unit-specific sub-instructions for implementing the MS operation.
- a possible discard tag may be inserted in the exception tag field for use in subsequent execution of the MS instruction.
- a branch indicator may be inserted in the branch information field to indicate possible branch targets and/or impossible branch targets.
- Figure 13 shows an example program, where (a) shows the original program to be compiled; the compiled program is divided into two parts, where (b) shows the compiled MS instruction flow, and (c) shows IP-specific MS instruction implementation, that is, the sub-instructions described previously.
- the original program involves the calculation of the Relu layer and the Softmax layer of the neural network in a deep learning application, which is written in the Python language using PyTorch, for example.
- the calculations of the Relu layer and Softmax layer adopt the method of calling the Torch library. Therefore, according to the call mapping process described earlier, these function calls to the Torch library can be mapped into MS instructions, such as "Matmul (matrix multiplication)", “Eltwadd (element-wise addition)", " Relu” and so on.
- the increment of the variable Epoch and the conditional branch are packaged and mapped into a conditional branch instruction "Ifcond", and a branch indicator is inserted to indicate possible branch targets and impossible branch targets.
- the Print statement is mapped to another MS command ("Print").
- (c) shows several MS instructions with IP specific codes.
- Matmul provides two IP-specific code implementations, one for GPU and one for DLA, so that "Matmul" MS instructions can be scheduled by the instruction dispatcher between the GPU lane and the DLA lane. Ifcond only provides CPU-specific code that involves reading the value Epoch from the first input storage bubble (vil), incrementing it by 1, and then storing it in the output storage bubble (vo). Then calculate the new Epoch value modulo 10 to get the result 10 and make a judgment based on this. If it is determined that the "Then" branch is to be taken (this branch is compared to an impossible branch), a "UBE" event is initiated.
- Ifcond MS instruction also inserts a "possible discard tag", and any subsequent large-scale MS instructions will be blocked until the Ifcond instruction has been executed.
- the Print MS instruction is dispatched only to the mother core because this instruction requires system calls and I/O to external devices.
- the program code to be compiled can be in various general programming languages or domain-specific languages.
- various new IP cores can be added to the SaaP SoC very easily without a lot of programming/compilation work, so the scalability of the SoC can be well supported.
- the same MS instruction can use multiple versions of sub-instructions, which also provides more options for scheduling during instruction execution and facilitates improvement of pipeline execution efficiency.
- SaaP provides an excellent design choice for the traditional understanding of heterogeneous SoC.
- MS instructions can be executed predictably and undo on error without any overhead because there is nothing in the executing IP core to leave an observable due to an erroneous instruction. side effect.
- the cache does not need to be consistent since there are no duplicate cache lines, and the Snoop Filter/MESI protocol is saved since there is no bus to snoop.
- FIG 14 shows a schematic structural diagram of a board card 1400 according to an embodiment of the present disclosure.
- the board 1400 includes a chip 1401, which may be a SaaP SoC according to an embodiment of the present disclosure, integrating one or more combined processing devices.
- the combined processing device is an artificial intelligence computing unit to support various depths. Learning and machine learning algorithms meet the needs of intelligent processing in complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
- the board 1400 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and powerful computing capabilities.
- the chip 1401 is connected to an external device 1403 through an external interface device 1402 .
- the external device 1403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or Wifi interface.
- the data to be processed can be transferred to the chip 1401 from the external device 1403 through the external interface device 1402.
- the calculation results of the chip 1401 can be transmitted back to the external device 1403 via the external interface device 1402 .
- the external interface device 1402 may have different interface forms, such as PCIe (Peripheral Component Interconnect express, high-speed peripheral component interconnection) interface, etc.
- Board 1400 also includes a memory device 1404 for storing data, which includes one or more memory cells 1405 .
- the memory device 1404 connects and transmits data with the control device 1406 and the chip 1401 through the bus.
- the control device 1406 in the board card 1400 is configured to control the status of the chip 1401.
- the control device 1406 may include a microcontroller, also known as a microcontroller unit (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- Embodiments of the present disclosure also provide a corresponding compilation device, which includes a processor configured to execute compiled program code; and a memory configured to store the compiled program code. When the compiled program code is generated by the processor When loaded and executed, the compilation device is caused to execute the compilation method described in any of the previous embodiments.
- Embodiments of the present disclosure also provide a machine-readable storage medium. The machine-readable storage medium includes compiler code. When executed, the compiler code causes the machine to perform the compilation method described in any of the previous embodiments.
- the electronic equipment or devices of the present disclosure may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or Medical equipment.
- the means of transportation include airplanes, ships and/or vehicles;
- the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
- the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph.
- the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields. Furthermore, the electronic equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, electronic equipment or devices with high computing power according to the solution of the present disclosure can be applied to cloud equipment (such as cloud servers), while electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
- cloud equipment such as cloud servers
- electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
- the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device.
- this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of the present disclosure, those skilled in the art can understand that certain steps may be implemented in other ways. be executed sequentially or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.
- units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units.
- the aforementioned components or units may be co-located or distributed over multiple network units.
- some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
- multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
- the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc.
- the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors.
- various devices such as computing devices or other processing devices described in this article can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs and ASICs (Application Specific Integrated Circuits). )wait.
- the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), dynamic memory, etc.
- Random access memory Dynamic Random Access Memory, DRAM
- static random access memory Static Random Access Memory, SRAM
- enhanced dynamic random access memory Enhanced Dynamic Random Access Memory, EDRAM
- high bandwidth memory High Bandwidth Memory
- HBM Hybrid Memory Cube
- HMC Hybrid Memory Cube
- ROM and RAM etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Advance Control (AREA)
- Hardware Redundancy (AREA)
Abstract
本披露公开了一种指令处理装置、指令执行方法、片上系统和板卡。本披露的方案通过提出一种统一的混合规模指令,可以隐藏执行单元的异构性,从而提高编程效率和硬件的利用率。
Description
相关申请的交叉引用
本申请要求于2022年6月29日申请的,申请号为202210764246.0,名称为“指令处理装置、指令执行方法、片上系统和板卡”的中国专利申请的优先权。
本披露一般地涉及指令集领域。更具体地,本披露涉及一种指令处理装置、指令执行方法、片上系统和板卡。
由于摩尔定律和登纳德缩放定律(Dennard Scaling)的结束,通用CPU(Central Processing Unit,中央处理器)的性能增益持续下跌。领域特定的架构(Domain-Specific Architecture,DSA)成为继续提高整个计算系统的性能和效率的最具前景和可行的方式。DSA迎来了大爆发,其被视为打开了计算机架构的新黄金时代。提出了各种各样的DSA以加速特定应用,例如各种xPU,包括用于数据流处理的DPU(Data Processing Unit,数据处理单元)、用于图像处理的GPU(Graphics Processing Unit,图形处理器)、用于神经网络的NPU(Neural network Processing Unit,神经网络处理器)、用于张量处理的TPU(Tensor Processing Unit,张量处理器)等。随着越来越多的DSA,尤其是用于计算目的的DSA(也称为IP,Intellectual Property,知识产权),被集成到片上系统(System on Chip,SoC)中以获得高效率,当前计算系统中硬件的异构性也在持续增长,从标准化变为定制化。
当前,IP通常仅暴露IP相关的硬件接口,这迫使SoC利用在主机CPU上运行的代码、将IP当做独立的设备进行管理。由于极难直接为应用开发者管理硬件异构性,通常花费大力气构建编程框架以帮助应用开发者管理这种硬件异构性。例如,用于深度学习的流行编程框架包括PyTorch、TensorFlow、MXNet等,它们都为应用开发者提供了高级、易用的Python接口。
然而不幸的是,由于较低的生产率和硬件利用率,这种在以CPU为中心的SoC中软件管理的异构性妨碍了用户应用在不同的SoC上高效地运行。低生产率源于编程框架和应用二者。对于编程框架开发者,为了支持不同的SoC,编程框架必须利用不同的IP实现它们各自的高级抽象接口,这需要大量的开发工作。对于应用开发者,SoC中不同IP的差异要求同一应用具有不同的实现,导致沉重的编程负担。并且,这对于编程框架不支持的IP而言,情况可能变得甚至更差,因为需要手动管理硬件异构性。低硬件利用率与CPU为中心的SoC和具有某些通用性的IP有关。在当前的SoC中,主机CPU必须将IP视为独立的设备,并利用在主机CPU上运行的代码(也即,以CPU为中心)来管理不同IP之间的协同,导致控制和数据交换两方面上均存在不可忽略的开销。此外,随着集成很多具有某些通用性的IP,领域特定的编程框架可能无法利用其他领域的可用IP来执行同一功能。例如,使用DLA(Deep Learning Accelerator,深度学习加速器)需要在Nivdia Tegra Xavier中显式编程。
然而,目前很少有研究调查关于增长的硬件异构性导致的编程生产率问题,大部分研究仍然专注于提高单个IP的性能和能源效率。有些工作通过在某些场景中针对基于流的应用按链来调度IP或者在硬件中添加捷径(Shortcut)来开发SoC性能。还有的提议了一种分形方法来解决编程生产率问题,不过是在不同规模的机器学习加速器上。结果,日益增长的硬件异构性彻底改变了构建未来SoC系统的范式,并提出了如何构建具有高生产率和高硬件利用率的SoC系统的关键问题。
发明内容
为了至少部分地解决背景技术中提到的一个或多个技术问题,本披露从多个方面提供了解决方案。一方面,提供了一种新的统一的片上系统架构框架(其可以称为片上系统即处理器,Soc-as-a-Processor,简称SaaP),其消除了软件角度的硬件异构性,以提高编程生产率和硬件利用率。另一方面,提供了一种架构自由的混合规模指令集,以支持高生产率和SaaP的新部件,包括用于片上管理的存储小泡(Vesicle),以及用于数据通路的片上互连,从而构建高效的SaaP架构。再一方面,提供了一种编译方法,用于将各种高级编程语言的程序代码编译成混合规模指令。本披露其他方面还针对指令中的分支预测、异常和中断等方面提供了解决方案。
在第一方面中,本披露公开一种指令处理装置,包括:指令译码器,用于对混合规模(Mixed-Scale,MS)指令进行译码,所述MS指令包括子指令域,所述子指令域指示能够执行所述MS指令的一个或多个执行单元特定的子指令信息;以及指令分派器,用于根据所述子指令域,将所述MS指令分派给对应的执行单元。
在第二方面中,本披露公开一种指令执行方法,包括:对混合规模(MS)指令进行译码,所述MS指令包括子指令域,所述子指令域指示能够执行所述MS指令的一个或多个执行单元特定的子指令信息;以及根据所述子指令域,将所述MS指令分派给对应的执行单元。
在第三方面中,本披露公开一种片上系统(SoC),包括第一方面的指令处理装置,以及多个异构IP核,所述多个异构IP核作为所述执行单元。
在第四方面中,本披露公开一种板卡,包括前述第三方面的片上系统。
根据如上提供的指令处理装置、指令执行方法、片上系统和板卡,通过提供一种新的MS指令集,在软硬件接口上做一个统一的抽象,从而可以隐藏不同的硬件或者说不同的指令之间的异构性,使得在硬件层面上看到的是统一的MS指令格式。这些MS指令可以分发到不同的执行单元上去实际执行。
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1示例性示出了SoC的典型架构;
图2示出了SoC上的硬件异构性;
图3示出了传统SoC的典型时间线;
图4a用简化图示意性示出了根据本披露实施例的SaaP架构;
图4b示出了传统的SoC架构作为对比;
图5示出了根据本披露实施例的SaaP的整体架构;
图6示例性示出在MISC架构上执行任务的一个示例过程;
图7示出了根据本披露实施例的指令执行方法的示例性流程图;
图8示出了根据本披露实施例的针对分支指令的指令执行方法的示例性流程图;
图9示出了根据本披露实施例的一种指令执行方法的示例性流程图;
图10示出了根据本披露实施例的一个指令执行示例;
图11示出了几种不同的数据通路设计;
图12示出根据本披露实施例的编译方法的示例性流程图;
图13示出了一个示例程序;以及
图14示出本披露实施例的一种板卡的结构示意图。
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中可能出现的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
片上系统(SoC)是一种集成电路芯片,其在同一芯片上集成了系统所有关键部件。SoC是当今移动/边缘场景中最常见的集成方案。相比于基于主板的解决方案,它的高度集成改善了系统性能,减少了总体功耗并提供了小得多的面积成本。
图1示例性示出了SoC的典型架构。
由于有限面积/功率预算下的性能要求,SoC通常集成很多专用的硬件IP,通常是出于计算目的的领域特定的架构,尤其是为了加速特定领域的应用或特定应用。这些硬件IP有些是SoC设计方定制的,例如神经网络处理IP(苹果A15中的神经引擎(Neuron Engine,NE)、英伟达Jetson Xavier中的深度学习加速器(DLA)、海思麒麟中的神经处理单元(NPU)和三星Exynos),有些是IP供应商标准化的,例如Arm或Imagination的CPU和GPU,Synopsys或Cadence的DSP(Digital Signal Processor,数字信号处理器),Intel或Xilinx的FPGA(Field-Programmable Gate Array,现场可编程门阵列),等等。
在图1的示例中,示出了CPU 101、GPU 102、NPU 103、片上RAM(Random Access Memory,随机存取存储器)104、DRAM(Dynamic Random Access Memory,动态随机存取存储器)控制器105、仲裁器(Arbiter)106、译码器107、外部存储接口108、总线桥109、UART(Universal Asynchronous Receiver/Transmitter,统一异步收发器)110、GPIO(General Purpose Input Output,通用输入输出)111、ROM(Read Only Memory,只读存储器)接口112等。
传统的SoC设计利用共享数据总线或片上网络(Network on Chip,NoC)将各个部件链接在一起。用于SoC片上互联的一种常见总线是ARM的开放标准的高级微控制器总线架构(Advanced Microcontroller Bus Architecture,AMBA)。
在图1的示例中,SoC使用共享总线连接和管理SoC中的各个功能块,这些共享总线包括用于高速连接的高级高性能总线(Advanced High Performance Bus,AHB),以及用于低带宽低速连接的高级外围总线(Advanced Peripheral Bus,APB)。也可以引入其他网络类拓扑,也即NoC,利用基于路由器的分组交互网络来管理更多部件。
集成多个不同的IP导致了SoC上的硬件异构性。硬件异构性包括SoC内IP的异构性和SoC间IP的异构性。
图2示出了SoC上的硬件异构性。图中示出了几个SoC上集成的IP。例如,某型号
A在SoC上集成了CPU和GPU;某型号B在SoC上集成了CPU、GPU和用于神经网络处理的神经引擎(NE);某型号C在SoC中集成了CPU、GPU和用于神经网络处理的神经处理单元(NPU);某型号D在SoC集成了CPU、GPU、用于深度学习的深度学习加速器(DLA)和可编程视觉加速器(Programmable Vision Accelerator,PVA)。
从图中可以看出,同一SoC上,IP各有不同,例如用于不同的目的。关于SoC内IP的异构性,这是由于越来越多的不同类型(尤其是用于计算目的)的IP被集成到SoC中以获得高效率。新的IP会持续引入SoC。例如,一种新型的、神经网络处理类的IP被广泛地引入到近来的移动SoC中。而且,SoC中的处理单元的数量也在持续增长。例如,某型号A的SoC主要包括10个处理单元(2个大核,2个小核,以及一个6核GPU);而在某型号B中,处理单元的数量增加到30个(2个大通用核,4个小通用核,一个16核神经引擎,以及一个5核GPU)。
关于SoC间IP的异构性,不同SoC上实现相同功能的IP可能差异很大,因为出于商业原因总是优选自己的IP。例如,如图2中的(b)、(c)和(d)所示,相同的功能(例如神经网络处理)指向不同的IP。在某型号B的SoC中是神经引擎(NE);在某型号D中是深度学习加速器(DLA);在某型号C中是神经处理单元(NPU)。此外,很多计算目的的IP针对某个领域(例如深度学习)或某些操作类型具有一定的通用性(例如,具有张量操作的GPU)。
对诸如计算目的的GPU和NPU之类的IP进行编程可以基于来自编程框架和供应商的支持来实现。例如,为了加速神经网络处理,应用开发者可以直接使用深度学习编程框架,例如PyTorch、TensorFlow、MXNet等,以代替直接地手动管理。这些编程框架提供高级编程接口(C++/Python)以定制化IP,这些使用IP供应商的低级接口来实现。IP供应商通过提供不同的编程接口,例如PTX(Parallel Thread Execution,并行线程执行)、CUDA(Compute Unified Device Architecture,计算统一设备架构)、cuDNN(CUDA Deep Neural Network library,CUDA深度神经网络库)和NCCL(NVIDIA Collective Communications Library,英伟达集体通信库)等,来使得他们的硬件驱动器适合于这些编程框架。
然而,编程框架要求极其巨大的开发工作,因为要求它们能够弥补软件多样性和硬件多样性之间的差距。编程框架为应用开发者提供高级接口以提高编程生产率,而这些接口是精心实现的,以提高硬件性能和效率。例如,Tensorflow初始由大约100个开发者开发,目前已有3000+贡献者进行维护以支持数十种SoC平台。针对上千的Tensorflow算子,在某个IP上优化一个算子可能要耗费一个熟练开发者几个月的时间。即使利用编程框架,对于不同的SoC,可能也要求应用开发者具有不同的实现。例如,针对某型号D编写的程序不能直接运行在GPU的TensorCore的服务器端DGX-1上。
编程框架很难实现高效率,根源在于SoC是通过主机CPU进行管理。由于运行在主机CPU上的编程框架控制整个执行过程,控制和数据的交互不可避免。其针对控制使用仅CPU-IP交互的方式进行处理,针对数据交换使用仅存储器-IP交互的方式进行处理。
图3示出了传统SoC的典型时间线。如图所示,主机CPU运行编程框架以进行运行时管理,其中对IP的每次调用都将由主机CPU来启动/结束,这带来不可忽略的运行时开销。数据保存在片外主存储器中,IP从主存储器读取/写入数据,这带来额外的数据访存。例如,当在某型号D上运行神经网络YOLO时,控制将从GPU返回给编程框架39次,占据DRAM空间量56.75M,其中95.06%是不必要的。根据Amdahl定律,系统的效率是有限的,特别是对于由碎片化操作组成的程序。
发明构思
考虑到向管理软件暴露硬件异构性会导致低生产率和低硬件利用率,本披露提出了一
种让SoC硬件自己管理异构性的解决方案。发明人注意到,经典的CPU中将异构的算术逻辑单元(Arithmetic Logic Unit,ALU)和浮点运算单元(Float Point Unit,FPU)视为流水线中的执行单元并通过硬件进行管理。受此启发,直观上,IP也可以视为IP级别的流水线中的执行单元,也即一种统一的片上系统即处理器(SoC-as-a-Processor,SaaP)。
图4a用简化图示意性示出了根据本披露实施例的SaaP架构。作为对比,图4b示出了传统的SoC架构,其中单线表示控制流,宽线表示数据流。
如图4a所示,本披露实施例的SaaP将Soc重构为处理器,包括系统控制器410(相当于处理器中的控制器,也即流水线管理器),用于管理硬件流水线,包括从系统内存(例如,图中的DRAM 440)取回指令、对指令进行译码、分派指令、撤销指令、提交指令等。多个异构IP核,包括CPU核,均作为硬件流水线420中的执行单元集成于SoC中(相当于处理器中的运算单元),用于执行系统控制器410分派的指令。因此,SaaP可以利用硬件流水线而不是编程框架来管理异构IP核。
与多标量范式类似,程序被划分为任务,任务可以小到单个标量指令,或者大到整个程序。一个任务可以在各种类型的IP核上实现,在执行时会分派到特定的IP核上。这些任务在SaaP中称为指令。由于任务有不同的大小,因此本披露实施例提出了一种混合规模(Mixed-Scale,MS)指令以配合具有IP级别流水线的SaaP来工作。MS指令是一种统一的指令,能够适用各种异构IP核。因此,硬件异构性在MS指令下是透明的。MS指令由系统控制器410进行取指、译码、分派、撤销、提交等操作。MS指令的采用可以充分利用混合级并行性。
进一步地,还可以为SaaP提供片上存储器430,例如片上SRAM(Static Random Access Memory,静态随机存取存储器)或寄存器,用于缓存执行单元(IP核)执行时相关的数据,例如输入数据和输出数据。由此,系统内存上的数据搬运到片上存储器中之后,IP核可以与片上存储器交互,以进行数据访存。片上存储器430类似于处理器中的寄存器,由此,片上IP协同也可以以类似于多标量流水线中的寄存器转发的方式隐式实现。
在SaaP的硬件流水线中,可以通过利用MS指令来充分利用混合级别的并行,利用片上存储器来实现IP核之间的数据交换,从而获得高硬件性能。而且,SaaP允许任意类型的IP核集成为执行单元,而来自应用开发者的高级代码只需稍微调整就可以编译给新IP核,由此使得能够提高编程生产率。
与之相比,图4b所示的传统SoC以CPU为中心,在主机CPU上运行编程框架。各种IP核作为孤立的设备附着在系统总线上,通过在主机CPU上运行的软件来管理。从图中可以看出,在传统SoC中,对于控制流,仅存在CPU-IP交互;对于数据流,仅存在系统内存(DRAM)-IP交互。
在SaaP中,以IP级别的流水线来构建SoC,IP核作为执行单元进行管理。这样,控制流自然可以由流水线管理器来管理,在运行时则不需要编程框架。而且,利用类似于流水线转发的机制,数据交换可以直接在不同的IP核之间进行。
将CPU标量流水线扩展到IP级别流水线不可避免地面临很多挑战。一个挑战在于一致性。由于诸如DL加速器一类的异构IP核按各种大小的块来访问数据(例如,张量和矢量),而不是按标量数据,随着数据块在流水线中并发地流动,检查数据依赖性和维护数据一致性变得极其复杂。因此,寄存器文件、缓存层级和数据通路都需要从根本上重新设计。另一个挑战在于可伸缩性。根据Amdahl定律,IP协同的开销(通常在μs级别)无意中限制了传统SoC的可伸缩性。这种开销还会阻止子μs级内核利用IP,因为这种开销可能会超过执行时间。而且,对于可伸缩性而言,SaaP不应青睐时间/面积昂贵的设计,诸如链式撤销(Chained Squashing)和交叉矩阵互连(Crossbar)。
尽管存在来自多个方面的挑战,但是发明人研究发现,问题的根源仅仅在于传统设计理念中共享数据的所有权不明确。在传统的SoC中,数据可以在任何时候被不同的IP核
访问和修改,并且可以存在多个数据副本。因此,为了正确地执行程序,需要引入具有大量开销的复杂机制,诸如总线侦听、原子操作、事务内存和地址解析缓冲区,以保持数据一致性和IP协同的一致性。
为了避免由于共享数据的所有权不明确而导致的缺陷,SaaP SoC在设计中遵循纯粹独占所有权(Pure eXclusive Ownership,PXO)架构的原则。其原理是系统中与数据相关的资源,包括片上缓冲区、数据通路、数据缓存、内存和I/O(Input/Output,输入/输出)设备,在某一时间被一个IP核独占。下面将结合附图详细描述本披露实施例提供的SaaP架构及其相关设计。
SaaP整体架构
图5更详细地示出了根据本披露实施例的SaaP的整体架构。类似于Tomasulo流水线,SaaP可以包含一个乱序五级流水线。
如图所示,在SaaP中,作为流水线管理器的系统控制器可以包括多个功能部件,以在流水线管理过程中实现不同的功能。例如,指令译码器511可以对本披露实施例提出的MS指令进行译码。指令分派器512可以对MS指令进行分派。指令退出电路513,用于完成指令提交,并按顺序退出已完成的MS指令。MS指令缓存器514,用于缓存MS指令。重命名电路515,用于对指令所涉及的存储元件进行重命名,以例如解决可能的数据冒险。系统控制器可以利用重命名机制来实现以下任一或多项处理:解决存储元件上的数据冒险;MS指令撤销,MS指令提交,等等。异常处理电路516,用于响应于IP核抛出的异常,进行相应的处理。各个部件的功能将在后文相关部分的描述中展开。
集成的异构IP核(图中示例性示出了CPU核、GPU核,DLA核等各种IP核)充当执行单元的角色,用于执行实际运算。这些异构IP核及相关部件(例如保留站521、IP指令缓存器522等)可以统称为IP核复合体520。
SaaP中还提供了片上存储器。在一些实现中,片上存储器可以实现为一堆暂存器(也称为一组存储小泡),用于缓冲输入和输出数据。存储小泡充当处理器中的寄存器的角色。存储小泡可以包括多个存储容量大小不等的暂存器,用于缓存多个异构IP核执行时相关的数据。例如,存储小泡的容量大小范围可以从64B、128B、256B、…256KB,直到512KB不等。优选地,小容量的存储小泡的数量多于大容量的存储小泡的数量,从而便于更好地支持不同规模的任务需求。这一组存储小泡可以统称为存储小泡复合体530。
在存储小泡复合体530与IP核复合体520之间,提供了片上互连540,用于在多个异构IP核与一组存储小泡之间提供无阻塞的数据通路连接。片上互连充当共享数据总线的角色。在一些实施例中,片上互连540可以基于排序网络实现,从而可以在只需要少量的硬件成本和可接受的延迟的情况下,提供无阻塞的数据通路。在本文中,片上互连540也可以称为高尔基(Golgi)。
如前面所提到,SaaP SoC在设计中遵循纯粹独占所有权(PXO)架构的原则。为此,在一些实施例中,上述多个异构IP核中可以指定一个IP核为母核,负责管理整个系统。例如,母核独占式地管理系统内存与存储小泡之间的数据交换。母核也独占式地管理与外部设备的I/O操作。母核还可以主控操作系统(Operating System,OS)和运行时,至少负责以下任一或多项处理:进程管理、页面管理、异常处理和中断处理等。例如,在分支和预测执行中,通过异常处理来实现分支和预测执行,其中将不可能的分支当做不可能分支异常(Unlikely Branch Exception,UBE)进行处理。可以采用静态预测来实现分支和预测执行。考虑到母核的作用和功能,通常将具有通用处理功能的CPU核确定为母核。在一些实施例中,优选增强母核的I/O能力,例如引入DMA(Direct Memory Access,直接存储器访问)模块以减轻连续的数据复制压力。
在一些实施例中,非母核的IP核可以根据其功能和/或类型划分成不同的IP车道。母
核本身属于一个单独的IP车道。图5中示出了母核车道、CPU车道、CPU车道、DLA车道、等等。继而,在调度MS指令时,可以至少部分基于MS指令的任务类型,将MS指令分派到适合的IP车道。
大体上,SaaP利用MS指令执行整个程序。初始时,当系统控制器取回一条MS指令时,会对其进行译码以预备数据以供执行。数据从系统内存加载到存储小泡或从其他存储小泡快速转发。如果不存在冲突,则该MS指令会被发送到MS指令分派器,并且之后发射到合适的IP核(例如,DLA核)以实际执行。此IP核将基于所发射的MS指令,加载预编译的实际的IP特定的代码(例如,DLA指令)。之后,IP核将执行该实际的代码,这非常类似于在常规加速器上的执行。在执行完成后,MS指令将从流水线中退出并提交其结果。
上面对本申请实施例的SaaP SoC的整体架构和任务执行过程进行了概括描述。下面将详细描述各个部分的具体实现。可以理解,虽然在SaaP SoC的环境中描述各个部分的实现,但是这些部分也可以脱离SaaP SoC,应用到其他类似环境中使用,例如非异构的系统中,本披露实施例在此方面没有限制。
MS(混合规模)指令
硬件上的异构性在软硬件接口上体现为指令格式的不同,并且每条指令的执行周期数相差也很大。表1示出了不同指令集之间的比较。对于标量系统,通常包括CISC(Complex Instruction Set Computer,复杂指令系统计算机)和RISC(Reduced Instruction Set Computer,精简指令集计算机)两种指令集。如表所示,CISC每条指令的长度不确定,有些指令功能复杂,拍数较多,有些指令功能简单,拍数较少。根据单条指令的执行复杂程度,指令周期数在2~15拍。RISC的指令长度是一定的,单条指令的指令周期数比较统一,大约在1~1.5拍。
由于SaaP SoC的异构性,其上的各种IP核(包括CPU、各种xPU)所需要的指令集是不一样的,诸如在规模或粒度等方面。为了隐藏这种异构性(也即指令格式、执行周期数等),在本披露一些实施例中,提供了一种混合规模(Mixed-Scale,MS)指令集(Mixed-scale Instruction Set Computer,MISC),其形式类似于RISC,可适合于各种IP核。这些IP核(主要为计算目的的各种加速器)大部分需要高效地处理一些大粒度的复杂工作,因此单MS指令的执行周期数(Cycle Per Instruction,CPI)比RISC要长,在10~10000+拍,属于一个比较大的范围。表1中也示出了本披露实施例提供的MISC。
表1不同指令集的比较
SaaP的每个实例即是混合规模指令集计算机(MISC)。MISC指令集由MS指令组成。不同于RISC和CISC,MISC具有其独特的设计风格。
首先,MS指令具有混合的负载大小,可以是比较小的负载,例如只需要执行10拍,也可以是比较大的负载,例如需要执行10000多拍。因此,每条MS指令携带的负载可能
要求不同尺寸大小的容器,以方便从容器中取数据,以及将计算出的结果数据存出到容器中。在本披露实施例中,利用前文提到的一组尺寸多样(例如从64B到512KB)的存储小泡来存储MS指令所需的输入和/或输出数据,从而支持MS指令的这种混合负载大小。
其次,MS指令是IP无关的,也即MS指令对IP是不感知的。具体地,针对每个IP核特定的指令(也即异构的指令)被封装在MS指令中,封装后的MS指令格式与具体封装了哪个IP核并不相关。
在一些实施例中,MS指令可以包括一个子指令域,该子指令域指示能够执行MS指令的一个或多个IP核特定的子指令信息。可以理解,该MS指令将来需要运行在某个IP核上,也即意味着有一段该IP核能识别的代码(也即IP核特定的代码),这些代码也是由一条或多条该IP核特定的指令构成,这些指令被封装在MS指令中,因此称为子指令。由此,系统控制器可以根据子指令域,将MS指令分派给对应的IP核。子指令信息可以包含子指令的类型(也即IP核的类型或IP车道的类型)和/或子指令的地址。可以有多种实现方式来表示子指令信息。
在一种实现中,可以将一个或多个IP核特定的子指令的地址放入子指令域中。这种方式可以直接确定MS指令中的子指令类型和地址。不过在这种实现中,由于同一MS指令可能能够在多个异构IP核上运行,因此MS指令的长度会随着能够运行该MS指令的IP核类型数量而变动。
在另一种实现中,可以用一个比特序列来表示该MS指令是否存在相应类型的子指令,同时可以使用一个首地址来表示第一段子指令地址。比特序列的长度可以是IP核类型或IP车道类型的数量,从而比特序列中每一比特可以用于指示是否存在对应类型的子指令。第一段子指令地址直接根据首地址获得,后续IP车道对应的子指令地址可以通过固定的方式索引(例如间隔一段地址距离),或者通过直接跳转MS指令实现。本披露实施例对于MS指令的具体格式实现没有限制。
MS指令被定义成执行复杂功能。因此,每条MS指令执行一个复杂功能,例如卷积,该指令将被分解成细粒度IP特定的代码(也即子指令)以供实际执行,例如RISC指令。IP特定的代码可以是根据标准库编译的代码(例如,来自Libstdc++用于内积的std::inner_product),也可以是根据供应商特定的库生成的代码(例如,来自cuBLAS同样用于内积操作的CublasSdot)。这使得SaaP有可能集成不同类型的IP,因为同一MS指令可以灵活地发射给不同类型的IP核。因此,异构性对于应用开发者而言是隐藏的,这也增加了SaaP的鲁棒性。
从上文可以看出,无论子指令是用于哪个IP核,例如CPU、GPU、DLA、NPU等,都不会改变MS指令的格式,因此从这个角度,MS指令是IP无关的。
再次,MS指令具有有限的元数。针对数据管理,每个MS指令将访问最多三个存储小泡:两个源存储小泡和一个目的地存储小泡。也即,针对数据管理,每条MS指令至多具有两个输入数据域和一个输出数据域,用于指示执行该MS指令相关的数据信息。在一些实现中,这些数据域可以通过关联的存储小泡的编号来表示,例如分别指示两个输入存储小泡和一个输出存储小泡。有限的元数减少了冲突解决、重命名、数据通路设计和编译器工具链的复杂性。例如,若MS指令的元数不受限,则不同的MS指令的译码时间差异就会非常大,从而导致硬件流水线不规整,出现一些低效的问题。针对高元数的函数或功能(例如超过3个元数),可以通过柯里化(Currying)来实现。柯里化是一种将多变量函数转换为单变量函数序列的技术,例如通过嵌套、链式等方式。由此,可以支持将具有任意数量的输入和输出的功能/函数转换为满足MS指令的有限元数的功能/函数。
最后,MS指令无副作用。此处“无副作用”是指当前指令的执行状态不会影响后续指令的执行。换言之,如果当前指令要被撤销的话,能够实现将其撤销掉,而不会让它的残留状态影响到后续指令的指令。除了修改输出存储小泡中的数据之外,MS指令的执行不
会在SaaP架构上留下任何可观测到的副作用。唯一的例外是在母核上执行的MS指令,因为母核可以对系统内存和外部设备进行操作。这一约束对于实现混合级并行(Mixed Level Parallelism,MLP)非常重要,因为这使得当例如根据预测执行的要求,需要撤销MS指令时,能够简单地回滚影响。换言之,在非母核的IP核上执行的MS指令的数据域只能指向存储小泡,而不能指向系统内存。并且,对应输出数据的存储小泡被独占式地指派给执行该MS指令的IP核。
由此可见,通过提供一种新的MS指令集,在软硬件接口上做一个统一的抽象,从而可以隐藏不同的硬件或者说不同的指令之间的异构性,使得在硬件层面上看到的是统一的MS指令格式。这些MS指令可以分发到不同的IP核上去实际执行。
图6示例性示出在MISC架构上执行任务的一个示例过程,以更好地理解MS指令的实现方案。图示的MISC架构例如具有一个母核和一个IP核。待执行的任务是制作三明治(材料:面包和肉)和蔬菜沙拉(材料:蔬菜)。其中,为方便绘图,图6中,面包命名为A,肉命名为B,蔬菜命名为C,三明治命名为D,沙拉命名为E。母核管理系统内存,因此首先由母核将待加工的材料从系统内存加载到存储小泡上,接着IP核可以对存储小泡上的材料进行处理。上述任务可以表示为下列MS指令流:
1)“Load Bread”v1,void,void
2)“Load Meat”v2,void,void
3)“Make Sandwich”v1,v1,v2
4)“Store Sandwich”void,v1,void
5)“Load Green”v1,void,void
6)“Make Salad”v1,v1,void
7)“Store Salad”void,v1,void
可以理解,MS指令在母核和IP核上执行时都应当提供各个核特定的代码形式,也即核特定的子指令,从而各个核能够知道如何执行相应的任务。为了简单起见,这些子指令在上面的MS指令流中仅简单示出其处理任务或功能,未区分不同形式。MS指令中使用的存储小泡(v1,v2)是逻辑编号。在实际执行中,存储小泡被重命名为不同的物理编号,以解决WAW(Write After Write,写后写)依赖性以及支持乱序预测执行。指令中的Void表示对应域不需要存储小泡,例如涉及系统内存时。
在图6中,①为初始状态;②为母核执行“Load Bread”这条MS指令。Load指令涉及对系统内存的访问,因此被分派给母核来执行。母核将系统内存的数据取出存入v1这个存储小泡中。具体涉及的系统内存的访存地址信息可以额外放在一个指令域中,本披露实施例在此方面没有限制。③为母核执行“Load Meat”指令,与“Load Bread”指令类似,母核将系统内存的数据取出存入v2这个存储小泡中。
接着,④为执行“Make Sandwich”指令,这条MS指令被分派给IP核进行处理,因为需要花费较多的处理时间。按照原始指令,IP核需要从v1取出面包,从v2取出肉,制作后放入v1中。此处由于要写入的v1和要读出的v1是同一个,存在一个读后写(Write After Read,WAR)相关,也即必须等v1中的数据完全读出了才能写入。但这种方式不太现实,因为MS指令可能非常庞大,例如需要几万拍,而中间制作出来的三明治需要有地方存储。为了解决这一数据冒险,可以采用存储小泡重命名机制。例如,MS指令在分派前,通过图5所示的存储小泡重命名电路515将MS指令中存储小泡的逻辑名称重命名映射为物理名称,以消除数据冒险。同时,存储小泡重命名电路515保存该物理名称与逻辑名称之间的映射关系。在图6的示例中,“Make Sandwich”指令的输出数据对应的存储小泡v1被重命名为存储小泡v3,因此制作好的三明治被放入v3中。图6的v3中的省略号表示此写入过程需要持续一段时间,不会很快写完。
在⑤中,由于制作三明治需要花费较多时间,因此还不能执行紧随其后的“Store
Sandwich”指令,但是其后的“Load Green”指令与前面的指令没有依赖关系,因此可以并行执行。类似地,“Load Green”指令所涉及的存储小泡v1也涉及一个读后写相关,因此可以采用存储小泡重命名机制,将相应的存储小泡v1重命名映射为存储小泡v4。同样,由母核执行“Load Green”指令,将系统内存中的数据取出写入v4这个存储小泡中。
在⑥中,由于IP核已被占用制作三明治,因此为了提高效率,根据调度策略,“Make Salad”指令可以分派给当前空闲的母核来执行。各个核的状态例如可以通过一个比特序列来标记,以方便指令分派器分派指令。同样,此处也应用了重命名机制。母核从存储小泡v4中取出蔬菜,制作成沙拉后放入存储小泡v5中。
在⑦中,当三明治制作好之后,此时可以执行之前被阻塞的“Store Sandwich”指令。Store指令涉及对系统内存的访问,因此被分派给母核来执行。母核将存储小泡v3的数据取出存入系统内存中。
在⑧中,当沙拉制作好之后,可以执行“Store Salad”指令。母核将存储小泡v5的数据取出存入系统内存中。
需要注意,在⑦和⑧中,即使沙拉比三明治先制作好,“Store Salad”指令也需要放在“Store Sandwich”指令之后执行,从而确保顺序提交,由此在发生指令撤销时不会产生任何副作用。
从上述示例过程可以看出,当数据准备好之后,IP核就可以开始执行处理。“Make Sandwich”需要花费较多的时间,因此,“Make Salad”在母核上执行并提前完成,从而可以充分挖掘混合级别的并行性(MLP)。由此不同IP核之间的执行是互不干扰的,也即可以乱序执行,但是提交时是顺序提交的。
系统控制器
对于MS指令间或者指令本身的处理,统一由系统控制器(也可以称为指令处理装置)来管理。下面详细描述系统控制器中各个部件的功能。SaaP SoC采用乱序流水线来挖掘IP核之间的混合级别并行性。流水线可以包含5级:取值&译码、冲突解决、分派、执行和退出。
图7示出了根据本披露实施例的指令执行方法的示例性流程图。以下的描述可以同时参考图5所示的SaaP架构进行理解。此外,为了便于描述和理解,图7示出了包含完整流水线的指令执行过程,但是本领域技术人员可以理解,有些步骤可能只在特定情况下发生,因此不是在所有情况下都是必需的,根据具体情况可以辨别其必要性。
首先,在步骤710中进行取指&译码。在这一级,基于MS程序计数器(Program Counter,PC)从MS指令缓存器514中取回MS指令,指令译码器511对取回的MS指令进行译码以预备操作数。译码后的MS指令可以放在指令译码器511的指令队列中。
如前面所描述的,MS指令包括一个子指令域,其指示能够执行该MS指令的一个或多个IP核特定的子指令信息。子指令信息例如可以指示子指令的类型(也即IP核的类型或IP车道的类型)和/或子指令的地址。
在一些实施例中,当取回MS指令并译码时,此时可以根据译码结果,将对应的子指令预先取回并存储在指定位置,例如子指令缓存器522(图5中也称为IP指令缓存器)中。由此,当该MS指令被发射到对应的IP核上执行时,该IP核可以从子指令缓存器522中取出相应的子指令以便执行。
在一些情况下,MS指令可能是分支指令。在本披露实施例中,采用静态预测来确定分支指令的方向,也即确定下一MS的PC值。发明人分析了基准测试程序中的分支行为,发现大规模指令分支中80.6%~99.8%的分支可以在编译时正确预测。由于大规模指令决定了整体执行时间,因此在本披露实施例中,采用静态预测来执行分支预测,从而可以省去硬件分支预测器。由此,无论何时遇到分支,总是假设下一MS指令是静态预测的可能分
支方向。
当分支误预测时,将触发不可能分支异常(Unlikely Branch Exception,UBE)。当发生异常时,错误的MS指令需要被撤销,下一MS指令计数设置成UBE的不可能分支,或者其他情况下发生异常陷阱(Exception Trap)。后文将详细描述分支和预测执行的处理方案。
接着,流水线前进到步骤720,在此对可能存在的冲突进行解决。在这一级,取回的MS指令进行排队以解决冲突。可能的冲突包括(1)数据冒险;(2)结构冲突(例如,在退出单元中没有可用空间);以及(3)异常违例(例如,阻塞不能被轻易撤销的MS指令,直到它被确认采取)。
在一些实施例中,可以通过存储小泡重命名机制来解决写后读(Read After Write,RAW)和写后写(WAW)这一类的数据冒险。存储小泡重命名电路515用于在MS指令中涉及的存储小泡上存在数据冒险时,在分派该MS指令之前,将存储小泡的逻辑名称重命名映射为物理名称,以及保存存储小泡的物理名称与逻辑名称之间的映射关系。通过存储小泡重命名机制,SaaP可以支持更快的MS指令撤销(通过简单地丢弃对输出数据存储小泡的重命名映射来实现)和无WAW冒险的乱序执行。
在解决了可能存在的冲突之后,流水线前进到步骤730,在此由指令分派器512对MS指令进行分派。
如前面所描述的,MS指令具有子指令域,其指示了能够执行该MS指令的IP核。因此,指令分派器512可以根据该子指令域的信息,将MS指令分派给对应的IP核,具体地,先分派到该IP核所属的保留站中以供后续发射到合适的IP核。
在一些实施例中,IP核可以按其功能和/或类型划分为不同的IP车道,每条车道对应于一个特定的IP核模型。相应地,保留站也可以根据车道进行分组,例如每个车道可以对应一个保留站。例如图5中示出了母核车道、CPU车道、CPU车道、DLA车道、等等。不同的车道可以适于执行不同类型的任务。因此,在调度分派MS指令时,可以至少部分基于MS指令的任务类型,将MS指令分派到适合车道所对应的保留站,以供后续发射到适合的IP核。
在一些实施例中,除了考虑任务类型之外,还可以根据各个IP车道中的处理状态,在能够执行该MS指令的多个IP车道之间调度,从而提高处理效率。由于同一MS指令可能具有在多个IP核上有执行的多个不同实现,因此根据适当的调度策略,通过选择分派的车道可以缓解瓶颈车道的处理压力。例如,涉及卷积运算的MS指令可以被分派到GPU车道或DLA车道。二者都可以有效地执行,此时可以根据两个车道的压力来选择其一,从而加速处理进度。调度策略可以包括各种规则,例如选择吞吐量最大的IP核,或者选择子指令条数最短的IP核,等等,本披露实施例在此方面没有限制。
在一些实施例中,一些指定类型的MS指令必须分派给指定的IP核。例如,前文提及多个异构IP核中可以指定一个IP核为母核,负责管理整个系统。因此,一些涉及系统管理类型的MS指令必须分派给母核来执行。
具体地,母核独占式地管理系统内存与存储小泡之间的数据交换。因此,对系统内存进行访存的访存类型的MS指令被分派给母核。母核也独占式地管理与外部设备的I/O操作。因此,诸如显示输出一类的I/O类型的MS指令也被分派给母核。母核还可以主控操作系统(OS)和运行时,至少负责以下任一或多项处理:进程管理、页面管理、异常处理和中断处理等。因此,中断电路517处理中断的MS指令被分派给母核。此外,当某些MS指令无法由其他IP核处理时,例如由于其他IP核处于繁忙状态,则可以分派给母核来处理。另外,根据MS指令调度策略,某些MS指令可能会被分配到母核上进行处理。此处不再一一列举。
接着,流水线前进到步骤740,在此阶段,MS指令可以由IP核乱序执行。
具体地,被分派到指令的IP核可以利用实际的IP特定的代码来执行MS指令的功能。例如,IP核根据分派的指令,从子指令缓存器/IP指令缓存器522中取回对应的子指令并执行。在此阶段可以实施Tomasulo算法来组织这些IP核,从而支持混合级的并行性(MLP)。一旦解决了存储小泡上的相关性,MS指令可以持续分派到IP核复合体中,从而可以乱序执行这些指令。
注意,在本披露实施例提供的SaaP SoC中,由于禁止对IP核的侵入性修改,因此IP核并不知道SaaP架构。为了适应SaaP,利用适配器对IP核进行封装,适配器将对程序的访问引导到IP指令缓存器522,并将对数据的访问引导到存储小泡。程序可以是加速器的接口信号,例如,用于DLA的CSB(Configuration Space Bus,配置空间总线)控制信号,或者是一段实现MS指令的IP特定代码(例如对于诸如CPU/GPU一类的可编程处理器而言)。运算类的MS指令是针对存储在一组存储小泡中的数据执行运算。这些存储小泡可以是多个存储容量大小不等的暂存器。每个IP核具有两个数据读出端口和一个数据写入端口。在执行期间,物理存储小泡排他性地连接到端口,因此从IP核的视角,存储小泡就像传统架构中的主内存一样工作。
最后,流水线前进到步骤750,也即退出阶段。在这一级,MS指令从流水线退出并提交结果。图5中的指令退出电路513,用于按顺序退出已完成的MS指令,并在MS指令退出时,通过确认该MS指令对其输出数据对应的存储小泡的重命名映射来提交执行结果。也即,通过在重命名电路515中永久地承认输出数据的存储小泡的重命名映射,来完成提交。由于仅是承认重命名映射,因此实际上没有任何数据被缓冲或复制,这样避免了数据量大时(这在各种以运算为目的的IP核中很常见)复制数据带来的额外开销。
应当理解,尽管在SaaP SoC的环境中描述了MS指令的执行过程,但是MS指令系统也可以应用于其他环境中,不限于具有异构IP核的环境,例如在同构环境中也可以使用,只要MS指令的执行单元能够独立解析并执行子指令即可。因此,上文的描述中,可以将IP核直接替换为执行单元,母核替换为主执行单元,上述方法依然适用。
分支和预测执行
MS指令流中也可能出现分支指令,分支指令引起控制相关。控制相关实际上是与MS指令的程序计数器PC的相关,PC值在取指时就要使用。如果分支指令处理得不好,下一条指令的取指就受影响,从而引起流水线的阻塞,影响流水线效率。因此,需要为MS指令提供有效的分支预测支持,也即,既对大规模指令有效,也对小规模指令有效。
在CPU的传统处理方式中,会在译码的时候计算分支条件,然后确定正确的分支目标,从而在下一拍取指时从分支跳转位置的地址取回下一条指令。这种分支条件计算以及将下一PC值设置为正确分支目标的值,通常只需要占据几拍的开销,这部分开销非常小,在常规CPU指令流水线中完全可以被流水线抵消。然而在MS指令流中,如果某条分支MS指令预测错误,则意味着在这个MS指令流的整个执行过程中的某个时间点,才发现该分支MS指令是预测错的,此时这个时间点的位置有可能与分支MS指令开始执行的时间相距几百拍或者几千拍或者更长。因此,在MS指令流水线中,不可能等到真正知道何时该跳转时才确定出下一MS指令的PC值。这样的话,预测的开销就会非常大。
发明人分析了5个基准测试程序中的分支行为,发现大规模指令分支中80.6%~99.8%的分支可以在编译时正确预测,也即可以通过静态方式进行预测。由于大规模指令占据总执行时间的大部分,决定了整体执行时间,因此在本披露实施例中,采用静态预测来执行分支预测,从而可以省去硬件分支预测器。
图8示出了根据本披露实施例的针对分支指令的指令执行方法的示例性流程图。该方法由系统控制器执行。
如图所示,在步骤810中,对MS指令进行译码。MS指令具有变化的每指令执行周
期(Cycle Per Instruction,CPI)。如前面所提到,MS指令的CPI范围可能在10拍~1万拍以上。MS指令这种变化的CPI特性也使得难以使用动态预测。
接着,在步骤820中,响应于MS指令为分支指令,根据分支指示信息获取下一MS指令,分支指示信息指示可能分支目标和/或不可能分支目标。
采用静态预测机制可以利用编译器的提示进行静态预测。具体地,在指令编译时,可以基于静态分支预测方式确定分支指示信息,并插入到MS指令流中。
取决于不同的静态分支预测方式,分支指示信息可以包含不同内容。例如,静态预测总是取可能分支目标作为下一MS指令地址。在一些情况下,为了保证指令缓存器的时间局部性,可能分支目标通常可以紧邻当前MS指令。因此,在这些情况下,分支指示信息可以只需要指示不可能分支目标。在另一些情况下,分支指示信息也可以同时指示可能分支目标和不可能分支目标。因此,在根据分支指示信息获取下一MS指令时可以将分支指示信息所指示的可能分支目标确定为下一MS指令。
既然是预测,就可能存在错误。发生分支预测错误时,需要取消该分支指令后面的所有指令,流水级越长,分支预测错误需要取消的指令数越多,指令流水线的效率损失也越大。由于MS指令采用静态预测方式,在分支条件确定之前,按照固有的方式取下一MS指令,这些指令可以乱序执行,但是要遵从前面描述的顺序提交。因此,当发现分支指令的预测方向错误时,需要恢复到正确的下一MS指令。此时,需要通过异常机制来实现,以修正错误的预测。
可选地或附加地,在步骤830中,当发生预测错误时,系统控制器会接收到不可能分支异常(UBE)事件。该UBE事件是由执行与分支指令关联的条件计算指令的执行单元(例如某个IP核)触发的。该UBE事件表示根据条件计算,分支方向应为不可能分支目标,也即之前的分支预测发生了错误。
此时,在步骤840中,系统控制器响应于该UBE事件,需要执行一系列的操作,才解决分支预测错误。这些操作包括:撤销分支指令之后的MS指令;提交分支指令之前的MS指令;以及将分支指示信息所指示的不可能分支目标确定为下一MS指令。这种处理对应一种精确异常,也即发生异常时,被异常打断的指令之前的所有指令都执行完,而该指令之后的所有指令都像没执行一样。由于UBE事件是由于分支预测错误引起的异常,因此,上述被异常打断的指令就是该分支MS指令。
针对需要撤销的MS指令所处的不同状态,可以采取不同的操作来实现撤销。需要撤销的MS指令通常可能处于三种状态:正在执行单元中执行;已执行结束;或者尚未执行。不同状态下可能会对不同的软硬件产生影响,因此需要消除这些影响。例如,若指令正在执行单元中执行,则需要终止正在执行这些需要撤销的MS指令的执行单元;若指令在执行过程中或执行之后对暂存器(例如存储小泡)进行过写入操作,则需要丢弃这些被要撤销的MS指令写过的暂存器;若指令尚未执行,则只需要从指令队列中将其取消。当然,由于指令队列会记录所有未退出/提交的指令,因此对于处于正在执行或已执行结束状态的指令,也需要将其从指令队列中取消。
因此,在一些实施例中,撤销分支指令之后的MS指令包括:取消指令队列中这些撤销的MS指令;终止执行这些撤销的MS指令的执行单元;以及丢弃被这些撤销的MS指令写过的暂存器。
从前文描述的指令退出过程可知,MS指令退出时,通过确认该MS指令对其输出数据对应的存储小泡的重命名映射来提交执行结果。因此,在丢弃被这些撤销的MS指令写过的暂存器时,可以只需要从保存有这些暂存器的物理名称与逻辑名称之间的重命名映射关系的记录中删除对应的映射关系即可。如前面所提及的,通过这种存储小泡重命名机制,可以支持更快的MS指令撤销,只需要简单地丢弃对输出数据存储小泡的重命名映射。
由此,在MS指令流水线中,通过静态预测的方式来处理分支MS指令可以节省硬
件资源,同时适应MS指令这种大变化范围的CPI特性,提高流水线效率。进一步地,通过异常机制来处理分支预测错误,可以进一步节省硬件资源,简化处理。
异常和中断处理
从前面的分支预测处理可以看出,撤销大规模MS指令的成本可能很高。因此,在本披露实施例中提出一种指令执行方案,其可以阻塞可能引起高撤销成本的MS指令,直到该指令之前所有可能丢弃的指令都已执行,也即状态都已确定。这种指令执行方案在异常和中断处理中可以很好地提高MS指令流水线的处理效率。
图9示出了根据本披露实施例的一种指令执行方法的示例性流程图。该方法由系统控制器执行。
如图所示,在步骤910中,当发射MS指令时,检查该MS指令是否可能被丢弃。
在一些实施例中,检查MS指令是否可能被丢弃包括检查该MS指令是否具有可能丢弃标签。可能丢弃标签可以在编译时,由编译器根据MS指令的类型而插入。例如,当编译器发现MS指令是条件分支指令或者可能发生其他异常时,可以插入可能丢弃标签。
接着,在步骤920中,当确定该MS指令可能被丢弃时,阻塞该MS指令之后的特定MS指令的发射。
特定MS指令可以是那些大规模MS指令,或者通常撤销代价会比较高的MS指令。具体地,特定MS指令可以根据以下一项或多项条件进判断:MS指令的输出数据对应的暂存器(存储小泡)的规模大于设定阈值;对系统内存执行写操作的MS指令;执行时长超过预定值的MS指令;或者由特定执行单元执行的MS指令。当输出数据的存储小泡规模(容量大小)超过设定阈值时,说明该MS指令的输出数据量会比较大,相应撤销开销也较高。对写系统内存的MS指令进行阻塞主要是为了保证存储一致性。
在阻塞这些特定MS指令之后,其之前的MS指令仍然正常发射和执行。根据这些正常发射和执行的MS指令可能出现的情况,可以分别进行处理。
一方面,在步骤930中,当引起阻塞该特定MS指令的所有可能被丢弃的MS指令均已正常执行完成时,响应于此事件,可以发射该阻塞的特定MS指令以供执行单元执行。可以理解,此时可以确定该特定MS指令不会因为前面的指令而撤销,因此可以继续指令流水线的正常发射和执行。
另一方面,在步骤940中,当引起阻塞该特定MS指令的任一可能被丢弃的MS指令的执行出现异常时,响应于此异常事件,执行异常处理。同样地,这种异常处理对应一种精确异常,需要撤销该引起异常的MS指令及之后执行的MS指令,提交在该MS指令之前的MS指令;以及将相应异常处理程序的MS指令作为下一MS指令。
与前面在分支预测处理中的描述类似,撤销该引起异常的MS指令及之后的MS指令包括:取消指令队列中这些撤销的MS指令;终止执行这些撤销的MS指令的执行单元;以及丢弃被这些撤销的MS指令写过的暂存器。同样,丢弃被这些撤销的MS指令写过的暂存器包括从保存有这些暂存器的物理名称与逻辑名称之间的重命名映射关系的记录中删除对应的映射关系。
当异常事件的类型是前述分支预测处理中由于分支类型的MS指令触发的不可能分支异常UBE事件时,除了上述异常处理之外,还需要将该MS指令附带的分支指示信息所指示的不可能分支目标确定为异常消除后的下一MS指令。由此,异常处理结束后,指令流水线可以正常跳转到正确的分支方向继续执行。
图10示出了根据本披露实施例的一个指令执行示例。
如图所示,(a)示出了指令队列中MS指令流的初始状态,其中包括5条MS指令待执行,#1MS指令带有可能丢弃标签,并且指令所占据不同的宽度可以代表不同规模,#3MS指令为大规模MS指令,其余为小规模MS指令。指令的不同背景代表其处于不同
状态,诸如等待、阻塞、发射、执行、退出、异常、撤销等,具体表示可以见图例。
(b)示出了指令发射步骤,小规模指令会尽快发射,而大规模指令则会被其前任一发射的可能丢弃的指令所阻塞。具体地,首先发射#0指令,之后发射#1指令。在发射#1指令时,发现该指令可能会被丢弃。此时,会阻塞其后的大规模指令。在此示例中,#2指令由于是小规模,因此仍然可以正常发射;而#3指令由于是大规模指令,因此被阻塞,其后的指令也处于等待状态。
(c)示出了指令执行过程,在此示例中,#2指令可能最先执行完,由于其前面的指令尚未执行完,因此需要等待,以保证顺序提交。
(d1)-(h1)示出了前述指令执行未出现异常时的处理过程;(d2)-(g2)示出了前述指令抛出异常时的处理过程。
具体地,(d1)示出了#1指令也正常执行完,没有被丢弃。此时,因为#1指令而被阻塞的大规模指令#3可以进行发射,其后的#4指令也可以正常发射。(e1)示出了#0、#1、#2和#4指令由于规模小,都已执行完,#3指令仍然在执行中。(f1)示出了#0、#1、#2指令进行顺序提交,而#4指令必须等待#3指令执行完才能提交。(g1)示出了#3指令也执行完。(h1)示出了#3和#4指令进行顺序提交。
另一方面,当#1指令执行出现异常时,如(d2)所示,此时会处理一个异常程序。异常处理的流程通常包括异常处理准备、确定异常来源、保存执行状态、处理异常、恢复执行状态并返回等。例如,在图5所示的异常处理电路516中,可以记录有无异常发生,并根据处理结果调整下一MS指令地址。
在处理异常时,执行精确异常处理。如(e2)和(f2)所示,在触发异常的#1指令之前的#0指令继续执行并完成提交。而在触发异常的#1指令之后发射的#2指令虽然已经执行完,但是也要被撤销,如(g2)所示。此时,由于被阻塞而一直未发射的#3和#4则处于等待状态,由此避免了撤销导致的开销。
如果#1指令触发的异常是前面描述的UBE事件,也即#1指令是分支指令,则可以根据该分支指令附带的分支指示信息,将其所指示的不可能分支目标确定为异常消除后的下一MS指令。也即,异常处理完后,流水线会跳转到对应不可能分支目标的那条MS指令。
若异常是其他类型的异常,例如除法中分母为零,则会跳至一个异常处理程序,该程序可能将分母值修改为一个非零的很小值,然后异常处理完之后,重新执行#1指令,并继续正常的指令流水线处理。
与异常事件相对的,中断事件来自于SoC的外部,因此也是不可预测的。不过,SaaP并不需要精确地停止在引发中断信号的这一点。当发生中断时,SaaP阻塞所有等待发射的MS指令,并等待所有已发射的MS指令完成和退出。
在SaaP中,大部分系统管理类的异常,诸如内存分配异常(Bad Allocation)、页面错误(Page Fault)、分段错误(Segment Fault)等,仅能从母核提出,因此也在母核内捕获和处理。SaaP架构中的其他部件以及其他IP核既不受这些异常的影响,也不会意识到这些异常。
存储小泡
在SaaP中,针对混合规模的数据访问,使用存储小泡(Vesicle)作为寄存器的替代形式。存储小泡可以是一些独立的、混合尺寸的单端口暂存器(Scratchpad),其容量大小例如可以从64B到512KB。在SaaP中,存储小泡可以类似于具有混合容量的寄存器,以供MS指令使用。本文中“存储小泡复合体”是指由存储小泡组成的物理“寄存器”文件,而不是固定尺寸的寄存器。优选地,小容量(例如64B)的存储小泡的数量多于大容量(例如512KB)的存储小泡的数量,从而便于更好地匹配程序需求,支持不同规模的任务。在
物理上,每个存储小泡可以是单个SRAM或寄存器,其具有两个读出端口和一个写入端口。设计这些存储小泡以更好地匹配混合规模数据访问模式,其可以用作SaaP中数据管理的基本单元。
两个IP核不能同时访问同一个存储小泡。因此,仍然可以像顺序标量处理器一样简单地管理数据依赖关系,并且片上IP协同可以通过硬件MS指令流水线来进行管理。
数据通路
为了能够从任一IP核访问任一存储小泡,需要IP核复合体与存储小泡复合体之间的完整连接。通常的解决方案包括数据总线(例如CPU中的方案)或交叉矩阵(例如多核系统中的方案)。然而,这些连接都不能满足高效的需求。例如,数据总线会导致竞争,而交叉矩阵非常占用面积,即使只有几十个核。为了以可接受的代价实现无阻塞数据传送,本披露实施例构建了一种基于排序网络实现的片上互连数据通路,称为高尔基。
图11示出了几种不同的数据通路设计,其中(a)示出了数据总线,(b)示出了交叉矩阵,以及(c)示出了本披露实施例提供的高尔基。
从(a)可以看出,数据总线不能提供无阻塞访问,其需要总线仲裁器来解决访问冲突。从(b)可以看出,交叉矩阵可以提供无阻塞访问,并且具有较低延迟,但是其需要O(mn)个开关,其中m是IP核的端口数,n是存储小泡的端口数。
在(c)所示的高尔基中,连接问题被视为Top-K排序网络,其中存储小泡端口基于目的地IP端口编号进行排序。片上互连包括由多个比较器和开关构成的双调排序网络。当m个IP核端口需要访问n个存储小泡端口时,利用该双调排序网络、基于目的地IP核端口的索引对相关的存储小泡端口进行排序,以构建m个IP核端口与n个存储小泡端口之间的数据通路。
对于(c)中的示例,当需要将存储小泡{a,c,d}分别映射到IP核{#3,#1,#2}时,高尔基将该映射视为对所有存储小泡{a,b,c,d}的排序,其分别具有值{#3,#+∞,#1,#2},其中未使用的端口赋予目的地编号+∞。
具体地,如(c)所示,从存储小泡{a,b,c,d}出发,可以首先偶数列相互比较,奇数列相互比较。例如,存储小泡a和c比较,发现a的值#3大于c的值#1,则将二者交换。图中用浅色阴影线表示开关接通,数据可以横向流动。存储小泡b和d比较,发现b的值#+∞大于d的值#2,则二者也交换,开关接通,数据通路横向流动。此时排序位置为c、d、a、b。接着,相邻的两个存储小泡之间进行比较。例如,存储小泡c和d比较,发现c的值#1小于d的值#2,则保持不变,开关不接通,数据通路只能竖向流动。类似地,存储小泡d和a比较后,也不接通开关;存储小泡a和b比较后,不接通开关。
最终可以看出,每个IP核下都正好对应其要访问的存储小泡。例如,对于IP#1,从其下面的通路竖直往下,到灰色圆点处再横向移动,然后竖直往下到存储小泡c。其他IP核的数据通路类似。由此,基于排序网络,在IP核和存储小泡之间构建了无阻塞的数据通路。
使用双调排序网络,可以利用O(n(logk)2)个比较器和开关来实现高尔基,这个数量远小于交叉矩阵所需的O(nk)个开关。通过高尔基递送的数据会经受若干周期的延迟(例如,8个周期),因此优选实践是在IP核中放置尽量少的本地缓存(1KB足够),因为其依赖于大量的随机访问。
综上,为了执行一条MS指令,SaaP在IP核及其存储小泡之间建立排他性数据通路。SaaP中这种排他性数据通路遵循PXO架构,并且利用最小的硬件成本提供了无阻塞的数据访问。
通过在MS指令之间传递存储小泡,数据可以在IP核之间共享。由于母核管理系统内存,输入数据通过母核在一个MS指令中汇集到一起,并被正确地放置在一个存储小泡
中以供另一MS指令使用。在被IP核处理之后,输出数据由母核类似地分散回系统内存。具体地,从系统内存到IP核的完整数据通路包括:[(加载MS指令)①内存②L3/L2缓存③母核④高尔基W0⑤存储小泡,(消费MS指令)⑤同一存储小泡⑥高尔基R0/1⑦IP核。]
从逻辑角度而言,系统内存是由母核排他性拥有的资源,这从如下几个方面极大地减少了系统复杂度:
1)页面错误仅能由母核发起,并且在母核内部处理,因此其他MS指令总是能够在确保无页面错误的条件下安全执行;
2)L2/L3缓存由母核排他性地拥有,因此永远不会发生缓存不一致/竞争/伪共享;
3)中断总是由母核处理,因此其他IP核(字面意义上)不会被中断。
编程
SaaP可以适应各种通用编程语言(C、C++、Python等)以及领域特定的语言。由于在SaaP上执行的任何任务都是MS指令,因此关键技术是提取混合规模的操作以形成MS指令。
图12示出根据本披露实施例的编译方法的示例性流程图。
如图所示,在步骤1210中,从待编译程序中提取混合规模(MS)操作,这些MS操作可以具有可变的执行周期数。接着,在步骤1220中,将所提取的混合规模操作进行封装,以形成MS指令。
低级操作可以从基础指令块中提取,而高级操作可以通过各种方式提取,包括但不限于:1)直接从库调用映射,2)从低级程序结构重构,以及3)手动设置编译器指引。因此,已有的程序,例如利用PyTorch以Python编写的深度学习应用,可以通过类似于多标量渠道的方式被编译到SaaP架构上。
在一些实施例中,可以选择性添加如下5个LLVM编译过程(Pass)以扩展传统的编译器。
a)Call-Map(调用-映射过程):这是一个由简单工作列表驱动的编译过程,其将已知的库调用转换成MS指令。MS指令的特定实现是从供应商特定的代码预编译而来,在此过程中作为库进行引用。
具体地,在一个实现中,可以从待编译程序中提取对库功能函数的调用作为MS操作;然后根据库功能函数到MS模板库的映射列表,将所提取的对库功能函数的调用转换为对应的MS指令。MS模板库是基于能够执行该库功能函数的执行单元特定的代码而预先编译的。
b)Reconstruct(重构过程):这是另一个由工作列表驱动的编译过程,其尝试从低级代码恢复高级结构,因此可以发现高级MS指令。
具体地,在一个实现中,通过模板匹配方式识别待编译程序中的指定程序结构作为MS操作;以及将所识别的指定程序结构转换为预定的MS指令。其中,模板可以是根据高级功能结构特点预先定义的。例如,模板可以定义嵌套循环结构,并且设置该嵌套循环结构的一些参数,诸如嵌套循环有几重,每重循环的大小,最内层循环中有哪些操作,等等。模板可以根据一些典型的高级结构,例如,卷积运算结构、快速傅里叶变换FFT(Fast Fourier Transform)等进行定义,具体的定义内容和定义方式在本披露实施例中没有限制。
例如,可以通过模板匹配的方式捕捉到用户实现的快速傅里叶变换FFT(作为嵌套环),继而可以利用在Call-Map中使用的供应商特定库的FFT MS指令来替换。恢复的FFT MS指令可以在DSP IP核(若有)上更有效地执行,并且在仅有CPU可用的最坏情况下,还可以转换回嵌套环。这是尽最大努力做的,因为从本质上而言,精确地重建所有高级结构是很困难的,但是这为不知道DSA的老程序提供了利用新的DSP IP核的机会。
c)CDFG(Control Data Flow Graph,控制数据流图)-分析过程:不同于多标量技术,
程序是在CDFG图上进行分析,而不是在CFG(Control Flow Graph,控制流图)图上。这是因为SaaP去除了寄存器遮罩和地址解决机制,并将数据组织到存储小泡中。在前面两个编译过程之后,可以识别出要在异构IP核上执行的操作。所有剩余代码要作为多标量任务在CPU上执行。此时,问题是找到将剩余代码划分为MS指令的最优划分。构建全局CDFG以供后续用来对不同MS指令划分的代价建模。
具体地,在一个实现中,可以在待编译程序的控制数据流图上,将待编译程序中尚未提取的操作,按多种分割方案分割成一个或多个操作集合;然后确定分割成本最优的分割方案。在每个分割方案中,每一个操作属于且仅属于一个操作集合中。
分割的方式可以有很多种。基本上,可以遵从如下一个或多个约束条件来执行分割方案。
例如,一个操作集合的输入数据和输出数据的元数不超过指定值。如MS指令所规定的元数,输入数据的元数不超过2,输出数据的元数不超过1,因此,可以基于此约束进行操作分割。
又例如,一个操作集合的任一输入数据或输出数据的大小不超过指定阈值。由于MS指令对应的存储元件为存储小泡,存储小泡存在容量限制,因此需要限制MS指令所处理的数据量不超过存储小泡的容量限制。
又例如,在分割时,与条件操作相关的分割方案可以有:
1.优先将条件操作和其两个分支操作划分在一个操作集合中。此时该操作集合对应的MS指令为普通的计算类指令。
2.条件操作和其两个分支操作不在同一操作集合中。这种分割方案的可能原因有:这样会导致操作集合过大;或者违反了输入输出的约束;或者分支操作在之前步骤中已识别为MS指令等。这种情况下,将产生包含条件操作的分支类型的MS指令。一般来说,把条件操作放在短小的操作集合中可以在执行时更快地得到分支结果。例如,可以控制同一操作集合中不同时包含条件操作和超过执行时长阈值的非条件操作。
分割方案的分割成本可以基于多种因素来确定,这些因素包括但不限于,操作集合数量;操作集合间所需数据交互量;承担分支功能的操作集合数量;以及各个操作集合的预计执行时长的分布均匀度。这些考虑的因素从多个方面影响指令流水线的执行效率,因此可以作为确定分割方案的衡量因素。例如,操作集合数量直接对应MS指令的数量;操作集合间所需的数据交互量则决定了需要的数据IO量;分支类型的指令越多,则可能触发异常的概率越大,对流水线的消耗也越大;预计执行时长的分布均匀度则影响流水线的整体运转,避免由于某一级时间消耗过长而引起流水线中断。
在一些实施例中,上述CDFG分析过程是在调用映射过程和重构过程之后执行的。因此,可以仅针对前两个编译过程中未识别的MS操作执行,也即针对剩余的操作执行。
d)MS-Cluster(MS集合转换过程):这是一个变换编译过程,用于将CDFG中的节点聚集起来以构建到MS指令的完整划分。
具体地,在一个实现中,根据在CDFG分析过程中确定的分割方案,将每个操作集合分别转换为一条MS指令。受存储小泡容量的限制,算法使跨MS指令边界的切割边缘的总代价最小。特别地,包括加载/回存操作的MS指令和系统调用被指定给母核。
e)Fractal-Decompose(分形-分解过程):这也是一个变换编译过程,用于将从调用映射过程和重构过程中提取出的违反存储小泡容量限制的MS指令分解,从而存储小泡容量不再限制SaaP的功能。
具体地,在一个实现中,分解过程包括:检查已转换的MS指令是否符合MS指令的存储容量约束;当MS指令不符合存储容量约束时,将该MS指令拆分为多个MS指令以实现相同的功能。
可以采取各种现有已知或未来研发的指令分解方式对MS指令进行分解。由于之前
已提取出的MS指令是要分配在某一个IP核上执行的,因此,构成该MS指令的多个操作都是相同类型的,也即同构的,只是需要适配物理硬件尺寸。因此,在一些实施例中,这一MS指令的分解过程可以简单地遵循分形执行模型。例如,可以参考Y.Zhao,Z.Du,Q.Guo,S.Liu,L.Li,Z.Xu,T.Chen,and Y.Chen等人的论文,“Cambricon-F:Machine Learning Computers with Fractal von Neumann Architecture,”in Proceedings of the 46th International Symposium on Computer Architecture,2019,pp.787–800。大体上,可以通过迭代方式将MS指令分解成若干更小的、类似的操作。由于本披露实施例的发明点不在于具体的指令分解方式,因此此处不展开描述。
将MS操作封装成MS指令,简单的说,就是填充MS指令的一个或多个指令域。如前面提到的,MS指令包括子指令域、输入输出存储小泡信息域,可能还包括系统内存地址信息域,分支信息域,异常标记域等。这些指令域有些是必填的,例如子指令域,异常标记域等;有些是按需填充的,例如输入输出存储小泡信息域,系统内存地址信息域,分支信息域等。
在填充子指令域时,可以在MS指令的子指令域标识该MS操作;以及将子指令域与用于实现该MS操作的、一个或多个执行单元特定的子指令相关联。
在一些实施例中,针对与分支MS指令相关联的条件计算MS指令,可以在异常标记域中插入可能丢弃标签,以供后续执行MS指令时使用。
在又一些实施例中,针对分支类型的MS指令,可以在分支信息域中插入分支指示符,以指示可能分支目标和/或不可能分支目标。
图13示出了一个示例程序,其中(a)示出了待编译的原始程序;编译后的程序分为两部分,其中(b)示出了编译后的MS指令流,(c)示出了IP特定的MS指令实现,也即前面描述的子指令。
在此示例中,原始程序涉及深度学习应用中神经网络的Relu层和Softmax层的计算,其例如利用PyTorch以Python语言编写。Relu层和Softmax层的计算都采用了调用Torch库的方式。因此,按照前面描述的调用映射过程,可以将这些对Torch库的函数调用映射成MS指令,如(b)中所示的“Matmul(矩阵乘)”、“Eltwadd(按元素加)”、“Relu”等等。变量Epoch的递增和条件分支则被打包映射成一个条件分支指令“Ifcond”,同时插入了分支指示符,以指示可能分支目标和不可能分支目标。Print申明则被映射到另一MS指令(“Print”)。
(c)示出了若干具有IP特定代码的MS指令。如所示,Matmul提供了两种IP特定的代码实现,一种用于GPU,另一种用于DLA,从而“Matmul”MS指令可以由指令分派器在GPU车道和DLA车道之间调度。Ifcond仅提供了CPU特定的代码,其涉及从第一输入存储小泡(vil)读取值Epoch,将其增加1,然后存到输出存储小泡(vo)。然后计算该新的Epoch值对10取模后的结果10并据此进行判断。如果确定要采取“Then”分支(该分支被比较为不可能分支),则发起一个“UBE”事件。因此,Ifcond这条MS指令也插入了“可能丢弃标签”,其后任何大规模MS指令都将被阻塞直到Ifcond这条指令已执行。Print MS指令仅被分派给母核,因为该指令需要系统调用和与外部设备的I/O。
由此,上面描述了将程序代码编译为MS指令的示例性方案。待编译的程序代码可以是各种通用编程语言,也可以是领域特定的语言。通过将这些程序代码编译为MS指令,可以非常方便地将各种新的IP核加入到SaaP SoC中,而无需进行大量的编程/编译工作,因此可以很好地支持SoC的扩展性。此外,同一MS指令可以用多个版本的子指令,也为指令执行时的调度提供了更多选择,便于提升流水线的执行效率。
综上,SaaP为异构SoC的传统认知提供了一个卓越的设计选择。在SaaP中,由于在PXO原则下没有共享资源,因此不存在竞争。MS指令可以预测执行并在错误时进行撤销,而不需要任何开销,因为执行中的IP核中没有什么会因为错误指令而留下可观察的
副作用。缓存无需保持一致性,因为没有重复的缓存行,并且节省了探听过滤器(Snoop Filter)/MESI协议,因为没有总线需要探听。尽管对SaaP施加了额外的约束,但是从本文描述可以看出这些约束从分析和经验的角度看都是合理的。
图14示出本披露实施例的一种板卡1400的结构示意图。如图所示,板卡1400包括芯片1401,其可以是本披露实施例的SaaP SoC,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡1400适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片1401通过对外接口装置1402与外部设备1403相连接。外部设备1403例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或Wifi接口等。待处理的数据可以由外部设备1403通过对外接口装置1402传递至芯片1401。芯片1401的计算结果可以经由对外接口装置1402传送回外部设备1403。根据不同的应用场景,对外接口装置1402可以具有不同的接口形式,例如PCIe(Peripheral Component Interconnect express,高速外围组件互连)接口等。
板卡1400还包括用于存储数据的存储器件1404,其包括一个或多个存储单元1405。存储器件1404通过总线与控制器件1406和芯片1401进行连接和数据传输。板卡1400中的控制器件1406配置用于对芯片1401的状态进行调控。为此,在一个应用场景中,控制器件1406可以包括单片机,又称微控制单元(Micro Controller Unit,MCU)。
本披露实施例提供的板卡中的SoC芯片可以包含前面描述的对应特征,此处不再重复。本披露实施例还提供了相应的编译装置,其包括处理器,配置用于执行编译程序代码;以及存储器,其配置用于存储所述编译程序代码,当所述编译程序代码由所述处理器加载并执行时,使得所述编译装置执行前面任一实施例所述的编译方法。本披露实施例还提供了一种机器可读存储介质,所述机器可读存储介质包括编译程序代码,所述编译程序代码在被执行时使机器执行前面任一实施例所述的编译方法。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器计算簇、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他
顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC(Application Specific Integrated Circuit,专用集成电路)等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。
Claims (25)
- 一种指令处理装置,包括:指令译码器,用于对混合规模(MS)指令进行译码,所述MS指令包括子指令域,所述子指令域指示能够执行所述MS指令的一个或多个执行单元特定的子指令信息;以及指令分派器,用于根据所述子指令域,将所述MS指令分派给对应的执行单元。
- 根据权利要求1所述的装置,其中所述多个执行单元按其功能划分为不同车道,并且所述指令分派器进一步用于:至少部分基于所述MS指令的任务类型,将所述MS指令分派到适合车道所对应的保留站,以供后续发射到适合的执行单元。
- 根据权利要求2所述的装置,其中所述指令分派器进一步用于:根据车道中的处理状态,在能够执行所述MS指令的多个执行单元所属的车道之间调度。
- 根据权利要求1-3任一所述的装置,其中所述指令分派器进一步用于:将指定类型的MS指令分派给所述执行单元中负责管理的主执行单元。
- 根据权利要求4所述的装置,其中所述指定类型的MS指令包括以下任一:对系统内存进行访存的MS指令;处理中断的MS指令;无法由其他执行单元处理的MS指令;根据MS指令调度策略,分配到主执行单元的MS指令。
- 根据权利要求4-5任一所述的装置,其中所述一个或多个执行单元特定的子指令被预先取回并存储在子指令缓存器上,以便在所述MS指令发射到对应的执行单元时,所述执行单元从所述子指令缓存器取回相应的子指令。
- 根据权利要求1-6任一所述的装置,其中:运算类的MS指令针对存储在一组存储小泡中的数据执行运算,所述一组存储小泡是多个存储容量大小不等的暂存器。
- 根据权利要求7所述的装置,还包括:存储小泡重命名电路,用于在所述MS指令中涉及的存储小泡上存在数据冒险时,在分派所述MS指令之前,将所述存储小泡的逻辑名称重命名映射为物理名称,以及保存所述物理名称与逻辑名称之间的映射关系。
- 根据权利要求8所述的装置,还包括:指令退出电路,用于按顺序退出已完成的MS指令,并在所述MS指令退出时,通过确认所述MS指令对其输出数据对应的存储小泡的重命名映射来提交执行结果。
- 根据权利要求1-9任一所述的装置,其中每条MS指令至多具有两个输入数据域和一个输出数据域。
- 根据权利要求1-10任一所述的装置,其中所述执行单元包括集成在片上系统(SoC)上的多个异构IP核。
- 一种指令执行方法,包括:对混合规模(MS)指令进行译码,所述MS指令包括子指令域,所述子指令域指示能够执行所述MS指令的一个或多个执行单元特定的子指令信息;以及根据所述子指令域,将所述MS指令分派给对应的执行单元。
- 根据权利要求12所述的方法,其中所述多个执行单元按其功能划分为不同车道,并且将所述MS指令分派给对应的执行单元进一步包括:至少部分基于所述MS指令的任务类型,将所述MS指令分派到适合车道所对应的保留站,以供后续发射到适合的执行单元。
- 根据权利要求13所述的方法,其中将所述MS指令分派给对应的执行单元进一步包括:根据车道中的处理状态,在能够执行所述MS指令的多个执行单元所属的车道之间调度。
- 根据权利要求12-14任一所述的方法,其中将所述MS指令分派给对应的执行单元进一步包括:将指定类型的MS指令分派给所述执行单元中负责管理的主执行单元。
- 根据权利要求15所述的方法,其中所述指定类型的MS指令包括以下任一:对系统内存进行访存的MS指令;处理中断的MS指令;无法由其他执行单元处理的MS指令;根据MS指令调度策略,分配到主执行单元的MS指令。
- 根据权利要求15-16任一所述的方法,还包括:将所述一个或多个执行单元特定的子指令预先取回并存储在子指令缓存器上;以及在所述MS指令发射到对应的执行单元时,所述执行单元从所述子指令缓存器取回相应的子指令。
- 根据权利要求12-17任一所述的方法,其中:运算类的MS指令针对存储在一组存储小泡中的数据执行运算,所述一组存储小泡是多个存储容量大小不等的暂存器。
- 根据权利要求18所述的方法,还包括:在分派所述MS指令之前的冲突解决阶段,在所述MS指令中涉及的存储小泡上存在数据冒险时,将所述存储小泡的逻辑名称重命名映射为物理名称;以及保存所述物理名称与逻辑名称之间的映射关系。
- 根据权利要求19所述的方法,还包括:在所述MS指令退出时,通过确认所述MS指令对其输出数据对应的存储小泡的重命名映射来提交执行结果。
- 根据权利要求20所述的方法,其中所述MS指令的译码、冲突解决、分派、执行和退出按照乱序流水线来并行执行。
- 根据权利要求12-21任一所述的方法,其中每条MS指令至多具有两个输入数据域和一个输出数据域。
- 根据权利要求12-22任一所述的方法,其中所述执行单元包括集成在片上系统(SoC)上的多个异构IP核。
- 一种片上系统(SoC),包括根据权利要求1-11任一所述的指令处理装置,以及多个异构IP核,所述多个异构IP核作为所述执行单元。
- 一种板卡,包括根据权利要求24所述的片上系统。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210764246.0A CN117348930A (zh) | 2022-06-29 | 2022-06-29 | 指令处理装置、指令执行方法、片上系统和板卡 |
CN202210764246.0 | 2022-06-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024002176A1 true WO2024002176A1 (zh) | 2024-01-04 |
Family
ID=89358139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/103276 WO2024002176A1 (zh) | 2022-06-29 | 2023-06-28 | 指令处理装置、指令执行方法、片上系统和板卡 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117348930A (zh) |
WO (1) | WO2024002176A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100122105A1 (en) * | 2005-04-28 | 2010-05-13 | The University Court Of The University Of Edinburgh | Reconfigurable instruction cell array |
CN101739235A (zh) * | 2008-11-26 | 2010-06-16 | 中国科学院微电子研究所 | 将32位dsp与通用risc cpu无缝混链的处理器装置 |
US20150277975A1 (en) * | 2014-03-28 | 2015-10-01 | John H. Kelm | Instruction and Logic for a Memory Ordering Buffer |
CN110121698A (zh) * | 2016-12-31 | 2019-08-13 | 英特尔公司 | 用于异构计算的系统、方法和装置 |
CN114168197A (zh) * | 2021-12-09 | 2022-03-11 | 海光信息技术股份有限公司 | 指令执行方法、处理器以及电子装置 |
-
2022
- 2022-06-29 CN CN202210764246.0A patent/CN117348930A/zh active Pending
-
2023
- 2023-06-28 WO PCT/CN2023/103276 patent/WO2024002176A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100122105A1 (en) * | 2005-04-28 | 2010-05-13 | The University Court Of The University Of Edinburgh | Reconfigurable instruction cell array |
CN101739235A (zh) * | 2008-11-26 | 2010-06-16 | 中国科学院微电子研究所 | 将32位dsp与通用risc cpu无缝混链的处理器装置 |
US20150277975A1 (en) * | 2014-03-28 | 2015-10-01 | John H. Kelm | Instruction and Logic for a Memory Ordering Buffer |
CN110121698A (zh) * | 2016-12-31 | 2019-08-13 | 英特尔公司 | 用于异构计算的系统、方法和装置 |
CN114168197A (zh) * | 2021-12-09 | 2022-03-11 | 海光信息技术股份有限公司 | 指令执行方法、处理器以及电子装置 |
Also Published As
Publication number | Publication date |
---|---|
CN117348930A (zh) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11893424B2 (en) | Training a neural network using a non-homogenous set of reconfigurable processors | |
WO2024002175A1 (zh) | 指令执行方法、系统控制器及相关产品 | |
US11847395B2 (en) | Executing a neural network graph using a non-homogenous set of reconfigurable processors | |
US11625283B2 (en) | Inter-processor execution of configuration files on reconfigurable processors using smart network interface controller (SmartNIC) buffers | |
CN109074260A (zh) | 乱序的基于块的处理器和指令调度器 | |
Abdolrashidi et al. | Wireframe: Supporting data-dependent parallelism through dependency graph execution in gpus | |
US10997102B2 (en) | Multidimensional address generation for direct memory access | |
US11182264B1 (en) | Intra-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS) | |
US11934308B2 (en) | Processor cluster address generation | |
US12079157B2 (en) | Reconfigurable data processor with fast argument load using a runtime program on a host processor | |
EP3516515B1 (en) | Scheduling of tasks in a multiprocessor device | |
US20240231903A1 (en) | Data transfer in dataflow computing systems using an intelligent dynamic transfer engine | |
US12056012B2 (en) | Force quit of reconfigurable processor | |
WO2024002176A1 (zh) | 指令处理装置、指令执行方法、片上系统和板卡 | |
WO2024002178A1 (zh) | 指令执行方法、系统控制器及相关产品 | |
WO2024002172A1 (zh) | 片上系统、指令系统、编译系统及相关产品 | |
EP4384902A1 (en) | Parallel processing architecture using distributed register files | |
CN117348881A (zh) | 编译方法、编译装置和机器可读存储介质 | |
Meakin | Multicore system design with xum: The extensible utah multicore project | |
US20230385125A1 (en) | Graph partitioning and implementation of large models on tensor streaming processors | |
US20220308872A1 (en) | Parallel processing architecture using distributed register files | |
KHALILI MAYBODI | A Data-Flow Threads Co-processor for MPSoC FPGA Clusters | |
WO2022251272A1 (en) | Parallel processing architecture with distributed register files | |
Agarwal et al. | Computer architecture and parallel processing | |
Cohen et al. | Mapping and Scheduling on Multi-core Processors using SMT Sol-vers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23830347 Country of ref document: EP Kind code of ref document: A1 |