WO2024002175A1 - Procédé d'exécution d'instructions, contrôleur de système et produit associé - Google Patents
Procédé d'exécution d'instructions, contrôleur de système et produit associé Download PDFInfo
- Publication number
- WO2024002175A1 WO2024002175A1 PCT/CN2023/103271 CN2023103271W WO2024002175A1 WO 2024002175 A1 WO2024002175 A1 WO 2024002175A1 CN 2023103271 W CN2023103271 W CN 2023103271W WO 2024002175 A1 WO2024002175 A1 WO 2024002175A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- instructions
- execution
- branch
- core
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 91
- 230000000903 blocking effect Effects 0.000 claims abstract description 15
- 238000003860 storage Methods 0.000 claims description 129
- 238000013507 mapping Methods 0.000 claims description 19
- 230000004044 response Effects 0.000 claims description 8
- 230000001960 triggered effect Effects 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 description 57
- 230000008569 process Effects 0.000 description 41
- 230000006870 function Effects 0.000 description 32
- 238000013528 artificial neural network Methods 0.000 description 15
- 230000003068 static effect Effects 0.000 description 15
- 230000007246 mechanism Effects 0.000 description 13
- 238000013135 deep learning Methods 0.000 description 12
- 239000011800 void material Substances 0.000 description 12
- 238000007726 management method Methods 0.000 description 11
- 240000008415 Lactuca sativa Species 0.000 description 10
- 238000013461 design Methods 0.000 description 10
- 235000012045 salad Nutrition 0.000 description 10
- 230000003993 interaction Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000001537 neural effect Effects 0.000 description 7
- 235000008429 bread Nutrition 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 239000000872 buffer Substances 0.000 description 5
- 238000000354 decomposition reaction Methods 0.000 description 5
- 230000010354 integration Effects 0.000 description 5
- 235000013372 meat Nutrition 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000013523 data management Methods 0.000 description 3
- 238000002408 directed self-assembly Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 235000013311 vegetables Nutrition 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 231100000957 no side effect Toxicity 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 241000414697 Tegra Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000001693 membrane extraction with a sorbent interface Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 235000012033 vegetable salad Nutrition 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
Definitions
- This disclosure relates generally to the field of instruction sets. More specifically, the present disclosure relates to an instruction execution method, a system controller, a system on a chip, a board card, and a machine-readable storage medium.
- DSA Domain-specific architecture
- Various DSAs have been proposed to accelerate specific applications, such as various xPUs, including DPU (Data Processing Unit) for data stream processing, GPU (Graphics Processing Unit) for image processing ), NPU (Neural network Processing Unit, neural network processor) for neural networks, TPU (Tensor Processing Unit, tensor processor) for tensor processing, etc.
- DSAs used for computing purposes
- IP Intellectual Property
- SoC System on Chip
- IP typically only exposes IP-related hardware interfaces, forcing SoCs to manage the IP as a standalone device using code running on the host CPU. Since it is extremely difficult to manage hardware heterogeneity directly for application developers, significant efforts are often made to build programming frameworks to help application developers manage this hardware heterogeneity.
- popular programming frameworks for deep learning include PyTorch, TensorFlow, MXNet, etc., all of which provide application developers with high-level, easy-to-use Python interfaces.
- the host CPU In current SoCs, the host CPU must treat IPs as independent devices and utilize code running on the host CPU (i.e., CPU-centric) to manage coordination between different IPs, resulting in both control and data exchange. There are costs that cannot be ignored in all aspects. Furthermore, with the integration of many IPs that have some commonality, domain-specific programming frameworks may not be able to leverage available IP from other domains to perform the same function. For example, using DLA (Deep Learning Accelerator, deep learning accelerator) requires explicit programming in Nivdia Tegra Xavier.
- DLA Deep Learning Accelerator, deep learning accelerator
- the present disclosure provides solutions from multiple aspects.
- it provides a new unified system-on-chip architecture framework (which can be called system-on-a-chip, Soc-as-a-Processor, SaaP for short), which eliminates hardware heterogeneity from a software perspective and Improve programming productivity and hardware utilization.
- an architecture-free mixed-scale instruction set is provided to support high productivity and new components of SaaS, including storage vesicles for on-chip management and on-chip interconnects for data paths, thereby Build an efficient SaaS architecture.
- a compilation method is provided for compiling program codes of various high-level programming languages into mixed-scale instructions.
- Other aspects of this disclosure also provide solutions for branch prediction, exceptions and interrupts in instructions.
- the present disclosure discloses an instruction execution method, including: when issuing a Mixed-Scale (MS) instruction, checking whether the MS instruction may be discarded; and when it is determined that the MS instruction may be discarded Discard, blocking the issuance of a specific MS command following the MS command.
- MS Mixed-Scale
- the present disclosure discloses a system controller configured to perform the instruction execution method according to the aforementioned first aspect.
- the present disclosure discloses a machine-readable storage medium including code that, when executed, causes a machine to perform the method of the aforementioned first aspect.
- the present disclosure discloses a system on a chip (SoC), including the system controller of the aforementioned second aspect, and a plurality of heterogeneous IP cores, the plurality of heterogeneous IP cores serve as the execution of the MS instructions unit.
- SoC system on a chip
- the present disclosure discloses a board card including the system-on-chip of the fourth aspect.
- MS instructions that may cause high revocation costs can be blocked until all instructions that may be discarded before the instruction have been executed, that is, The status is determined.
- This instruction execution scheme can greatly improve the processing efficiency of the MS instruction pipeline in exception and interrupt handling.
- Figure 1 schematically shows a typical architecture of a SoC
- FIG. 2 shows the hardware heterogeneity on the SoC
- Figure 3 shows a typical timeline for a traditional SoC
- FIG. 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram
- Figure 4b shows the traditional SoC architecture for comparison
- Figure 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure
- Figure 6 schematically shows an example process of performing tasks on the MISC architecture
- Figure 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure
- FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure
- Figure 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure
- Figure 10 shows an instruction execution example according to an embodiment of the present disclosure
- FIG 11 shows several different data path designs
- Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure
- Figure 13 shows an example program
- Figure 14 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.
- the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
- SoC system on a chip
- SoC is an integrated circuit chip that integrates all the key components of the system on the same chip. SoC is the most common integration solution in today's mobile/edge scenarios. Its high level of integration improves system performance, reduces overall power consumption and provides significantly smaller area costs compared to motherboard-based solutions.
- Figure 1 schematically shows a typical architecture of a SoC.
- SoC Due to performance requirements under limited area/power budget, SoC usually integrates a lot of dedicated hardware IP, usually domain-specific architecture for computing purposes, especially to accelerate domain-specific applications or specific applications.
- Some of these hardware IPs are customized by SoC designers, such as neural network processing IP (Neural Engine (NE) in Apple A15, deep learning accelerator (DLA) in NVIDIA Jetson Xavier, neural processing unit in HiSilicon Kirin (NPU) and Samsung Exynos), some are standardized by IP suppliers, such as Arm or Imagination's CPU and GPU, Synopsys or Cadence's DSP (Digital Signal Processor, digital signal processor), Intel or Xilinx's FPGA (Field-Programmable Gate Array, field programmable gate array), etc.
- NE Ne
- DLA deep learning accelerator
- NPU HiSilicon Kirin
- Samsung Exynos some are standardized by IP suppliers, such as Arm or Imagination's CPU and GPU, Synopsys or Cadence's DSP (Digital
- CPU 101 GPU 102, NPU (Neural-Network Processing Unit, neural network processing unit) 103, on-chip RAM (Random Access Memory, random access memory) 104, DRAM (Dynamic Random Access Memory, dynamic random access memory) controller 105, arbiter (Arbiter) 106, decoder 107, external storage interface 108, bus bridge 109, UART (Universal Asynchronous Receiver/Transmitter, unified asynchronous receiver and transmitter) 110, GPIO ( General Purpose Input Output, general input and output) 111, ROM (Read Only Memory, read-only memory) interface 112, etc.
- SoC Network on Chip
- a common bus used for SoC on-chip interconnection is ARM’s open standard Advanced Microcontroller Bus Architecture (AMBA).
- the SoC uses shared buses to connect and manage various functional blocks in the SoC.
- These shared buses include the Advanced High Performance Bus (AHB) for high-speed connections, and the Advanced High Performance Bus (AHB) for low-bandwidth low-speed connections.
- Advanced Peripheral Bus (APB) Advanced Peripheral Bus
- Hardware heterogeneity includes the heterogeneity of IP within SoC and the heterogeneity of IP between SoCs.
- Figure 2 illustrates the hardware heterogeneity on the SoC.
- the figure shows several IP integrated on the SoC.
- a certain model A integrates a CPU and a GPU on the SoC
- a certain model B integrates a CPU, a GPU and a neural network module on the SoC.
- Neural Engine (NE) for processing
- a certain model C integrates CPU, GPU and neural processing unit (NPU) for neural network processing in SoC
- a certain model D integrates CPU, GPU, deep learning unit for deep learning in SoC Learning Accelerator (DLA) and Programmable Vision Accelerator (PVA).
- DLA SoC Learning Accelerator
- PVA Programmable Vision Accelerator
- the IPs on the same SoC are different, for example, used for different purposes.
- this is due to the fact that more and more different types of IP (especially for computing purposes) are integrated into SoC to achieve high efficiency.
- New IP will continue to be introduced into SoC.
- a new type of neural network processing IP has been widely introduced into recent mobile SoCs.
- the number of processing units in an SoC continues to grow.
- the SoC of a certain model A mainly includes 10 processing units (2 large cores, 2 small cores, and a 6-core GPU); while in a certain model B, the number of processing units increases to 30 (2 large cores) General-purpose cores, 4 small general-purpose cores, a 16-core neural engine, and a 5-core GPU).
- the IP that implements the same function on different SoCs may vary greatly because one's own IP is always preferred for business reasons.
- the same functionality (such as neural network processing) is directed to different IPs.
- SoC is the neural engine (NE); in a certain model D is the deep learning accelerator (DLA); in a certain model C is the neural processing unit (NPU).
- DLA deep learning accelerator
- NPU neural processing unit
- many computing-purpose IPs are specific to a certain field (e.g., deep learning) or have certain generality for certain types of operations (e.g., GPUs with tensor operations).
- Programming IP such as GPUs and NPUs for computing purposes can be achieved based on support from programming frameworks and vendors.
- programming frameworks such as PyTorch, TensorFlow, MXNet, etc.
- These programming frameworks provide high-level programming interfaces (C++/Python) to customize IP, which are implemented using the IP vendor's low-level interfaces.
- IP suppliers provide different programming interfaces, such as PTX (Parallel Thread Execution, parallel thread execution), CUDA (Compute Unified Device Architecture, computing unified device architecture), cuDNN (CUDA Deep Neural Network library, CUDA deep neural network library) and NCCL (NVIDIA Collective Communications Library), etc., to make their hardware drivers suitable for these programming frameworks.
- PTX Parallel Thread Execution, parallel thread execution
- CUDA Computer Unified Device Architecture, computing unified device architecture
- cuDNN CUDA Deep Neural Network library
- CUDA deep neural network library NVIDIA Collective Communications Library
- programming frameworks require extremely large development efforts because they are required to bridge the gap between software diversity and hardware diversity.
- Programming frameworks provide application developers with high-level interfaces to improve programming productivity, and these interfaces are carefully implemented to improve hardware performance and efficiency.
- Tensorflow was initially developed by about 100 developers and is currently maintained by 3,000+ contributors to support dozens of SoC platforms.
- optimizing one operator on a certain IP may take a skilled developer several months.
- application developers may be required to have different implementations for different SoCs. For example, a program written for a certain model D cannot be run directly on the server-side DGX-1 of TensorCore of the GPU.
- FIG. 3 shows a typical timeline for a traditional SoC.
- the host CPU runs the programming framework for runtime management, where each call to the IP will be started/ended by the host CPU, which brings non-negligible runtime overhead.
- the data is stored in off-chip main memory, and the IP reads/writes data from the main memory, which brings additional data access.
- control will be returned from the GPU to the programming framework 39 times, occupying 56.75M of DRAM space, 95.06% of which is unnecessary.
- Amdahl's law the efficiency of a system is limited, especially for programs composed of fragmented operations.
- this disclosure proposes a solution that lets the SoC hardware manage heterogeneity by itself.
- the logic unit Arimetic Logic Unit, ALU
- the floating point unit Float Point Unit, FPU
- IP can also be regarded as an execution unit in the IP-level pipeline, that is, a unified system-on-a-chip (SoC-as-a-Processor, SaaP).
- Figure 4a schematically illustrates a SaaP architecture according to an embodiment of the present disclosure in a simplified diagram.
- Figure 4b shows a traditional SoC architecture, where single lines represent control flow and wide lines represent data flow.
- the SaaP of the embodiment of the present disclosure reconstructs the Soc into a processor, including a system controller 410 (equivalent to the controller in the processor, that is, the pipeline manager), which is used to manage the hardware pipeline, including from System memory (e.g., DRAM 440 in the figure) retrieves instructions, decodes them, dispatches them, undoes them, commits them, etc.
- System memory e.g., DRAM 440 in the figure
- Multiple heterogeneous IP cores, including CPU cores, are integrated into the SoC as execution units in the hardware pipeline 420 (equivalent to the arithmetic units in the processor) for executing instructions assigned by the system controller 410 . Therefore, SaaS can utilize hardware pipelines rather than programming frameworks to manage heterogeneous IP cores.
- MS instruction is a unified instruction that can be applied to various heterogeneous IP cores. Therefore, hardware heterogeneity is transparent under MS instructions.
- MS instructions are fetched, decoded, dispatched, revoked, committed, etc. by the system controller 410. The adoption of MS instructions can fully exploit mixed-level parallelism.
- on-chip memory 430 can also be provided for SaaP, such as on-chip SRAM (Static Random Access Memory) or registers, which are used to cache data related to the execution of the execution unit (IP core), such as input data and output. data. Therefore, after the data on the system memory is transferred to the on-chip memory, the IP core can interact with the on-chip memory to access the data.
- SaaP such as on-chip SRAM (Static Random Access Memory) or registers, which are used to cache data related to the execution of the execution unit (IP core), such as input data and output. data. Therefore, after the data on the system memory is transferred to the on-chip memory, the IP core can interact with the on-chip memory to access the data.
- On-chip memory 430 is similar to registers in a processor, whereby on-chip IP coordination can also be implemented implicitly in a manner similar to register forwarding in a multi-scalar pipeline.
- SaaS In the SaaS hardware pipeline, you can make full use of mixed-level parallelism by using MS instructions, and use on-chip memory to realize data exchange between IP cores, thereby achieving high hardware performance. Moreover, SaaS allows any type of IP core to be integrated as an execution unit, and high-level code from application developers can be compiled to the new IP core with only slight adjustments, thus enabling improved programming productivity.
- the traditional SoC shown in Figure 4b is CPU-centric, with the programming framework running on the host CPU.
- Various IP cores are attached to the system bus as isolated devices and managed by software running on the host CPU.
- DRAM system memory
- SoC is built with an IP-level pipeline, and the IP core is managed as an execution unit.
- the control flow can naturally be managed by the pipeline manager, and no programming framework is required at runtime.
- data exchange can be performed directly between different IP cores.
- SaaS SoCs follow the principles of Pure eXclusive Ownership (PXO) architecture in their design.
- the principle is that data-related resources in the system, including on-chip buffers, data paths, data caches, memory and I/O (Input/Output, input/output) devices, are exclusively occupied by one IP core at a certain time.
- PXO Pure eXclusive Ownership
- FIG. 5 shows the overall architecture of a SaaP according to an embodiment of the present disclosure in more detail. Similar to the Tomasulo pipeline, SaaP can contain an out-of-order five-stage pipeline.
- the system controller as the pipeline manager can include multiple functional components to implement different functions in the pipeline management process.
- the instruction decoder 511 can decode the MS instruction proposed in the embodiment of the present disclosure.
- Instruction dispatcher 512 may dispatch MS instructions.
- the instruction exit circuit 513 is used to complete the instruction submission and exit the completed MS instructions in order.
- MS instruction cache 514 is used to cache MS instructions.
- the renaming circuit 515 is used to rename the storage elements involved in the instruction, for example, to solve possible data hazards.
- the system controller may utilize the renaming mechanism to implement any one or more of the following processes: resolving data hazards on storage elements; MS command revocation, MS command submission, etc.
- the exception handling circuit 516 is used to respond to exceptions thrown by the IP core and perform corresponding processing. The functions of each component will be described in the relevant sections below.
- IP cores (the figure illustrates various IP cores such as CPU cores, GPU cores, and DLA cores) act as execution units for performing actual operations.
- IP cores and related components such as reservation station 521, IP instruction cache 522, etc.
- IP core complex 520 may be collectively referred to as IP core complex 520.
- On-chip memory is also provided in SaaP.
- on-chip memory can be implemented as a bank of scratchpads (also called a set of memory bubbles) that buffer input and output data.
- Storage bubbles act as registers in the processor.
- the storage bubble can include multiple temporary registers with different storage capacities, which are used to cache data related to the execution of multiple heterogeneous IP cores.
- the capacity of storage bubbles can range from 64B, 128B, 256B,...256KB, to 512KB.
- the number of small-capacity storage bubbles is greater than the number of large-capacity storage bubbles, so as to better support task requirements of different scales.
- This group of storage vesicles may be collectively referred to as storage vesicle complex 530.
- an on-chip interconnect 540 is provided to provide non-blocking data path connectivity between multiple heterogeneous IP cores and a set of storage vesicles.
- the on-chip interconnect acts as a shared data bus.
- the on-chip interconnect 540 can be implemented based on an ordering network, thereby providing a non-blocking data path with only a small hardware cost and acceptable latency.
- the on-chip interconnect 540 may also be referred to as Golgi.
- one IP core among the above-mentioned multiple heterogeneous IP cores can be designated as the mother core, responsible for managing the entire system.
- the mother core exclusively manages the exchange of data between system memory and storage bubbles.
- the mother core also exclusively manages I/O operations with external devices.
- the mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc.
- OS operating system
- branching and predictive execution branching and predictive execution are implemented through exception handling, in which impossible branches are treated as unlikely branch exceptions (UBE).
- Static prediction can be used to implement branching and speculative execution.
- the CPU core with general processing functions is usually determined as the mother core.
- DMA Direct Memory Access
- non-parent IP cores may be divided into different IP lanes based on their functionality and/or type.
- the mother core itself belongs to a separate IP lane.
- Figure 5 shows the mother core lane, CPU lane, CPU lane, DLA car Tao, wait. Then, when scheduling the MS instruction, the MS instruction can be dispatched to the appropriate IP lane based at least in part on the task type of the MS instruction.
- SaaS uses MS instructions to execute the entire program. Initially, when the system controller retrieves an MS instruction, it decodes it to prepare the data for execution. Data is loaded from system memory to storage bubbles or forwarded quickly from other storage bubbles. If there is no conflict, the MS instruction is sent to the MS instruction dispatcher and then to the appropriate IP core (eg, DLA core) for actual execution. This IP core will load the actual precompiled IP specific code (eg, DLA instructions) based on the MS instruction issued. The IP core will then execute that actual code, much like execution on a regular accelerator. After execution is complete, the MS instruction exits the pipeline and commits its results.
- IP core eg, DLA core
- Table 1 shows a comparison between different instruction sets.
- CISC Complex Instruction Set Computer, complex instruction set computer
- RISC Reduced Instruction Set Computer, reduced instruction set computer
- the length of each CISC instruction is uncertain. Some instructions have complex functions and a large number of beats, while some instructions have simple functions and a small number of beats.
- the number of instruction cycles ranges from 2 to 15.
- the instruction length of RISC is fixed, and the number of instruction cycles for a single instruction is relatively uniform, about 1 to 1.5 cycles.
- a mixed-scale (Mixed-Scale, MS) instruction set (Mixed-scale Instruction Set Computer) is provided.
- MISC Mated-scale Instruction Set Computer
- IP cores (various accelerators mainly for computing purposes) need to efficiently process some large-grained complex work, so the number of execution cycles (Cycle Per Instruction, CPI) of a single MS instruction is longer than that of RISC, ranging from 10 to 10,000 +shoot, which belongs to a relatively large range.
- CPI Execution Cycle
- MISC mixed-scale instruction set computer
- MS instructions have mixed load sizes, which can be relatively small loads, such as only needing to execute 10 beats, or relatively large loads, such as needing to execute more than 10,000 beats. Therefore, the payload carried by each MS instruction may Containers of different sizes are required to facilitate retrieving data from the container and saving the calculated result data into the container.
- the aforementioned set of storage bubbles of various sizes eg, from 64B to 512KB are used to store the input and/or output data required by the MS instructions, thereby supporting this mixed load of the MS instructions size.
- MS instructions are IP-independent, that is, MS instructions are not aware of IP. Specifically, instructions specific to each IP core (that is, heterogeneous instructions) are encapsulated in MS instructions, and the encapsulated MS instruction format is not related to which IP core is specifically encapsulated.
- the MS instruction may include a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction. It can be understood that the MS instruction needs to be run on a certain IP core in the future, which means that there is a piece of code that the IP core can recognize (that is, IP core-specific code). These codes are also composed of one or more pieces of code specific to the IP core. Composed of instructions, these instructions are encapsulated in MS instructions, so they are called sub-instructions. Therefore, the system controller can assign the MS instruction to the corresponding IP core according to the sub-instruction domain.
- the subinstruction information may include the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction. There can be multiple implementations to represent subcommand information.
- the addresses of one or more IP core-specific subinstructions may be placed into the subinstruction field. This method can directly determine the sub-instruction type and address in the MS instruction. However, in this implementation, since the same MS instruction may be able to run on multiple heterogeneous IP cores, the length of the MS instruction will vary with the number of IP core types that can run the MS instruction.
- a bit sequence can be used to indicate whether the MS instruction has a corresponding type of sub-instruction, and a first address can be used to indicate the first sub-instruction address.
- the length of the bit sequence may be the number of IP core types or IP lane types, so that each bit in the bit sequence may be used to indicate whether there is a sub-instruction of the corresponding type.
- the first sub-instruction address is obtained directly from the first address.
- the sub-instruction addresses corresponding to subsequent IP lanes can be indexed in a fixed way (for example, separated by a certain address distance), or by directly jumping to the MS instruction.
- the embodiments of this disclosure have no restrictions on the specific format implementation of MS instructions.
- MS instructions are defined to perform complex functions. Therefore, each MS instruction performs a complex function, such as convolution, and the instruction will be broken down into fine-grained IP-specific code (i.e., sub-instructions) for actual execution, such as RISC instructions.
- IP-specific code can be code compiled against the standard library (e.g., std::inner_product from Libstdc++ for inner products) or code generated from a vendor-specific library (e.g., from cuBLAS also for inner products) operating CublasSdot). This makes it possible for SaaS to integrate different types of IP because the same MS command can be flexibly issued to different types of IP cores. Therefore, heterogeneity is hidden from application developers, which also increases the robustness of SaaS.
- MS instructions have a limited arity.
- each MS instruction will access up to three storage bubbles: two source storage bubbles and one destination storage bubble. That is, for data management, each MS instruction has at most two input data fields and one output data field, which are used to indicate data information related to the execution of the MS instruction.
- these data fields may be represented by numbers of associated storage bubbles, such as indicating two input storage bubbles and one output storage bubble, respectively.
- Limited metadata reduces the complexity of conflict resolution, renaming, datapath design, and compiler toolchains. For example, if the number of MS instructions is not limited, the decoding time difference between different MS instructions will be very large, resulting in irregular hardware pipelines and some inefficiency problems.
- Currying is a technique that converts multi-variable functions into sequences of single-variable functions, such as through nesting, chaining, etc. Thereby, it is possible to support the conversion of functions/functions with any number of inputs and outputs into functions/functions that satisfy the finite element number of MS instructions.
- MS instructions have no side effects. "No side effects” here means that the execution status of the current instruction will not affect the execution of subsequent instructions. In other words, if the current instruction is to be revoked, it can be revoked without its residual status affecting subsequent instructions.
- the execution of the MS instruction No observable side effects will be left on the SaaP architecture. The only exception is MS instructions that execute on the mother core, since the mother core can operate on system memory and external devices. This constraint is important for implementing Mixed Level Parallelism (MLP), as it enables simple rollback of effects when MS instructions need to be undoed, for example due to speculative execution requirements.
- MLP Mixed Level Parallelism
- the data field of the MS instruction executed on the non-mother core IP core can only point to the storage bubble, but not to the system memory.
- the storage bubble corresponding to the output data is exclusively assigned to the IP core that executes the MS instruction.
- FIG. 6 schematically shows an example process of executing tasks on the MISC architecture to better understand the implementation of MS instructions.
- the illustrated MISC architecture has, for example, a mother core and an IP core.
- the tasks to be performed are to make sandwiches (materials: bread and meat) and vegetable salads (materials: vegetables).
- the bread is named A
- the meat is named B
- the vegetables are named C
- the sandwich is named D
- the salad is named E.
- the mother core manages the system memory, so first the mother core loads the materials to be processed from the system memory to the storage bubbles, and then the IP core can process the materials on the storage bubbles.
- the above tasks can be expressed as the following MS instruction flow:
- each core should provide a specific code form, that is, core-specific sub-instructions, so that each core can know how to perform the corresponding task.
- these sub-instructions only briefly show their processing tasks or functions in the above MS instruction flow, and different forms are not distinguished.
- the storage bubbles (v1, v2) used in the MS instruction are logical numbers. In actual implementation, the storage bubbles are renamed to different physical numbers to resolve WAW (Write After Write) dependencies and support out-of-order predictive execution. Void in the instruction indicates that the corresponding domain does not need to store bubbles, such as when system memory is involved.
- 1 is the initial state; 2 executes the "Load Bread” MS instruction for the mother core.
- the Load instruction involves access to system memory and is therefore assigned to the mother core for execution.
- the mother core takes out the data from the system memory and stores it into the storage bubble v1.
- the specific memory access address information of the system memory may be placed in an additional instruction field, and the embodiments of the present disclosure have no limitation in this regard.
- 3 Execute the "Load Meat” instruction for the mother core. Similar to the "Load Bread” instruction, the mother core takes out the data from the system memory and stores it in the v2 storage bubble.
- this MS instruction is assigned to the IP core for processing because it takes more processing time.
- the IP core needs to take out the bread from v1, take out the meat from v2, and put it into v1 after making it.
- WAR Write After Read
- this method is not very realistic, because the MS instructions may be very large, for example, tens of thousands of shots are required, and the sandwiches made in the middle need to be stored somewhere. In order to solve this data hazard, a storage bubble renaming mechanism can be used.
- the storage bubble renaming circuit 515 saves the mapping relationship between the physical name and the logical name.
- the storage bubble v1 corresponding to the output data of the "Make Sandwich" instruction is renamed to storage bubble v3, so the prepared sandwich is placed in v3.
- the ellipsis in v3 in Figure 6 indicates that this writing process will take a while and will not be completed quickly.
- the "Make Salad" instruction can be assigned to the currently idle mother core for execution.
- the status of each core can be marked, for example, by a bit sequence to facilitate the instruction dispatcher to dispatch instructions. Again, the renaming mechanism is applied here as well.
- the mother core takes out the vegetables from the storage bubble v4, makes them into salads and puts them into the storage bubble v5.
- the IP core can start processing.
- “Make Sandwich” takes more time, so “Make Salad” is executed on the mother core and completed in advance, so that mixed-level parallelism (MLP) can be fully exploited. Therefore, the execution of different IP cores does not interfere with each other, that is, they can be executed out of order, but they are submitted in order.
- MLP mixed-level parallelism
- SaaP SoCs employ out-of-order pipelines to mine mixed-level parallelism between IP cores.
- the pipeline can contain 5 levels: value & decoding, conflict resolution, dispatch, execution and exit.
- FIG. 7 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure.
- the following description can be understood with reference to the SaaP architecture shown in Figure 5.
- Figure 7 shows the instruction execution process including a complete pipeline, but those skilled in the art can understand that some steps may only occur under specific circumstances and are therefore not necessary in all cases. The necessity can be discerned according to the specific situation.
- step 710 instruction fetch & decode is performed.
- the MS instruction is retrieved from the MS instruction register 514 based on the MS program counter (PC), and the instruction decoder 511 decodes the retrieved MS instruction to prepare operands.
- the decoded MS instructions can be placed in the instruction queue of the instruction decoder 511 .
- the MS instruction includes a sub-instruction field that indicates sub-instruction information specific to one or more IP cores capable of executing the MS instruction.
- the subinstruction information may, for example, indicate the type of the subinstruction (ie, the type of IP core or the type of IP lane) and/or the address of the subinstruction.
- the corresponding sub-instruction when the MS instruction is retrieved and decoded, the corresponding sub-instruction can be retrieved in advance and stored in a designated location according to the decoding result, such as the sub-instruction cache 522 (also referred to as sub-instruction buffer 522 in Figure 5 for the IP command cache). Therefore, when the MS instruction is issued to the corresponding IP core for execution, the IP core can fetch the corresponding sub-instruction from the sub-instruction buffer 522 for execution.
- the MS instruction may be a branch instruction.
- static prediction is used to determine the direction of the branch instruction, that is, to determine the PC value of the next MS.
- the inventor analyzed the branch behavior in the benchmark program and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time. Since large-scale instructions determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor. From this, whenever a branch is encountered, it is always assumed that the next MS instruction is a statically predicted possible branch. branch direction.
- step 720 the pipeline proceeds to step 720, where possible conflicts are resolved.
- the retrieved MS instructions are queued to resolve conflicts.
- Possible conflicts include (1) data hazards; (2) structural conflicts (e.g., no space available in the exit unit); and (3) exception violations (e.g., blocking an MS instruction that cannot be easily undone until it is acknowledged to be taken ).
- data hazards such as Read After Write (RAW) and Write After Write (WAW) can be solved through a storage bubble renaming mechanism.
- the storage bubble renaming circuit 515 is used to rename the logical name of the storage bubble to a physical name and save the storage bubble before dispatching the MS instruction when there is a data hazard on the storage bubble involved in the MS instruction.
- the mapping relationship between physical names and logical names Through the storage bubble renaming mechanism, SaaP can support faster MS instruction revocation (achieved by simply discarding the rename mapping to the output data storage bubble) and out-of-order execution without WAW risks.
- step 730 the MS instructions are dispatched by the instruction dispatcher 512 .
- an MS instruction has a sub-instruction field that indicates the IP core capable of executing the MS instruction. Therefore, the instruction dispatcher 512 can dispatch the MS instruction to the corresponding IP core according to the information of the sub-instruction field. Specifically, first dispatch the MS instruction to the reservation station to which the IP core belongs for subsequent transmission to the appropriate IP core.
- IP cores may be divided into different IP lanes according to their functions and/or types, with each lane corresponding to a specific IP core model.
- reservation stations can also be grouped according to lanes, for example, each lane can correspond to a reservation station.
- Figure 5 shows a mother core lane, a CPU lane, a CPU lane, a DLA lane, and so on.
- Different lanes can be suitable for performing different types of tasks. Therefore, when scheduling and dispatching the MS instruction, the MS instruction can be dispatched to the reservation station corresponding to the appropriate lane based at least in part on the task type of the MS instruction for subsequent transmission to the appropriate IP core.
- scheduling can also be performed among multiple IP lanes capable of executing the MS instruction according to the processing status in each IP lane, thereby improving processing efficiency. Since the same MS instruction may have multiple different implementations executed on multiple IP cores, the processing pressure of the bottleneck lane can be alleviated by selecting the assigned lane according to an appropriate scheduling policy. For example, MS instructions involving convolution operations can be dispatched to the GPU lane or the DLA lane. Both can be performed effectively, and one can be selected based on the pressure of the two lanes, thereby speeding up the processing progress.
- the scheduling policy may include various rules, such as selecting the IP core with the largest throughput, or selecting the IP core with the shortest number of sub-instructions, etc. The embodiments of the present disclosure are not limited in this regard.
- some specific types of MS instructions must be dispatched to specific IP cores.
- one IP core can be designated as the mother core, responsible for managing the entire system. Therefore, some MS instructions involving system management types must be dispatched to the mother core for execution.
- the mother core exclusively manages the exchange of data between system memory and storage vesicles. Therefore, MS instructions of the memory access type that access system memory are dispatched to the mother core.
- the mother core also exclusively manages I/O operations with external devices. Therefore, I/O type MS instructions such as display output are also dispatched to the mother core.
- the mother core can also control the operating system (OS) and runtime, and is responsible for at least one or more of the following processes: process management, page management, exception handling, interrupt handling, etc. Therefore, the interrupt circuit 517 handles the MS instruction that interrupts and is dispatched to the parent core.
- OS operating system
- the interrupt circuit 517 handles the MS instruction that interrupts and is dispatched to the parent core.
- some MS instructions cannot be processed by other IP cores, for example because other IP cores are busy, they can be assigned to the mother core for processing.
- some MS instructions may be assigned to the mother core for processing. No more enumeration here.
- step 740 at which stage MS instructions may be executed out of order by the IP core.
- the IP core assigned to the instruction may utilize the actual IP-specific code to perform the functionality of the MS instruction. For example, the IP core retrieves the corresponding sub-instruction from the sub-instruction register/IP instruction register 522 according to the assigned instruction and executes it.
- the Tomasulo algorithm can be implemented at this stage to organize these IP cores to support mixed-level parallelism (MLP). Once the dependencies on the storage bubble are resolved, MS instructions can be continuously dispatched to the IP core complex, allowing them to be executed out of order.
- MLP mixed-level parallelism
- an adapter is used to encapsulate the IP core.
- the adapter directs access to the program to the IP instruction cache 522 and directs access to the data to the storage bubble.
- the program can be an interface signal of the accelerator, such as the CSB (Configuration Space Bus, Configuration Space Bus) control signal for DLA, or a piece of IP-specific code that implements MS instructions (for example, for programmable processors such as CPU/GPU terms).
- Operational MS instructions perform operations on data stored in a set of storage bubbles.
- Each IP core has two data read ports and one data write port. During execution, the physical storage bubble is exclusively connected to the port, so from the perspective of the IP core, the storage bubble works like main memory in a traditional architecture.
- step 750 the exit phase.
- MS instructions exit the pipeline and commit the results.
- the instruction exit circuit 513 in Figure 5 is used to exit completed MS instructions in order, and when the MS instruction exits, submit the execution result by confirming the renaming mapping of the storage bubble corresponding to the MS instruction's output data. That is, the commit is accomplished by permanently acknowledging the rename map of the storage bubble of the output data in the rename circuit 515 . Since only the rename mapping is recognized, no data is actually buffered or copied, thus avoiding the additional overhead of copying data when the amount of data is large (which is common in various computing-purpose IP cores).
- the MS instruction system can also be applied in other environments, and is not limited to environments with heterogeneous IP cores. For example, it can also be used in homogeneous environments.
- the execution unit of the MS instruction can independently parse and execute the sub-instructions. Therefore, in the above description, the IP core can be directly replaced by the execution unit, and the mother core can be replaced as the main execution unit. The above method is still applicable.
- Branch instructions may also appear in the MS instruction stream, and branch instructions cause control dependencies.
- the control correlation is actually related to the program counter PC of the MS instruction, and the PC value is used when fetching instructions. If the branch instruction is not handled well, the fetching of the next instruction will be affected, causing the pipeline to be blocked and affecting the pipeline efficiency. Therefore, it is necessary to provide effective branch prediction support for MS instructions, that is, effective for both large-scale instructions and small-scale instructions.
- the branch condition is calculated during decoding, and then the correct branch target is determined, so that the next instruction can be retrieved from the address of the branch jump position when fetching the next instruction.
- This branch condition calculation and setting the next PC value to the value of the correct branch target usually only takes up a few beats of overhead. This overhead is very small and can be completely offset by the pipeline in the conventional CPU instruction pipeline.
- the MS instruction stream if a branch MS instruction is mispredicted, it means that at a certain point in time during the entire execution of the MS instruction stream, it is discovered that the branch MS instruction is mispredicted. At this time The position of the point may be hundreds of beats or thousands of beats or longer away from the time when the branch MS instruction starts executing. Therefore, in the MS instruction pipeline, it is impossible to determine the PC value of the next MS instruction until it really knows when to jump. In this case, the prediction overhead will be very high.
- the inventor analyzed the branch behavior in five benchmark programs and found that 80.6% to 99.8% of the large-scale instruction branches can be predicted correctly at compile time, that is, they can be predicted in a static manner. Since large-scale instructions occupy most of the total execution time and determine the overall execution time, in the embodiment of the present disclosure, static prediction is used to perform branch prediction, thereby eliminating the need for a hardware branch predictor.
- FIG. 8 shows an exemplary flowchart of an instruction execution method for a branch instruction according to an embodiment of the present disclosure. This method is executed by the system controller.
- the MS instruction is decoded.
- MS instructions have varying per-instruction execution cycles period (CPI).
- CPI per-instruction execution cycles period
- the CPI range of MS instructions may range from 10 beats to more than 10,000 beats. This changing CPI nature of MS instructions also makes it difficult to use dynamic prediction.
- step 820 in response to the MS instruction being a branch instruction, the next MS instruction is obtained according to the branch indication information, and the branch indication information indicates a possible branch target and/or an impossible branch target.
- the static prediction mechanism can use compiler hints to perform static predictions. Specifically, during instruction compilation, the branch instruction information can be determined based on the static branch prediction method and inserted into the MS instruction stream.
- the branch indication information may contain different contents. For example, static prediction always takes the possible branch target as the next MS instruction address. In some cases, in order to ensure the temporal locality of the instruction cache, it is possible that the branch target can usually be immediately adjacent to the current MS instruction. Therefore, in these cases, the branch indication information may only need to indicate the impossible branch target. In other cases, the branch indication information may also indicate possible branch targets and impossible branch targets at the same time. Therefore, when the next MS instruction is obtained according to the branch instruction information, the possible branch target indicated by the branch instruction information can be determined as the next MS instruction.
- the system controller may receive an Impossible Branch Exception (UBE) event.
- UBE Impossible Branch Exception
- the UBE event is triggered by an execution unit (such as an IP core) that executes a conditional calculation instruction associated with a branch instruction. This UBE event indicates that according to conditional calculation, the branch direction should be an impossible branch target, that is, an error occurred in the previous branch prediction.
- step 840 the system controller needs to perform a series of operations in response to the UBE event to resolve the branch prediction error. These operations include: canceling the MS instruction after the branch instruction; committing the MS instruction before the branch instruction; and determining the impossible branch target indicated by the branch indication information as the next MS instruction.
- This kind of processing corresponds to a precise exception, that is, when an exception occurs, all instructions before the instruction interrupted by the exception are executed, and all instructions after the instruction are as if they were not executed. Since the UBE event is an exception caused by a branch prediction error, the above-mentioned instruction interrupted by the exception is the branch MS instruction.
- the MS instruction that needs to be revoked may usually be in three states: being executed in the execution unit; execution has ended; or has not yet been executed. Different states may have effects on different software and hardware, so these effects need to be eliminated. For example, if the instruction is being executed in the execution unit, the execution unit that is executing the MS instructions that need to be revoked needs to be terminated; if the instruction has written to the temporary register (such as a storage bubble) during or after execution, Then you need to discard the temporary registers written by the MS instructions to be canceled; if the instructions have not been executed, you only need to cancel them from the instruction queue. Of course, since the instruction queue will record all unexited/submitted instructions, instructions that are in the executing or completed execution state also need to be canceled from the instruction queue.
- undoing the MS instructions following the branch instruction includes: canceling the undoed MS instructions in the instruction queue; terminating the execution units executing the undoed MS instructions; and discarding the temporary files written by the undoed MS instructions. memory.
- processing branch MS instructions through static prediction can save hardware software resources, while adapting to the CPI characteristics of MS instructions with a wide range of changes, and improving pipeline efficiency. Furthermore, handling branch prediction errors through exception mechanisms can further save hardware resources and simplify processing.
- an instruction execution scheme which can block MS instructions that may cause high revocation costs until all instructions that may be discarded before the instruction have been executed, that is, the status has been determined.
- This instruction execution scheme can greatly improve the processing efficiency of the MS instruction pipeline in exception and interrupt handling.
- FIG. 9 shows an exemplary flowchart of an instruction execution method according to an embodiment of the present disclosure. This method is executed by the system controller.
- step 910 when the MS command is issued, it is checked whether the MS command may be discarded.
- checking whether the MS instruction may be discarded includes checking whether the MS instruction has a possible discard tag.
- Possible discard tags can be inserted at compile time by the compiler based on the type of MS directive. For example, the compiler can insert a possible discard label when it discovers that the MS instruction is a conditional branch instruction or that other exceptions may occur.
- step 920 when it is determined that the MS instruction may be discarded, the issuance of the specific MS instruction following the MS instruction is blocked.
- Specific MS instructions can be those large-scale MS instructions, or MS instructions that generally have a relatively high cost of revocation.
- a specific MS instruction can be judged based on one or more of the following conditions: the size of the temporary register (storage bubble) corresponding to the output data of the MS instruction is greater than the set threshold; the MS instruction that performs a write operation on the system memory; MS instructions whose execution time exceeds a predetermined value; or MS instructions executed by a specific execution unit.
- the storage bubble size (capacity size) of the output data exceeds the set threshold, it means that the output data volume of the MS instruction will be relatively large, and the corresponding cancellation overhead will also be high. Blocking MS instructions that write system memory is mainly to ensure storage consistency.
- MS instructions before them are still issued and executed normally. Depending on the possible situations of these normally launched and executed MS instructions, they can be processed separately.
- step 930 when all potentially discarded MS instructions that caused the blocking of the specific MS instruction have been normally executed, in response to this event, the blocked specific MS instruction may be issued for execution by the execution unit. It can be understood that at this time it can be determined that the specific MS instruction will not be canceled due to the previous instruction, so the normal issuance and execution of the instruction pipeline can be continued.
- step 940 when an exception occurs in the execution of any MS instruction that blocks the specific MS instruction and may be discarded, exception handling is performed in response to the exception event.
- this kind of exception handling corresponds to a precise exception, which requires canceling the MS instruction that caused the exception and the MS instructions executed after it, submitting the MS instruction before the MS instruction; and using the MS instruction of the corresponding exception handler as the next MS instructions.
- canceling the MS instruction that caused the exception and subsequent MS instructions includes: canceling these canceled MS instructions in the instruction queue; terminating the execution units that execute these canceled MS instructions; and discarding these canceled MS instructions.
- discarding the scratchpads written by these revoked MS instructions includes deleting the corresponding mappings from the record holding the rename mappings between the physical names and logical names of these scratchpads.
- the type of the exception event is an impossible branch exception UBE event triggered by a branch type MS instruction in the branch prediction processing
- the branch target is determined to be the next MS instruction after the exception is eliminated. Therefore, after the exception handling is completed, the instruction pipeline can normally jump to the correct branch direction to continue execution.
- Figure 10 shows an instruction execution example according to an embodiment of the present disclosure.
- (a) shows the initial state of the MS instruction flow in the instruction queue, which includes 5 MS instructions to be executed
- the #1 MS instruction has a possible discard label
- the different widths occupied by the instructions can represent different scales.
- the #3MS command is a large-scale MS command
- the rest are small-scale MS commands.
- the different backgrounds of the instructions represent that they are in different Status, such as waiting, blocking, launching, executing, exiting, exception, cancellation, etc., please see the legend for specific representation.
- (b) shows the instruction issuance step.
- Small-scale instructions will be issued as soon as possible, while large-scale instructions will be blocked by instructions that may be discarded by the previous issue. Specifically, the #0 instruction is issued first, and then the #1 instruction is issued. When issuing command #1, it was discovered that the command might be discarded. At this time, subsequent large-scale instructions will be blocked.
- instruction #2 can still be issued normally because it is a small-scale instruction; instruction #3 is blocked because it is a large-scale instruction, and subsequent instructions are also in a waiting state.
- (c) shows the instruction execution process.
- the #2 instruction may be executed first. Since the instructions before it have not yet been executed, it needs to wait to ensure sequential submission.
- (d1) shows that the #1 instruction has also been executed normally and has not been discarded. At this time, the large-scale instruction #3 that was blocked because of the #1 instruction can be issued, and the subsequent #4 instruction can also be issued normally.
- (e1) shows that instructions #0, #1, #2 and #4 have all been executed due to their small size, and instruction #3 is still being executed.
- (f1) shows that instructions #0, #1, and #2 are submitted sequentially, while instruction #4 must wait for instruction #3 to be executed before it can be submitted.
- (g1) shows that instruction #3 has also been executed.
- (h1) shows instructions #3 and #4 committing sequentially.
- an exception program when an exception occurs during the execution of the #1 instruction, as shown in (d2), an exception program will be processed.
- the process of exception handling usually includes exception handling preparation, determining the source of the exception, saving the execution state, handling the exception, restoring the execution state and returning, etc.
- exception processing circuit 516 shown in FIG. 5 it is possible to record whether an exception occurs, and adjust the next MS instruction address according to the processing result.
- the impossible branch target indicated by it can be determined as the impossible branch target after the exception is eliminated based on the branch instruction information attached to the branch instruction.
- Next MS command That is, after the exception is handled, the pipeline will jump to the MS instruction corresponding to the impossible branch target.
- the denominator in division is zero, it will jump to an exception handler, which may modify the denominator value to a very small non-zero value, and then re-execute #1 after the exception is handled. instruction, and normal instruction pipeline processing continues.
- storage vesicles are used as an alternative to registers for mixed-scale data access.
- the storage bubbles can be some independent, mixed-sized single-port scratchpads (Scratchpad), whose capacity can range from 64B to 512KB, for example.
- storage bubbles can be similar to registers with mixed capacities for use by MS instructions.
- "storage vesicle complex" refers to a physical "register” file composed of storage vesicles, rather than a fixed-size register.
- the number of small-capacity (for example, 64B) storage bubbles is greater than the number of large-capacity (for example, 512KB) storage bubbles, so as to better match program requirements and support tasks of different sizes. service.
- each memory bubble can be a single SRAM or register with two read ports and one write port.
- Figure 11 shows several different data path designs, where (a) shows the data bus, (b) shows the crossbar matrix, and (c) shows the Golgi provided by embodiments of the present disclosure.
- the data bus cannot provide non-blocking access and requires a bus arbiter to resolve access conflicts.
- the cross matrix can provide non-blocking access and has lower latency, but it requires O(mn) switches, where m is the number of ports of the IP core and n is the number of ports for storage bubbles.
- connection problem is treated as a Top-K sorting network, where storage vesicle ports are sorted based on the destination IP port number.
- the on-chip interconnect consists of a bimodal sequencing network consisting of multiple comparators and switches.
- the bimodal sorting network is used to sort the relevant storage bubble ports based on the index of the destination IP core port to construct m IP core ports and n A data path between storage bubble ports.
- the even-numbered columns can be compared with each other first, and the odd-numbered columns can be compared with each other.
- the stored bubbles a and c are compared and it is found that the value #3 of a is greater than the value #1 of c, then the two are exchanged.
- the light hatched line in the figure indicates that the switch is turned on and data can flow laterally. Comparing storage bubbles b and d, it is found that the value #+ ⁇ of b is greater than the value #2 of d, then the two are also exchanged, the switch is turned on, and the data path flows horizontally.
- the sorting positions are c, d, a, and b.
- a comparison is made between two adjacent storage vesicles. For example, if the storage bubbles c and d are compared and it is found that the value #1 of c is less than the value #2 of d, it remains unchanged, the switch is not turned on, and the data path can only flow vertically. Similarly, after comparing storage bubbles d and a, the switch is not turned on; after comparing storage bubbles a and b, the switch is not turned on.
- each IP core exactly corresponds to the storage bubble it wants to access. For example, for IP#1, go straight down from the passage below it, move laterally to the gray dot, and then go straight down to the storage bubble c.
- the data paths of other IP cores are similar.
- a non-blocking data path is constructed between the IP core and the storage bubble.
- the Golgi can be implemented using O(n(logk) 2 ) comparators and switches, which is much smaller than the O(nk) switches required for the cross matrix.
- Data delivered through the Golgi is subject to several cycles of latency (e.g., 8 cycles), so the preferred practice is to place as little local cache as possible (1KB is enough) in the IP core, as it relies on a large number of random accesses.
- SaaP in order to execute an MS instruction, establishes an exclusive data path between the IP core and its storage bubble.
- This exclusive data path in SaaS follows the PXO architecture and provides non-blocking data access with minimal hardware cost.
- Data can be shared between IP cores by passing memory bubbles between MS instructions. Since the mother core manages system memory, input data is brought together in an MS instruction by the mother core and correctly placed in a storage bubble in for use by another MS command. After being processed by the IP core, the output data is similarly dispersed back to system memory by the mother core.
- the complete data path from system memory to IP core includes: [(Loading MS instructions) 1 Memory 2 L3/L2 cache 3 Mother core 4 Golgi W0 5 Storage vesicle, (consuming MS instructions) 5 The same storage vesicle 6 Golgi R0 /17IP core. ]
- system memory is a resource exclusively owned by the mother core, which greatly reduces system complexity in the following aspects:
- Page faults can only be initiated by the mother core and are handled within the mother core, so other MS instructions can always be executed safely while ensuring no page faults;
- L2/L3 cache is owned exclusively by the mother core, so cache inconsistency/contention/pseudo-sharing never occurs;
- SaaS can adapt to various general-purpose programming languages (C, C++, Python, etc.) as well as domain-specific languages. Since any task performed on a SaaP is an MS instruction, the key technology is to extract mixed-scale operations to form MS instructions.
- Figure 12 shows an exemplary flowchart of a compilation method according to an embodiment of the present disclosure.
- step 1210 mixed scale (MS) operations are extracted from the program to be compiled, and these MS operations may have a variable number of execution cycles.
- step 1220 the extracted mixed-scale operations are packaged to form MS instructions.
- Low-level operations can be extracted from the base instruction block, while high-level operations can be extracted in various ways, including but not limited to: 1) calling the map directly from the library, 2) reconstructing from the low-level program structure, and 3) manually setting the compiler directives . Therefore, existing programs, such as deep learning applications written in Python using PyTorch, can be compiled onto the SaaP architecture in a manner similar to the multi-scalar pipeline.
- the following five LLVM compilation passes (Pass) can be optionally added to extend the traditional compiler.
- the call to the library function function can be extracted from the program to be compiled as an MS operation; then according to the mapping list of the library function function to the MS template library, the extracted call to the library function function is converted into Corresponding MS instructions.
- the MS template library is pre-compiled based on execution unit-specific code capable of executing the library's functional functions.
- the specified program structure in the program to be compiled is identified as an MS operation through template matching; and the identified specified program structure is converted into a predetermined MS instruction.
- the template can be predefined based on high-level functional structural characteristics.
- the template can define a nested loop structure and set some parameters of the nested loop structure, such as how many nested loops there are, the size of each loop, what operations are in the innermost loop, etc.
- the template can be defined based on some typical high-level structures, such as convolution operation structure, Fast Fourier Transform (FFT), etc.
- FFT Fast Fourier Transform
- a user-implemented Fast Fourier Transform FFT (as a nested ring) can be captured via template matching and then replaced using the FFT MS instructions of a vendor-specific library used in Call-Map.
- the restored FFT MS instructions can be executed more efficiently on the DSP IP core (if available) and can be converted back into a nested loop in the worst case scenario where only the CPU is available. This is a best-effort effort, as it is inherently difficult to accurately reconstruct all high-level structures, but it provides an opportunity for older programs that are not aware of DSA to take advantage of the new DSP IP core.
- CDFG Control Data Flow Graph, control data flow graph
- the program is analyzed on the CDFG graph, not on the CFG (Control Flow Graph, control flow graph) graph. This is because SaaS removes register masking and address resolution mechanisms and organizes data into storage bubbles.
- the operations to be performed on the heterogeneous IP cores can be identified. All remaining code is executed on the CPU as multi-scalar tasks. At this point, the problem is to find the optimal partitioning of the remaining code into MS instructions.
- a global CDFG is constructed for subsequent use to model the costs of different MS instruction partitions.
- the operations that have not yet been extracted in the program to be compiled can be divided into one or more operation sets according to multiple division schemes on the control data flow graph of the program to be compiled; and then the optimal division cost is determined partitioning plan.
- each partitioning scheme each operation belongs to and only belongs to one operation set.
- the segmentation scheme can be executed subject to one or more of the following constraints.
- the number of input data and output data of an operation set does not exceed the specified value.
- the arity of the input data does not exceed 2
- the arity of the output data does not exceed 1. Therefore, the operation can be divided based on this constraint.
- the size of any input data or output data of an operation set does not exceed a specified threshold. Since the storage element corresponding to the MS instruction is a storage bubble, and the storage bubble has a capacity limit, it is necessary to limit the amount of data processed by the MS instruction to not exceed the capacity limit of the storage bubble.
- the segmentation plans related to conditional operations can include:
- conditional operation and its two branch operations are not in the same operation set. Possible reasons for this splitting scheme include: it will cause the operation set to be too large; or it violates the input and output constraints; or the branch operation has been identified as an MS instruction in the previous step, etc. In this case, a branch type MS instruction containing a conditional operation will be generated. In general, placing conditional operations in a short set of operations results in faster branch results during execution. For example, you can control that the same operation set does not contain both conditional operations and non-conditional operations that exceed the execution time threshold.
- the splitting cost of the splitting solution can be determined based on a variety of factors, including but not limited to, the number of operation sets; the amount of data interaction required between the operation sets; the number of operation sets that bear the branch function; and the expected execution time of each operation set. Distribution uniformity. These considerations affect the execution efficiency of the instruction pipeline from many aspects, and therefore can be used as a measurement factor to determine the partitioning scheme. For example, the number of operation sets directly corresponds to the number of MS instructions; the amount of data interaction required between operation sets determines the amount of data IO required; the more branch type instructions, the greater the probability of triggering exceptions, which consumes the pipeline The distribution uniformity of the expected execution time will affect the overall operation of the pipeline and avoid pipeline interruption due to excessive time consumption at a certain level.
- the above-mentioned CDFG analysis process is performed after invoking the mapping process and the reconstruction process. Therefore, it can be executed only for the MS operations that were not recognized in the first two compilation passes, that is, for the remaining operations.
- MS-Cluster This is a transformation compilation process that is used to gather nodes in CDFG to build a complete division into MS instructions.
- each operation set is converted into an MS instruction separately according to the segmentation scheme determined during the CDFG analysis process. Limited by the capacity of the storage bubble, the algorithm minimizes the total cost of cutting edges across MS instruction boundaries.
- MS instructions and system calls including load/store operations are assigned to the mother core.
- Fractal-Decompose (fractal-decomposition process): This is also a transformation compilation process, used to decompose the MS instructions that violate the storage bubble capacity limit extracted from the call mapping process and reconstruction process, thereby storing the bubble capacity SaaS functionality is no longer limited.
- the decomposition process includes: checking whether the converted MS instruction complies with the storage capacity constraint of the MS instruction; when the MS instruction does not comply with the storage capacity constraint, split the MS instruction into multiple MS instructions for implementation Same functionality.
- MS instructions can be decomposed using various currently known or future developed instruction decomposition methods. Due to previous The extracted MS instructions are to be allocated for execution on a certain IP core. Therefore, the multiple operations that constitute the MS instructions are of the same type, that is, isomorphic, and only need to adapt to the physical hardware size. Therefore, in some embodiments, this decomposition process of MS instructions may simply follow a fractal execution model. For example, you can refer to the paper by Y. Zhao, Z. Du, Q. Guo, S. Liu, L. Li, Z. Xu, T. Chen, and Y.
- MS instructions include sub-instruction fields, input and output storage bubble information fields, and may also include system memory address information fields, branch information fields, exception flag fields, etc. Some of these instruction fields are required, such as sub-instruction fields, exception flag fields, etc.; some are filled on demand, such as input and output storage bubble information fields, system memory address information fields, branch information fields, etc.
- the MS operation When populating the sub-instruction field, the MS operation may be identified in the sub-instruction field of the MS instruction; and the sub-instruction field may be associated with one or more execution unit-specific sub-instructions for implementing the MS operation.
- a possible discard tag may be inserted in the exception tag field for use in subsequent execution of the MS instruction.
- a branch indicator may be inserted in the branch information field to indicate possible branch targets and/or impossible branch targets.
- Figure 13 shows an example program, where (a) shows the original program to be compiled; the compiled program is divided into two parts, where (b) shows the compiled MS instruction flow, and (c) shows IP-specific MS instruction implementation, that is, the sub-instructions described previously.
- the original program involves the calculation of the Relu layer and the Softmax layer of the neural network in a deep learning application, which is written in the Python language using PyTorch, for example.
- the calculations of the Relu layer and Softmax layer adopt the method of calling the Torch library. Therefore, according to the call mapping process described earlier, these function calls to the Torch library can be mapped into MS instructions, such as "Matmul (matrix multiplication)", “Eltwadd (element-wise addition)", " Relu” and so on.
- the increment of the variable Epoch and the conditional branch are packaged and mapped into a conditional branch instruction "Ifcond", and a branch indicator is inserted to indicate possible branch targets and impossible branch targets.
- the Print statement is mapped to another MS command ("Print").
- (c) shows several MS instructions with IP specific codes.
- Matmul provides two IP-specific code implementations, one for GPU and one for DLA, such that "Matmul" MS instructions can be scheduled by the instruction dispatcher between the GPU lane and the DLA lane. Ifcond only provides CPU-specific code that involves reading the value Epoch from the first input storage bubble (vil), incrementing it by 1, and then storing it in the output storage bubble (vo). Then calculate the new Epoch value modulo 10 to get the result 10 and make a judgment based on it. If it is determined that the "Then" branch is to be taken (this branch is compared to an impossible branch), a "UBE" event is initiated.
- Ifcond MS instruction also inserts a "possible discard tag", and any subsequent large-scale MS instructions will be blocked until the Ifcond instruction has been executed.
- the Print MS instruction is dispatched only to the mother core because this instruction requires system calls and I/O to external devices.
- the program code to be compiled can be in various general programming languages or domain-specific languages.
- various new IP cores can be added to the SaaP SoC very easily without a lot of programming/compilation work, so the scalability of the SoC can be well supported.
- the same MS instruction can use multiple versions of sub-instructions, which also provides more options for scheduling during instruction execution and facilitates improvement of pipeline execution efficiency.
- SaaP provides an excellent design choice for the traditional understanding of heterogeneous SoC.
- MS instructions can be executed predictably and undo on error without any overhead because there is nothing in the executing IP core to leave an observable due to an erroneous instruction. side effect.
- the cache does not need to be consistent since there are no duplicate cache lines, and the Snoop Filter/MESI protocol is saved since there is no bus to snoop.
- FIG 14 shows a schematic structural diagram of a board card 1400 according to an embodiment of the present disclosure.
- the board 1400 includes a chip 1401, which may be a SaaP SoC according to an embodiment of the present disclosure, integrating one or more combined processing devices.
- the combined processing device is an artificial intelligence computing unit to support various depths. Learning and machine learning algorithms meet the needs of intelligent processing in complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
- the board 1400 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and powerful computing capabilities.
- the chip 1401 is connected to an external device 1403 through an external interface device 1402 .
- the external device 1403 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or Wifi interface.
- the data to be processed can be transferred to the chip 1401 from the external device 1403 through the external interface device 1402.
- the calculation results of the chip 1401 can be transmitted back to the external device 1403 via the external interface device 1402 .
- the external interface device 1402 may have different interface forms, such as PCIe (Peripheral Component Interconnect express, high-speed peripheral component interconnection) interface, etc.
- Board 1400 also includes a memory device 1404 for storing data, which includes one or more memory cells 1405 .
- the storage device 1404 connects and transmits data with the control device 1406 and the chip 1401 through the bus.
- the control device 1406 in the board card 1400 is configured to control the status of the chip 1401.
- the control device 1406 may include a microcontroller, also known as a microcontroller unit (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- Embodiments of the present disclosure also provide a corresponding compilation device, which includes a processor configured to execute compiled program code; and a memory configured to store the compiled program code. When the compiled program code is generated by the processor When loaded and executed, the compilation device is caused to execute the compilation method described in any of the previous embodiments.
- Embodiments of the present disclosure also provide a machine-readable storage medium. The machine-readable storage medium includes compiler code. When executed, the compiler code causes the machine to perform the compilation method described in any of the previous embodiments.
- the electronic equipment or devices of the present disclosure may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or Medical equipment.
- the means of transportation include airplanes, ships and/or vehicles;
- the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
- the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph.
- the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical and other fields. Furthermore, the electronic equipment or device of the present disclosure can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, electronic equipment or devices with high computing power according to the solution of the present disclosure can be applied to cloud equipment (such as cloud servers), while electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
- cloud equipment such as cloud servers
- electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
- the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device.
- this disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of this disclosure are not limited by the order of the described actions. . Therefore, based on the disclosure or teachings of the present disclosure, those skilled in the art can understand that certain steps may be implemented in other ways. be executed sequentially or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present disclosure. In addition, depending on the solution, the description of some embodiments in this disclosure also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the relevant descriptions of other embodiments.
- units illustrated as separate components may or may not be physically separate, and components illustrated as units may or may not be physical units.
- the aforementioned components or units may be co-located or distributed over multiple network units.
- some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
- multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
- the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc.
- the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors.
- various devices such as computing devices or other processing devices described in this article can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs and ASICs (Application Specific Integrated Circuits). )wait.
- the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), dynamic memory, etc.
- Random access memory Dynamic Random Access Memory, DRAM
- static random access memory Static Random Access Memory, SRAM
- enhanced dynamic random access memory Enhanced Dynamic Random Access Memory, EDRAM
- high bandwidth memory High Bandwidth Memory
- HBM Hybrid Memory Cube
- HMC Hybrid Memory Cube
- ROM and RAM etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Debugging And Monitoring (AREA)
- Advance Control (AREA)
Abstract
L'invention concerne un procédé d'exécution d'instructions, un contrôleur de système et un produit associé. Le procédé consiste : pendant la transmission d'instructions d'échelle mixte, à vérifier si une instruction d'échelle mixte à transmettre peut être rejetée ; et lorsqu'il est déterminé que ladite instruction d'échelle mixte peut être rejetée, à bloquer la transmission d'une instruction d'échelle mixte particulière à transmettre postérieurement à ladite instruction d'échelle mixte.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210764229.7A CN117348929A (zh) | 2022-06-29 | 2022-06-29 | 指令执行方法、系统控制器及相关产品 |
CN202210764229.7 | 2022-06-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024002175A1 true WO2024002175A1 (fr) | 2024-01-04 |
Family
ID=89354529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/103271 WO2024002175A1 (fr) | 2022-06-29 | 2023-06-28 | Procédé d'exécution d'instructions, contrôleur de système et produit associé |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117348929A (fr) |
WO (1) | WO2024002175A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118426841A (zh) * | 2024-06-25 | 2024-08-02 | 飞腾信息技术有限公司 | 指令处理方法、处理器核、处理器、电子设备及存储介质 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117687957B (zh) * | 2024-02-04 | 2024-04-23 | 中国人民解放军海军航空大学 | 一种基于FPGA的Top-k信息处理引擎及其排序方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104298552A (zh) * | 2013-07-15 | 2015-01-21 | 华为技术有限公司 | 多线程处理器的线程取指调度方法、系统和多线程处理器 |
CN104424129A (zh) * | 2013-08-19 | 2015-03-18 | 上海芯豪微电子有限公司 | 基于指令读缓冲的缓存系统和方法 |
CN105446773A (zh) * | 2015-11-18 | 2016-03-30 | 上海兆芯集成电路有限公司 | 高速缓存行的非对齐加载指令的推测并行执行系统和方法 |
CN106415515A (zh) * | 2014-06-26 | 2017-02-15 | 英特尔公司 | 使用不具有sfence的优化的pio写入序列来发送分组 |
-
2022
- 2022-06-29 CN CN202210764229.7A patent/CN117348929A/zh active Pending
-
2023
- 2023-06-28 WO PCT/CN2023/103271 patent/WO2024002175A1/fr unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104298552A (zh) * | 2013-07-15 | 2015-01-21 | 华为技术有限公司 | 多线程处理器的线程取指调度方法、系统和多线程处理器 |
CN104424129A (zh) * | 2013-08-19 | 2015-03-18 | 上海芯豪微电子有限公司 | 基于指令读缓冲的缓存系统和方法 |
CN106415515A (zh) * | 2014-06-26 | 2017-02-15 | 英特尔公司 | 使用不具有sfence的优化的pio写入序列来发送分组 |
CN105446773A (zh) * | 2015-11-18 | 2016-03-30 | 上海兆芯集成电路有限公司 | 高速缓存行的非对齐加载指令的推测并行执行系统和方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118426841A (zh) * | 2024-06-25 | 2024-08-02 | 飞腾信息技术有限公司 | 指令处理方法、处理器核、处理器、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN117348929A (zh) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11893424B2 (en) | Training a neural network using a non-homogenous set of reconfigurable processors | |
WO2024002175A1 (fr) | Procédé d'exécution d'instructions, contrôleur de système et produit associé | |
US11847395B2 (en) | Executing a neural network graph using a non-homogenous set of reconfigurable processors | |
US11625283B2 (en) | Inter-processor execution of configuration files on reconfigurable processors using smart network interface controller (SmartNIC) buffers | |
CN109074260A (zh) | 乱序的基于块的处理器和指令调度器 | |
US10997102B2 (en) | Multidimensional address generation for direct memory access | |
US11182264B1 (en) | Intra-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS) | |
US11934308B2 (en) | Processor cluster address generation | |
US12079157B2 (en) | Reconfigurable data processor with fast argument load using a runtime program on a host processor | |
US20240231903A1 (en) | Data transfer in dataflow computing systems using an intelligent dynamic transfer engine | |
US12056012B2 (en) | Force quit of reconfigurable processor | |
WO2024002176A1 (fr) | Appareil de traitement d'instructions, procédé d'exécution d'instructions, système sur puce, et carte | |
WO2024002178A1 (fr) | Procédé d'exécution d'instruction, et dispositif de commande de système et produit associé | |
WO2024002172A1 (fr) | Système sur puce, système d'instruction, système de compilation et produit associé | |
WO2023018477A1 (fr) | Architecture de traitement parallèle faisant appel à des fichiers de registres distribués | |
CN117348881A (zh) | 编译方法、编译装置和机器可读存储介质 | |
Meakin | Multicore system design with xum: The extensible utah multicore project | |
US20230385125A1 (en) | Graph partitioning and implementation of large models on tensor streaming processors | |
US20220308872A1 (en) | Parallel processing architecture using distributed register files | |
Volz et al. | IPEC: Open-Source Design Automation for Inter-Processing Element Communication | |
US20220291957A1 (en) | Parallel processing architecture with distributed register files | |
US20230273818A1 (en) | Highly parallel processing architecture with out-of-order resolution | |
KHALILI MAYBODI | A Data-Flow Threads Co-processor for MPSoC FPGA Clusters | |
WO2022251272A1 (fr) | Architecture de traitement parallèle à fichiers de registres distribués | |
WO2023172660A1 (fr) | Architecture de traitement hautement parallèle à résolution dans le désordre |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23830346 Country of ref document: EP Kind code of ref document: A1 |