WO2023123453A1 - 运算加速的处理方法、运算加速器的使用方法及运算加速器 - Google Patents

运算加速的处理方法、运算加速器的使用方法及运算加速器 Download PDF

Info

Publication number
WO2023123453A1
WO2023123453A1 PCT/CN2021/143921 CN2021143921W WO2023123453A1 WO 2023123453 A1 WO2023123453 A1 WO 2023123453A1 CN 2021143921 W CN2021143921 W CN 2021143921W WO 2023123453 A1 WO2023123453 A1 WO 2023123453A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
scalar
instructions
unit
Prior art date
Application number
PCT/CN2021/143921
Other languages
English (en)
French (fr)
Inventor
王雅洁
毕舒展
张迪
官惠泽
梁晓峣
景乃锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180029995.2A priority Critical patent/CN116685964A/zh
Priority to PCT/CN2021/143921 priority patent/WO2023123453A1/zh
Publication of WO2023123453A1 publication Critical patent/WO2023123453A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present application relates to the field of computer architecture and artificial intelligence technology in vehicle intelligence, in particular to the field of artificial intelligence chip technology, and in particular to a processing method for computing acceleration, a method for using a computing accelerator, and a computing accelerator.
  • Deep Learning (DL) and Neural Network (Neural Network, NN) in the field of artificial intelligence (AI) require more and more computing power for computing resources.
  • general-purpose computing (General-Purpose Computing, GP) hardware such as the central processing unit (Central Processing Unit, CPU) has been unable to complete the large-scale and high-throughput computing requirements required by some large-scale neural networks.
  • AI chips or AI computing devices equipped with multiple AI processors are usually selected to meet the computing power requirements generated by large-scale calculations required for neural network training or neural network reasoning.
  • operator code automatic generation software such as the neural network computing structure CANN (Compute Architecture for Neural Networks, CANN) will be used to generate or design neural network algorithms.
  • CANN Computer Architecture for Neural Networks
  • automatic compilation tools or computing scheduling templates such as Tensor Virtual Machine (TVM)
  • TVM Tensor Virtual Machine
  • the fixed-width SIMD pipeline set by the AI processor may be too wide compared with the data to be processed. At this time, it will adversely affect the execution of computationally intensive AI tasks using the target AI chip or AI computing device.
  • the application provides an operation accelerator, a processing method for operation acceleration, a method for using the operation accelerator, an artificial intelligence processor, an artificial intelligence processing device, and an electronic device, which can improve the utilization rate of the vector calculation unit in the operation accelerator, and improve the single-time operation of the vector calculation unit. Execution performance, thereby improving the performance of the computing accelerator as a whole.
  • the first aspect of the present application provides an operation accelerator, including: a storage unit configured with at least one vector instruction queue, each vector instruction queue is used to cache one or more vector instructions; at least two scalar calculation units, each The scalar calculation units are respectively used to obtain instructions and decode the instructions to obtain decoded instructions, the decoded instructions include vector instructions, and each scalar calculation unit is also used to cache the vector instructions into at least one vector instruction queue; The computing unit is used to execute vector instructions in the vector instruction queue.
  • At least two scalar calculation units read and decode the vector instructions, the decoded vector instructions are cached in the vector instruction queue configured by the storage unit, and the decoded vector instructions are executed by the vector calculation units.
  • the number of scalar computing units is increased, the number of vector instructions read can be increased, and the number of vector instructions transmitted to the vector computing units can be increased, thereby improving the utilization rate of the vector computing units.
  • the vector computing unit is configured to execute at least two vector instructions in the vector instruction queue within one execution cycle.
  • the vector computing unit executes at least two decoded vector instructions in one execution cycle. In this way, the processing efficiency of the vector instruction is improved, and the utilization rate of the vector calculation unit is improved.
  • At least two decoded vector instructions executed in one execution cycle have the same execution delay.
  • At least two decoded vector instructions executed in one execution cycle have the same execution latency. In this way, it is possible to avoid branching between vector instructions when the vector computing unit executes the write-back process due to the different execution delays of the vector instructions, thereby keeping the control logic unchanged when the vector computing unit executes the write-back process.
  • an assembling unit is further included, configured to assemble at least two vector instructions in the vector instruction queue to obtain assembled vector instructions, and provide the assembled vector instructions to the vector calculation unit for execution.
  • the assembly unit At least two decoded vector instructions are assembled by the assembly unit, and the assembled vector instructions are executed by the vector calculation unit.
  • the vector computing unit executes the assembled vector instruction, which improves the processing efficiency of the vector instruction and improves the utilization rate of the vector computing unit.
  • the assembly unit includes a logic storage module and an assembly module, and the logic storage module is configured with at least one The assembly queue is used to cache decoded vector instructions extracted from at least two vector instruction queues; the assembly module is used to extract and assemble at least two decoded vector instructions from the assembly queue according to the execution delay.
  • the vector instructions decoded by at least two scalar computing units are respectively buffered in at least two corresponding vector instruction queues, and the assembly queue configured by the assembly unit caches the vector instructions extracted from the at least two vector instruction queues, and The assembly module extracts any at least two vector instructions from the assembly queue according to the execution delay pair, and assembles them.
  • each scalar computing unit and its corresponding vector instruction queue are taken as a whole, and other scalar computing units and their corresponding vector instruction queues are taken as other wholes. Each whole is independent of each other and executed in parallel, which improves the overall performance of each scalar computing unit.
  • the assembly module of the assembly unit uses the logical storage module configured with the assembly queue to assemble the vector instructions extracted from the vector instruction queue, and can assemble the vector instructions flexibly, reliably and efficiently. Assembling the vector instructions according to the execution delay is beneficial to improve the processing efficiency when the assembled vector instructions are executed by the vector computing unit.
  • the first aspect also includes: a data handling unit; the storage unit is further configured with at least one data handling instruction queue, and each data handling instruction queue is respectively used to cache one or more data handling instructions;
  • the decoded instructions also include data transfer instructions, and each scalar calculation unit caches the data transfer instructions in at least one data transfer instruction queue; the data transfer unit is used to execute the decoded data transfer instructions.
  • At least two scalar calculation units read the data transfer instruction and decode the data transfer instruction.
  • the decoded data transfer instruction is cached in the data transfer instruction queue configured by the storage unit, and the data transfer unit executes the decoding. data movement instructions. After the number of scalar calculation units is increased, the number of read data transfer instructions can be increased, and the number of data transfer instructions provided to the data transfer units can be increased, thereby improving the utilization rate of the vector calculation units.
  • the decoded instructions further include scalar instructions, and each scalar computing unit is further configured to execute the scalar instructions.
  • At least two scalar computing units read scalar instructions, decode the scalar instructions, and execute the decoded scalar instructions.
  • Each scalar calculation unit respectively executes the decoded scalar instruction, and realizes the program flow control of the operation accelerator in a coordinated manner as a whole.
  • the number of read scalar instructions can be increased, thereby improving the processing efficiency of scalar instructions.
  • At least two scalar computing units include a master scalar computing unit and at least one slave scalar computing unit, the master scalar computing unit is used to control the start or stop of each slave scalar computing unit, or Controls the synchronization between scalar computing units.
  • the at least two scalar computing units constitute a cluster, and in the cluster, a master scalar computing unit and at least one slave scalar computing unit are pre-designated.
  • the designated master scalar computing unit controls the start or stop of the slave scalar computing unit, or controls the synchronization of at least two scalar computing units including the master scalar computing unit, so that each scalar computing unit in the cluster starts or stops according to the specified role, And it executes the synchronization function according to the specified role, with clear logic and reliable operation, and realizes the coordinated operation of the computing accelerator.
  • the second aspect of the present application provides an operation acceleration processing method, including: each scalar calculation unit of at least two scalar calculation units acquires an instruction and decodes the instruction to obtain a decoded instruction, and the decoded instruction includes Vector instruction; each scalar computing unit caches the vector instruction to at least one vector instruction queue, the vector instruction queue is configured in the storage unit, and each vector instruction queue is used to cache one or more vector instructions; executed by the vector computing unit Vector instructions in the vector instruction queue.
  • the vector calculation unit executes at least two vector instructions in the vector instruction queue within one execution cycle.
  • At least two decoded vector instructions executed in one execution cycle have the same execution delay.
  • the method further includes: using the assembly unit to assemble at least two vector instructions in the vector instruction queue to obtain an assembled vector instruction, and providing the assembled vector instruction to the vector calculation unit for execution.
  • the assembly unit includes a logic storage module and an assembly module; the assembly queue configured by the logic storage module
  • the decoded vector instructions extracted from at least two vector instruction queues are cached; at least two decoded vector instructions are extracted from the assembly queues by the assembly module according to the execution delay and assembled.
  • the decoded instruction also includes a data handling instruction, and each scalar calculation unit caches the data handling instruction into at least one data handling instruction queue; the data handling instruction queue is configured in the storage In the unit, each data transfer instruction queue is used to buffer the data transfer instruction; the decoded data transfer instruction is also executed by the data transfer unit.
  • the instructions further include scalar instructions, and each scalar computing unit executes the scalar instructions.
  • At least two scalar computing units including a master scalar computing unit and at least one slave scalar computing unit, also control the start or stop of each scalar computing unit through the master scalar computing unit, or control Synchronization between scalar computing units.
  • the main scalar computing unit is used to control the startup of each scalar computing unit, including: when the main scalar computing unit runs to the startup instruction that identifies the multi-working unit mode, according to the startup instruction Quantity, which controls the activation of the corresponding number of slave scalar computation units.
  • the third aspect of the present application provides a method for using a computing accelerator, including: generating identifiers pointing to M working units according to the number M of specified working units; using at least one vector operation function supported by the computing accelerator described in the first aspect To process the data, at least one parameter of the vector operation function refers to the identifier.
  • the computing accelerator for computing acceleration when using the computing accelerator for computing acceleration, according to the specified number M of working units, generate identifiers pointing to M working units, and use at least one vector computing function supported by the computing accelerator to process data, for example, through the vector computing function At least one of the arguments references this identifier as a way to allocate the data that each unit of work processes individually. At this time, specify the multi-work unit mode of the computing accelerator, and allocate the data processed by each work unit separately, so as to realize parallel processing of data.
  • the method of using a computing accelerator for computing acceleration facilitates the realization of program codes with complete structure and clear logic, is highly man-machine friendly, and is easy to use, which is conducive to promoting the wide application of the computing accelerator in the industry.
  • the third aspect also includes: using at least one data handling function supported by the computing accelerator to process data, and at least one parameter reference identifier of the data handling function; or using a synchronous waiting function supported by the computing accelerator to specify M unit of work synchronization.
  • the data processed by each work unit is allocated by means of at least one parameter reference identifier of the data transfer function, and the M work units are executed synchronously because the data allocated for processing has dependencies. .
  • specify the multi-work unit mode of the computing accelerator and allocate the data processed by each work unit separately, so as to realize parallel processing of data and specify each work unit to wait synchronously when the data is dependent.
  • the method of using a computing accelerator for computing acceleration facilitates the realization of program codes with complete structure and clear logic, is highly man-machine friendly, and is easy to use, which is conducive to promoting the wide application of the computing accelerator in the industry.
  • the fourth aspect of the present application provides an artificial intelligence processor, including: at least one computing accelerator described in the first aspect; a processor, and a memory, on which data and programs are stored, and when the program is executed by the processor, at least An operation accelerator performs an operation specified in a program on data and returns the result of the operation to the processor or memory.
  • the artificial intelligence processor is integrated with at least one computing accelerator.
  • the at least one computing accelerator When executing an artificial intelligence-based application program to process data, the at least one computing accelerator performs the operation specified in the program for the data, and returns the result of the operation to the processor or Memory, easy to integrate, flexible to use.
  • the fifth aspect of the present application provides an artificial intelligence processing device, including: a processor, and an interface circuit, wherein the processor accesses the memory through the interface circuit, and the memory stores programs and data.
  • the processor passes The interface circuit accesses at least one operation accelerator described in the first aspect, so that the at least one operation accelerator performs the operation specified in the program on the data, and returns the result of the operation to the processor or the memory.
  • the artificial intelligence processing device is provided with computing accelerators accessed through the bus.
  • computing accelerators accessed through the bus.
  • its processor executes an artificial intelligence-based application program to process data
  • at least one of these computing accelerators performs the calculation specified in the application program for the data, and the calculated The result is returned to the processor or memory, which is easy to integrate and flexible to use.
  • the sixth aspect of the present application provides an electronic device, including: a processor, and a memory, on which a program is stored, and when the program is executed by the processor, the method described in the third aspect is executed.
  • FIG. 1A is a schematic diagram of the architecture of a computing accelerator of a specific domain architecture
  • FIG. 1B is a schematic diagram of execution of the computing accelerator shown in FIG. 1A;
  • FIG. 1C is a schematic diagram of the utilization rate of the Vector unit of the computing accelerator shown in FIG. 1A;
  • FIG. 2A is a schematic diagram of the composition of the computing accelerator of the embodiment of the present application.
  • FIG. 2B is another schematic diagram of the composition of the computing accelerator of the embodiment of the present application.
  • FIG. 2C is another schematic diagram of the composition of the computing accelerator of the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for using a computing accelerator according to an embodiment of the present application
  • FIG. 4A is a schematic diagram of the composition of another computing accelerator according to the embodiment of the present application.
  • FIG. 4B is a schematic diagram of execution of the computing accelerator in FIG. 4A;
  • FIG. 4C is a schematic diagram of the composition of the Vector instruction assembly unit of the computing accelerator of FIG. 4A;
  • FIG. 4D is a schematic diagram of the composition of the MTE instruction merging unit of the computing accelerator of FIG. 4A;
  • FIG. 5A is a schematic diagram of the operation accelerator of the embodiment of the present application performing vector instruction assembly to merge and process data;
  • 5B is a schematic diagram of the operation accelerator of the embodiment of the present application performing vector instruction assembly to increase the utilization of the Vector unit;
  • FIG. 5C is a schematic diagram of parameter meanings of vector instructions supported by the computing accelerator of the embodiment of the present application.
  • FIG. 6A is a schematic diagram of the code when the multi-work unit mode is executed outside the multi-core mode supported by the computing accelerator of the embodiment of the present application;
  • FIG. 6B is a schematic diagram of the code when executing the multi-work unit mode in the multi-core mode supported by the computing accelerator provided by the embodiment of the present application;
  • FIG. 7A is a schematic connection diagram when the computing accelerator provided by the embodiment of the present application is used as a terminal device
  • FIG. 7B is a schematic diagram of the connection of the computing accelerator provided by the embodiment of the present application as a single board working in RC mode;
  • FIG. 8 is a schematic diagram of the composition of an artificial intelligence processing device provided by an embodiment of the present application.
  • computing acceleration solutions include computing accelerators, computing acceleration processing methods, artificial intelligence processing systems, electronic devices, computing equipment, computer-readable storage media, and computer program products. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have been referred to each other and can be combined with each other.
  • FIG 1A shows an example of the architecture of an AI computing accelerator of a certain domain specific architecture (Domain Specific Architecture, DSA), that is, an AI core (AI Core).
  • the AI core may include a computing unit, a storage system, and a system control module.
  • the computing unit can provide a variety of basic computing resources for performing matrix computing, vector computing, and scalar computing. , hereinafter referred to as a Vector unit) and a scalar calculation unit 200 (Scalar Unit, hereinafter referred to as a Scalar unit).
  • the Cube unit executes matrix operation instructions (hereinafter referred to as Cube instructions) to complete matrix multiplication and matrix addition operations;
  • the Vector unit executes vector operation instructions (hereinafter referred to as Vector instructions or vector instructions) to complete vector-type operations, for example, the addition of vectors , multiplication, etc.;
  • the Scalar unit is responsible for various types of scalar calculations and program flow control, and uses multi-stage pipelines to complete program flow control such as loop control, branch judgment, and synchronization, as well as address and parameter calculation of Cube instructions or Vector instructions, as well as basic arithmetic. operation, etc.
  • each computing unit forms its own independent execution pipeline, and cooperates with each other to achieve optimized computing efficiency under the unified scheduling of the system control module 13 (System Control) running in the device operating system (Device OS).
  • System Control system control module 13
  • Device OS device operating system
  • the storage system includes the internal storage (In Core Buffer) of the AI core and the corresponding data path, where the data path may include the circulation path of the data in which the AI core completes a computing task, such as the bus interface unit 14 (Bus Interface Unit, BIU).
  • AI core internal storage (In Core Buffer) is connected with the bus interface unit 14 as the internal storage resource of AI core, thereby obtains external data through BIU and bus (Bus), as, secondary buffer (L2Buffer, L2), Data in Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM, DDR for short), High Bandwidth Memory (HBM), Global Memory (GM), etc. are transmitted to AI Kernel internal storage.
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • GM Global Memory
  • the internal storage of the AI core includes a primary buffer 121 (L1 Buffer), a unified buffer 122 (Unified Buffer, UB), a general-purpose register 123 (General-Purpose Register, GPR), a special-purpose register 124 (Special-Purpose Register , SPR) etc.
  • the instruction cache 125 (Instruction Cache, I Cache) is the instruction cache area inside the AI Core, and the instruction sequence generated by the neural network algorithm designed by the user for the AI task after calculation and arrangement is stored in the instruction cache. In storage access, the priority of the first level buffer is higher than that of the second level buffer.
  • the internal storage of the AI core also includes high-speed register units that provide temporary variables. These high-speed register units can be located in the aforementioned calculation units, such as the source register workspace (Src Reg Workspace) and target register that provide operands to the Vector unit in Figure 4C Workspace (Dst Reg Workspace).
  • a data transfer unit 600 (Memory Transfer Engine, hereinafter referred to as the MTE unit) can also be set in the AI Core.
  • the MTE unit executes data transfer instructions (hereinafter referred to as MTE instructions), which can realize data format conversion with extremely high efficiency, and can also complete AI core external storage to AI core internal storage, AI core internal storage to AI core external storage, and AI core Data transfer between different buffers of internal storage.
  • the data transfer unit 600 shown in FIG. 1A may include multiple MTE units during specific implementation, which are respectively used to implement different types of data transfer.
  • the working principle of the AI Core is shown in Figure 1B.
  • the instruction sequence is fetched, decoded, and executed in a scheduling manner of "reading instructions sequentially and executing instructions in parallel".
  • the sequence of instructions is sequentially read and decoded by the Scalar unit.
  • the decoded instructions include Scalar instructions, Cube instructions, Vector instructions, and MTE instructions.
  • Scalar instructions instructions related to program flow control will be sent to the Processing Scalar Queue (PSQ) for decoding and subsequent execution, and instructions related to scalar calculations will be directly executed by the Scalar unit.
  • PSQ Processing Scalar Queue
  • the Scalar unit the address of the operand and the parameters involved in the operation have been configured.
  • the instruction issuing unit 15 distributes each instruction to the corresponding instruction queue, and schedules the corresponding computing units to execute in parallel.
  • the instructions cached in the vector instruction queue are executed by the Vector unit, they have different execution latencies.
  • the instructions cached in the MTE instruction queue are executed by the MTE unit, they have different execution delays.
  • Cube instruction is distributed to matrix instruction queue 161 (Cube instruction queue, hereinafter referred to as Cube Queue), by its cache matrix operation instruction after each decoding;
  • Vector instruction is distributed to vector instruction queue 162 (vector instruction Queues, hereinafter referred to as Vector Queue), cache the vector operation instructions after each decoding by it;
  • Handling instructions The instructions cached in each queue are subsequently executed by the corresponding computing unit.
  • instruction pipelines of different queues are executed in parallel, which improves instruction execution efficiency.
  • the execution order of instructions between different instruction queues can be sequential or out-of-order, but the internal instructions of the queue must be executed sequentially.
  • an event synchronization module 17 (Event Synchronized) can also be set in the AI Core. During the execution of the instructions in the same instruction queue, when there is a dependency relationship or a mandatory time sequence, the event synchronization module 17 is used to adjust and maintain the execution sequence requirements of the instructions.
  • the event synchronization module can be controlled by software, for example, by inserting synchronization functions or synchronization instructions to specify the execution timing of each pipeline, so as to achieve the purpose of adjusting the order of instruction execution.
  • the Vector unit can be a Single Instruction Multi Data (Single Instruction Multi Data, SIMD) computing unit, which executes the vector operation request submitted by the SIMD instruction model, and realizes the 64 instruction pipeline in the 32-bit floating point (Float Point, FP) FP32 mode ( The width of Lanes), or the width of 128 instruction Lanes in 16-bit floating-point FP16 mode can be adaptively adjusted according to the precision of the data to be processed. It can process up to 64 FP32-precision data; for example, in FP64 mode, it can process up to 32 FP64-precision data in one execution, and has powerful and flexible vector computing capabilities.
  • SIMD Single Instruction Multi Data
  • the width of the SIMD pipeline occupied by this instruction execution can be flexibly adjusted to adapt to the tail block size or data volume of the data to be processed.
  • the Vetcor unit runs on a fixed clock with an independent execution pipeline. If no valid vector operation request is received, the Vetcor unit performs a no operation (No Operation, NOP). When performing a no operation, the utilization rate of the Vector unit is zero.
  • the Vetcor unit executes a vector operation request with floating-point 16-bit precision (Float Point 16bit, that is, 2Byte is used to represent a floating-point number, denoted as FP16), the maximum SIMD pipeline width provided by the Vector unit is 128Lanes. If the value of the mask parameter is set to 128 in the vector operation request submitted by the SIMD instruction model, the Vetcor unit performs the specified operation on the source operand or the target operand within 128 instruction Lanes in one execution cycle (Operation, OP), the utilization of Vector cells is 100%.
  • Setting the value of the mask parameter to 16 was determined and implemented during the software writing process. For example, when the input data is arranged in the memory format of "N, C/16, H, W, 16", and the nearest neighbor interpolation calculation is required for the input data, in some cases, the user can only generate 1*16 SIMD instruction model filled with destination data. As shown in Figure 1C, when the value of the mask parameter is set to 16, only the first 16 Lanes participate in the calculation, the remaining 112 Lanes are idle, and the utilization rate of the Vector unit is 1/8. Similarly, when the value of the mask parameter is set to 32 or 64, the utilization ratio of the Vector unit is 1/4 or 1/2 respectively.
  • the solution of the embodiment of the present application provides a solution that can flexibly use the maximum possible width of the SIMD pipeline according to the data to be processed. method.
  • the value of the mask parameter in the SIMD instruction model is related to factors such as the amount of data and the tail block size of the data.
  • the solution provided by the embodiment of the present application can improve the utilization rate of the Vetcor unit and improve its single execution efficiency through instruction assembly (including operation assembly and operand merging) for the situation of sufficient data volume, thereby improving the performance of the operation accelerator as a whole. Performance, and enhanced applicability to operator code automatic generation software or computing scheduling templates.
  • the computing accelerator is provided with a master Scalar unit, and at least one slave Scalar unit respectively fetches and decodes instructions to obtain decoded Vetcor instructions.
  • the operation accelerator assembles at least two decoded Vetcor instructions into a wide instruction, in which two or more operations corresponding to vector operations with the same execution delay are specified, and the at least two Vetcor instructions are specified. Source operand address or destination operand address.
  • the Vetcor unit set in the computing accelerator executes the wide instruction in one execution cycle. In one execution cycle, the Vetcor unit processes more data, has a higher utilization rate, and has higher single-execution efficiency.
  • the master Scalar unit and at least one slave Scalar unit provided by the computing accelerator fetch and decode instructions respectively to obtain decoded MTE instructions.
  • the computing accelerator also assembles at least two decoded MTE instructions into a wide instruction, specifying two or more MTE operations in the wide instruction, and specifying source operands of the at least two decoded MTE instructions or the destination operand.
  • the MTE set in the computing accelerator only executes the wide instruction in one execution cycle, and the MTE unit processes more data in one execution cycle, and the single execution efficiency is higher.
  • the solution provided by the embodiment of the present application also provides a multi-worker mode programming model to use the master Scalar unit and at least one slave Scalar unit provided by the target computing accelerator.
  • a method of specifying at least one parameter of the vector operation function, the data transfer function and the synchronization waiting function by using the identifier (worker_id) of the work unit is respectively provided.
  • the code for processing data written above is compiled to obtain the instruction sequence.
  • the utilization rate of the Vector unit set by AI Core is higher, the single execution performance is higher, and the MTE unit processes more data in one execution cycle, The single execution performance is higher, which in turn improves the performance of the target computing accelerator as a whole.
  • the computing accelerator of an embodiment of the present application includes: a storage unit 100 configured with at least one vector instruction queue 100A, and each vector instruction queue is used to cache one or more vector instructions; at least two Scalar calculation unit 200, each scalar calculation unit is used to acquire instructions and decode the instructions to obtain decoded instructions, the decoded instructions include vector instructions, and each scalar calculation unit 200 is also used to cache vector instructions to at least one vector instruction queue; the vector calculation unit 300 is configured to execute the vector instructions in the vector instruction queue.
  • the computing accelerator can be an AI chip or an AI single board equipped with at least one aforementioned AI Core, and the AI chip or AI single board includes but is not limited to a neural network processor (Neural-network Processing Unit, NPU) , Graphics Processing Unit (GPU), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), etc.
  • NPU Neural-network Processing Unit
  • GPU Graphics Processing Unit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the scalar calculation unit may be a CPU on an operation accelerator. In some embodiments, the scalar calculation unit 200 may be the aforementioned Scalar unit. The scalar calculation unit performs instruction fetching, decoding, Execute, fetch, and write back operations.
  • the above scalar computing units 200 are respectively used to acquire instructions and decode the instructions.
  • the instructions include vector instructions.
  • the identifier (worker_id) of the unit correspondingly fetches instructions, and decodes the fetched instructions.
  • Each instruction designates a different data portion as its source operand within each cycle with the specified number M of work units as the total number of cycles, by referencing a worker_id that is incremented in value. Subsequently, the same instruction or different instructions that refer to the worker_id with the same value are assembled into a wide instruction.
  • For the specific assembly method please refer to the description below.
  • the M scalar computing units set in the computing accelerator are used to accelerate the calculation, it is necessary to specify in advance in the application program to start the M scalar computing units by calling the identification corresponding to the M working units (workers).
  • the working mode of the cluster composed of computing units. Therefore, when the number of specified working units is M, at least two scalar computing units are M, which are used to acquire instructions and decode instructions, regardless of other occupied processes such as synchronization waiting between scalar computing units In the case of time, the efficiency of instruction sequence fetching and decoding per unit time is increased to M times of the original, which provides sufficient data volume and processing capacity for the subsequent more efficient use of the vector computing unit or data handling unit. number of instructions.
  • the above vector instruction queue configured in the storage unit can be a vector instruction queue 100A jointly configured by all or more scalar computing units, that is, the number of scalar computing units is M, and the vector instruction queue The quantity is 1.
  • the instructions decoded by each scalar calculation unit are sequentially cached in the common vector instruction queue 100A according to the sequence of decoding time.
  • each scalar computing unit 200 may be configured with a vector instruction queue 100A in one-to-one correspondence, that is, the number of scalar computing units is M, and the number of vector instruction queues is also M.
  • the instructions decoded by each scalar calculation unit are independently cached in the corresponding vector instruction queue 100A.
  • the above storage unit utilizes the cache mechanism to realize fetching the instructions in the vector instruction queue to execute the first transaction, and adding a new decoded vector instruction to the vector instruction queue as the second transaction, so that the first transaction and the second Transactions can be performed at the same time or at intervals, which ensures that the number of instructions in the vector instruction queue and the contents of instructions can be continuously updated, and supports each scalar computing unit to fetch, decode, and store decoded instructions in parallel, which improves the overall The execution efficiency of the computing accelerator.
  • the cached instructions in the vector instruction queue 100A come from at least two scalar computing units 200, so it is necessary to implement more complex execution logic and business scheduling.
  • the decoded vector instruction executed by the vector computing unit 300 for a single time may be the decoded instruction of any scalar computing unit, or the assembled instruction of any multiple scalar computing units. the next wide instruction.
  • the computing accelerator of this embodiment utilizes at least two scalar computing units to acquire and decode instructions respectively, cache the decoded vector instructions of each scalar computing unit by the storage unit, and execute the decoded vector instructions by the vector computing unit Vector instructions, thus forming a stable and efficient processing pipeline, with high execution efficiency as a whole, and after the number of scalar computing units increases, more scalar computing units can fetch and decode more vector instructions, thereby increasing The quantity of vector instructions provided to the vector computing unit, thereby improving the utilization rate of the vector computing unit.
  • the vector calculation unit that is, the aforementioned Vector unit, is used to execute the decoded vector instruction, including: the vector calculation unit 300 is used to execute at least two decoded vector instructions in one execution cycle .
  • the decoded vector instructions may adopt a SIMD instruction model.
  • the execution cycle here refers to the execution delay required when the Vector unit executes the operation specified in the current SIMD instruction model.
  • Execution latency refers to the number of clock cycles required for a SIMD instruction model to be executed by the Vetcor unit. For example, the execution delay of the Vetcor unit when performing floating-point division is 5 clock cycles, then, within 5 consecutive clock cycles, the Vector unit is occupied to perform operations corresponding to the floating-point division.
  • the vector calculation unit 300 executes at least two vector instructions after decoding in one execution cycle T, and compared with the vector calculation unit executing a single decoded vector instruction in T within one execution cycle, the performance of the vector instruction is improved. processing efficiency, and improves the utilization rate of the vector computing unit.
  • the execution cycle of the vector calculation unit during execution this time is among the at least two different execution delays. the maximum value.
  • the at least two vector instructions with different execution delays are submitted to the Vector unit for execution at the same time. Although the utilization rate of the Vector unit is improved, it may reduce the overall Vector because of the longer execution delay. The single execution performance of the unit.
  • At least two decoded vector instructions executed in one execution cycle have the same execution latency.
  • the Vector unit simultaneously executes at least two vector instructions with the same execution delay within one execution cycle, and the corresponding execution delay is the same as the execution delay when each instruction is executed separately, but within one execution cycle More data is processed, the utilization rate of the Vector unit is higher, and the performance of each execution is higher, which in turn improves the performance of the AI Core and the performance of the computing accelerator as a whole.
  • At least two decoded vector instructions executed in one execution cycle have the same execution delay, which can avoid the problem that the vector calculation unit executes to the write-back process due to the different execution delays of the vector instructions.
  • a fork occurs between the vector instructions, thereby keeping the control logic unchanged when the vector computing unit executes the write-back process.
  • the computing accelerator of the embodiment of the present application may further include a vector instruction assembly unit 400, configured to assemble at least two decoded vector instructions; the vector calculation unit is also used to execute the assembled vector instructions.
  • a vector instruction assembly unit 400 configured to assemble at least two decoded vector instructions; the vector calculation unit is also used to execute the assembled vector instructions.
  • the vector instruction assembly unit 400 can be respectively set in the master Scalar unit or the slave Scalar unit. As described above and shown in Figure 1B, after the vector instructions in the instruction sequence are processed by the Scalar unit, the address of the source operand and/or target operand, parameters (such as mask mask, number of repetitions repeat, jump amount stride), etc. All have been configured and sent directly to the Vector unit for execution when the Vector unit is idle. On the basis that the master Scalar or the slave Scalar has the functions of the Scalar unit in FIG. 1B , it only needs to add an instruction assembly function to realize the function of the aforementioned assembly unit.
  • each Scalar unit has the instruction assembly function, and assembles the vector instruction fetched and decoded by its own Scalar unit, and sends the assembled wide instruction to the Vector unit in turn. implement.
  • each Scalar unit assembles the instructions independently, and the wide instructions come from the vector instruction fetched and decoded by its own Scalar unit.
  • the assembly efficiency is low; each Scalar unit has an instruction assembly function independently, and the system structure is more complicated, which is not conducive to improving the execution efficiency as a whole.
  • a common vector instruction assembly unit 400 can be set for all or more master Scalar units and slave Scalar units, and at least two decoded vectors can be assembled by the vector instruction assembly unit 400 instruction, and provide the assembled vector instruction, that is, the aforementioned wide instruction, to the vector computing unit, so that the vector computing unit, that is, the aforementioned Vector unit, executes the assembled vector instruction.
  • the vector instruction assembling unit 400 can extract all or a plurality of master Scalar units and vector instructions decoded from the slave Scalar units, the sources of instructions are more diverse, the data of instruction operations are more dispersed, and Packer assembles each The Scalar unit independently decodes the vector instructions, which is more conducive to efficiently screening instructions suitable for assembly and improving the utilization rate of the Vector unit.
  • the vector instruction assembly unit 400 may also include a logic storage module 410, a vector instruction assembly module 420,
  • the logic storage module 410 is configured with a vector instruction assembly queue 410A, and the vector instruction assembly queue 410A is used to buffer the decoded vector instructions extracted from at least two vector instruction queues;
  • the vector instruction assembly module 420 is used to assemble from The queue fetches and assembles at least two decoded vector instructions.
  • the vector instruction assembly unit 400 may be a hardware module separately provided on the AI Core, which is provided with a logic storage module 410.
  • the logical storage module 410 may be the instruction slot described below, or any other storage block with a similar logical structure, such as a queue, a linked list, a table, a stack, and the like.
  • the instruction slot is configured to store decoded instructions extracted by the assembly unit from multiple vector instruction queues.
  • the vector instruction assembly module 420 (Vec Instruction Coalescer) included in the vector instruction assembly unit 400 is used to extract instructions from each vector instruction queue and put them into the cache in the instruction slot. At this time, the vector instruction assembly module 420 extracts a decoded vector instruction from at least two vector instruction queues 100A according to preset rules, such as first in first out; or extracts a decoded vector instruction from each vector instruction queue 100A.
  • At least two decoded vector instructions are extracted at the same time; or according to the total number of instructions in each vector instruction queue 100A, from more to less, the decoded vector instructions are extracted ; or according to the number of vector instructions that can be received in the current instruction slot, extract the decoded vector instructions from the corresponding number of vector instruction queues.
  • the decoded instructions stored in the instruction slots may come from the same vector instruction queue 100A, or different vector instruction queues 100A. Also, these decoded instructions may have the same or different execution latencies.
  • the vector instruction assembly module 420 is further configured to extract and assemble at least two decoded vector instructions from the vector instruction assembly queue 410A according to the execution delay.
  • the vector instruction assembly module 420 extracts and assembles the instructions according to the value of the mask of each instruction in the instruction slot.
  • the fetched instruction is determined according to the maximum value of the Lanes width supported by the Vector unit of the target computing accelerator under the current computing precision. For example, multiple instructions are fetched incrementally until the cumulative sum of mask values of the fetched instructions is closest to the maximum value of the Lanes width supported by the Vector unit.
  • the vector instruction assembling module 420 extracts and assembles the instructions to be assembled according to the execution delay of each instruction in the instruction slot. For example, the instructions are divided into multiple groups according to the execution delays corresponding to the instructions in the current instruction slot. A group of instructions with the smallest execution delay is preferentially selected for assembly, or a group of instructions with the largest execution delay is preferentially selected for assembly.
  • the aforementioned method of accumulating the value of the mask can also be used to ensure the maximum utilization of the Vector unit.
  • At least one operation is specified in the assembled wide instruction, and the execution delays of these specified operations are the same when the Vector units are executed separately.
  • At least M source operands are specified in the assembled wide instruction, and the widths of the M specified source operands are the same or different (because the values of masks can be different).
  • the third transaction that the vector instruction assembling unit 400 extracts instructions in the instruction slot
  • the instruction assembly in the instruction slot is the fourth transaction, under the coordination of the system control module of the AI Core or the vector instruction assembly unit 400 , the third transaction and the fourth transaction can be processed simultaneously or at intervals or in parallel.
  • the vector instruction assembly module 420 is also configured to send the assembled wide instruction to the vector instruction execution queue (Vector Execution Queue) configured by the AI Core for the Vector unit.
  • Vector Execution Queue the vector instruction execution queue
  • the vector instruction assembly unit 400 further includes a Vec instruction emission module 430, which is used to extract the specified source operand from the UB to the operand buffer for reading by the vector calculation unit.
  • the assembled at least two decoded vector instructions respectively specify their respective source operands; and specify the addresses of the target operands in the operand buffer. For example, extract the source operand to the operand buffer configured by AI Core for the Vector unit, such as the source register workspace; for example, specify the target in the operand buffer configured by AI Core for the Vector unit, such as the target register workspace The address of the operand.
  • the Vec instruction emission module 430 When the amount of data that can be buffered in the operand buffer is greater than the number of instructions that can be buffered in the vector instruction execution queue, the Vec instruction emission module 430 first extracts instructions from the vector instruction execution queue, and then prepares the source operand and the target operand with reference to the aforementioned method . And when the amount of data that can be buffered in the operand buffer is less than the amount of instructions that can be buffered in the vector instruction execution queue, the Vec instruction emission module 430 first prepares the source operand and the target operand with reference to the aforementioned method, and after the data is ready, from The instructions fetched from the vector instruction execution queue are sent to the Vector unit for execution.
  • the Vec instruction issuing module 430 transmits the vector instruction with the data prepared to the Vector unit 300 for execution.
  • the Vec instruction issue module 430 is also used to write back operands after the Vector unit executes the wide instruction. For example, write the target operand back into the unified buffer UB from the target register work space.
  • Writing the target operand back into the UB is another transaction. Under the coordination of the system control module 13 of the AI Core or the vector instruction assembly unit 400, these multiple transactions are processed in parallel.
  • the vector instructions decoded by the at least two scalar computing units are cached in their respective corresponding vector instruction queues, and the vector instruction assembly queue configured by the vector instruction assembly unit caches the vector instructions extracted from the at least two vector instruction queues , and the vector instruction assembly module extracts any at least two vector instructions from the vector instruction assembly queue according to the execution delay, and assembles them.
  • the scalar computing unit and its corresponding vector instruction queue as a whole are independent from other scalar computing units and their corresponding vector instruction queues, and are executed in parallel, which improves the overall execution efficiency of each scalar computing unit; the assembly unit
  • the assembly module uses the logical storage module configured with the assembly queue to assemble the vector instructions extracted from the vector instruction queue, realizing the flexible, reliable and efficient assembly of vector instructions, and assembling the vector instructions according to the execution delay is conducive to improving the efficiency after assembly.
  • the instruction fetched and decoded by each scalar calculation unit 200 also includes a data transfer instruction, correspondingly, as shown in FIG. 2C
  • the storage unit 100 is also configured with a data transfer instruction queue 100B, and the data transfer instruction queue 100B is used for buffering the decoded data transfer instructions of at least two scalar calculation units; the data transfer unit 600 is also used for executing the decoded Data movement instructions.
  • the above storage unit is configured with a data transfer instruction queue corresponding to each scalar calculation unit, and the data transfer instruction queue is used to cache the decoded data transfer instructions of each scalar calculation unit.
  • the translation of each scalar calculation unit The coded instructions are sequentially sent to the data transfer instruction queue 100B of each scalar calculation unit arranged in the storage unit.
  • all or more scalar computing units may be configured with one data transfer instruction queue, that is, the number of scalar computing units is M, and the number of data transfer instruction queues is 1. At this time, the instructions decoded by each scalar calculation unit are sequentially buffered in the common data transfer instruction queue according to the sequence of decoding time.
  • each scalar computing unit may be configured with a corresponding data transfer instruction queue, that is, the number of scalar computing units is N, and the number of data transfer instruction queues is M. At this time, the instructions decoded by each scalar calculation unit are independently buffered in the corresponding data transfer instruction queue.
  • the above uses the cache mechanism to realize the transaction that the instructions in the queue are taken away and executed, and the transaction of adding new decoded instructions in the queue can be performed at the same time or at intervals, and the number and content of instructions in the queue are continuously updated. , so as to support each scalar calculation unit to realize the pipeline of code fetching, decoding and storing the decoded instructions in parallel, which improves the execution efficiency as a whole.
  • the above vector calculation unit is used to execute the decoded data transfer instruction.
  • the vector computing unit may be a data handling unit, and the decoded data handling instruction executed in a single execution may be the decoded instruction of any scalar computing unit, or any number of scalar computing units.
  • the decoded instructions are assembled into wide instructions.
  • the computing accelerator of this embodiment uses at least two scalar computing units to obtain and decode instructions respectively, and the storage unit caches the decoded data transfer instructions of each scalar computing unit, and executes them by the vector computing unit
  • the decoded data transfer instructions form a stable and efficient processing pipeline with high execution efficiency as a whole.
  • the decoded instruction obtained is either a decoded vector instruction, or a decoded
  • the MTE command is either a decoded Cube command or a Scalar command.
  • the Vector unit and the MTE unit are implemented independently.
  • the merging unit for assembling the MTE instructions and the assembling unit for assembling the vector instructions are also executed independently. This can be achieved by pre-eliminating possible data dependencies between the two types of instructions during the programming phase of the instructions in the instruction sequence.
  • the data handling instruction includes multiple types of data handling instructions
  • the data handling instruction queue corresponding to each scalar computing unit is a group of multiple types, such as n types, then the data handling instruction queue corresponding to each scalar computing unit has Multiple categories, such as n categories. That is, if the number of scalar calculation units is M, the number of data transfer instruction queues is n*M.
  • AI Core can respectively set up an assembly unit for assembling at least two decoded data handling instructions of the same type, and emit the assembled Data movement instructions. That is, if the data transfer instructions include n types of data transfer instructions, the number of MTE merging units is n.
  • the number of MTE merging units may also be 1, matching the number of MTE units.
  • the data transfer unit is further configured to execute at least two decoded data transfer instructions within one execution cycle.
  • the execution cycle here refers to the required execution delay when the MTE unit executes the operation specified in the current MTE instruction.
  • Execution latency refers to the number of clock cycles required for an MTE instruction to be executed by the MTE unit.
  • the address of the operand pointed to by at least two decoded data movement instructions executed in one execution cycle satisfies the assembly condition.
  • the instructions fetched and decoded by the scalar computing units further include scalar instructions, and at least two scalar computing units are also used to execute the decoded scalar instructions.
  • These scalar instructions mostly flow control instructions or mathematical operations, do not involve references to worker_ids.
  • Each scalar calculation unit respectively executes the decoded scalar instruction, and realizes the process control of the operation accelerator in a coordinated manner as a whole. After the number of scalar computing units increases, the number of read scalar instructions can be increased, thereby improving the processing efficiency of scalar instructions.
  • At least two scalar calculation units include a master scalar calculation unit and at least one slave scalar calculation unit, and the master scalar calculation unit is used to control the slave scalar calculation unit. Start or stop, or control the synchronization of at least two scalar computing units.
  • the at least two scalar computing units V2 constitute a cluster, in which a master scalar computing unit is designated and at least one slave scalar computing unit is designated.
  • the designated master scalar computing unit controls the start or stop of the slave scalar computing unit, or controls the synchronization of at least two scalar computing units including the master scalar computing unit, so that each scalar computing unit in the cluster starts or stops according to the specified role, And it executes the synchronization function according to the specified role, with clear logic and reliable operation, and realizes the coordinated operation of the computing accelerator.
  • the target computing accelerator When the target computing accelerator is used, it can be in a single operation mode of the main scalar computing unit, or in a multi-working unit mode in which the main scalar computing unit and at least one slave scalar computing unit form a cluster.
  • the single operation mode and the multi-work unit mode are mutually exclusive. Under the coordination of the AI Core's system control module or assembly unit, the two modes can be mixed and arranged, as shown in Figures 6A and 6B.
  • the at least two scalar computing units are in multi-working unit mode, and each scalar computing unit “virtually” corresponds to a working unit unit.
  • the multi-work unit mode it may involve the control of any target computing accelerator, that is, the synchronization of each scalar computing unit in the AI Core, that is, the synchronization of each "virtual" work unit. Please refer to the following description for the specific implementation method.
  • the main scalar calculation unit controls the slave scalar calculation unit to stop, the main scalar calculation unit is in an independent operation mode.
  • the single operation mode it will not involve controlling the synchronization of each scalar computing unit in any AI Core, but it may involve multiple target computing accelerators, that is, the synchronization between multiple AI Cores.
  • the specific implementation method is no longer repeat.
  • the master Scalar is used to control the start of the slave Scalar, including: the master scalar calculation unit is used to control the corresponding number of slaves according to the number recorded in the start instruction when executing the instruction that identifies the start of the multi-work unit mode Startup of the scalar computation unit.
  • the main scalar computing unit executes the following instruction for range(worker_num) that marks the start of the multi-work unit mode, it controls the startup of worker_num-1 slave scalar computing units according to the value of worker_num recorded in the instruction.
  • start, stop or synchronization methods see the following instructions.
  • the address and parameters of the source operand and/or target operand have been configured, and the scalar calculations cached in the vector instruction queue configured in the storage unit
  • the decoded vector instructions of the unit are assembled by the assembly unit to obtain wider vector instructions, and are directly sent to the Vector unit for execution when the Vector unit is idle.
  • And data21, data31, data41, and data51 are used as operands, and the utilization rate for the Vector unit is the same, all of which are 16Lanes.
  • the utilization rate of the Vector unit is roughly 4 times that of when each instruction is executed separately.
  • the value of the mask parameter of the 8 decoded vector instructions is 16 and it is known that another target AI accelerated processor (which is provided with 8 Scalar units, that is, scalar calculation units) executes operations including op2, op3, The execution delays of the 8 operations including op4 are the same. At this time, you can try to execute the 8 decoded vector instructions in one execution cycle of the Vector unit through instruction assembly, and the utilization rate of the Vector unit is roughly 8 times that when each instruction is executed independently.
  • FIG. 5C also shows the meanings of the number of repetitions repeat and the jump amount stride in the SMID vector instruction model.
  • the method of using these two parameters when assembling instructions please refer to the description below.
  • the computing accelerator when the computing accelerator is running in the multi-work unit mode, the operation of assembling vector instructions or MTE instructions is mapped to the user-oriented programming model or programming method, and these hardware-sense master Scalars or slave Scalars are transparent to users of.
  • the user when the user uses the target computing accelerator through programming, it is similar to using multiple "virtual" Vector subunits or “virtual” MTE subunits through identification of the working unit.
  • the method for using the computing accelerator in one embodiment of the present application includes:
  • S130 Use at least one vector operation function supported by the operation accelerator to process data, and at least one parameter reference identifier of the vector operation function.
  • the method of use also includes:
  • S150 Use at least one data transfer function supported by the computing accelerator to process data, and at least one parameter reference identifier of the data transfer function;
  • S170 Designate M work units to be synchronized by using a synchronous wait function supported by the computing accelerator.
  • the computing accelerator for computing acceleration when using the computing accelerator for computing acceleration, according to the specified number M of working units, generate identifiers pointing to M working units, and use at least one vector computing function supported by the computing accelerator to process data, for example, through the vector computing function
  • At least one of the parameter references identifies a way to allocate the data to be processed by each unit of work respectively.
  • specify the multi-work unit mode of the computing accelerator and allocate the data processed by each work unit separately, so as to realize parallel processing of data.
  • the method of using a computing accelerator for computing acceleration facilitates the realization of program codes with complete structure and clear logic, is highly man-machine friendly, and is easy to use, which is conducive to promoting the wide application of the computing accelerator in the industry.
  • At least one parameter reference identifier of the data transfer function is used to allocate the data processed by each work unit and execute M work units synchronously waiting due to the dependency of the allocated data.
  • specify the multi-work unit mode of the computing accelerator and allocate the data processed by each work unit separately, so as to realize parallel processing of data and specify each work unit to wait synchronously when the data is dependent.
  • the method of using a computing accelerator for computing acceleration facilitates the realization of program codes with complete structure and clear logic, is highly man-machine friendly, and is easy to use, which is conducive to promoting the wide application of the computing accelerator in the industry.
  • the V1 chip is provided with two Scalar units and one SIMD-type Vector unit for performing vector operations.
  • the Vector unit has 128Lanes for vector operations.
  • the SIMD unit has 64Lanes for vector operations.
  • the user-designed program can specify that one or two Scalar units process data, and when the combined vector operation instructions finally submitted to the Vector unit are executed, it can take 16 to Any number of 128Lanes, and any value starting from 16 and incremented by 16.
  • the combined vector operation instructions finally submitted to the Vector unit are executed, 32Lanes or 64Lanes may be occupied.
  • the target operation accelerator V2 is provided with four Scalar units and one SIMD-type Vector unit for performing vector operations.
  • some instructions in the instruction sequence corresponding to the program designed by the user can be processed by one, two, three, or four Scalar units, and when the combined vector operation instructions finally submitted to the Vector unit are executed, Can occupy any number of 4 to 128Lanes, and start with 16, any value in increments of 16.
  • the situation that the target computing accelerator is provided with other numbers of Scalar units can refer to the foregoing, and will not be repeated here.
  • AI processors and AI processing devices can be constructed, such as AI acceleration modules, AI accelerator cards (which can be training cards, or It can be an inference card), an AI edge computing server or an AI edge computing terminal device, an AI cloud server (it can be a training service or an inference service, such as an AI computing server with 2048 nodes), and an AI cluster.
  • AI acceleration modules such as AI acceleration modules, AI accelerator cards (which can be training cards, or It can be an inference card), an AI edge computing server or an AI edge computing terminal device, an AI cloud server (it can be a training service or an inference service, such as an AI computing server with 2048 nodes), and an AI cluster.
  • AI accelerator cards which can be training cards, or It can be an inference card
  • AI edge computing server or an AI edge computing terminal device such as an AI edge computing server or an AI edge computing terminal device
  • AI cloud server it can be a training service or an inference service, such as an AI computing server with 2048 nodes
  • AI cluster such as AI acceleration modules, AI accelerator cards (which can be
  • the AI processor may distinguish its own working mode according to its PCIe (Peripheral Component Interconnect express, PCIe) working mode. If its PCIe works in the main mode, the hardware device where the AI processor is located can expand the peripherals. At this time, the AI processor works in the RC (Root Controller) mode.
  • the hardware device where the AI processor is located can also be connected to other peripherals such as a network camera, an I 2 C sensor, and an SPI display as a slave device.
  • a certain type of AI processor includes multiple AI computing accelerators AI Core, AI CPU, control CPU and task scheduling CPU.
  • the PCIe of the AI processor works in master mode.
  • the CPU of the AI processor directly runs the AI service program specified by the user, processes the AI service, and dispatches the AI operation acceleration processor to perform the NN calculation specified in the AI service.
  • the AI processing device has an operating system, an input and output device, a file processing system, and the like.
  • the CPU is responsible for efficiently allocating and scheduling computing tasks on multiple AI Cores and AI CPUs, and multiple AI Cores execute neural networks separately
  • the AI CPU executes operators that are not suitable for processing by the AI Core (for example, non-matrix complex calculations).
  • the AI processor works in EP (End Point) mode.
  • EP mode the Host side is usually used as the PCIe master end, and the Device side is used as the PCIe slave end.
  • the AI business program specified by the user runs in the Host system, and the hardware device where the AI processor is located is used as the Device system to connect to the Host system through the PCIe slave device to provide NN computing capabilities for the servers in the Host system.
  • the host system interacts with the device system through the PCIe channel, and loads AI tasks to the AI processor on the device side to run.
  • the Host system includes an X86 server or an ARM server, and uses the NN (Neural-Network) computing power provided by the AI computing accelerator to complete the business.
  • NN Neuro-Network
  • the hardware device where the AI processor is located acts as the Device side, receives the AI task distributed by the Host side, and executes the NN calculation specified in the AI task. After the AI task is completed, the NN calculation result Return to the Host side.
  • the embodiment of the present application improves the architecture of the AI Core of the AI processor, and accordingly designs a relatively independent multi-working unit mode within each AI Core in the multi-core mode.
  • AI chips equipped with at least one AI Core after the aforementioned improved design include but are not limited to neural network processors (Neural-network Processing Unit, NPU), graphics processors (Graphics Processing Unit, GPU), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), etc.
  • NPU neural network processors
  • graphics processors Graphics Processing Unit, GPU
  • application-specific integrated circuits Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the energy efficiency of the vector calculation unit and the MTE unit can be improved by using the processing method for accelerating the operation provided by the embodiment of the present application.
  • an AI processor is usually provided with at least one, that is, at least two computing accelerators AI Core.
  • the AI processor equipped with at least one AI Core after the aforementioned improved design can be applied to application scenarios related to image, video, voice, and word processing, such as smart parks, automatic driving, etc.
  • the application program interface Application Interface, API
  • the computing accelerator in the embodiment of the present application is running, its operating system will call the underlying compiler provided in the form of a driver.
  • Software stacks such as APIs or operators supported by the computing accelerator will be provided together with the computing accelerator firmware, such as recorded in a storage medium, for example, downloaded to the local through the network service area before use.
  • the following embodiments describe the work of the user in the stage of writing program codes when designing a neural network operator for a target computing accelerator.
  • CANN provides a tensor boost engine (Tensor Boost Engine, TBE) operator design framework based on the tensor virtual machine framework. Based on this TBE framework, users can use the Python language to design custom operators through the API and programming design interface provided by TBE, and complete the operator design required for neural network operations in AI tasks.
  • TBE tensor boost engine
  • DSL Domain Specific Language
  • Users use domain-specific language to declare the calculation process, and use the Auto Schedule mechanism to generate program code. After specifying the target computing accelerator, the generated program code is compiled into an instruction sequence that can run on the target computing accelerator.
  • DSL Domain Specific Language
  • TIK Tensor Iterator Kernel
  • TIK As a dynamic programming framework based on the Python language, TIK appears as a dedicated functional module running under Python. Users can call the API provided by TIK to write custom operators based on the Python language. The compiler provided by TIK will compile the TIK program code into an instruction sequence that can run on the target computing accelerator.
  • Table 1 below lists some of the statements or variables used by the user when virtualizing the Vector unit.
  • Table 2 lists some instructions executed by AI Core.
  • FIGS. 6A and 6B When a user uses the computing accelerator of the embodiment of the present application to design an AI business program, there are two possible scenarios shown in FIGS. 6A and 6B .
  • the multi-work unit mode is used for a certain AI Core to virtualize Vector units or virtual MTE units.
  • the multi-worker mode for each AI Core to virtualize Vector units or virtual MTE units.
  • the method of processing data in the multi-worker mode outside the multi-core mode (that is, outside the multi-core loop body) shown in FIG. 6A is described.
  • a neural network operator for multiple target computing accelerators including AI Core_j, AI Core_j+1
  • when implementing a virtual Vector unit corresponding to the specified AI Core_j it can be implemented by the jth multi-worker cycle, corresponding to the specified AI Core_j +1
  • when implementing a virtual Vector unit it can be implemented by the j+1th multi-worker loop.
  • any AI Core can use multiple independent multi-worker loops to implement virtual Vector units respectively.
  • each AI Core the physical identifier of each Scalar unit is determined.
  • the program code refers to the worker_id with the same name that is incremented between loops as the logical identifier for referencing each Scalar unit, and the logical identifiers are consistent with each other from small to large.
  • the physical identifiers of each Scalar unit correspond one by one.
  • each "virtual" Vector subunit is individually referenced by the same identifier worker_id incremented on the loop body.
  • the program control part is roughly shown in the following code.
  • the following code shows the method of specifying the multi-worker mode, the method of specifying the number of Scalar units involved in the multi-worker mode, and implicitly specifying the behaviors corresponding to each Scalar unit in a multiple-cycle manner, such as fetching and translating code and a method of sending decoded instructions to the instruction queue.
  • the AI Core enters the multi-worker mode after the instruction corresponding to the statement TIK_instance.for_range is fetched and decoded by the designated main Scalar unit.
  • the codes corresponding to the statements of worker_id are respectively referenced in worker_num independent loops, and each Scalar unit fetches instructions, decodes them, and sends the decoded instructions to the instruction queue.
  • the user uses the Python language to refer to the TIK function to write the program code for the operator of the neural network operation.
  • the behavior corresponding to each Scalar unit can be explicitly specified, thereby realizing the virtual Vector unit.
  • a target computing accelerator is provided with N Scalar units and one Vector unit.
  • SIMDX mask, data1
  • the source data data1 also the target data specified in the SIMDX function If the size is too small relative to the number of Lanes in the Vector unit, the utilization rate of the Vector unit will be low.
  • M (M is less than or equal to N) instructions corresponding to the single-purpose vector operation function are assembled, so that the source data or target data in the M instructions are "merged", then the assembled instruction as a whole can be , in the form of for example M*SIMDX(mask*M, data1,..., dataM) is sent to the Vector unit for execution, then the utilization rate of the Vector unit can be increased by M times of the original.
  • the user can use the statement "SIMDX(mask, data( worker_id))" to respectively specify the index position of the source data source data to be processed in each cycle of the vector operation function SIMDX in the data tensor to be processed.
  • the user realizes that the M Scalar units specifying the target computing accelerator cooperate to virtualize the Vector units on it into M Vector subunits, and "formally" specify the correspondence by referring to the "worker_id” variable.
  • the M Vector sub-units of each execute the instructions corresponding to the SIMDX(mask, data(worker_id)) function, thereby improving the utilization rate of the Vector unit.
  • the method for processing data in the multi-worker mode within the multi-core mode shown in FIG. 6B (that is, within the multi-core loop body) will be described below.
  • multi-core mode is often used.
  • the multi-worker mode can also be used to process data.
  • the following takes the implementation of two computing accelerators set for an AI processor, that is, the code segment of the AI Core to implement the virtual Vector unit as an example, to illustrate the method of using the multi-worker mode to process data in the multi-core mode.
  • 8 Scalar units and one Vector unit are set on each AI Core.
  • any AI Core will not process the tensor in the UB of other AI Cores.
  • the tensor used in each single-core loop in the multi-core mode must be declared within the single-core loop and be private to each AI Core.
  • each worker does not need and cannot declare them again in their respective worker loops when referencing these tensors.
  • multiple workers belonging to the same AI Core use "worker_id" to refer to different data in the same tensor.
  • each AI Core in the target AI processor can enter the multi-worker mode respectively, and each AI Core can enter the multi-worker mode multiple times.
  • the number of Scalar units involved may be different.
  • the protocol is implemented with a worker loop and is synchronized.
  • the data to be processed that is, the format of the input data is [batch_size, channel_cnt, spatial_size].
  • Each A maximum value is regulated under the batch.
  • the logical level of worker is added to "formally" allow users to control the calculation or scheduling logic of the Vector subunit after the virtual Vector unit in each AI Core or Computation or scheduling logic for MTE sub-units following the virtual MTE unit.
  • the Vector unit is split into multiple "virtual" Vector subunits, and the calculation tasks are assigned to each Vector subunit, so that users can more conveniently and flexibly design programs according to the characteristics of the data to be processed , improve the utilization rate of the instruction sequence executed by the Vector unit, and improve the computing efficiency of the AI Core as a whole.
  • each AI Core in the computing accelerator of the embodiment of the present application when running, it obtains the instruction sequence compiled for the target computing accelerator from DDR or HBM and transfers it to the instruction cache, and reads it from DDR or HBM HBM gets the data to be processed and moves it to internal storage, such as L1Buffer or UB.
  • These instruction sequences are neural network algorithms designed by the user for the data to be processed.
  • the data to be processed may be training data or inference data.
  • the neural network algorithm may be used to train the neural network to obtain a target neural network that meets the accuracy requirements; it may also be used to use the trained neural network for inference to obtain inference results.
  • Each AI Core executes the instruction sequence separately, processes the data in a multi-worker mode other than the specified multi-core mode, or processes the data in a multi-worker mode within the specified multi-core mode.
  • an accelerated processor is configured with a master Scalar unit and two slave Scalar units, and each Scalar unit is configured with a vector instruction queue and a set of MTE instruction queues.
  • the Scalar unit Scalar_0 with a specified identifier is configured as the master Scalar unit and started to perform the function of the master Scalar unit.
  • the other Scalar units, Scalar_1 and Scalar_2 are configured as slave Scalar units and remain in standby state.
  • the main Scalar unit After being configured as the main Scalar unit, as shown in FIG. 1B, the main Scalar unit reads instructions from the instruction cache ICache, decodes and executes the decoded instructions.
  • the master Scalar unit can also perform operations such as starting the slave Scalar unit, shutting down the slave Scalar unit, and controlling master-slave Scalar synchronization (including operand synchronization/data synchronization).
  • the system control module starts the specified number of slave Scalar and Vector assembly Unit 400 and MTE merge unit 500 .
  • the master Scalar and each started slave Scalar fetch and decode instructions from the instruction cache according to the worker_id recorded in the instruction.
  • the decoded instructions are buffered into the respective corresponding vector instruction queues and MTE instruction queues.
  • the master Scalar or any slave Scalar fetches an instruction to the user-specified multi-worker mode computing task
  • the master Scalar or any slave Scalar stops fetching and decoding instructions after processing the current instruction, and enters In the synchronization waiting state, until the master Scalar sends out a synchronization release message, the master Scalar and each slave Scalar unit continue to fetch and decode instructions from the instruction cache according to the worker_id recorded in the instruction.
  • the system control module in the AI Core shuts down the started slaves Scalar unit, Vector assembly unit and MTE merge unit.
  • the total number of cached instructions in the Vector instruction queue 0 corresponding to the master Scalar and the Vector instruction queue 1 corresponding to the slave Scalar_1 are approximately the same.
  • the total number of cached instructions in the MTE instruction queue 0 corresponding to the master Scalar and the MTE instruction queue 1 corresponding to the slave Scalar is approximately the same.
  • the decoded instruction belongs to the vector instruction, if the value of the mask recorded in the instruction is smaller than the preset value corresponding to the data precision, it is called a narrow instruction.
  • the activated Vector assembly unit takes out the decoded instructions from each Vector instruction queue and buffers them in the instruction slot, and assembles multiple narrow instructions according to the value of the mask recorded in the instruction and the execution delay corresponding to the operation in the instruction, and Issue the assembled wide instructions to the Vector unit for execution.
  • the assembled execution command needs to set the value of the assembled mask.
  • the number of clock cycles consumed by the Vector unit to execute the wide instruction is the same as the execution delay of each narrow instruction.
  • the utilization rate of the Vector unit is improved, and the computing performance of the AI Core is improved as a whole.
  • the Vector unit is configured with an instruction execution queue (Vector Execution Queue), and the Scalar-decoded instructions or assembled wide instructions are buffered in the instruction execution queue.
  • the purpose of buffering is to wait for the operand to be ready, and to wait for the Vector unit to complete the current operation.
  • the Vector unit executes the vector instructions sent by the Vec instruction sending module 430 (Vec Dispatch) one by one.
  • the operand to be operated by the launched vector instruction has been moved from the UB to the operand buffer of the Vector Unit.
  • the Vector assembly unit includes a Vec instruction assembly module 420 (Vec Instruction Coalescer) and a Vec instruction emission module 430 (Vector Dispatch).
  • the Vec instruction assembly module is used to collect all vector instructions decoded by the master-slave Scalar unit, and merge instructions as much as possible to obtain wide instructions including worker_num narrow instructions at most, and cache them in the instruction execution queue; the Vec instruction emission module 430 issues the wide instruction to the Vector unit for execution after the operands are ready.
  • the Vec instruction emission module 430 can be used for merging the memory access information, and according to the combined memory access information, the operand is fetched from the Vec related cache (Vector Related Mem, such as UB) to the source register workspace, so as to be fetched by the vector computing unit ; Or transfer, move, or write out the operand from the target register workspace to the Vec related cache (Vector Related Mem, such as UB). That is, before the Vector unit executes the wide instruction, the launch module provides the function of fetching the operand; after the Vector unit executes the wide instruction, the launch module provides the function of writing the operand back.
  • Vector Related Mem such as UB
  • the vector instruction assembly module 420 configures K instruction slots according to the value of workernum passed in from the main Scalar unit, where K is a positive integer not less than worker-num.
  • Each instruction slot includes multiple contents: source operand address 0 (src_addr), source operand address 1 (src1_addr), destination operand address (dst_addr), mask (mask) value, instruction specified operation, instruction specified The number of repetitions (repeat), or the jump amount (stride) specified by the instruction.
  • the instruction slot is used to store decoded instructions extracted by the vector instruction assembly module.
  • the vector instruction assembly module 420 checks the status of each instruction slot in a polling manner. For example, when any instruction slot is not empty, it is an active instruction slot. When it is detected that any instruction slot is empty, take a decoded instruction from the Vector instruction queue according to the preset rules (such as polling, such as the longest wait), and fill it into the empty instruction slot , and increment the number of active instruction slots by 1.
  • the preset rules such as polling, such as the longest wait
  • the vector instruction assembly module 420 counts the number of active instruction slots in a polling manner. When the number of active instruction slots is not zero, multiple instructions with the same execution delay are sequentially taken out from each active instruction slot and merged, and the merged instructions are buffered in the Vector instruction execution queue.
  • the vector instruction issue module 430 collects operands and issues instructions to be executed. Active instruction slots are locked until pending instructions are issued.
  • the vector instruction assembly module assembles instructions, when the number of active instruction slots is greater than 1, and the instruction ops of multiple active instruction slots belong to the same delayed instruction, multiple instructions recorded in multiple active instruction slots are spliced into one to be executed instruction.
  • the instruction to be executed corresponds to a plurality of narrow instructions, and each narrow instruction correspondingly includes an operation specified by the instruction in an instruction slot and an operand specified by the instruction. Subsequently, the instruction to be executed is issued as a whole to the vector computing unit for execution as a wide instruction.
  • the master Scalar and the slave Scalar are respectively configured with their own program counters (Program Counter, PC), and each PC is a register, which is configured to point to The address of the next instruction to be read from the master Scalar or from the Scalar.
  • PC program Counter
  • the master Scalar fetches instructions to start a specified number of slave Scalar instructions (such as .VVECST.n)
  • the master Scalar controls the slave Scalar to start, and shares the current content in the master Scalar's PC to share the location of the instructions to be read the address of.
  • the master Scalar and the slave Scalar independently read instructions from the instruction cache, decode and execute them.
  • the computing accelerator of the embodiment of the present application has different values of work_id when the master and slave Scalars are running (runtime). , so the instructions fetched and decoded by the master and slave Scalars and the types of instructions may be different.
  • the behavior of master and slave scalars may be different. For example, after the master-slave Scalar reads the instruction of its associated work_id register, the behavior of the master-slave Scalar is controlled by the read instruction, and a fork may occur. For example, after reading the code related to work_id, the master-slave Scalar may enter Different program branches or program segments read, decode and execute different instructions respectively.
  • the data handling instruction merging unit that is, the MTE merging unit 500 includes an MTE instruction merging module 520 (MTE Instruction Coalescer), an MTE instruction emission module 530 (MTE Dispatch).
  • the MTE instruction merging module 520 obtains the decoded MTE instructions from a plurality of MTE Queues, and after the instructions are merged, generates an MTE execution instruction, and sends it to the MTE instruction execution queue (MTE Execution Queue) for buffering.
  • the MTE instruction emission module 530 when detecting that the MTE unit is idle, transmits the MTE execution instruction in the MTE instruction execution queue to the direct memory access (Direct Memory Access, DMA) unit, that is, the DMAs in the figure, such as, data is passed through the DMAs Transfer to each core cache, that is, the AI core internal cache (In Core Buffer) such as UB, or the core external cache, that is, the AI core external cache (Out Memory), such as GB, etc.
  • DMA Direct Memory Access
  • the MTE instruction merging module 520 maintains a Miss Status and Handling Register (MSHR) table.
  • the MTE instruction merging module extracts any decoded original instruction from the MTE instruction queue corresponding to each Scalar unit and stores it in an entry in the MSHR table.
  • the number of table entries S in the MSHR is not related to the number of master-slave Scalar units, and is typically a larger value. For example, when the number of master-slave Scalar units is 8, the number of table entries S is 16, 32 or 64 ).
  • each entry in MSHR is to record the state of the MTE instruction to identify the feasibility of merging the MTE instruction. For example, when the content of the status column is "Unused”, it is used to identify the instruction slot where the entry is located. Is empty. When the content of the status column is "Used”, it is used to identify that the MTE instruction in the instruction slot where the entry is located has not yet been merged and can be merged. When the content of the status column is "Issued”, it is used to identify that the MTE command in the command slot where the entry is located has been merged.
  • table entry columns can also be set in the entry, for example, for recording source operand address (src_addr), destination operand address (dst_addr), instruction specified jump amount (stride), instruction specified modified linear unit parameter ( Rectified Linear Unit, Relu), the filling parameter (pad) specified by the instruction.
  • src_addr source operand address
  • dst_addr destination operand address
  • instruction specified jump amount stride
  • instruction specified modified linear unit parameter Rectified Linear Unit, Relu
  • the filling parameter pad
  • the MTE command merging module 520 clears the MSHR table, that is, configures the status columns of all entries as "undefined”.
  • the MTE instruction merging module 520 When detecting that any instruction slot is empty (at this moment, the content of its status column is "Unused"), the MTE instruction merging module 520 sends a certain MTE instruction with certain rules (such as polling/longest wait, etc.) Instructions are read from the queue, filled into the free entry, and the status column of the entry is updated to "Used";
  • the MTE instruction merge module 520 executes the instruction merge, and the table entries where multiple executions participating in the instruction merge are located The status bar of the update to "Issued".
  • MTE instructions can include a variety of different categories.
  • the categories of MTE instructions remain the same before and after the merger.
  • Multiple MTE instructions of the same type can be merged when specific rules are met, that is, multiple MTE instructions of the same type are merged into a wider MTE instruction and sent to the MTE for execution at one time. For example, after being combined into a wider hash read MTE instruction, it is sent to the MTE unit at one time, and the execution is completed in one execution cycle (cycle) of the MTE unit.
  • the instruction set U_I refers to any subset of the set U composed of all instructions to be assembled, and there are at least 2 MTE instructions in the instruction set U_I.
  • the MTE instruction merging module judges whether each instruction set U_I satisfies specific rules, and executes instruction merging when it judges that the rules are met.
  • Execution instruction merging means that when two or more instructions in the MSHR table meet certain rules, the two or more instructions are merged into a wider MTE instruction.
  • Specific rules such as but not limited to:
  • the source address and destination address involved in each instruction in the instruction set U_I do not overlap each other, and the source address segments after combining the source addresses are completely continuous, and the destination address segments after combining the destination address segments are completely continuous. Then such a set of instructions U_I can be completed with one MTE instruction, and can be combined.
  • each source address segment presents the same hash pattern (that is, the value of the jump stride parameter in each instruction is the same), and the The hash mode is maintained after the combination of each source address segment, and the combination of each destination address segment is completely continuous, then such an instruction set U_I can be completed by one hash read MTE instruction, and can be combined.
  • the MTE instruction merging module will clear two or more entries related to the merged instruction, that is, configure the status columns of these entries as " Unused” state.
  • any slave Scalar When any slave Scalar reads the instruction to the first ".VVECSYNC" instruction (such as decoding to the instruction corresponding to sync_workers()), the slave Scalar will pause, and send the first synchronization arrival signal to the master Scalar, and Wait for the master Scalar to send the second synchronization arrival signal. Afterwards, after obtaining the second synchronization arrival signal sent by the master Scalar, the slave Scalar enters the next command.
  • the master Scalar When the master Scalar receives the first synchronization arrival signal sent by a slave Scalar, or the master Scalar reads the command to the first ".VVECSYNC" command, the master Scalar starts counting the number of slave Scalars that have sent the first synchronization arrival signal quantity. After it is determined that all currently active slave Scalars have sent the first synchronization arrival signal, it is determined that all slave Scalars are ready for synchronization. At this time, the master Scalar sends the second synchronization arrival signal to all slave Scalars. Afterwards, the master Scalar and each slave Scalar enter the next instruction respectively.
  • entering the next instruction refers to fetching, decoding, and executing the next instruction. If the master Scalar and each slave Scalar support out-of-order execution, after decoding, skip this instruction that has dependencies and needs to be synchronously waited for, and continue to fetch, decode, and execute the next instruction.
  • any logical storage bank can be used, for example, a synchronization table or the aforementioned instruction slots can be established.
  • the synchronization table is used to identify whether each slave Scalar unit has sent the first synchronization arrival signal.
  • the established synchronization table records the identifiers (Scalar_id) of all slave Scalar units or the state quantities corresponding to all slave Scalar units are sent, it is determined that all slave Scalars have sent the first synchronization arrival signal.
  • the synchronization table is created in a modified way, it will be initialized to a fixed-length or fixed-format synchronization table when it is created, and all slave Scalar identifiers will be recorded, and/or the status quantity recorded with the slave Scalar identifier will be unsent. Afterwards, whenever a first synchronization arrival signal is received, the identification of the slave Scalar unit recorded in the first synchronization arrival signal is extracted, and the state quantity of the identification of the slave Scalar is updated as sent. If the synchronization table is created in the generation mode, it will be initialized as an empty synchronization table when it is created. Afterwards, each time a first synchronization arrival signal is received, the slave Scalar's identifier recorded in the first synchronization arrival signal is extracted, and the status quantity of the slave Scalar's identifier is updated as sent.
  • the user when the user designs a neural network operator for the target AI accelerated processor and implements a virtual Vector unit for any specified AI Core, the user refers to the "worker_num" variable in the for_range statement and specifies the value of worker_num to specify the number of participating workers The number of Scalar units of the pattern, and form a multi-worker loop with the number of loops worker_num.
  • the instruction ".VVECST.n” is generated to identify the start of the multi-worker loop segment and instruct the Scalar cluster to start. And generate the instruction ".VVECED” to mark the end of the multi-worker loop segment and instruct the Scalar cluster to exit.
  • “.VVECST.n” means to start implementing the virtual Vector unit or the virtual MTE unit.
  • the “n” in “.VVECST.n” indicates that all instructions below the statement will be executed in the form of n "virtual” Vector subunits or n “virtual” MTE subunits.
  • the program switches from multi-core mode to multi-worker mode.
  • “.VVECED” indicates that the multi-worker mode ends, and the program returns to the multi-core mode.
  • the target acceleration processor After the target acceleration processor starts to execute the AI processing task specified by the user, when it does not enter the multi-worker mode, only the main Scalar unit in each AI Core is active, and each slave Scalar is started but no operation (No Operation, NOP) state.
  • NOP No Operation
  • the main Scalar unit in each AI Core fetches, decodes, and executes the decoded instructions, and controls the Vector unit or MTE unit to perform corresponding calculations or operations.
  • the master Scalar unit When the ".VVECST.n" instruction is executed, it enters the accelerated processor multi-worker mode. At this time, the master Scalar unit sends a message to the designated n-1 slave Scalar units, and the content of the message includes: the state of the program register of the master Scalar unit. Each slave Scalar unit receives the message, and initializes its own program register state according to the program register state of the master Scalar, and completes the replication of the master Scalar state. And, after all the slave Scalars obtain the state of the master Scalar, the execution of the instruction sequence enters the next instruction.
  • the main Scalar unit also sends a message to the Vector assembly unit or the MTE merging unit, and the content of the message includes: the number of workers, that is, n in ".VVECST.n".
  • the Vector assembly unit receives the message, and initializes the instruction slot according to the number M of extracted workers. At this time, the number of slots in the instruction slots is not less than the number M of workers, so as to ensure that decoded instructions from all Scalar units can be cached at the same time.
  • the MTE merging unit clears the MSHR table.
  • each Scalar unit in the Scalar cluster independently fetches, decodes, and executes the decoded instructions, And distributed to the corresponding vector instruction queue, and after the instructions are assembled by the Vector assembly unit and the MTE merging unit respectively, they are sent to the Vector unit or the MTE unit to perform corresponding calculations or operations.
  • the master Scalar When the ".VVECED" instruction is executed, the master Scalar creates an end table, and the slave Scalar sends an end signal to the master Scalar. When all slave Scalars in the end table have sent end signals, the master Scalar executes the next command. That is, at the end of the multi-worker cycle, the compiler controls the master Scalar unit or each slave Scalar unit to implicitly execute the aforementioned ".VVECSYNC" instruction once to synchronize between the master and slave Scalars.
  • FIG. 8 is a schematic structural diagram of an artificial intelligence processing device 900 provided by an embodiment of the present application.
  • the artificial intelligence processing device 900 includes: a processor 910 and a memory 920 .
  • the artificial intelligence processing device 900 shown in FIG. 8 may further include a communication interface 930, which may be used for communicating with other devices.
  • the processor 910 may be connected to the memory 920 .
  • the memory 920 can be used to store program codes and data. Therefore, the memory 920 may be a storage unit inside the processor 910, or an external storage unit independent of the processor 910, or may include a storage unit inside the processor 910 and an external storage unit independent of the processor 910. part.
  • the artificial intelligence processing device 900 may also include a bus.
  • the memory 920 and the communication interface 930 may be connected to the processor 910 through a bus.
  • the bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one line is used in FIG. 8 , but it does not mean that there is only one bus or one type of bus.
  • the processor 910 may be a central processing unit (central processing unit, CPU).
  • the processor can also be other general-purpose processors, digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), off-the-shelf programmable gate arrays (field programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the processor 910 uses one or more integrated circuits for executing related programs.
  • the processor accesses at least one of the aforementioned computing accelerators through the interface circuit, so that the at least one computing accelerator performs data execution The operation specified in the program code, and the result of the operation is returned to the processor 910 or the memory 920 .
  • the memory 920 may include read-only memory and random-access memory, and provides instructions and data to the processor 910 .
  • a portion of processor 910 may also include non-volatile random access memory.
  • processor 910 may also store device type information.
  • the processor 910 executes the computer-executed instructions in the memory 920 to execute the operation steps of the above method.
  • the artificial intelligence processing device 900 may correspond to the corresponding subjects performing the methods according to the various embodiments of the present application, and the above-mentioned and other operations and/or functions of the modules in the artificial intelligence processing device 900 In order to implement the corresponding processes of the methods in this embodiment respectively, for the sake of brevity, details are not repeated here.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may physically exist separately, or at least two units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computing device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the program is used to execute the method of the present application when executed by a processor, and the method includes the solutions described in the above-mentioned embodiments at least one.
  • the computer storage medium in the embodiments of the present application may use any combination of one or more computer-readable media.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections with one or more leads, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
  • LAN local area network
  • WAN wide area network
  • connect such as via the Internet using an Internet service provider

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

本申请实施方式涉及车辆智能化中的人工智能芯片技术领域,特别涉及运算加速的处理方法、运算加速器的使用方法及运算加速器。该运算加速器,包括:存储单元,被配置有至少一个向量指令队列,每个向量指令队列分别用于缓存一个或多个向量指令;至少两个标量计算单元,每个标量计算单元分别用于获取指令并对指令译码得到译码后的指令,译码后的指令包括向量指令,每个标量计算单元还分别用于将向量指令缓存至至少一个向量指令队列;向量计算单元,用于执行向量指令队列中的向量指令。由上,标量计算单元的数量增加之后,可以使得读取的向量指令的数量增加,增加向向量计算单元发射的向量指令的数量,由此提高向量计算单元的利用率,进而整体上提高运算加速器的性能。

Description

运算加速的处理方法、运算加速器的使用方法及运算加速器 技术领域
本申请涉及车辆智能化中的计算机体系结构及人工智能技术领域,尤其涉及人工智能芯片技术领域,特别涉及运算加速的处理方法、运算加速器的使用方法及运算加速器。
背景技术
目前人工智能(Artificial Intelligent,AI)领域的深度学习(Deep Learning,DL)及神经网络(Neural Network,NN)对计算资源的算力需求越来越高。但目前的通用计算(General-Purpose Computing,GP)硬件,如中央处理器(Central Processing Unit,CPU)已经无法完成一部分大型神经网络所需要的大规模及高吞吐量的计算要求。针对高吞吐量场景中的神经网络应用,通常选用设置有多个AI处理器的AI芯片或AI计算设备来满足神经网络训练或神经网络推理所需要的大规模计算生成的算力需求。
在应用AI芯片或AI计算设备之前,会利用算子代码自动生成软件,如神经网络计算结构CANN(Compute Architecture for Neural Networks,CANN)来生成或设计神经网络算法。而在实施神经网络算法时,会通过自动编译工具或计算调度模板,如张量虚拟机(Tensor Virtual Machine,TVM)实现针对目标AI芯片或AI计算设备的计算编排。相对于多种多样的待处理数据和应用场景,在计算编排之后,可能会出现AI处理器设置的宽度固定的SIMD管线与待处理数据相比,管线过宽的情形。这时,将对使用目标AI芯片或AI计算设备执行计算密集型AI任务产生不利影响。
因此,有必要针对AI处理器中宽度固定的SIMD管线,设计可以根据待处理数据灵活使用SIMD管线中尽可能大的宽度的方法,以提高SIMD管线的利用率,消除在AI处理器执行计算密集型AI任务的不利影响。
发明内容
本申请提供运算加速器、运算加速的处理方法、运算加速器的使用方法、人工智能处理器、人工智能处理设备及电子装置,能够提高运算加速器中向量计算单元的利用率,提高向量计算单元的单次执行效能,进而整体上提高运算加速器的性能。
本申请第一方面提供一种运算加速器,包括:存储单元,被配置有至少一个向量指令队列,每个向量指令队列分别用于缓存一个或多个向量指令;至少两个标量计算单元,每个标量计算单元分别用于获取指令并对指令译码得到译码后的指令,译码后的指令包括向量指令,每个标量计算单元还分别用于将向量指令缓存至至少一个向量指令队列;向量计算单元,用于执行向量指令队列中的向量指令。
由上,由至少两个标量计算单元读取向量指令并对向量指令译码,译码后的向量指令缓存在存储单元配置的向量指令队列,并由向量计算单元执行译码后的向量指令。 如此,标量计算单元的数量增加之后,可以使得读取的向量指令的数量增加,增加向向量计算单元发射的向量指令的数量,由此提高向量计算单元的利用率。
作为第一方面的一种可能的实现方式,向量计算单元,用于在一个执行周期内执行向量指令队列中至少两个向量指令。
由上,与向量计算单元在一个执行周期内执行单个译码后的向量指令相比,向量计算单元在一个执行周期内执行至少两个译码后的向量指令。如此,提高了向量指令的处理效率,及提高了向量计算单元的利用率。
作为第一方面的一种可能的实现方式,在一个执行周期内执行的至少两个译码后的向量指令,具有相同的执行延时。
由上,在一个执行周期内执行的至少两个译码后的向量指令,具有相同的执行延时。如此,可以避免因向量指令具有不同的执行延时而导致在向量计算单元执行到写回过程时向量指令之间出现分叉,进而保持向量计算单元执行写回过程时的控制逻辑不变。
作为第一方面的一种可能的实现方式,还包括组装单元,用于组装向量指令队列中至少两个向量指令得到组装后的向量指令,并将组装后的向量指令提供给向量计算单元执行。
由上,由组装单元组装至少两个译码后的向量指令,并由向量计算单元执行组装后的向量指令。与执行单个译码后的向量指令相比,向量计算单元执行组装后的向量指令,提高了向量指令的处理效率,及提高了向量计算单元的利用率。
作为第一方面的一种可能的实现方式,向量指令队列为至少两个,与至少两个标量计算单元分别一一对应;组装单元包括逻辑存储模块和组装模块,逻辑存储模块被配置有至少一个组装队列,组装队列用于缓存从至少两个向量指令队列提取的译码后的向量指令;组装模块用于按照执行延时从组装队列提取至少两个译码后的向量指令并组装。
由上,至少两个标量计算单元译码后的向量指令分别缓存在各自对应的至少两个向量指令队列中,组装单元配置的组装队列缓存从这至少两个向量指令队列提取的向量指令,并由组装模块按照执行延时对从组装队列提取任意至少两个向量指令,并组装。这时,每一个标量计算单元及其对应的向量指令队列作为一个整体,其他标量计算单元及其对应的向量指令队列作为其他整体,各整体彼此独立,并行执行,提高了各标量计算单元整体上的执行效率;组装单元的组装模块利用配置有组装队列的逻辑存储模块,组装从向量指令队列提取的向量指令,能够灵活、可靠、高效地组装向量指令。而按照执行延时组装向量指令,则有利于提高组装后的向量指令由向量计算单元执行时的处理效率。
作为第一方面的一种可能的实现方式,还包括:数据搬运单元;存储单元还被配置有至少一个数据搬运指令队列,每个数据搬运指令队列分别用于缓存一个或多个数据搬运指令;译码后的指令还包括数据搬运指令,每个标量计算单元将数据搬运指令缓存至至少一个数据搬运指令队列;数据搬运单元用于执行译码后的数据搬运指令。
由上,由至少两个标量计算单元读取数据搬运指令并对数据搬运指令译码,译码后的数据搬运指令缓存在存储单元配置的数据搬运指令队列,并由数据搬运单元执行译码后的数据搬运指令。标量计算单元的数量增加之后,可以使得读取到的数据搬运指令的数量增加,可以增加向数据搬运单元提供的数据搬运指令的数量,由此提高向量计算单元的利用率。
作为第一方面的一种可能的实现方式,译码后的指令还包括标量指令,每个标量计算单元还用于执行标量指令。
由上,由至少两个标量计算单元读取标量指令、对标量指令译码并执行译码后的标量指令。各标量计算单元分别执行译码后的标量指令,整体上协调地实现运算加速器的程序流程控制。标量计算单元的数量增加之后,可以使得读取的标量指令的数量增加,由此提高标量指令的处理效率。
作为第一方面的一种可能的实现方式,至少两个标量计算单元,包括主标量计算单元与至少一个从标量计算单元,主标量计算单元用于控制各从标量计算单元的启动或停止,或控制各标量计算单元之间的同步。
由上,这至少两个标量计算单元构成集群,在该集群中,预先指定有主标量计算单元、至少一个从标量计算单元。指定的主标量计算单元控制从标量计算单元的启动或停止,或控制包括主标量计算单元在内的至少两个标量计算单元同步,从而实现集群内各标量计算单元按照指定的角色启动或停止,并按照指定的角色执行同步功能,逻辑清晰、运行可靠,实现了运算加速器的协调运行。
本申请第二方面提供一种运算加速的处理方法,包括:通过至少两个标量计算单元的每个标量计算单元分别获取指令并对指令译码得到译码后的指令,译码后的指令包括向量指令;通过每个标量计算单元将向量指令缓存至至少一个向量指令队列,向量指令队列配置在存储单元中,每个向量指令队列分别用于缓存一个或多个向量指令;通过向量计算单元执行向量指令队列中的向量指令。
作为第二方面的一种可能的实现方式,通过向量计算单元在一个执行周期内执行向量指令队列中至少两个向量指令。
作为第二方面的一种可能的实现方式,在一个执行周期内执行的至少两个译码后的向量指令,具有相同的执行延时。
作为第二方面的一种可能的实现方式,还包括:通过组装单元组装向量指令队列中至少两个向量指令得到组装后的向量指令,并将组装后的向量指令提供给向量计算单元执行。
作为第二方面的一种可能的实现方式,向量指令队列为至少两个,与至少两个标量计算单元分别一一对应;组装单元包括逻辑存储模块和组装模块;通过逻辑存储模块配置的组装队列缓存从至少两个向量指令队列提取的译码后的向量指令;通过组装模块按照执行延时从组装队列提取至少两个译码后的向量指令并组装。
作为第二方面的一种可能的实现方式,译码后的指令还包括数据搬运指令,还通过每个标量计算单元将数据搬运指令缓存至至少一个数据搬运指令队列;数据搬运指 令队列配置在存储单元中,每个数据搬运指令队列用于缓存数据搬运指令;还通过数据搬运单元执行译码后的数据搬运指令。
作为第二方面的一种可能的实现方式,指令还包括标量指令,还通过每个标量计算单元执行标量指令。
作为第二方面的一种可能的实现方式,至少两个标量计算单元,包括主标量计算单元与至少一个从标量计算单元,还通过主标量计算单元控制各标量计算单元的启动或停止,或控制各标量计算单元之间的同步。
作为第二方面的一种可能的实现方式,通过主标量计算单元控制各标量计算单元的启动,包括:主标量计算单元在运行到标识多工作单元模式的启动指令时,根据启动指令中记载的数量,控制对应数量的从标量计算单元的启动。
本申请第三方面提供一种运算加速器的使用方法,包括:根据指定的工作单元的数量M,生成指向M个工作单元的标识;使用在第一方面说明的运算加速器支持的至少一个向量运算函数来处理数据,向量运算函数的至少一个参数引用标识。
由上,在使用运算加速器进行运算加速时,根据指定的工作单元的数量M,生成指向M个工作单元的标识,并使用运算加速器支持的至少一个向量运算函数处理数据,如,通过向量运算函数的至少一个参数引用该标识的方式来分配各工作单元分别处理的数据。这时,指定使用该运算加速器的多工作单元模式,并分配各工作单元分别处理的数据,以实现并行处理数据。该使用运算加速器进行运算加速的方法,便于实现结构完整、逻辑清晰的程序代码,人机友好程度高,易于使用,有利于推广该运算加速器在产业内广泛应用。
作为第三方面的一种可能的实现方式,还包括:使用运算加速器支持的至少一个数据搬运函数来处理数据,数据搬运函数的至少一个参数引用标识;或使用运算加速器支持的同步等待函数指定M个工作单元同步。
由上,在使用运算加速器进行运算加速时,通过数据搬运函数的至少一个参数引用标识的方式来分配各工作单元分别处理的数据并执行M个工作单元因所分配处理的数据存在依赖而同步等待。这时,指定使用该运算加速器的多工作单元模式,并分配各工作单元分别处理的数据,以实现并行处理数据并在数据存在依赖时指定各工作单元同步等待。该使用运算加速器进行运算加速的方法,便于实现结构完整、逻辑清晰的程序代码,人机友好程度高,易于使用,有利于推广该运算加速器在产业内广泛应用。
本申请第四方面提供一种人工智能处理器,包括:至少一个在第一方面说明的运算加速器;处理器,以及存储器,其上存储有数据和程序,程序当被处理器执行时,使得至少一个运算加速器针对数据执行程序中指定的运算,并将运算的结果返回处理器或存储器。
如上,该人工智能处理器集成有至少一个运算加速器,在执行基于人工智能的应用程序来处理数据时,这些至少一个运算加速器针对数据执行程序中指定的运算,并将运算的结果返回处理器或存储器,集成方便,使用灵活。
本申请第五方面提供一种人工智能处理设备,包括:处理器,以及接口电路,其 中,处理器通过接口电路访问存储器,存储器存储有程序和数据,程序当被处理器执行时,处理器通过接口电路访问至少一个在第一方面说明的运算加速器,使得至少一个运算加速器针对数据执行程序中指定的运算,并将运算的结果返回处理器或存储器。
如上,该人工智能处理设备设置通过总线访问的运算加速器,在其处理器执行基于人工智能的应用程序来处理数据时,这些至少一个运算加速器针对数据执行应用程序中指定的运算,并将运算的结果返回处理器或存储器,集成方便,使用灵活。
本申请第六方面提供一种电子装置,包括:处理器,以及存储器,其上存储有程序,程序当被处理器执行时,执行在第三方面说明的方法。
本申请的这些和其它方面在以下(多个)实施例的描述中会更加简明易懂。
附图说明
以下参照附图来进一步说明本申请的各个特征和各个特征之间的联系。附图均为示例性的,一些特征并不以实际比例示出,并且一些附图中可能省略了本申请所涉及领域的惯常的且对于本申请非必要的特征,或是额外示出了对于本申请非必要的特征,附图所示的各个特征的组合并不用以限制本申请。另外,在本说明书全文中,相同的附图标记所指代的内容也是相同的。具体的附图说明如下:
图1A为某型特定域架构的运算加速器的架构示意图;
图1B为图1A所示的运算加速器的执行示意图;
图1C为图1A所示的运算加速器的Vector单元的利用率示意图;
图2A为本申请实施例的运算加速器的组成示意图;
图2B为本申请实施例的运算加速器的另一组成示意图;
图2C为本申请实施例的运算加速器的又一组成示意图;
图3为本申请实施例的运算加速器的使用方法的流程示意图;
图4A为本申请实施例的又一运算加速器的组成示意图;
图4B为图4A的运算加速器的执行示意图;
图4C为图4A的运算加速器的Vector指令组装单元的组成示意图;
图4D为图4A的运算加速器的MTE指令合并单元的组成示意图;
图5A为本申请实施例的运算加速器进行向量指令组装以合并处理数据的示意图;
图5B为本申请实施例的运算加速器进行向量指令组装以增加Vector单元利用率的示意图;
图5C为本申请实施例的运算加速器支持的向量指令的参数含义示意图;
图6A为本申请实施例的运算加速器支持的在多核模式之外执行多工作单元模式时的代码示意图;
图6B为本申请实施例提供的运算加速器支持的多核模式之内执行多工作单元模式时的代码示意图;
图7A为本申请实施例提供的运算加速器作为端设备使用时的连接示意图;
图7B为本申请实施例提供的运算加速器作为单板工作在RC模式时的连接示意图;
图8为本申请实施例提供的人工智能处理设备的组成示意图。
具体实施方式
下面结合附图并举实施例,对本申请提供的技术方案作进一步说明。应理解,本申请实施例中提供的系统结构和业务场景主要是为了说明本申请的技术方案的可能的实施方式,不应被解读为对本申请的技术方案的唯一限定。本领域普通技术人员可知,随着系统结构的演进和新业务场景的出现,本申请提供的技术方案对类似技术问题同样适用。
应理解,本申请实施例提供的运算加速方案,包括运算加速器、运算加速的处理方法及人工智能处理系统、电子装置、计算设备、计算机可读存储介质、计算机程序产品。由于这些技术方案解决问题的原理相同或相似,在如下具体实施例的介绍中,某些重复之处可能不再赘述,但应视为这些具体实施例之间已有相互引用,可以相互结合。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。如有不一致,以本说明书中所说明的含义或者根据本说明书中记载的内容得出的含义为准。另外,本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
为了更加方便理解本方案,在对本申请实施例提供的运算加速器进行介绍之前,先对本申请实施例中出现的如下名词进行介绍:
图1A给出了某型特定域架构(Domain Specific Architecture,DSA)的AI运算加速器,也即AI核(AI Core)的一种架构示例。AI核可以包括计算单元、存储系统以及系统控制模块。其中,计算单元可以提供多种基础计算资源,用于执行矩阵计算、向量计算以及标量计算等,例如,可以包括矩阵计算单元800(Cube Unit,以下称Cube单元)、向量计算单元300(Vector Unit,以下称Vector单元)和标量计算单元200(Scalar Unit,以下称Scalar单元)。Cube单元执行矩阵运算指令(以下称Cube指令),完成矩阵乘运算、矩阵加运算;Vector单元执行向量运算指令(以下称Vector指令或向量指令),完成向量类型的运算,例如,向量的相加、相乘等;Scalar单元负责各类型的标量计算和程序流程控制,以多级流水线完成循环控制、分支判断、同步等程序流程控制,以及Cube指令或Vector指令的地址和参数计算以及基本的算术运算等。
如图1B所示,各计算单元分别形成各自独立的执行流水线,在运行在设备操作系统(Device OS)中的系统控制模块13(System Control)的统一调度下,互相配合达到优化的计算效率。
存储系统包括AI核内部存储(In Core Buffer)和相应的数据通路,其中数据通路可以包括AI核完成一次计算任务中的数据的流通路径,如总线接口单元14(Bus Interface Unit,BIU)。AI核内部存储(In Core Buffer)作为AI核内部的存储资源,与总线接口单元14相连,从而经BIU和总线(Bus)获得外部的数据,如,将二级缓冲区(L2Buffer,L2)、双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM,简称DDR)、高带宽存储器(High  Bandwidth Memory,HBM)、全局存储器(Global Memory,GM)等内的数据传输到AI核内部存储。图1A中,AI核内部存储包括一级缓冲区121(L1 Buffer),统一缓冲区122(Unified Buffer,UB),通用寄存器123(General-Purpose Register,GPR),专用寄存器124(Special-Purpose Register,SPR)等。如图1A所示,指令缓存125(Instruction Cache,I Cache)是AI Core内部的指令缓存区,用户针对AI任务设计的神经网络算法经过计算编排后生成的指令序列存放在指令缓存内。在存储访问时,一级缓冲区的优先级高于二级缓冲区。AI核内部存储还包括提供临时变量的高速寄存器单元,这些高速寄存器单元可以位于前述的各计算单元中,如图4C中向Vector单元提供操作数的源寄存器工作空间(Src Reg Workspace)和目标寄存器工作空间(Dst Reg Workspace)。
为了配合数据搬运和存储转换,AI Core中还可以设置数据搬运单元600(Memory Transfer Engine,以下简称MTE单元)。MTE单元执行数据搬运指令(以下称MTE指令),可以以极高的效率实现数据格式的转换,还可以完成AI核外部存储到AI核内部存储、AI核内部存储到AI核外部存储及AI核内部存储的不同缓冲区(Buffer)之间的数据搬运。图1A中示出的数据搬运单元600,具体实施时,可以包括多个MTE单元,分别用于实现不同类型的数据搬运。
该AI Core的工作原理如图1B所示,指令序列以“顺序读指令,并行执行指令”的调度方式,被取指、译码及执行。指令序列被Scalar单元顺序读取并译码。译码后的指令包括Scalar指令、Cube指令、Vector指令、和MTE指令。Scalar指令中,与程序流程控制有关的指令会被发射至标量指令处理队列(Processing Scalar Queue,PSQ)译码及后续执行,与标量计算有关的指令由Scalar单元直接执行。Cube指令、Vector指令、MTE指令经过Scalar单元处理之后,操作数的地址、操作涉及的各参数等已经配置完成。之后,指令发射单元15(Instruction Dispatch)将各指令分别分发到对应的指令队列,并调度相应的计算单元来并行执行。通常,向量指令队列内缓存的指令由Vector单元执行时,具有不同的执行延时。通常,MTE指令队列内缓存的指令由MTE单元执行时,具有不同的执行延时。
如图1B所示,Cube指令被分发到矩阵指令队列161(Cube指令队列,以下称Cube Queue),由其缓存各译码后的矩阵运算指令;Vector指令被分发到向量指令队列162(向量指令队列,以下称Vector Queue),由其缓存各译码后的向量运算指令;MTE指令被分发到存储转换指令队列163(MTE指令队列,以下称MTE Queue),由其缓存各译码后的数据搬运指令。各队列中缓存的指令后续由对应的计算单元来执行。
如图1B所示,对于标量指令队列、向量指令队列、矩阵指令队列、数据搬运指令队列,不同队列的指令流水线是并行执行,提高了指令执行效率。具体执行时,除了标量指令处理队列,不同指令队列间指令执行顺序可以是顺序或乱序的,但是队列内部指令必须为顺序执行。
如图1A所示,AI Core中还可以设置事件同步模块17(Event Synchronized)。在相同指令队列内的指令的执行过程中,出现依赖关系或者有强制的时间先后顺序时,通过事件同步模块17来调整和维护指令的执行顺序要求。事件同步模块可以通过软 件控制,例如通过插入同步函数或同步指令的方式来指定每一条流水线的执行时序,从而达到调整指令执行顺序的目的。
其中,Vector单元可以为单指令多数据(Single Instruction Multi Data,SIMD)计算单元,执行以SIMD指令模型提交的向量运算请求,在32位浮点(Float Point,FP)FP32模式实现64指令管线(Lanes)的宽度,或者在16位浮点FP16模式实现128指令Lanes的宽度,可以根据待处理数据的精度自适应地调整Vector单元执行一次可以处理的数据的宽度,如,在FP32模式,一次执行最多可以处理64个FP32精度的数据;如,在FP64模式,一次执行最多可以处理32个FP64精度的数据,具有强大、灵活的向量运算能力。
另外,通过设置SIMD指令模型中的掩码(mask)参数的值可以灵活调整本次指令执行所占用的SIMD管线的宽度,以适应待处理数据的尾块大小或数据量。mask参数的值与当前运算精度时,Vector单元提供的最大SIMD管线的宽度之比,也即本次指令执行Vector单元的利用率。
如前述,AI Core初始化之后,Vetcor单元以独立的执行流水线,按照固定时钟运行。如果没有接收到有效的向量运算请求,Vetcor单元执行空操作(No Operation,NOP),执行空操作时,Vector单元的利用率为零。
该Vetcor单元在执行浮点16位精度(Float Point 16bit,也即2Byte用来表示一个浮点数,记为FP16)的向量运算请求时,Vector单元提供的最大SIMD管线的宽度为128Lanes。如果以SIMD指令模型提交的向量运算请求中,mask参数的值设定为128,则Vetcor单元在一个执行周期中在128个指令Lanes内针对源操作数或目标操作数执行指定的操作(Operation,OP),Vector单元的利用率为100%。
将mask参数的值设定为16,是在软件编写的过程中确定并实施的。例如,在输入数据的内存格式为“N,C/16,H,W,16”的排布下,需要针对输入数据进行最近邻插值计算时,在某些情况下,用户只能生成以源数据1*16填充目的数据的SIMD指令模型。如图1C所示,在mask参数的值设定为16时,只有前16个Lanes参与运算,剩余112个Lanes闲置,Vector单元的利用率为1/8。类似的,在mask参数的值分别设定为32或64时,Vector单元的利用率为分别为1/4或1/2。
因此,针对以上Vector单元的管线宽度相对于用户当前使用的SIMD指令模型中指定的数据过大的情形,本申请实施例的方案提供可以根据待处理数据灵活使用SIMD管线中尽可能大的宽度的方法。
如前述,SIMD指令模型中mask参数的值与数据量、数据的尾块大小等因素相关。本申请实施例提供的方案,可针对数据量充裕的情形,通过指令组装(包括操作组装和操作数合并),提高Vetcor单元的利用率,提高其单次执行效能,进而整体上提高运算加速器的性能,及增强对算子代码自动生成软件或计算调度模板的适用性。
本申请实施例提供的方案中,如图4A所示,运算加速器中设置有主Scalar单元,和至少一个从Scalar单元分别取指、译码,得到译码后的Vetcor指令。该运算加速器将至少两个译码后的Vetcor指令组装为一个宽指令,该宽指令中指定两个或多个执行延时相同的向量运算对应的操作,还指定了这至少两个Vetcor指令的源操作数地址或 目标操作数地址。该运算加速器中设置的Vetcor单元在一个执行周期中执行该宽指令,在一个执行周期中Vetcor单元处理的数据更多,利用率更高,单次执行效能更高。
本申请实施例提供的方案中,如图4A所示,运算加速器设置的主Scalar单元和至少一个从Scalar单元还分别取指、译码,得到译码后的MTE指令。该运算加速器还将至少两个译码后的MTE指令组装为一个宽指令,该宽指令中指定两个或多个MTE操作,还指定了这至少两个译码后的MTE指令的源操作数或目标操作数。该运算加速器中设置的MTE单在一个执行周期中执行该宽指令,在一个执行周期中MTE单元处理的数据更多,单次执行效能更高。
本申请实施例提供的方案,还提供了多工作单元(worker)模式编程模型来使用目标运算加速器提供的主Scalar单元和至少一个从Scalar单元。该多工作单元(worker)模式编程模型中,分别提供了利用工作单元的标识(worker_id)来指定向量运算函数、数据搬运函数和同步等待函数中的至少一个参数的方法。当根据处理的数据,确定使用目标运算加速器提供的多工作单元(worker)模式时,根据确定的工作单元的数量M,创建指向M个工作单元的标识(worker_id);引用工作单元的标识作为函数中的参数,使用向量运算函数或数据搬运函数编写处理数据的代码。在引用工作单元的标识来分别指定其处理数据中的不同部分时,针对存在的数据依赖,还可以设置同步等待函数。
以上编写的处理数据的代码,经过编译,得到指令序列。在使用安装有驱动程序的目标运算加速器运行该指令序列时,AI Core设置的Vector单元的利用率更高,单次执行效能更高,以及,MTE单元在一个执行周期中处理的数据更多,单次执行效能更高,进而整体上提高了目标运算加速器的性能。
下面,将参见附图,对本申请实施例进一步详细说明:
如图2A所示,本申请一个实施例的运算加速器,包括:存储单元100,被配置有至少一个向量指令队列100A,每个向量指令队列分别用于缓存一个或多个向量指令;至少两个标量计算单元200,每个标量计算单元分别用于获取指令并对指令译码得到译码后的指令,译码后的指令包括向量指令,每个标量计算单元200还分别用于将向量指令缓存至至少一个向量指令队列;向量计算单元300,用于执行向量指令队列中的向量指令。
在一些实施例里,该运算加速器可以为搭载有至少一个前述AI Core的AI芯片或AI单板,该AI芯片或AI单板包括但不限于神经网络处理器(Neural-network Processing Unit,NPU)、图形处理器(Graphics Processing Unit,GPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)等。
在一些实施例里,标量计算单元可以为运算加速器上的CPU,在一些实施例里,标量计算单元200可以为前述的Scalar单元,该标量计算单元以流水线的方式,进行取指、译码、执行、访存和写回等操作。
在一些实施例里,以上各标量计算单元200分别用于获取指令并对指令译码,指令包括向量指令,具体而言,是指各标量计算单元根据预先生成的指令序列中各指令 引用的工作单元的标识(worker_id)对应地获取指令,并对获取的指令译码。在以指定的工作单元的数量M为总循环次数的每一个循环内,各指令通过引用数值递增的worker_id,来指定不同的数据部分作为其源操作数。随后,引用数值相同的worker_id的同一个指令或不同的指令,被组装为一个宽指令,具体组装的方法参见后述说明。
以上,参考前述说明,运算加速器中设置的M个标量计算单元用于加速运算时,需要预先在应用程序中通过分别调用M个工作单元(worker)对应的标识的方式来指定启动由M个标量计算单元组成的集群的工作模式。因此,在指定的工作单元的数量为M时,至少两个标量计算单元为M个,分别用于获取指令并对指令译码,在不考虑各个标量计算单元之间的同步等待等其他占用处理时间的情形时,实现了单位时间内对指令序列取指和译码的效率提高至原来的M倍,为后续更高效地利用向量计算单元或数据搬运单元的处理能力提供了充裕的数据量和指令数。
以上在存储单元内配置的向量指令队列,在一些实施例中,可以是全部或多个标量计算单元共同地配置一个向量指令队列100A,也即,标量计算单元的数量为M,而向量指令队列的数量为1。这时,各标量计算单元分别译码后的指令,按照译码时刻的先后顺序,依次缓存在共同的向量指令队列中100A。在另一些实施例中,可以是各标量计算单元200分别一一对应地配置有一个向量指令队列100A,也即,标量计算单元的数量为M,向量指令队列的数量也为M。这时,各标量计算单元分别译码后的指令,分别独立地缓存在对应的向量指令队列100A中。
以上存储单元利用缓存机制,实现了取走向量指令队列内的指令去执行这个第一事务,与向量指令队列内增加新的译码后的向量指令这个第二事务,使得第一事务和第二事务可以同时或间隔地进行,保证了向量指令队列内的指令数量和指令的内容能够持续更新,支持各标量计算单元并行地取码、译码及存放译码后的指令,整体上提高了该运算加速器的执行效率。与之前在向量指令队列内的缓存的指令只来自一个标量计算单元200不同,这时,向量指令队列100A内的缓存的指令来自至少两个标量计算单元200,因此需要实现更复杂的执行逻辑和业务调度。
在一些实施例里,向量计算单元300单次执行的译码后的向量指令,可以是任一个标量计算单元译码后的指令,也可以是任意多个标量计算单元译码后的指令经组装后的一条宽指令。
该实施例的运算加速器,利用至少两个标量计算单元,分别获取指令并对指令译码,由存储单元缓存各标量计算单元的译码后的向量指令,并由向量计算单元执行译码后的向量指令,从而形成了稳定高效的处理流水线,整体上具有高的执行效率,并且,标量计算单元的数量增加之后,更多的标量计算单元可以取码并译码更多的向量指令,从而增加向向量计算单元提供的向量指令的数量,由此可以提高向量计算单元的利用率。
在一些实施例中,向量计算单元,也即前述的Vector单元,用于执行译码后的向量指令,包括:向量计算单元300用于在一个执行周期内执行至少两个译码后的向量指令。
在一些实施例中,译码后的向量指令可以采用SIMD指令模型。SIMD指令模型多种多样,分别对应于各自的执行周期。这里的执行周期,是指Vector单元执行当前SIMD指令模型中指定的操作时,所需的执行延时。执行延时,是指某SIMD指令模型由Vetcor单元执行时所需要的时钟周期数。如,Vetcor单元执行浮点除法时的执行延时为5个时钟周期,那么,在5个连续的时钟周期内,Vector单元被占用,来执行浮点除法对应的操作。
由上,向量计算单元300在一个执行周期内T执行至少两个译码后的向量指令,与向量计算单元在一个执行周期内T执行单个译码后的向量指令相比,提高了向量指令的处理效率,及提高了向量计算单元的利用率。
在一些实施例中,若至少两个译码后的向量指令分别在不同的时刻提交给Vector单元单独执行时,分别具有确定的执行延时。这些执行延时可能相同,也可能不同。
若在一个执行时刻同时开始执行的至少两个译码后的向量指令具有至少两个不同的执行延时,则向量计算单元本次执行时的执行周期为这至少两个不同的执行延时中的最大值。如上,这至少两个具有不同的执行延时的向量指令被同时提交给Vector单元去执行,尽管提高了Vector单元的利用率,但因为具有更长的执行延时,从而整体上可能会降低Vector单元的单次执行效能。
在一些实施例中,在一个执行周期内执行的至少两个译码后的向量指令,具有相同的执行延时。这时,Vector单元在一个执行周期内,同时执行具有相同的执行延时的至少两个向量指令,对应的执行延时与分别单独执行各指令时的执行延时相同,但在一个执行周期内处理的数据更多,Vector单元的利用率更高,次执行效能更高,进而整体上提高了AI Core的性能,及运算加速器的性能。
由上,在一个执行周期内执行的至少两个译码后的向量指令,具有相同的执行延时,可以避免因向量指令具有不同的执行延时而导致在向量计算单元执行到写回过程时向量指令之间出现分叉,进而保持向量计算单元执行写回过程时的控制逻辑不变。
在一些实施例中,如图2B所示,本申请实施例的运算加速器还可以包括向量指令组装单元400,用于组装至少两个译码后的向量指令;向量计算单元还用于执行组装后的向量指令。
在一些实施例中,向量指令组装单元400可以分别设置在主Scalar单元或从Scalar单元中。如前述说明及图1B所示,指令序列内的向量指令经过Scalar单元处理之后,源操作数和/或目标操作数的地址、参数(如掩码mask,反复次数repeat,跳转量stride)等都已经配置好,并在Vector单元空闲时直接发射给Vector单元去执行。在主Scalar或从Scalar继承地具有图1B中Scalar单元的功能的基础之上,只需要增加指令组装功能即可实现前述组装单元的功能。这时,主Scalar单元或每一个从Scalar单元分别具有指令组装功能,并组装由自身的Scalar单元取指及译码后的向量指令,并分别轮流地将组装后的宽指令发射给Vector单元去执行。这时,各Scalar单元分别独立地组装指令,宽指令均来自自身的Scalar单元取指及译码后的向量指令,指令的来源单一,操作的数据的集中性强,增加了数据依赖程度,导致组装效率偏低;各Scalar单元分别独立地具有指令组装功能,系统构成更复杂,不利于整体上提高执行效率。
在一些实施例中,如图2B所示,可针对全部或多个主Scalar单元和从Scalar单元设置共同的向量指令组装单元400,并由向量指令组装单元400组装至少两个译码后的向量指令,并提供经组装后的向量指令,也即前述的宽指令,给向量计算单元,以由向量计算单元,也即,前述的Vector单元执行组装后的向量指令。
这时,向量指令组装单元400可以提取全部或多个主Scalar单元和从Scalar单元译码后的向量指令,指令来源更多样化,指令操作的数据的分散性更大,Packer集中地组装各Scalar单元分别独立地译码后的向量指令,更有利于高效地筛选适合组装的指令,提高Vector单元的利用率。
在一些实施中,如图2B所示,向量指令队列100A为至少两个,与至少两个标量计算单元200分别对应;向量指令组装单元400还可以包括逻辑存储模块410、向量指令组装模块420,逻辑存储模块410被配置有向量指令组装队列410A,向量指令组装队列410A用于缓存从至少两个向量指令队列提取的译码后的向量指令;向量指令组装模块420用于按照执行延时从组装队列提取至少两个译码后的向量指令并组装。
这时,向量指令组装单元400可以是AI Core上单独设置的一个硬件模块,其设置有逻辑存储模块410。这个逻辑存储模块410可以是下述的指令槽,还可以是任一具有相似逻辑结构的其他存储块,如,队列,链表,表格,堆栈等。这时,指令槽被配置为存放组装单元从多个向量指令队列提取的译码后的指令。
向量指令组装单元400包括的向量指令组装模块420(Vec Instruction Coalescer)用于从各向量指令队列中提取指令并放入到指令槽内缓存。这时,向量指令组装模块420,按照预先设定的规则,如先进先出,分别从至少两个向量指令队列100A中各提取一个译码后的向量指令;或从每一个向量指令队列100A中按照预先设定的规则,如先进先出,同时提取至少两个译码后的向量指令;或按照各向量指令队列100A中的指令总数量,自多到少地,提取译码后的向量指令;或按照当前指令槽内可以接收的向量指令的数量,从对应数量的向量指令队列中提取译码后的向量指令。
如上,指令槽内存放的译码后的指令可以来自同一个向量指令队列100A,或不同的向量指令队列100A。并且,这些译码后的指令可以具有相同或不同的执行延时。
在一些实施例中,向量指令组装模块420还用于按照执行延时从向量指令组装队列410A中提取至少两个译码后的向量指令并组装。
在一些实施例中,向量指令组装模块420按照指令槽内各指令的mask的值,提取指令并组装。如,根据当前计算精度下,目标运算加速器的Vector单元支持的Lanes宽度的最大值,确定提取的指令。如,增量式地提取多条指令,直到提取到的指令的mask的值的累加和最接近于Vector单元支持的Lanes宽度的最大值。
在一些实施例中,向量指令组装模块420按照指令槽内各指令的执行延时,提取待组装的指令进行组装。如,按照当前指令槽内各指令对应的执行时延,将指令分为多组。优先选择执行时延最小的一组指令进行组装,或优先选择执行时延最大的一组指令进行组装。从每一组指令中提取指令时,还可以采用前述的mask的值的累加和的方法,来保证最大化Vector单元的利用率。
在一些实施例中,组装后的宽指令中指定至少一个操作,这些指定的操作由Vector单元分别单独执行时的执行延时相同。组装后的宽指令中指定至少M个源操作数,这M个指定的源操作数的宽度相同或者不同(因为mask的值可以不同)。
在一些实施例中,向量指令组装单元400向指令槽内提取指令是第三事务,将指令槽内的指令组装是第四事务,在AI Core的系统控制模块或向量指令组装单元400的协调下,第三事务和第四事务可以同时或间隔处理或并行处理。
在一些实施例中,如图4C所示,向量指令组装模块420还用于将组装后的宽指令发送到AI Core为Vector单元配置的向量指令执行队列(Vector Execution Queue)。
在Vector单元300执行宽指令之前,需要准备操作数。在一些实施例中,如图4C所示,向量指令组装单元400还包括Vec指令发射模块430,其用于将指定的源操作数从UB内提取到操作数缓冲区,以由向量计算单元读取,这时,组装后的至少两个译码后的向量指令分别指定各自的源操作数;并指定目标操作数在操作数缓冲区内的地址。如,将源操作数提取到AI Core为Vector单元配置的操作数缓冲区,如源寄存器工作空间;如,在AI Core为Vector单元配置的操作数缓冲区,如,目标寄存器工作空间内指定目标操作数的地址。
在操作数缓冲区可缓存的数据量大于向量指令执行队列内可缓存的指令数量时,Vec指令发射模块430先从向量指令执行队列中提取指令,随后参照前述方法准备源操作数和目标操作数。而在操作数缓冲区可缓存的数据量小于向量指令执行队列内可缓存的指令数量时,Vec指令发射模块430先参照前述方法准备源操作数和目标操作数,并在数据准备好之后,从向量指令执行队列中提取指令发射到Vector单元执行。
如前述,在检测到Vector单元空闲时,Vec指令发射模块430将数据准备好的向量指令发射到Vector单元300去执行。
在一些实施例中,在Vector单元执行宽指令之后,Vec指令发射模块430还用于写回操作数。如,从目标寄存器工作空间内将目标操作数写回到统一缓冲区UB内。
在一些实施例中,Vec指令发射模块430准备操作数是一个事务,从向量指令执行队列中提取指令是另一个事务,将数据准备好的向量指令发射到Vector单元去执行是再一个事务,将目标操作数写回到UB内是又一个事务,在AI Core的系统控制模块13或向量指令组装单元400的协调下,这些多个事务被并行处理。
由上,这至少两个标量计算单元译码后的向量指令分别缓存在各自对应的向量指令队列中,向量指令组装单元配置的向量指令组装队列缓存从这至少两个向量指令队列提取的向量指令,并由向量指令组装模块按照执行延时对从向量指令组装队列提取任意至少两个向量指令,并组装。这时,标量计算单元及其对应的向量指令队列作为一个整体,与其他标量计算单元及其对应的向量指令队列彼此独立,并行执行,提高了各标量计算单元整体上的执行效率;组装单元的组装模块利用配置有组装队列的逻辑存储模块,组装从向量指令队列提取的向量指令,实现了灵活、可靠、高效地组装向量指令,而按照执行延时组装向量指令,则有利于提高组装后的向量指令由向量计算单元执行时的处理效率。
在一些实施例中,参考前述结合图1A的描述,在运算加速器中设置数据搬运单元600时,各标量计算单元200取指并译码的指令还包括数据搬运指令,相应地,如图2C所示,存储单元100还被配置有数据搬运指令队列100B,数据搬运指令队列100B用于缓存至少两个标量计算单元的译码后的数据搬运指令;数据搬运单元600还用于执行译码后的数据搬运指令。
以上存储单元,被配置有对应各标量计算单元的数据搬运指令队列,数据搬运指令队列用于缓存各标量计算单元的译码后的数据搬运指令,在一些实施例中,各标量计算单元的译码后的指令被依次送到配置在存储单元中的各标量计算单元的数据搬运指令队列100B。
在一些实施例中,可以是全部的或多个标量计算单元共同地配置有一个数据搬运指令队列,也即,标量计算单元的数量为M,而数据搬运指令队列的数量为1。这时,各标量计算单元分别译码后的指令,按照译码时刻的先后顺序,依次缓存在共同的数据搬运指令队列中。在另一些实施例中,可以是各标量计算单元分别对应地配置有一个数据搬运指令队列,也即,标量计算单元的数量为N,而数据搬运指令队列的数量为M。这时,各标量计算单元分别译码后的指令,分别独立地缓存在对应的数据搬运指令队列中。
以上利用缓存机制,实现了队列内的指令被取走执行这一事务,及队列内增加新的译码后的指令这一事务可以同时或间隔地进行,队列内的指令数量和指令内容持续更新,从而支持各标量计算单元能够分别并行地实现取码、译码及存放译码后的指令这一流水线,整体上提高了执行效率。
以上向量计算单元,用于执行译码后的数据搬运指令。在一些实施例里,向量计算单元可以为数据搬运单元,其单次执行的译码后的数据搬运指令,可以是任一个标量计算单元译码后的指令,也可以是任意多个标量计算单元译码后的指令经组装后的宽指令。
综上,该实施例的运算加速器,利用至少两个标量计算单元,分别获取指令并对指令译码,由存储单元缓存各标量计算单元的译码后的数据搬运指令,并由向量计算单元执行译码后的数据搬运指令,从而形成了稳定高效的处理流水线,整体上具有高的执行效率。
在一些实施例中,参照前述说明,这至少两个标量计算单元从指令序列中取指及译码后,得到的译码后的指令或者是译码后的向量指令,或者是译码后的MTE指令,或者是译码后的Cube指令,或者是Scalar指令。
在一些实施例中,Vector单元和MTE单元分别独立执行。并且,用于组装MTE指令的合并单元和用于组装向量指令的组装单元也是分别独立执行。这可以由指令序列中的指令在编程阶段预先消除这两类指令之间可能存在的数据依赖关系来实现。
在一些实施例中,数据搬运指令包括多类数据搬运指令,对应于各标量计算单元的数据搬运指令队列是一组多类,如n类,则对应于各标量计算单元的数据搬运指令队列具有多个类别,如n类。也即,如果标量计算单元的数量为M,则数据搬运指令队列的数量为n*M。这时,针对与各标量计算单元分别对应的同一类数据搬运指令队 列,AI Core可以分别设置一个组装单元,用于组装至少两个译码后的同一类数据搬运指令,并发射经组装后的数据搬运指令。也即,如果数据搬运指令包括n类数据搬运指令,则MTE合并单元的数量为n。
在一些实施例中,数据搬运单元执行MTE指令时,并不区分MTE指令的种类。因此,在一些实施例中,MTE合并单元的数量也可以是1,与MTE单元的数量相适配。
在一些实施例中,数据搬运单元,还用于在一个执行周期内执行至少两个译码后的数据搬运指令。
这里的执行周期,是指MTE单元执行当前MTE指令中指定的操作时,所需的执行延时。执行延时,是指某MTE指令由MTE单元执行时所需要的时钟周期数。
在一些实施例中,在一个执行周期内执行的至少两个译码后的数据搬运指令,其指向的操作数的地址满足组装条件。
在一些实施例中,标量计算单元取指并译码的指令还包括标量指令,至少两个标量计算单元还用于执行译码后的标量指令。这些标量指令,大多为流程控制指令或数学运算,不涉及针对worker_id的引用。各标量计算单元分别执行译码后的标量指令,整体上协调地实现运算加速器的流程控制。标量计算单元的数量增加之后,可以使得读取的标量指令的数量增加,由此提高标量指令的处理效率。
为了具有使用灵活、数据适应性好的特点,在一些实施例中,至少两个标量计算单元,包括主标量计算单元与至少一个从标量计算单元,主标量计算单元用于控制从标量计算单元的启动或停止,或控制至少两个标量计算单元的同步。
由上,这至少两个标量计算单元V2构成集群,在该集群中,指定有主标量计算单元、并指定有至少一个从标量计算单元。指定的主标量计算单元控制从标量计算单元的启动或停止,或控制包括主标量计算单元在内的至少两个标量计算单元同步,从而实现集群内各标量计算单元按照指定的角色启动或停止,并按照指定的角色执行同步功能,逻辑清晰、运行可靠,实现了运算加速器的协调运行。
该目标运算加速器使用时,可以是主标量计算单元单独运行模式,或主标量计算单元与至少一个从标量计算单元组成集群的多工作单元模式。单独运行模式与多工作单元模式是互斥的,在AI Core的系统控制模块或组装单元的协调下,这两个模式可以混合编排,如图6A和6B所示。
具体而言,主标量计算单元控制从标量计算单元启动之后,及在停止从标量计算单元之前,这至少两个标量计算单元为多工作单元模式,每一个标量计算单元“虚拟”地对应一个工作单元。在多工作单元模式期间,可能会涉及到控制任一目标运算加速器,也即AI Core内各标量计算单元的同步,也即各“虚拟”的工作单元的同步,具体实现方法参加下述说明。
具体而言,主标量计算单元控制从标量计算单元停止之后,主标量计算单元为单独运行模式。在单独运行模式期间,不会涉及到控制任一AI Core内各标量计算单元的同步,但可能会涉及到多个目标运算加速器,也即多个AI Core之间的同步,具体实现方法不再赘述。
在一些实施例中,主Scalar用于控制从Scalar的启动,包括:主标量计算单元用于在执行到标识多工作单元模式启动的指令时,根据启动指令中记载的数量,控制对应数量的从标量计算单元的启动。如,主标量计算单元执行到如下的标识多工作单元模式启动的指令for range(worker_num)时,根据指令中记载的worker_num的数值,控制worker_num-1个从标量计算单元的启动。更详细的启动、停止或同步方法,参见下述说明。
参照前述说明,指令序列内的向量指令经过Scalar单元处理之后,源操作数和/或目标操作数的地址、参数等要素都已经配置好,存储单元中配置的向量指令队列中缓存的各标量计算单元的译码后的向量指令,由组装单元组装后得到更宽的向量指令,在Vector单元空闲时直接发射给Vector单元去执行。
以某型加速处理器执行具有4个工作单元的多工作单元模式为例,对组装的指令为相同的指令时的组装过程及效果进行说明。以4个Scalar单元在当前时刻分别取指并译码一个示例性的向量运算指令模型Simd1(OP1,worker_id*block)为例,该向量运算指令模型指定的操作为OP1,在每个工作单元循环中,指定的源操作数的地址由其引用的数值递增的worker_id来确定。记这4个Scalar单元依次为Scalar_0、Scalar_1、Scalar_2及Scalar_3,分别取指并译码一个Simd1指令后,如图5A所示,UB内一段连续的数据被依次指定为各Scalar单元译码后的向量指令中指定的源操作数。
针对图5A的情形,将4条引用worker_id且指定同一操作的指令组装为一个宽指令Simd11(OP1,St,OP1,St+block,OP1,St+2*block,OP1,St+3*block)。这时,该宽指令中指定的操作是同一个op,因此执行延时相同。并且,整体上该宽指令处理的数据为这些源操作数的并集。执行该宽指令,Vector单元整体上处理的数据是组装前单独处理每一个译码后的指令的4倍,因此Vector单元的利用率是合并前的4倍。
以组装的指令为不同的向量指令但执行时延相同时的的组装过程及效果进行说明。如图5B所示,记4条译码后的向量指令分别为Smid21(op2,data21),mask参数的值为16;Smid31(op2,data31),mask参数的值为16;Smid41(op3,data41),mask参数的值为16;Smid51(op4,data51),mask参数的值为16。已知由某目标AI加速处理器(其设置有4个Scalar单元,也即标量计算单元)执行该op2、op3、op4的执行延时是相同的。并且data21、data31、data41、data51作为操作数,针对Vector单元的利用率相同,均为16Lanes。这时,可以通过指令组装,尝试在Vector单元的一个执行周期内同时执行这4条译码后的向量指令,则Vector单元的利用率大致是各指令分别单独执行时的4倍。
进一步地,利用8条译码后的向量指令的mask参数值为16且已知由另一目标AI加速处理器(其设置有8个Scalar单元,也即标量计算单元)执行包括op2、op3、op4在内的8个操作时的执行延时相同,则这时,可以尝试通过指令组装在Vector单元的一个执行周期内执行这8条译码后的向量指令,则Vector单元的利用率大致是各指令分别单独执行时的8倍。
如图5C还示出了SMID向量指令模型中的反复次数repeat和跳转量stride的含义,针对这两个参数在组装指令时采用的方法,可参考后述说明。
以上,运算加速器运行在多工作单元模式时,对向量指令或MTE指令进行组装的操作,在映射到面向用户的编程模型或编程方法时,这些硬件意义上的主Scalar或从Scalar对用户是透明的。相应地,用户通过编程来使用目标运算加速器时,近似于通过工作单元的标识来使用多个“虚拟”的Vector子单元或“虚拟”的MTE子单元。
在图1至图2所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关方法。
如图3所示,本申请一个实施例的运算加速器的使用方法,包括:
S110:根据指定的工作单元的数量M,生成指向M个工作单元的标识;
S130:使用运算加速器支持的至少一个向量运算函数处理数据,向量运算函数的至少一个参数引用标识。
在一些实施例里,该使用方法,还包括:
S150:使用运算加速器支持的至少一个数据搬运函数处理数据,数据搬运函数的至少一个参数引用标识;
S170:使用运算加速器支持的同步等待函数指定M个工作单元同步。
由上,在使用运算加速器进行运算加速时,根据指定的工作单元的数量M,生成指向M个工作单元的标识,并使用运算加速器支持的至少一个向量运算函数处理数据,如,通过向量运算函数的至少一个参数引用标识的方式来分配各工作单元分别处理的数据。这时,指定使用该运算加速器的多工作单元模式,并分配各工作单元分别处理的数据,以实现并行处理数据。该使用运算加速器进行运算加速的方法,便于实现结构完整、逻辑清晰的程序代码,人机友好程度高,易于使用,有利于推广该运算加速器在产业内广泛应用。
在使用运算加速器进行运算加速时,通过数据搬运函数的至少一个参数引用标识的方式来分配各工作单元分别处理的数据并执行M个工作单元因所分配处理的数据存在依赖而同步等待。这时,指定使用该运算加速器的多工作单元模式,并分配各工作单元分别处理的数据,以实现并行处理数据并在数据存在依赖时指定各工作单元同步等待。该使用运算加速器进行运算加速的方法,便于实现结构完整、逻辑清晰的程序代码,人机友好程度高,易于使用,有利于推广该运算加速器在产业内广泛应用。
用户使用本申请实施例的运算加速器设计AI业务程序时,根据任务需要,可以灵活使用各AI Core内提供的Scalar单元的数量。以目标运算加速器为V1芯片为例,该V1芯片设置有两个Scalar单元,一个SIMD类型的执行向量运算的Vector单元。在FP16精度下,该Vector单元具有128Lanes用于向量运算。在FP32精度下,该SIMD单元具有64Lanes可以用于向量运算。依据前述的运算加速器的使用方法,在FP16精度下,用户设计的程序可以指定由一个或两个Scalar单元处理数据,最终提交至该Vector单元的合并后的向量运算指令执行时,可以占用16至128Lanes中的任意多个,并且为从16开始,以16为增量的任意数值。相应地,在FP32精度下,最终提交至该Vector单元的合并后的向量运算指令执行时,可以占用32Lanes或64Lanes。
若目标运算加速器V2设置有4个Scalar单元,以及一个SIMD类型的执行向量运算的Vector单元。在FP16精度,用户设计的程序对应的指令序列中的某些指令可以由一个、两个、三个、或四个Scalar单元处理,最终提交至该Vector单元的合并后的向量运算指令执行时,可以占用4至128Lanes中的任意多个,并且从16开始,以16为增量的任意数值。目标运算加速器设置有其他数目的Scalar单元的情形可参照前述,不再赘述。
基于一个或多个运算加速器可以实现多种形式的AI芯片或AI单板,并构建多种形式的AI处理器和AI处理设备,如,AI加速模块,AI加速卡(可以是训练卡,也可以是推理卡),AI边缘计算服务器或AI边缘计算端设备,AI云端服务器(可以是训练服务,也可以是推理服务,如,拥有2048个节点的AI运算服务器),及AI集群等。这些AI处理器和AI处理设备还可以是车载设备,或用于实现通信及通讯的的智能终端、移动数据中心(Mobile Data Center)等。
如,AI处理器可以按照其PCIe(Peripheral Component Interconnect express,PCIe)的工作模式区分其自身的工作模式。如果其PCIe工作在主模式,则AI处理器所在硬件设备可以扩展外设,这时,AI处理器工作在RC(Root Controller)模式。AI处理器所在硬件设备还可以接入网络摄像头、I 2C传感器、SPI显示器等其他外设作为从设备。
图7B示出的某AI处理设备中,某型AI处理器包括多个AI运算加速器AI Core、AI CPU、控制CPU及任务调度CPU。该AI处理器的PCIe工作在主模式。该AI处理器的CPU直接运行用户指定的AI业务程序,处理AI业务,并调度AI运算加速处理器执行AI业务中指定的NN计算。该AI处理设备具有操作系统、输入输出设备、文件处理系统等。控制CPU控制整个AI处理器整体运行,从DDR或HBM获取数据或指令序列,任务调度CPU负责将计算任务在多个AI Core及AI CPU上的高效分配和调度,多个AI Core分别执行神经网络中的各算子,AI CPU则执行不适合由AI Core处理的算子(如,非矩阵类复杂计算)。
如果其PCIe工作在从模式,则AI处理器工作在EP(End Point)模式。EP模式通常由Host侧作为PCIe主端,Device侧作为PCIe从端。用户指定的AI业务程序运行在Host系统中,AI处理器所在硬件设备作为Device系统以PCIe从设备接入Host系统,为Host系统中的服务器提供NN计算能力。Host系统通过PCIe通道与Device系统交互,将AI任务加载到Device侧的AI处理器中运行。这时,Host系统包括X86服务器或ARM服务器,利用AI运算加速器提供的NN(Neural-Network)计算能力完成业务。
如图7A示出的某AI处理设备中,AI处理器所在硬件设备作为Device侧,接收Host侧分发的AI任务,并执行AI任务中指定的NN计算,在AI任务完成之后,将NN计算结果返回给Host侧。
如上,本申请实施例针对AI处理器的AI Core的架构进行了改进,并相应地设计了多核模式下各AI Core内部相对独立的多工作单元模式。搭载有至少一个前述改进设计后的AI Core的AI芯片包括但不限于神经网络处理器(Neural-network Processing Unit,NPU)、图形处理器(Graphics Processing Unit,GPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程逻辑门阵列(Field Programmable Gate Array, FPGA)等。在前述各种芯片的设计过程中,均能通过本申请实施例提供的运算加速的处理方法,以提升其向量计算单元和MTE单元的能效。
作为片上系统(System on Chip,SoC),AI处理器通常设置有至少一个,也即至少两个运算加速器AI Core。搭载有至少一个前述改进设计后的AI Core的AI处理器可以应用在与图像、视频、语音、文字处理相关的应用场景,如,智能园区、自动驾驶等。
用户使用本申请实施例的运算加速器设计AI业务程序时,调用该运算加速器支持的应用程序接口(Application Interface,API)或算子,实现AI业务对应的神经网络拓扑图。本申请实施例的运算加速器运行时,其所在的操作系统会调用以驱动形式提供的底层编译器。该运算加速器支持的API或算子等软件栈会随同该运算加速器固件一同被提供,如记载在存储介质中,如,通过网络服务区下载到本地后再使用。
以下实施例描述了用户在针对目标运算加速器设计神经网络算子时编写程序代码阶段的工作。
CANN提供了基于张量虚拟机框架的张量加速引擎(Tensor Boost Engine,TBE)算子设计框架。用户可以基于此TBE框架,通过TBE提供的API和编程设计界面,使用Python语言设计自定义算子,完成AI任务中神经网络运算所需的算子设计。
算子可以基于特定域语言(Domain Specific Language,DSL)设计。用户利用特定域语言声明计算流程,并使用自动调度(Auto Schedule)机制生成程序代码。在指定目标运算加速器后,生成的程序代码被编译为可以运行在目标运算加速器上的指令序列。
算子还可以基于张量迭代器内核(Tensor Iterator Kernel,TIK)设计。TIK作为基于Python语言的动态编程框架,呈现为运行在Python下的专用功能模块。用户可以调用TIK提供的API基于Python语言编写自定义算子。TIK提供的编译器会将TIK程序代码编译为可以运行在目标运算加速器上的指令序列。
在具体实施时,为实现多个“虚拟”的Vector子单元或“虚拟”的MTE子单元,以与多个Scalar单元的硬件设计相适应,CANN提供的上层编程方法,及其底层编译器也适应性地进行了改进,如在TIK API层面对与实现多个“虚拟”的Vector子单元与“虚拟”的MTE子单元有关的函数或程序控制方法进行了改进设计。
以下表1列出了虚拟Vector单元时用户使用的部分语句或变量。表2列出了由AI Core执行的部分指令。
表1虚拟Vector单元时用户使用的语句或变量
Figure PCTCN2021143921-appb-000001
表2由AI Core执行的指令
Figure PCTCN2021143921-appb-000002
用户使用本申请实施例的运算加速器设计AI业务程序时,具有图6A和6B所示的两种可能的场景。图6A中,在多核循环之外,针对某一个AI Core使用多工作单元模式来虚拟Vector单元或虚拟MTE单元。图6B中,在任一个多核循环之内,分别针对每一个AI Core使用多worker模式来虚拟Vector单元或虚拟MTE单元。
首先说明图6A所示的在多核模式之外(也即,多核循环体之外)利用多worker模式处理数据的方法。在针对多个目标运算加速器(包括AI Core_j,AI Core_j+1)设计神经网络算子时,对应指定的AI Core_j实现虚拟Vector单元时,可以由第j个多worker循环实现,对应指定的AI Core_j+1实现虚拟Vector单元时,可以由第j+1个多worker循环实现.
具体而言,针对任一个AI Core,根据数据处理的需要,其可以使用多个相互独立的多worker循环分别实现虚拟Vector单元。
以对指定的AI Core_j由第j个多worker循环实现虚拟Vector单元为例,如,在for_range语句中引用“worker_num”变量并指定worker_num的数值来指定参与多worker模式的Scalar单元的数量,并形成循环次数为worker_num的第j个多worker循环。在第j个多worker循环的循环体中,针对每一个循环,在向量运算函数或数据搬运函数中,引用“worker_id”这个变量来指定该循环中需要操作的数据在待处理数据张量或处理后数据张量中的索引位置或在UB内的地址或地址指针。也即,通过定义“worker_num”变量来指定执行虚拟Vector单元时,在Scalar集群中需要启动的Scalar单元的数量,并通过引用“worker_id”来指定由相应逻辑标识的Scalar单元取指及译码的指令。
每个AI Core中,各Scalar单元的物理标识是确定的,程序代码在每个循环中,引用循环间递增的同名标识worker_id作为引用各Scalar单元的逻辑标识,并且,逻辑标识自小到大地与各Scalar单元的物理标识一一地对应。
以上,用户将该向量计算单元等价为worker_num个“虚拟”的Vector子单元,并通过引用“worker id”来“形式上”地使用各“虚拟”的Vector子单元。这时,每个“虚拟”的Vector子单元由在循环体上递增的相同标识worker_id来分别引用。
TIK编程模型对指定的一个AI Core实现虚拟Vector单元场景时,其程序控制部分大致如下面的代码所示。以下代码展示了指定多worker模式的方法,指定多worker模式涉及的Scalar单元的数量的方法,及以多次循环的方式分别隐含地指定了各Scalar单元所对应的行为,如取指、译码及将译码后的指令发送到指令队列的方法。
以下的语句片段“for_range(0,8,worker_num=8)as worker_id”中,引用“worker_num”变量并指定worker_num的数值为8,来指定参与多worker模式的Scalar单元的数量为8,并引用“worker_id”这个变量作为循环变量,指定在其之后的代码段 中形成一个当前的循环序数从1递增为8的多worker循环。
Figure PCTCN2021143921-appb-000003
以上的语句片段“vec_add(16,dst_Scalar[row_idx*2,worker_id*2,0],src_Scalar[row_idx,worker_id,0],...)”则在当前的循环序数为worker_id的循环中,在向量加法函数中引用“worker_id”这个变量来指定当前循环中需要操作的源数据,如源数据在待处理数据张量和目的数据在处理后数据张量中的索引位置。
从以上代码可以看出,属于同一个AI Core内的多个worker分别利用“worker_id”引用同一个tensor内的不同数据。并且,这个共用tensor需要在多worker模式启动之前的语句中定义。
在目标运算加速器运行由该神经网络算子定义的AI运算任务时,语句TIK_instance.for_range对应的指令被指定的主Scalar单元取指并译码后,该AI Core进入多worker模式。相应地,在worker_num个独立的循环中分别引用worker_id的语句所对应的代码,由各Scalar单元分别取指、译码及将译码后的指令发送到指令队列。
以上,用户通过Python语言引用TIK函数来编写实现神经网络运算的算子的程序代码,在代码中可以显式地指定每个Scalar单元所对应的行为,从而实现了虚拟Vector单元。
如某目标运算加速器设置有N个Scalar单元和一个Vector单元。用户当前设计的AI任务中某个单目向量运算函数,形式如SIMDX(mask,data1)对应的指令,被发送至Vector单元执行时,SIMDX函数中指定的源数据data1(同时也是目标数据)的大小相对于Vector单元的Lanes数而言过小,将导致该Vector单元的利用率偏低。这时,若将M个(M小于等于N)该单目向量运算函数对应的指令组装,以使得这M个指令中的源数据或目标数据“合并”,则可以使得整体上组装后的指令,形式上例如M*SIMDX(mask*M,data1,…,dataM)被发送至Vector单元执行时,则可以提高Vector单元的利用率为原来的M倍。
这时,用户可以利用语句片段“for_range(0,M,worker_num=M)as worker_id”来指定参与多worker模式的Scalar单元的数量为M,并指定在其之后的代码段中形成M次循环的多worker循环。
通过提取出data1,…,dataM在待处理数据张量中的位置索引与取值范围为从1开始递增至M的worker_id之间的函数关系data(worker_id),用户可以利用语句“SIMDX(mask,data(worker_id))”来分别指定向量运算函数SIMDX在每一个循环中,需要操作的源数据源数据在待处理数据张量中的索引位置。
通过以上2个语句片段,用户实现了指定该目标运算加速器的M个Scalar单元协同地将其上的Vector单元虚拟为M个Vector子单元,并通过引用“worker_id”变量来“形式上”指定对应的M个Vector子单元分别执行SIMDX(mask,data(worker_id))函数对应的指令,从而提高了Vector单元的利用率。
下面说明图6B所示的多核模式之内(也即,多核循环体之内)利用多worker模式处理数据的方法。在针对目标运算加速器设计神经网络算子时,通常会使用多核模式。这时,在多核模式之内的每一个循环中(如,分别引用取值递增的同名标识Core_idx来指定AI Core_1到AI Core_n),还可以利用多worker模式处理数据。这时,在每一个AI Core对应的循环体内,分别在第i个多worker循环(如对应指定的AI Core_i)或第i+1个多worker循环(如对应指定的AI Core_i+1)中,针对该AI Core实现虚拟Vector单元,不再赘述。
下面以实现对某AI处理器设置的的2个运算加速器,也即AI Core实现虚拟Vector单元的代码段为例,说明在多核模式之内利用多worker模式处理数据的方法。这时,每一个AI Core上设置的8个Scalar单元和一个Vector单元。
Figure PCTCN2021143921-appb-000004
从以上代码可以看出,任一个AI Core并不会去处理其他AI Core的UB内的tensor。这时,多核模式之内的每一个单核循环下所使用的tensor必须在该单核循环之内声明,并且为每个AI Core所私有。并且,针对在单核循环之内声明的tensor,每个worker在引用这些tensor时,不需要也不能在各自的worker循环内再次声明。这时,属于同一个AI Core内的多个worker分别利用“worker_id”引用同一个tensor内的不同数据。
如图6A和图6B所示,整个程序代码中只能有一组TIK多核循环。本申请实施例针对多工作单元模式提供的编程模型允许一个TIK虚拟机内有多段worker循环(一段程序可能存在多个同步点)。并且,当程序段不在worker循环内时,虚拟化TIK编程模型维持原本的TIK编程模型模式。同时硬件上,目标AI处理器内各AI Core内的Scalar集群中只有主Scalar工作,各从Scalar处于非激活状态。另外,在计算任务开始执行后,也即程序启动之后,目标AI处理器内各AI Core可以分别进入多worker模式,并且,各AI Core可以多次进入多worker模式。并且,每次进入多worker模式时,涉及到的Scalar单元的数量可以不同。
用户在针对目标AI处理器设计神经网络算子时,有时需要在程序代码内利用语句来指定目标AI处理器中某一个AI Core内的所有worker同步,以保证该目标AI处理 器运行该算子时,AI Core内各Scalar单元的行为同步。
以针对图像数据进行归约为例。规约用worker循环实现,并且进行同步。待处理的数据,也即输入数据的格式为[batch_size,channel_cnt,spatial_size],在代码中,将batch_size赋值给core_idx,将channel_cnt赋值给worker_idx,以多核多worker模式实现对图像数据归约,每个batch下规约出一个最大值。这时,每个worker循环一次,对一个channel下的元素进行规约,并存入到中间tensor中;所有channel都规约完成后,再进行最后一次规约;在所谓的“所有channel都规约完成后”,即意味着进行了一次同步。
Figure PCTCN2021143921-appb-000005
以上的语句片段“sync_workers()”出现在每一个worker循环内,用于指定当前的AI Core对应的多worker循环中所有worker执行至此指令处要相互等待以同步,相应地,该目标运算加速器对应的AI Core在执行到在该语句对应的指令时,需要使得AI Core内各Scalar单元在行为上同步,如在译码到依赖到的指令时会停顿。
一些实施例中,语句片段“sync_workers()”被编译器编译后,生成成对出现的2个“.VVECSYNC”指令,其中第一个“.VVECSYNC”指令(在前出现的)表示同步开始,第二个“.VVECSYNC”指令(在后出现的)表示同步结束。相应地,该指令在目标运算加速器上运行时,针对Scalar集群分别调用该AI Core中已有的set/wait flag功能/标识来实现,操作set/wait flag的方法参加下文描述。
如前述,在CANN开放给用户的编程API或TIK中,增加了worker这一逻辑层级来“形式上”地让用户控制每个AI Core内虚拟Vector单元之后的Vector子单元的计算或调度逻辑或虚拟MTE单元之后的MTE子单元的计算或调度逻辑。从“形式上”,将Vector单元拆分成多个“虚拟的”Vector子单元,并分别为各Vector子单元指定计算任务,从而便于用户更方便、灵活地针对待处理数据的特点进行程序设计,提高指令序列由Vector单元执行时的利用率,整体上提升AI Core的计算效率。
如图1A、图4A和4B所示,本申请实施例的运算加速器内的各AI Core运行时,从DDR或HBM获取针对目标运算加速器编译后的指令序列并搬运到指令缓存,并从DDR或HBM获取待处理的数据并搬运到内部存储,如L1Buffer或UB。
这些指令序列是用户针对待处理数据设计的神经网络算法。待处理数据可能是训 练用数据,也可能是推理用数据。神经网络算法可能用于训练神经网络,得到满足精度要求的目标神经网络;也可能用于利用训练好的神经网络进行推理,得到推理结果。
各AI Core分别执行指令序列,以指定的多核模式之外的多worker模式处理该数据,或以指定的多核之内的多worker模式处理该数据。
如图4A示,本申请一个实施例的加速处理器配置有一个主Scalar单元和两个从Scalar单元,每个Scalar单元配置有一个向量指令队列和一组MTE指令队。
在一些实施中,初始化时,指定标识的Scalar单元Scalar_0被配置为主Scalar单元并启动,执行主Scalar单元功能。其他的Scalar单元,Scalar_1和Scalar_2被配置为从Scalar单元,并保持为待机状态。
被配置为主Scalar单元之后,如图1B所示,主Scalar单元从指令缓存I Cache读指令、译码并执行译码后的指令。主Scalar单元还可以执行启动从Scalar单元、关闭从Scalar单元,控制主从Scalar同步(包括操作数同步/数据同步)等操作。
在执行到用户指定的多worker模式计算任务时,主Scalar取指到启动指定数量的从Scalar的指令(参照下述的.VVECST.n)之后,系统控制模块启动指定数量的从Scalar、Vector组装单元400和MTE合并单元500。指定数量的从Scalar启动之后,主Scalar和每一个启动的从Scalar按照指令中记载的worker_id分别从指令缓存中取指并译码。译码后的指令被缓存到各自对应的向量指令队列和MTE指令队列。
在主Scalar或任一从Scalar取指到用户指定的多worker模式计算任务中指定的同步等待指令时,主Scalar或任一从Scalar在处理过当前的指令后,停止取指及译码,进入同步等待状态,直到主Scalar发出同步解除消息之后,主Scalar和每一个从Scalar单元继续按照指令中记载的worker_id分别从指令缓存中取指并译码。
在用户指定的多worker模式计算任务结束时,在主Scalar或任一从Scalar取指到停止/关闭指定数量的从Scalar的指令(如VVECED)之后,AI Core内的系统控制模块关闭启动的从Scalar单元、Vector组装单元和MTE合并单元。
在主Scalar单元和从Scalar单元取指及译码的效率大致相同时,主Scalar对应的Vector指令队列0和从Scalar_1对应的Vector指令队列1内缓存的指令的总数目大致相同。
以及,在主Scalar和从Scalar取指及译码的效率大致相同时,主Scalar对应的MTE指令队列0和从Scalar对应的MTE指令队列1内缓存的指令的总数目大致相同。
若译码后的指令属于向量指令,则若指令内记载的mask的数值小于与数据精度对应的预设值时,称其为窄指令。启动的Vector组装单元从各Vector指令队列中取出译码后的指令到指令槽内缓冲,并根据指令内记载的mask的值和指令中操作对应的执行延时,对多个窄指令组装,并将组装后得到的宽指令发射给Vector单元执行。组装后的执行指令,需要设置组装后的mask的值。而Vector单元执行该宽指令所消耗的时钟周期数与各窄指令的执行延时相同。以上,因为Vector单元在相同的时间内,执行了多条向量指令,因此提高了Vector单元的利用率,整体上提升了AI Core的计算性能。
Vector单元配置有指令执行队列(Vector Execution Queue),Scalar译码后的指令或组装后的宽指令缓冲在指令执行队列。缓冲的目的,其一是等待操作数准备就绪, 其二是等待Vector单元完成当前的运算。Vector单元则逐条执行由Vec指令发射模块430(Vec Dispatch)发射的向量指令。发射的向量指令待操作的操作数已经从UB内搬运到Vector Unit的操作数缓冲区。
如图4A和4C所示,Vector组装单元包含Vec指令组装模块420(Vec Instruction Coalescer)和Vec指令发射模块430(Vector Dispatch)。Vec指令组装模块用于收集所有主从Scalar单元译码后的向量指令,并尽可能地合并指令,得到最多包括worker_num个窄指令的宽指令,并缓存在指令执行队列中;由Vec指令发射模块430在操作数准备好之后,将宽指令发射给Vector单元执行。
Vec指令发射模块430可以用于合并访存信息,并根据合并后的访存信息将操作数从Vec有关缓存(Vector Related Mem,如UB)取到源寄存器工作空间,以由向量计算单元取用;或将操作数从目标寄存器工作空间转移、搬运、或写出到Vec有关缓存(Vector Related Mem,如UB)。也即,在Vector单元执行宽指令之前,发射模块提供取操作数的功能;在Vector单元执行宽指令之后,发射模块提供写回操作数的功能。
Vector组装单元启动时,如图4C所示,向量指令组装模块420根据从主Scalar单元传入的workernum的值,配置K个指令槽,K为不小于worker-num的正整数。每个指令槽包括多项内容:源操作数地址0(src_addr)、源操作数地址1(src1_addr)、目标操作数地址(dst_addr)、掩码(mask)的值、指令指定的操作、指令指定的反复次数(repeat)、或指令指定的跳转量(stride)。配置之后,该指令槽用于存储向量指令组装模块提取的译码后的指令。
Vector组装单元启动之后,向量指令组装模块420轮询地检测各指令槽的状态。如,当任一指令槽不为空时,其就为活跃指令槽。当检测到任一指令槽为空时,按照预先设定的规则(如,轮询,如,最久等待)从Vector指令队列中取出一个译码后的指令,并填充至该空的指令槽中,并对活跃指令槽的数量加1。
Vector组装单元启动之后,向量指令组装模块420轮询地计数活跃的指令槽的数量。当活跃指令槽的数量不为零时,依次从各活跃的指令槽中取出执行延时相同的多条指令并合并,将合并后的指令缓冲在Vector指令执行队列中。向量指令发射模块430收集操作数,并发射待执行指令。在待执行指令发射之前,各活跃指令槽锁定。
向量指令组装模块组装指令时,当活跃指令槽的数量大于1,且多个活跃指令槽的指令op属于相同延时指令,则将多个活跃指令槽内记载的多个指令拼接为一个待执行指令。这个待执行指令,对应于多个窄指令,每个窄指令对应地包括一个指令槽内的指令指定的操作、及指令指定的操作数。随后,这个待执行指令作为一条宽指令整体地被发射给向量计算单元执行。
以上,来自不同指令槽内的指令指定的操作必须属于相同延时指令才可以合并的限制,其原因在于,将不具有相同执行延时的指令合并在一个待执行指令并统一地进行发射给向量计算单元执行之后,将导致在写回操作数的过程中出现分叉,而使控制逻辑十分复杂,并降低执行的可靠性。因此,如当vdiv指令和vzero指令同时出现在当前活跃的多个指令槽中时,因这两者不具有相同的执行延时,会导致之后的写回过程出现分叉,因此不会将这两者合并在一个待执行指令内。
以上,按照指令中记载的worker_id分别从指令缓存中取指并译码时,主Scalar和从Scalar分别配置有各自的程序计数器(Program Counter,PC),各PC为一个寄存器,被配置为分别指向主Scalar或从Scalar的下一个待读的指令的地址。
如前,主Scalar取指到启动指定数量的从Scalar的指令(如.VVECST.n)之后,主Scalar控制从Scalar启动,并共享主Scalar的PC内当前的内容,以共享待读的指令所在的地址。主Scalar和从Scalar的PC内容同步后,主Scalar和从Scalar分别独立地向指令缓存读指令、译码并执行。
与图1B所示的主Scalar单独直接从指令缓存中读指令不同,如图4B所示,本申请实施例的运算加速器,因为主从Scalar在运行(runtime)时对应的work_id的值各不相同,因此主从Scalar各自取指及译码的指令及指令的类型都可能不同。相应地,主从Scalar的行为可能不同。如,主从Scalar读到与其关联的work_id寄存器的指令之后,主从Scalar的行为受控于读的指令,可能会产生分叉,如,读到涉及到work_id的代码后,主从Scalar可能进入不同的程序分支或程序段落,分别读、译码及执行不同的指令。
如图4A、图4B和4D所示,针对MTE单元600,实现处理加速时,数据搬运指令合并单元,也即,MTE合并单元500包括MTE指令合并模块520(MTE Instruction Coalescer)、MTE指令发射模块530(MTE Dispatch)。MTE指令合并模块520从多个MTE Queue中获取译码后的MTE指令,进行指令合并后,生成MTE执行指令,并发送到MTE指令执行队列(MTE Execution Queue)中缓存。MTE指令发射模块530在检测到MTE单元空闲时,将MTE指令执行队列中的MTE执行指令发射给直接内存访问(Direct Memory Access,DMA)单元,也即图中的DMAs,如,将数据经DMAs搬运到各核内缓存,也即AI核内部缓存(In Core Buffer)如UB、或核外缓存,也即AI核外部缓存(Out Memory),如GB等。
如图4D所示,MTE指令合并模块520内维持一个缺失状态保持寄存器表(Miss Status and Handling Register,MSHR)。MTE指令合并模块从与各个Scalar单元对应的MTE指令队列中抽取任一个译码后的原始指令并存放至MSHR表中的一个表项内。MSHR内的表项数S不与主从Scalar单元的数量相关,典型情况下是一个较大的数值,如,在主从Scalar单元的数量为8时,表项数S为16、32或64)。
MSHR中每个表项的作用在于记录MTE指令的状态,以标识MTE指令合并的可行性,如,状态栏的内容为“未定义(Unused)”时,用于标识该表项所在的指令槽为空。状态栏的内容为“使用中(Used)”时,用于标识该表项所在的指令槽内的MTE指令为尚未合并、可以合并的状态。状态栏的内容为“已发射(Issued)”时,用于标识该表项所在的指令槽内的MTE指令为已经合并过的状态。表项内还可以设置其他表项栏,如,用于记录源操作数地址(src_addr)、目标操作数地址(dst_addr),指令指定的跳转量(stride),指令指定的修正线性单元参数(Rectified Linear Unit,Relu)、指令指定的填充参数(pad)。
多worker模式启动时,MTE指令合并模块520将MSHR表清空,即将所有表项的状态栏都配置为“未定义”状态。
当检测到任一指令槽为空(这时,其状态栏的内容为“Unused”)时,MTE指令合并模块520以某种规则(如,轮询/最久等待等)向某个MTE指令队列中抽读指令,填充至该空闲的表项中,并把该表项的状态栏更新为“Used”;
当检测到任一指令槽为可合并的“Used”状态(其状态栏的内容为“Used”)时,MTE指令合并模块520执行指令合并,并将参与指令合并的多个执行所在的表项的状态栏更新为“Issued”。
MTE指令可以包括多种不同的类别。合并前后,MTE指令的类别保持不变。同一类别的多个MTE指令在满足特定规则时可以合并,也即多个同类型的MTE指令合并为一个更宽的MTE指令并一次被发射到MTE执行。如合并为更宽的散列读的MTE指令后,一次性地被发射到MTE单元,并在MTE单元的一个执行周期(cycle)中完成执行。
具体的,指令集合U_I是指全部待组装的指令所组成的集合U的任一个子集,指令集合U_I内至少有2条MTE指令。MTE指令合并模块对每一指令集合U_I是否满足特定规则进行判断,并在判断满足规则时,执行指令合并。执行指令合并是指,当MSHR表中的两个或多个指令符合特定规则时,这两个或多个指令合并为一个更宽的MTE指令。特定规则例如但不限于:
指令集合U_I内的每条指令所涉及的源地址和目的地址互不重叠,并且将各个源地址进行组合后的源地址段落完全连续,将各个目的地址段落进行组合后的目的地址段落完全连续。那么这样的指令集合U_I可以用一次MTE指令完成,即可进行组合。
指令集合U_I内的每条指令所涉及的源地址和目的地址互不重叠,并且各个源地址段落呈现相同的散列模式(也即各指令中的跳转量stride参数的值相同),且将各个源地址段落进行组合后亦保持散列模式,并且将各个目的地址段落进行组合后完全连续,那么这样的指令集合U_I可以用一次散列读取的MTE指令完成,即可进行组合。
指令集合U_I内的每条指令所涉及的源地址和目的地址互不重叠,并且源地址段落进行组合后完全连续,并且各个目的地址段落呈现相同的散列模式(也即各指令中的跳转量stride参数的值相同),且进行组合后亦保持散列模式,那么这样的指令集合U_I可以用一次散列写入MTE指令完成,即可进行组合。
当MTE单元执行完成了该条组装后的宽指令后,MTE指令合并模块将与这条合并后指令相关的两个或多个表项清空,即,将这些表项的状态栏都配置为“Unused”状态。
以下结合前述的硬件方面的描述和程序设计方面的描述,说明Scalar集群开始、结束及同步等待的方法及过程。
当任一从Scalar读指令到第一个“.VVECSYNC”指令时(如译码到sync_workers()对应的指令时),该从Scalar会停顿,并向主Scalar发送第一次同步抵达信号,并等待主Scalar发送第二次同步抵达信号。之后,在获取到主Scalar发送的第二次同步抵达信号之后,该从Scalar进入下一个指令。
当主Scalar接收到某一个从Scalar发送的第一次同步抵达信号,或主Scalar读指令 到第一个“.VVECSYNC”指令时,主Scalar开始计数已发送过第一次同步抵达信号的从Scalar的数量。当确定目前活跃的所有从Scalar都已发送过第一次同步抵达信号之后,判断所有从Scalar已做好同步准备。这时,主Scalar向所有从Scalar发送第二次同步抵达信号。之后,主Scalar和各从Scalar分别进入下一个指令。
这里“进入下一个指令”,是指取指、译码及执行下一个指令。如果主Scalar和各从Scalar支持乱序执行,则译码之后,先跳过这条具有依赖关系而需要同步等待的指令,继续去取指、译码及执行下一个指令。
计数已发送过第一次同步抵达信号的从Scalar单元时,可以采用任一逻辑存储体,如,建立同步表或前述的指令槽。该同步表用于标识每个从Scalar单元是否发送过第一次同步抵达信号。当建立的同步表中记载了所有从Scalar单元的标识(Scalar_id)或所有从Scalar单元对应的状态量为已发送时,确定所有从Scalar都已发送过第一次同步抵达信号。如,若修改方式建立同步表,则建立时初始化为固定长度或固定格式的同步表,记载全部的从Scalar的标识,和/或记载从Scalar的标识的状态量为未发送。之后,每接收到一个第一次同步抵达信号时,提取该第一次同步抵达信号中记载的从Scalar单元的标识,并更新该从Scalar的标识的状态量为已发送。如生成方式建立同步表,建立时初始化为空的同步表。之后,每接收到一个第一次同步抵达信号,提取该第一次同步抵达信号中记载的从Scalar的标识,并更新该从Scalar的标识的状态量为已发送。
以上同步的过程,对于从Scalar单元来说,可以认为是“发送状态->等待回执”;对于主Scalar单元来说,可以认为是“接收状态->发送回执”。
如前述,用户在针对目标AI加速处理器设计神经网络算子时,对指定的任一个AI Core实现虚拟Vector单元时,在for_range语句中引用“worker_num”变量并指定worker_num的数值来指定参与多worker模式的Scalar单元的数量,并形成一个循环次数为worker_num的多worker循环。相应地,引用“worker_num”变量并指定worker_num的数值的for_range语句及其循环体被编译器编译后,生成指令“.VVECST.n”来标识多worker循环段落的起始,以及指示Scalar集群启动,并生成指令“.VVECED”来标识多worker循环段落的结束,以及指示Scalar集群退出。
其中,“.VVECST.n”表示开始实现虚拟Vector单元或虚拟MTE单元。“.VVECST.n”中的“n”则表示该语句下方的所有指令会以n个“虚拟的”Vector子单元的方式或n个“虚拟的”MTE子单元来执行。程序从多核模式下切换到多worker模式。“.VVECED”表示多worker模式结束,程序回到多核模式下。
目标加速处理器在开始执行用户指定的AI处理任务之后,在没有进入多worker模式时,每一个AI Core内只有主Scalar单元处于活跃状态,各从Scalar处于启动但空操作(No Operation,NOP)状态。这时,如图1B所示,每一个AI Core内的主Scalar单元取指、译码及执行译码后的指令,并控制Vector单元或MTE单元执行对应的运算或操作。
当执行至“.VVECST.n”指令后,进入加速处理器多worker模式。这时,主Scalar单元向指定的n-1个从Scalar单元发送消息,消息内容包括:主Scalar单元的程序寄存器的状态。各从Scalar单元接收到消息,并根据主Scalar的程序寄存器状态,初始化各自的程序寄存器状态,完成主Scalar状态复制。并且,当所有从Scalar获得主Scalar的状态后, 针对指令序列的执行进入下一个指令。
这时,主Scalar单元还向Vector组装单元或MTE合并单元发送消息,消息内容包括:worker的数量,也即“.VVECST.n”中的n。Vector组装单元接收到消息,根据提取到的worker的数量M,初始化指令槽。这时,指令槽中的槽数不小于worker的数量M,以保证可以同时缓存来自全部的Scalar单元的译码后的指令。MTE合并单元接收到消息,将MSHR表清空。
之后,针对指令“.VVECST.x”与指令“.VVECED”之间的指令,如图4B所示,Scalar集群中的各Scalar单元分别独立地取指、译码及执行译码后的指令,并分发到对应的向量指令队列,并分别经Vector组装单元、MTE合并单元组装指令后,发射至Vector单元或MTE单元执行对应的运算或操作。
当执行至“.VVECED”指令后,主Scalar建立结束表,从Scalar向主Scalar发送结束信号。当结束表中所有从Scalar都发送过结束信号后,主Scalar执行下一个指令。也即,多worker循环在结束时,编译器控制主Scalar单元或各从Scalar单元隐式地执行一次前述的“.VVECSYNC”指令,以在主从Scalar之间同步。
图8是本申请实施例提供的一种人工智能处理设备900的结构性示意性图。该人工智能处理设备900包括:处理器910、存储器920。
应理解,图8所示的人工智能处理设备900中还可包括通信接口930,可以用于与其他设备之间进行通信。
其中,该处理器910可以与存储器920连接。该存储器920可以用于存储程序代码和数据。因此,该存储器920可以是处理器910内部的存储单元,也可以是与处理器910独立的外部存储单元,还可以是包括处理器910内部的存储单元和与处理器910独立的外部存储单元的部件。
可选的,人工智能处理设备900还可以包括总线。其中,存储器920、通信接口930可以通过总线与处理器910连接。总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。
应理解,在本申请实施例中,该处理器910可以采用中央处理单元(central processing unit,CPU)。该处理器还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific Integrated circuit,ASIC)、现成可编程门阵列(field programmable gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。或者该处理器910采用一个或多个集成电路,用于执行相关程序,程序代码当被处理器910执行时,处理器通过接口电路访问至少一个前述的运算加速器,使得至少一个运算加速器针对数据执行程序代码中指定的运算,并将运算的结果返回处理器910或存储器920。
该存储器920可以包括只读存储器和随机存取存储器,并向处理器910提供指令和数据。处理器910的一部分还可以包括非易失性随机存取存储器。例如,处理器910 还可以存储设备类型的信息。
在人工智能处理设备900运行时,处理器910执行存储器920中的计算机执行指令执行上述方法的操作步骤。
应理解,根据本申请实施例的人工智能处理设备900可以对应于执行根据本申请各实施例的方法中的相应主体,并且人工智能处理设备900中的各个模块的上述和其它操作和/或功能分别为了实现本实施例各方法的相应流程,为了简洁,在此不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以至少两个单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时用于执行本申请的方法,该方法包括上述各个实施例所描述的方案中的至少之一。
本申请实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机 可读存储介质例如可以是,但不限于,电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括、但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
另外,说明书和权利要求书中的词语“第一、第二、第三等”或模块A、模块B、模块C等类似用语,仅用于区别类似的对象,不代表针对对象的特定排序,可以理解地,在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
在上述的描述中,所涉及的表示步骤的标号,如S501、S502……等,并不表示一定会按此步骤执行,在允许的情况下可以互换前后步骤的顺序,或同时执行。
说明书和权利要求书中使用的术语“包括”不应解释为限制于其后列出的内容;它不排除其它的元件或步骤。因此,其应当诠释为指定所提到的所述特征、整体、步骤或部件的存在,但并不排除存在或添加一个或更多其它特征、整体、步骤或部件及其组群。因此,表述“包括装置A和B的设备”不应局限为仅由部件A和B组成的设备。
本说明书中提到的“一个实施例”或“实施例”意味着与该实施例结合描述的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在本说明书各处出现的用语“在一个实施例中”或“在实施例中”并不一定都指同一实施例,但可以指同一实施例。此外,在一个或多个实施例中,能够以任何适当的方式组合各特定特征、结构或特性,如从本公开对本领域的普通技术人员显而易见的那样。
注意,上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解, 本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,均属于本申请保护范畴。

Claims (22)

  1. 一种运算加速器,其特征在于,包括:
    存储单元,被配置有至少一个向量指令队列,每个所述向量指令队列分别用于缓存一个或多个向量指令;
    至少两个标量计算单元,每个标量计算单元分别用于获取指令并对所述指令译码得到译码后的指令,所述译码后的指令包括向量指令,每个所述标量计算单元还分别用于将所述向量指令缓存至所述至少一个向量指令队列;
    向量计算单元,用于执行所述向量指令队列中的向量指令。
  2. 根据权利要求1所述的运算加速器,其特征在于,所述向量计算单元,用于在一个执行周期内执行所述向量指令队列中至少两个向量指令。
  3. 根据权利要求2所述的运算加速器,其特征在于,
    在一个执行周期内执行的所述至少两个所述译码后的向量指令,具有相同的执行延时。
  4. 根据权利要求1所述的运算加速器,其特征在于,
    还包括组装单元,用于组装所述向量指令队列中至少两个向量指令得到组装后的向量指令,并将所述组装后的向量指令提供给所述向量计算单元执行。
  5. 根据权利要求4所述的运算加速器,其特征在于,
    所述向量指令队列为至少两个,与所述至少两个标量计算单元分别一一对应;
    所述组装单元包括逻辑存储模块和组装模块,所述逻辑存储模块被配置有至少一个组装队列,所述组装队列用于缓存从所述至少两个向量指令队列提取的译码后的向量指令;所述组装模块用于按照执行延时从所述组装队列提取至少两个所述译码后的向量指令并组装。
  6. 根据权利要求1所述的运算加速器,其特征在于,还包括:
    数据搬运单元;
    所述存储单元还被配置有至少一个数据搬运指令队列,每个所述数据搬运指令队列分别用于缓存一个或多个数据搬运指令;
    所述译码后的指令还包括数据搬运指令,每个所述标量计算单元将所述数据搬运指令缓存至所述至少一个数据搬运指令队列;
    所述数据搬运单元用于执行所述译码后的数据搬运指令。
  7. 根据权利要求1所述的运算加速器,其特征在于,
    所述译码后的指令还包括标量指令,
    每个所述标量计算单元还用于执行所述标量指令。
  8. 根据权利要求1所述的运算加速器,其特征在于,
    所述至少两个标量计算单元,包括主标量计算单元与至少一个从标量计算单元,
    所述主标量计算单元用于控制各从标量计算单元的启动或停止,或控制所述各从标量计算单元之间的同步。
  9. 一种运算加速的处理方法,其特征在于,包括:
    通过至少两个标量计算单元的每个标量计算单元分别获取指令并对所述指令译码得到译码后的指令,所述译码后的指令包括向量指令;
    通过每个所述标量计算单元将所述向量指令缓存至至少一个向量指令队列,所述向量指令队列配置在存储单元中,每个所述向量指令队列分别用于缓存一个或多个向量指令;
    通过向量计算单元执行所述向量指令队列中的向量指令。
  10. 根据权利要求9所述的方法,其特征在于,
    通过所述向量计算单元在一个执行周期内执行所述向量指令队列中至少两个向量指令。
  11. 根据权利要求10所述的方法,其特征在于,
    在一个执行周期内执行的所述至少两个所述译码后的向量指令,具有相同的执行延时。
  12. 根据权利要求9所述的方法,其特征在于,还包括:
    通过组装单元组装所述向量指令队列中至少两个向量指令得到组装后的向量指令,并将所述组装后的向量指令提供给所述向量计算单元执行。
  13. 根据权利要求12所述的方法,其特征在于,
    所述向量指令队列为至少两个,与所述至少两个标量计算单元分别一一对应;
    所述组装单元包括逻辑存储模块和组装模块;
    通过所述逻辑存储模块配置的组装队列缓存从所述至少两个向量指令队列提取的译码后的向量指令;
    通过所述组装模块按照执行延时从所述组装队列提取至少两个所述译码后的向量指令并组装。
  14. 根据权利要求9所述的方法,其特征在于,
    所述译码后的指令还包括数据搬运指令,还通过每个所述标量计算单元将所述数据搬运指令缓存至所述至少一个数据搬运指令队列;所述数据搬运指令队列配置在所述存储单元中,每个所述数据搬运指令队列用于缓存数据搬运指令;
    还通过数据搬运单元执行所述译码后的数据搬运指令。
  15. 根据权利要求9所述的方法,其特征在于,
    所述指令还包括标量指令,还通过每个所述标量计算单元执行所述标量指令。
  16. 根据权利要求9所述的方法,其特征在于,
    所述至少两个标量计算单元,包括主标量计算单元与至少一个从标量计算单元,
    还通过所述主标量计算单元控制各从标量计算单元的启动或停止,或控制所述各从标量计算单元之间的同步。
  17. 根据权利要求16所述的方法,其特征在于,所述通过所述主标量计算单元控制各标量计算单元的启动,包括:
    所述主标量计算单元在运行到标识多工作单元模式的启动指令时,根据启动指令中记载的数量,控制对应数量的从标量计算单元的启动。
  18. 一种运算加速器的使用方法,其特征在于,包括:
    根据指定的工作单元的数量M,生成指向M个工作单元的标识;
    使用权利要求1-8任一项所述的运算加速器支持的至少一个向量运算函数来处理数据,所述向量运算函数的至少一个参数引用所述标识。
  19. 根据权利要求18所述的方法,其特征在于,还包括:
    使用所述运算加速器支持的至少一个数据搬运函数来处理数据,所述数据搬运函数的至少一个参数引用所述标识;或
    使用所述运算加速器支持的同步等待函数指定所述M个工作单元同步。
  20. 一种人工智能处理器,其特征在于,包括:
    至少一个权利要求1至8任一项所述的运算加速器;
    处理器,以及存储器,其上存储有数据和程序,
    所述程序当被所述处理器执行时,使得所述至少一个运算加速器针对所述数据执行所述程序中指定的运算,并将运算的结果返回所述处理器或存储器。
  21. 一种人工智能处理设备,其特征在于,包括:
    处理器,以及接口电路,其中,所述处理器通过所述接口电路访问存储器,所述存储器存储有程序和数据,
    所述程序当被所述处理器执行时,所述处理器通过所述接口电路访问所述至少一 个权利要求1至8任一项所述的运算加速器,使得所述至少一个运算加速器针对所述数据执行所述程序中指定的运算,并将运算的结果返回所述处理器或存储器。
  22. 一种电子装置,其特征在于,包括:
    处理器,以及
    存储器,其上存储有程序,所述程序当被所述处理器执行时,执行权利要求18或19所述的方法。
PCT/CN2021/143921 2021-12-31 2021-12-31 运算加速的处理方法、运算加速器的使用方法及运算加速器 WO2023123453A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180029995.2A CN116685964A (zh) 2021-12-31 2021-12-31 运算加速的处理方法、运算加速器的使用方法及运算加速器
PCT/CN2021/143921 WO2023123453A1 (zh) 2021-12-31 2021-12-31 运算加速的处理方法、运算加速器的使用方法及运算加速器

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/143921 WO2023123453A1 (zh) 2021-12-31 2021-12-31 运算加速的处理方法、运算加速器的使用方法及运算加速器

Publications (1)

Publication Number Publication Date
WO2023123453A1 true WO2023123453A1 (zh) 2023-07-06

Family

ID=86997206

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/143921 WO2023123453A1 (zh) 2021-12-31 2021-12-31 运算加速的处理方法、运算加速器的使用方法及运算加速器

Country Status (2)

Country Link
CN (1) CN116685964A (zh)
WO (1) WO2023123453A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
US20130067196A1 (en) * 2011-09-13 2013-03-14 Qualcomm Incorporated Vectorization of machine level scalar instructions in a computer program during execution of the computer program
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法
CN107315715A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵加/减运算的装置和方法
CN107329936A (zh) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
US20130067196A1 (en) * 2011-09-13 2013-03-14 Qualcomm Incorporated Vectorization of machine level scalar instructions in a computer program during execution of the computer program
CN107315574A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵乘运算的装置和方法
CN107315715A (zh) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 一种用于执行矩阵加/减运算的装置和方法
CN107329936A (zh) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法

Also Published As

Publication number Publication date
CN116685964A (zh) 2023-09-01

Similar Documents

Publication Publication Date Title
TWI628594B (zh) 用戶等級分叉及會合處理器、方法、系統及指令
CN108388528B (zh) 基于硬件的虚拟机通信
US10768989B2 (en) Virtual vector processing
US10515049B1 (en) Memory circuits and methods for distributed memory hazard detection and error recovery
US10445234B2 (en) Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features
US10572376B2 (en) Memory ordering in acceleration hardware
US10474375B2 (en) Runtime address disambiguation in acceleration hardware
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
KR101748506B1 (ko) 실시간 명령어 추적 프로세서들, 방법들 및 시스템들
JPH07501163A (ja) マルチプルスレッド用同期コプロセッサ付データ処理システム
US7444639B2 (en) Load balanced interrupt handling in an embedded symmetric multiprocessor system
US9250916B2 (en) Chaining between exposed vector pipelines
JP2021090188A (ja) ストリーミングファブリックインタフェース
US20240211429A1 (en) Remote promise and remote future for downstream components to update upstream states
US20210089305A1 (en) Instruction executing method and apparatus
KR20160046223A (ko) 멀티 쓰레딩 기반 멀티 코어 에뮬레이션 장치 및 방법
KR20240025019A (ko) 니어 메모리 컴퓨팅을 사용한 복합 연산에 대한 원자성 제공
KR20160113677A (ko) 다수의 스트랜드들로부터 명령어들을 디스패칭하기 위한 프로세서 로직 및 방법
WO2021243490A1 (zh) 一种处理器、处理方法及相关设备
US11366690B2 (en) Scheduling commands in a virtual computing environment
WO2023123453A1 (zh) 运算加速的处理方法、运算加速器的使用方法及运算加速器
US11960898B2 (en) Enabling asynchronous operations in synchronous processors
CN116710891A (zh) 子图的编译、执行方法及相关设备
KR100809294B1 (ko) 가상 머신에서 스레드 스케줄링을 수행하는 장치 및 그방법
WO2024087039A1 (zh) 一种块指令的处理方法和块指令处理器

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180029995.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21969798

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE