WO2024055168A1 - Resource allocation method, processor, and computing platform - Google Patents

Resource allocation method, processor, and computing platform Download PDF

Info

Publication number
WO2024055168A1
WO2024055168A1 PCT/CN2022/118522 CN2022118522W WO2024055168A1 WO 2024055168 A1 WO2024055168 A1 WO 2024055168A1 CN 2022118522 W CN2022118522 W CN 2022118522W WO 2024055168 A1 WO2024055168 A1 WO 2024055168A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
application
computing unit
units
allocated
Prior art date
Application number
PCT/CN2022/118522
Other languages
French (fr)
Chinese (zh)
Inventor
陈清龙
毕舒展
项能武
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/118522 priority Critical patent/WO2024055168A1/en
Publication of WO2024055168A1 publication Critical patent/WO2024055168A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L27/00Modulated-carrier systems

Definitions

  • the present application relates to the field of computer technology, and in particular, to a resource allocation method, processor and computing platform.
  • NPU graphics processing unit
  • NPU neural network processor
  • Figure 1 is a schematic diagram of the system on chip (SoC) of the Xavier chip, including GPU, CPU, vision accelerator, deep learning accelerator (DLA), video encoder, ( video encoder), video decoder (video decoder), camera ingest (camera ingest), Internet service provider (internet server provider, ISP), etc.
  • the GPU includes multiple independent stream multiprocessors (SM), each SM contains a universal parallel computing architecture (compute unified device architecture, CUDA) core, and each CUDA core All contain a variety of computing units.
  • SM system on chip
  • CUDA compute unified device architecture
  • FIG. 2 is a schematic diagram of the SoC of Huawei's Ascend chip, including NPU, CPU, task scheduler, network card, universal serial bus (USB) interface, external memory interface, and high-speed serial computer extension Bus (peripheral component interconnect express, PCIE) interface, general-purpose input/output (GPIO) interface, etc.
  • the NPU includes multiple AI Cores, and each AI core stacks the same computing resources, such as the same type and the same number of tensor computing units, vector computing units, etc.
  • the present application provides a resource allocation method, a processor, and a computing platform to improve the resource utilization of chips including GPUs, NPUs, etc., and reduce chip area and static power consumption losses.
  • a processor in a first aspect, includes a computing unit pool, and the computing unit pool includes multiple computing units; each idle computing unit in the multiple computing units can be called on demand, and among the multiple computing units The number of different types of computing units is positively related to the number of times that type of computing unit is called.
  • the processor may be a processing chip used to implement parallel computing, such as an NPU or a GPU or other processing chips used to implement parallel computing.
  • This application pools computing units in processing chips, including NPU and GPU, which are used to implement large-scale parallel calculations, so that each computing unit can be called on demand by upper-layer applications when it is idle, which can improve resource utilization. Reduce the static power loss of the computing unit and reduce the chip area.
  • all computing units in the processor are pooled, that is, all computing units in the processor are included in the computing unit pool.
  • All computing units of the processor are pooled, which can minimize the static power loss of the processor and reduce the chip area.
  • Some computing units of the processor are pooled, which can reduce static power loss and reduce the chip area of the processor. At the same time, retaining a small number of computing units from being pooled (for example, in the form of AI core) can make the processor have good forward compatibility.
  • types correspond to algorithms executed by multiple computing units.
  • the types of computing units can be divided according to the algorithms executed by the computing units.
  • the type of the computing unit used to perform the addition (add) algorithm is the add computing unit
  • the type of the computing unit used to perform the logarithm (log) algorithm is the type log computing unit, and so on.
  • each of the multiple computing units corresponds to a low-level computing instruction.
  • the add calculation unit corresponds to the addition instruction
  • the log calculation unit corresponds to the log instruction. It is understandable that in actual applications, different computing units can correspond to the same underlying operation instructions. The scale of the underlying operations varies according to the capabilities of the processor. It is specifically designed according to the chip specifications of the processor. The same underlying operation instructions may be used at the bottom layer. Executed in parallel by multiple computing units.
  • the computing unit pool includes one or more of a tensor computing unit pool, a vector computing unit pool, and a scalar computing unit pool.
  • the processor is configured to: receive a request to execute the first application; determine the computing resource requirements of the first application, and allocate computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application,
  • Computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, as well as the quantity of each computing unit.
  • resources can be dynamically scheduled for upper-layer applications according to their requests, which can improve the flexibility of resource allocation and improve resource utilization.
  • the processor is also used to: determine multiple processes corresponding to the first application; determine the computing resource requirements of each of the multiple processes, and obtain the computing resource requirements from the computing unit pool according to the computing resource requirements of each process.
  • Each process is allocated a computing unit to implement the first application; wherein the computing unit allocated to each process is different from the computing units allocated to other processes.
  • the processor is further configured to: determine whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, adjust the computing unit allocated to the first application.
  • the computing resource allocation result of an application is allocated to the first application from the computing unit pool according to the adjusted computing resource allocation result.
  • the processor is further configured to establish a correspondence between the first application and the computing unit allocated to the first application.
  • a resource allocation method is provided, which is applied to a processor.
  • the processor includes a computing unit pool, and the computing unit pool includes multiple computing units; each idle computing unit in the multiple computing units can be called on demand, And the number of different types of computing units in the plurality of computing units is positively correlated with the number of times that type of computing unit is called; the method includes: receiving a request to execute the first application; determining the computing resource requirements of the first application, and starting from Computing units are allocated to the first application in the computing unit pool to implement the first application.
  • Computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, as well as the number of each computing unit.
  • determining the computing resource requirements of the first application, and allocating computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application includes: determining multiple processes corresponding to the first application; Determine the computing resource requirements of each process in the plurality of processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process to implement the first application; wherein the computing unit allocated to each process is equal to Other processes are allocated different compute units.
  • the method further includes: determining whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, adjusting the first application the computing resource allocation result, and allocates a computing unit to the first application from the computing unit pool according to the adjusted computing resource allocation result.
  • the method further includes: establishing a correspondence between the first application and the computing unit allocated to the first application.
  • a third aspect provides a processing device, which includes modules/units/technical means for executing the method described in the second aspect or any possible design of the second aspect.
  • the device may include: a transceiver module, configured to receive a request to execute the first application; a processing module, configured to determine the computing resource requirements of the first application, and select the computing unit pool for the first application from the computing unit pool of the processor according to the computing resource requirements.
  • An application allocates computing units to implement the first application.
  • the computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, and the number of each computing unit; wherein the computing unit pool includes multiple computing units. computing units; each idle computing unit among the multiple computing units can be called on demand, and the number of different types of computing units among the multiple computing units is positively related to the number of times that type of computing unit is called.
  • the processing module can also be used to: determine multiple processes corresponding to the first application; determine the computing resource requirements of each of the multiple processes, and select the computing resource from the computing unit pool according to the computing resource requirements of each process. Allocating a computing unit to each process in to implement the first application; wherein the computing unit allocated to each process is different from the computing units allocated to other processes.
  • the processing module may also be used to: determine whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, Adjust the computing resource allocation result of the first application, and allocate computing units to the first application from the computing unit pool according to the adjusted computing resource allocation result.
  • the processing module may also be used to establish a correspondence between the first application and the computing unit allocated to the first application.
  • a fourth aspect provides a computing platform, including the processor described in the first aspect and the processing device described in the third aspect.
  • a computer-readable storage medium stores computer-executable instructions.
  • the computer-executable instructions make it possible as in the second aspect or any one of the second aspects.
  • the method described in the design is implemented.
  • Figure 1 is a schematic diagram of a SoC
  • FIG. 2 is a schematic diagram of another SoC
  • Figure 3 is a schematic diagram of a processor provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram showing that all computing units in the NPU are pooled
  • Figure 5 is a schematic diagram showing that some computing units in the NPU are pooled
  • Figure 6 is a statistical chart of the usage frequency of operators
  • Figure 7 is a flow chart of a resource allocation method provided by an embodiment of the present application.
  • Figure 8A is a schematic diagram of a possible resource allocation provided by an embodiment of the present application.
  • Figure 8B is a schematic diagram of another possible resource allocation provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a possible computing unit pool 11 provided by the embodiment of the present application.
  • Figure 10 is a schematic diagram of a processing device provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of a computing platform provided by an embodiment of the present application.
  • the processor 01 includes a computing unit pool 11 , and the computing unit pool 11 includes a plurality of computing units 111 .
  • the computing units in the computing unit pool 11 are pooled computing units, or computing units that can be shared, and can be called by different upper-layer applications, for example.
  • each computing unit in the computing unit pool 11 when it is not called (or not running), it is an idle computing unit and can be called by upper-layer applications.
  • Each idle computing unit in the plurality of computing units 111 can be called on demand. Specifically, it can be called according to the needs of upper-layer applications. That is, if any upper-layer application needs it and there are idle computing units in the computing unit pool 11, then The idle computing unit can be called by the upper-layer application.
  • the processor 01 may be a processing chip used to implement parallel computing, such as an NPU or a GPU or other processing chips used to implement parallel computing.
  • the multiple computing units 111 in the computing unit pool 11 may be computing units in a chip used to implement parallel computing, such as computing units in an NPU or a GPU.
  • the computing unit pool 11 can be pooled by computing units in an NPU or a GPU or other processing chips used to implement parallel computing.
  • the above solution pools computing units in processing chips, including NPU and GPU, which are used to implement large-scale parallel computing, so that each computing unit can be called on demand by upper-layer applications when it is idle.
  • resource utilization can be improved. Under the same resource usage, the static power consumption loss of the computing unit can be reduced and the chip area of the processor 01 can be reduced.
  • the NPU is mainly used as an example. But the same implementation can be applied to other chips such as GPUs.
  • all computing units in the NPU are pooled, that is, all computing units in the NPU are included in the computing unit pool 11 .
  • all computing units 111 in the NPU are in the computing unit pool 11 .
  • every idle computing unit among all computing units of the NPU can be called on demand.
  • only some of the computing units in the NPU are pooled, that is, only some of the computing units are included in the computing unit pool 11 .
  • some computing units 111 in the NPU are in the computing unit pool 11, and other computing units 111 are not in the computing unit pool 11, but are configured in the AI Core.
  • the computing unit pool 11 may include different types of computing units, where the computing unit type corresponds to the algorithm executed by the computing unit.
  • the number of different types of computing units 111 in the plurality of computing units 111 is positively related to the number of times a type of computing unit 111 is called. In other words, the more times a type of computing unit 111 is called, the more likely it is in the computing unit pool.
  • each computing unit 111 corresponds to a low-level operation instruction of the processor 01.
  • each computing unit 111 among the plurality of computing units 111 corresponds to a low-level operation instruction.
  • different computing units 111 can correspond to the same underlying operation instructions, and the scale of the underlying operations varies according to the capabilities of the processor 01, and is specifically designed according to the chip specifications of the processor 01.
  • an add instruction when the amount of data to be processed corresponding to the instruction is large, in order to improve the processing capability of processor 01 at the same time (that is, the degree of parallelism), multiple add computing units can be used at the bottom to perform parallel processing on different data. calculation, so for the same add instruction, the underlying layer may perform multiple additions.
  • the computing resource pool can be configured according to the resource situation that meets the actual requirements by analyzing and counting the proportion of various computing units used by various operators in the model (such as the neural network model) run by processor 01 in actual applications. 11 The number of various types of computing units. It can be understood that one operator can correspond to one or more computing units 111.
  • Figure 6 is a statistical chart of the frequency of use of various operators in the model. The functions of each operator in Figure 6 are explained as follows:
  • Slice slice
  • BatchNormalization batch normalization
  • ConstantOfShape generates a tensor with a given value and shape
  • LeakyRelu leakyretified linear unit
  • GlobalAveragePool global average pooling
  • ConvTranspose transposed convolution, also known as Deconvolution
  • Equal Determine whether the sequence is equal
  • ReduceMean reduction layer in convolutional neural network
  • each type of computing unit 111 may be slightly greater than the number corresponding to the number of calls of that type of computing unit 111 to ensure that the processor 01 has sufficient performance to cope with emergencies.
  • the static performance of the processor 01 can be reduced while ensuring the running performance of the processor 01. Power loss.
  • the computing unit 111 is classified into types.
  • multiple computing units 111 may include a tensor computing unit (or matrix computing unit) 111A, a vector computing unit (or vector computing unit) 111B, a scalar computing unit One or more types of unit 111C, etc.
  • the tensor calculation unit 111A is used to perform matrix calculations
  • the vector calculation unit 111B is used to perform vector calculations
  • the scalar calculation unit 111C is used to perform scalar calculations.
  • the multiple computing units 111 may include an eight-bit integer (int8) computing unit and a 16-bit floating point (fp16) computing unit.
  • Int8 an eight-bit integer
  • fp16 16-bit floating point
  • One or more types of unit 32-bit floating point (fp32) calculation unit, 4-bit integer (int4) calculation unit, etc.
  • all calculation units 11 are first divided into tensor calculation units 111A, vector calculation units 111B, and scalar calculation units 111C, and then further subdivided according to mathematical operation types and/or data types in the tensor calculation unit 111A.
  • the processor 01 can schedule resources for the upper-layer application from the computing unit pool 11 when receiving a request from the upper-layer application.
  • FIG. 7 is a flow chart of a resource allocation method provided by an embodiment of the present application.
  • the method may be executed by the processor 01 in FIG. 3 , or may be executed by other processing devices other than the processor 01 .
  • the method includes:
  • S702. Determine the computing resource requirements of the first application; allocate computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application.
  • the computing resource requirements include tensor computing units, vector computing units and vector computing units. One or more types of computing units, and the quantity of each type of computing unit.
  • the NPU in the processor 01 or other specially configured processing unit for resource allocation may be used to execute the above method, which is not limited in this application.
  • the computing unit allocated to each process is different from the computing units allocated to other processes.
  • different processes in the plurality of processes are allocated to different computing units.
  • FIG. 8A is a schematic diagram of a possible resource allocation.
  • add, sub, mul, exp, etc. each of these resources is placed in processor 01, and matrix multiplication is placed in multiple copies.
  • application 0 and application n are to be executed at the same time.
  • the computing resources required by application 0 include matrix multiplication, add, and sub, and the computing resources required by application n include matrix multiplication, mul, and exp.
  • the resources of can be shown as resource group 0, including matrix multiplication, add, and sub.
  • the resources allocated for application n can be shown as resource group n, including matrix multiplication, mul, and exp. In this way, different types of computing resources can be allocated to different applications, which can achieve the purpose of reducing chip area, cost, and static power consumption without reducing the processing performance of the chip.
  • FIG. 8B is a schematic diagram of another possible resource allocation.
  • add, sub, mul, exp, etc. these resources are placed in two copies each in processor 01, and matrix multiplication is placed in multiple copies.
  • application 0 and application n are to be executed at the same time.
  • Application 0 has a high demand for add, while application n has a low demand for add.
  • the resources allocated to application 0 can be as shown in resource group 0.
  • the resources allocated for application n can be as shown in resource group n, including matrix multiplication, sub, mul, and exp. In this way, different amounts of computing resources can be allocated to different applications, which can achieve the purpose of reducing chip area, cost, and static power consumption without reducing the processing performance of the chip.
  • resources can be dynamically scheduled for upper-layer applications based on their requests, improving the flexibility of resource allocation.
  • the computing unit allocated to the first application may also be determined whether the computing unit allocated to the first application meets the needs of the first application; when the first application When the allocated computing unit cannot meet the needs of the first application, the computing resource allocation result of the first application is adjusted, and the computing unit is allocated to the first application from the computing unit pool 11 according to the adjusted computing resource allocation result.
  • the computing resource allocation result of the first application is adjusted to add more computing units, such as The computing units 111b and 111c are assigned to the first application to meet the indicators required by the first application.
  • a corresponding relationship between the first application and the computing unit allocated to the first application can also be established.
  • the computing unit allocated to the first application is fixed (that is, changed from an idle computing unit to a non-idle computing unit) and is only used by the first application. And during the use of the first application, other applications cannot call this part of the resource.
  • the part of the computing units may be released (for example, the corresponding relationship is deleted), so that the part of the computing units may be called by other applications.
  • resource utilization can be further improved.
  • the first application may not release the part of the computing units (for example, still retain the corresponding relationship), so that the part of the computing units is not called by other applications, but remains
  • the first application retains this part of the computing units, so that when the first application runs next time, it can directly use this part of the computing units without the need to allocate resources again.
  • the performance of the first application can be guaranteed first.
  • the computing unit pool 11 includes one or more of a tensor computing unit pool, a vector computing unit pool, and a scalar computing unit pool.
  • the computing unit pool 11 includes a tensor computing unit pool 11A, a vector computing unit pool 11B, and a scalar computing unit pool 11C.
  • the tensor computing unit pool 11A includes one or more tensor computing units 111A.
  • the vector calculation unit pool 11B includes one or more vector calculation units 111B, and the scalar calculation unit pool 11C includes one or more scalar calculation units 111C.
  • a corresponding computing unit pool may be found first according to its corresponding algorithm dimension, and then the computing unit may be determined from the computing unit pool.
  • the embodiment of the present application also provides a processing device 100.
  • the processing device 100 includes modules/units/technical means for executing the method shown in Figure 7.
  • the processing device 100 may include:
  • Transceiver module 1001 configured to receive a request to execute the first application
  • the processing module 1002 is used to determine the computing resource requirements of the first application, and allocate computing units to the first application from the computing unit pool of the processor according to the computing resource requirements to implement the first application.
  • the computing resource requirements include tensor computing units, vector One or more of computing units and vector computing units, and the number of each computing unit; wherein the computing unit pool includes multiple computing units; each idle computing unit among the multiple computing units can be called on demand, And the number of different types of computing units in multiple computing units is positively correlated with the number of times that type of computing unit is called.
  • an embodiment of the present application also provides a computing platform 1100 , including the above-mentioned processor 01 and the processing device 100 .
  • embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores computer-executable instructions. When called by a computer, the computer-executable instructions enable the method shown in Figure 7 be executed.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multi Processors (AREA)

Abstract

Disclosed in the present application are a resource allocation method, a processor, and a computing platform, which are used for improving the resource utilization rate of a chip, such as a GPU and an NPU, and reducing the area of the chip and static power consumption loss. The processor comprises a computing unit pool, which comprises a plurality of computing units, wherein each idle computing unit among the plurality of computing units can be called as required, and the numbers of computing units of different types among the plurality of computing units are positively correlated with the numbers of instances of the computing units of the types being called. In the present application, by means of performing pooling on computing units in a processing chip, such as an NPU and a GPU, each computing unit can be called by means of an upper-layer application as required when the computing unit is idle, thereby improving the resource utilization rate, reducing static power consumption loss of the computing units, and reducing the area of the chip.

Description

一种资源分配方法、处理器和计算平台A resource allocation method, processor and computing platform 技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种资源分配方法、处理器和计算平台。The present application relates to the field of computer technology, and in particular, to a resource allocation method, processor and computing platform.
背景技术Background technique
随着机器学习、人工智能(artificial intelligence,AI)、无人驾驶、工业仿真等领域的崛起,使得通用处理器(如中央处理器(central processing unit,CPU))在处理海量计算、海量数据/图片时遇到越来越多的性能瓶颈,如并行度不高、带宽不够、时延高等。为了应对计算多元化的需求,越来越多的场景开始引入图形处理器(graphics processing unit,GPU)或神经网络处理器(neural processing unit,NPU)等芯片,与通用处理器一起形成异构计算平台。NPU、GPU等芯片中的计算资源有多种,例如张量计算单元、向量计算单元、标量计算单元等。现有芯片厂商生产的NPU、GPU中,各类计算资源比例是固定的,而且同样的计算资源会复制很多份。With the rise of machine learning, artificial intelligence (AI), driverless driving, industrial simulation and other fields, general-purpose processors (such as central processing units (CPU)) are increasingly used to process massive calculations and massive data/ More and more performance bottlenecks are encountered when processing images, such as low parallelism, insufficient bandwidth, and high latency. In order to cope with the demand for diversified computing, more and more scenarios are beginning to introduce chips such as graphics processing unit (GPU) or neural network processor (neural processing unit, NPU), which together with general-purpose processors form heterogeneous computing platform. There are many kinds of computing resources in NPU, GPU and other chips, such as tensor computing unit, vector computing unit, scalar computing unit, etc. In the NPU and GPU produced by existing chip manufacturers, the proportion of various computing resources is fixed, and the same computing resources will be copied many times.
例如,参见图1,为Xavier芯片的片上系统(system on chip,SoC)示意图,包括GPU、CPU、视觉加速器、(vision accelerator)、深度学习加速器(deep learning accelerator,DLA)、视频编码器、(video encoder)、视频解码器(video decoder)、相机摄取(camera ingest)、因特网服务提供商(internet server provider,ISP)等。其中,GPU中包括包含多个独立的流式多处理器(stream multiprocessor,SM),每个SM都包含通用并行计算架构(compute unified device architecture,CUDA)核(core),而每个CUDA core中都包含各种各样的计算单元。并且,各种各样的数学运算(如指数(exp),倒数(1/x),对数(log),平方根(sqrt),加法(add),减法(sub),乘法(mul)等),分别都对应单独的物理计算单元;甚至为了满足不同的数据类型(如8位整数(int8),16位浮点数(fp16),32位浮点数(fp32),4位整数(int4)等),而放置多种规格的计算单元,这些都需要在CUDA Core中堆叠多种资源。For example, see Figure 1, which is a schematic diagram of the system on chip (SoC) of the Xavier chip, including GPU, CPU, vision accelerator, deep learning accelerator (DLA), video encoder, ( video encoder), video decoder (video decoder), camera ingest (camera ingest), Internet service provider (internet server provider, ISP), etc. Among them, the GPU includes multiple independent stream multiprocessors (SM), each SM contains a universal parallel computing architecture (compute unified device architecture, CUDA) core, and each CUDA core All contain a variety of computing units. And, various mathematical operations (such as exponent (exp), reciprocal (1/x), logarithm (log), square root (sqrt), addition (add), subtraction (sub), multiplication (mul), etc.) , each corresponding to a separate physical computing unit; even to meet different data types (such as 8-bit integer (int8), 16-bit floating point number (fp16), 32-bit floating point number (fp32), 4-bit integer (int4), etc.) , and placing computing units of various specifications requires stacking multiple resources in CUDA Core.
例如,参见图2,为华为公司的昇腾芯片的SoC示意图,包括NPU、CPU、任务调度器、网卡、通用串行总线(universal serial bus,USB)接口、外存接口、高速串行计算机扩展总线(peripheral component interconnect express,PCIE)接口、通用输入输出(general-purpose input/output,GPIO)接口等。其中,NPU中包括有多个AI Core,而且每个AI core内部都堆叠相同的计算资源,如相同种类和相同数量的张量计算单元、矢量计算单元等。For example, see Figure 2, which is a schematic diagram of the SoC of Huawei's Ascend chip, including NPU, CPU, task scheduler, network card, universal serial bus (USB) interface, external memory interface, and high-speed serial computer extension Bus (peripheral component interconnect express, PCIE) interface, general-purpose input/output (GPIO) interface, etc. Among them, the NPU includes multiple AI Cores, and each AI core stacks the same computing resources, such as the same type and the same number of tensor computing units, vector computing units, etc.
然而,实际应用中,NPU、GPU等芯片在运行时,并不是所有计算资源都同时用到的,因此现有技术中NPU、GPU等芯片设计方式,不仅造成芯片面积的浪费,还存在静态功耗损失的问题。However, in actual applications, when NPU, GPU and other chips are running, not all computing resources are used at the same time. Therefore, the design methods of NPU, GPU and other chips in the existing technology not only cause a waste of chip area, but also cause static functions. The problem of consumption loss.
如何提高包括NPU、GPU等在内的芯片的资源利用率,是本申请要解决的问题。How to improve the resource utilization of chips including NPU, GPU, etc. is the problem to be solved in this application.
发明内容Contents of the invention
本申请提供一种资源分配方法、处理器和计算平台,用以提高包括GPU、NPU等在内的芯片的资源利用率,减小芯片面积和静态功耗损失。The present application provides a resource allocation method, a processor, and a computing platform to improve the resource utilization of chips including GPUs, NPUs, etc., and reduce chip area and static power consumption losses.
第一方面,提供一种处理器,该处理器包括计算单元池,计算单元池包括多个计算单元;多个计算单元中的每个空闲计算单元可以被按需调用,且多个计算单元中不同类型的计算单元的数量与该类型的计算单元被调用次数正相关。In a first aspect, a processor is provided. The processor includes a computing unit pool, and the computing unit pool includes multiple computing units; each idle computing unit in the multiple computing units can be called on demand, and among the multiple computing units The number of different types of computing units is positively related to the number of times that type of computing unit is called.
可以理解的,处理器可以是用于实现并行计算的处理芯片,例如NPU或GPU或其它用于实现并行计算的处理芯片。It can be understood that the processor may be a processing chip used to implement parallel computing, such as an NPU or a GPU or other processing chips used to implement parallel computing.
本申请通过将包括NPU、GPU在内的用于实现大量并行计算的处理芯片中的计算单元池化,使得每个计算单元在其空闲时可以被上层应用按需调用,可以提高资源利用率,减少计算单元的静态功耗损失,减小芯片面积。This application pools computing units in processing chips, including NPU and GPU, which are used to implement large-scale parallel calculations, so that each computing unit can be called on demand by upper-layer applications when it is idle, which can improve resource utilization. Reduce the static power loss of the computing unit and reduce the chip area.
一种可能的设计中,处理器中的所有计算单元都被池化,即处理器中的所有计算单元都包含在该计算单元池中。In one possible design, all computing units in the processor are pooled, that is, all computing units in the processor are included in the computing unit pool.
处理器的所有计算单元都被池化,可以尽可能减少处理器的静态功耗损失,减小芯片面积。All computing units of the processor are pooled, which can minimize the static power loss of the processor and reduce the chip area.
另一种可能的设计中,处理器中仅部分计算单元都被池化,即只有部分计算单元包含在该计算单元池中。In another possible design, only some of the computing units in the processor are pooled, that is, only some of the computing units are included in the computing unit pool.
处理器的部分计算单元被池化,可以减少静态功耗损失,减小处理器的芯片面积。同时保留少部分计算单元不被池化(例如以AI core的形式存在),可以使得处理器具有良好的前向兼容性。Some computing units of the processor are pooled, which can reduce static power loss and reduce the chip area of the processor. At the same time, retaining a small number of computing units from being pooled (for example, in the form of AI core) can make the processor have good forward compatibility.
一种可能的设计中,类型与多个计算单元执行的算法对应。换而言之,计算单元的类型可以根据计算单元执行的算法来划分。例如,用于执行加法(add)算法的计算单元的类型为add计算单元,用于执行对数(log)算法的计算单元的类型为log计算单元,等等。In one possible design, types correspond to algorithms executed by multiple computing units. In other words, the types of computing units can be divided according to the algorithms executed by the computing units. For example, the type of the computing unit used to perform the addition (add) algorithm is the add computing unit, the type of the computing unit used to perform the logarithm (log) algorithm is the type log computing unit, and so on.
一种可能的设计中,多个计算单元中的每个计算单元对应一个底层运算指令。例如,add计算单元对应加法指令,log计算单元对应log指令。可以理解的,实际应用中,不同的计算单元可以对应相同的底层运算指令,底层运算的规模根据处理器的能力不同而不同,具体根据处理器的芯片规格来设计,同一底层运算指令在底层可能由多个计算单元并行执行。In one possible design, each of the multiple computing units corresponds to a low-level computing instruction. For example, the add calculation unit corresponds to the addition instruction, and the log calculation unit corresponds to the log instruction. It is understandable that in actual applications, different computing units can correspond to the same underlying operation instructions. The scale of the underlying operations varies according to the capabilities of the processor. It is specifically designed according to the chip specifications of the processor. The same underlying operation instructions may be used at the bottom layer. Executed in parallel by multiple computing units.
一种可能的设计中,计算单元池包括张量计算单元池、矢量计算单元池和标量计算单元池中的一个或多个。In a possible design, the computing unit pool includes one or more of a tensor computing unit pool, a vector computing unit pool, and a scalar computing unit pool.
将张量计算单元、矢量计算单元和标量计算单元分别单独进行池化,可以提高资源管理效率,以及提高上层应用调度资源的效率。Pooling tensor computing units, vector computing units, and scalar computing units separately can improve resource management efficiency and improve the efficiency of upper-layer application scheduling resources.
一种可能的设计中,处理器用于:接收执行第一应用的请求;确定第一应用的计算资源需求,根据计算资源需求从计算单元池中为第一应用分配计算单元以实现第一应用,计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量。In one possible design, the processor is configured to: receive a request to execute the first application; determine the computing resource requirements of the first application, and allocate computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application, Computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, as well as the quantity of each computing unit.
通过上述方式,可以实现根据上层应用的请求为其动态调度资源,可以提高资源分配的灵活性,提高资源利用率。Through the above method, resources can be dynamically scheduled for upper-layer applications according to their requests, which can improve the flexibility of resource allocation and improve resource utilization.
一种可能的设计中,处理器还用于:确定第一应用对应的多个进程;确定多个进程中每个进程的计算资源需求,根据每个进程的计算资源需求从计算单元池中为每个进程分配计算单元以实现第一应用;其中,为每个进程分配的计算单元与为其他进程分配的计算单元不同。In a possible design, the processor is also used to: determine multiple processes corresponding to the first application; determine the computing resource requirements of each of the multiple processes, and obtain the computing resource requirements from the computing unit pool according to the computing resource requirements of each process. Each process is allocated a computing unit to implement the first application; wherein the computing unit allocated to each process is different from the computing units allocated to other processes.
通过上述方式,可以实现根据第一应用的每个进程的计算资源需求从计算单元池中为 每个进程分配计算单元,可以提高资源利用率。Through the above method, it is possible to allocate computing units from the computing unit pool to each process according to the computing resource requirements of each process of the first application, which can improve resource utilization.
一种可能的设计中,处理器还用于:确定为第一应用分配的计算单元是否满足第一应用的需要;当为第一应用分配的计算单元不能满足第一应用的需要时,调整第一应用的计算资源分配结果,并根据调整后的计算资源分配结果从计算单元池中为第一应用分配计算单元。In a possible design, the processor is further configured to: determine whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, adjust the computing unit allocated to the first application. The computing resource allocation result of an application is allocated to the first application from the computing unit pool according to the adjusted computing resource allocation result.
一种可能的设计中,处理器还用于:建立第一应用与为第一应用分配的计算单元的对应关系。In a possible design, the processor is further configured to establish a correspondence between the first application and the computing unit allocated to the first application.
通过上述方式,可以保证为第一应用分配的资源能够满足其需求,可以提高资源分配的可靠性。Through the above method, it can be ensured that the resources allocated to the first application can meet its needs, and the reliability of resource allocation can be improved.
第二方面,提供一种资源分配方法,应用于处理器,该处理器包括计算单元池,计算单元池包括多个计算单元;多个计算单元中的每个空闲计算单元可以被按需调用,且多个计算单元中不同类型的计算单元的数量与该类型的计算单元被调用次数正相关;方法包括:接收执行第一应用的请求;确定第一应用的计算资源需求,根据计算资源需求从计算单元池中为第一应用分配计算单元以实现第一应用,计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量。In the second aspect, a resource allocation method is provided, which is applied to a processor. The processor includes a computing unit pool, and the computing unit pool includes multiple computing units; each idle computing unit in the multiple computing units can be called on demand, And the number of different types of computing units in the plurality of computing units is positively correlated with the number of times that type of computing unit is called; the method includes: receiving a request to execute the first application; determining the computing resource requirements of the first application, and starting from Computing units are allocated to the first application in the computing unit pool to implement the first application. Computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, as well as the number of each computing unit.
一种可能的设计中,确定第一应用的计算资源需求,根据计算资源需求从计算单元池中为第一应用分配计算单元以实现第一应用,包括:确定第一应用对应的多个进程;确定多个进程中每个进程的计算资源需求,根据每个进程的计算资源需求从计算单元池中为每个进程分配计算单元以实现第一应用;其中,为每个进程分配的计算单元与为其他进程分配的计算单元不同。In one possible design, determining the computing resource requirements of the first application, and allocating computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application includes: determining multiple processes corresponding to the first application; Determine the computing resource requirements of each process in the plurality of processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process to implement the first application; wherein the computing unit allocated to each process is equal to Other processes are allocated different compute units.
一种可能的设计中,方法还包括:确定为第一应用分配的计算单元是否满足第一应用的需要;当为第一应用分配的计算单元不能满足第一应用的需要时,调整第一应用的计算资源分配结果,并根据调整后的计算资源分配结果从计算单元池中为第一应用分配计算单元。In a possible design, the method further includes: determining whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, adjusting the first application the computing resource allocation result, and allocates a computing unit to the first application from the computing unit pool according to the adjusted computing resource allocation result.
一种可能的设计中,方法还包括:建立第一应用与为第一应用分配的计算单元的对应关系。In a possible design, the method further includes: establishing a correspondence between the first application and the computing unit allocated to the first application.
第三方面,提供一种处理装置,该装置包括用于执行第二方面或第二方面任一种可能的设计中所述的方法的模块/单元/技术手段。A third aspect provides a processing device, which includes modules/units/technical means for executing the method described in the second aspect or any possible design of the second aspect.
示例性的,该装置可以包括:收发模块,用于接收执行第一应用的请求;处理模块,用于确定第一应用的计算资源需求,根据计算资源需求从处理器的计算单元池中为第一应用分配计算单元以实现第一应用,计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量;其中,计算单元池包括多个计算单元;多个计算单元中的每个空闲计算单元可以被按需调用,且多个计算单元中不同类型的计算单元的数量与该类型的计算单元被调用次数正相关。Exemplarily, the device may include: a transceiver module, configured to receive a request to execute the first application; a processing module, configured to determine the computing resource requirements of the first application, and select the computing unit pool for the first application from the computing unit pool of the processor according to the computing resource requirements. An application allocates computing units to implement the first application. The computing resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, and the number of each computing unit; wherein the computing unit pool includes multiple computing units. computing units; each idle computing unit among the multiple computing units can be called on demand, and the number of different types of computing units among the multiple computing units is positively related to the number of times that type of computing unit is called.
一种可能的设计中,处理模块,还可以用于:确定第一应用对应的多个进程;确定多个进程中每个进程的计算资源需求,根据每个进程的计算资源需求从计算单元池中为每个进程分配计算单元以实现第一应用;其中,为每个进程分配的计算单元与为其他进程分配的计算单元不同。In a possible design, the processing module can also be used to: determine multiple processes corresponding to the first application; determine the computing resource requirements of each of the multiple processes, and select the computing resource from the computing unit pool according to the computing resource requirements of each process. Allocating a computing unit to each process in to implement the first application; wherein the computing unit allocated to each process is different from the computing units allocated to other processes.
一种可能的设计中,处理模块,还可以用于:确定为第一应用分配的计算单元是否满足第一应用的需要;当为第一应用分配的计算单元不能满足第一应用的需要时,调整第一 应用的计算资源分配结果,并根据调整后的计算资源分配结果从计算单元池中为第一应用分配计算单元。In a possible design, the processing module may also be used to: determine whether the computing unit allocated to the first application meets the needs of the first application; when the computing unit allocated to the first application cannot meet the needs of the first application, Adjust the computing resource allocation result of the first application, and allocate computing units to the first application from the computing unit pool according to the adjusted computing resource allocation result.
一种可能的设计中,处理模块,还可以用于:建立第一应用与为第一应用分配的计算单元的对应关系。In a possible design, the processing module may also be used to establish a correspondence between the first application and the computing unit allocated to the first application.
第四方面,提供一种计算平台,包括上述第一方面所述的处理器以及上述第三方面所述的处理装置。A fourth aspect provides a computing platform, including the processor described in the first aspect and the processing device described in the third aspect.
第五方面,提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令在被计算机调用时,使得如第二方面或第二方面任一种可能的设计中所述的方法被执行。In a fifth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores computer-executable instructions. When called by a computer, the computer-executable instructions make it possible as in the second aspect or any one of the second aspects. The method described in the design is implemented.
附图说明Description of drawings
图1为一种SoC的示意图;Figure 1 is a schematic diagram of a SoC;
图2为另一种SoC的示意图;Figure 2 is a schematic diagram of another SoC;
图3为本申请实施例提供的一种处理器的示意图;Figure 3 is a schematic diagram of a processor provided by an embodiment of the present application;
图4为NPU中的所有计算单元都被池化的示意图;Figure 4 is a schematic diagram showing that all computing units in the NPU are pooled;
图5为NPU中的部分计算单元都被池化的示意图;Figure 5 is a schematic diagram showing that some computing units in the NPU are pooled;
图6为算子的使用频度的统计图;Figure 6 is a statistical chart of the usage frequency of operators;
图7为本申请实施例提供的一种资源分配方法的流程图;Figure 7 is a flow chart of a resource allocation method provided by an embodiment of the present application;
图8A为本申请实施例提供的一种可能的资源分配示意图;Figure 8A is a schematic diagram of a possible resource allocation provided by an embodiment of the present application;
图8B为本申请实施例提供的另一种可能的资源分配示意图;Figure 8B is a schematic diagram of another possible resource allocation provided by an embodiment of the present application;
图9为本申请实施例提供的一种可能的计算单元池11的示意图;Figure 9 is a schematic diagram of a possible computing unit pool 11 provided by the embodiment of the present application;
图10为本申请实施例提供的一种处理装置的示意图;Figure 10 is a schematic diagram of a processing device provided by an embodiment of the present application;
图11为本申请实施例提供的一种计算平台的示意图。Figure 11 is a schematic diagram of a computing platform provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例进行详细描述。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
参见图3,为本申请实施例提供的一种处理器01的示意图,处理器01中包括计算单元池11,计算单元池11包括多个计算单元111。Referring to FIG. 3 , a schematic diagram of a processor 01 is provided according to an embodiment of the present application. The processor 01 includes a computing unit pool 11 , and the computing unit pool 11 includes a plurality of computing units 111 .
可以理解的,计算单元池11中的计算单元,是被池化的计算单元,或者说是可以共享的计算单元,例如可以被不同的上层应用调用。具体实现时,对于计算单元池11中的每个计算单元,当其未被调用时(或者说未运行时),为空闲计算单元,可以被上层应用调用。多个计算单元111中的每个空闲计算单元,可以被按需调用,具体例如是按照上层应用的需要被调用,即任何上层应用若有需要,而计算单元池11也存在空闲计算单元,则该空闲的计算单元可以被该上层应用调用。It can be understood that the computing units in the computing unit pool 11 are pooled computing units, or computing units that can be shared, and can be called by different upper-layer applications, for example. In specific implementation, for each computing unit in the computing unit pool 11, when it is not called (or not running), it is an idle computing unit and can be called by upper-layer applications. Each idle computing unit in the plurality of computing units 111 can be called on demand. Specifically, it can be called according to the needs of upper-layer applications. That is, if any upper-layer application needs it and there are idle computing units in the computing unit pool 11, then The idle computing unit can be called by the upper-layer application.
处理器01可以是用于实现并行计算的处理芯片,例如NPU或GPU或其它用于实现并行计算的处理芯片。计算单元池11中的多个计算单元111可以是用于实现并行计算的芯片中的计算单元,例如是NPU或GPU中的计算单元。换而言之,计算单元池11可以由NPU或GPU或其它用于实现并行计算的处理芯片中的计算单元进行池化得到。The processor 01 may be a processing chip used to implement parallel computing, such as an NPU or a GPU or other processing chips used to implement parallel computing. The multiple computing units 111 in the computing unit pool 11 may be computing units in a chip used to implement parallel computing, such as computing units in an NPU or a GPU. In other words, the computing unit pool 11 can be pooled by computing units in an NPU or a GPU or other processing chips used to implement parallel computing.
上述方案,通过将包括NPU、GPU在内的用于实现大量并行计算的处理芯片中的计 算单元池化,使得每个计算单元在其空闲时可以被上层应用按需调用,相比现有技术中以SM或AI Core等设计方式堆叠多种相同的计算资源而言,可以提高资源利用率。在相同的资源使用量的情况下,可以减少计算单元的静态功耗损失,减小处理器01的芯片面积。The above solution pools computing units in processing chips, including NPU and GPU, which are used to implement large-scale parallel computing, so that each computing unit can be called on demand by upper-layer applications when it is idle. Compared with the existing technology In terms of stacking multiple identical computing resources using design methods such as SM or AI Core, resource utilization can be improved. Under the same resource usage, the static power consumption loss of the computing unit can be reduced and the chip area of the processor 01 can be reduced.
为了便于描述,在下文中,主要以NPU为例进行描述。但相同的实施方式,可以应用于GPU等其他芯片。For the convenience of description, in the following description, the NPU is mainly used as an example. But the same implementation can be applied to other chips such as GPUs.
一种可能的设计中,NPU中的所有计算单元都被池化,即NPU中的所有计算单元都包含在该计算单元池11中。例如图4所示,NPU中的全部计算单元111都在计算单元池11中。换而言之,NPU的所有计算单元中的每个空闲计算单元都可以被按需调用。In one possible design, all computing units in the NPU are pooled, that is, all computing units in the NPU are included in the computing unit pool 11 . For example, as shown in FIG. 4 , all computing units 111 in the NPU are in the computing unit pool 11 . In other words, every idle computing unit among all computing units of the NPU can be called on demand.
通过上述方式,NPU的所有计算单元都被池化,可以尽可能减少处理器01的静态功耗损失,减小处理器01的芯片面积。Through the above method, all computing units of the NPU are pooled, which can reduce the static power consumption loss of the processor 01 as much as possible and reduce the chip area of the processor 01.
另一种可能的设计中,NPU中仅部分计算单元都被池化,即只有部分计算单元包含在该计算单元池11中。例如图5所示,NPU中的一部分计算单元111在计算单元池11中,另部分计算单元111不在计算单元池11中,而是被配置在AI Core中。In another possible design, only some of the computing units in the NPU are pooled, that is, only some of the computing units are included in the computing unit pool 11 . For example, as shown in Figure 5, some computing units 111 in the NPU are in the computing unit pool 11, and other computing units 111 are not in the computing unit pool 11, but are configured in the AI Core.
通过上述方式,NPU的部分计算单元被池化,可以减少静态功耗损失,减小处理器01的芯片面积。同时保留少部分计算单元以AI core的形式存在,可以使得处理器01具有良好的前向兼容性。Through the above method, some computing units of the NPU are pooled, which can reduce static power consumption loss and reduce the chip area of the processor 01. At the same time, a small number of computing units are retained in the form of AI core, which allows processor 01 to have good forward compatibility.
一种可能的设计中,计算单元池11中可以包括不同类型的计算单元,其中计算单元类型与计算单元执行的算法对应。多个计算单元111中不同类型的计算单元111的数量与类型的计算单元111被调用次数正相关,换而言之,哪种类型的计算单元111被调用的次数越多,其在计算单元池11中的数量越多,或者说,哪种算法被调用的次数越多,则该算法对应的计算单元111在计算单元池11中的数量越多。例如,add计算单元被调用的多,那么计算单元池11中可以有更多add计算单元,log计算单元被调用的少,那么计算单元池11中可以有更少的log计算单元。In one possible design, the computing unit pool 11 may include different types of computing units, where the computing unit type corresponds to the algorithm executed by the computing unit. The number of different types of computing units 111 in the plurality of computing units 111 is positively related to the number of times a type of computing unit 111 is called. In other words, the more times a type of computing unit 111 is called, the more likely it is in the computing unit pool. The greater the number in 11 , or in other words, the more times an algorithm is called, the greater the number of computing units 111 corresponding to the algorithm in the computing unit pool 11 . For example, if add computing units are called more, there can be more add computing units in the computing unit pool 11, and if log computing units are called less, there can be fewer log computing units in the computing unit pool 11.
在具体实现时,每个计算单元111都对应处理器01的一个底层运算指令。换而言之,多个计算单元111中的每个计算单元111对应一个底层运算指令。可以理解的,实际应用中,不同的计算单元111可以对应相同的底层运算指令,底层运算的规模根据处理器01的能力不同而不同,具体根据处理器01的芯片规格来设计。例如,对于一个add指令,该指令对应的待处理数据量较多时,为了提高处理器01在同一时刻的处理能力(即并行度),在底层可以由多个add计算单元针对不同的数据进行并行计算,因此针对同一add指令,底层可能会执行多次相加。In specific implementation, each computing unit 111 corresponds to a low-level operation instruction of the processor 01. In other words, each computing unit 111 among the plurality of computing units 111 corresponds to a low-level operation instruction. It can be understood that in actual applications, different computing units 111 can correspond to the same underlying operation instructions, and the scale of the underlying operations varies according to the capabilities of the processor 01, and is specifically designed according to the chip specifications of the processor 01. For example, for an add instruction, when the amount of data to be processed corresponding to the instruction is large, in order to improve the processing capability of processor 01 at the same time (that is, the degree of parallelism), multiple add computing units can be used at the bottom to perform parallel processing on different data. calculation, so for the same add instruction, the underlying layer may perform multiple additions.
作为一种示例,可以通过分析统计实际应用中处理器01运行的模型(如神经网络模型)中各类算子用到的各类计算单元的比例,按满足实际要求的资源情况配置计算资源池11中的各类计算单元的数量。可以理解的,一个算子可以对应一个或多个计算单元111。As an example, the computing resource pool can be configured according to the resource situation that meets the actual requirements by analyzing and counting the proportion of various computing units used by various operators in the model (such as the neural network model) run by processor 01 in actual applications. 11 The number of various types of computing units. It can be understood that one operator can correspond to one or more computing units 111.
例如,参见图6,为模型中各类算子的使用频度的统计图,图6中各算子的功能解释如下:For example, see Figure 6, which is a statistical chart of the frequency of use of various operators in the model. The functions of each operator in Figure 6 are explained as follows:
Conv:卷积(convolution);Conv: convolution;
Relu:线性整流单元(rectified linear unit);Relu: linear rectified unit (rectified linear unit);
Dequant:反量化;Dequant: reverse quantification;
Mul:乘法;Mul: multiplication;
Slice:切分(slice);Slice: slice;
Gather:汇聚;Gather: gather;
Shape:整形;Shape: plastic surgery;
BatchNormalization:批规范化;BatchNormalization: batch normalization;
Resize:缩放;Resize: zoom;
Pad:数据扩充;Pad: data expansion;
ConstantOfShape:用给定的值和形状(shape)生成张量;ConstantOfShape: generates a tensor with a given value and shape;
Tanh:激活函数;Tanh: activation function;
LeakyRelu:漏洞修正线性单元(leakyretified linear unit);LeakyRelu: leakyretified linear unit;
GlobalAveragePool:全局平均池化;GlobalAveragePool: global average pooling;
Floor:向下取整;Floor: round down;
ConvTranspose:转置卷积,亦称Deconvolution;ConvTranspose: transposed convolution, also known as Deconvolution;
Softmax:归一化;Softmax: normalization;
Exp:指数;Exp: index;
Flatten:降维;Flatten: dimensionality reduction;
Equal:判断序列相等;Equal: Determine whether the sequence is equal;
Expand:扩展;Expand: extension;
MatMul:矩阵乘法运算;MatMul: matrix multiplication operation;
And:与运算;And: AND operation;
ReduceMean:卷积神经网络中的还原层;ReduceMean: reduction layer in convolutional neural network;
ReduceMax:取最大值。ReduceMax: Take the maximum value.
可以理解的,图6中仅示出了部分算子类型,实际应用中还可以有其它算子类型,本申请不做限制。It can be understood that only some operator types are shown in FIG. 6 , and there may be other operator types in actual applications, which are not limited in this application.
从图6可以看出,Conv、Relu、AscendDequant等算子的使用频度较高,因此可以在计算单元池11中为Conv、Relu、AscendDequant等算子配置较多的计算单元111,而Softmax、Exp、Flatten、Equal、Expand、MatMul、And、ReduceMean、ReduceMax等算子的使用频度较低,可以在计算单元池11中为Conv、Relu、AscendDequant等算子配置较少的计算单元111。As can be seen from Figure 6, operators such as Conv, Relu, and AscendDequant are used more frequently, so more computing units 111 can be configured in the computing unit pool 11 for operators such as Conv, Relu, and AscendDequant, while Softmax, Operators such as Exp, Flatten, Equal, Expand, MatMul, And, ReduceMean, and ReduceMax are used less frequently, and fewer computing units 111 can be configured in the computing unit pool 11 for operators such as Conv, Relu, and AscendDequant.
可以理解的,在具体实施时,每种类型的计算单元111的实际数量可以略大于与该类型的计算单元111被调用次数对应的数量,以保证处理器01具有足够的性能应对突发情况。It can be understood that during specific implementation, the actual number of each type of computing unit 111 may be slightly greater than the number corresponding to the number of calls of that type of computing unit 111 to ensure that the processor 01 has sufficient performance to cope with emergencies.
该设计方式,通过配置计算单元池11中不同类型的计算单元111的数量与类型的计算单元111被调用次数正相关,可以在保证处理器01的运行性能的前提下,减少处理器01的静态功耗损失。In this design method, by configuring the number of different types of computing units 111 in the computing unit pool 11 to be positively correlated with the number of calls of the type of computing unit 111, the static performance of the processor 01 can be reduced while ensuring the running performance of the processor 01. Power loss.
需要说明的是,以上是以底层运算指令(或者说计算单元111执行的算法类型,或计算单元111执行的数学运算类型)来区分计算单元111的类型,实际应用时,还可以从其它方面来对计算单元111进行类型划分。It should be noted that the above is based on the underlying operation instructions (or the type of algorithm executed by the computing unit 111, or the type of mathematical operation performed by the computing unit 111) to distinguish the type of the computing unit 111. In actual application, it can also be distinguished from other aspects. The computing unit 111 is classified into types.
一种可能的示例中,从算法的维度上划分,多个计算单元111可以包括张量计算单元(或者称为矩阵计算单元)111A、矢量计算单元(或者称为向量计算单元)111B、标量计算单元111C等中的一种或多种类型。其中,张量计算单元111A用于执行矩阵计算,矢量计算单元111B用于执行向量计算,标量计算单元111C用于执行标量计算。In one possible example, divided from the algorithm dimension, multiple computing units 111 may include a tensor computing unit (or matrix computing unit) 111A, a vector computing unit (or vector computing unit) 111B, a scalar computing unit One or more types of unit 111C, etc. Among them, the tensor calculation unit 111A is used to perform matrix calculations, the vector calculation unit 111B is used to perform vector calculations, and the scalar calculation unit 111C is used to perform scalar calculations.
另一种可能的示例中,一种可能的示例中,从算法对应的数据类型上划分,多个计算 单元111可以包括八位的整数(int8)计算单元,16位的浮点数(fp16)计算单元,32位的浮点数(fp32)计算单元,4位的整数(int4)计算单元等中的一种或多种类型。In another possible example, according to the data type corresponding to the algorithm, the multiple computing units 111 may include an eight-bit integer (int8) computing unit and a 16-bit floating point (fp16) computing unit. One or more types of unit, 32-bit floating point (fp32) calculation unit, 4-bit integer (int4) calculation unit, etc.
可以理解的,以上几种划分计算单元类型的方式仅为示例,而非限定,实际还有其它划分方式。It can be understood that the above ways of dividing computing unit types are only examples and not limitations. There are actually other ways of dividing.
在实际应用时,上述几种类型划分方式可以相互结合。例如,先将所有计算单元11按照张量计算单元111A、矢量计算单元111B、标量计算单元111C进行划分,然后在张量计算单元111A中进一步按照数学运算类型和/或数据类型进行细分。In practical applications, the above types of classification methods can be combined with each other. For example, all calculation units 11 are first divided into tensor calculation units 111A, vector calculation units 111B, and scalar calculation units 111C, and then further subdivided according to mathematical operation types and/or data types in the tensor calculation unit 111A.
一种可能的设计中,处理器01可以在收到上层应用的请求时,从计算单元池11中为上层应用调度资源。In one possible design, the processor 01 can schedule resources for the upper-layer application from the computing unit pool 11 when receiving a request from the upper-layer application.
示例性的,参见图7,为本申请实施例提供的一种资源分配方法的流程图,该方法可以由图3中的处理器01执行,也可以由处理器01外的其他处理装置执行。该方法包括:For example, see FIG. 7 , which is a flow chart of a resource allocation method provided by an embodiment of the present application. The method may be executed by the processor 01 in FIG. 3 , or may be executed by other processing devices other than the processor 01 . The method includes:
S701、接收执行第一应用的请求;S701. Receive a request to execute the first application;
S702、确定第一应用的计算资源需求;根据计算资源需求从计算单元池中为第一应用分配计算单元以实现第一应用,计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量。S702. Determine the computing resource requirements of the first application; allocate computing units to the first application from the computing unit pool according to the computing resource requirements to implement the first application. The computing resource requirements include tensor computing units, vector computing units and vector computing units. One or more types of computing units, and the quantity of each type of computing unit.
当上述方法由处理器01执行时,具体可以是处理器01中的NPU或者其他专门设置的用于实现资源分配的处理单元来执行上述方法,本申请不做限制。When the above method is executed by the processor 01, the NPU in the processor 01 or other specially configured processing unit for resource allocation may be used to execute the above method, which is not limited in this application.
可选的,可以先确定第一应用对应的多个进程;然后确定多个进程中每个进程的计算资源需求,根据每个进程的计算资源需求从计算单元池中为每个进程分配计算单元以实现第一应用;其中,为每个进程分配的计算单元与为其他进程分配的计算单元不同,换而言之,该多个进程中不同进程被分配到不同的计算单元。Optionally, you can first determine multiple processes corresponding to the first application; then determine the computing resource requirements of each of the multiple processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process. To implement the first application; wherein the computing unit allocated to each process is different from the computing units allocated to other processes. In other words, different processes in the plurality of processes are allocated to different computing units.
作为一种示例,参见图8A,为一种可能的资源分配示意图。在芯片设计阶段,add,sub,mul,exp等,这些资源在处理器01中各放一份,矩阵乘放置多份。假设现在要同时执行应用0和应用n,其中应用0需要的计算资源有矩阵乘、add、sub,应用n需要的计算资源有矩阵乘、mul、exp,则在分配资源时,为应用0分配的资源可以如资源组0所示,包含矩阵乘、add、sub,为应用n分配的资源可以如资源组n所示,包含矩阵乘、mul、exp。这样,可以为不同应用分配不同类型的计算资源,能达到降低芯片面积,降低成本,降低静态功耗的目的,同时又不降低芯片的处理性能。As an example, see Figure 8A, which is a schematic diagram of a possible resource allocation. In the chip design stage, add, sub, mul, exp, etc., each of these resources is placed in processor 01, and matrix multiplication is placed in multiple copies. Assume that application 0 and application n are to be executed at the same time. The computing resources required by application 0 include matrix multiplication, add, and sub, and the computing resources required by application n include matrix multiplication, mul, and exp. When allocating resources, allocate them to application 0. The resources of can be shown as resource group 0, including matrix multiplication, add, and sub. The resources allocated for application n can be shown as resource group n, including matrix multiplication, mul, and exp. In this way, different types of computing resources can be allocated to different applications, which can achieve the purpose of reducing chip area, cost, and static power consumption without reducing the processing performance of the chip.
作为一种示例,参见图8B,为另一种可能的资源分配示意图。在芯片设计阶段,add,sub、mul、exp等,这些资源在处理器01中各放两份,矩阵乘放置多份。假设现在要同时执行应用0和应用n,其中应用0对add的需求高,而应用n对add的需求不高,则在分配资源时,为应用0分配的资源可以如资源组0所示,包含矩阵乘、add、sub、mul、exp,其中add有两份,为应用n分配的资源可以如资源组n所示,包含矩阵乘、sub、mul、exp。这样,可以为不同应用分配不同数量的计算资源,能达到降低芯片面积,降低成本,降低静态功耗的目的,同时又不降低芯片的处理性能。As an example, see FIG. 8B , which is a schematic diagram of another possible resource allocation. In the chip design stage, add, sub, mul, exp, etc., these resources are placed in two copies each in processor 01, and matrix multiplication is placed in multiple copies. Assume that application 0 and application n are to be executed at the same time. Application 0 has a high demand for add, while application n has a low demand for add. When allocating resources, the resources allocated to application 0 can be as shown in resource group 0. Contains matrix multiplication, add, sub, mul, and exp, of which add has two copies. The resources allocated for application n can be as shown in resource group n, including matrix multiplication, sub, mul, and exp. In this way, different amounts of computing resources can be allocated to different applications, which can achieve the purpose of reducing chip area, cost, and static power consumption without reducing the processing performance of the chip.
通过该设计,可以根据上层应用的请求为其动态调度资源,提高资源分配的灵活性。Through this design, resources can be dynamically scheduled for upper-layer applications based on their requests, improving the flexibility of resource allocation.
一种可能的设计中,在根据计算资源需求从计算单元池中为第一应用分配计算单元之后,还可以确定为第一应用分配的计算单元是否满足第一应用的需要;当为第一应用分配的计算单元不能满足第一应用的需要时,调整第一应用的计算资源分配结果,并根据调整后的计算资源分配结果从计算单元池11中为第一应用分配计算单元。In one possible design, after allocating a computing unit to the first application from the computing unit pool according to the computing resource requirements, it may also be determined whether the computing unit allocated to the first application meets the needs of the first application; when the first application When the allocated computing unit cannot meet the needs of the first application, the computing resource allocation result of the first application is adjusted, and the computing unit is allocated to the first application from the computing unit pool 11 according to the adjusted computing resource allocation result.
例如,如果为第一应用分配的计算单元111a在进行计算时,不能满足第一应用要求的指标(如时延),则调整第一应用的计算资源分配结果,将更多的计算单元,例如计算单元111b和111c分配给第一应用,以满足第一应用要求的指标。For example, if the computing unit 111a allocated to the first application cannot meet the indicators (such as delay) required by the first application when performing calculations, the computing resource allocation result of the first application is adjusted to add more computing units, such as The computing units 111b and 111c are assigned to the first application to meet the indicators required by the first application.
通过上述方式,可以保证为第一应用分配的资源能够满足其需求,提高资源分配的可靠性。Through the above method, it can be ensured that the resources allocated to the first application can meet its needs, and the reliability of resource allocation is improved.
一种可能的设计中,还可以建立第一应用与为第一应用分配的计算单元的对应关系。换而言之,在为第一应用分配好计算单元之后,将为第一应用分配的计算单元固定下来(即从空闲的计算单元变成非空闲的计算单元),只供第一应用使用,且在第一应用使用期间,其他应用无法调用该部分资源。In a possible design, a corresponding relationship between the first application and the computing unit allocated to the first application can also be established. In other words, after the computing unit is allocated to the first application, the computing unit allocated to the first application is fixed (that is, changed from an idle computing unit to a non-idle computing unit) and is only used by the first application. And during the use of the first application, other applications cannot call this part of the resource.
通过上述方式,可以避免将为第一应用分配的资源分配给其他资源引起第一应用中断的问题,可以保证第一应用稳定运行。Through the above method, the problem of interrupting the first application caused by allocating resources allocated to the first application to other resources can be avoided, and stable operation of the first application can be ensured.
可选的,当第一应用使用完为第一应用分配的计算单元之后,可以释放该部分计算单元例如删除该对应关系),使得该部分计算单元可以被其他应用调用。通过上述方式,可以进一步提高资源利用率。Optionally, after the first application has finished using the computing units allocated for the first application, the part of the computing units may be released (for example, the corresponding relationship is deleted), so that the part of the computing units may be called by other applications. Through the above method, resource utilization can be further improved.
可选的,第一应用使用完为第一应用分配的计算单元之后,可以不释放该部分计算单元(例如仍然保留该对应关系),使得该部分计算单元不被其他应用调用,而是一直为第一应用保留该部分计算单元,这样第一应用下次运行时,可直接使用该部分计算单元,无需再次进行资源分配。通过上述方式,可以优先保证第一应用的性能。Optionally, after the first application has finished using the computing units allocated for the first application, it may not release the part of the computing units (for example, still retain the corresponding relationship), so that the part of the computing units is not called by other applications, but remains The first application retains this part of the computing units, so that when the first application runs next time, it can directly use this part of the computing units without the need to allocate resources again. Through the above method, the performance of the first application can be guaranteed first.
一种可能的设计中,计算单元池11包括张量计算单元池、矢量计算单元池和标量计算单元池中的一个或多个。In a possible design, the computing unit pool 11 includes one or more of a tensor computing unit pool, a vector computing unit pool, and a scalar computing unit pool.
例如,参见图9所示,计算单元池11包括张量计算单元池11A、矢量计算单元池11B和标量计算单元池11C,张量计算单元池11A中包括一个或多个张量计算单元111A,矢量计算单元池11B中包括一个或多个矢量计算单元111B,标量计算单元池11C中包括一个或多个标量计算单元111C。For example, as shown in FIG. 9 , the computing unit pool 11 includes a tensor computing unit pool 11A, a vector computing unit pool 11B, and a scalar computing unit pool 11C. The tensor computing unit pool 11A includes one or more tensor computing units 111A. The vector calculation unit pool 11B includes one or more vector calculation units 111B, and the scalar calculation unit pool 11C includes one or more scalar calculation units 111C.
在为第一应用分配计算单元时,可以先根据其对应的算法维度找到对应的计算单元池,然后再从该计算单元池中确定计算单元。When allocating a computing unit to the first application, a corresponding computing unit pool may be found first according to its corresponding algorithm dimension, and then the computing unit may be determined from the computing unit pool.
将张量计算单元、矢量计算单元和标量计算单元分别单独进行池化,可以提高资源管理效率,以及提高上层应用调度资源的效率。Pooling tensor computing units, vector computing units, and scalar computing units separately can improve resource management efficiency and improve the efficiency of upper-layer application scheduling resources.
可以理解的,上述各设计方式可以分别单独实施,也可以相互结合实施。It can be understood that each of the above design methods can be implemented individually or in combination with each other.
基于相同的技术构思,本申请实施例还提供一种处理装置100,处理装置100包括用于执行图7所示的方法的模块/单元/技术手段。Based on the same technical concept, the embodiment of the present application also provides a processing device 100. The processing device 100 includes modules/units/technical means for executing the method shown in Figure 7.
示例性的,参见图10,处理装置100可以包括:For example, referring to Figure 10, the processing device 100 may include:
收发模块1001,用于接收执行第一应用的请求; Transceiver module 1001, configured to receive a request to execute the first application;
处理模块1002,用于确定第一应用的计算资源需求,根据计算资源需求从处理器的计算单元池中为第一应用分配计算单元以实现第一应用,计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量;其中,计算单元池包括多个计算单元;多个计算单元中的每个空闲计算单元可以被按需调用,且多个计算单元中不同类型的计算单元的数量与该类型的计算单元被调用次数正相关。The processing module 1002 is used to determine the computing resource requirements of the first application, and allocate computing units to the first application from the computing unit pool of the processor according to the computing resource requirements to implement the first application. The computing resource requirements include tensor computing units, vector One or more of computing units and vector computing units, and the number of each computing unit; wherein the computing unit pool includes multiple computing units; each idle computing unit among the multiple computing units can be called on demand, And the number of different types of computing units in multiple computing units is positively correlated with the number of times that type of computing unit is called.
应理解,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。It should be understood that all relevant content of each step involved in the above method embodiments can be quoted from the functional description of the corresponding functional module, and will not be described again here.
基于相同的技术构思,参见图11,本申请实施例还提供一种计算平台1100,包括上文所述的处理器01和处理装置100。Based on the same technical concept, see FIG. 11 , an embodiment of the present application also provides a computing platform 1100 , including the above-mentioned processor 01 and the processing device 100 .
基于相同的技术构思,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行指令,计算机可执行指令在被计算机调用时,使得如图7所示的方法被执行。Based on the same technical concept, embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores computer-executable instructions. When called by a computer, the computer-executable instructions enable the method shown in Figure 7 be executed.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的保护范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (17)

  1. 一种处理器,其特征在于,所述处理器包括计算单元池,所述计算单元池包括多个计算单元;A processor, characterized in that the processor includes a computing unit pool, and the computing unit pool includes a plurality of computing units;
    所述多个计算单元中的每个空闲计算单元可以被按需调用,且所述多个计算单元中不同类型的计算单元的数量与所述类型的计算单元被调用次数正相关。Each idle computing unit in the plurality of computing units can be called on demand, and the number of different types of computing units in the plurality of computing units is positively correlated with the number of times the type of computing unit is called.
  2. 根据权利要求1所述的处理器,其特征在于,所述多个计算单元中的每个计算单元对应一个底层运算指令。The processor according to claim 1, wherein each computing unit in the plurality of computing units corresponds to a low-level computing instruction.
  3. 根据权利要求1或2所述的处理器,其特征在于,所述计算单元池包括张量计算单元池、矢量计算单元池和标量计算单元池中的一个或多个。The processor according to claim 1 or 2, wherein the computing unit pool includes one or more of a tensor computing unit pool, a vector computing unit pool, and a scalar computing unit pool.
  4. 根据权利要求1-3中任一项所述的处理器,其特征在于,所述处理器用于:The processor according to any one of claims 1-3, characterized in that the processor is used for:
    接收执行第一应用的请求;Receive a request to execute the first application;
    确定所述第一应用的计算资源需求,根据所述计算资源需求从所述计算单元池中为所述第一应用分配计算单元以实现所述第一应用,所述计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量。Determine the computing resource requirements of the first application, and allocate computing units from the computing unit pool to the first application according to the computing resource requirements to implement the first application, where the computing resource requirements include tensor calculations One or more of units, vector calculation units, and vector calculation units, and the number of each type of calculation unit.
  5. 根据权利要求4所述的处理器,其特征在于,所述处理器还用于:The processor according to claim 4, characterized in that the processor is also used for:
    确定所述第一应用对应的多个进程;Determine multiple processes corresponding to the first application;
    确定所述多个进程中每个进程的计算资源需求,根据所述每个进程的计算资源需求从所述计算单元池中为所述每个进程分配计算单元以实现所述第一应用;其中,为所述每个进程分配的计算单元与为其他进程分配的计算单元不同。Determine the computing resource requirements of each process in the plurality of processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process to implement the first application; wherein , the computing unit allocated to each process is different from the computing units allocated to other processes.
  6. 根据权利要求4或5所述的处理器,其特征在于,所述处理器还用于:The processor according to claim 4 or 5, characterized in that the processor is also used for:
    确定为所述第一应用分配的计算单元是否满足所述第一应用的需要;Determine whether the computing unit allocated to the first application meets the needs of the first application;
    当为所述第一应用分配的计算单元不能满足所述第一应用的需要时,调整所述第一应用的计算资源分配结果,并根据所述调整后的计算资源分配结果从所述计算单元池中为所述第一应用分配计算单元。When the computing unit allocated to the first application cannot meet the needs of the first application, the computing resource allocation result of the first application is adjusted, and the computing resource allocation result is obtained from the computing unit according to the adjusted computing resource allocation result. The first application is allocated a computing unit in the pool.
  7. 根据权利要求4或5所述的处理器,其特征在于,所述处理器还用于:The processor according to claim 4 or 5, characterized in that the processor is also used for:
    建立所述第一应用与为所述第一应用分配的计算单元的对应关系。A corresponding relationship between the first application and the computing unit allocated for the first application is established.
  8. 根据权利要求1-7中任一项所述的处理器,其特征在于,所述类型与所述多个计算单元执行的算法对应。The processor according to any one of claims 1-7, wherein the type corresponds to an algorithm executed by the plurality of computing units.
  9. 一种资源分配方法,其特征在于,应用于处理器,所述处理器包括计算单元池,所述计算单元池包括多个计算单元;所述多个计算单元中的每个空闲计算单元可以被按需调用,且所述多个计算单元中不同类型的计算单元的数量与所述类型的计算单元被调用次数正相关;A resource allocation method, characterized in that it is applied to a processor, the processor includes a computing unit pool, the computing unit pool includes a plurality of computing units; each idle computing unit in the plurality of computing units can be Called on demand, and the number of different types of computing units among the plurality of computing units is positively correlated with the number of times the type of computing unit is called;
    所述方法包括:The methods include:
    接收执行第一应用的请求;Receive a request to execute the first application;
    确定所述第一应用的计算资源需求,根据所述计算资源需求从所述计算单元池中为所述第一应用分配计算单元以实现所述第一应用,所述计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量。Determine the computing resource requirements of the first application, and allocate computing units from the computing unit pool to the first application according to the computing resource requirements to implement the first application, where the computing resource requirements include tensor calculations One or more of units, vector calculation units, and vector calculation units, and the number of each type of calculation unit.
  10. 根据权利要求9所述的方法,其特征在于,所述确定所述第一应用的计算资源需求,根据所述计算资源需求从所述计算单元池中为所述第一应用分配计算单元以实现所述第 一应用,包括:The method according to claim 9, characterized in that the computing resource requirements of the first application are determined, and computing units are allocated to the first application from the computing unit pool according to the computing resource requirements to implement The first application includes:
    确定所述第一应用对应的多个进程;Determine multiple processes corresponding to the first application;
    确定所述多个进程中每个进程的计算资源需求,根据所述每个进程的计算资源需求从所述计算单元池中为所述每个进程分配计算单元以实现所述第一应用;其中,为所述每个进程分配的计算单元与为其他进程分配的计算单元不同。Determine the computing resource requirements of each process in the plurality of processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process to implement the first application; wherein , the computing unit allocated to each process is different from the computing units allocated to other processes.
  11. 根据权利要求9或10所述的方法,其特征在于,所述方法还包括:The method according to claim 9 or 10, characterized in that, the method further includes:
    确定为所述第一应用分配的计算单元是否满足所述第一应用的需要;Determine whether the computing unit allocated to the first application meets the needs of the first application;
    当为所述第一应用分配的计算单元不能满足所述第一应用的需要时,调整所述第一应用的计算资源分配结果,并根据所述调整后的计算资源分配结果从所述计算单元池中为所述第一应用分配计算单元。When the computing unit allocated to the first application cannot meet the needs of the first application, the computing resource allocation result of the first application is adjusted, and the computing resource allocation result is obtained from the computing unit according to the adjusted computing resource allocation result. The first application is allocated a computing unit in the pool.
  12. 根据权利要求9或10所述的方法,其特征在于,所述方法还包括:The method according to claim 9 or 10, characterized in that, the method further includes:
    建立所述第一应用与为所述第一应用分配的计算单元的对应关系。A corresponding relationship between the first application and the computing unit allocated for the first application is established.
  13. 一种处理装置,其特征在于,包括:A processing device, characterized in that it includes:
    收发模块,用于接收执行第一应用的请求;A transceiver module, configured to receive a request to execute the first application;
    处理模块,用于确定所述第一应用的计算资源需求,根据所述计算资源需求从处理器的计算单元池中为所述第一应用分配计算单元以实现所述第一应用,所述计算资源需求包括张量计算单元、矢量计算单元和矢量计算单元中的一种或多种,以及每种计算单元的数量;a processing module, configured to determine the computing resource requirements of the first application, and allocate computing units to the first application from the computing unit pool of the processor according to the computing resource requirements to implement the first application; the computing Resource requirements include one or more of tensor computing units, vector computing units, and vector computing units, as well as the number of each computing unit;
    其中,所述计算单元池包括多个计算单元;所述多个计算单元中的每个空闲计算单元可以被按需调用,且所述多个计算单元中不同类型的计算单元的数量与所述类型的计算单元被调用次数正相关。Wherein, the computing unit pool includes multiple computing units; each idle computing unit in the multiple computing units can be called on demand, and the number of different types of computing units in the multiple computing units is the same as the number of computing units. The number of times a type of computing unit is called is positively correlated.
  14. 根据权利要求13所述的装置,其特征在于,所述处理模块,还用于:The device according to claim 13, characterized in that the processing module is also used to:
    确定所述第一应用对应的多个进程;Determine multiple processes corresponding to the first application;
    确定所述多个进程中每个进程的计算资源需求,根据所述每个进程的计算资源需求从所述计算单元池中为所述每个进程分配计算单元以实现所述第一应用;其中,为所述每个进程分配的计算单元与为其他进程分配的计算单元不同。Determine the computing resource requirements of each process in the plurality of processes, and allocate computing units to each process from the computing unit pool according to the computing resource requirements of each process to implement the first application; wherein , the computing unit allocated to each process is different from the computing units allocated to other processes.
  15. 根据权利要求13或14所述的装置,其特征在于,所述处理模块,还用于:The device according to claim 13 or 14, characterized in that the processing module is also used to:
    确定为所述第一应用分配的计算单元是否满足所述第一应用的需要;Determine whether the computing unit allocated to the first application meets the needs of the first application;
    当为所述第一应用分配的计算单元不能满足所述第一应用的需要时,调整所述第一应用的计算资源分配结果,并根据所述调整后的计算资源分配结果从所述计算单元池中为所述第一应用分配计算单元。When the computing unit allocated to the first application cannot meet the needs of the first application, the computing resource allocation result of the first application is adjusted, and the computing resource allocation result is obtained from the computing unit according to the adjusted computing resource allocation result. The first application is allocated a computing unit in the pool.
  16. 根据权利要求13或14所述的装置,其特征在于,所述处理模块,还用于:The device according to claim 13 or 14, characterized in that the processing module is also used to:
    建立所述第一应用与为所述第一应用分配的计算单元的对应关系。A corresponding relationship between the first application and the computing unit allocated for the first application is established.
  17. 一种计算平台,其特征在于,包括如权利要求1-3中任意一项所述的处理器以及如权利要求13-16中任意一项所述的装置。A computing platform, characterized by comprising the processor according to any one of claims 1-3 and the device according to any one of claims 13-16.
PCT/CN2022/118522 2022-09-13 2022-09-13 Resource allocation method, processor, and computing platform WO2024055168A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/118522 WO2024055168A1 (en) 2022-09-13 2022-09-13 Resource allocation method, processor, and computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/118522 WO2024055168A1 (en) 2022-09-13 2022-09-13 Resource allocation method, processor, and computing platform

Publications (1)

Publication Number Publication Date
WO2024055168A1 true WO2024055168A1 (en) 2024-03-21

Family

ID=90274061

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118522 WO2024055168A1 (en) 2022-09-13 2022-09-13 Resource allocation method, processor, and computing platform

Country Status (1)

Country Link
WO (1) WO2024055168A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111176852A (en) * 2020-01-15 2020-05-19 上海依图网络科技有限公司 Resource allocation method, device, chip and computer readable storage medium
US20200342292A1 (en) * 2019-04-24 2020-10-29 Baidu Usa Llc Hardware-software co-design for accelerating deep learning inference
CN113791906A (en) * 2021-08-09 2021-12-14 戴西(上海)软件有限公司 Scheduling system and optimization algorithm based on GPU resources in artificial intelligence and engineering fields
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200342292A1 (en) * 2019-04-24 2020-10-29 Baidu Usa Llc Hardware-software co-design for accelerating deep learning inference
CN111176852A (en) * 2020-01-15 2020-05-19 上海依图网络科技有限公司 Resource allocation method, device, chip and computer readable storage medium
CN113791906A (en) * 2021-08-09 2021-12-14 戴西(上海)软件有限公司 Scheduling system and optimization algorithm based on GPU resources in artificial intelligence and engineering fields
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system

Similar Documents

Publication Publication Date Title
WO2016078008A1 (en) Method and apparatus for scheduling data flow task
CN110262901B (en) Data processing method and data processing system
US10108458B2 (en) System and method for scheduling jobs in distributed datacenters
CN111488205B (en) Scheduling method and scheduling system for heterogeneous hardware architecture
CN112181613B (en) Heterogeneous resource distributed computing platform batch task scheduling method and storage medium
WO2022067531A1 (en) Computing resource aware task scheduling method
CN110471766B (en) GPU resource scheduling system and method based on CUDA
WO2021254135A1 (en) Task execution method and storage device
WO2022247105A1 (en) Task scheduling method and apparatus, computer device and storage medium
CN111142938B (en) Task processing method and device for heterogeneous chip and electronic equipment
CN107656813A (en) The method, apparatus and terminal of a kind of load dispatch
WO2017185285A1 (en) Method and device for assigning graphics processing unit task
CN111352727B (en) Image processing method applied to image mixing cluster processing system
CN112214319A (en) Task scheduling method for sensing computing resources
CN115292016A (en) Task scheduling method based on artificial intelligence and related equipment
CN108256182B (en) Layout method of dynamically reconfigurable FPGA
CN117058288A (en) Graphics processor, multi-core graphics processing system, electronic device, and apparatus
CN105718318B (en) Integrated scheduling optimization method based on auxiliary engineering design software
CN112925616A (en) Task allocation method and device, storage medium and electronic equipment
CN116701001B (en) Target task allocation method and device, electronic equipment and storage medium
CN116431315B (en) Batch processing task processing method and device, electronic equipment and storage medium
WO2024055168A1 (en) Resource allocation method, processor, and computing platform
CN112433844A (en) Resource allocation method, system, equipment and computer readable storage medium
CN115373826B (en) Task scheduling method and device based on cloud computing
CN111930485A (en) Job scheduling method based on performance expression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22958362

Country of ref document: EP

Kind code of ref document: A1