WO2024131170A1 - Operator processing method and apparatus, and chip, computing device and storage medium - Google Patents

Operator processing method and apparatus, and chip, computing device and storage medium Download PDF

Info

Publication number
WO2024131170A1
WO2024131170A1 PCT/CN2023/119946 CN2023119946W WO2024131170A1 WO 2024131170 A1 WO2024131170 A1 WO 2024131170A1 CN 2023119946 W CN2023119946 W CN 2023119946W WO 2024131170 A1 WO2024131170 A1 WO 2024131170A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
accelerator
input data
operators
segmentation
Prior art date
Application number
PCT/CN2023/119946
Other languages
French (fr)
Chinese (zh)
Inventor
仇悦
徐晓忻
周建伟
周卿
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024131170A1 publication Critical patent/WO2024131170A1/en

Links

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to an operator processing method, apparatus, chip, computing device and storage medium.
  • AI accelerators With the rapid development of artificial intelligence (AI) technology, a series of AI accelerators have emerged to provide computing power for matrices and vectors to accelerate the calculation of AI models.
  • AI accelerators are equipped with dedicated operator programming interfaces for users to write operators to run on AI accelerators.
  • Operators include fixed shape operators and dynamic shape operators, among which fixed shape operators refer to operators whose input data size is fixed, and dynamic shape operators refer to operators whose input data size is not fixed.
  • the AI accelerator when the AI accelerator obtains the input data of the dynamic shape operator, it selects the operator implementation file corresponding to the shape range from multiple preset operator implementation files according to the shape range of the input data, and calls the operator implementation file to realize the dynamic shape operation of the operator.
  • the above method requires users to generate different operator implementation files according to different shape ranges in advance, which has poor usability. Moreover, if the shape of the input data does not fall within the preset shape range, it will trigger runtime compilation (just-in-time, JIT), introduce compilation time overhead, and greatly affect the operating performance of the AI accelerator.
  • JIT runtime compilation
  • the embodiment of the present application provides an operator processing method, device, chip, computing device and storage medium, which can effectively improve the operating performance of the AI accelerator.
  • the technical solution is as follows:
  • a method for processing an operator which is executed by an artificial intelligence (AI) accelerator, and the method includes:
  • the multiple base operators are called to process the respective input data to obtain output data of the multiple base operators.
  • the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of the operator is obtained by splicing the output data of multiple base operators.
  • the above method uses a preset shape set to dynamically segment the input data of operators of any shape, thereby calling multiple base operators corresponding to the operator to collaboratively realize the execution of the dynamic shape operator, effectively improving the operating performance of the AI accelerator.
  • the step of obtaining input data of multiple base operators corresponding to the operator includes:
  • generating a segmentation scheme of the operator Based on the preset shape set and the shape of the input data of the operator, generating a segmentation scheme of the operator, the segmentation scheme including a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
  • the input data of the operator is segmented to obtain the input data of the multiple base operators.
  • the slicing scheme further includes calling a computing power resource allocation method of the plurality of base operators, wherein the computing power resource allocation method is associated with at least one of the following:
  • a method for allocating computing resources based on the number of logic blocks in the AI accelerator is a method for allocating computing resources based on the number of logic blocks in the AI accelerator.
  • the computing resource allocation method of calling multiple base operators is taken into account when generating the segmentation plan, it can effectively balance The computing power consumption caused by calling multiple base operators is reduced, thereby making full use of the computing power resources of the AI accelerator.
  • generating a segmentation scheme of the operator based on the preset shape set and the shape of the input data of the operator includes:
  • the segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
  • determining the segmentation scheme that meets the target condition from the multiple candidate segmentation schemes includes:
  • a candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
  • the candidate segmentation scheme further includes calling a computing resource allocation method of the plurality of candidate base operators, and determining the cost of each candidate segmentation scheme includes:
  • the cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
  • a method for determining a slicing scheme for automatic load balancing is provided.
  • the slicing scheme with the lowest cost is selected to maximize the operating performance of the AI accelerator.
  • the method further comprises:
  • the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method
  • the method further includes: sending output data of the plurality of base operators to the host.
  • the host generates the operator segmentation plan, which can offload the computing power of the AI accelerator and save the computing power resources of the AI accelerator.
  • the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein a difference between a numerical value at the tail of a target sequence and a numerical value at the head of an adjacent sequence of the target sequence is greater than a tolerance of the target sequence.
  • the numerical values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.
  • the size of the numerical value in the preset shape set is associated with at least one of the following:
  • the data range corresponding to the instruction executed by the AI accelerator
  • the size of cache space at each level in the AI accelerator is the size of cache space at each level in the AI accelerator.
  • the preset shape set set in the above manner can balance operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator.
  • a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
  • a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators.
  • a device for processing an operator is provided.
  • the device is configured in an AI accelerator and includes at least one functional module for executing the method for processing an operator provided in the first aspect or any possible implementation of the first aspect.
  • a chip configured as an AI accelerator, the AI accelerator comprising a communication interface and at least one AI processing core, the communication interface being used to provide program instructions and/or data to the at least one processing core, the at least one AI processing core being used to implement a processing method for an operator provided in the first aspect or any possible implementation of the first aspect.
  • a computing device including a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the first aspect or any possible implementation of the first aspect.
  • a computing device cluster comprising multiple computing devices, wherein the computing devices include a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the aforementioned first aspect or any possible implementation of the first aspect.
  • a computer-readable storage medium is provided, wherein the computer-readable storage medium is used to store at least one program code, and the at least one program code is used to execute the processing method of the operator provided in the first aspect or any possible implementation of the first aspect.
  • the storage medium includes but is not limited to a volatile memory, such as a random access memory, a non-volatile memory, such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
  • a computer program product which, when running on an AI accelerator, enables the AI accelerator to execute the operator processing method provided in the first aspect or any possible implementation of the first aspect.
  • the computer program product may be a software installation package, and when the functions of the aforementioned AI accelerator need to be implemented, the computer program product may be downloaded and executed on the AI accelerator.
  • FIG1 is a schematic diagram of an implementation environment provided by the present application.
  • FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of an AI processing core provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the segmentation of a matrix multiplication operator
  • FIG6 is a schematic diagram of another segmentation of a matrix multiplication operator
  • FIG7 is a schematic diagram of a matrix multiplication operator segmentation method provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a base operator scheduling mechanism provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of a cache space allocation method provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of a flow chart of an operator development phase provided in an embodiment of the present application.
  • FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application.
  • FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application.
  • FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application.
  • AI model is a type of mathematical algorithm model that uses machine learning ideas to solve practical problems.
  • the AI model includes a large number of parameters and calculation formulas (or calculation rules).
  • AI accelerators are a type of specialized hardware accelerator or computer system designed to accelerate AI applications, especially neural networks, machine vision, and machine learning. For example, they are used to provide computing power for calculating matrices and vectors to accelerate the calculation of AI models.
  • AI accelerators are, for example, graphics processing units (GPUs), intelligent processing units (IPUs), tensor processing units (TPUs), domain specific architecture (DSA) chips, and so on.
  • Deep learning is a branch of machine learning (ML). Deep learning is the study of the inherent laws and representation levels of sample data. The information obtained in the learning process is very helpful for interpreting data such as text, images, and sounds. Schematically, deep learning is a complex machine learning algorithm, and the typical AI model that uses deep learning ideas is the neural network model.
  • An operator refers to a computing unit or computing function running on a computing device.
  • neural network layers and even the entire model are composed of operators, which correspond to the computing logic in the neural network layer.
  • a convolution layer is an operator
  • the weight summation process in a fully-connected layer (FC layer) is an operator.
  • operators include fixed-shape operators and dynamic-shape operators. The following introduces these two types of operators respectively: (1) Fixed-shape operators refer to operators whose input data size is fixed, so the operator's internal input data segmentation and execution scheduling can be fixed. (2) Dynamic-shape operators refer to operators whose input data size is not fixed, or in other words, the input data size is delayed until the operator obtains the actual input data size at runtime.
  • Tensor is the data in the operator, including input data and output data.
  • Shape refers to the shape of a tensor, or the dimension value of each dimension of a tensor, usually expressed in the form of (D0, D1, ..., Dn-1), where n is a positive integer.
  • Ping-pong buffering is a data buffering mechanism that can use two data buffers at the same time to achieve the purpose of continuous data transmission, thereby increasing the data transmission rate. It should be understood that since the data processed by a single buffer is easily overwritten during transmission and processing, the ping-pong buffering method can always keep the data in one buffer being used while the other buffer is used to store data. In other words, ping-pong buffering means that two identical objects are read and written alternately as buffers.
  • the technical solution provided in this application is applied to the scenario of running dynamic shape operators based on AI accelerators.
  • matrix C For example, consider matrix C as two 10 ⁇ 15-dimensional matrices spliced together, then the local data of matrix A and matrix B can be taken respectively to calculate two 10 ⁇ 15-dimensional matrices. These two 10 ⁇ 15-dimensional matrices are spliced together to obtain the output data of the operator, that is, matrix C.
  • the input data of each base operator is the local data of the input data of the operator, it is necessary to divide the input data of the operator to obtain the input data of each base operator, and finally realize the function of the operator.
  • the matrix C can also have many other splicing methods, and accordingly, the operator's segmentation and scheduling strategy also includes many other strategies. It can be seen that for this dynamic shape operator, since the shape of its input data is not fixed, the optimization space for the segmentation and scheduling strategy of this operator is very large, and the large optimization space means that since the AI accelerator obtains the shape of the input data when running the operator, it is difficult for the AI accelerator to implement the operator function with the best or even better segmentation and scheduling strategy, so its performance and hardware utilization are usually difficult to guarantee.
  • the present application provides a technical solution for implementing a dynamic shape operator based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator.
  • a preset shape set also called a shape set
  • the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained (the specific principle and implementation process will be introduced in the subsequent embodiments, and will not be repeated here).
  • FIG1 is a schematic diagram of an implementation environment provided by the present application. As shown in FIG1 , the implementation environment includes a host 100 and an AI accelerator 200 , and the host 100 and the AI accelerator 200 are communicatively connected.
  • the host 100 refers to a device used to run an AI model and provide AI services to users.
  • the host 100 can implement development functions and execution functions for operators in the AI model, wherein the development function refers to the host 100 providing users with various development functions for operators, such as writing operators, setting preset shape sets, generating operator codes, etc., and the present application is not limited thereto.
  • the execution function means that the host 100 can control the AI accelerator 200 to run the operator developed by the user and realize the function of the corresponding operator. This process can also be understood as loading the AI task into the AI accelerator 200 for operation.
  • the number of hosts 100 can be one or more, and the present application does not limit this.
  • the AI accelerator 200 is used to provide computing power for the running AI model to accelerate the computing process of the AI model, that is, to execute the processing method of the operator provided in this application.
  • the AI accelerator 200 is a DSA chip, a GPU, etc., but this application is not limited thereto.
  • the AI accelerator 200 calls the corresponding operator based on the input data of the operator sent by the host 100, processes the input data, obtains the output data, and returns the output data to the host 100 to implement the function of the corresponding operator.
  • the functions of the AI accelerator 200 are as follows: The contents shown in FIG2 are introduced in detail and will not be described again here.
  • the number of the AI accelerator 200 may be one or more, and this application does not limit this.
  • the host 100 and the AI accelerator 200 are integrated in a computing device, and the host 100 and the AI accelerator 200 are connected to each other through a peripheral component interconnect bus (PCIe) link.
  • the host 100 exchanges data with the AI accelerator 200 through the PCIe link and controls the AI accelerator 200 to run the corresponding operator.
  • the computing device can be an independent physical server, or a server cluster or distributed file system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
  • CDNs content delivery networks
  • the computing device can also be called a cloud platform (i.e., the abbreviation of cloud computing platform), which refers to services based on hardware resources and software resources, providing computing, network and storage capabilities.
  • cloud platform i.e., the abbreviation of cloud computing platform
  • the cloud platform can achieve the rapid release and publication of configurable computing resources with a small management cost or low interaction complexity between users and service providers.
  • the host 100 and the AI accelerator 200 are introduced as an example to realize the execution function of the operator through interaction.
  • the AI accelerator 200 has the function of running the AI model, that is, the AI accelerator 200 can directly run the AI model specified by the user to realize the execution function of the operator. This application does not limit this.
  • the networks involved above include, but are not limited to, data center networks, storage area networks (SAN), local area networks (LAN), metropolitan area networks (MAN), wide area networks (WAN), mobile, wired or wireless networks, dedicated networks or any combination of virtual private networks.
  • technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML) are used to represent data exchanged through the network.
  • HTML hypertext markup language
  • XML extensible markup language
  • conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private network (VPN), and Internet protocol security (IPsec) can also be used to encrypt all or part of the links.
  • SSL secure sockets layer
  • TLS transport layer security
  • VPN virtual private network
  • IPsec Internet protocol security
  • customized and/or dedicated data communication technologies can also be used to replace or supplement the above data communication technologies.
  • FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application. It should be understood that FIG2 is only an exemplary structural diagram of the AI accelerator 200, and the present application does not limit the division of the functions of the AI accelerator 200. Schematically, as shown in FIG2, the functions of the AI accelerator 200 include but are not limited to: data acquisition function 201 and operator call function 202. In some embodiments, the functions of the AI accelerator 200 also include a segmentation function 203 and a storage function 204, etc., and the present application is not limited thereto.
  • the data acquisition function 201 is used to obtain the input data of multiple base operators corresponding to the operator, wherein the operator refers to a dynamic shape operator, and multiple base operators are used to collaboratively realize the function of the operator, that is, the functions of these multiple base operators combined are equivalent to the function of the operator, and it can also be understood that these multiple base operators are a series of basic operators obtained by segmenting the operator.
  • the input data of the base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the preset shape set is pre-set, including multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data.
  • the base operator refers to a pre-developed and compiled operator used to process fixed-shape input data, where the fixed shape corresponds to the numerical value in the preset shape set.
  • the operator calling function 202 is used to call multiple base operators, process their respective input data, and obtain the output data of multiple base operators. This process is to use a series of base operators of pre-developed and compiled operators to process the input data after the operator is segmented, so as to divide the computing task of a complete operator into multiple sub-computing tasks, and hand over these sub-computing tasks to the base operators corresponding to the operators for collaborative implementation, thereby improving the operating performance of the AI accelerator. It should be understood that after obtaining the output data of multiple base operators, the output data of multiple base operators are spliced together based on the way of segmenting the input data of the operator, that is, the output data of the operator is obtained, thereby realizing the function of the operator.
  • the segmentation function 203 is used to generate a segmentation scheme for the operator based on the preset shape set and the shape of the input data of the operator, and segment the input data of the operator according to the segmentation scheme to obtain input data of multiple base operators.
  • the segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
  • the storage function 204 is used to store pre-developed and compiled base operators, preset shape sets, AI models, etc., but the present application is not limited thereto.
  • the functions of the AI accelerator 200 are not limited to the above 201 to 204. In actual applications, more functions can be set according to user needs. Multifunctional: Through the above functions, the AI accelerator 200 can realize the execution function of the operators in the AI model and improve the operating performance of the AI accelerator 200.
  • FIG. 3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application.
  • the chip 300 includes a communication interface 301, at least one AI processing core 302, a processor 303, a memory 304, and a bus 305.
  • the communication interface 301, at least one AI processing core 302, the processor 303, and the memory 304 are connected to each other through the bus 305.
  • the communication interface 301 is used to provide program instructions and/or data to at least one AI processing core 302.
  • the communication interface 301 includes a PCIe communication interface, other general peripheral interfaces, etc., which are not limited in this application.
  • the chip 300 when the chip 300 is used as an accelerator card of the host 100, data exchange is achieved through the PCIe communication interface and the host 100.
  • the chip 300 realizes communication between the chip 300 and other devices or communication networks through the peripheral interface.
  • the AI processing core 302 is used to implement the functions of the AI accelerator shown in FIG2 above, that is, to execute the processing method of the operator provided in the embodiment of the present application.
  • the AI processing core adopts the Da Vinci architecture, which realizes high throughput, high computing power and low power consumption, and is suitable for processing common calculations required for neural networks in deep learning, such as matrix multiplication, etc.
  • the specific architecture is introduced in FIG4 below and will not be repeated here.
  • the processor 303 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or an integrated circuit for controlling the execution of the program of the present application.
  • the processor 303 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the number of processors 303 may be one or more. Taking a multi-core processor as an example, the multiple cores may be divided into a control CPU dedicated to controlling the overall operation of the chip 300 and an AI CPU dedicated to non-matrix complex calculations according to their functions.
  • the number of CPU cores occupied by the two types of tasks may be dynamically allocated by the software according to the actual operation of the system, and the present application does not limit this.
  • the memory 304 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited to this.
  • ROM read-only memory
  • RAM random access memory
  • EEPROM electrically erasable programmable read-only memory
  • CD-ROM compact disc read-only memory
  • optical disc storage including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.
  • magnetic disk storage medium or other magnetic storage device or any other
  • the bus 305 may include a path for transmitting information between various components of the chip 300 (eg, the communication interface 301 , the at least one AI processing core 302 , the processor 303 , and the memory 304 ).
  • FIG. 3 is only a hardware structure diagram of a chip that can be configured as the above-mentioned AI accelerator 200 provided by the present application.
  • the chip may also include other components to achieve more functions.
  • the chip also includes a task scheduler (TS) for achieving efficient allocation and scheduling of computing tasks on the AI processing core, etc.
  • TS task scheduler
  • Fig. 4 is a schematic diagram of the structure of an AI processing core provided by an embodiment of the present application. As shown in Fig. 4, taking the AI processing core 302 using the Da Vinci architecture as an example, the AI processing core 302 includes a computing unit, a storage unit, and a control unit.
  • the computing units include: cube unit, vector unit and scalar unit. These three computing units perform their respective functions, forming three independent execution pipelines. Under the unified scheduling of the system software, they cooperate with each other to achieve optimized computing efficiency and complete different types of data calculations in the AI processing core.
  • the storage units include: L1 buffer, L0 buffer, unified buffer, general-purpose register (GPR), special-purpose register (SPR) and scalar buffer.
  • the AI processing core needs to load the data in the external storage into the internal storage to complete the corresponding calculation.
  • the AI processing core also includes a bus interface unit (BIU), a memory transfer engine (MTE1), MTE2, and MTE3.
  • BIU is the interface for the AI processing core to interact with the bus
  • MTE is a data handling unit used to complete data handling between different buffers.
  • the control unit includes: system control module (system control), instruction dispatch module (instr.dispatch), matrix operation queue (cube
  • system control module is responsible for commanding and coordinating the overall operation mode of the AI processing core, configuring parameters, and implementing power consumption control.
  • instruction dispatch module instr.dispatch
  • matrix operation queue cube
  • system control module is responsible for commanding and coordinating the overall operation mode of the AI processing core, configuring parameters, and implementing power consumption control.
  • instructions When instructions are sent out in sequence through the instruction emission module, they will be sent to the matrix operation queue, vector operation queue, and storage conversion queue respectively according to the different types of instructions.
  • the storage unit provides transposed and required data to each computing unit.
  • the computing unit returns the result of the operation to the storage unit.
  • the control unit provides instruction control to the computing unit and the storage unit. The three coordinate and cooperate with each other to complete the computing task.
  • FIG. 4 is only a structural diagram provided by the present application that can realize the above-mentioned AI processing core function.
  • the AI processing core can also adopt other architectures, and the present application does not limit this.
  • the present application provides a technical solution for obtaining dynamic shape operators based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator.
  • a preset shape set also called a shape set
  • the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained.
  • FIG. 5 to FIG. 7 For ease of understanding, reference is made to FIG. 5 to FIG. 7 below, and taking the operator as a matrix multiplication operator as an example, the principle of the technical solution provided in this application is introduced from the perspective of the output data of the matrix multiplication operator.
  • Figure 5 is a schematic diagram of the segmentation of a matrix multiplication operator.
  • the shape of the left matrix A is (M, K), that is, an M ⁇ K-dimensional matrix
  • the shape of the right matrix B is (K, N), that is, a K ⁇ N-dimensional matrix, where M, K, and N are all positive integers, used to represent the dimension values in the row direction or column direction of the matrix
  • the matrix C A ⁇ B
  • the shape of the matrix C is (M, N). That is, it is an M ⁇ N-dimensional matrix.
  • the segmentation perspective is generally from the output data, that is, the matrix C.
  • the local matrix C′ can be calculated by multiplying the local left matrix A′ and the local right matrix B′. Therefore, the segmentation of the matrix multiplication operator usually focuses on the segmentation of the matrix C. That is, by segmenting in the M and N directions, the solution of the local matrix C′ can be obtained.
  • FIG6 is a schematic diagram of the partitioning of another matrix multiplication operator.
  • a local solution of the local matrix C′ can be obtained. That is, by multiplying the local left matrix A′′ and the local right matrix B′′, a local solution of the local matrix C′ can be obtained. All K-direction partition blocks in the local left matrix A′ and all K-direction partition blocks in the local right matrix B′ are multiplied and added to obtain the solution of the local matrix C′.
  • FIG 7 is a segmentation schematic diagram of a matrix multiplication operator provided by an embodiment of the present application.
  • the present application pre-sets a preset shape set and pre-sets a series of base operators related to the numerical values in the preset shape set, therefore, for the matrix multiplication operator shown in Figures 5 and 6 above, based on the preset shape set, the shape of matrix A and the shape of matrix B, a segmentation scheme of the matrix multiplication operator is generated. According to the segmentation scheme, matrix A and matrix B are segmented to obtain input data of multiple base operators.
  • a segmentation scheme is generated, and the input data of the matrix multiplication operator is segmented according to the segmentation scheme to obtain input data of multiple base operators.
  • the values involved in the segmentation process include, for example, 16, 32, and 64.
  • the above-mentioned partitioning scheme corresponds to 12 basis operators, that is, as shown in FIG7 , matrix A is partitioned to obtain matrices A1 to A6, and matrix B is partitioned to obtain matrices B1 to B4. Accordingly, the local solution of matrix C includes matrices C1 to C6.
  • the specific implementation method of the technical solution of the present application is introduced below.
  • the preset shape set and the base operator involved in the operator processing method need to be pre-set, and the host can provide various development functions for the operator. Therefore, for ease of understanding, the technical solution of the present application is introduced below based on the two stages of the operator development stage and the operator execution stage.
  • the host provides the user with various development functions for operators, such as setting a preset shape set, writing operators, generating operator codes, writing base operators, etc.
  • the host displays the operator development interface, and the user triggers the host to implement various operator development functions by performing various operator development operations on the operator development interface, but the present application is not limited thereto.
  • this stage involves the following steps.
  • Step 1 Set the preset shape set.
  • the preset shape set includes multiple values for segmenting data, which are used to indicate the dimension value of the data, wherein the size and number of the values in the preset shape set can be set according to the needs of the user. It should be noted that the preset shape set needs to be "complete”, that is, any shape can be spliced based on the values in the preset shape set, or in other words, input data of any shape can be segmented according to the values in the preset shape set, for example, any shape can be spliced using the shape with the smallest granularity.
  • the size of the values in the preset shape set needs to consider whether the hardware capabilities are fully utilized (such as the computing power of the computing unit in the AI processing core, the bandwidth and size of the storage unit, etc.), so as to ensure the operating performance of the operator.
  • the number of values in the preset shape set increases, the number of base operators corresponding to the preset shape set often increases exponentially (for example, if the shape of the input data of the operator corresponds to three values, the number of base operators and the number of values are cubic, for details, please refer to the aforementioned Figure 7, which will not be repeated here).
  • the present application provides a specific setting method of a preset shape set, which can balance the operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator, or improving the operating performance of the computing device where the AI accelerator is located.
  • the size of the value in the preset shape set is associated with at least one of the following:
  • the data type corresponding to the instruction executed by the AI accelerator is, for example, FP16, FP32, etc., but the present application is not limited thereto.
  • the FP16 data type corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.
  • the data range corresponding to the instruction executed by the AI accelerator can also be understood as the input parameter limit of the instruction, for example, the length cannot exceed 1024 characters, the number cannot exceed 100 numbers, and so on. This application does not limit this.
  • this instruction is used for data transfer, and the value range is [1, 65535], and the unit is 32B. Therefore, this instruction can only transfer data with a minimum granularity of 32B, that is, for data of the FP16 data type, it corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.
  • the size of cache space at each level in the AI accelerator usually includes multiple levels of cache space, for example, UB (256KB), L1 buffer (1MB), L0A/L0B buffer (64KB), L0C buffer (256KB), etc.
  • UB 256KB
  • L1 buffer (1MB) L1 buffer (1MB)
  • L0A/L0B buffer 64KB
  • L0C buffer 256KB
  • the size of the numerical value in the preset shape set needs to consider the size of the cache space at each level.
  • the M and K dimensions of the matrix on L0A determine the size of the cache space occupied in the L0A buffer
  • the K and N dimensions of the matrix on L0B determine the size of the cache space occupied in the L0B buffer (the left matrix data is stored in the L0A buffer, and the right matrix data is stored in the L0B buffer). Therefore, taking the left matrix of data type FP16 as an example, if the ping-pong cache mechanism is used to implement basic operator scheduling, the size of the matrix stored in L0A is M ⁇ K ⁇ 16384 (corresponding to 32KB, that is, half of 64KB), that is, the upper limit of the cache space occupied in L0A is 32KB.
  • the present application provides a method for dividing a dimensional space based on an arithmetic sequence of numerical values, that is, the numerical values in the preset shape set include at least two arithmetic sequences of numerical values, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence is greater than the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence.
  • the preset shape set ⁇ 16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280 ⁇ , where ⁇ 16, 32, 48, 64, 80, 96, 112, 128 ⁇ is the target sequence, and the values in the target sequence increase according to the tolerance of 16, and ⁇ 256, 384, 512, 640, 768, 896, 1024, 1152, 1280 ⁇ is the adjacent sequence of the target sequence, and the values in the adjacent sequence increase according to the tolerance of 128.
  • the above method can also be understood as a method of dividing the dimensional space based on multi-level spans, that is, the values in the preset shape set increase according to the multi-level spans.
  • the preset shape set ⁇ 16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280 ⁇ , where the first-level span is 16 and the second-level span is 128.
  • the value starts from 16 and increases according to the current span of 16 to obtain 32, 48, 64, 80, 96, 112, and 128.
  • the second-level span is reached, and the value continues to increase from 128 according to the second-level span of 128 to obtain 256, 384,... and so on.
  • the values corresponding to the shapes of the input data of multiple base operators are associated with the number of sequences in a preset shape set (or the number of levels of multi-level spans, which is not limited to this). That is, the number of sequences in the preset shape set can be used to constrain the number of values used when splitting the input data.
  • the shape of the input data of a certain operator is (112).
  • the corresponding number of base operators will also increase exponentially, thereby affecting the operating performance of the AI accelerator. Moreover, the increase in the number of base operators will also lead to excessive overhead in the scheduling of base operators. If the constraint on the number of values is considered, for example, the number of values corresponding to the shapes of the input data of multiple base operators is equal to the number of sequences in the preset shape set plus 1 (or the number of levels of the multi-level span plus 1. It should be understood that this is only an example and can be set according to user needs, and does not constitute a limitation on the technical solution of the present application). Then, the number of values involved in the segmentation scheme can be effectively controlled.
  • the number of sequences in the preset shape set is 1, the number of values corresponding to the shapes of the input data of multiple base operators is at most 2.
  • This segmentation scheme involves 1 value, which effectively reduces the number of base operators. In this way, computing resources are saved, which improves the operating performance of the AI accelerator.
  • the preset shape set includes two arithmetic progression value sequences.
  • the maximum number of values involved in the segmentation scheme is 3.
  • the preset shape set ⁇ 16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280 ⁇ , where the total number of values is 17, which can be adjusted according to needs.
  • the preset shape set includes two arithmetic progression value sequences.
  • the maximum number of values involved in the segmentation scheme is 3.
  • the preset shape set ⁇ 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 512, 768, 1024, 1280, 1536, 1792, 2048 ⁇ , where the total number of values is 23, which can be adjusted according to needs.
  • the AI processing core contains multiple levels of cache space, and the size of each level of cache space varies greatly, so the corresponding constraints are also different. Therefore, the setting of the preset shape set can comprehensively consider the association relationship between data transfer between multiple levels of cache and the constraints of multiple levels of cache space, and the present application is not limited to this.
  • the maximum amount of data transferred between each level of cache space is limited to ⁇ the block size, and different transfer and calculation orders are designed for different internal segmentation methods of base operators to reuse data as much as possible, reduce transfer, and form a pipeline. This process will be further introduced in the second part later, so I will not go into details here.
  • the user can also adjust the preset shape set according to the quality evaluation information of the preset shape set (such as automatic iterative adjustment or manual iterative adjustment) so that the adjusted preset shape set meets the target conditions and ensures the rationality of the preset shape set.
  • the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs.
  • the target condition includes that the total number of base operators corresponding to the preset shape set meets the requirements, the number of values involved in the segmentation scheme meets the requirements, the utilization rate of the hardware capacity meets the requirements, etc.
  • the preset shape set is iteratively adjusted by evaluating the quality of the preset shape set. The following introduces several situations involving the above-mentioned target conditions:
  • the evaluation basis for fully utilizing the hardware capabilities may include: fully utilizing the cache space of L0A and L0B; fully utilizing the bandwidth of cache space at all levels; and fully utilizing the computing capability of the computing unit as much as possible under the constraints of the cache space of L0A and L0B.
  • fully utilizing hardware capabilities is only for illustrative purposes.
  • Hardware capabilities can also be evaluated through other factors, such as the theoretical upper limit model of compute bound and memory bound, cache hit rate of multi-level cache space, scheduling overhead, etc.
  • Step 2 Set the base operator.
  • a base operator refers to a pre-developed and compiled operator used to process input data of a fixed shape.
  • the fixed shape here corresponds to the numerical value in the preset shape set.
  • the base operator can also be understood as a basic operator for segmenting operators.
  • a solution for automatic implementation of a base operator is provided, that is, after the base operator is developed and compiled, a unified base operator scheduling mechanism is used during the actual operation of the base operator to achieve seamless splicing of scheduling between multiple base operators.
  • a unified base operator scheduling mechanism when generating a base operator, it is necessary to adopt a unified base operator scheduling mechanism to achieve seamless splicing of scheduling between multiple base operators. Therefore, it is necessary to constrain the cache space allocation method implemented inside the base operator based on a unified AI accelerator internal cache space mechanism. In this way, in the subsequent AI accelerator calling multiple base operators to process their respective input data, a unified granularity can be used to allocate cache space at all levels in the AI accelerator when reading and writing data, thereby achieving seamless splicing of scheduling between multiple base operators.
  • a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators, that is, an AI accelerator usually includes multiple levels of cache space. By opening up ping-pong cache space at each level of cache space, the computing unit and the data handling unit are executed concurrently in a pipeline, thereby reducing processing latency.
  • FIG8 is a schematic diagram of a base operator scheduling mechanism provided by an embodiment of the present application.
  • the AI processing core includes a multi-level cache space.
  • Each base operator generally includes three stages of data input handling, calculation, and output handling during the implementation process. If a base operator is executed and then the next base operator is executed, the hardware resources of some computing units and data handling units are idle.
  • the present application provides a base operator scheduling mechanism, which adopts a ping-pong cache mechanism to achieve seamless splicing of scheduling between multiple base operators.
  • FIG9 is a schematic diagram of a cache space allocation method provided by an embodiment of the present application.
  • the size of the cache space corresponding to the L1 buffer is 1MB
  • the size of the cache space corresponding to the L0A and L0B buffers is 64KB each
  • the size of the cache space corresponding to the L0C and UB buffers is 256KB.
  • a block of 32KB is used to uniformly divide the cache space.
  • the data can be read and written at a unified granularity.
  • the cache space at each level in the AI accelerator can be allocated, thereby realizing seamless splicing in scheduling between multiple base operators.
  • the combination of different values in the preset shape set can determine the size of the cache space occupied. Therefore, the above unified division of the cache space can also be used to constrain the size of the values in the preset shape set.
  • the ping-pong cache implementation method for concurrent pipelining requires that two cache spaces of the same size exist in the cache space for implementing the ping-pong cache. Taking the L0A buffer as an example, the size of the cache space corresponding to the L0A buffer is 64KB. Therefore, the upper limit of the ping-pong cache space is 32KB, so that the data size obtained by the combination of different values in the preset shape set needs to be less than or equal to 32KB.
  • the examples given here are only for schematic illustration, and the size of the values in the preset shape set can be adjusted according to actual needs.
  • the AI accelerator since the AI accelerator includes multiple levels of cache space, and the sizes of the cache spaces at each level are different, after the AI accelerator obtains data from the outside, it still needs to be segmented internally and then the calculation is completed one by one.
  • the base operator can also include a segmentation process, and different internal segmentation schemes have different effects on the performance of the base operator. Since the shape of the input data of the base operator is fixed, a corresponding segmentation scheme can be generated for each base operator under the constraints shown in the above point (1). This process can first be written by the developer based on experience to write the internal implementation template of the base operator with different segmentation methods, and then the most suitable implementation scheme for each base operator is selected through performance testing.
  • the performance test optimization process after the base operator is implemented by different templates can be automatically completed by tools such as scripts, and the specific implementation method of the base operator is not limited in this application.
  • the shape of the data processed by the base operator can also be used as a function input parameter, and the function of the base operator is realized by calling the set template, that is, there is no need to generate the base operator code file, thereby greatly reducing the amount of base operator code, making the compiled binary file further reduced, and saving resource usage.
  • Figure 10 is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application.
  • M2_K2_N2 corresponding to the multiplication operator as an example
  • M2_K2_N2 means that it is segmented in two equal parts in the three dimensions of M, K, and N.
  • Base operators that process data of different shapes are suitable for different internal segmentation implementation methods. By performing performance tests on a single base operator, the most suitable implementation scheme for each base operator can be selected, and the present application is not limited to this.
  • Step 3 Set the method for generating the segmentation plan.
  • a segmentation scheme is generated based on a preset shape set and the shape of the input data of the operator, and is used to include a segmentation method for the input data of the operator and a plurality of base operators corresponding to the operator. After the input data of the operator is segmented according to the segmentation scheme, the input data of the plurality of base operators corresponding to the operator can be obtained.
  • the present application provides a segmentation scheme determination method for automatic load balancing, by determining the cost of different segmentation schemes, thereby selecting the segmentation scheme with the minimum cost, so as to improve the operating performance of the AI accelerator, wherein the cost of the segmentation scheme can indicate the predicted time consuming of calling a plurality of base operators for data processing according to the segmentation scheme to obtain the output data of the operator.
  • the segmentation scheme also includes a computing power resource allocation method for calling multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a computing power resource allocation method based on the number of cores in the AI accelerator; a computing power resource allocation method based on the number of threads in the AI accelerator; a computing power resource allocation method based on the number of thread bundles in the AI accelerator; a computing power resource allocation method based on the number of logic blocks in the AI accelerator.
  • the AI accelerator usually includes multiple AI processing cores (as shown in Figure 3 above), for example, including 32 AI processing cores.
  • FIG. 11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application.
  • the AI accelerator includes 6 AI processing cores, and the input data of the operator is evenly divided according to the number of cores, and each AI processing core processes a small 3 ⁇ 3 matrix, which is further divided in each AI processing core.
  • the respective input data is processed to obtain the output data of each base operator, that is, the output data of the complete operator.
  • it can also be divided according to the number of threads, the number of thread bundles, or the number of logic blocks, etc., and the present application is not limited to this.
  • the following example uses the allocation of computing resources based on the number of cores in an AI accelerator to illustrate the method of determining the cost of a segmentation scheme.
  • the present application provides a cost model, through which the costs of different segmentation schemes can be determined, providing technical support for determining the final segmentation scheme of the operator.
  • the cost model is shown in the following formulas (7) and (8):
  • t estimation refers to the predicted time consumption of obtaining the output data of the operator by calling multiple base operators for data processing according to the partitioning scheme.
  • the additional overhead of t split-core scheduling is an estimated value obtained through testing. For example, 1 core, 2 cores, ... 32 cores are used to run calculations in sequence, and the same size of computing tasks are set on each core. In theory, if there is no additional overhead, then the total time consumption should remain unchanged as the number of cores used increases. However, through testing, it is found that there will be a linear increase. Based on the amount of linear increase, the additional overhead of split-core scheduling can be estimated.
  • t split-core scheduling additional overhead ⁇ the number of split cores is the predicted time consumption for computing power resource allocation based on the computing power resource allocation method.
  • t base operator is an estimated value obtained through testing.
  • the data shape corresponding to the calculation task is set to the size of a certain base operator to be tested, and it is constrained to run only on a single core to obtain the predicted time consumption of running the base operator.
  • the proportion of the most time-consuming stage is the proportion of the longest time-consuming stage in the operation of the base operator obtained through testing, or in other words, its contribution to the time consumption of the entire execution process is calculated according to the time consumption of the longest first-level pipeline.
  • Figure 12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application.
  • the most time-consuming stage is the input transfer stage (it should be noted that this is only an example and does not constitute a limitation of the present application). Accordingly, That is, the prediction time of multiple base operators is time-consuming.
  • FIG13 is a flow chart of an operator development stage provided in an embodiment of the present application. As shown in FIG13 , the operator development stage is executed by the host and includes the following steps 1301 to 1306 .
  • the host generates a preset shape set of operators based on constraint conditions, where the preset shape set includes multiple values for segmenting data.
  • the operator is a dynamic shape operator
  • the constraint condition refers to the constraint condition set by the user, which can be adjusted according to actual needs.
  • the constraint conditions include the data type corresponding to the instruction executed by the AI accelerator, the data range corresponding to the instruction executed by the AI accelerator, the size of the cache space at each level in the AI accelerator, the correlation relationship between data transfer between multi-level caches, the granularity of unified division of cache space, etc. Please refer to the first and second parts above for details, which will not be repeated here.
  • the host obtains quality evaluation information of the preset shape set, and adjusts the preset shape set based on the quality evaluation information, so that the adjusted preset shape set meets the target condition.
  • the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs.
  • the target condition includes that the total number of basic operators corresponding to the preset shape set meets the requirements, the number of numerical values involved in the segmentation scheme meets the requirements, the utilization rate of hardware capabilities meets the requirements, etc.
  • the specific content of the quality assessment information refers to the first part mentioned above and will not be repeated here.
  • the host generates a code file of the base operator based on the segmentation method of the base operator corresponding to the operator.
  • the division method of the base operator is the internal implementation template of the base operator shown in the second part above.
  • the host can automatically generate the code file of the base operator according to the division method of the base operator provided by the user. Please refer to the second part above for details, which will not be repeated here.
  • the host constructs a cost model of the operator based on the code file of the base operator.
  • the host constructs a segmentation function of the operator based on the preset shape set, and the segmentation function is used to generate a segmentation scheme of the operator.
  • the segmentation function refers to a function code for generating a segmentation scheme of an operator, and the segmentation function is used to output segmentation parameters based on a preset shape set and the shape of the input data of the operator, that is, to generate a segmentation scheme of the operator.
  • the segmentation parameters are a set of parameters used to indicate how the input data is segmented and calculated, including source data address, destination data address, address offset, base operator type, quantity, quantity after segmentation of each segmentation dimension, number of loops, etc., and the present application is not limited thereto.
  • the host generates a code file for the operator based on the code file of the base operator, the cost model of the operator, and the segmentation function of the operator.
  • the code file of the operator includes all code files involved in running the operator, such as the code file of the base operator, the cost model of the operator, the splitting function of the operator, and the logic code file for unified scheduling of the base operator, etc., but the present application is not limited to this.
  • FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application. As shown in FIG14 , the method is executed by an AI accelerator and includes the following steps 1401 to 1404 .
  • the AI accelerator obtains input data of an operator, which is a dynamic shape operator.
  • the operator refers to any dynamic shape operator of the AI model, and the input data can be sent by the host, or it can be the output data of other operators associated with the operator during the AI accelerator running the AI model.
  • the present application is not limited to this.
  • the AI accelerator generates a segmentation scheme for the operator based on a preset shape set and a shape of the input data of the operator.
  • the segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
  • the preset shape set includes at least two arithmetic progression value sequences, among which the target The difference between the value at the end of the sequence and the value at the head of the adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
  • the values corresponding to the shapes of the input data of the plurality of base operators are associated with the number of sequences in the preset shape set.
  • the segmentation scheme also includes a computing power resource allocation method that calls multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a method for allocating computing power resources based on the number of cores in the AI accelerator; a method for allocating computing power resources based on the number of threads in the AI accelerator; a method for allocating computing power resources based on the number of thread bundles in the AI accelerator; a method for allocating computing power resources based on the number of logic blocks in the AI accelerator.
  • this step 1402 includes the following two steps:
  • Step A Based on a preset shape set and the shape of the operator's input data, generate multiple candidate segmentation schemes for the operator, the candidate segmentation schemes including candidate segmentation methods for the operator's input data and multiple candidate base operators corresponding to the candidate segmentation methods.
  • Step B Determine a segmentation scheme that meets the target conditions from multiple candidate segmentation schemes.
  • the AI accelerator determines the cost of each candidate segmentation scheme, which indicates the predicted time consumption of calling multiple candidate base operators for data processing according to the candidate segmentation scheme to obtain the output data of the operator; and determines the candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes as the segmentation scheme.
  • the candidate splitting scheme also includes calling a computing power resource allocation method for multiple candidate base operators.
  • the AI accelerator determines the cost of each candidate splitting scheme, including: determining the cost of the candidate splitting scheme based on the predicted time consumption of multiple candidate base operators indicated by the candidate splitting scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
  • the above process can refer to the third part of the aforementioned operator development stage, and specifically refer to the cost model shown in formula (7) and formula (8), which will not be repeated here.
  • the AI accelerator divides the input data of the operator according to the division scheme to obtain input data of multiple base operators.
  • the AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.
  • the AI accelerator calls multiple base operators, and in the process of processing their respective input data, a unified granularity is used to allocate cache space at all levels in the AI accelerator when reading and writing data. For example, based on the ping-pong buffer mechanism, a unified granularity (such as a 32KB block) is used to allocate cache space at all levels in the AI accelerator. It should be understood that since multiple base operators can collaboratively realize the functions of the operator, the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of multiple base operators is spliced together to obtain the output data of the operator.
  • the AI accelerator calls multiple base operators, and in the process of processing their respective input data, each base operator corresponds to its own destination video memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination video memory address on the output memory, that is, the output data of the operator is obtained.
  • the partitioning scheme also indicates a computing resource allocation method including multiple base operators, such as an average allocation according to the number of processing cores, each AI processing core will output a copy of the data, and the output data of multiple AI processing cores will be spliced together to obtain the output data of the operator. Please refer to Figure 11 above for details, and will not be repeated here.
  • a preset shape set is used to dynamically segment the input data of operators of arbitrary shapes, thereby realizing the execution of dynamic shape operators by calling multiple base operators corresponding to the operators, thereby effectively improving the operating performance of the AI accelerator.
  • FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application. As shown in FIG15 , the interaction between a host and an AI accelerator is taken as an example for introduction, and includes the following steps 1501 to 1506 .
  • the host generates a segmentation scheme for the operator based on a preset shape set and the shape of the input data of the operator.
  • the preset shape set includes multiple numerical values for segmenting the data, and the numerical values are used to indicate the dimension value of the data.
  • the segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
  • the process of the host generating an operator segmentation plan is the same as the process of the AI accelerator generating a segmentation plan shown in Figure 14 above, so it will not be repeated here.
  • the host sends the operator input data and the operator segmentation plan to the AI accelerator.
  • the AI accelerator obtains the input data of the operator and the segmentation scheme of the operator.
  • the AI accelerator divides the input data of the operator according to the division scheme of the operator to obtain input data of multiple base operators.
  • step 1504 is an optional step.
  • the host divides the input data of the operator according to the operator's division scheme to obtain input data of multiple base operators, and sends the input data of the multiple base operators to the AI accelerator.
  • the AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.
  • the AI accelerator sends the output data of the multiple base operators to the host.
  • the AI accelerator calls multiple base operators to process their respective input data. Each base operator corresponds to its own destination memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination memory address on the output memory.
  • the AI accelerator sends the output data of the multiple base operators to the host, and the host obtains the output data of the operator.
  • a preset shape set is used to dynamically segment the input data of operators of arbitrary shapes, thereby realizing the execution of dynamic shape operators by calling multiple base operators corresponding to the operators, thereby effectively improving the operating performance of the AI accelerator.
  • the segmentation scheme of the operator generated when the AI accelerator runs the operator is introduced as an example.
  • the segmentation scheme can also be pre-set.
  • the function is to automatically perform model operator generation, operator fusion, calculation graph optimization, calculation quantization, model cutting and other tasks offline according to the model description given by the user, and finally generate a high-performance model file for reasoning and deployment in production.
  • the technical solution provided in this application can also be used in this scenario, that is, an AI model with a segmentation scheme that has been set can be generated according to the AI model description given by the user, and this application is not limited to this.
  • FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application.
  • the device can realize the function of the aforementioned AI accelerator through software, hardware, or a combination of both.
  • the device is configured in the AI accelerator, including an acquisition module 1601 and a call module 1602.
  • Acquisition module 1601 is used to acquire input data of multiple base operators corresponding to an operator.
  • the operator is a dynamic shape operator.
  • the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set. Multiple operators are used to collaboratively implement the functions of the operator.
  • the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data.
  • the calling module 1602 is used to call multiple basic operators, process the respective input data, and obtain the output data of the multiple basic operators.
  • the acquisition module 1601 includes:
  • a generating unit configured to generate a segmentation scheme of the operator based on a preset shape set and a shape of the input data of the operator, wherein the segmentation scheme includes a segmentation method for the input data of the operator and a plurality of base operators corresponding to the segmentation method;
  • the segmentation unit is used to segment the input data of the operator according to the segmentation scheme to obtain the input data of multiple base operators.
  • the slicing scheme further includes calling a computing resource allocation method of multiple base operators, where the computing resource allocation method is associated with at least one of the following:
  • a method of allocating computing resources based on the number of logic blocks in the AI accelerator is a method of allocating computing resources based on the number of logic blocks in the AI accelerator.
  • the generating unit is used to:
  • a plurality of candidate segmentation schemes of the operator are generated, wherein the candidate segmentation schemes include candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;
  • the generating unit is used to:
  • the candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes is determined as the segmentation scheme.
  • the candidate segmentation scheme further includes calling a computing resource allocation method of multiple candidate base operators, and the generating unit is used to:
  • the cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing resource allocation based on the computing resource allocation method.
  • the acquisition module 1601 is further used to:
  • the device also includes a sending module, which is used to send output data of multiple base operators to a host.
  • the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
  • the values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.
  • the size of the numerical value in the preset shape set is associated with at least one of the following:
  • the data range corresponding to the instructions executed by the AI accelerator
  • the size of cache space at each level in the AI accelerator is the size of cache space at each level in the AI accelerator.
  • the calling module 1602 calls multiple base operators to process their respective input data, and uses a unified granularity to allocate cache space at various levels in the AI accelerator when reading and writing data.
  • a preset shape set is used to dynamically segment the input data of a target operator of any shape, thereby realizing the execution of a dynamic shape operator by calling multiple base operators corresponding to the target operator, effectively improving the operating performance of the AI accelerator.
  • the operator processing device provided in the above embodiment processes the operator, only the division of the above functional modules is used as an example.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the operator processing device provided in the above embodiment and the operator processing method embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.
  • first, second, etc. are used to distinguish between identical or similar items with substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between “first”, “second”, and “nth”, nor is the quantity and execution order limited. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.
  • the first processing core can be referred to as the second processing core, and similarly, the second processing core can be referred to as the first processing core. Both the first processing core and the second processing core can be processing cores, and in some cases, can be separate and different processing cores.
  • the term "at least one” means one or more, and the term “plurality” means two or more.
  • a plurality of processing cores means two or more processing cores.
  • all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
  • all or part of the embodiments may be implemented in the form of program structure information.
  • the program structure information includes one or more program instructions.

Abstract

The present application relates to the technical field of artificial intelligence (AI). Disclosed are an operator processing method and apparatus, and a chip, a computing device and a storage medium. The method is executed by means of an AI accelerator, and comprises: acquiring input data of a plurality of basis operators corresponding to an operator; and calling the plurality of basis operators to process the respective input data, so as to obtain output data of the plurality of basis operators, i.e., obtaining output data of the operator. The plurality of basis operators are used for collaboratively implementing functions of the operator, and the input data of each basis operator is obtained by means of segmenting input data of the operator on the basis of a preset shape set. That is, in the method, input data, which is of any shape, of an operator is dynamically segmented by using a preset shape set, so that a plurality of basis operators corresponding to the operator are called to collaboratively implement the execution of a dynamic shape operator, thereby effectively improving the operation performance of an AI accelerator.

Description

算子的处理方法、装置、芯片、计算设备及存储介质Operator processing method, device, chip, computing device and storage medium
本申请要求于2022年12月24日提交的申请号202211669360.1、发明名称为“一种算子生成的加速方法”的中国专利申请的优先权,以及,于2023年3月31日提交的申请号202310379731.0、发明名称为“算子的处理方法、装置、芯片、计算设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of Chinese patent application with application number 202211669360.1 filed on December 24, 2022, and invention name “A method for accelerating operator generation”, and the priority of Chinese patent application with application number 202310379731.0 filed on March 31, 2023, and invention name “Operator processing method, device, chip, computing device and storage medium”, all contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及人工智能技术领域,特别涉及一种算子的处理方法、装置、芯片、计算设备及存储介质。The present application relates to the field of artificial intelligence technology, and in particular to an operator processing method, apparatus, chip, computing device and storage medium.
背景技术Background technique
随着人工智能(artificial intelligence,AI)技术的蓬勃发展,一系列AI加速器涌现出来,用于提供对矩阵、向量进行计算的计算算力,以加速AI模型的计算。通常,AI加速器配有专用的算子编程接口,以供用户编写在AI加速器上运行的算子。算子包括固定形状(shape)算子和动态形状算子等,其中,固定形状算子是指算子的输入数据的尺寸是固定的,动态形状算子是指算子的输入数据的尺寸不是固定的。With the rapid development of artificial intelligence (AI) technology, a series of AI accelerators have emerged to provide computing power for matrices and vectors to accelerate the calculation of AI models. Usually, AI accelerators are equipped with dedicated operator programming interfaces for users to write operators to run on AI accelerators. Operators include fixed shape operators and dynamic shape operators, among which fixed shape operators refer to operators whose input data size is fixed, and dynamic shape operators refer to operators whose input data size is not fixed.
相关技术中,AI加速器在获取到动态形状算子的输入数据时,根据输入数据的形状所处的形状范围,从预设的多个算子实现文件中选择该形状范围对应的算子实现文件,调用该算子实现文件,实现算子的动态形状运行。In the related technology, when the AI accelerator obtains the input data of the dynamic shape operator, it selects the operator implementation file corresponding to the shape range from multiple preset operator implementation files according to the shape range of the input data, and calls the operator implementation file to realize the dynamic shape operation of the operator.
然而,上述方法需要用户预先根据不同的形状范围生成不同的算子实现文件,易用性较差,而且,若输入数据的形状未落在预设的形状范围内,会触发运行时编译(just-in-time,JIT),引入编译耗时开销,极大影响AI加速器的运行性能。However, the above method requires users to generate different operator implementation files according to different shape ranges in advance, which has poor usability. Moreover, if the shape of the input data does not fall within the preset shape range, it will trigger runtime compilation (just-in-time, JIT), introduce compilation time overhead, and greatly affect the operating performance of the AI accelerator.
发明内容Summary of the invention
本申请实施例提供了一种算子的处理方法、装置、芯片、计算设备及存储介质,能够有效提升AI加速器的运行性能。该技术方案如下:The embodiment of the present application provides an operator processing method, device, chip, computing device and storage medium, which can effectively improve the operating performance of the AI accelerator. The technical solution is as follows:
第一方面,提供了一种算子的处理方法,由人工智能AI加速器执行,所述方法包括:In a first aspect, a method for processing an operator is provided, which is executed by an artificial intelligence (AI) accelerator, and the method includes:
获取算子对应的多个基算子的输入数据,所述算子为动态形状算子,每个基算子的输入数据是基于预设形状集合对所述算子的输入数据切分得到的,所述多个基算子用于协同实现所述算子的功能,所述预设形状集合包括用于切分数据的多个数值,所述数值用于指示数据的维度值;Obtain input data of multiple base operators corresponding to an operator, where the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;
调用所述多个基算子,对各自的输入数据进行处理,得到所述多个基算子的输出数据。The multiple base operators are called to process the respective input data to obtain output data of the multiple base operators.
其中,由于多个基算子能够协同实现算子的功能,因此,多个基算子的输出数据也即是目标算子的输出数据,或者说,将多个基算子的输出数据拼接起来就得到了算子的输出数据。也即是,上述方法利用了一种预设形状集合来动态切分任意形状的算子的输入数据,从而调用算子对应的多个基算子来协同实现动态形状算子的执行,有效提升了AI加速器的运行性能。Among them, since multiple base operators can collaboratively realize the functions of the operator, the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of the operator is obtained by splicing the output data of multiple base operators. In other words, the above method uses a preset shape set to dynamically segment the input data of operators of any shape, thereby calling multiple base operators corresponding to the operator to collaboratively realize the execution of the dynamic shape operator, effectively improving the operating performance of the AI accelerator.
在一些实施例中,所述获取算子对应的多个基算子的输入数据,包括:In some embodiments, the step of obtaining input data of multiple base operators corresponding to the operator includes:
基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;Based on the preset shape set and the shape of the input data of the operator, generating a segmentation scheme of the operator, the segmentation scheme including a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
按照所述切分方案,对所述算子的输入数据进行切分,得到所述多个基算子的输入数据。According to the segmentation scheme, the input data of the operator is segmented to obtain the input data of the multiple base operators.
在一些实施例中,所述切分方案还包括调用所述多个基算子的算力资源分配方式,所述算力资源分配方式与下述至少一项相关联:In some embodiments, the slicing scheme further includes calling a computing power resource allocation method of the plurality of base operators, wherein the computing power resource allocation method is associated with at least one of the following:
基于所述AI加速器中核数量进行算力资源分配的方式;A method for allocating computing resources based on the number of cores in the AI accelerator;
基于所述AI加速器中线程数量进行算力资源分配的方式;A method for allocating computing resources based on the number of threads in the AI accelerator;
基于所述AI加速器中线程束数量进行算力资源分配的方式;A method for allocating computing resources based on the number of thread warps in the AI accelerator;
基于所述AI加速器中逻辑块数量进行算力资源分配的方式。A method for allocating computing resources based on the number of logic blocks in the AI accelerator.
通过上述方式,由于在生成切分方案时考虑到了调用多个基算子的算力资源分配方式,能够有效权衡 调用多个基算子带来的算力消耗,从而充分利用了AI加速器的算力资源。In the above method, since the computing resource allocation method of calling multiple base operators is taken into account when generating the segmentation plan, it can effectively balance The computing power consumption caused by calling multiple base operators is reduced, thereby making full use of the computing power resources of the AI accelerator.
在一些实施例中,所述基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的切分方案,包括:In some embodiments, generating a segmentation scheme of the operator based on the preset shape set and the shape of the input data of the operator includes:
基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的多个候选切分方案,所述候选切分方案包括针对所述算子的输入数据的候选切分方式以及所述候选切分方式所对应的多个候选基算子;Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;
从所述多个候选切分方案中确定符合目标条件的所述切分方案。The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
在一些实施例中,所述从所述多个候选切分方案中确定符合目标条件的所述切分方案,包括:In some embodiments, determining the segmentation scheme that meets the target condition from the multiple candidate segmentation schemes includes:
确定各个候选切分方案的代价,所述代价指示按照候选切分方案调用所述多个候选基算子进行数据处理得到所述算子的输出数据的预测耗时;Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;
将所述多个候选切分方案中代价最小的候选切分方案确定为所述切分方案。A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
在一些实施例中,所述候选切分方案还包括调用所述多个候选基算子的算力资源分配方式,所述确定各个候选切分方案的代价,包括:In some embodiments, the candidate segmentation scheme further includes calling a computing resource allocation method of the plurality of candidate base operators, and determining the cost of each candidate segmentation scheme includes:
基于所述候选切分方案所指示的多个候选基算子的预测耗时和基于所述算力资源分配方式进行算力资源分配的预测耗时,确定所述候选切分方案的代价。The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
通过上述方式,提供了一种自动负载均衡的切分方案确定方式,通过确定出不同候选切分方案的代价,从而选出代价最小的切分方案,以最大化提升AI加速器的运行性能。Through the above method, a method for determining a slicing scheme for automatic load balancing is provided. By determining the costs of different candidate slicing schemes, the slicing scheme with the lowest cost is selected to maximize the operating performance of the AI accelerator.
在一些实施例中,所述方法还包括:In some embodiments, the method further comprises:
获取主机发送的所述算子的输入数据和所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
所述方法还包括:将所述多个基算子的输出数据发送给所述主机。The method further includes: sending output data of the plurality of base operators to the host.
通过上述方式,由主机来生成算子的切分方案,能够卸载AI加速器的算力,节约AI加速器的算力资源。Through the above method, the host generates the operator segmentation plan, which can offload the computing power of the AI accelerator and save the computing power resources of the AI accelerator.
在一些实施例中,所述预设形状集合包括至少两个等差数值序列,其中,目标序列尾部的数值和所述目标序列的相邻序列头部的数值之间的差值大于所述目标序列的公差。In some embodiments, the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein a difference between a numerical value at the tail of a target sequence and a numerical value at the head of an adjacent sequence of the target sequence is greater than a tolerance of the target sequence.
在一些实施例中,所述多个基算子的输入数据的形状所对应的数值与所述预设形状集合中序列的数量相关联。In some embodiments, the numerical values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.
在一些实施例中,所述预设形状集合中数值的大小与下述至少一项相关联:In some embodiments, the size of the numerical value in the preset shape set is associated with at least one of the following:
所述AI加速器执行的指令所对应的数据类型;The data type corresponding to the instruction executed by the AI accelerator;
所述AI加速器执行的指令所对应的数据范围;The data range corresponding to the instruction executed by the AI accelerator;
所述AI加速器中各级缓存空间的大小。The size of cache space at each level in the AI accelerator.
通过上述方式设定的预设形状集合能够权衡算子性能和基算子的数量,从而有效提升AI加速器的运行性能。The preset shape set set in the above manner can balance operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator.
在一些实施例中,所述调用所述多个基算子,对各自的输入数据进行处理的过程中,数据读写时采用统一的粒度对所述AI加速器中各级缓存空间进行分配。In some embodiments, in the process of calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
通过上述方式,能够实现多个基算子间调度上的无缝拼接。例如,采用乒乓缓存机制来实现多个基算子间调度上的无缝拼接。Through the above method, it is possible to achieve seamless splicing of scheduling between multiple base operators. For example, a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators.
第二方面,提供了一种算子的处理装置,所述装置配置于AI加速器,包括至少一个功能模块,用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。In a second aspect, a device for processing an operator is provided. The device is configured in an AI accelerator and includes at least one functional module for executing the method for processing an operator provided in the first aspect or any possible implementation of the first aspect.
第三方面,提供了一种芯片,配置为AI加速器,所述AI加速器包括通信接口和至少一个AI处理核,所述通信接口用于为所述至少一个处理核提供程序指令和/或数据,所述至少一个AI处理核用于实现如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。In a third aspect, a chip is provided, configured as an AI accelerator, the AI accelerator comprising a communication interface and at least one AI processing core, the communication interface being used to provide program instructions and/or data to the at least one processing core, the at least one AI processing core being used to implement a processing method for an operator provided in the first aspect or any possible implementation of the first aspect.
第四方面,提供了一种计算设备,包括主机和AI加速器,所述主机用于向所述AI加速器发送数据,接收所述AI加速器发送的数据,所述AI加速器用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。 In a fourth aspect, a computing device is provided, including a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the first aspect or any possible implementation of the first aspect.
第五方面,提供了一种计算设备集群,包括多个计算设备,所述计算设备包括主机和AI加速器,所述主机用于向所述AI加速器发送数据,接收所述AI加速器发送的数据,所述AI加速器用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。In a fifth aspect, a computing device cluster is provided, comprising multiple computing devices, wherein the computing devices include a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the aforementioned first aspect or any possible implementation of the first aspect.
第六方面,提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储至少一段程序代码,所述至少一段程序代码用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。In a sixth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium is used to store at least one program code, and the at least one program code is used to execute the processing method of the operator provided in the first aspect or any possible implementation of the first aspect. The storage medium includes but is not limited to a volatile memory, such as a random access memory, a non-volatile memory, such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
第七方面,提供了一种计算机程序产品,当所述计算机程序产品在AI加速器上运行时,使得所述AI加速器执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的算子的处理方法。该计算机程序产品可以为一个软件安装包,在需要实现前述AI加速器的功能的情况下,可以下载该计算机程序产品并在AI加速器上执行该计算机程序产品。In a seventh aspect, a computer program product is provided, which, when running on an AI accelerator, enables the AI accelerator to execute the operator processing method provided in the first aspect or any possible implementation of the first aspect. The computer program product may be a software installation package, and when the functions of the aforementioned AI accelerator need to be implemented, the computer program product may be downloaded and executed on the AI accelerator.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请提供的一种实施环境的示意图;FIG1 is a schematic diagram of an implementation environment provided by the present application;
图2是本申请实施例提供的一种AI加速器200的架构示意图;FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application;
图3是本申请实施例提供的一种芯片的硬件结构示意图;FIG3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application;
图4是本申请实施例提供的一种AI处理核的结构示意图;FIG4 is a schematic diagram of the structure of an AI processing core provided in an embodiment of the present application;
图5是一种矩阵乘算子的切分示意图;FIG5 is a schematic diagram of the segmentation of a matrix multiplication operator;
图6是另一种矩阵乘算子的切分示意图;FIG6 is a schematic diagram of another segmentation of a matrix multiplication operator;
图7是本申请实施例提供的一种矩阵乘算子的切分示意图;FIG7 is a schematic diagram of a matrix multiplication operator segmentation method provided in an embodiment of the present application;
图8是本申请实施例提供的一种基算子调度机制的示意图;FIG8 is a schematic diagram of a base operator scheduling mechanism provided in an embodiment of the present application;
图9是本申请实施例提供的缓存空间分配方式的示意图;FIG9 is a schematic diagram of a cache space allocation method provided in an embodiment of the present application;
图10是本申请实施例提供的一种基算子的切分方式示意图;FIG10 is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application;
图11是本申请实施例提供的一种切分方案的示意图;FIG11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application;
图12是本申请实施例提供的一种最耗时阶段占比的示意图;FIG12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application;
图13是本申请实施例提供的一种算子开发阶段的流程示意图;FIG13 is a schematic diagram of a flow chart of an operator development phase provided in an embodiment of the present application;
图14是本申请实施例提供的一种算子的处理方法的流程图;FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application;
图15是本申请实施例提供的另一种算子的处理方法的流程图;FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application;
图16是本申请实施例提供的一种算子的处理装置的结构示意图。FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the implementation methods of the present application will be further described in detail below with reference to the accompanying drawings.
为了方便理解,下面先对本申请涉及的关键术语和关键概念进行说明。For ease of understanding, the key terms and key concepts involved in this application are explained below.
人工智能(artificial intelligence,AI)模型,是一类用机器学习思想解决实际问题的数学算法模型,AI模型包括大量的参数和计算公式(或计算规则)。Artificial intelligence (AI) model is a type of mathematical algorithm model that uses machine learning ideas to solve practical problems. The AI model includes a large number of parameters and calculation formulas (or calculation rules).
AI加速器,是一类专门的硬件加速器或计算机系统,旨在加速AI应用,尤其是神经网络、机器视觉和机器学习等。例如,用于提供对矩阵、向量进行计算的计算算力,以加速AI模型的计算。示意性地,AI加速器例如是图形处理器(graphics processing unit,GPU)、智能处理器(intelligent processing unit,IPU)、张量处理器(tensor processing unit,TPU)、特定域架构(domain specific architecture,DSA)芯片,等等。AI accelerators are a type of specialized hardware accelerator or computer system designed to accelerate AI applications, especially neural networks, machine vision, and machine learning. For example, they are used to provide computing power for calculating matrices and vectors to accelerate the calculation of AI models. Schematically, AI accelerators are, for example, graphics processing units (GPUs), intelligent processing units (IPUs), tensor processing units (TPUs), domain specific architecture (DSA) chips, and so on.
深度学习(deep learning,DL),是机器学习(machine learning,ML)领域中一个分支,深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。示意性地,深度学习是一种复杂的机器学习算法,典型的利用深度学习思想的AI模型是神经网络模型。 Deep learning (DL) is a branch of machine learning (ML). Deep learning is the study of the inherent laws and representation levels of sample data. The information obtained in the learning process is very helpful for interpreting data such as text, images, and sounds. Schematically, deep learning is a complex machine learning algorithm, and the typical AI model that uses deep learning ideas is the neural network model.
算子(operator,OP),是指运行在计算设备上的计算单元或计算函数,在深度学习领域,神经网络层乃至整个模型均由算子组成,这些算子对应神经网络层中的计算逻辑。例如,卷积层(convolution layer)是一个算子;全连接层(fully-connected layer,FC layer)中的权值求和过程,是一个算子。示意性地,算子包括固定形状算子和动态形状算子,下面分别对这两种算子进行介绍:(1)固定形状算子是指算子的输入数据的尺寸是固定的,因而其算子内部对输入数据的切分和执行的调度,均可以是固定的。(2)动态形状算子是指算子的输入数据的尺寸不是固定的,或者说,延迟指定输入数据的尺寸,直到运行时算子才得到实际的输入数据的尺寸。An operator (OP) refers to a computing unit or computing function running on a computing device. In the field of deep learning, neural network layers and even the entire model are composed of operators, which correspond to the computing logic in the neural network layer. For example, a convolution layer is an operator; the weight summation process in a fully-connected layer (FC layer) is an operator. Schematically, operators include fixed-shape operators and dynamic-shape operators. The following introduces these two types of operators respectively: (1) Fixed-shape operators refer to operators whose input data size is fixed, so the operator's internal input data segmentation and execution scheduling can be fixed. (2) Dynamic-shape operators refer to operators whose input data size is not fixed, or in other words, the input data size is delayed until the operator obtains the actual input data size at runtime.
张量(tensor),是算子中的数据,包括输入数据与输出数据。Tensor is the data in the operator, including input data and output data.
形状(shape),是指张量的形状,或者说张量各个维度的维度值,通常以(D0,D1,…,Dn-1)的形式表示,其中,n为正整数。例如,算子的矩阵张量的输入数据shape为(100,100),表明该矩阵行列方向的维度值各为100,共包含100×100=10000个元素。Shape refers to the shape of a tensor, or the dimension value of each dimension of a tensor, usually expressed in the form of (D0, D1, ..., Dn-1), where n is a positive integer. For example, the input data shape of the matrix tensor of an operator is (100, 100), indicating that the dimension value of the matrix in the row and column directions is 100 each, and contains a total of 100×100=10000 elements.
乒乓缓冲(ping-pong),是一种数据缓冲机制,能够同时利用两个数据缓冲区达到数据连续传输的目的,从而提高数据传输速率。应理解,由于单个缓冲区处理的数据在传输和处理中很容易被覆盖,而采用ping-pong缓冲的方式能够总是保持一个缓冲区的数据被利用,另一个缓冲去存储数据,换言之,ping-pong缓冲是指两个相同的对象作为缓冲区交替地被读和被写。Ping-pong buffering is a data buffering mechanism that can use two data buffers at the same time to achieve the purpose of continuous data transmission, thereby increasing the data transmission rate. It should be understood that since the data processed by a single buffer is easily overwritten during transmission and processing, the ping-pong buffering method can always keep the data in one buffer being used while the other buffer is used to store data. In other words, ping-pong buffering means that two identical objects are read and written alternately as buffers.
下面对本申请涉及的应用场景和实施环境进行介绍。The following is an introduction to the application scenarios and implementation environment involved in this application.
本申请提供的技术方案应用于基于AI加速器来运行动态形状算子的场景中。目前,动态形状算子的实现往往涉及切分和调度策略,比如说,以动态形状算子为矩阵乘算子为例,AI加速器在运行该算子时,获取到算子的输入数据包括一个10×20维的矩阵A和一个20×30维的矩阵B,那么算子最终的输出数据为矩阵C=A×B,是一个10×30维的矩阵,而C=A×B这一计算任务通常无法通过AI加速器中的计算单元单次计算完成,因而往往需要对这一计算任务进行切分,例如,将矩阵C看作由两个10×15维的矩阵拼接而成,那就可以分别取矩阵A和矩阵B的局部数据来计算得到两个10×15维的矩阵,将这两个10×15维的矩阵拼接起来,就得到了算子的输出数据,即矩阵C。这一过程也即是,将一个C=A×B的算子的计算任务切分为两个子计算任务,调用算子对应的两个基算子(一个基算子用于计算得到一个10×15维的矩阵)来分别实现这两个子计算任务,也即是,通过算子对应的两个基算子来协同实现算子的功能,相应地,由于每个基算子的输入数据是算子的输入数据的局部数据,因此,需要对算子的输入数据进行切分,来得到每个基算子的输入数据,最终实现算子的功能。然而,应理解的是,矩阵C还可以有其他多种拼接方式,相应地,算子的切分和调度策略也就包括其他多种策略。可见,对于这种动态形状算子,由于其输入数据的形状并不固定,因此针对这种算子的切分和调度策略的优化空间是非常庞大的,而庞大的优化空间意味着由于AI加速器在运行算子时才获取到输入数据的形状,AI加速器很难以最佳甚至较优的切分和调度策略来实现算子功能,所以其性能和硬件利用率通常难以保证。The technical solution provided in this application is applied to the scenario of running dynamic shape operators based on AI accelerators. At present, the implementation of dynamic shape operators often involves segmentation and scheduling strategies. For example, taking the dynamic shape operator as a matrix multiplication operator, when the AI accelerator runs the operator, the input data of the operator obtained includes a 10×20-dimensional matrix A and a 20×30-dimensional matrix B. Then the final output data of the operator is matrix C=A×B, which is a 10×30-dimensional matrix. The calculation task of C=A×B cannot usually be completed by a single calculation of the computing unit in the AI accelerator, so it is often necessary to segment this calculation task. For example, consider matrix C as two 10×15-dimensional matrices spliced together, then the local data of matrix A and matrix B can be taken respectively to calculate two 10×15-dimensional matrices. These two 10×15-dimensional matrices are spliced together to obtain the output data of the operator, that is, matrix C. This process is to divide the computing task of an operator of C=A×B into two sub-computing tasks, and call the two base operators corresponding to the operator (a base operator is used to calculate a 10×15-dimensional matrix) to respectively implement the two sub-computing tasks, that is, to collaboratively implement the function of the operator through the two base operators corresponding to the operator. Accordingly, since the input data of each base operator is the local data of the input data of the operator, it is necessary to divide the input data of the operator to obtain the input data of each base operator, and finally realize the function of the operator. However, it should be understood that the matrix C can also have many other splicing methods, and accordingly, the operator's segmentation and scheduling strategy also includes many other strategies. It can be seen that for this dynamic shape operator, since the shape of its input data is not fixed, the optimization space for the segmentation and scheduling strategy of this operator is very large, and the large optimization space means that since the AI accelerator obtains the shape of the input data when running the operator, it is difficult for the AI accelerator to implement the operator function with the best or even better segmentation and scheduling strategy, so its performance and hardware utilization are usually difficult to guarantee.
基于此,本申请提供了一种基于预设形状集合(也称为shape集)来实现动态形状算子的技术方案,能够有效提升AI加速器的运行性能。其中,对于某一算子的任意形状的输入数据,按照预设形状集合对输入数据进行切分,从而利用预设的该算子对应的多个基算子,来分别对切分后的输入数据进行处理,最终得到该算子的输出数据(具体原理及实施过程会在后续实施例中进行介绍,在此不再赘述)。Based on this, the present application provides a technical solution for implementing a dynamic shape operator based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator. Among them, for input data of any shape of a certain operator, the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained (the specific principle and implementation process will be introduced in the subsequent embodiments, and will not be repeated here).
下面参考图1,对本申请的实施环境进行介绍。图1是本申请提供的一种实施环境的示意图。如图1所示,该实施环境包括主机100和AI加速器200,主机100和AI加速器200通信连接。The implementation environment of the present application is introduced below with reference to FIG1 . FIG1 is a schematic diagram of an implementation environment provided by the present application. As shown in FIG1 , the implementation environment includes a host 100 and an AI accelerator 200 , and the host 100 and the AI accelerator 200 are communicatively connected.
主机100是指用于运行AI模型的设备,为用户提供AI业务。在本申请实施例中,主机100能够实现针对AI模型中算子的开发功能和执行功能,其中,开发功能是指主机100为用户提供针对算子的各类开发功能,如编写算子、设定预设形状集合、生成算子代码等等,本申请不限于此。执行功能是指主机100能够控制AI加速器200来运行用户已开发的算子,实现相应算子的功能,这一过程也可以理解为,将AI任务加载到AI加速器200中运行。另外,主机100的数量可以是一个或多个,本申请对此不作限定。The host 100 refers to a device used to run an AI model and provide AI services to users. In an embodiment of the present application, the host 100 can implement development functions and execution functions for operators in the AI model, wherein the development function refers to the host 100 providing users with various development functions for operators, such as writing operators, setting preset shape sets, generating operator codes, etc., and the present application is not limited thereto. The execution function means that the host 100 can control the AI accelerator 200 to run the operator developed by the user and realize the function of the corresponding operator. This process can also be understood as loading the AI task into the AI accelerator 200 for operation. In addition, the number of hosts 100 can be one or more, and the present application does not limit this.
AI加速器200用于为运行的AI模型提供计算算力,以加速AI模型的计算过程,也即是执行本申请提供的算子的处理方法。例如,AI加速器200是DSA芯片、GPU等等,本申请不限于此。示意性地,AI加速器200根据主机100发送的算子的输入数据,调用相应算子,对输入数据进行处理,得到输出数据,将输出数据返回给主机100,实现相应算子的功能。需要说明的是,AI加速器200所具备的功能通过下述 图2所示内容进行详细介绍,在此不再赘述。另外,AI加速器200的数量可以是一个或多个,本申请对此不作限定。The AI accelerator 200 is used to provide computing power for the running AI model to accelerate the computing process of the AI model, that is, to execute the processing method of the operator provided in this application. For example, the AI accelerator 200 is a DSA chip, a GPU, etc., but this application is not limited thereto. Schematically, the AI accelerator 200 calls the corresponding operator based on the input data of the operator sent by the host 100, processes the input data, obtains the output data, and returns the output data to the host 100 to implement the function of the corresponding operator. It should be noted that the functions of the AI accelerator 200 are as follows: The contents shown in FIG2 are introduced in detail and will not be described again here. In addition, the number of the AI accelerator 200 may be one or more, and this application does not limit this.
示意性地,主机100与AI加速器200集成在一个计算设备中,主机100与AI加速器200通过外围组件互连总线(peripheral component interconnect express,PCIe)链路通信连接,主机100通过PCIe链路与AI加速器200进行数据交互,控制AI加速器200来运行相应算子。例如,计算设备可以是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式文件系统,又或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(content delivery network,CDN)以及大数据和人工智能平台等基础云计算服务的云服务器。以计算设备为云服务器为例,计算设备也可以称为是一种云平台(即云计算平台的简称),是指基于硬件资源和软件资源的服务,提供计算、网络和存储能力。通过网络“云”将庞大的数据计算处理在远端进行处理和分析后返回给用户,具有大规模、分布式、虚拟化、高可用性、扩展性、按需服务以及安全性等特点。云平台可以以较小的管理代价,或者用户与业务提供者较低的交互复杂度,实现可配置计算资源的快速发放与发布。Schematically, the host 100 and the AI accelerator 200 are integrated in a computing device, and the host 100 and the AI accelerator 200 are connected to each other through a peripheral component interconnect bus (PCIe) link. The host 100 exchanges data with the AI accelerator 200 through the PCIe link and controls the AI accelerator 200 to run the corresponding operator. For example, the computing device can be an independent physical server, or a server cluster or distributed file system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. Taking the computing device as a cloud server as an example, the computing device can also be called a cloud platform (i.e., the abbreviation of cloud computing platform), which refers to services based on hardware resources and software resources, providing computing, network and storage capabilities. Through the network "cloud", huge data computing is processed and analyzed remotely and then returned to the user, with the characteristics of large-scale, distributed, virtualized, high availability, scalability, on-demand service and security. The cloud platform can achieve the rapid release and publication of configurable computing resources with a small management cost or low interaction complexity between users and service providers.
需要说明的是,在上述图1所示实施环境中,是以主机100和AI加速器200之间通过交互来实现针对算子的执行功能为例进行介绍的,在另一些实施例中,AI加速器200具有运行AI模型的功能,即,AI加速器200可以直接运行用户指定的AI模型来实现算子的执行功能,本申请对此不作限定。It should be noted that in the implementation environment shown in Figure 1 above, the host 100 and the AI accelerator 200 are introduced as an example to realize the execution function of the operator through interaction. In other embodiments, the AI accelerator 200 has the function of running the AI model, that is, the AI accelerator 200 can directly run the AI model specified by the user to realize the execution function of the operator. This application does not limit this.
另外,上述涉及的网络包括但不限于数据中心网络(data center network)、存储区域网(storage area network,SAN)、局域网(local area network,LAN)、城域网(metropolitan area network,MAN)、广域网(wide area network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实现方式中,使用包括超级文本标记语言(hyper text markup language,HTML)、可扩展标记语言(extensible markup language,XML)等技术和/或格式来代表通过网络交换的数据。此外还能够使用诸如安全套接字层(secure sockets layer,SSL)、传输层安全(transport layer security,TLS)、虚拟专用网络(virtual private network,VPN)、网际协议安全(internet protocol security,IPsec)等常规加密技术来加密所有或者部分链路。在另一些实施例中,还能够使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。In addition, the networks involved above include, but are not limited to, data center networks, storage area networks (SAN), local area networks (LAN), metropolitan area networks (MAN), wide area networks (WAN), mobile, wired or wireless networks, dedicated networks or any combination of virtual private networks. In some implementations, technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML) are used to represent data exchanged through the network. In addition, conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private network (VPN), and Internet protocol security (IPsec) can also be used to encrypt all or part of the links. In other embodiments, customized and/or dedicated data communication technologies can also be used to replace or supplement the above data communication technologies.
下面对上述实施环境中AI加速器200的功能进行详细介绍。The functions of the AI accelerator 200 in the above implementation environment are introduced in detail below.
图2是本申请实施例提供的一种AI加速器200的架构示意图。应理解,图2仅是示例性地展示了AI加速器200的一种结构化示意图,本申请并不限定对AI加速器200各项功能的划分,示意性地,如图2所示,AI加速器200所具备的功能包括但不限于:数据获取功能201和算子调用功能202。在一些实施例中,AI加速器200所具备的功能还包括切分功能203以及存储功能204等等,本申请不限于此。FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application. It should be understood that FIG2 is only an exemplary structural diagram of the AI accelerator 200, and the present application does not limit the division of the functions of the AI accelerator 200. Schematically, as shown in FIG2, the functions of the AI accelerator 200 include but are not limited to: data acquisition function 201 and operator call function 202. In some embodiments, the functions of the AI accelerator 200 also include a segmentation function 203 and a storage function 204, etc., and the present application is not limited thereto.
数据获取功能201,用于获取算子对应的多个基算子的输入数据,其中,算子是指动态形状算子,多个基算子用于协同实现算子的功能,也即是,这多个基算子的功能组合起来等同于算子的功能,也可以理解为,这多个基算子是对算子进行切分得到的一系列基础算子。对于算子对应的任一个基算子来说,该基算子的输入数据是基于预设形状集合对算子的输入数据切分得到的,该预设形状集合是预先设定的,包括用于切分数据的多个数值,该数值用于指示数据的维度值。应理解,预设形状集合中的数值能够自由组合以拼接成任意形状。基算子是指预先开发编译的算子,用于处理固定形状的输入数据,这里的固定形状与预设形状集合中的数值对应。The data acquisition function 201 is used to obtain the input data of multiple base operators corresponding to the operator, wherein the operator refers to a dynamic shape operator, and multiple base operators are used to collaboratively realize the function of the operator, that is, the functions of these multiple base operators combined are equivalent to the function of the operator, and it can also be understood that these multiple base operators are a series of basic operators obtained by segmenting the operator. For any base operator corresponding to the operator, the input data of the base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the preset shape set is pre-set, including multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data. It should be understood that the numerical values in the preset shape set can be freely combined to splice into any shape. The base operator refers to a pre-developed and compiled operator used to process fixed-shape input data, where the fixed shape corresponds to the numerical value in the preset shape set.
算子调用功能202,用于调用多个基算子,对各自的输入数据进行处理,得到多个基算子的输出数据。这一过程也即是,利用预先开发编译的算子的一系列基算子,来对算子切分后的输入数据分别进行处理,从而将一个完整的算子的计算任务切分为多个子计算任务,将这些子计算任务交由算子对应的基算子来协同实现,从而提升了AI加速器的运行性能。应理解,在得到多个基算子的输出数据之后,以对算子的输入数据进行切分的方式为依据,将多个基算子的输出数据拼接起来,也即得到了算子的输出数据,从而实现了算子的功能。The operator calling function 202 is used to call multiple base operators, process their respective input data, and obtain the output data of multiple base operators. This process is to use a series of base operators of pre-developed and compiled operators to process the input data after the operator is segmented, so as to divide the computing task of a complete operator into multiple sub-computing tasks, and hand over these sub-computing tasks to the base operators corresponding to the operators for collaborative implementation, thereby improving the operating performance of the AI accelerator. It should be understood that after obtaining the output data of multiple base operators, the output data of multiple base operators are spliced together based on the way of segmenting the input data of the operator, that is, the output data of the operator is obtained, thereby realizing the function of the operator.
切分功能203,用于基于预设形状集合和算子的输入数据的形状,生成算子的切分方案,按照切分方案,对算子的输入数据进行切分,得到多个基算子的输入数据。其中,切分方案包括针对算子的输入数据的切分方式以及切分方式所对应的多个基算子。The segmentation function 203 is used to generate a segmentation scheme for the operator based on the preset shape set and the shape of the input data of the operator, and segment the input data of the operator according to the segmentation scheme to obtain input data of multiple base operators. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
存储功能204,用于存储预先开发编译的基算子、预设形状集合、AI模型,等等,本申请不限于此。The storage function 204 is used to store pre-developed and compiled base operators, preset shape sets, AI models, etc., but the present application is not limited thereto.
另外,AI加速器200的功能并不仅限于上述201至204,在实际应用中,能够根据用户的需求设置更 多功能,通过上述功能,AI加速器200能够实现针对AI模型中算子的执行功能,提升AI加速器200的运行性能。In addition, the functions of the AI accelerator 200 are not limited to the above 201 to 204. In actual applications, more functions can be set according to user needs. Multifunctional: Through the above functions, the AI accelerator 200 can realize the execution function of the operators in the AI model and improve the operating performance of the AI accelerator 200.
下面对上述AI加速器200的硬件结构进行介绍。The hardware structure of the above-mentioned AI accelerator 200 is introduced below.
本申请实施例提供了一种芯片,能够配置为上述实施环境所示的AI加速器200。参考图3,图3是本申请实施例提供的一种芯片的硬件结构示意图。如图3所示,该芯片300包括通信接口301、至少一个AI处理核302、处理器303、存储器304以及总线305。其中,通信接口301、至少一个AI处理核302、处理器303以及存储器304通过总线305实现彼此之间的通信连接。The embodiment of the present application provides a chip that can be configured as the AI accelerator 200 shown in the above implementation environment. Referring to Figure 3, Figure 3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application. As shown in Figure 3, the chip 300 includes a communication interface 301, at least one AI processing core 302, a processor 303, a memory 304, and a bus 305. Among them, the communication interface 301, at least one AI processing core 302, the processor 303, and the memory 304 are connected to each other through the bus 305.
通信接口301用于为至少一个AI处理核302提供程序指令和/或数据,通信接口301包括PCIe通信接口、其他通用的外设接口等,本申请对此不作限定。例如,当芯片300作为主机100的加速卡使用时,通过PCIe通信接口和主机100实现数据互换。又例如,芯片300通过外设接口实现芯片300与其他设备或通信网络之间的通信。The communication interface 301 is used to provide program instructions and/or data to at least one AI processing core 302. The communication interface 301 includes a PCIe communication interface, other general peripheral interfaces, etc., which are not limited in this application. For example, when the chip 300 is used as an accelerator card of the host 100, data exchange is achieved through the PCIe communication interface and the host 100. For another example, the chip 300 realizes communication between the chip 300 and other devices or communication networks through the peripheral interface.
AI处理核302用于实现如上述图2所示的AI加速器的功能,即,执行本申请实施例提供的算子的处理方法。示意性地,AI处理核采用达芬奇架构,这种架构实现了高通量、大算力和低功耗,适合处理深度学习中神经网络必需的常用计算,如矩阵相乘等,具体架构在下述图4进行介绍,在此不再赘述。The AI processing core 302 is used to implement the functions of the AI accelerator shown in FIG2 above, that is, to execute the processing method of the operator provided in the embodiment of the present application. Schematically, the AI processing core adopts the Da Vinci architecture, which realizes high throughput, high computing power and low power consumption, and is suitable for processing common calculations required for neural networks in deep learning, such as matrix multiplication, etc. The specific architecture is introduced in FIG4 below and will not be repeated here.
处理器303可以是中央处理器(central processing unit,CPU)、特定应用集成电路(application-specific integrated circuit,ASIC)或用于控制本申请方案程序执行的集成电路。该处理器303可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。该处理器303的数量可以是一个,也可以是多个。以多核处理器为例,多个核按照功能可以划分为专用于控制芯片300整体运行的控制CPU和专用于承担非矩阵类复杂计算的AI CPU,两类任务占用的CPU核数可由软件根据系统实际运行情况动态分配,本申请对此不作限定。The processor 303 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or an integrated circuit for controlling the execution of the program of the present application. The processor 303 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. The number of processors 303 may be one or more. Taking a multi-core processor as an example, the multiple cores may be divided into a control CPU dedicated to controlling the overall operation of the chip 300 and an AI CPU dedicated to non-matrix complex calculations according to their functions. The number of CPU cores occupied by the two types of tasks may be dynamically allocated by the software according to the actual operation of the system, and the present application does not limit this.
存储器304可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。The memory 304 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited to this.
总线305可包括在芯片300各个部件(例如,通信接口301、至少一个AI处理核302、处理器303、存储器304)之间传送信息的通路。The bus 305 may include a path for transmitting information between various components of the chip 300 (eg, the communication interface 301 , the at least one AI processing core 302 , the processor 303 , and the memory 304 ).
需要说明的是,上述图3所示仅为本申请提供的一种能够配置为上述AI加速器200的芯片的硬件结构图,在一些实施例中,该芯片还可以包括其他部件,以实现更多功能,例如,该芯片还包括任务调度器(task scheduler,TS),用于实现计算任务在AI处理核上的高效分配和调度,等等,本申请不限于此。It should be noted that the above FIG. 3 is only a hardware structure diagram of a chip that can be configured as the above-mentioned AI accelerator 200 provided by the present application. In some embodiments, the chip may also include other components to achieve more functions. For example, the chip also includes a task scheduler (TS) for achieving efficient allocation and scheduling of computing tasks on the AI processing core, etc. The present application is not limited to this.
下面对上述AI处理核302的架构进行介绍。The architecture of the above-mentioned AI processing core 302 is introduced below.
图4是本申请实施例提供的一种AI处理核的结构示意图。如图4所示,以AI处理核302采用达芬奇架构为例,AI处理核302包括计算单元、存储单元以及控制单元。Fig. 4 is a schematic diagram of the structure of an AI processing core provided by an embodiment of the present application. As shown in Fig. 4, taking the AI processing core 302 using the Da Vinci architecture as an example, the AI processing core 302 includes a computing unit, a storage unit, and a control unit.
计算单元包括:矩阵计算单元(cube unit)、向量计算单元(vector unit)和标量计算单元(scalar unit)。这三种计算单元各司其职,形成了三条独立的执行流水线,在系统软件的统一调度下互相配合达到优化的计算效率,完成AI处理核中不同类型的数据计算。The computing units include: cube unit, vector unit and scalar unit. These three computing units perform their respective functions, forming three independent execution pipelines. Under the unified scheduling of the system software, they cooperate with each other to achieve optimized computing efficiency and complete different types of data calculations in the AI processing core.
存储单元包括:L1缓冲区(L1 buffer),L0缓冲区(L0 buffer),统一缓冲区(unified buffer),通用寄存器(general-purpose register,GPR),专用寄存器(special-purpose register,SPR)和标量缓冲区(scalar buffer)。应理解,上述存储单元是指AI处理核的内部存储,AI处理核需要把外部存储中的数据加载到内部存储中,才能完成相应的计算。为了配合AI处理核中的数据传输和搬运,AI处理核还包含总线接口单元(bus interface unit,BIU),存储转换引擎(memory transfer engine,MTE1),MTE2,MTE3。其中BIU为AI处理核与总线交互的接口;MTE为数据搬运单元,用于完成不同缓冲区之间的数据搬运。The storage units include: L1 buffer, L0 buffer, unified buffer, general-purpose register (GPR), special-purpose register (SPR) and scalar buffer. It should be understood that the above storage units refer to the internal storage of the AI processing core. The AI processing core needs to load the data in the external storage into the internal storage to complete the corresponding calculation. In order to cooperate with the data transmission and handling in the AI processing core, the AI processing core also includes a bus interface unit (BIU), a memory transfer engine (MTE1), MTE2, and MTE3. Among them, BIU is the interface for the AI processing core to interact with the bus; MTE is a data handling unit used to complete data handling between different buffers.
控制单元包括:系统控制模块(system control),指令发射模块(instr.dispatch),矩阵运算队列(cube  queue),向量运算队列(vector queue),存储转换队列(MTE queue)。其中,系统控制模块负责指挥和协调AI处理核的整体运行模式,配置参数和实现功耗控制等。当指令通过指令发射模块顺次发射出去后,根据指令的不同类型,将会分别被发送到矩阵运算队列、向量运算队列和存储转换队列。The control unit includes: system control module (system control), instruction dispatch module (instr.dispatch), matrix operation queue (cube The system control module is responsible for commanding and coordinating the overall operation mode of the AI processing core, configuring parameters, and implementing power consumption control. When instructions are sent out in sequence through the instruction emission module, they will be sent to the matrix operation queue, vector operation queue, and storage conversion queue respectively according to the different types of instructions.
在AI处理核中,存储单元为各个计算单元提供被转置过并符合要求的数据,计算单元返回运算的结果给存储单元,控制单元为计算单元和存储单元提供指令控制,三者相互协调合作完成计算任务。In the AI processing core, the storage unit provides transposed and required data to each computing unit. The computing unit returns the result of the operation to the storage unit. The control unit provides instruction control to the computing unit and the storage unit. The three coordinate and cooperate with each other to complete the computing task.
需要说明的是,上述图4所示仅为本申请提供的一种能够实现上述AI处理核功能的结构示意图,AI处理核还能够采用其他架构,本申请对此不作限定。It should be noted that the above FIG. 4 is only a structural diagram provided by the present application that can realize the above-mentioned AI processing core function. The AI processing core can also adopt other architectures, and the present application does not limit this.
下面对本申请提供的算子的处理方法进行介绍。The following is an introduction to the processing method of the operator provided in this application.
基于前述内容可知,本申请提供了一种基于预设形状集合(也称为shape集)来获取动态形状算子的技术方案,能够有效提升AI加速器的运行性能。其中,对于某一算子的任意形状的输入数据,按照预设形状集合对输入数据进行切分,从而利用预设的该算子对应的多个基算子,来分别对切分后的输入数据进行处理,最终得到该算子的输出数据。Based on the above content, it can be seen that the present application provides a technical solution for obtaining dynamic shape operators based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator. Among them, for input data of any shape of a certain operator, the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained.
为便于理解,下面先参考图5至图7,以算子为矩阵乘算子为例,从矩阵乘算子的输出数据的视角,对本申请提供的技术方案的原理进行介绍。For ease of understanding, reference is made to FIG. 5 to FIG. 7 below, and taking the operator as a matrix multiplication operator as an example, the principle of the technical solution provided in this application is introduced from the perspective of the output data of the matrix multiplication operator.
图5是一种矩阵乘算子的切分示意图。如图5所示,左矩阵A的形状为(M,K),也即是M×K维的矩阵,右矩阵B的形状为(K,N),也即是K×N维的矩阵,其中,M、K、N均为正整数,用于表示矩阵行方向或列方向上的维度值,矩阵C=A×B,矩阵C的形状为(M,N)。也即是M×N维的矩阵。应理解,C=A×B这一计算任务,通常无法通过AI加速器中AI处理核的计算单元单次计算完成,因而往往需要对这一计算任务进行切分,切分视角一般从输出数据即矩阵C来看,如图5中矩阵C的局部矩阵C′所示,局部矩阵C′可由局部左矩阵A′和局部右矩阵B′相乘计算得到。因此,矩阵乘算子的切分,通常关注矩阵C的切分即可。也即是,通过在M和N方向上进行切分,能够得到局部矩阵C′的解。Figure 5 is a schematic diagram of the segmentation of a matrix multiplication operator. As shown in Figure 5, the shape of the left matrix A is (M, K), that is, an M×K-dimensional matrix, and the shape of the right matrix B is (K, N), that is, a K×N-dimensional matrix, where M, K, and N are all positive integers, used to represent the dimension values in the row direction or column direction of the matrix, and the matrix C=A×B, and the shape of the matrix C is (M, N). That is, it is an M×N-dimensional matrix. It should be understood that the calculation task of C=A×B cannot usually be completed by a single calculation of the computing unit of the AI processing core in the AI accelerator, so it is often necessary to segment this calculation task. The segmentation perspective is generally from the output data, that is, the matrix C. As shown in the local matrix C′ of the matrix C in Figure 5, the local matrix C′ can be calculated by multiplying the local left matrix A′ and the local right matrix B′. Therefore, the segmentation of the matrix multiplication operator usually focuses on the segmentation of the matrix C. That is, by segmenting in the M and N directions, the solution of the local matrix C′ can be obtained.
图6是另一种矩阵乘算子的切分示意图。如图6所示,在上述图5所示内容的基础上,通过在矩阵的行方向和列方向上进行切分,能够得到局部矩阵C′的局部解。即,将局部左矩阵A″和局部右矩阵B″相乘,能够得到局部矩阵C′的局部解。将局部左矩阵A′内的所有K方向切分块和局部右矩阵B′内的所有K方向切分块各自相乘后累加,得到局部矩阵C′的解。FIG6 is a schematic diagram of the partitioning of another matrix multiplication operator. As shown in FIG6, based on the content shown in FIG5 above, by partitioning the matrix in the row direction and the column direction, a local solution of the local matrix C′ can be obtained. That is, by multiplying the local left matrix A″ and the local right matrix B″, a local solution of the local matrix C′ can be obtained. All K-direction partition blocks in the local left matrix A′ and all K-direction partition blocks in the local right matrix B′ are multiplied and added to obtain the solution of the local matrix C′.
经过上述图5和图6所示内容,介绍了矩阵乘算子的常规切分方式,下面通过图7来介绍本申请提供的技术方案。图7是本申请实施例提供的一种矩阵乘算子的切分示意图。如图7所示,由于本申请预先设定了一个预设形状集合,并且预先设定了一系列与预设形状集合中数值相关的基算子,因此,对于上述图5和图6所示的矩阵乘算子,基于预设形状集合、矩阵A的形状以及矩阵B的形状,生成矩阵乘算子的切分方案,按照该切分方案,对矩阵A和矩阵B进行切分,得到多个基算子的输入数据,接着,调用多个基算子,对各自的输入数据进行处理,得到各个基算子的输出数据,应理解,多个基算子的输出数据拼接起来,也即得到了矩阵乘算子的输出数据,也即是,通过多个基算子来协同实现算子的功能。示意性地,以M=112、K=80、N=48为例,也即是,矩阵乘算子的输入数据的形状表示为(M,K,N)=(112,80,48),从矩阵乘算子的输出数据的视角来看,输出数据即矩阵C的形状为(112,48),基于预设形状集合和输入数据的形状,生成切分方案,按照切分方案对矩阵乘算子的输入数据进行切分,得到多个基算子的输入数据,在切分过程涉及到的数值例如包括16、32和64,则矩阵乘算子的输入数据的各个维度可以按照如下切分方案进行切分:
M(112)=64×1+32×1+16×1;K(80)=64×1+32×1;N(48)=32×1+16×1;
Through the contents shown in Figures 5 and 6 above, the conventional segmentation method of the matrix multiplication operator is introduced. The technical solution provided by the present application is introduced below through Figure 7. Figure 7 is a segmentation schematic diagram of a matrix multiplication operator provided by an embodiment of the present application. As shown in Figure 7, since the present application pre-sets a preset shape set and pre-sets a series of base operators related to the numerical values in the preset shape set, therefore, for the matrix multiplication operator shown in Figures 5 and 6 above, based on the preset shape set, the shape of matrix A and the shape of matrix B, a segmentation scheme of the matrix multiplication operator is generated. According to the segmentation scheme, matrix A and matrix B are segmented to obtain input data of multiple base operators. Then, multiple base operators are called to process their respective input data to obtain output data of each base operator. It should be understood that the output data of multiple base operators are spliced together, that is, the output data of the matrix multiplication operator is obtained, that is, the function of the operator is collaboratively realized through multiple base operators. Schematically, taking M=112, K=80, N=48 as an example, that is, the shape of the input data of the matrix multiplication operator is expressed as (M, K, N)=(112, 80, 48). From the perspective of the output data of the matrix multiplication operator, the shape of the output data, i.e., the matrix C, is (112, 48). Based on the preset shape set and the shape of the input data, a segmentation scheme is generated, and the input data of the matrix multiplication operator is segmented according to the segmentation scheme to obtain input data of multiple base operators. The values involved in the segmentation process include, for example, 16, 32, and 64. Then, each dimension of the input data of the matrix multiplication operator can be segmented according to the following segmentation scheme:
M(112)=64×1+32×1+16×1; K(80)=64×1+32×1; N(48)=32×1+16×1;
相应地,上述切分方案对应12个基算子,即,如图7所示,对矩阵A进行切分,得到矩阵A1至A6,对矩阵B进行切分,得到矩阵B1至B4,相应地,矩阵C的局部解包括矩阵C1至C6,其具体基算子的数据处理过程参考下述公式(1)至公式(6),其中,“pp”用于指代预设的基算子,仅为举例说明,并不构成对本申请的限定:
C1=A1×B1+A2×B3=pp(64,64,32)+pp(64,16,32)  (1)
C2=A1×B2+A2×B4=pp(64,64,16)+pp(64,16,16)  (2)
C3=A3×B1+A4×B3=pp(32,64,32)+pp(32,16,32)  (3)
C4=A3×B2+A4×B4=pp(32,64,16)+pp(32,16,16)  (4)
C5=A5×B1+A6×B3=pp(16,64,32)+pp(16,16,32)  (5)
C6=A5×B2+A6×B4=pp(16,64,16)+pp(16,16,16)  (6)
Accordingly, the above-mentioned partitioning scheme corresponds to 12 basis operators, that is, as shown in FIG7 , matrix A is partitioned to obtain matrices A1 to A6, and matrix B is partitioned to obtain matrices B1 to B4. Accordingly, the local solution of matrix C includes matrices C1 to C6. The data processing process of the specific basis operators is referred to the following formulas (1) to (6), wherein “pp” is used to refer to the preset basis operators, which is only for illustration and does not constitute a limitation of the present application:
C1=A1×B1+A2×B3=pp(64,64,32)+pp(64,16,32) (1)
C2=A1×B2+A2×B4=pp(64,64,16)+pp(64,16,16) (2)
C3=A3×B1+A4×B3=pp(32,64,32)+pp(32,16,32) (3)
C4=A3×B2+A4×B4=pp(32,64,16)+pp(32,16,16) (4)
C5=A5×B1+A6×B3=pp(16,64,32)+pp(16,16,32) (5)
C6=A5×B2+A6×B4=pp(16,64,16)+pp(16,16,16) (6)
通过上述公式(1)至公式(6),调用多个基算子,对各自的输入数据进行处理,得到多个基算子的输出数据,在对多个基算子的输出数据进行拼接后,即可得到矩阵乘算子的输出数据矩阵C。Through the above formulas (1) to (6), multiple base operators are called to process their respective input data to obtain output data of multiple base operators. After concatenating the output data of multiple base operators, the output data matrix C of the matrix multiplication operator can be obtained.
可见,本申请提供了一种利用静态已编译的基算子,通过动态组合来协同实现动态形状计算任务的方案。应理解,上述图5至图7仅以矩阵乘算子为例进行了介绍,本申请提供的技术方案适用于所有类型算子,在此处不赘述。It can be seen that the present application provides a solution for collaboratively implementing dynamic shape calculation tasks by using statically compiled base operators through dynamic combination. It should be understood that the above Figures 5 to 7 are only introduced using the matrix multiplication operator as an example. The technical solution provided by the present application is applicable to all types of operators and will not be described in detail here.
基于上述对本申请技术方案的原理的介绍,下面对本申请技术方案的具体实施方式进行介绍。结合前述图1所示的实施环境以及技术方案原理可知,算子的处理方法中涉及得到的预设形状集合和基算子均需要预先设定,而主机可以提供针对算子的各类开发功能,因此,为便于理解,下面分别基于算子开发阶段和算子执行阶段这两个阶段来对本申请的技术方案进行介绍。Based on the above introduction to the principle of the technical solution of the present application, the specific implementation method of the technical solution of the present application is introduced below. In combination with the implementation environment and the principle of the technical solution shown in the above FIG1, it can be seen that the preset shape set and the base operator involved in the operator processing method need to be pre-set, and the host can provide various development functions for the operator. Therefore, for ease of understanding, the technical solution of the present application is introduced below based on the two stages of the operator development stage and the operator execution stage.
算子开发阶段Operator development stage
在这一阶段,由主机为用户提供针对算子的各类开发功能,如设定预设形状集合、编写算子、生成算子代码、编写基算子等等。例如,主机显示算子开发界面,用户通过在算子开发界面上实施各类算子开发操作,触发主机实现各类算子开发功能,本申请不限于此。示意性地,以开发任一个动态形状算子为例,这一阶段涉及下述几个步骤。In this stage, the host provides the user with various development functions for operators, such as setting a preset shape set, writing operators, generating operator codes, writing base operators, etc. For example, the host displays the operator development interface, and the user triggers the host to implement various operator development functions by performing various operator development operations on the operator development interface, but the present application is not limited thereto. Schematically, taking the development of any dynamic shape operator as an example, this stage involves the following steps.
步骤1、设定预设形状集合。Step 1. Set the preset shape set.
预设形状集合包括用于切分数据的多个数值,该数值用于指示数据的维度值,其中,预设形状集合中数值的大小以及数量能够根据用户的需求进行设定。需要说明的是,预设形状集合需要是“完备”的,即基于预设形状集合中的数值可以拼接出任意形状,或者说,任意形状的输入数据均能够按照预设形状集合中的数值进行切分,例如,使用最小粒度的形状就可以拼接出任意形状。另外,预设形状集合中数值的大小需要考虑是否充分利用了硬件能力(例如AI处理核中计算单元的计算能力、存储单元的带宽、大小等),这样可以确保算子的运行性能。同时,随着预设形状集合中数值的数量的增加,预设形状集合对应的基算子数量往往呈指数级别上升(例如,若算子的输入数据的形状对应三个数值,则基算子数量与数值的数量是三次方的关系,具体可参考前述图7,在此不再赘述)。基于此,本申请提供了一种预设形状集合的具体设定方式,能够权衡算子性能和基算子的数量,从而有效提升AI加速器的运行性能,或者说提升AI加速器所在计算设备的运行性能。示意性地,预设形状集合中数值的大小与下述至少一项相关联:The preset shape set includes multiple values for segmenting data, which are used to indicate the dimension value of the data, wherein the size and number of the values in the preset shape set can be set according to the needs of the user. It should be noted that the preset shape set needs to be "complete", that is, any shape can be spliced based on the values in the preset shape set, or in other words, input data of any shape can be segmented according to the values in the preset shape set, for example, any shape can be spliced using the shape with the smallest granularity. In addition, the size of the values in the preset shape set needs to consider whether the hardware capabilities are fully utilized (such as the computing power of the computing unit in the AI processing core, the bandwidth and size of the storage unit, etc.), so as to ensure the operating performance of the operator. At the same time, as the number of values in the preset shape set increases, the number of base operators corresponding to the preset shape set often increases exponentially (for example, if the shape of the input data of the operator corresponds to three values, the number of base operators and the number of values are cubic, for details, please refer to the aforementioned Figure 7, which will not be repeated here). Based on this, the present application provides a specific setting method of a preset shape set, which can balance the operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator, or improving the operating performance of the computing device where the AI accelerator is located. Schematically, the size of the value in the preset shape set is associated with at least one of the following:
(1)AI加速器执行的指令所对应的数据类型。数据类型例如是FP16、FP32等等,本申请不限于此。以上述图4所示的AI处理核为例,由于计算单元指令、缓存搬运指令均要求32B对齐,FP16数据类型对应16个数,因此这种情况下可以将预设形状集合中的最小值设定为16。(1) The data type corresponding to the instruction executed by the AI accelerator. The data type is, for example, FP16, FP32, etc., but the present application is not limited thereto. Taking the AI processing core shown in FIG. 4 as an example, since both the computing unit instructions and the cache transfer instructions require 32B alignment, the FP16 data type corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.
(2)AI加速器执行的指令所对应的数据范围。指令所对应的数据范围也可以理解为是指令的入参限制,例如长度不能超过1024个字符,个数不能超过100个数,等等,本申请对此次不作限定。示意性地,以数据搬运指令为例,该指令用于数据搬运,取值范围为[1,65535],单位为32B。因而该指令只能搬运最小粒度为32B的数据,也就是说对于数据类型是FP16的数据,对应16个数,因此这种情况下可以将预设形状集合中的最小值设定为16。(2) The data range corresponding to the instruction executed by the AI accelerator. The data range corresponding to the instruction can also be understood as the input parameter limit of the instruction, for example, the length cannot exceed 1024 characters, the number cannot exceed 100 numbers, and so on. This application does not limit this. Schematically, taking the data transfer instruction as an example, this instruction is used for data transfer, and the value range is [1, 65535], and the unit is 32B. Therefore, this instruction can only transfer data with a minimum granularity of 32B, that is, for data of the FP16 data type, it corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.
(3)AI加速器中各级缓存空间的大小。应理解,AI加速器通常包括多级缓存空间,例如,UB(256KB)、L1缓冲区(1MB)、L0A/L0B缓冲区(64KB)、L0C缓冲区(256KB)等等,具体可参考前述图4。由于某一数据各个维度的维度值的大小和维度数量决定缓存空间的占用大小,因此预设形状集合中数值的大小需要考虑各级缓存空间的大小。例如,对于矩阵乘算子来说,在L0A上的矩阵的M、K维决定L0A缓冲区中缓存空间的占用大小,在L0B上的矩阵的K、N维决定L0B缓冲区中缓存空间的占用大小(左矩阵数据存放于L0A缓冲区,右矩阵数据存放于L0B缓冲区)。因此,以数据类型FP16的左矩阵为例,若采用乒乓缓存机制来实现基算子调度,则在L0A中存放的矩阵的大小M×K≤16384(对应32KB,即取64KB的一半),也即是,L0A中的缓存空间占用的上限是32KB。(3) The size of cache space at each level in the AI accelerator. It should be understood that the AI accelerator usually includes multiple levels of cache space, for example, UB (256KB), L1 buffer (1MB), L0A/L0B buffer (64KB), L0C buffer (256KB), etc. For details, please refer to the aforementioned Figure 4. Since the size of the dimension value of each dimension of a certain data and the number of dimensions determine the size of the cache space occupied, the size of the numerical value in the preset shape set needs to consider the size of the cache space at each level. For example, for the matrix multiplication operator, the M and K dimensions of the matrix on L0A determine the size of the cache space occupied in the L0A buffer, and the K and N dimensions of the matrix on L0B determine the size of the cache space occupied in the L0B buffer (the left matrix data is stored in the L0A buffer, and the right matrix data is stored in the L0B buffer). Therefore, taking the left matrix of data type FP16 as an example, if the ping-pong cache mechanism is used to implement basic operator scheduling, the size of the matrix stored in L0A is M×K≤16384 (corresponding to 32KB, that is, half of 64KB), that is, the upper limit of the cache space occupied in L0A is 32KB.
下面对本申请提供的一种预设形状集合的可选实现方式进行介绍。The following introduces an optional implementation method of a preset shape set provided by this application.
本申请提供了一种基于等差数值序列来划分维度空间的方式,即,预设形状集合中的数值包括至少两个等差数值序列,其中,目标序列尾部的数值和目标序列的相邻序列头部的数值之间的差值大于目标序列 的公差。比如说,预设形状集合={16,32,48,64,80,96,112,128,256,384,512,640,768,896,1024,1152,1280},其中,{16,32,48,64,80,96,112,128}为目标序列,该目标序列中的数值按照公差16递增,{256,384,512,640,768,896,1024,1152,1280}为目标序列的相邻序列,该相邻序列中的数值按照公差128递增。示意性地,上述方式也可以理解为是一种基于多级跨度来划分维度空间的方式,即,预设形状集合中的数值按照多级跨度递增,比如说,预设形状集合={16,32,48,64,80,96,112,128,256,384,512,640,768,896,1024,1152,1280},其中,一级跨度为16,二级跨度为128,对于一级跨度,数值从16开始按照本级跨度16进行递增,得到32、48、64、80、96、112、128,此时达到二级跨度,数值从128开始按照二级跨度128继续递增,得到256,384,…等等,以此类推。The present application provides a method for dividing a dimensional space based on an arithmetic sequence of numerical values, that is, the numerical values in the preset shape set include at least two arithmetic sequences of numerical values, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence is greater than the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence. For example, the preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where {16, 32, 48, 64, 80, 96, 112, 128} is the target sequence, and the values in the target sequence increase according to the tolerance of 16, and {256, 384, 512, 640, 768, 896, 1024, 1152, 1280} is the adjacent sequence of the target sequence, and the values in the adjacent sequence increase according to the tolerance of 128. Schematically, the above method can also be understood as a method of dividing the dimensional space based on multi-level spans, that is, the values in the preset shape set increase according to the multi-level spans. For example, the preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where the first-level span is 16 and the second-level span is 128. For the first-level span, the value starts from 16 and increases according to the current span of 16 to obtain 32, 48, 64, 80, 96, 112, and 128. At this time, the second-level span is reached, and the value continues to increase from 128 according to the second-level span of 128 to obtain 256, 384,... and so on.
在一些实施例中,多个基算子的输入数据的形状所对应的数值与预设形状集合中序列的数量相关联(或者说与多级跨度的级数相关联,对此不作限定)。也即是,可以通过预设形状集合中序列的数量来约束对输入数据进行切分时使用到的数值的个数。比如说,某一算子的输入数据的形状为(112),在不考虑对数值个数的约束的情况下,该输入数据可以有多种切分方案,例如,112=64×1+32×1+16×1,这种切分方案涉及3个数值,又例如,112=64×1+48×1,这种切分方案涉及2个数值,而基于前述介绍可知,当切分方案涉及的数值过多时,其对应的基算子的数量也会呈指数级增长,从而影响AI加速器的运行性能,而且,基算子数量增多也会导致基算子调度的开销过大。若考虑对数值个数的约束,例如,多个基算子的输入数据的形状所对应的数值的个数等于预设形状集合中序列的数量加1(或者说多级跨度的级数加1,应理解,此处仅为举例说明,具体可根据用户需求进行设定,并不构成对本申请技术方案的限定),则可以有效控制切分方案涉及的数值的个数,例如,预设形状集合中序列的数量为1,则多个基算子的输入数据的形状所对应的数值的个数最多为2,以某一算子的输入数据的形状为(112)为例,该输入数据的切分方案最多涉及2个数值,例如是112=112×1,这种切分方案涉及1个数值,从而有效减少了基算子的数量,如此,节约了计算资源,也即提升了AI加速器的运行性能。In some embodiments, the values corresponding to the shapes of the input data of multiple base operators are associated with the number of sequences in a preset shape set (or the number of levels of multi-level spans, which is not limited to this). That is, the number of sequences in the preset shape set can be used to constrain the number of values used when splitting the input data. For example, the shape of the input data of a certain operator is (112). Without considering the constraint on the number of values, the input data can have multiple splitting schemes, for example, 112 = 64 × 1 + 32 × 1 + 16 × 1, this splitting scheme involves 3 values, and for example, 112 = 64 × 1 + 48 × 1, this splitting scheme involves 2 values. Based on the above introduction, when the splitting scheme involves too many values, the corresponding number of base operators will also increase exponentially, thereby affecting the operating performance of the AI accelerator. Moreover, the increase in the number of base operators will also lead to excessive overhead in the scheduling of base operators. If the constraint on the number of values is considered, for example, the number of values corresponding to the shapes of the input data of multiple base operators is equal to the number of sequences in the preset shape set plus 1 (or the number of levels of the multi-level span plus 1. It should be understood that this is only an example and can be set according to user needs, and does not constitute a limitation on the technical solution of the present application). Then, the number of values involved in the segmentation scheme can be effectively controlled. For example, if the number of sequences in the preset shape set is 1, the number of values corresponding to the shapes of the input data of multiple base operators is at most 2. Taking the shape of the input data of a certain operator as (112) as an example, the segmentation scheme of the input data involves at most 2 values, for example, 112=112×1. This segmentation scheme involves 1 value, which effectively reduces the number of base operators. In this way, computing resources are saved, which improves the operating performance of the AI accelerator.
下面基于上述内容,给出预设形状集合的两种示例:Based on the above, two examples of preset shape sets are given below:
第一种The first
预设形状集合包括两个等差数值序列,切分方案涉及的数值的最大个数为3,预设形状集合={16,32,48,64,80,96,112,128,256,384,512,640,768,896,1024,1152,1280},其中,数值的总数为17,能够根据需求进行调整。The preset shape set includes two arithmetic progression value sequences. The maximum number of values involved in the segmentation scheme is 3. The preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where the total number of values is 17, which can be adjusted according to needs.
第二种The second
预设形状集合包括两个等差数值序列,切分方案涉及的数值的最大个数为3,预设形状集合={16,32,48,64,80,96,112,128,144,160,176,192,208,224,240,256,512,768,1024,1280,1536,1792,2048},其中,数值的总数为23,能够根据需求进行调整。The preset shape set includes two arithmetic progression value sequences. The maximum number of values involved in the segmentation scheme is 3. The preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 512, 768, 1024, 1280, 1536, 1792, 2048}, where the total number of values is 23, which can be adjusted according to needs.
需要说明的是,主机能够基于用户设定的约束条件(如前述内容所示),自动生成预设形状集合,例如,以图4所示的AI处理核为例,由于L0A缓冲区上的矩阵需满足M×K≤16384(对应32KB),故当ML0A=1024时,KL0A≤16;而当ML0A=128时,KL0A≤128。应理解,这里的示例描述的是L0A缓冲区约束所带来的结果,所描述的ML0A和KL0A均为L0A缓冲区上矩阵的大小,而非实际矩阵乘算子的M和K维。通常情况下,AI处理核包含多级缓存空间,每一级缓存空间大小差异较大,因而对应的约束也并不相同。因此,预设形状集合的设定可以综合考虑多级缓存间数据搬运的关联关系和多级缓存空间的约束,本申请不限于此。例如,限定每一级缓存空间之间数据传递的最大数据量≤块(block)大小,并且对于不同的基算子内部切分方式,设计不同的搬运、计算顺序,尽量复用数据,减少搬运,形成流水。这一过程会在后续第二部分进一步介绍,在此不再赘述。It should be noted that the host can automatically generate a preset shape set based on the constraints set by the user (as shown in the above content). For example, taking the AI processing core shown in Figure 4 as an example, since the matrix on the L0A buffer needs to satisfy M×K≤16384 (corresponding to 32KB), when M L0A =1024, K L0A ≤16; and when M L0A =128, K L0A ≤128. It should be understood that the example here describes the result brought about by the L0A buffer constraint. The described M L0A and K L0A are both the size of the matrix on the L0A buffer, rather than the M and K dimensions of the actual matrix multiplication operator. Generally, the AI processing core contains multiple levels of cache space, and the size of each level of cache space varies greatly, so the corresponding constraints are also different. Therefore, the setting of the preset shape set can comprehensively consider the association relationship between data transfer between multiple levels of cache and the constraints of multiple levels of cache space, and the present application is not limited to this. For example, the maximum amount of data transferred between each level of cache space is limited to ≤ the block size, and different transfer and calculation orders are designed for different internal segmentation methods of base operators to reuse data as much as possible, reduce transfer, and form a pipeline. This process will be further introduced in the second part later, so I will not go into details here.
进一步地,用户还能够根据预设形状集合的质量评估信息,对预设形状集合进行调整(如自动迭代调整或手动迭代调整),以使调整后的预设形状集合满足目标条件,确保预设形状集合的合理性。其中,目标条件是指用户设定的用于评估预设形状集合质量的条件,能够根据实际需求进行调整,例如目标条件包括预设形状集合对应的基算子总体数量满足要求、切分方案涉及的数值的个数满足要求、硬件能力的利用率满足要求等等。示意性地,在初步设定预设形状集合之后,通过评估预设形状集合的质量,来对预设形状集合进行迭代调整,下面对上述目标条件涉及的几种情况进行介绍:Furthermore, the user can also adjust the preset shape set according to the quality evaluation information of the preset shape set (such as automatic iterative adjustment or manual iterative adjustment) so that the adjusted preset shape set meets the target conditions and ensures the rationality of the preset shape set. Among them, the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs. For example, the target condition includes that the total number of base operators corresponding to the preset shape set meets the requirements, the number of values involved in the segmentation scheme meets the requirements, the utilization rate of the hardware capacity meets the requirements, etc. Schematically, after the preset shape set is initially set, the preset shape set is iteratively adjusted by evaluating the quality of the preset shape set. The following introduces several situations involving the above-mentioned target conditions:
(1)预设形状集合对应的基算子总体数量(能够影响二进制文件大小,基算子越多,二进制文件越多,从而占用的空间越多)。 (1) The total number of base operators corresponding to the preset shape set (which can affect the size of the binary file; the more base operators there are, the more binary files there are, and thus the more space they occupy).
(2)切分方案涉及的数值的个数(能够影响算子调度的复杂度,如影响指令缓存未命中(cache miss),从而影响运行性能)。(2) The number of values involved in the segmentation scheme (which can affect the complexity of operator scheduling, such as affecting instruction cache misses, thereby affecting operating performance).
(3)硬件能力的利用率,如预设形状集合对应的基算子中是否包括用满计算单元、存储单元能力的基算子(能够影响拼接得到的算子性能上限,硬件能力利用率越高,性能上限越高)。例如,参考上述图4所示的AI处理核,矩阵乘计算单元(cube)的计算能力是每个循环(cycle)可执行一个形状(M,K,N)=(16,16,16)的矩阵乘算子,而矩阵乘计算指令的输入约束受L0A、L0B缓存空间的限制。因此,对于图4所示的AI处理核,用满硬件能力的评价依据可以包括:充分利用L0A、L0B的缓存空间;充分利用各级缓存空间的带宽;在L0A、L0B的缓存空间的约束下,尽可能用满计算单元的计算能力。当然,上述对充分利用硬件能力的介绍仅为示意性说明,还可以通过其他因素来评估硬件能力,例如,计算限制(compute bound)和存储限制(memory bound)的理论上限模型,多级缓存空间的缓存命中率,调度开销等。(3) Utilization of hardware capabilities, such as whether the base operators corresponding to the preset shape set include base operators that fully utilize the capabilities of the computing units and storage units (which can affect the performance upper limit of the operators obtained by splicing. The higher the hardware capability utilization, the higher the performance upper limit). For example, referring to the AI processing core shown in FIG4 above, the computing capability of the matrix multiplication computing unit (cube) is that each cycle can execute a matrix multiplication operator of shape (M, K, N) = (16, 16, 16), and the input constraint of the matrix multiplication computing instruction is limited by the cache space of L0A and L0B. Therefore, for the AI processing core shown in FIG4, the evaluation basis for fully utilizing the hardware capabilities may include: fully utilizing the cache space of L0A and L0B; fully utilizing the bandwidth of cache space at all levels; and fully utilizing the computing capability of the computing unit as much as possible under the constraints of the cache space of L0A and L0B. Of course, the above introduction to fully utilizing hardware capabilities is only for illustrative purposes. Hardware capabilities can also be evaluated through other factors, such as the theoretical upper limit model of compute bound and memory bound, cache hit rate of multi-level cache space, scheduling overhead, etc.
步骤2、设定基算子。Step 2: Set the base operator.
基算子是指预先开发编译的算子,用于处理固定形状的输入数据,这里的固定形状与预设形状集合中的数值对应,基算子也可以理解为是一种用于切分算子的基础算子。在本申请实施例中,提供了一种基算子自动实现的方案,即,在开发编译得到基算子之后,在基算子实际运行过程中,采用统一的基算子调度机制,实现多个基算子间调度上的无缝拼接。下面对这一基算子自动实现方案进行介绍,包括如下两点:A base operator refers to a pre-developed and compiled operator used to process input data of a fixed shape. The fixed shape here corresponds to the numerical value in the preset shape set. The base operator can also be understood as a basic operator for segmenting operators. In an embodiment of the present application, a solution for automatic implementation of a base operator is provided, that is, after the base operator is developed and compiled, a unified base operator scheduling mechanism is used during the actual operation of the base operator to achieve seamless splicing of scheduling between multiple base operators. The following is an introduction to this base operator automatic implementation solution, including the following two points:
(1)基算子调度机制(1) Base operator scheduling mechanism
在本申请实施例中,在生成基算子时,考虑到需要采用统一的基算子调度机制,实现多个基算子间调度上的无缝拼接,因此,需要基于统一的AI加速器内部缓存空间机制,来约束基算子内部实现的缓存空间分配方式,如此,在后续AI加速器调用多个基算子,对各自的输入数据进行处理的过程中,数据读写时能够采用统一的粒度对AI加速器中各级缓存空间进行分配,进而实现多个基算子间调度上的无缝拼接。例如,采用乒乓缓存机制来实现多个基算子间调度上的无缝拼接,即,AI加速器通常包括多级缓存空间,通过在各级缓存空间上开辟ping-pong缓存空间,使得计算单元和数据搬运单元流水并发执行,降低处理时延。In the embodiment of the present application, when generating a base operator, it is necessary to adopt a unified base operator scheduling mechanism to achieve seamless splicing of scheduling between multiple base operators. Therefore, it is necessary to constrain the cache space allocation method implemented inside the base operator based on a unified AI accelerator internal cache space mechanism. In this way, in the subsequent AI accelerator calling multiple base operators to process their respective input data, a unified granularity can be used to allocate cache space at all levels in the AI accelerator when reading and writing data, thereby achieving seamless splicing of scheduling between multiple base operators. For example, a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators, that is, an AI accelerator usually includes multiple levels of cache space. By opening up ping-pong cache space at each level of cache space, the computing unit and the data handling unit are executed concurrently in a pipeline, thereby reducing processing latency.
下面参考图8,对上述过程的原理进行介绍,图8是本申请实施例提供的一种基算子调度机制的示意图。如图8所示,以前述图4所示的AI处理核为例,AI处理核内部包括多级缓存空间,每个基算子在实现过程中通常包括数据的输入搬运、计算以及输出搬运三个阶段,若采用一个基算子执行完成后,再执行下一个基算子,使得部分计算单元、数据搬运单元的硬件资源空闲。基于此,本申请提供了一种基算子调度机制,采用乒乓缓存机制,来实现多个基算子间调度上的无缝拼接,然而,计算和数据搬运的流水并发,依赖于对缓存空间的统一划分,即计算单元在读取和处理ping缓存空间上的数据时,数据搬运单元只能写入pong缓存空间上的数据,否则会出现数据同步冒险的问题。因此,采用统一的粒度对AI加速器中各级缓存空间进行分配,使得每个基算子内部实现的缓存空间分配方式是统一的,从而实现基算子间调度的无缝拼接。例如,参考图9,图9是本申请实施例提供的缓存空间分配方式的示意图,如图9所示,以前述图4所示的AI处理核为例,L1缓冲区对应的缓存空间的大小为1MB,L0A、L0B缓冲区对应的缓存空间的大小各为64KB,L0C和UB缓冲区对应的缓存空间的大小为256KB。基于此,采用32KB大小的块(block),将缓存空间进行统一划分,每次基算子内部需要占用缓存空间时,以block为单位进行分配,再通过统一调度,避免基算子间在同一时刻读写同一个block,也就避免了发生数据读写冲突,实现整个执行过程中计算、数据搬运的流水并行化,也即是,在后续AI加速器调用多个基算子,对各自的输入数据进行处理的过程中,数据读写时能够采用统一的粒度对AI加速器中各级缓存空间进行分配,进而实现多个基算子间调度上的无缝拼接。The principle of the above process is introduced below with reference to FIG8 , which is a schematic diagram of a base operator scheduling mechanism provided by an embodiment of the present application. As shown in FIG8 , taking the AI processing core shown in FIG4 as an example, the AI processing core includes a multi-level cache space. Each base operator generally includes three stages of data input handling, calculation, and output handling during the implementation process. If a base operator is executed and then the next base operator is executed, the hardware resources of some computing units and data handling units are idle. Based on this, the present application provides a base operator scheduling mechanism, which adopts a ping-pong cache mechanism to achieve seamless splicing of scheduling between multiple base operators. However, the pipeline concurrency of calculation and data handling depends on the unified division of cache space, that is, when the computing unit reads and processes the data in the ping cache space, the data handling unit can only write the data in the pong cache space, otherwise there will be a problem of data synchronization risk. Therefore, a unified granularity is used to allocate cache space at all levels in the AI accelerator, so that the cache space allocation method implemented inside each base operator is unified, thereby achieving seamless splicing of scheduling between base operators. For example, referring to FIG9, FIG9 is a schematic diagram of a cache space allocation method provided by an embodiment of the present application. As shown in FIG9, taking the AI processing core shown in FIG4 as an example, the size of the cache space corresponding to the L1 buffer is 1MB, the size of the cache space corresponding to the L0A and L0B buffers is 64KB each, and the size of the cache space corresponding to the L0C and UB buffers is 256KB. Based on this, a block of 32KB is used to uniformly divide the cache space. Each time the cache space needs to be occupied within the base operator, it is allocated in units of blocks. Then, through unified scheduling, the base operators are prevented from reading and writing the same block at the same time, which also avoids data read and write conflicts, and realizes the pipeline parallelization of calculation and data handling during the entire execution process. That is, in the subsequent AI accelerator calling multiple base operators and processing their respective input data, the data can be read and written at a unified granularity. The cache space at each level in the AI accelerator can be allocated, thereby realizing seamless splicing in scheduling between multiple base operators.
在一些实施例中,基于上述对预设形状集合的介绍可知,预设形状集合中不同数值的组合,能够决定缓存空间的占用大小,因此,上述对缓存空间的统一划分也能够用于约束预设形状集合中数值的大小。示意性地,用于并发流水的乒乓缓存实现方式,要求在实现乒乓缓存的缓存空间上,需要存在两块大小一致的缓存空间,以L0A缓冲区为例,L0A缓冲区对应的缓存空间的大小为64KB,因此,ping-pong缓存空间的上限是32KB,从而,预设形状集合中不同数值的组合得到的数据大小需要小于或等于32KB。当然,此处举例仅为示意性说明,预设形状集合中数值的大小可以根据实际需求进行调整。 In some embodiments, based on the above introduction to the preset shape set, it can be known that the combination of different values in the preset shape set can determine the size of the cache space occupied. Therefore, the above unified division of the cache space can also be used to constrain the size of the values in the preset shape set. Schematically, the ping-pong cache implementation method for concurrent pipelining requires that two cache spaces of the same size exist in the cache space for implementing the ping-pong cache. Taking the L0A buffer as an example, the size of the cache space corresponding to the L0A buffer is 64KB. Therefore, the upper limit of the ping-pong cache space is 32KB, so that the data size obtained by the combination of different values in the preset shape set needs to be less than or equal to 32KB. Of course, the examples given here are only for schematic illustration, and the size of the values in the preset shape set can be adjusted according to actual needs.
(2)开发基算子内部实现模板(2) Developing the internal implementation template of the base operator
在本申请实施例中,由于AI加速器内部包括多级缓存空间,各级缓存空间的大小不一样,因此AI加速器从外部获取到数据后,还是要在内部进行切分,再逐次完成计算。基于此,基算子内部也可以包括切分过程,不同的内部切分方案对基算子的性能影响不同。由于基算子的输入数据的形状是固定的,因而可以在满足上述第(1)点所示的约束下,为每个基算子生成相应的切分方案。这一过程可以先由开发人员凭借经验,编写具有不同切分方式的基算子内部实现模板,再通过性能测试,选择每个基算子最适合的实现方案。基算子由不同模板实现后的性能测试寻优过程,可以通过脚本等工具自动完成,本申请对于基算子的具体实现方式不作限定。当然,也可以将基算子所处理数据的形状作为函数入参,通过调用已设定好的模板来实现基算子的功能,也即是,无需生成基算子代码文件,从而大幅降低了基算子代码量,使得编译得到的二进制文件进一步缩小,节约了资源占用。In the embodiment of the present application, since the AI accelerator includes multiple levels of cache space, and the sizes of the cache spaces at each level are different, after the AI accelerator obtains data from the outside, it still needs to be segmented internally and then the calculation is completed one by one. Based on this, the base operator can also include a segmentation process, and different internal segmentation schemes have different effects on the performance of the base operator. Since the shape of the input data of the base operator is fixed, a corresponding segmentation scheme can be generated for each base operator under the constraints shown in the above point (1). This process can first be written by the developer based on experience to write the internal implementation template of the base operator with different segmentation methods, and then the most suitable implementation scheme for each base operator is selected through performance testing. The performance test optimization process after the base operator is implemented by different templates can be automatically completed by tools such as scripts, and the specific implementation method of the base operator is not limited in this application. Of course, the shape of the data processed by the base operator can also be used as a function input parameter, and the function of the base operator is realized by calling the set template, that is, there is no need to generate the base operator code file, thereby greatly reducing the amount of base operator code, making the compiled binary file further reduced, and saving resource usage.
示意性地,参考图10,图10是本申请实施例提供的一种基算子的切分方式示意图。如图10所示,以乘算子对应的基算子M2_K2_N2为例,M2_K2_N2表示在M、K、N三个维度上,采用两等分切分的方式进行切分,其他基算子同理,不再赘述。处理不同形状的数据的基算子适合采用不同的内部切分实现方式。通过对单个基算子进行性能测试,可以选出每个基算子最适合的实现方案,本申请不限于此。Schematically, refer to Figure 10, which is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application. As shown in Figure 10, taking the base operator M2_K2_N2 corresponding to the multiplication operator as an example, M2_K2_N2 means that it is segmented in two equal parts in the three dimensions of M, K, and N. The same is true for other base operators, which will not be repeated. Base operators that process data of different shapes are suitable for different internal segmentation implementation methods. By performing performance tests on a single base operator, the most suitable implementation scheme for each base operator can be selected, and the present application is not limited to this.
步骤3、设定切分方案的生成方式。Step 3: Set the method for generating the segmentation plan.
在本申请实施例中,切分方案是基于预设形状集合和算子的输入数据的形状生成的,用于包括针对算子的输入数据的切分方式以及算子所对应的多个基算子。在按照切分方案,对算子的输入数据进行切分后,能够得到算子对应的多个基算子的输入数据。考虑到对于任意形状,按照预设形状集合切分通常会得到多种切分方案,本申请提供了一种自动负载均衡的切分方案确定方式,通过确定出不同切分方案的代价,从而选出代价最小的切分方案,以便提升AI加速器的运行性能,其中,切分方案的代价能够指示按照切分方案调用多个基算子进行数据处理得到算子的输出数据的预测耗时。In an embodiment of the present application, a segmentation scheme is generated based on a preset shape set and the shape of the input data of the operator, and is used to include a segmentation method for the input data of the operator and a plurality of base operators corresponding to the operator. After the input data of the operator is segmented according to the segmentation scheme, the input data of the plurality of base operators corresponding to the operator can be obtained. Taking into account that for any shape, segmentation according to a preset shape set usually results in a plurality of segmentation schemes, the present application provides a segmentation scheme determination method for automatic load balancing, by determining the cost of different segmentation schemes, thereby selecting the segmentation scheme with the minimum cost, so as to improve the operating performance of the AI accelerator, wherein the cost of the segmentation scheme can indicate the predicted time consuming of calling a plurality of base operators for data processing according to the segmentation scheme to obtain the output data of the operator.
在一些实施例中,切分方案还包括调用多个基算子的算力资源分配方式,该算力资源分配方式与下述至少一项相关联:基于AI加速器中核数量进行算力资源分配的方式;基于AI加速器中线程数量进行算力资源分配的方式;基于AI加速器中线程束数量进行算力资源分配的方式;基于AI加速器中逻辑块数量进行算力资源分配的方式。应理解,以基于AI加速器中核数量进行算力资源分配的方式为例,AI加速器通常包括多个AI处理核(如前述图3所示),例如包括32个AI处理核,在生成切分方案时,将一个完整的算子按照核数量平均分配的方式切分至各个AI处理核上去运行,如此,可以充分利用算力资源。当然,也可以采用其他分配方式,本申请对此不作限定。示意性地,参考图11,图11是本申请实施例提供的一种切分方案的示意图。如图11所示,以算子的输出数据的视角进行举例说明,AI加速器包括6个AI处理核,则按照核数量对算子的输入数据进行平均切分,由每个AI处理核处理一个3×3的小矩阵,在每个AI处理核内再进行切分,通过调用多个基算子,对各自的输入数据进行处理,得到各个基算子的输出数据,也即得到了完整算子的输出数据。同理,还可以按照线程数量、线程束数量或者逻辑块数量等等来进行切分,本申请不限于此。In some embodiments, the segmentation scheme also includes a computing power resource allocation method for calling multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a computing power resource allocation method based on the number of cores in the AI accelerator; a computing power resource allocation method based on the number of threads in the AI accelerator; a computing power resource allocation method based on the number of thread bundles in the AI accelerator; a computing power resource allocation method based on the number of logic blocks in the AI accelerator. It should be understood that, taking the method of allocating computing power resources based on the number of cores in the AI accelerator as an example, the AI accelerator usually includes multiple AI processing cores (as shown in Figure 3 above), for example, including 32 AI processing cores. When generating a segmentation scheme, a complete operator is divided into each AI processing core in an evenly distributed manner according to the number of cores to run, so that the computing power resources can be fully utilized. Of course, other allocation methods can also be used, and this application is not limited to this. Schematically, refer to Figure 11, Figure 11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application. As shown in Figure 11, taking the output data of the operator as an example, the AI accelerator includes 6 AI processing cores, and the input data of the operator is evenly divided according to the number of cores, and each AI processing core processes a small 3×3 matrix, which is further divided in each AI processing core. By calling multiple base operators, the respective input data is processed to obtain the output data of each base operator, that is, the output data of the complete operator. Similarly, it can also be divided according to the number of threads, the number of thread bundles, or the number of logic blocks, etc., and the present application is not limited to this.
下面以基于AI加速器中核数量进行算力资源分配为例,对确定切分方案的代价的方式进行举例说明。示意性地,本申请提供了一种代价模型(cost model),通过该代价模型,能够确定出不同切分方案的代价,为确定算子最终的切分方案提供技术支撑,该代价模型如下述公式(7)和公式(8)所示:
The following example uses the allocation of computing resources based on the number of cores in an AI accelerator to illustrate the method of determining the cost of a segmentation scheme. Schematically, the present application provides a cost model, through which the costs of different segmentation schemes can be determined, providing technical support for determining the final segmentation scheme of the operator. The cost model is shown in the following formulas (7) and (8):
在上述公式中,t估计是指按照切分方案调用多个基算子进行数据处理得到算子的输出数据的预测耗时。t分核调度额外开销是通过测试得到的预估值,例如,依次使用1核、2核、…32核运行计算,并在每个核上设定同样大小的计算任务,按理说如果没有额外开销,那么随着使用核数的上升,总耗时应该不变,但通过测试发现会有线性上升,则可以根据线性上升的量,估算出分核调度的额外开销。相应地,t分核调度额外开销×分核数也即是基于算力资源分配方式进行算力资源分配的预测耗时。t基算子是通过测试得到的预估值,例如,将 计算任务对应的数据形状设置成某个待测基算子的大小,并且约束其只能在单核上运行,得到运行该基算子的预测耗时。最耗时阶段占比是通过测试得到的在基算子运行过程中耗时最长的阶段所粘的比重,或者说,按最长一级流水的耗时来计算它对整个执行过程耗时的贡献,例如,参考图12,图12是本申请实施例提供的一种最耗时阶段占比的示意图,如图12所示,最耗时阶段为输入搬运阶段(需要说明的是,此处仅为举例说明,并不构成对本申请的限定)。相应地,也即是多个基算子的预测耗时。In the above formula, t estimation refers to the predicted time consumption of obtaining the output data of the operator by calling multiple base operators for data processing according to the partitioning scheme. The additional overhead of t split-core scheduling is an estimated value obtained through testing. For example, 1 core, 2 cores, ... 32 cores are used to run calculations in sequence, and the same size of computing tasks are set on each core. In theory, if there is no additional overhead, then the total time consumption should remain unchanged as the number of cores used increases. However, through testing, it is found that there will be a linear increase. Based on the amount of linear increase, the additional overhead of split-core scheduling can be estimated. Correspondingly, t split-core scheduling additional overhead × the number of split cores is the predicted time consumption for computing power resource allocation based on the computing power resource allocation method. t base operator is an estimated value obtained through testing. For example, The data shape corresponding to the calculation task is set to the size of a certain base operator to be tested, and it is constrained to run only on a single core to obtain the predicted time consumption of running the base operator. The proportion of the most time-consuming stage is the proportion of the longest time-consuming stage in the operation of the base operator obtained through testing, or in other words, its contribution to the time consumption of the entire execution process is calculated according to the time consumption of the longest first-level pipeline. For example, referring to Figure 12, Figure 12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application. As shown in Figure 12, the most time-consuming stage is the input transfer stage (it should be noted that this is only an example and does not constitute a limitation of the present application). Accordingly, That is, the prediction time of multiple base operators is time-consuming.
下面参考图13,对上述算子开发阶段进行总结,图13是本申请实施例提供的一种算子开发阶段的流程示意图,如图13所示,该算子开发阶段由主机执行,包括如下步骤1301至步骤1306。The above-mentioned operator development stage is summarized below with reference to FIG13 . FIG13 is a flow chart of an operator development stage provided in an embodiment of the present application. As shown in FIG13 , the operator development stage is executed by the host and includes the following steps 1301 to 1306 .
1301、主机基于约束条件,生成算子的预设形状集合,该预设形状集合包括用于切分数据的多个数值。1301. The host generates a preset shape set of operators based on constraint conditions, where the preset shape set includes multiple values for segmenting data.
其中,算子为动态形状算子,约束条件是指用户设定的约束条件,能够根据实际需求进行调整,例如约束条件包括AI加速器执行的指令所对应的数据类型、AI加速器执行的指令所对应的数据范围、AI加速器中各级缓存空间的大小、多级缓存间数据搬运的关联关系、对缓存空间进行统一划分的粒度等等,具体参考前述第一部分和第二部分,在此不再赘述。Among them, the operator is a dynamic shape operator, and the constraint condition refers to the constraint condition set by the user, which can be adjusted according to actual needs. For example, the constraint conditions include the data type corresponding to the instruction executed by the AI accelerator, the data range corresponding to the instruction executed by the AI accelerator, the size of the cache space at each level in the AI accelerator, the correlation relationship between data transfer between multi-level caches, the granularity of unified division of cache space, etc. Please refer to the first and second parts above for details, which will not be repeated here.
1302、主机获取预设形状集合的质量评估信息,基于质量评估信息调整预设形状集合,以使调整后的预设形状集合满足目标条件。1302. The host obtains quality evaluation information of the preset shape set, and adjusts the preset shape set based on the quality evaluation information, so that the adjusted preset shape set meets the target condition.
其中,目标条件是指用户设定的用于评估预设形状集合质量的条件,能够根据实际需求进行调整,例如目标条件包括预设形状集合对应的基算子总体数量满足要求、切分方案涉及的数值的个数满足要求、硬件能力的利用率满足要求等等,质量评估信息的具体内容参考前述第一部分,在此不再赘述。Among them, the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs. For example, the target condition includes that the total number of basic operators corresponding to the preset shape set meets the requirements, the number of numerical values involved in the segmentation scheme meets the requirements, the utilization rate of hardware capabilities meets the requirements, etc. The specific content of the quality assessment information refers to the first part mentioned above and will not be repeated here.
1303、主机基于算子对应的基算子的切分方式,生成基算子的代码文件。1303. The host generates a code file of the base operator based on the segmentation method of the base operator corresponding to the operator.
其中,基算子的切分方式也即是前述第二部分所示的基算子内部实现模板,主机能够根据用户提供的基算子切分方式,自动生成基算子的代码文件,具体参考前述第二部分,在此不再赘述。Among them, the division method of the base operator is the internal implementation template of the base operator shown in the second part above. The host can automatically generate the code file of the base operator according to the division method of the base operator provided by the user. Please refer to the second part above for details, which will not be repeated here.
1304、主机基于基算子的代码文件,构建算子的代价模型。1304. The host constructs a cost model of the operator based on the code file of the base operator.
其中,代价模型的具体实现方式参考前述第三部分,在此不再赘述。The specific implementation of the cost model is described in the third part above and will not be repeated here.
1305、主机基于预设形状集合,构建算子的切分函数,该切分函数用于生成算子的切分方案。1305. The host constructs a segmentation function of the operator based on the preset shape set, and the segmentation function is used to generate a segmentation scheme of the operator.
其中,切分函数是指用于生成算子的切分方案的函数代码,该切分函数用于基于预设形状集合和算子的输入数据的形状,输出切分参数,也即是生成该算子的切分方案。示意性地,切分参数是用来指示输入数据如何切分并进行计算的一组参数,包括源数据地址、目的数据地址、地址偏移量、基算子类型、数量、各切分维度切分后的数量,循环次数等等,本申请不限于此。The segmentation function refers to a function code for generating a segmentation scheme of an operator, and the segmentation function is used to output segmentation parameters based on a preset shape set and the shape of the input data of the operator, that is, to generate a segmentation scheme of the operator. Schematically, the segmentation parameters are a set of parameters used to indicate how the input data is segmented and calculated, including source data address, destination data address, address offset, base operator type, quantity, quantity after segmentation of each segmentation dimension, number of loops, etc., and the present application is not limited thereto.
1306、主机基于基算子的代码文件、算子的代价模型以及算子的切分函数,生成算子的代码文件。1306. The host generates a code file for the operator based on the code file of the base operator, the cost model of the operator, and the segmentation function of the operator.
其中,算子的代码文件包括用于运行算子所涉及的全部代码文件,例如,基算子的代码文件、算子的代价模型、算子的切分函数以及对基算子进行统一调度的逻辑代码文件等等,本申请不限于此。Among them, the code file of the operator includes all code files involved in running the operator, such as the code file of the base operator, the cost model of the operator, the splitting function of the operator, and the logic code file for unified scheduling of the base operator, etc., but the present application is not limited to this.
算子执行阶段Operator execution phase
基于上述对算子开发阶段的介绍,下面对算子执行阶段的流程进行介绍,也即是本申请提供的算子的处理方法。基于前述图1所示的实施环境可知,算子的执行功能可以由AI加速器200自行实现,也可以通过主机100和AI加速器200之间的交互来实现,下面分别通过图14和图15,对两种实现方式进行介绍。Based on the above introduction to the operator development stage, the following is an introduction to the process of the operator execution stage, which is the operator processing method provided by this application. Based on the implementation environment shown in Figure 1 above, it can be seen that the execution function of the operator can be implemented by the AI accelerator 200 itself, or it can be implemented through the interaction between the host 100 and the AI accelerator 200. The following two implementation methods are introduced through Figures 14 and 15 respectively.
图14是本申请实施例提供的一种算子的处理方法的流程图,如图14所示,该方法由AI加速器执行,包括如下步骤1401至步骤1404。FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application. As shown in FIG14 , the method is executed by an AI accelerator and includes the following steps 1401 to 1404 .
1401、AI加速器获取算子的输入数据,该算子为动态形状算子。1401. The AI accelerator obtains input data of an operator, which is a dynamic shape operator.
其中,算子是指AI模型的任意一个动态形状算子,输入数据可以是主机发送的,也可以是AI加速器在运行AI模型的过程中,与算子相关联的其他算子的输出数据,本申请不限于此。Among them, the operator refers to any dynamic shape operator of the AI model, and the input data can be sent by the host, or it can be the output data of other operators associated with the operator during the AI accelerator running the AI model. The present application is not limited to this.
1402、AI加速器基于预设形状集合和算子的输入数据的形状,生成算子的切分方案,该切分方案包括针对算子的输入数据的切分方式以及切分方式所对应的多个基算子。1402. The AI accelerator generates a segmentation scheme for the operator based on a preset shape set and a shape of the input data of the operator. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
其中,多个基算子用于协同实现算子的功能。预设形状集合包括至少两个等差数值序列,其中,目标 序列尾部的数值和目标序列的相邻序列头部的数值之间的差值大于目标序列的公差。在一些实施例中,多个基算子的输入数据的形状所对应的数值与预设形状集合中序列的数量相关联。Among them, multiple base operators are used to collaboratively realize the functions of the operators. The preset shape set includes at least two arithmetic progression value sequences, among which the target The difference between the value at the end of the sequence and the value at the head of the adjacent sequence of the target sequence is greater than the tolerance of the target sequence. In some embodiments, the values corresponding to the shapes of the input data of the plurality of base operators are associated with the number of sequences in the preset shape set.
在一些实施例中,切分方案还包括调用多个基算子的算力资源分配方式,算力资源分配方式与下述至少一项相关联:基于AI加速器中核数量进行算力资源分配的方式;基于AI加速器中线程数量进行算力资源分配的方式;基于AI加速器中线程束数量进行算力资源分配的方式;基于AI加速器中逻辑块数量进行算力资源分配的方式。In some embodiments, the segmentation scheme also includes a computing power resource allocation method that calls multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a method for allocating computing power resources based on the number of cores in the AI accelerator; a method for allocating computing power resources based on the number of threads in the AI accelerator; a method for allocating computing power resources based on the number of thread bundles in the AI accelerator; a method for allocating computing power resources based on the number of logic blocks in the AI accelerator.
在一些实施例中,基于前述算子开发阶段的内容可知,本申请提供了一种自动负载均衡的切分方案确定方式,通过确定出不同切分方案的代价,从而选出代价最小的切分方案,以便提升AI加速器的运行性能。示意性地,本步骤1402包括下述两个步骤:In some embodiments, based on the contents of the aforementioned operator development stage, the present application provides a method for determining a slicing scheme for automatic load balancing, by determining the costs of different slicing schemes, thereby selecting the slicing scheme with the lowest cost, so as to improve the operating performance of the AI accelerator. Schematically, this step 1402 includes the following two steps:
步骤A、基于预设形状集合和算子的输入数据的形状,生成算子的多个候选切分方案,该候选切分方案包括针对算子的输入数据的候选切分方式以及候选切分方式所对应的多个候选基算子。Step A: Based on a preset shape set and the shape of the operator's input data, generate multiple candidate segmentation schemes for the operator, the candidate segmentation schemes including candidate segmentation methods for the operator's input data and multiple candidate base operators corresponding to the candidate segmentation methods.
步骤B、从多个候选切分方案中确定符合目标条件的切分方案。Step B: Determine a segmentation scheme that meets the target conditions from multiple candidate segmentation schemes.
其中,AI加速器确定各个候选切分方案的代价,该代价指示按照候选切分方案调用多个候选基算子进行数据处理得到算子的输出数据的预测耗时;将多个候选切分方案中代价最小的候选切分方案确定为切分方案。Among them, the AI accelerator determines the cost of each candidate segmentation scheme, which indicates the predicted time consumption of calling multiple candidate base operators for data processing according to the candidate segmentation scheme to obtain the output data of the operator; and determines the candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes as the segmentation scheme.
在一些实施例中,候选切分方案还包括调用多个候选基算子的算力资源分配方式,示意性地,AI加速器确定各个候选切分方案的代价,包括:基于候选切分方案所指示的多个候选基算子的预测耗时和基于算力资源分配方式进行算力资源分配的预测耗时,确定候选切分方案的代价。In some embodiments, the candidate splitting scheme also includes calling a computing power resource allocation method for multiple candidate base operators. Schematically, the AI accelerator determines the cost of each candidate splitting scheme, including: determining the cost of the candidate splitting scheme based on the predicted time consumption of multiple candidate base operators indicated by the candidate splitting scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
上述过程可参考前述算子开发阶段的第三部分,具体参考公式(7)和公式(8)所示的代价模型,在此不再赘述。The above process can refer to the third part of the aforementioned operator development stage, and specifically refer to the cost model shown in formula (7) and formula (8), which will not be repeated here.
1403、AI加速器按照切分方案,对算子的输入数据进行切分,得到多个基算子的输入数据。1403. The AI accelerator divides the input data of the operator according to the division scheme to obtain input data of multiple base operators.
1404、AI加速器调用多个基算子,对各自的输入数据进行处理,得到多个基算子的输出数据。1404. The AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.
其中,AI加速器调用多个基算子,对各自的输入数据进行处理的过程中,数据读写时采用统一的粒度对AI加速器中各级缓存空间进行分配。例如,基于乒乓缓冲机制,采用统一的粒度(如32KB的block)对AI加速器中各级缓存空间进行分配。应理解,由于多个基算子能够协同实现算子的功能,因此,多个基算子的输出数据也即是目标算子的输出数据,或者说,将多个基算子的输出数据拼接起来也就得到了算子的输出数据,在一些实施例中,AI加速器调用多个基算子,对各自的输入数据进行处理的过程中,每个基算子都对应有各自的目的显存地址,在多个基算子执行完成后,输出内存上各个目的显存地址中都写入了相应的输出数据,也即得到了算子的输出数据。Among them, the AI accelerator calls multiple base operators, and in the process of processing their respective input data, a unified granularity is used to allocate cache space at all levels in the AI accelerator when reading and writing data. For example, based on the ping-pong buffer mechanism, a unified granularity (such as a 32KB block) is used to allocate cache space at all levels in the AI accelerator. It should be understood that since multiple base operators can collaboratively realize the functions of the operator, the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of multiple base operators is spliced together to obtain the output data of the operator. In some embodiments, the AI accelerator calls multiple base operators, and in the process of processing their respective input data, each base operator corresponds to its own destination video memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination video memory address on the output memory, that is, the output data of the operator is obtained.
需要说明的是,上述步骤1403至步骤1404的具体实现方式已在前述图5至图7以及算子开发阶段进行了详细介绍,在此不再赘述。应理解,在AI加速器包括多个AI处理核的情况下,若切分方案还指示包括多个基算子的算力资源分配方式,如按照处理核数量平均分配,则每个AI处理核均会输出一份数据,多个AI处理核的输出数据拼接起来即得到算子的输出数据,具体参考上述图11,不再赘述。It should be noted that the specific implementation of the above steps 1403 to 1404 has been described in detail in the above-mentioned Figures 5 to 7 and the operator development stage, and will not be repeated here. It should be understood that in the case where the AI accelerator includes multiple AI processing cores, if the partitioning scheme also indicates a computing resource allocation method including multiple base operators, such as an average allocation according to the number of processing cores, each AI processing core will output a copy of the data, and the output data of multiple AI processing cores will be spliced together to obtain the output data of the operator. Please refer to Figure 11 above for details, and will not be repeated here.
综上,在本申请提供的算子的处理方法中,利用了一种预设形状集合来动态切分任意形状的算子的输入数据,从而通过调用算子对应的多个基算子,实现了动态形状算子的执行,有效提升了AI加速器的运行性能。In summary, in the operator processing method provided in this application, a preset shape set is used to dynamically segment the input data of operators of arbitrary shapes, thereby realizing the execution of dynamic shape operators by calling multiple base operators corresponding to the operators, thereby effectively improving the operating performance of the AI accelerator.
图15是本申请实施例提供的另一种算子的处理方法的流程图,如图15所示,以主机和AI加速器之间的交互为例进行介绍,包括如下步骤1501至步骤1506。FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application. As shown in FIG15 , the interaction between a host and an AI accelerator is taken as an example for introduction, and includes the following steps 1501 to 1506 .
1501、主机基于预设形状集合和算子的输入数据的形状,生成算子的切分方案,预设形状集合包括用于切分数据的多个数值,该数值用于指示数据的维度值,该切分方案包括针对算子的输入数据的切分方式以及切分方式所对应的多个基算子。1501. The host generates a segmentation scheme for the operator based on a preset shape set and the shape of the input data of the operator. The preset shape set includes multiple numerical values for segmenting the data, and the numerical values are used to indicate the dimension value of the data. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.
其中,主机生成算子的切分方案的过程与前述图14所示的AI加速器生成切分方案的过程同理,故不再赘述。Among them, the process of the host generating an operator segmentation plan is the same as the process of the AI accelerator generating a segmentation plan shown in Figure 14 above, so it will not be repeated here.
1502、主机向AI加速器发送算子的输入数据和算子的切分方案。1502. The host sends the operator input data and the operator segmentation plan to the AI accelerator.
1503、AI加速器获取算子的输入数据和算子的切分方案。1503. The AI accelerator obtains the input data of the operator and the segmentation scheme of the operator.
1504、AI加速器按照算子的切分方案,对算子的输入数据进行切分,得到多个基算子的输入数据。 1504. The AI accelerator divides the input data of the operator according to the division scheme of the operator to obtain input data of multiple base operators.
需要说明的是,步骤1504为可选步骤,在一些实施例中,由主机按照算子的切分方案,对算子的输入数据进行切分,得到多个基算子的输入数据,并将多个基算子的输入数据发送给AI加速器。It should be noted that step 1504 is an optional step. In some embodiments, the host divides the input data of the operator according to the operator's division scheme to obtain input data of multiple base operators, and sends the input data of the multiple base operators to the AI accelerator.
1505、AI加速器调用多个基算子,对各自的输入数据进行处理,得到多个基算子的输出数据。1505. The AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.
1506、AI加速器将多个基算子的输出数据发送给主机。1506. The AI accelerator sends the output data of the multiple base operators to the host.
应理解,由于多个基算子能够协同实现算子的功能,因此,将多个基算子的输出数据拼接起来也就得到了算子的输出数据,在一些实施例中,AI加速器调用多个基算子,对各自的输入数据进行处理的过程中,每个基算子都对应有各自的目的显存地址,在多个基算子执行完成后,输出内存上各个目的显存地址中都写入了相应的输出数据,AI加速器将多个基算子的输出数据发送给主机,主机也就得到了算子的输出数据。It should be understood that since multiple base operators can collaboratively implement the functions of the operator, the output data of the multiple base operators can be concatenated to obtain the output data of the operator. In some embodiments, the AI accelerator calls multiple base operators to process their respective input data. Each base operator corresponds to its own destination memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination memory address on the output memory. The AI accelerator sends the output data of the multiple base operators to the host, and the host obtains the output data of the operator.
综上,在本申请提供的算子的处理方法中,利用了一种预设形状集合来动态切分任意形状的算子的输入数据,从而通过调用算子对应的多个基算子,实现了动态形状算子的执行,有效提升了AI加速器的运行性能。In summary, in the operator processing method provided in this application, a preset shape set is used to dynamically segment the input data of operators of arbitrary shapes, thereby realizing the execution of dynamic shape operators by calling multiple base operators corresponding to the operators, thereby effectively improving the operating performance of the AI accelerator.
另外,在上述图14和图15所示实施例中,是以AI加速器运行算子时生成算子的切分方案为例进行介绍的,在一些实施例中,切分方案也可以是预先设定的,例如在AI推理引擎的场景下,其功能是根据用户给定的模型描述,离线自动进行模型算子生成、算子融合、计算图优化、计算量化、模型裁切等工作,最终生成高性能的、用于生产时推理部署用的模型文件。本申请提供的技术方案也可用于这一场景,也即是,可以根据用户给定的AI模型描述,生成一个已经设定好切分方案的AI模型,本申请不限于此。In addition, in the embodiments shown in Figures 14 and 15 above, the segmentation scheme of the operator generated when the AI accelerator runs the operator is introduced as an example. In some embodiments, the segmentation scheme can also be pre-set. For example, in the scenario of the AI reasoning engine, its function is to automatically perform model operator generation, operator fusion, calculation graph optimization, calculation quantization, model cutting and other tasks offline according to the model description given by the user, and finally generate a high-performance model file for reasoning and deployment in production. The technical solution provided in this application can also be used in this scenario, that is, an AI model with a segmentation scheme that has been set can be generated according to the AI model description given by the user, and this application is not limited to this.
图16是本申请实施例提供的一种算子的处理装置的结构示意图。该装置可以通过软件、硬件或者两者的结合实现成为前述AI加速器的功能。如图16所示,该装置配置于AI加速器,包括获取模块1601和调用模块1602。FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application. The device can realize the function of the aforementioned AI accelerator through software, hardware, or a combination of both. As shown in FIG16 , the device is configured in the AI accelerator, including an acquisition module 1601 and a call module 1602.
获取模块1601,用于获取算子对应的多个基算子的输入数据,该算子为动态形状算子,每个基算子的输入数据是基于预设形状集合对算子的输入数据切分得到的,多个算子用于协同实现算子的功能,该预设形状集合包括用于切分数据的多个数值,该数值用于指示数据的维度值。Acquisition module 1601 is used to acquire input data of multiple base operators corresponding to an operator. The operator is a dynamic shape operator. The input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set. Multiple operators are used to collaboratively implement the functions of the operator. The preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data.
调用模块1602,用于调用多个基算子,对各自的输入数据进行处理,得到多个基算子的输出数据。The calling module 1602 is used to call multiple basic operators, process the respective input data, and obtain the output data of the multiple basic operators.
在一些实施例中,获取模块1601,包括:In some embodiments, the acquisition module 1601 includes:
生成单元,用于基于预设形状集合和算子的输入数据的形状,生成算子的切分方案,该切分方案包括针对算子的输入数据的切分方式以及切分方式所对应的多个基算子;A generating unit, configured to generate a segmentation scheme of the operator based on a preset shape set and a shape of the input data of the operator, wherein the segmentation scheme includes a segmentation method for the input data of the operator and a plurality of base operators corresponding to the segmentation method;
切分单元,用于按照切分方案,对算子的输入数据进行切分,得到多个基算子的输入数据。The segmentation unit is used to segment the input data of the operator according to the segmentation scheme to obtain the input data of multiple base operators.
在一些实施例中,切分方案还包括调用多个基算子的算力资源分配方式,该算力资源分配方式与下述至少一项相关联:In some embodiments, the slicing scheme further includes calling a computing resource allocation method of multiple base operators, where the computing resource allocation method is associated with at least one of the following:
基于AI加速器中核数量进行算力资源分配的方式;A method for allocating computing resources based on the number of cores in the AI accelerator;
基于AI加速器中线程数量进行算力资源分配的方式;A method for allocating computing resources based on the number of threads in the AI accelerator;
基于AI加速器中线程束数量进行算力资源分配的方式;A method for allocating computing resources based on the number of thread bundles in the AI accelerator;
基于AI加速器中逻辑块数量进行算力资源分配的方式。A method of allocating computing resources based on the number of logic blocks in the AI accelerator.
在一些实施例中,该生成单元,用于:In some embodiments, the generating unit is used to:
基于预设形状集合和算子的输入数据的形状,生成算子的多个候选切分方案,该候选切分方案包括针对算子的输入数据的候选切分方式以及候选切分方式所对应的多个候选基算子;Based on the preset shape set and the shape of the input data of the operator, a plurality of candidate segmentation schemes of the operator are generated, wherein the candidate segmentation schemes include candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;
从多个候选切分方案中确定符合目标条件的切分方案。Determine a segmentation scheme that meets the target conditions from multiple candidate segmentation schemes.
在一些实施例中,该生成单元,用于:In some embodiments, the generating unit is used to:
确定各个候选切分方案的代价,该代价指示按照候选切分方案调用多个候选基算子进行数据处理得到算子的输出数据的预测耗时;Determine the cost of each candidate segmentation scheme, where the cost indicates the predicted time consumption of calling multiple candidate base operators to process data and obtain output data of the operators according to the candidate segmentation scheme;
将多个候选切分方案中代价最小的候选切分方案确定为切分方案。The candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes is determined as the segmentation scheme.
在一些实施例中,候选切分方案还包括调用多个候选基算子的算力资源分配方式,该生成单元,用于:In some embodiments, the candidate segmentation scheme further includes calling a computing resource allocation method of multiple candidate base operators, and the generating unit is used to:
基于候选切分方案所指示的多个候选基算子的预测耗时和基于算力资源分配方式进行算力资源分配的预测耗时,确定候选切分方案的代价。The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing resource allocation based on the computing resource allocation method.
在一些实施例中,获取模块1601,还用于:In some embodiments, the acquisition module 1601 is further used to:
获取主机发送的算子的输入数据和算子的切分方案,该切分方案包括针对算子的输入数据的切分方式 以及切分方式所对应的多个基算子;Get the input data of the operator sent by the host and the operator's segmentation plan, which includes the segmentation method for the operator's input data And multiple basis operators corresponding to the segmentation method;
该装置还包括发送模块,该发送模块,用于:将多个基算子的输出数据发送给主机。The device also includes a sending module, which is used to send output data of multiple base operators to a host.
在一些实施例中,预设形状集合包括至少两个等差数值序列,其中,目标序列尾部的数值和目标序列的相邻序列头部的数值之间的差值大于目标序列的公差。In some embodiments, the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
在一些实施例中,多个基算子的输入数据的形状所对应的数值与预设形状集合中序列的数量相关联。In some embodiments, the values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.
在一些实施例中,预设形状集合中数值的大小与下述至少一项相关联:In some embodiments, the size of the numerical value in the preset shape set is associated with at least one of the following:
AI加速器执行的指令所对应的数据类型;The data type corresponding to the instructions executed by the AI accelerator;
AI加速器执行的指令所对应的数据范围;The data range corresponding to the instructions executed by the AI accelerator;
AI加速器中各级缓存空间的大小。The size of cache space at each level in the AI accelerator.
在一些实施例中,调用模块1602调用多个基算子,对各自的输入数据进行处理的过程中,数据读写时采用统一的粒度对AI加速器中各级缓存空间进行分配。In some embodiments, the calling module 1602 calls multiple base operators to process their respective input data, and uses a unified granularity to allocate cache space at various levels in the AI accelerator when reading and writing data.
通过上述装置,利用了一种预设形状集合来动态切分任意形状的目标算子的输入数据,从而通过调用目标算子对应的多个基算子,实现了动态形状算子的执行,有效提升了AI加速器的运行性能。Through the above device, a preset shape set is used to dynamically segment the input data of a target operator of any shape, thereby realizing the execution of a dynamic shape operator by calling multiple base operators corresponding to the target operator, effectively improving the operating performance of the AI accelerator.
需要说明的是:上述实施例提供的算子的处理装置在对算子进行处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的算子的处理装置与算子的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that: when the operator processing device provided in the above embodiment processes the operator, only the division of the above functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the operator processing device provided in the above embodiment and the operator processing method embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一处理核可以被称为第二处理核,并且类似地,第二处理核可以被称为第一处理核。第一处理核和第二处理核都可以是处理核,并且在某些情况下,可以是单独且不同的处理核。In this application, the terms "first", "second", etc. are used to distinguish between identical or similar items with substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between "first", "second", and "nth", nor is the quantity and execution order limited. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of the various described examples, the first processing core can be referred to as the second processing core, and similarly, the second processing core can be referred to as the first processing core. Both the first processing core and the second processing core can be processing cores, and in some cases, can be separate and different processing cores.
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个处理核是指两个或两个以上的处理核。In this application, the term "at least one" means one or more, and the term "plurality" means two or more. For example, a plurality of processing cores means two or more processing cores.
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above description is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of various equivalent modifications or replacements within the technical scope disclosed in the present application, and these modifications or replacements should be included in the protection scope of the present application. Therefore, the protection scope of the present application shall be based on the protection scope of the claims.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以程序结构信息的形式实现。该程序结构信息包括一个或多个程序指令。在计算设备上加载和执行该程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of program structure information. The program structure information includes one or more program instructions. When the program instructions are loaded and executed on a computing device, all or part of the processes or functions in the embodiments of the present application are generated.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware or by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 As described above, the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present application.

Claims (27)

  1. 一种算子的处理方法,其特征在于,由人工智能AI加速器执行,所述方法包括:A method for processing an operator, characterized in that it is executed by an artificial intelligence (AI) accelerator, and the method comprises:
    获取算子对应的多个基算子的输入数据,所述算子为动态形状算子,每个基算子的输入数据是基于预设形状集合对所述算子的输入数据切分得到的,所述多个基算子用于协同实现所述算子的功能,所述预设形状集合包括用于切分数据的多个数值,所述数值用于指示数据的维度值;Obtain input data of multiple base operators corresponding to an operator, where the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;
    调用所述多个基算子,对各自的输入数据进行处理,得到所述多个基算子的输出数据。The multiple base operators are called to process the respective input data to obtain output data of the multiple base operators.
  2. 根据权利要求1所述的方法,其特征在于,所述获取算子对应的多个基算子的输入数据,包括:The method according to claim 1, characterized in that the step of obtaining input data of multiple base operators corresponding to the operator comprises:
    基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;Based on the preset shape set and the shape of the input data of the operator, generating a segmentation scheme of the operator, the segmentation scheme including a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
    按照所述切分方案,对所述算子的输入数据进行切分,得到所述多个基算子的输入数据。According to the segmentation scheme, the input data of the operator is segmented to obtain the input data of the multiple base operators.
  3. 根据权利要求2所述的方法,其特征在于,所述切分方案还包括调用所述多个基算子的算力资源分配方式,所述算力资源分配方式与下述至少一项相关联:The method according to claim 2, characterized in that the segmentation scheme further comprises calling a computing power resource allocation method of the multiple base operators, and the computing power resource allocation method is associated with at least one of the following:
    基于所述AI加速器中核数量进行算力资源分配的方式;A method for allocating computing resources based on the number of cores in the AI accelerator;
    基于所述AI加速器中线程数量进行算力资源分配的方式;A method for allocating computing resources based on the number of threads in the AI accelerator;
    基于所述AI加速器中线程束数量进行算力资源分配的方式;A method for allocating computing resources based on the number of thread warps in the AI accelerator;
    基于所述AI加速器中逻辑块数量进行算力资源分配的方式。A method for allocating computing resources based on the number of logic blocks in the AI accelerator.
  4. 根据权利要求2或3所述的方法,其特征在于,所述基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的切分方案,包括:The method according to claim 2 or 3, characterized in that the generating of the segmentation scheme of the operator based on the preset shape set and the shape of the input data of the operator comprises:
    基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的多个候选切分方案,所述候选切分方案包括针对所述算子的输入数据的候选切分方式以及所述候选切分方式所对应的多个候选基算子;Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;
    从所述多个候选切分方案中确定符合目标条件的所述切分方案。The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
  5. 根据权利要求4所述的方法,其特征在于,所述从所述多个候选切分方案中确定符合目标条件的所述切分方案,包括:The method according to claim 4, characterized in that the step of determining the segmentation scheme that meets the target condition from the plurality of candidate segmentation schemes comprises:
    确定各个候选切分方案的代价,所述代价指示按照候选切分方案调用所述多个候选基算子进行数据处理得到所述算子的输出数据的预测耗时;Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;
    将所述多个候选切分方案中代价最小的候选切分方案确定为所述切分方案。A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
  6. 根据权利要求4或5所述的方法,其特征在于,所述候选切分方案还包括调用所述多个候选基算子的算力资源分配方式,所述确定各个候选切分方案的代价,包括:The method according to claim 4 or 5 is characterized in that the candidate segmentation scheme further includes calling the computing resource allocation mode of the multiple candidate base operators, and the determining the cost of each candidate segmentation scheme includes:
    基于所述候选切分方案所指示的多个候选基算子的预测耗时和基于所述算力资源分配方式进行算力资源分配的预测耗时,确定所述候选切分方案的代价。The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, characterized in that the method further comprises:
    获取主机发送的所述算子的输入数据和所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
    所述方法还包括:将所述多个基算子的输出数据发送给所述主机。The method further includes: sending output data of the plurality of base operators to the host.
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述预设形状集合包括至少两个等差数值序列,其中,目标序列尾部的数值和所述目标序列的相邻序列头部的数值之间的差值大于所述目标序列的公差。 The method according to any one of claims 1 to 7 is characterized in that the preset shape set includes at least two arithmetic progression value sequences, wherein the difference between the value at the tail of the target sequence and the value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
  9. 根据权利要求8所述的方法,其特征在于,所述多个基算子的输入数据的形状所对应的数值与所述预设形状集合中序列的数量相关联。The method according to claim 8 is characterized in that the numerical values corresponding to the shapes of the input data of the multiple basis operators are associated with the number of sequences in the preset shape set.
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述预设形状集合中数值的大小与下述至少一项相关联:The method according to any one of claims 1 to 9, characterized in that the magnitude of the numerical value in the preset shape set is associated with at least one of the following:
    所述AI加速器执行的指令所对应的数据类型;The data type corresponding to the instruction executed by the AI accelerator;
    所述AI加速器执行的指令所对应的数据范围;The data range corresponding to the instruction executed by the AI accelerator;
    所述AI加速器中各级缓存空间的大小。The size of cache space at each level in the AI accelerator.
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述调用所述多个基算子,对各自的输入数据进行处理的过程中,数据读写时采用统一的粒度对所述AI加速器中各级缓存空间进行分配。The method according to any one of claims 1 to 10 is characterized in that, in the process of calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
  12. 一种算子的处理装置,其特征在于,配置于AI加速器,所述装置包括:An operator processing device, characterized in that it is configured in an AI accelerator, and the device comprises:
    获取模块,用于获取算子对应的多个基算子的输入数据,所述算子为动态形状算子,每个基算子的输入数据是基于预设形状集合对所述算子的输入数据切分得到的,所述多个基算子用于协同实现所述算子的功能,所述预设形状集合包括用于切分数据的多个数值,所述数值用于指示数据的维度值;An acquisition module, used to acquire input data of multiple base operators corresponding to an operator, wherein the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;
    调用模块,用于调用所述多个基算子,对各自的输入数据进行处理,得到所述多个基算子的输出数据。The calling module is used to call the multiple basic operators, process the respective input data, and obtain the output data of the multiple basic operators.
  13. 根据权利要求12所述的装置,其特征在于,所述获取模块,包括:The device according to claim 12, characterized in that the acquisition module comprises:
    生成单元,用于基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;A generating unit, configured to generate a segmentation scheme for the operator based on the preset shape set and the shape of the input data of the operator, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the plurality of base operators corresponding to the segmentation method;
    切分单元,用于按照所述切分方案,对所述算子的输入数据进行切分,得到所述多个基算子的输入数据。A segmentation unit is used to segment the input data of the operator according to the segmentation scheme to obtain the input data of the multiple base operators.
  14. 根据权利要求13所述的装置,其特征在于,所述切分方案还包括调用所述多个基算子的算力资源分配方式,所述算力资源分配方式与下述至少一项相关联:The device according to claim 13, characterized in that the segmentation scheme further includes calling a computing power resource allocation method of the multiple base operators, and the computing power resource allocation method is associated with at least one of the following:
    基于所述AI加速器中核数量进行算力资源分配的方式;A method for allocating computing resources based on the number of cores in the AI accelerator;
    基于所述AI加速器中线程数量进行算力资源分配的方式;A method for allocating computing resources based on the number of threads in the AI accelerator;
    基于所述AI加速器中线程束数量进行算力资源分配的方式;A method for allocating computing resources based on the number of thread warps in the AI accelerator;
    基于所述AI加速器中逻辑块数量进行算力资源分配的方式。A method for allocating computing resources based on the number of logic blocks in the AI accelerator.
  15. 根据权利要求13或14所述的装置,其特征在于,所述生成单元,用于:The device according to claim 13 or 14, characterized in that the generating unit is used to:
    基于所述预设形状集合和所述算子的输入数据的形状,生成所述算子的多个候选切分方案,所述候选切分方案包括针对所述算子的输入数据的候选切分方式以及所述候选切分方式所对应的多个候选基算子;Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;
    从所述多个候选切分方案中确定符合目标条件的所述切分方案。The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
  16. 根据权利要求15所述的装置,其特征在于,所述生成单元,用于:The device according to claim 15, characterized in that the generating unit is used to:
    确定各个候选切分方案的代价,所述代价指示按照候选切分方案调用所述多个候选基算子进行数据处理得到所述算子的输出数据的预测耗时;Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;
    将所述多个候选切分方案中代价最小的候选切分方案确定为所述切分方案。A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
  17. 根据权利要求15或16所述的装置,其特征在于,所述候选切分方案还包括调用所述多个候选基算子的算力资源分配方式,所述生成单元,用于:The device according to claim 15 or 16, characterized in that the candidate segmentation scheme further includes a computing resource allocation method for calling the multiple candidate base operators, and the generating unit is used to:
    基于所述候选切分方案所指示的多个候选基算子的预测耗时和基于所述算力资源分配方式进行算力资源分配的预测耗时,确定所述候选切分方案的代价。The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
  18. 根据权利要求12所述的装置,其特征在于,所述获取模块,还用于: The device according to claim 12, characterized in that the acquisition module is further used to:
    获取主机发送的所述算子的输入数据和所述算子的切分方案,所述切分方案包括针对所述算子的输入数据的切分方式以及所述切分方式所对应的所述多个基算子;Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;
    所述装置还包括发送模块,所述发送模块,用于:将所述多个基算子的输出数据发送给所述主机。The device further includes a sending module, and the sending module is used to send output data of the multiple base operators to the host.
  19. 根据权利要求12至18中任一项所述的装置,其特征在于,所述预设形状集合包括至少两个等差数值序列,其中,目标序列尾部的数值和所述目标序列的相邻序列头部的数值之间的差值大于所述目标序列的公差。The device according to any one of claims 12 to 18 is characterized in that the preset shape set includes at least two arithmetically differing numerical sequences, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
  20. 根据权利要求19所述的装置,其特征在于,所述多个基算子的输入数据的形状所对应的数值与所述预设形状集合中序列的数量相关联。The device according to claim 19 is characterized in that the numerical values corresponding to the shapes of the input data of the multiple basis operators are associated with the number of sequences in the preset shape set.
  21. 根据权利要求12至20中任一项所述的装置,其特征在于,所述预设形状集合中数值的大小与下述至少一项相关联:The device according to any one of claims 12 to 20, characterized in that the magnitude of the numerical value in the preset shape set is associated with at least one of the following:
    所述AI加速器执行的指令所对应的数据类型;The data type corresponding to the instruction executed by the AI accelerator;
    所述AI加速器执行的指令所对应的数据范围;The data range corresponding to the instruction executed by the AI accelerator;
    所述AI加速器中各级缓存空间的大小。The size of cache space at each level in the AI accelerator.
  22. 根据权利要求12至21中任一项所述的装置,其特征在于,所述调用模块调用所述多个基算子,对各自的输入数据进行处理的过程中,数据读写时采用统一的粒度对所述AI加速器中各级缓存空间进行分配。The device according to any one of claims 12 to 21 is characterized in that, in the process of the calling module calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
  23. 一种芯片,其特征在于,配置为AI加速器,所述AI加速器包括通信接口和至少一个AI处理核,所述通信接口用于为所述至少一个处理核提供程序指令和/或数据,所述至少一个AI处理核用于实现如上述权利要求1至11中任一项的算子的处理方法。A chip, characterized in that it is configured as an AI accelerator, the AI accelerator comprising a communication interface and at least one AI processing core, the communication interface is used to provide program instructions and/or data to the at least one processing core, and the at least one AI processing core is used to implement the processing method of the operator as described in any one of claims 1 to 11 above.
  24. 一种计算设备,其特征在于,包括主机和AI加速器,所述主机用于向所述AI加速器发送数据,接收所述AI加速器发送的数据,所述AI加速器用于执行如上述权利要求1至11中任一项所述的算子的处理方法。A computing device, characterized in that it includes a host and an AI accelerator, the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute the processing method of the operator as described in any one of claims 1 to 11 above.
  25. 一种计算设备集群,其特征在于,包括多个计算设备,所述计算设备包括主机和AI加速器,所述主机用于向所述AI加速器发送数据,接收所述AI加速器发送的数据,所述AI加速器用于执行如上述权利要求1至11中任一项所述的算子的处理方法。A computing device cluster, characterized in that it includes multiple computing devices, the computing devices include a host and an AI accelerator, the host is used to send data to the AI accelerator, receive data sent by the AI accelerator, and the AI accelerator is used to execute the processing method of the operator as described in any one of claims 1 to 11 above.
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储至少一段程序代码,所述至少一段程序代码用于执行如权利要求1至权利要求11中任一项所述的算子的处理方法。A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store at least one section of program code, and the at least one section of program code is used to execute the operator processing method according to any one of claims 1 to claim 11.
  27. 一种计算机程序产品,其特征在于,当所述计算机程序产品在AI加速器上运行时,使得所述AI加速器执行如权利要求1至权利要求11中任一项所述的算子的处理方法。 A computer program product, characterized in that when the computer program product runs on an AI accelerator, the AI accelerator executes the operator processing method according to any one of claims 1 to 11.
PCT/CN2023/119946 2022-12-24 2023-09-20 Operator processing method and apparatus, and chip, computing device and storage medium WO2024131170A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211669360.1 2022-12-24
CN202310379731.0 2023-03-31

Publications (1)

Publication Number Publication Date
WO2024131170A1 true WO2024131170A1 (en) 2024-06-27

Family

ID=

Similar Documents

Publication Publication Date Title
US10915816B2 (en) System and method of executing neural networks
Fang et al. Turbotransformers: an efficient gpu serving system for transformer models
US10963787B2 (en) Systems and methods for generation of sparse code for convolutional neural networks
CN110262901B (en) Data processing method and data processing system
US20210182676A1 (en) Systems and methods for generation of sparse code for convolutional neural networks
CN113568599B (en) Method, electronic device and computer program product for processing a computing job
CN114418127B (en) Machine learning calculation optimization method and platform
JP2021532437A (en) Improving machine learning models to improve locality
US11907770B2 (en) Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors
US11055139B2 (en) Smart accelerator allocation and reclamation for deep learning jobs in a computing cluster
CN113037800B (en) Job scheduling method and job scheduling device
Beaumont et al. Optimal GPU-CPU offloading strategies for deep neural network training
CN115860066A (en) Neural network reasoning pipeline multiplexing method based on batch processing
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Kaya et al. Seamless computation offloading for mobile applications using an online learning algorithm
WO2024131170A1 (en) Operator processing method and apparatus, and chip, computing device and storage medium
Lin et al. Joint deadline-constrained and influence-aware design for allocating MapReduce jobs in cloud computing systems
CN114356738A (en) Method for predicting time required for executing neural network model and related product
CN114358253A (en) Time estimation method of neural network model and related product
CN118246497A (en) Operator processing method, device, chip, computing device and storage medium
CN114490002A (en) Data processing system, task scheduling method, device, chip and electronic equipment
WO2021061172A1 (en) System and method of executing neural networks
CN117114055B (en) FPGA binary neural network acceleration method for industrial application scene
US20220345535A1 (en) Distribution of machine learning workflows on webscale infrastructures
CN116980423B (en) Model scheduling method, device, computing system, equipment and readable storage medium