WO2024131170A1

WO2024131170A1 - Operator processing method and apparatus, and chip, computing device and storage medium

Info

Publication number: WO2024131170A1
Application number: PCT/CN2023/119946
Authority: WO
Inventors: 仇悦; 徐晓忻; 周建伟; 周卿
Original assignee: 华为技术有限公司
Priority date: 2022-12-24
Filing date: 2023-09-20
Publication date: 2024-06-27

Abstract

The present application relates to the technical field of artificial intelligence (AI). Disclosed are an operator processing method and apparatus, and a chip, a computing device and a storage medium. The method is executed by means of an AI accelerator, and comprises: acquiring input data of a plurality of basis operators corresponding to an operator; and calling the plurality of basis operators to process the respective input data, so as to obtain output data of the plurality of basis operators, i.e., obtaining output data of the operator. The plurality of basis operators are used for collaboratively implementing functions of the operator, and the input data of each basis operator is obtained by means of segmenting input data of the operator on the basis of a preset shape set. That is, in the method, input data, which is of any shape, of an operator is dynamically segmented by using a preset shape set, so that a plurality of basis operators corresponding to the operator are called to collaboratively implement the execution of a dynamic shape operator, thereby effectively improving the operation performance of an AI accelerator.

Description

Operator processing method, device, chip, computing device and storage medium

This application claims the priority of Chinese patent application with application number 202211669360.1 filed on December 24, 2022, and invention name “A method for accelerating operator generation”, and the priority of Chinese patent application with application number 202310379731.0 filed on March 31, 2023, and invention name “Operator processing method, device, chip, computing device and storage medium”, all contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular to an operator processing method, apparatus, chip, computing device and storage medium.

Background technique

With the rapid development of artificial intelligence (AI) technology, a series of AI accelerators have emerged to provide computing power for matrices and vectors to accelerate the calculation of AI models. Usually, AI accelerators are equipped with dedicated operator programming interfaces for users to write operators to run on AI accelerators. Operators include fixed shape operators and dynamic shape operators, among which fixed shape operators refer to operators whose input data size is fixed, and dynamic shape operators refer to operators whose input data size is not fixed.

In the related technology, when the AI accelerator obtains the input data of the dynamic shape operator, it selects the operator implementation file corresponding to the shape range from multiple preset operator implementation files according to the shape range of the input data, and calls the operator implementation file to realize the dynamic shape operation of the operator.

However, the above method requires users to generate different operator implementation files according to different shape ranges in advance, which has poor usability. Moreover, if the shape of the input data does not fall within the preset shape range, it will trigger runtime compilation (just-in-time, JIT), introduce compilation time overhead, and greatly affect the operating performance of the AI accelerator.

Summary of the invention

The embodiment of the present application provides an operator processing method, device, chip, computing device and storage medium, which can effectively improve the operating performance of the AI accelerator. The technical solution is as follows:

In a first aspect, a method for processing an operator is provided, which is executed by an artificial intelligence (AI) accelerator, and the method includes:

Obtain input data of multiple base operators corresponding to an operator, where the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;

The multiple base operators are called to process the respective input data to obtain output data of the multiple base operators.

Among them, since multiple base operators can collaboratively realize the functions of the operator, the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of the operator is obtained by splicing the output data of multiple base operators. In other words, the above method uses a preset shape set to dynamically segment the input data of operators of any shape, thereby calling multiple base operators corresponding to the operator to collaboratively realize the execution of the dynamic shape operator, effectively improving the operating performance of the AI accelerator.

In some embodiments, the step of obtaining input data of multiple base operators corresponding to the operator includes:

Based on the preset shape set and the shape of the input data of the operator, generating a segmentation scheme of the operator, the segmentation scheme including a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;

According to the segmentation scheme, the input data of the operator is segmented to obtain the input data of the multiple base operators.

In some embodiments, the slicing scheme further includes calling a computing power resource allocation method of the plurality of base operators, wherein the computing power resource allocation method is associated with at least one of the following:

A method for allocating computing resources based on the number of cores in the AI accelerator;

A method for allocating computing resources based on the number of threads in the AI accelerator;

A method for allocating computing resources based on the number of thread warps in the AI accelerator;

A method for allocating computing resources based on the number of logic blocks in the AI accelerator.

In the above method, since the computing resource allocation method of calling multiple base operators is taken into account when generating the segmentation plan, it can effectively balance The computing power consumption caused by calling multiple base operators is reduced, thereby making full use of the computing power resources of the AI accelerator.

In some embodiments, generating a segmentation scheme of the operator based on the preset shape set and the shape of the input data of the operator includes:

Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;

The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.

In some embodiments, determining the segmentation scheme that meets the target condition from the multiple candidate segmentation schemes includes:

Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;

A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.

In some embodiments, the candidate segmentation scheme further includes calling a computing resource allocation method of the plurality of candidate base operators, and determining the cost of each candidate segmentation scheme includes:

The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.

Through the above method, a method for determining a slicing scheme for automatic load balancing is provided. By determining the costs of different candidate slicing schemes, the slicing scheme with the lowest cost is selected to maximize the operating performance of the AI accelerator.

In some embodiments, the method further comprises:

Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;

The method further includes: sending output data of the plurality of base operators to the host.

Through the above method, the host generates the operator segmentation plan, which can offload the computing power of the AI accelerator and save the computing power resources of the AI accelerator.

In some embodiments, the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein a difference between a numerical value at the tail of a target sequence and a numerical value at the head of an adjacent sequence of the target sequence is greater than a tolerance of the target sequence.

In some embodiments, the numerical values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.

In some embodiments, the size of the numerical value in the preset shape set is associated with at least one of the following:

The data type corresponding to the instruction executed by the AI accelerator;

The data range corresponding to the instruction executed by the AI accelerator;

The size of cache space at each level in the AI accelerator.

The preset shape set set in the above manner can balance operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator.

In some embodiments, in the process of calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.

Through the above method, it is possible to achieve seamless splicing of scheduling between multiple base operators. For example, a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators.

In a second aspect, a device for processing an operator is provided. The device is configured in an AI accelerator and includes at least one functional module for executing the method for processing an operator provided in the first aspect or any possible implementation of the first aspect.

In a third aspect, a chip is provided, configured as an AI accelerator, the AI accelerator comprising a communication interface and at least one AI processing core, the communication interface being used to provide program instructions and/or data to the at least one processing core, the at least one AI processing core being used to implement a processing method for an operator provided in the first aspect or any possible implementation of the first aspect.

In a fourth aspect, a computing device is provided, including a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the first aspect or any possible implementation of the first aspect.

In a fifth aspect, a computing device cluster is provided, comprising multiple computing devices, wherein the computing devices include a host and an AI accelerator, wherein the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute an operator processing method provided in the aforementioned first aspect or any possible implementation of the first aspect.

In a sixth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium is used to store at least one program code, and the at least one program code is used to execute the processing method of the operator provided in the first aspect or any possible implementation of the first aspect. The storage medium includes but is not limited to a volatile memory, such as a random access memory, a non-volatile memory, such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).

In a seventh aspect, a computer program product is provided, which, when running on an AI accelerator, enables the AI accelerator to execute the operator processing method provided in the first aspect or any possible implementation of the first aspect. The computer program product may be a software installation package, and when the functions of the aforementioned AI accelerator need to be implemented, the computer program product may be downloaded and executed on the AI accelerator.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of an implementation environment provided by the present application;

FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application;

FIG3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application;

FIG4 is a schematic diagram of the structure of an AI processing core provided in an embodiment of the present application;

FIG5 is a schematic diagram of the segmentation of a matrix multiplication operator;

FIG6 is a schematic diagram of another segmentation of a matrix multiplication operator;

FIG7 is a schematic diagram of a matrix multiplication operator segmentation method provided in an embodiment of the present application;

FIG8 is a schematic diagram of a base operator scheduling mechanism provided in an embodiment of the present application;

FIG9 is a schematic diagram of a cache space allocation method provided in an embodiment of the present application;

FIG10 is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application;

FIG11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application;

FIG12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application;

FIG13 is a schematic diagram of a flow chart of an operator development phase provided in an embodiment of the present application;

FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application;

FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application;

FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application more clear, the implementation methods of the present application will be further described in detail below with reference to the accompanying drawings.

For ease of understanding, the key terms and key concepts involved in this application are explained below.

Artificial intelligence (AI) model is a type of mathematical algorithm model that uses machine learning ideas to solve practical problems. The AI model includes a large number of parameters and calculation formulas (or calculation rules).

AI accelerators are a type of specialized hardware accelerator or computer system designed to accelerate AI applications, especially neural networks, machine vision, and machine learning. For example, they are used to provide computing power for calculating matrices and vectors to accelerate the calculation of AI models. Schematically, AI accelerators are, for example, graphics processing units (GPUs), intelligent processing units (IPUs), tensor processing units (TPUs), domain specific architecture (DSA) chips, and so on.

Deep learning (DL) is a branch of machine learning (ML). Deep learning is the study of the inherent laws and representation levels of sample data. The information obtained in the learning process is very helpful for interpreting data such as text, images, and sounds. Schematically, deep learning is a complex machine learning algorithm, and the typical AI model that uses deep learning ideas is the neural network model.

An operator (OP) refers to a computing unit or computing function running on a computing device. In the field of deep learning, neural network layers and even the entire model are composed of operators, which correspond to the computing logic in the neural network layer. For example, a convolution layer is an operator; the weight summation process in a fully-connected layer (FC layer) is an operator. Schematically, operators include fixed-shape operators and dynamic-shape operators. The following introduces these two types of operators respectively: (1) Fixed-shape operators refer to operators whose input data size is fixed, so the operator's internal input data segmentation and execution scheduling can be fixed. (2) Dynamic-shape operators refer to operators whose input data size is not fixed, or in other words, the input data size is delayed until the operator obtains the actual input data size at runtime.

Tensor is the data in the operator, including input data and output data.

Shape refers to the shape of a tensor, or the dimension value of each dimension of a tensor, usually expressed in the form of (D0, D1, ..., Dn-1), where n is a positive integer. For example, the input data shape of the matrix tensor of an operator is (100, 100), indicating that the dimension value of the matrix in the row and column directions is 100 each, and contains a total of 100×100=10000 elements.

Ping-pong buffering is a data buffering mechanism that can use two data buffers at the same time to achieve the purpose of continuous data transmission, thereby increasing the data transmission rate. It should be understood that since the data processed by a single buffer is easily overwritten during transmission and processing, the ping-pong buffering method can always keep the data in one buffer being used while the other buffer is used to store data. In other words, ping-pong buffering means that two identical objects are read and written alternately as buffers.

The following is an introduction to the application scenarios and implementation environment involved in this application.

The technical solution provided in this application is applied to the scenario of running dynamic shape operators based on AI accelerators. At present, the implementation of dynamic shape operators often involves segmentation and scheduling strategies. For example, taking the dynamic shape operator as a matrix multiplication operator, when the AI accelerator runs the operator, the input data of the operator obtained includes a 10×20-dimensional matrix A and a 20×30-dimensional matrix B. Then the final output data of the operator is matrix C=A×B, which is a 10×30-dimensional matrix. The calculation task of C=A×B cannot usually be completed by a single calculation of the computing unit in the AI accelerator, so it is often necessary to segment this calculation task. For example, consider matrix C as two 10×15-dimensional matrices spliced together, then the local data of matrix A and matrix B can be taken respectively to calculate two 10×15-dimensional matrices. These two 10×15-dimensional matrices are spliced together to obtain the output data of the operator, that is, matrix C. This process is to divide the computing task of an operator of C=A×B into two sub-computing tasks, and call the two base operators corresponding to the operator (a base operator is used to calculate a 10×15-dimensional matrix) to respectively implement the two sub-computing tasks, that is, to collaboratively implement the function of the operator through the two base operators corresponding to the operator. Accordingly, since the input data of each base operator is the local data of the input data of the operator, it is necessary to divide the input data of the operator to obtain the input data of each base operator, and finally realize the function of the operator. However, it should be understood that the matrix C can also have many other splicing methods, and accordingly, the operator's segmentation and scheduling strategy also includes many other strategies. It can be seen that for this dynamic shape operator, since the shape of its input data is not fixed, the optimization space for the segmentation and scheduling strategy of this operator is very large, and the large optimization space means that since the AI accelerator obtains the shape of the input data when running the operator, it is difficult for the AI accelerator to implement the operator function with the best or even better segmentation and scheduling strategy, so its performance and hardware utilization are usually difficult to guarantee.

Based on this, the present application provides a technical solution for implementing a dynamic shape operator based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator. Among them, for input data of any shape of a certain operator, the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained (the specific principle and implementation process will be introduced in the subsequent embodiments, and will not be repeated here).

The implementation environment of the present application is introduced below with reference to FIG1 . FIG1 is a schematic diagram of an implementation environment provided by the present application. As shown in FIG1 , the implementation environment includes a host 100 and an AI accelerator 200 , and the host 100 and the AI accelerator 200 are communicatively connected.

The host 100 refers to a device used to run an AI model and provide AI services to users. In an embodiment of the present application, the host 100 can implement development functions and execution functions for operators in the AI model, wherein the development function refers to the host 100 providing users with various development functions for operators, such as writing operators, setting preset shape sets, generating operator codes, etc., and the present application is not limited thereto. The execution function means that the host 100 can control the AI accelerator 200 to run the operator developed by the user and realize the function of the corresponding operator. This process can also be understood as loading the AI task into the AI accelerator 200 for operation. In addition, the number of hosts 100 can be one or more, and the present application does not limit this.

The AI accelerator 200 is used to provide computing power for the running AI model to accelerate the computing process of the AI model, that is, to execute the processing method of the operator provided in this application. For example, the AI accelerator 200 is a DSA chip, a GPU, etc., but this application is not limited thereto. Schematically, the AI accelerator 200 calls the corresponding operator based on the input data of the operator sent by the host 100, processes the input data, obtains the output data, and returns the output data to the host 100 to implement the function of the corresponding operator. It should be noted that the functions of the AI accelerator 200 are as follows: The contents shown in FIG2 are introduced in detail and will not be described again here. In addition, the number of the AI accelerator 200 may be one or more, and this application does not limit this.

Schematically, the host 100 and the AI accelerator 200 are integrated in a computing device, and the host 100 and the AI accelerator 200 are connected to each other through a peripheral component interconnect bus (PCIe) link. The host 100 exchanges data with the AI accelerator 200 through the PCIe link and controls the AI accelerator 200 to run the corresponding operator. For example, the computing device can be an independent physical server, or a server cluster or distributed file system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. Taking the computing device as a cloud server as an example, the computing device can also be called a cloud platform (i.e., the abbreviation of cloud computing platform), which refers to services based on hardware resources and software resources, providing computing, network and storage capabilities. Through the network "cloud", huge data computing is processed and analyzed remotely and then returned to the user, with the characteristics of large-scale, distributed, virtualized, high availability, scalability, on-demand service and security. The cloud platform can achieve the rapid release and publication of configurable computing resources with a small management cost or low interaction complexity between users and service providers.

It should be noted that in the implementation environment shown in Figure 1 above, the host 100 and the AI accelerator 200 are introduced as an example to realize the execution function of the operator through interaction. In other embodiments, the AI accelerator 200 has the function of running the AI model, that is, the AI accelerator 200 can directly run the AI model specified by the user to realize the execution function of the operator. This application does not limit this.

In addition, the networks involved above include, but are not limited to, data center networks, storage area networks (SAN), local area networks (LAN), metropolitan area networks (MAN), wide area networks (WAN), mobile, wired or wireless networks, dedicated networks or any combination of virtual private networks. In some implementations, technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML) are used to represent data exchanged through the network. In addition, conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private network (VPN), and Internet protocol security (IPsec) can also be used to encrypt all or part of the links. In other embodiments, customized and/or dedicated data communication technologies can also be used to replace or supplement the above data communication technologies.

The functions of the AI accelerator 200 in the above implementation environment are introduced in detail below.

FIG2 is a schematic diagram of the architecture of an AI accelerator 200 provided in an embodiment of the present application. It should be understood that FIG2 is only an exemplary structural diagram of the AI accelerator 200, and the present application does not limit the division of the functions of the AI accelerator 200. Schematically, as shown in FIG2, the functions of the AI accelerator 200 include but are not limited to: data acquisition function 201 and operator call function 202. In some embodiments, the functions of the AI accelerator 200 also include a segmentation function 203 and a storage function 204, etc., and the present application is not limited thereto.

The data acquisition function 201 is used to obtain the input data of multiple base operators corresponding to the operator, wherein the operator refers to a dynamic shape operator, and multiple base operators are used to collaboratively realize the function of the operator, that is, the functions of these multiple base operators combined are equivalent to the function of the operator, and it can also be understood that these multiple base operators are a series of basic operators obtained by segmenting the operator. For any base operator corresponding to the operator, the input data of the base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the preset shape set is pre-set, including multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data. It should be understood that the numerical values in the preset shape set can be freely combined to splice into any shape. The base operator refers to a pre-developed and compiled operator used to process fixed-shape input data, where the fixed shape corresponds to the numerical value in the preset shape set.

The operator calling function 202 is used to call multiple base operators, process their respective input data, and obtain the output data of multiple base operators. This process is to use a series of base operators of pre-developed and compiled operators to process the input data after the operator is segmented, so as to divide the computing task of a complete operator into multiple sub-computing tasks, and hand over these sub-computing tasks to the base operators corresponding to the operators for collaborative implementation, thereby improving the operating performance of the AI accelerator. It should be understood that after obtaining the output data of multiple base operators, the output data of multiple base operators are spliced together based on the way of segmenting the input data of the operator, that is, the output data of the operator is obtained, thereby realizing the function of the operator.

The segmentation function 203 is used to generate a segmentation scheme for the operator based on the preset shape set and the shape of the input data of the operator, and segment the input data of the operator according to the segmentation scheme to obtain input data of multiple base operators. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.

The storage function 204 is used to store pre-developed and compiled base operators, preset shape sets, AI models, etc., but the present application is not limited thereto.

In addition, the functions of the AI accelerator 200 are not limited to the above 201 to 204. In actual applications, more functions can be set according to user needs. Multifunctional: Through the above functions, the AI accelerator 200 can realize the execution function of the operators in the AI model and improve the operating performance of the AI accelerator 200.

The hardware structure of the above-mentioned AI accelerator 200 is introduced below.

The embodiment of the present application provides a chip that can be configured as the AI accelerator 200 shown in the above implementation environment. Referring to Figure 3, Figure 3 is a schematic diagram of the hardware structure of a chip provided in an embodiment of the present application. As shown in Figure 3, the chip 300 includes a communication interface 301, at least one AI processing core 302, a processor 303, a memory 304, and a bus 305. Among them, the communication interface 301, at least one AI processing core 302, the processor 303, and the memory 304 are connected to each other through the bus 305.

The communication interface 301 is used to provide program instructions and/or data to at least one AI processing core 302. The communication interface 301 includes a PCIe communication interface, other general peripheral interfaces, etc., which are not limited in this application. For example, when the chip 300 is used as an accelerator card of the host 100, data exchange is achieved through the PCIe communication interface and the host 100. For another example, the chip 300 realizes communication between the chip 300 and other devices or communication networks through the peripheral interface.

The AI processing core 302 is used to implement the functions of the AI accelerator shown in FIG2 above, that is, to execute the processing method of the operator provided in the embodiment of the present application. Schematically, the AI processing core adopts the Da Vinci architecture, which realizes high throughput, high computing power and low power consumption, and is suitable for processing common calculations required for neural networks in deep learning, such as matrix multiplication, etc. The specific architecture is introduced in FIG4 below and will not be repeated here.

The processor 303 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or an integrated circuit for controlling the execution of the program of the present application. The processor 303 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. The number of processors 303 may be one or more. Taking a multi-core processor as an example, the multiple cores may be divided into a control CPU dedicated to controlling the overall operation of the chip 300 and an AI CPU dedicated to non-matrix complex calculations according to their functions. The number of CPU cores occupied by the two types of tasks may be dynamically allocated by the software according to the actual operation of the system, and the present application does not limit this.

The memory 304 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited to this.

The bus 305 may include a path for transmitting information between various components of the chip 300 (eg, the communication interface 301 , the at least one AI processing core 302 , the processor 303 , and the memory 304 ).

It should be noted that the above FIG. 3 is only a hardware structure diagram of a chip that can be configured as the above-mentioned AI accelerator 200 provided by the present application. In some embodiments, the chip may also include other components to achieve more functions. For example, the chip also includes a task scheduler (TS) for achieving efficient allocation and scheduling of computing tasks on the AI processing core, etc. The present application is not limited to this.

The architecture of the above-mentioned AI processing core 302 is introduced below.

Fig. 4 is a schematic diagram of the structure of an AI processing core provided by an embodiment of the present application. As shown in Fig. 4, taking the AI processing core 302 using the Da Vinci architecture as an example, the AI processing core 302 includes a computing unit, a storage unit, and a control unit.

The computing units include: cube unit, vector unit and scalar unit. These three computing units perform their respective functions, forming three independent execution pipelines. Under the unified scheduling of the system software, they cooperate with each other to achieve optimized computing efficiency and complete different types of data calculations in the AI processing core.

The storage units include: L1 buffer, L0 buffer, unified buffer, general-purpose register (GPR), special-purpose register (SPR) and scalar buffer. It should be understood that the above storage units refer to the internal storage of the AI processing core. The AI processing core needs to load the data in the external storage into the internal storage to complete the corresponding calculation. In order to cooperate with the data transmission and handling in the AI processing core, the AI processing core also includes a bus interface unit (BIU), a memory transfer engine (MTE1), MTE2, and MTE3. Among them, BIU is the interface for the AI processing core to interact with the bus; MTE is a data handling unit used to complete data handling between different buffers.

The control unit includes: system control module (system control), instruction dispatch module (instr.dispatch), matrix operation queue (cube The system control module is responsible for commanding and coordinating the overall operation mode of the AI processing core, configuring parameters, and implementing power consumption control. When instructions are sent out in sequence through the instruction emission module, they will be sent to the matrix operation queue, vector operation queue, and storage conversion queue respectively according to the different types of instructions.

In the AI processing core, the storage unit provides transposed and required data to each computing unit. The computing unit returns the result of the operation to the storage unit. The control unit provides instruction control to the computing unit and the storage unit. The three coordinate and cooperate with each other to complete the computing task.

It should be noted that the above FIG. 4 is only a structural diagram provided by the present application that can realize the above-mentioned AI processing core function. The AI processing core can also adopt other architectures, and the present application does not limit this.

The following is an introduction to the processing method of the operator provided in this application.

Based on the above content, it can be seen that the present application provides a technical solution for obtaining dynamic shape operators based on a preset shape set (also called a shape set), which can effectively improve the operating performance of the AI accelerator. Among them, for input data of any shape of a certain operator, the input data is segmented according to the preset shape set, so that the multiple base operators corresponding to the preset operator are used to process the segmented input data respectively, and finally the output data of the operator is obtained.

For ease of understanding, reference is made to FIG. 5 to FIG. 7 below, and taking the operator as a matrix multiplication operator as an example, the principle of the technical solution provided in this application is introduced from the perspective of the output data of the matrix multiplication operator.

Figure 5 is a schematic diagram of the segmentation of a matrix multiplication operator. As shown in Figure 5, the shape of the left matrix A is (M, K), that is, an M×K-dimensional matrix, and the shape of the right matrix B is (K, N), that is, a K×N-dimensional matrix, where M, K, and N are all positive integers, used to represent the dimension values in the row direction or column direction of the matrix, and the matrix C=A×B, and the shape of the matrix C is (M, N). That is, it is an M×N-dimensional matrix. It should be understood that the calculation task of C=A×B cannot usually be completed by a single calculation of the computing unit of the AI processing core in the AI accelerator, so it is often necessary to segment this calculation task. The segmentation perspective is generally from the output data, that is, the matrix C. As shown in the local matrix C′ of the matrix C in Figure 5, the local matrix C′ can be calculated by multiplying the local left matrix A′ and the local right matrix B′. Therefore, the segmentation of the matrix multiplication operator usually focuses on the segmentation of the matrix C. That is, by segmenting in the M and N directions, the solution of the local matrix C′ can be obtained.

FIG6 is a schematic diagram of the partitioning of another matrix multiplication operator. As shown in FIG6, based on the content shown in FIG5 above, by partitioning the matrix in the row direction and the column direction, a local solution of the local matrix C′ can be obtained. That is, by multiplying the local left matrix A″ and the local right matrix B″, a local solution of the local matrix C′ can be obtained. All K-direction partition blocks in the local left matrix A′ and all K-direction partition blocks in the local right matrix B′ are multiplied and added to obtain the solution of the local matrix C′.

Through the contents shown in Figures 5 and 6 above, the conventional segmentation method of the matrix multiplication operator is introduced. The technical solution provided by the present application is introduced below through Figure 7. Figure 7 is a segmentation schematic diagram of a matrix multiplication operator provided by an embodiment of the present application. As shown in Figure 7, since the present application pre-sets a preset shape set and pre-sets a series of base operators related to the numerical values in the preset shape set, therefore, for the matrix multiplication operator shown in Figures 5 and 6 above, based on the preset shape set, the shape of matrix A and the shape of matrix B, a segmentation scheme of the matrix multiplication operator is generated. According to the segmentation scheme, matrix A and matrix B are segmented to obtain input data of multiple base operators. Then, multiple base operators are called to process their respective input data to obtain output data of each base operator. It should be understood that the output data of multiple base operators are spliced together, that is, the output data of the matrix multiplication operator is obtained, that is, the function of the operator is collaboratively realized through multiple base operators. Schematically, taking M=112, K=80, N=48 as an example, that is, the shape of the input data of the matrix multiplication operator is expressed as (M, K, N)=(112, 80, 48). From the perspective of the output data of the matrix multiplication operator, the shape of the output data, i.e., the matrix C, is (112, 48). Based on the preset shape set and the shape of the input data, a segmentation scheme is generated, and the input data of the matrix multiplication operator is segmented according to the segmentation scheme to obtain input data of multiple base operators. The values involved in the segmentation process include, for example, 16, 32, and 64. Then, each dimension of the input data of the matrix multiplication operator can be segmented according to the following segmentation scheme:
M(112)=64×1+32×1+16×1; K(80)=64×1+32×1; N(48)=32×1+16×1;

Accordingly, the above-mentioned partitioning scheme corresponds to 12 basis operators, that is, as shown in FIG7 , matrix A is partitioned to obtain matrices A1 to A6, and matrix B is partitioned to obtain matrices B1 to B4. Accordingly, the local solution of matrix C includes matrices C1 to C6. The data processing process of the specific basis operators is referred to the following formulas (1) to (6), wherein “pp” is used to refer to the preset basis operators, which is only for illustration and does not constitute a limitation of the present application:
C1＝A1×B1+A2×B3＝pp(64，64，32)+pp(64，16，32) (1)
C2＝A1×B2+A2×B4＝pp(64，64，16)+pp(64，16，16) (2)
C3＝A3×B1+A4×B3＝pp(32，64，32)+pp(32，16，32) (3)
C4＝A3×B2+A4×B4＝pp(32，64，16)+pp(32，16，16) (4)
C5＝A5×B1+A6×B3＝pp(16，64，32)+pp(16，16，32) (5)
C6＝A5×B2+A6×B4＝pp(16，64，16)+pp(16，16，16) (6)

Through the above formulas (1) to (6), multiple base operators are called to process their respective input data to obtain output data of multiple base operators. After concatenating the output data of multiple base operators, the output data matrix C of the matrix multiplication operator can be obtained.

It can be seen that the present application provides a solution for collaboratively implementing dynamic shape calculation tasks by using statically compiled base operators through dynamic combination. It should be understood that the above Figures 5 to 7 are only introduced using the matrix multiplication operator as an example. The technical solution provided by the present application is applicable to all types of operators and will not be described in detail here.

Based on the above introduction to the principle of the technical solution of the present application, the specific implementation method of the technical solution of the present application is introduced below. In combination with the implementation environment and the principle of the technical solution shown in the above FIG1, it can be seen that the preset shape set and the base operator involved in the operator processing method need to be pre-set, and the host can provide various development functions for the operator. Therefore, for ease of understanding, the technical solution of the present application is introduced below based on the two stages of the operator development stage and the operator execution stage.

Operator development stage

In this stage, the host provides the user with various development functions for operators, such as setting a preset shape set, writing operators, generating operator codes, writing base operators, etc. For example, the host displays the operator development interface, and the user triggers the host to implement various operator development functions by performing various operator development operations on the operator development interface, but the present application is not limited thereto. Schematically, taking the development of any dynamic shape operator as an example, this stage involves the following steps.

Step 1. Set the preset shape set.

The preset shape set includes multiple values for segmenting data, which are used to indicate the dimension value of the data, wherein the size and number of the values in the preset shape set can be set according to the needs of the user. It should be noted that the preset shape set needs to be "complete", that is, any shape can be spliced based on the values in the preset shape set, or in other words, input data of any shape can be segmented according to the values in the preset shape set, for example, any shape can be spliced using the shape with the smallest granularity. In addition, the size of the values in the preset shape set needs to consider whether the hardware capabilities are fully utilized (such as the computing power of the computing unit in the AI processing core, the bandwidth and size of the storage unit, etc.), so as to ensure the operating performance of the operator. At the same time, as the number of values in the preset shape set increases, the number of base operators corresponding to the preset shape set often increases exponentially (for example, if the shape of the input data of the operator corresponds to three values, the number of base operators and the number of values are cubic, for details, please refer to the aforementioned Figure 7, which will not be repeated here). Based on this, the present application provides a specific setting method of a preset shape set, which can balance the operator performance and the number of base operators, thereby effectively improving the operating performance of the AI accelerator, or improving the operating performance of the computing device where the AI accelerator is located. Schematically, the size of the value in the preset shape set is associated with at least one of the following:

(1) The data type corresponding to the instruction executed by the AI accelerator. The data type is, for example, FP16, FP32, etc., but the present application is not limited thereto. Taking the AI processing core shown in FIG. 4 as an example, since both the computing unit instructions and the cache transfer instructions require 32B alignment, the FP16 data type corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.

(2) The data range corresponding to the instruction executed by the AI accelerator. The data range corresponding to the instruction can also be understood as the input parameter limit of the instruction, for example, the length cannot exceed 1024 characters, the number cannot exceed 100 numbers, and so on. This application does not limit this. Schematically, taking the data transfer instruction as an example, this instruction is used for data transfer, and the value range is [1, 65535], and the unit is 32B. Therefore, this instruction can only transfer data with a minimum granularity of 32B, that is, for data of the FP16 data type, it corresponds to 16 numbers, so in this case the minimum value in the preset shape set can be set to 16.

(3) The size of cache space at each level in the AI accelerator. It should be understood that the AI accelerator usually includes multiple levels of cache space, for example, UB (256KB), L1 buffer (1MB), L0A/L0B buffer (64KB), L0C buffer (256KB), etc. For details, please refer to the aforementioned Figure 4. Since the size of the dimension value of each dimension of a certain data and the number of dimensions determine the size of the cache space occupied, the size of the numerical value in the preset shape set needs to consider the size of the cache space at each level. For example, for the matrix multiplication operator, the M and K dimensions of the matrix on L0A determine the size of the cache space occupied in the L0A buffer, and the K and N dimensions of the matrix on L0B determine the size of the cache space occupied in the L0B buffer (the left matrix data is stored in the L0A buffer, and the right matrix data is stored in the L0B buffer). Therefore, taking the left matrix of data type FP16 as an example, if the ping-pong cache mechanism is used to implement basic operator scheduling, the size of the matrix stored in L0A is M×K≤16384 (corresponding to 32KB, that is, half of 64KB), that is, the upper limit of the cache space occupied in L0A is 32KB.

The following introduces an optional implementation method of a preset shape set provided by this application.

The present application provides a method for dividing a dimensional space based on an arithmetic sequence of numerical values, that is, the numerical values in the preset shape set include at least two arithmetic sequences of numerical values, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence is greater than the difference between the numerical value at the tail of the target sequence and the numerical value at the head of the adjacent sequence of the target sequence. For example, the preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where {16, 32, 48, 64, 80, 96, 112, 128} is the target sequence, and the values in the target sequence increase according to the tolerance of 16, and {256, 384, 512, 640, 768, 896, 1024, 1152, 1280} is the adjacent sequence of the target sequence, and the values in the adjacent sequence increase according to the tolerance of 128. Schematically, the above method can also be understood as a method of dividing the dimensional space based on multi-level spans, that is, the values in the preset shape set increase according to the multi-level spans. For example, the preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where the first-level span is 16 and the second-level span is 128. For the first-level span, the value starts from 16 and increases according to the current span of 16 to obtain 32, 48, 64, 80, 96, 112, and 128. At this time, the second-level span is reached, and the value continues to increase from 128 according to the second-level span of 128 to obtain 256, 384,... and so on.

In some embodiments, the values corresponding to the shapes of the input data of multiple base operators are associated with the number of sequences in a preset shape set (or the number of levels of multi-level spans, which is not limited to this). That is, the number of sequences in the preset shape set can be used to constrain the number of values used when splitting the input data. For example, the shape of the input data of a certain operator is (112). Without considering the constraint on the number of values, the input data can have multiple splitting schemes, for example, 112 = 64 × 1 + 32 × 1 + 16 × 1, this splitting scheme involves 3 values, and for example, 112 = 64 × 1 + 48 × 1, this splitting scheme involves 2 values. Based on the above introduction, when the splitting scheme involves too many values, the corresponding number of base operators will also increase exponentially, thereby affecting the operating performance of the AI accelerator. Moreover, the increase in the number of base operators will also lead to excessive overhead in the scheduling of base operators. If the constraint on the number of values is considered, for example, the number of values corresponding to the shapes of the input data of multiple base operators is equal to the number of sequences in the preset shape set plus 1 (or the number of levels of the multi-level span plus 1. It should be understood that this is only an example and can be set according to user needs, and does not constitute a limitation on the technical solution of the present application). Then, the number of values involved in the segmentation scheme can be effectively controlled. For example, if the number of sequences in the preset shape set is 1, the number of values corresponding to the shapes of the input data of multiple base operators is at most 2. Taking the shape of the input data of a certain operator as (112) as an example, the segmentation scheme of the input data involves at most 2 values, for example, 112=112×1. This segmentation scheme involves 1 value, which effectively reduces the number of base operators. In this way, computing resources are saved, which improves the operating performance of the AI accelerator.

Based on the above, two examples of preset shape sets are given below:

The first

The preset shape set includes two arithmetic progression value sequences. The maximum number of values involved in the segmentation scheme is 3. The preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280}, where the total number of values is 17, which can be adjusted according to needs.

The second

The preset shape set includes two arithmetic progression value sequences. The maximum number of values involved in the segmentation scheme is 3. The preset shape set = {16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 512, 768, 1024, 1280, 1536, 1792, 2048}, where the total number of values is 23, which can be adjusted according to needs.

It should be noted that the host can automatically generate a preset shape set based on the constraints set by the user (as shown in the above content). For example, taking the AI processing core shown in Figure 4 as an example, since the matrix on the L0A buffer needs to satisfy M×K≤16384 (corresponding to 32KB), when M _L0A =1024, K _L0A ≤16; and when M _L0A =128, K _L0A ≤128. It should be understood that the example here describes the result brought about by the L0A buffer constraint. The described M _L0A and K _L0A are both the size of the matrix on the L0A buffer, rather than the M and K dimensions of the actual matrix multiplication operator. Generally, the AI processing core contains multiple levels of cache space, and the size of each level of cache space varies greatly, so the corresponding constraints are also different. Therefore, the setting of the preset shape set can comprehensively consider the association relationship between data transfer between multiple levels of cache and the constraints of multiple levels of cache space, and the present application is not limited to this. For example, the maximum amount of data transferred between each level of cache space is limited to ≤ the block size, and different transfer and calculation orders are designed for different internal segmentation methods of base operators to reuse data as much as possible, reduce transfer, and form a pipeline. This process will be further introduced in the second part later, so I will not go into details here.

Furthermore, the user can also adjust the preset shape set according to the quality evaluation information of the preset shape set (such as automatic iterative adjustment or manual iterative adjustment) so that the adjusted preset shape set meets the target conditions and ensures the rationality of the preset shape set. Among them, the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs. For example, the target condition includes that the total number of base operators corresponding to the preset shape set meets the requirements, the number of values involved in the segmentation scheme meets the requirements, the utilization rate of the hardware capacity meets the requirements, etc. Schematically, after the preset shape set is initially set, the preset shape set is iteratively adjusted by evaluating the quality of the preset shape set. The following introduces several situations involving the above-mentioned target conditions:

(1) The total number of base operators corresponding to the preset shape set (which can affect the size of the binary file; the more base operators there are, the more binary files there are, and thus the more space they occupy).

(2) The number of values involved in the segmentation scheme (which can affect the complexity of operator scheduling, such as affecting instruction cache misses, thereby affecting operating performance).

(3) Utilization of hardware capabilities, such as whether the base operators corresponding to the preset shape set include base operators that fully utilize the capabilities of the computing units and storage units (which can affect the performance upper limit of the operators obtained by splicing. The higher the hardware capability utilization, the higher the performance upper limit). For example, referring to the AI processing core shown in FIG4 above, the computing capability of the matrix multiplication computing unit (cube) is that each cycle can execute a matrix multiplication operator of shape (M, K, N) = (16, 16, 16), and the input constraint of the matrix multiplication computing instruction is limited by the cache space of L0A and L0B. Therefore, for the AI processing core shown in FIG4, the evaluation basis for fully utilizing the hardware capabilities may include: fully utilizing the cache space of L0A and L0B; fully utilizing the bandwidth of cache space at all levels; and fully utilizing the computing capability of the computing unit as much as possible under the constraints of the cache space of L0A and L0B. Of course, the above introduction to fully utilizing hardware capabilities is only for illustrative purposes. Hardware capabilities can also be evaluated through other factors, such as the theoretical upper limit model of compute bound and memory bound, cache hit rate of multi-level cache space, scheduling overhead, etc.

Step 2: Set the base operator.

A base operator refers to a pre-developed and compiled operator used to process input data of a fixed shape. The fixed shape here corresponds to the numerical value in the preset shape set. The base operator can also be understood as a basic operator for segmenting operators. In an embodiment of the present application, a solution for automatic implementation of a base operator is provided, that is, after the base operator is developed and compiled, a unified base operator scheduling mechanism is used during the actual operation of the base operator to achieve seamless splicing of scheduling between multiple base operators. The following is an introduction to this base operator automatic implementation solution, including the following two points:

(1) Base operator scheduling mechanism

In the embodiment of the present application, when generating a base operator, it is necessary to adopt a unified base operator scheduling mechanism to achieve seamless splicing of scheduling between multiple base operators. Therefore, it is necessary to constrain the cache space allocation method implemented inside the base operator based on a unified AI accelerator internal cache space mechanism. In this way, in the subsequent AI accelerator calling multiple base operators to process their respective input data, a unified granularity can be used to allocate cache space at all levels in the AI accelerator when reading and writing data, thereby achieving seamless splicing of scheduling between multiple base operators. For example, a ping-pong cache mechanism is used to achieve seamless splicing of scheduling between multiple base operators, that is, an AI accelerator usually includes multiple levels of cache space. By opening up ping-pong cache space at each level of cache space, the computing unit and the data handling unit are executed concurrently in a pipeline, thereby reducing processing latency.

The principle of the above process is introduced below with reference to FIG8 , which is a schematic diagram of a base operator scheduling mechanism provided by an embodiment of the present application. As shown in FIG8 , taking the AI processing core shown in FIG4 as an example, the AI processing core includes a multi-level cache space. Each base operator generally includes three stages of data input handling, calculation, and output handling during the implementation process. If a base operator is executed and then the next base operator is executed, the hardware resources of some computing units and data handling units are idle. Based on this, the present application provides a base operator scheduling mechanism, which adopts a ping-pong cache mechanism to achieve seamless splicing of scheduling between multiple base operators. However, the pipeline concurrency of calculation and data handling depends on the unified division of cache space, that is, when the computing unit reads and processes the data in the ping cache space, the data handling unit can only write the data in the pong cache space, otherwise there will be a problem of data synchronization risk. Therefore, a unified granularity is used to allocate cache space at all levels in the AI accelerator, so that the cache space allocation method implemented inside each base operator is unified, thereby achieving seamless splicing of scheduling between base operators. For example, referring to FIG9, FIG9 is a schematic diagram of a cache space allocation method provided by an embodiment of the present application. As shown in FIG9, taking the AI processing core shown in FIG4 as an example, the size of the cache space corresponding to the L1 buffer is 1MB, the size of the cache space corresponding to the L0A and L0B buffers is 64KB each, and the size of the cache space corresponding to the L0C and UB buffers is 256KB. Based on this, a block of 32KB is used to uniformly divide the cache space. Each time the cache space needs to be occupied within the base operator, it is allocated in units of blocks. Then, through unified scheduling, the base operators are prevented from reading and writing the same block at the same time, which also avoids data read and write conflicts, and realizes the pipeline parallelization of calculation and data handling during the entire execution process. That is, in the subsequent AI accelerator calling multiple base operators and processing their respective input data, the data can be read and written at a unified granularity. The cache space at each level in the AI accelerator can be allocated, thereby realizing seamless splicing in scheduling between multiple base operators.

In some embodiments, based on the above introduction to the preset shape set, it can be known that the combination of different values in the preset shape set can determine the size of the cache space occupied. Therefore, the above unified division of the cache space can also be used to constrain the size of the values in the preset shape set. Schematically, the ping-pong cache implementation method for concurrent pipelining requires that two cache spaces of the same size exist in the cache space for implementing the ping-pong cache. Taking the L0A buffer as an example, the size of the cache space corresponding to the L0A buffer is 64KB. Therefore, the upper limit of the ping-pong cache space is 32KB, so that the data size obtained by the combination of different values in the preset shape set needs to be less than or equal to 32KB. Of course, the examples given here are only for schematic illustration, and the size of the values in the preset shape set can be adjusted according to actual needs.

(2) Developing the internal implementation template of the base operator

In the embodiment of the present application, since the AI accelerator includes multiple levels of cache space, and the sizes of the cache spaces at each level are different, after the AI accelerator obtains data from the outside, it still needs to be segmented internally and then the calculation is completed one by one. Based on this, the base operator can also include a segmentation process, and different internal segmentation schemes have different effects on the performance of the base operator. Since the shape of the input data of the base operator is fixed, a corresponding segmentation scheme can be generated for each base operator under the constraints shown in the above point (1). This process can first be written by the developer based on experience to write the internal implementation template of the base operator with different segmentation methods, and then the most suitable implementation scheme for each base operator is selected through performance testing. The performance test optimization process after the base operator is implemented by different templates can be automatically completed by tools such as scripts, and the specific implementation method of the base operator is not limited in this application. Of course, the shape of the data processed by the base operator can also be used as a function input parameter, and the function of the base operator is realized by calling the set template, that is, there is no need to generate the base operator code file, thereby greatly reducing the amount of base operator code, making the compiled binary file further reduced, and saving resource usage.

Schematically, refer to Figure 10, which is a schematic diagram of a segmentation method of a base operator provided in an embodiment of the present application. As shown in Figure 10, taking the base operator M2_K2_N2 corresponding to the multiplication operator as an example, M2_K2_N2 means that it is segmented in two equal parts in the three dimensions of M, K, and N. The same is true for other base operators, which will not be repeated. Base operators that process data of different shapes are suitable for different internal segmentation implementation methods. By performing performance tests on a single base operator, the most suitable implementation scheme for each base operator can be selected, and the present application is not limited to this.

Step 3: Set the method for generating the segmentation plan.

In an embodiment of the present application, a segmentation scheme is generated based on a preset shape set and the shape of the input data of the operator, and is used to include a segmentation method for the input data of the operator and a plurality of base operators corresponding to the operator. After the input data of the operator is segmented according to the segmentation scheme, the input data of the plurality of base operators corresponding to the operator can be obtained. Taking into account that for any shape, segmentation according to a preset shape set usually results in a plurality of segmentation schemes, the present application provides a segmentation scheme determination method for automatic load balancing, by determining the cost of different segmentation schemes, thereby selecting the segmentation scheme with the minimum cost, so as to improve the operating performance of the AI accelerator, wherein the cost of the segmentation scheme can indicate the predicted time consuming of calling a plurality of base operators for data processing according to the segmentation scheme to obtain the output data of the operator.

In some embodiments, the segmentation scheme also includes a computing power resource allocation method for calling multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a computing power resource allocation method based on the number of cores in the AI accelerator; a computing power resource allocation method based on the number of threads in the AI accelerator; a computing power resource allocation method based on the number of thread bundles in the AI accelerator; a computing power resource allocation method based on the number of logic blocks in the AI accelerator. It should be understood that, taking the method of allocating computing power resources based on the number of cores in the AI accelerator as an example, the AI accelerator usually includes multiple AI processing cores (as shown in Figure 3 above), for example, including 32 AI processing cores. When generating a segmentation scheme, a complete operator is divided into each AI processing core in an evenly distributed manner according to the number of cores to run, so that the computing power resources can be fully utilized. Of course, other allocation methods can also be used, and this application is not limited to this. Schematically, refer to Figure 11, Figure 11 is a schematic diagram of a segmentation scheme provided in an embodiment of the present application. As shown in Figure 11, taking the output data of the operator as an example, the AI accelerator includes 6 AI processing cores, and the input data of the operator is evenly divided according to the number of cores, and each AI processing core processes a small 3×3 matrix, which is further divided in each AI processing core. By calling multiple base operators, the respective input data is processed to obtain the output data of each base operator, that is, the output data of the complete operator. Similarly, it can also be divided according to the number of threads, the number of thread bundles, or the number of logic blocks, etc., and the present application is not limited to this.

The following example uses the allocation of computing resources based on the number of cores in an AI accelerator to illustrate the method of determining the cost of a segmentation scheme. Schematically, the present application provides a cost model, through which the costs of different segmentation schemes can be determined, providing technical support for determining the final segmentation scheme of the operator. The cost model is shown in the following formulas (7) and (8):

In the above formula, t _estimation refers to the predicted time consumption of obtaining the output data of the operator by calling multiple base operators for data processing according to the partitioning scheme. _{The additional overhead of t split-core scheduling} is an estimated value obtained through testing. For example, 1 core, 2 cores, ... 32 cores are used to run calculations in sequence, and the same size of computing tasks are set on each core. In theory, if there is no additional overhead, then the total time consumption should remain unchanged as the number of cores used increases. However, through testing, it is found that there will be a linear increase. Based on the amount of linear increase, the additional overhead of split-core scheduling can be estimated. Correspondingly, t _{split-core scheduling additional overhead} × the number of split cores is the predicted time consumption for computing power resource allocation based on the computing power resource allocation method. _{t base operator} is an estimated value obtained through testing. For example, The data shape corresponding to the calculation task is set to the size of a certain base operator to be tested, and it is constrained to run only on a single core to obtain the predicted time consumption of running the base operator. The proportion of the most time-consuming stage is the proportion of the longest time-consuming stage in the operation of the base operator obtained through testing, or in other words, its contribution to the time consumption of the entire execution process is calculated according to the time consumption of the longest first-level pipeline. For example, referring to Figure 12, Figure 12 is a schematic diagram of the proportion of the most time-consuming stage provided in an embodiment of the present application. As shown in Figure 12, the most time-consuming stage is the input transfer stage (it should be noted that this is only an example and does not constitute a limitation of the present application). Accordingly, That is, the prediction time of multiple base operators is time-consuming.

The above-mentioned operator development stage is summarized below with reference to FIG13 . FIG13 is a flow chart of an operator development stage provided in an embodiment of the present application. As shown in FIG13 , the operator development stage is executed by the host and includes the following steps 1301 to 1306 .

1301. The host generates a preset shape set of operators based on constraint conditions, where the preset shape set includes multiple values for segmenting data.

Among them, the operator is a dynamic shape operator, and the constraint condition refers to the constraint condition set by the user, which can be adjusted according to actual needs. For example, the constraint conditions include the data type corresponding to the instruction executed by the AI accelerator, the data range corresponding to the instruction executed by the AI accelerator, the size of the cache space at each level in the AI accelerator, the correlation relationship between data transfer between multi-level caches, the granularity of unified division of cache space, etc. Please refer to the first and second parts above for details, which will not be repeated here.

1302. The host obtains quality evaluation information of the preset shape set, and adjusts the preset shape set based on the quality evaluation information, so that the adjusted preset shape set meets the target condition.

Among them, the target condition refers to the condition set by the user for evaluating the quality of the preset shape set, which can be adjusted according to actual needs. For example, the target condition includes that the total number of basic operators corresponding to the preset shape set meets the requirements, the number of numerical values involved in the segmentation scheme meets the requirements, the utilization rate of hardware capabilities meets the requirements, etc. The specific content of the quality assessment information refers to the first part mentioned above and will not be repeated here.

1303. The host generates a code file of the base operator based on the segmentation method of the base operator corresponding to the operator.

Among them, the division method of the base operator is the internal implementation template of the base operator shown in the second part above. The host can automatically generate the code file of the base operator according to the division method of the base operator provided by the user. Please refer to the second part above for details, which will not be repeated here.

1304. The host constructs a cost model of the operator based on the code file of the base operator.

The specific implementation of the cost model is described in the third part above and will not be repeated here.

1305. The host constructs a segmentation function of the operator based on the preset shape set, and the segmentation function is used to generate a segmentation scheme of the operator.

The segmentation function refers to a function code for generating a segmentation scheme of an operator, and the segmentation function is used to output segmentation parameters based on a preset shape set and the shape of the input data of the operator, that is, to generate a segmentation scheme of the operator. Schematically, the segmentation parameters are a set of parameters used to indicate how the input data is segmented and calculated, including source data address, destination data address, address offset, base operator type, quantity, quantity after segmentation of each segmentation dimension, number of loops, etc., and the present application is not limited thereto.

1306. The host generates a code file for the operator based on the code file of the base operator, the cost model of the operator, and the segmentation function of the operator.

Among them, the code file of the operator includes all code files involved in running the operator, such as the code file of the base operator, the cost model of the operator, the splitting function of the operator, and the logic code file for unified scheduling of the base operator, etc., but the present application is not limited to this.

Operator execution phase

Based on the above introduction to the operator development stage, the following is an introduction to the process of the operator execution stage, which is the operator processing method provided by this application. Based on the implementation environment shown in Figure 1 above, it can be seen that the execution function of the operator can be implemented by the AI accelerator 200 itself, or it can be implemented through the interaction between the host 100 and the AI accelerator 200. The following two implementation methods are introduced through Figures 14 and 15 respectively.

FIG14 is a flowchart of an operator processing method provided in an embodiment of the present application. As shown in FIG14 , the method is executed by an AI accelerator and includes the following steps 1401 to 1404 .

1401. The AI accelerator obtains input data of an operator, which is a dynamic shape operator.

Among them, the operator refers to any dynamic shape operator of the AI model, and the input data can be sent by the host, or it can be the output data of other operators associated with the operator during the AI accelerator running the AI model. The present application is not limited to this.

1402. The AI accelerator generates a segmentation scheme for the operator based on a preset shape set and a shape of the input data of the operator. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.

Among them, multiple base operators are used to collaboratively realize the functions of the operators. The preset shape set includes at least two arithmetic progression value sequences, among which the target The difference between the value at the end of the sequence and the value at the head of the adjacent sequence of the target sequence is greater than the tolerance of the target sequence. In some embodiments, the values corresponding to the shapes of the input data of the plurality of base operators are associated with the number of sequences in the preset shape set.

In some embodiments, the segmentation scheme also includes a computing power resource allocation method that calls multiple base operators, and the computing power resource allocation method is associated with at least one of the following: a method for allocating computing power resources based on the number of cores in the AI accelerator; a method for allocating computing power resources based on the number of threads in the AI accelerator; a method for allocating computing power resources based on the number of thread bundles in the AI accelerator; a method for allocating computing power resources based on the number of logic blocks in the AI accelerator.

In some embodiments, based on the contents of the aforementioned operator development stage, the present application provides a method for determining a slicing scheme for automatic load balancing, by determining the costs of different slicing schemes, thereby selecting the slicing scheme with the lowest cost, so as to improve the operating performance of the AI accelerator. Schematically, this step 1402 includes the following two steps:

Step A: Based on a preset shape set and the shape of the operator's input data, generate multiple candidate segmentation schemes for the operator, the candidate segmentation schemes including candidate segmentation methods for the operator's input data and multiple candidate base operators corresponding to the candidate segmentation methods.

Step B: Determine a segmentation scheme that meets the target conditions from multiple candidate segmentation schemes.

Among them, the AI accelerator determines the cost of each candidate segmentation scheme, which indicates the predicted time consumption of calling multiple candidate base operators for data processing according to the candidate segmentation scheme to obtain the output data of the operator; and determines the candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes as the segmentation scheme.

In some embodiments, the candidate splitting scheme also includes calling a computing power resource allocation method for multiple candidate base operators. Schematically, the AI accelerator determines the cost of each candidate splitting scheme, including: determining the cost of the candidate splitting scheme based on the predicted time consumption of multiple candidate base operators indicated by the candidate splitting scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.

The above process can refer to the third part of the aforementioned operator development stage, and specifically refer to the cost model shown in formula (7) and formula (8), which will not be repeated here.

1403. The AI accelerator divides the input data of the operator according to the division scheme to obtain input data of multiple base operators.

1404. The AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.

Among them, the AI accelerator calls multiple base operators, and in the process of processing their respective input data, a unified granularity is used to allocate cache space at all levels in the AI accelerator when reading and writing data. For example, based on the ping-pong buffer mechanism, a unified granularity (such as a 32KB block) is used to allocate cache space at all levels in the AI accelerator. It should be understood that since multiple base operators can collaboratively realize the functions of the operator, the output data of multiple base operators is also the output data of the target operator, or in other words, the output data of multiple base operators is spliced together to obtain the output data of the operator. In some embodiments, the AI accelerator calls multiple base operators, and in the process of processing their respective input data, each base operator corresponds to its own destination video memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination video memory address on the output memory, that is, the output data of the operator is obtained.

It should be noted that the specific implementation of the above steps 1403 to 1404 has been described in detail in the above-mentioned Figures 5 to 7 and the operator development stage, and will not be repeated here. It should be understood that in the case where the AI accelerator includes multiple AI processing cores, if the partitioning scheme also indicates a computing resource allocation method including multiple base operators, such as an average allocation according to the number of processing cores, each AI processing core will output a copy of the data, and the output data of multiple AI processing cores will be spliced together to obtain the output data of the operator. Please refer to Figure 11 above for details, and will not be repeated here.

In summary, in the operator processing method provided in this application, a preset shape set is used to dynamically segment the input data of operators of arbitrary shapes, thereby realizing the execution of dynamic shape operators by calling multiple base operators corresponding to the operators, thereby effectively improving the operating performance of the AI accelerator.

FIG15 is a flowchart of another operator processing method provided in an embodiment of the present application. As shown in FIG15 , the interaction between a host and an AI accelerator is taken as an example for introduction, and includes the following steps 1501 to 1506 .

1501. The host generates a segmentation scheme for the operator based on a preset shape set and the shape of the input data of the operator. The preset shape set includes multiple numerical values for segmenting the data, and the numerical values are used to indicate the dimension value of the data. The segmentation scheme includes a segmentation method for the input data of the operator and multiple base operators corresponding to the segmentation method.

Among them, the process of the host generating an operator segmentation plan is the same as the process of the AI accelerator generating a segmentation plan shown in Figure 14 above, so it will not be repeated here.

1502. The host sends the operator input data and the operator segmentation plan to the AI accelerator.

1503. The AI accelerator obtains the input data of the operator and the segmentation scheme of the operator.

1504. The AI accelerator divides the input data of the operator according to the division scheme of the operator to obtain input data of multiple base operators.

It should be noted that step 1504 is an optional step. In some embodiments, the host divides the input data of the operator according to the operator's division scheme to obtain input data of multiple base operators, and sends the input data of the multiple base operators to the AI accelerator.

1505. The AI accelerator calls multiple base operators, processes their respective input data, and obtains output data of the multiple base operators.

1506. The AI accelerator sends the output data of the multiple base operators to the host.

It should be understood that since multiple base operators can collaboratively implement the functions of the operator, the output data of the multiple base operators can be concatenated to obtain the output data of the operator. In some embodiments, the AI accelerator calls multiple base operators to process their respective input data. Each base operator corresponds to its own destination memory address. After the execution of multiple base operators is completed, the corresponding output data is written to each destination memory address on the output memory. The AI accelerator sends the output data of the multiple base operators to the host, and the host obtains the output data of the operator.

In addition, in the embodiments shown in Figures 14 and 15 above, the segmentation scheme of the operator generated when the AI accelerator runs the operator is introduced as an example. In some embodiments, the segmentation scheme can also be pre-set. For example, in the scenario of the AI reasoning engine, its function is to automatically perform model operator generation, operator fusion, calculation graph optimization, calculation quantization, model cutting and other tasks offline according to the model description given by the user, and finally generate a high-performance model file for reasoning and deployment in production. The technical solution provided in this application can also be used in this scenario, that is, an AI model with a segmentation scheme that has been set can be generated according to the AI model description given by the user, and this application is not limited to this.

FIG16 is a schematic diagram of the structure of an operator processing device provided in an embodiment of the present application. The device can realize the function of the aforementioned AI accelerator through software, hardware, or a combination of both. As shown in FIG16 , the device is configured in the AI accelerator, including an acquisition module 1601 and a call module 1602.

Acquisition module 1601 is used to acquire input data of multiple base operators corresponding to an operator. The operator is a dynamic shape operator. The input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set. Multiple operators are used to collaboratively implement the functions of the operator. The preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimensional value of the data.

The calling module 1602 is used to call multiple basic operators, process the respective input data, and obtain the output data of the multiple basic operators.

In some embodiments, the acquisition module 1601 includes:

A generating unit, configured to generate a segmentation scheme of the operator based on a preset shape set and a shape of the input data of the operator, wherein the segmentation scheme includes a segmentation method for the input data of the operator and a plurality of base operators corresponding to the segmentation method;

The segmentation unit is used to segment the input data of the operator according to the segmentation scheme to obtain the input data of multiple base operators.

In some embodiments, the slicing scheme further includes calling a computing resource allocation method of multiple base operators, where the computing resource allocation method is associated with at least one of the following:

A method for allocating computing resources based on the number of thread bundles in the AI accelerator;

A method of allocating computing resources based on the number of logic blocks in the AI accelerator.

In some embodiments, the generating unit is used to:

Based on the preset shape set and the shape of the input data of the operator, a plurality of candidate segmentation schemes of the operator are generated, wherein the candidate segmentation schemes include candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;

Determine a segmentation scheme that meets the target conditions from multiple candidate segmentation schemes.

In some embodiments, the generating unit is used to:

Determine the cost of each candidate segmentation scheme, where the cost indicates the predicted time consumption of calling multiple candidate base operators to process data and obtain output data of the operators according to the candidate segmentation scheme;

The candidate segmentation scheme with the smallest cost among multiple candidate segmentation schemes is determined as the segmentation scheme.

In some embodiments, the candidate segmentation scheme further includes calling a computing resource allocation method of multiple candidate base operators, and the generating unit is used to:

The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing resource allocation based on the computing resource allocation method.

In some embodiments, the acquisition module 1601 is further used to:

Get the input data of the operator sent by the host and the operator's segmentation plan, which includes the segmentation method for the operator's input data And multiple basis operators corresponding to the segmentation method;

The device also includes a sending module, which is used to send output data of multiple base operators to a host.

In some embodiments, the preset shape set includes at least two arbitrarily-differentiated numerical sequences, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.

In some embodiments, the values corresponding to the shapes of the input data of the plurality of basis operators are associated with the number of sequences in the preset shape set.

The data type corresponding to the instructions executed by the AI accelerator;

The data range corresponding to the instructions executed by the AI accelerator;

The size of cache space at each level in the AI accelerator.

In some embodiments, the calling module 1602 calls multiple base operators to process their respective input data, and uses a unified granularity to allocate cache space at various levels in the AI accelerator when reading and writing data.

Through the above device, a preset shape set is used to dynamically segment the input data of a target operator of any shape, thereby realizing the execution of a dynamic shape operator by calling multiple base operators corresponding to the target operator, effectively improving the operating performance of the AI accelerator.

It should be noted that: when the operator processing device provided in the above embodiment processes the operator, only the division of the above functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the operator processing device provided in the above embodiment and the operator processing method embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.

In this application, the terms "first", "second", etc. are used to distinguish between identical or similar items with substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between "first", "second", and "nth", nor is the quantity and execution order limited. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of the various described examples, the first processing core can be referred to as the second processing core, and similarly, the second processing core can be referred to as the first processing core. Both the first processing core and the second processing core can be processing cores, and in some cases, can be separate and different processing cores.

In this application, the term "at least one" means one or more, and the term "plurality" means two or more. For example, a plurality of processing cores means two or more processing cores.

The above description is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of various equivalent modifications or replacements within the technical scope disclosed in the present application, and these modifications or replacements should be included in the protection scope of the present application. Therefore, the protection scope of the present application shall be based on the protection scope of the claims.

In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of program structure information. The program structure information includes one or more program instructions. When the program instructions are loaded and executed on a computing device, all or part of the processes or functions in the embodiments of the present application are generated.

A person skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware or by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.

As described above, the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present application.

Claims

A method for processing an operator, characterized in that it is executed by an artificial intelligence (AI) accelerator, and the method comprises:

Obtain input data of multiple base operators corresponding to an operator, where the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;

The multiple base operators are called to process the respective input data to obtain output data of the multiple base operators.
The method according to claim 1, characterized in that the step of obtaining input data of multiple base operators corresponding to the operator comprises:

Based on the preset shape set and the shape of the input data of the operator, generating a segmentation scheme of the operator, the segmentation scheme including a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;

According to the segmentation scheme, the input data of the operator is segmented to obtain the input data of the multiple base operators.
The method according to claim 2, characterized in that the segmentation scheme further comprises calling a computing power resource allocation method of the multiple base operators, and the computing power resource allocation method is associated with at least one of the following:

A method for allocating computing resources based on the number of cores in the AI accelerator;

A method for allocating computing resources based on the number of threads in the AI accelerator;

A method for allocating computing resources based on the number of thread warps in the AI accelerator;

A method for allocating computing resources based on the number of logic blocks in the AI accelerator.
The method according to claim 2 or 3, characterized in that the generating of the segmentation scheme of the operator based on the preset shape set and the shape of the input data of the operator comprises:

Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;

The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
The method according to claim 4, characterized in that the step of determining the segmentation scheme that meets the target condition from the plurality of candidate segmentation schemes comprises:

Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;

A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
The method according to claim 4 or 5 is characterized in that the candidate segmentation scheme further includes calling the computing resource allocation mode of the multiple candidate base operators, and the determining the cost of each candidate segmentation scheme includes:

The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
The method according to claim 1, characterized in that the method further comprises:

Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;

The method further includes: sending output data of the plurality of base operators to the host.
The method according to any one of claims 1 to 7 is characterized in that the preset shape set includes at least two arithmetic progression value sequences, wherein the difference between the value at the tail of the target sequence and the value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
The method according to claim 8 is characterized in that the numerical values corresponding to the shapes of the input data of the multiple basis operators are associated with the number of sequences in the preset shape set.
The method according to any one of claims 1 to 9, characterized in that the magnitude of the numerical value in the preset shape set is associated with at least one of the following:

The data type corresponding to the instruction executed by the AI accelerator;

The data range corresponding to the instruction executed by the AI accelerator;

The size of cache space at each level in the AI accelerator.
The method according to any one of claims 1 to 10 is characterized in that, in the process of calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
An operator processing device, characterized in that it is configured in an AI accelerator, and the device comprises:

An acquisition module, used to acquire input data of multiple base operators corresponding to an operator, wherein the operator is a dynamic shape operator, and the input data of each base operator is obtained by segmenting the input data of the operator based on a preset shape set, and the multiple base operators are used to collaboratively implement the function of the operator, and the preset shape set includes multiple numerical values for segmenting data, and the numerical values are used to indicate the dimension value of the data;

The calling module is used to call the multiple basic operators, process the respective input data, and obtain the output data of the multiple basic operators.
The device according to claim 12, characterized in that the acquisition module comprises:

A generating unit, configured to generate a segmentation scheme for the operator based on the preset shape set and the shape of the input data of the operator, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the plurality of base operators corresponding to the segmentation method;

A segmentation unit is used to segment the input data of the operator according to the segmentation scheme to obtain the input data of the multiple base operators.
The device according to claim 13, characterized in that the segmentation scheme further includes calling a computing power resource allocation method of the multiple base operators, and the computing power resource allocation method is associated with at least one of the following:

A method for allocating computing resources based on the number of cores in the AI accelerator;

A method for allocating computing resources based on the number of threads in the AI accelerator;

A method for allocating computing resources based on the number of thread warps in the AI accelerator;

A method for allocating computing resources based on the number of logic blocks in the AI accelerator.
The device according to claim 13 or 14, characterized in that the generating unit is used to:

Based on the preset shape set and the shape of the input data of the operator, generating a plurality of candidate segmentation schemes of the operator, the candidate segmentation schemes including candidate segmentation methods for the input data of the operator and a plurality of candidate base operators corresponding to the candidate segmentation methods;

The segmentation scheme that meets the target condition is determined from the multiple candidate segmentation schemes.
The device according to claim 15, characterized in that the generating unit is used to:

Determine a cost of each candidate segmentation scheme, where the cost indicates a predicted time consumption of calling the plurality of candidate base operators to perform data processing according to the candidate segmentation scheme to obtain output data of the operators;

A candidate segmentation scheme with the smallest cost among the multiple candidate segmentation schemes is determined as the segmentation scheme.
The device according to claim 15 or 16, characterized in that the candidate segmentation scheme further includes a computing resource allocation method for calling the multiple candidate base operators, and the generating unit is used to:

The cost of the candidate segmentation scheme is determined based on the predicted time consumption of multiple candidate base operators indicated by the candidate segmentation scheme and the predicted time consumption of computing power resource allocation based on the computing power resource allocation method.
The device according to claim 12, characterized in that the acquisition module is further used to:

Acquire the input data of the operator and the segmentation scheme of the operator sent by the host, wherein the segmentation scheme includes a segmentation method for the input data of the operator and the multiple base operators corresponding to the segmentation method;

The device further includes a sending module, and the sending module is used to send output data of the multiple base operators to the host.
The device according to any one of claims 12 to 18 is characterized in that the preset shape set includes at least two arithmetically differing numerical sequences, wherein the difference between the numerical value at the tail of the target sequence and the numerical value at the head of an adjacent sequence of the target sequence is greater than the tolerance of the target sequence.
The device according to claim 19 is characterized in that the numerical values corresponding to the shapes of the input data of the multiple basis operators are associated with the number of sequences in the preset shape set.
The device according to any one of claims 12 to 20, characterized in that the magnitude of the numerical value in the preset shape set is associated with at least one of the following:

The data type corresponding to the instruction executed by the AI accelerator;

The data range corresponding to the instruction executed by the AI accelerator;

The size of cache space at each level in the AI accelerator.
The device according to any one of claims 12 to 21 is characterized in that, in the process of the calling module calling the multiple base operators to process the respective input data, a uniform granularity is used to allocate cache space at various levels in the AI accelerator when reading and writing data.
A chip, characterized in that it is configured as an AI accelerator, the AI accelerator comprising a communication interface and at least one AI processing core, the communication interface is used to provide program instructions and/or data to the at least one processing core, and the at least one AI processing core is used to implement the processing method of the operator as described in any one of claims 1 to 11 above.
A computing device, characterized in that it includes a host and an AI accelerator, the host is used to send data to the AI accelerator and receive data sent by the AI accelerator, and the AI accelerator is used to execute the processing method of the operator as described in any one of claims 1 to 11 above.
A computing device cluster, characterized in that it includes multiple computing devices, the computing devices include a host and an AI accelerator, the host is used to send data to the AI accelerator, receive data sent by the AI accelerator, and the AI accelerator is used to execute the processing method of the operator as described in any one of claims 1 to 11 above.
A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store at least one section of program code, and the at least one section of program code is used to execute the operator processing method according to any one of claims 1 to claim 11.
A computer program product, characterized in that when the computer program product runs on an AI accelerator, the AI accelerator executes the operator processing method according to any one of claims 1 to 11.