CN115587922A - Tensor blocking method and device and storage medium - Google Patents

Tensor blocking method and device and storage medium Download PDF

Info

Publication number
CN115587922A
CN115587922A CN202110760579.1A CN202110760579A CN115587922A CN 115587922 A CN115587922 A CN 115587922A CN 202110760579 A CN202110760579 A CN 202110760579A CN 115587922 A CN115587922 A CN 115587922A
Authority
CN
China
Prior art keywords
tensor
blocking
model
operator information
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110760579.1A
Other languages
Chinese (zh)
Inventor
张松飞
姚棋中
吴辉阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110760579.1A priority Critical patent/CN115587922A/en
Publication of CN115587922A publication Critical patent/CN115587922A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application relates to a tensor blocking method, a tensor blocking device and a storage medium, wherein the tensor blocking method comprises the following steps: inputting training data into a first model, wherein the first model is used for determining a blocking strategy of a tensor on each cache module of a processor, the processor is used for executing operator operation of the tensor, and the training data comprises operator information corresponding to at least one operator type; the first model outputs a first prediction result, and the first prediction result meets constraint conditions corresponding to each cache module; evaluating the combination of the first prediction results of the cache modules to obtain a first evaluation result; and performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met to obtain a trained first model. According to the embodiment of the application, the generalization performance of the model can be improved, a better tensor blocking strategy is obtained on line, and the calculation performance of the processor is improved.

Description

Tensor blocking method and device and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a tensor blocking method, device, and storage medium.
Background
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. That is, artificial intelligence studies the design principle and implementation method of various intelligent machines, so that the machine has the functions of perception, reasoning and decision making.
As an important research direction of AI, a Deep Learning (DL) model is composed of individual computing units, which may be called Operators (OPs), and since the memory of a computing device is limited, in order to meet the requirement of a complex operator on huge computing power, it is inevitably necessary to perform block (tiling) reading, calculation and output on an input tensor. For example, when processing the image type tensor, the current tensor blocking method is long in time consumption and poor in generalization capability, and a method capable of improving the calculation performance of the processor is urgently needed.
Disclosure of Invention
In view of the above, a tensor blocking method, apparatus and storage medium are provided.
In a first aspect, an embodiment of the present application provides a tensor blocking method, including: inputting training data into a first model, wherein the first model is used for determining a partitioning strategy of a tensor on each cache module of a processor, the processor is used for executing operator operation of the tensor, and the training data comprises operator information corresponding to at least one operator type; the first model outputs first prediction results, the first prediction results meet constraint conditions corresponding to each cache module, each group of first prediction results correspond to one cache module of the processor, and the constraint conditions comprise first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions among the cache modules; evaluating the combination of the first prediction results of the cache modules to obtain a first evaluation result; and performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met to obtain a trained first model.
According to the embodiment of the application, the first model is used for determining the blocking strategy of the tensor on each cache module of the processor, the processor is used for executing the operator operation of the tensor, each group of first prediction results corresponds to one cache module of the processor, the training data comprises the operator information corresponding to at least one operator type, the model can be suitable for multiple operator types, the generalization performance of the model is improved, and a better blocking strategy can be obtained, so that the computing performance of the processor is improved, the constraint conditions corresponding to each cache module are met through the first prediction results, the constraint conditions comprise the first constraint conditions of hardware specifications and/or the second constraint conditions between the cache modules, the search range of the model in the training process can be reduced, the training speed of the model is improved, and meanwhile, the blocking result can be suitable for the relevant constraint of the processor.
In a first possible implementation form of the tensor blocking method according to the first aspect, the first model outputs a first prediction result, and the method includes: determining a second prediction result from at least one second prediction result meeting the constraint condition for each dimension of tensor on each cache module; determining a set of first predictors including a combination of said one second predictor for each of the tensor dimensions at the corresponding cache module.
According to the embodiment of the application, a second prediction result is determined from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module, and a group of first prediction results are determined, wherein the group of first prediction results comprise a combination of the second prediction results corresponding to the dimensions of the tensor on the corresponding cache module, so that a legal block partitioning strategy can be selected for evaluation in the training process, and the training efficiency of the model is improved.
According to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the tensor blocking method, the first constraint condition is that the size of the tensor after blocking is not larger than the size specified by the hardware specification of the corresponding cache module; the second constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the next cache module under the condition that the next cache module exists.
According to the embodiment of the application, the search range in the model training process can be reduced, the prediction result is suitable for the corresponding constraint of the processor, and the model training speed is improved.
In a third possible implementation form of the tensor blocking method according to the first aspect, the method further comprises: inputting tuning data into the trained first model to obtain a third prediction result, wherein the tuning data comprises operator information corresponding to a single operator type; evaluating the third prediction result to obtain a second evaluation result; performing iterative training on the trained first model again according to the third prediction result and the second evaluation result until a training convergence condition is met; and determining a first partitioning strategy from the third prediction result according to the third prediction result output in each iteration process and the corresponding second evaluation result, and recording the first partitioning strategy and the corresponding first operator information in the tuning data.
According to the embodiment of the application, the trained first model is optimized, the uniform optimization and training scheme can be realized, the maintenance cost during training and optimization is reduced, the optimized model can be better adapted to a single operator by the optimization data including operator information corresponding to the single operator type, the model convergence speed is higher, the operation efficiency of the operator on a processor is better, the calculation performance of the operator on the processor is improved, the online reasoning of the model is not needed in the subsequent process by recording the first partitioning strategy and the first operator information in the corresponding optimization data, and the efficiency is improved.
In a fourth possible implementation manner of the tensor blocking method according to the first aspect as such or the first or the second or the third possible implementation manner of the first aspect, the method further includes: inputting second operator information into the trained first model to obtain a blocking strategy of the second operator information, wherein the blocking strategy of the second operator information comprises blocking strategies of tensors corresponding to the second operator information on each cache module of the processor.
According to the embodiment of the application, the blocking strategy of the second operator information is obtained by inputting the second operator information into the trained first model, the blocking strategy of the tensor on the cache module of the processor can be generated through online inference, the compiling time is short, the generalization performance of the model is better, and the operational efficiency of the operator is improved.
In a fifth possible implementation manner of the tensor blocking method according to the first aspect as such or the first or the second or the third or the fourth possible implementation manner of the first aspect, the method further includes: determining whether operator information matched with the second operator information exists in a knowledge base or not according to the second operator information, wherein the knowledge base comprises the operator information and a corresponding blocking strategy, and the operator information in the knowledge base comprises the first operator information; determining a blocking strategy corresponding to the operator information matched with the second operator information as the blocking strategy of the second operator information under the condition that the operator information matched with the second operator information exists; otherwise, inputting the second operator information into the trained first model to obtain the blocking strategy of the second operator information.
According to the embodiment of the application, whether the operator information matched with the second operator information exists in the knowledge base or not is determined according to the second operator information, under the condition that the operator information matched with the second operator information exists, the blocking strategy corresponding to the operator information matched with the second operator information is determined to be the blocking strategy of the second operator information, a better blocking strategy can be generated for a single operator, the processor has better operation performance on the single operator, the processing speed is improved, under the condition that the operator information matched with the second operator information does not exist, the second operator information is input into the trained first model, the blocking strategy of the second operator information is obtained, inference on multiple types of operators can be realized, generalization performance in the inference process is better, and the operation performance on the processor is improved.
According to a fourth possible implementation manner or a fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the tensor blocking method, the second operator information is operator information corresponding to a convolution operator, the tensor is image data and/or weight data, and a blocking policy of the second operator information is used for indicating that the image data and/or the weight data are blocked on each cache module of the processor.
According to the embodiment of the application, the operation performance optimization of the convolution operator by the processor can be realized, and the blocking strategy of the corresponding image data and the corresponding weight data in each cache module of the processor is obtained.
In a second aspect, an embodiment of the present application provides a tensor blocking apparatus, which includes: an input module, configured to input training data into a first model, where the first model is used to determine a blocking strategy of a tensor on each cache module of a processor, the processor is used to perform an operator operation of the tensor, and the training data includes operator information corresponding to at least one operator type; the output module is used for outputting a first prediction result by the first model, the first prediction result meets constraint conditions corresponding to each cache module, each group of first prediction results corresponds to one cache module of the processor, and the constraint conditions comprise first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions among the cache modules; the first evaluation module is used for evaluating the combination of the first prediction results of the cache modules to obtain a first evaluation result; and the first iterative training module is used for performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met to obtain a trained first model.
In a first possible implementation manner of the tensor blocking apparatus according to the second aspect, the first model outputs a first prediction result, and the first model includes: determining a second prediction result from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module; determining a set of first predictors, the set of first predictors comprising a combination of the one second predictor corresponding to each dimension of the tensor over the corresponding cache module.
In a second possible implementation manner of the tensor blocking apparatus according to the second aspect as such or the first possible implementation manner of the second aspect, the first constraint condition is that a size of the tensor after blocking is not larger than a size specified by a hardware specification of the corresponding cache module; the second constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the next cache module under the condition that the next cache module exists.
In a third possible implementation manner of the tensor blocking apparatus according to the second aspect, the apparatus further includes: the first determination module is used for inputting tuning data into the trained first model to obtain a third prediction result, wherein the tuning data comprises operator information corresponding to a single operator type; the second evaluation module is used for evaluating the third prediction result to obtain a second evaluation result; the second iterative training module is used for carrying out iterative training again on the trained first model according to the third prediction result and the second evaluation result until a training convergence condition is met; and the second determining module is used for determining a first partitioning strategy from the third prediction result according to the third prediction result output in each iteration process and the corresponding second evaluation result, and recording the first partitioning strategy and the corresponding first operator information in the tuning data.
In a fourth possible implementation manner of the tensor blocking apparatus according to the second aspect as such or the first, second or third possible implementation manner of the second aspect, the apparatus further includes: and the third determining module is used for inputting second operator information into the trained first model to obtain a blocking strategy of the second operator information, wherein the blocking strategy of the second operator information comprises blocking strategies of tensors corresponding to the second operator information on each cache module of the processor.
In a fifth possible implementation manner of the tensor blocking apparatus according to the second aspect as such or the first or the second or the third or the fourth possible implementation manner of the second aspect, the apparatus further includes: a fourth determining module, configured to determine, according to second operator information, whether operator information matched with the second operator information exists in a knowledge base, where the knowledge base includes the operator information and a corresponding blocking policy, and the operator information in the knowledge base includes the first operator information; a fifth determining module, configured to determine, when there is operator information that matches the second operator information, that a blocking policy corresponding to the operator information that matches the second operator information is the blocking policy of the second operator information; otherwise, inputting the second operator information into the trained first model to obtain the blocking strategy of the second operator information.
According to a fourth possible implementation manner or a fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the tensor blocking apparatus, the second operator information is operator information corresponding to a convolution operator, the tensor is image data and/or weight data, and a blocking policy of the second operator information is used for indicating that the image data and/or the weight data are blocked on each cache module of the processor.
In a third aspect, an embodiment of the present application provides a tensor blocking apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to carry out the instructions when implementing the tensor blocking method of the first aspect as such or one or more of many possible implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement a tensor blocking method of the above first aspect or one or more of many possible implementations of the first aspect.
In a fifth aspect, an embodiment of the present application provides a terminal device, where the terminal device may perform the tensor blocking method of the first aspect or one or more of the multiple possible implementations of the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product, which includes computer readable code or a non-transitory computer readable storage medium carrying computer readable code, when the computer readable code is run in an electronic device, a processor in the electronic device performs a tensor blocking method of the first aspect or one or more of many possible implementations of the first aspect.
These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.
Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application.
Fig. 2 is a schematic diagram illustrating data flow when a convolution operator operates according to an embodiment of the present application.
Fig. 3 shows a schematic diagram of a system architecture according to an embodiment of the present application.
Fig. 4 shows a block diagram of a tensor blocking apparatus according to an embodiment of the present application.
FIG. 5 illustrates an interaction diagram of a generalized model training phase according to an embodiment of the present application.
FIG. 6 illustrates a flow diagram of a tensor blocking method of a generalized model training phase according to an embodiment of the present application.
FIG. 7 is a diagram illustrating generalized model training based on a reinforcement learning framework according to an embodiment of the present application.
FIG. 8 shows an interaction diagram of a model tuning phase according to an embodiment of the application.
FIG. 9 illustrates an interaction diagram of an inference phase according to an embodiment of the application.
Figure 10 illustrates a flow diagram of a tensor blocking method of an inference phase according to an embodiment of the present application.
FIG. 11 shows a flow diagram of a tensor blocking method according to an embodiment of the present application.
FIG. 12 shows a flow diagram of a tensor blocking method according to an embodiment of the present application.
FIG. 13 shows a flow diagram of a tensor blocking method according to an embodiment of the present application.
FIG. 14 shows a flow diagram of a tensor blocking method according to an embodiment of the present application.
Fig. 15 is a block diagram of a tensor blocking apparatus according to an embodiment of the present application.
Fig. 16 is a block diagram of a tensor blocking apparatus according to an embodiment of the present application.
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.
Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application. The tensor blocking method in the embodiment of the application can be used in a scene of operating blocked tensors on processors such as a neural Network Processing Unit (NPU)/a Central Processing Unit (CPU)/a Graphics Processing Unit (GPU)/a Tensor Processing Unit (TPU). The NPU/CPU/GPU/TPU and the like can be realized by chips with different architectures. As shown in fig. 1, the computation description (computer) phase may be an operation of a mathematical computation expression description operator, and the blocking (tiling) and computation scheduling (schedule) phases may represent operations of blocking an Intermediate Representation (IR) in the computation description phase and performing operations of operations on a processor according to the tensor blocking method of the embodiment of the present application.
Taking a convolution operator as an example, in the calculation scheduling stage, the tensor corresponding to a may be input image data, the tensor corresponding to B may represent a weight, and the tensor corresponding to C may represent a result of performing a matrix multiplication operation on a and B.
In the stage of blocking and computation scheduling, taking convolution operator operation on an NPU under a DaVinci architecture as an example, L1, L0A, L0B, L C, UB may represent a buffer module (buffer) on the NPU, where the buffer module is from near to far according to a distance from a computation core (cube), the buffer capacity is from small to large, and the read-write speed is from fast to slow, for example, in the buffer module, L0C may be the buffer module closest to the computation core, an arrow may represent a flow direction of a data stream, and a shadow on each buffer module and in a tetragonal A, B, C may represent a size of a blocking tensor. The tensors a and B may be stored in a memory, such as a Double Data Rate (DDR) memory in fig. 1. For A, B, the data flow when performing convolution operator operations on NPUs can be seen in fig. 2 below.
Fig. 2 is a schematic diagram illustrating a data flow when a convolution operator performs an operation according to an embodiment of the present application. As shown in fig. 2, on an NPU under the DaVinci architecture, feature _ map _ X may represent tensor a in fig. 1, i.e., input image data, kernel _ w may represent tensor B in fig. 1, i.e., weights, mov may represent a data transfer instruction, load _3d and load _2d may represent format conversion instructions, and mad may represent instructions of a matrix multiplication operation.
Firstly, the method can be used for respectively moving feature _ map _ X and kernel _ w into the cache module L1, and represented by X _ L1 and kernel _ L1; then, feature _ map _ X and kernel _ w can be respectively moved into the cache modules L0A and LOB, and format conversion is performed on the feature _ map _ X and kernel _ w by using load _3d and load _2d instructions; then, a mad instruction can be executed on the data in the LOA and the LOB to obtain a result (such as a tensor C in fig. 1), and the result is moved to the cache module LOC through an mov instruction and can be represented by X _ L0C; then, the data in L0C may be transferred to the cache module UB, and vector operation may also be performed on the data, for example, a bias term (bias) is added to the data, the data in UB may be represented by X _ UB, and finally, the data may be transferred to the memory DDR, and the result in the memory may be represented by X _ DDR.
It should be noted that, the present application only takes the operator operation performed in the NPU as an example, and the process of the operator operation may also be performed on the CPU/GPU/TPU, which is not limited in the present application.
When performing operator operation on a processor, because input data cannot be read into each cache module on the processor at one time under the condition of a large tensor, and in order to meet the requirement of performing concurrent pipeline (pipeline) calculation on the cache module, data in each cache module may be partitioned by an operator compiler (compiler), so that the processor may perform operation according to the partitioned tensor, the operator compiler may also operate on the processor, and data in the cache module in fig. 2 may represent each block unit after partitioning the tensor (refer to shadows in the tetragonal A, B, C and on each cache module in fig. 1). Different tensor blocking modes show that the performance of operator operation on the processor is different, so that a better blocking strategy is provided for the processor, the computing performance of the processor can be improved greatly, the computing resources of the processor are saved, and the computing efficiency of the processor is improved.
Fig. 3 shows a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 3, the tensor blocking apparatus may be deployed in a tensor acceleration engine (TBE) or a Tensor Virtual Machine (TVM), and may be used to generate a strategy of tensor blocking during compiling of an operator; the TBE/TVM is used as a part of an operator compiler, a tensor blocking device can be called to obtain a tensor blocking strategy, and the tensor is blocked in the compiling process of the operator; the upper layer of the operator compiler can be an open source framework (such as MindSpore or TensorFlow) of deep learning, and can be used for realizing corresponding neural network tasks; the lower layer of the operator compiler may be a computing unit, which may be deployed in a processor, for example, and an evaluation board (EVB), where the computing unit may be configured to perform operator operations according to the blocked tensor, and the EVB may be configured to perform performance tests on the blocked tensor to obtain performance evaluation results; the operator compiler may also interact with a memory unit, which may be deployed, for example, on a DDR, which may be used to store data needed by the tensor blocking apparatus in the process of generating the strategy of tensor blocking.
Through a better tensor blocking strategy generated by the tensor blocking device, the TBE/TVM can realize tensor blocking better in the compiling process of an operator, so that the requirement of carrying out pipeline concurrent calculation on a calculating unit by a corresponding operator is met under the condition of meeting the hardware parameter specification of each cache module in the calculating unit, the capacity of a processor can be exerted to the maximum extent, and the calculating performance of the processor is improved. Meanwhile, the calculation unit performs calculation according to the blocked tensors, and the calculation result can be used for the deep learning model to perform a corresponding task, for example, in an image classification task, the output of the deep learning model can be the classification result of the image.
Fig. 4 shows a block diagram of a tensor blocking apparatus according to an embodiment of the present application. As shown in fig. 4, the tensor blocking apparatus may include a generalization model training module, a model tuning module, and an inference module. The generalized model training module can be used for training the initialized transform model to obtain a trained generalized model, and the model is suitable for multiple operators and used for generating a tensor blocking strategy; the model tuning module can be used for tuning the trained generalized model in the generalized model training module to obtain the optimal blocking result of the tensor of the specific shape in a single specific operator; the reasoning module can be used for reasoning the input operator information according to the trained generalization model so as to obtain the blocking result of the corresponding tensor.
The tensor blocking method according to the embodiment of the present application is exemplarily described below with reference to fig. 5 to 10 by taking convolution operator operation on image data and weight data as an example with reference to the modules in fig. 3 to 4. It should be noted that the tensor blocking method according to the embodiment of the present application may also be used for other tensors and other operators, which is not limited by the present application. The tensor blocking method in the embodiment of the application can comprise three stages, namely a generalization model training stage, a model tuning stage and an inference stage, wherein the generalization model training stage and the model tuning stage can be performed off-line, and the inference stage can be performed on-line.
FIG. 5 illustrates an interaction diagram of a generalization model training phase according to an embodiment of the present application. The generalized model training stage may be performed in a generalized model training module, as shown in fig. 5, in the generalized model training stage, a data set may be used to store training data, the training data may include training data corresponding to a plurality of operators and a plurality of tensor shapes, where an operator may include a conv2d operator, a matmul operator, and other types of operators, and a tensor may include image data or other types of data except for image data, and the initialized model may be generalized trained using the training data, for example, the model may be a transform model or may be another model, which is not limited in this application.
FIG. 6 illustrates a flow diagram of a tensor blocking method of a generalized model training phase according to an embodiment of the present application. As shown in fig. 6, the tensor blocking method of the generalized model training phase may include:
step S601, inputting training data into the model.
Wherein the data set of training data may be stored in a database, which may be stored in the storage unit described above. The training data may include operator information, which may include, for example, an operator type, a data type of a tensor of an input operator, a shape of the tensor, and the like, and may represent information or parameters that affect the operation scale when the operation of the operator is performed. The operator type may include, for example, an activation operator (relu, sigmoid, etc.), a convolution operator (conv 2d, conv3d, etc.), and the like; the data type of the tensor may refer to the data type of the element in the tensor, such as integer (int), floating point (float), character (char), etc.; the shape of the tensor represents parameters or information describing shape features such as tensor size, dimension, number of channels, etc., such as 0-dimensional tensor, 1-dimensional tensor, 2-dimensional tensor, etc., and may further include other shape information, for example, for a convolution operator, the training data may include operator information of the convolution operator corresponding to color image and weight data, where the shape of the tensor in the operator information may include number of batches (e.g., number of images), number of channels (e.g., three channels of red, green and blue), height and width (e.g., size of a photograph), etc.
The model may be, for example, an initialized transformer model.
Step S602, for each cache module in the processor, the checker judges a legal block strategy according to at least one block strategy, and a final complete block strategy is determined by the model.
The cache module in the processor may be designed according to a multi-level cache architecture with a distance from a processor computing core to a near, a capacity from large to small, and a read-write speed from slow to fast, and reference may be made to the cache module in fig. 1, for example, LO (LOA, LOC), L1, UB, etc., where, for example, a certain cache module LOC is, under the Davinci architecture, according to a hardware specification constraint, a configurable dimension range of [ nc, mc ] may be provided, where nc, mc may respectively represent configurable values of a two-dimensional tensor divided in two dimensions, that is, a tensor divided into lengths less than or equal to nc and widths less than or equal to mc may be stored on the cache module LOC, and for other cache modules, there is also such a configurable dimension range, and the configurable dimension range between cache modules has a predetermined mutual constraint, and the constraint may enable a pipeline computing corresponding to a certain operator to work normally on the cache modules, for example, a cache computing process corresponding to a convolution operator, and a tensor on a judgment module (such as a processor) is a pipeline computing process, and whether the tensor should be divided into a pipeline computing block size of a legal hardware specification range is not greater than a constraint. The checker can judge the legality of the blocking strategy possible on a certain cache module and corresponding to each dimension according to the hardware specification constraint of each cache module and the constraint between the cache modules, and the blocking strategy can be considered to be legal under the condition that the blocking strategy meets the hardware specification constraint of each cache module and the constraint between the cache modules.
For each dimension of the tensor, the checker may determine whether all possible blocking strategies of the tensor in the dimension are legal, the blocking strategies may include, for example, after the tensor is blocked in the dimension, a dimension value of each block in the dimension, and one of the blocking strategies is selected by the model as a blocking strategy of the tensor in the dimension and on the corresponding cache module, the checker may determine a possible blocking strategy in a next dimension according to the selected blocking strategy in the dimension, determine whether each blocking strategy is legal, and continue to select by the model until the model selects a blocking strategy in all dimensions corresponding to the tensor on the cache module, and combine the blocking strategies, the checker may continue to determine legality for the blocking strategies in each dimension on the next cache module and corresponding to the tensor, and continue to select by the model until the model selects a group of blocking strategies on all cache modules, combine the blocking strategies corresponding to all the model-selected blocking strategies, and may obtain a complete blocking strategy in the corresponding to pipeline calculation.
For each cache module, the condition for selecting one blocking strategy from at least one blocking strategy determined by the model from the checker can be preset, and the method is not limited to this, for example, for a certain dimension of the tensor, the model can score a legal value corresponding to the dimension in at least one blocking strategy, can select one blocking strategy in the dimension according to the scoring result (for example, select the blocking strategy corresponding to the highest score), and after selecting the blocking strategies in all dimensions, the blocking strategies can be combined into a group of blocking strategies on the cache module. The specific manner in which the model is scored is not limited in this application.
For example, in a certain dimension (e.g. length or width) of the tensor corresponding to a certain cache module (e.g. LOC), the judgment result of the checker may be a tiling _ mask [0,1, … … ], wherein, the tiling _ mask can represent that after the tensor is partitioned in a certain dimension, each value in the tiling _ mask may correspond to a cache module (e.g., L0C) on which a possible blocking strategy for the tensor in that dimension is based, whether the blocking policy is legal or not may be represented, for example, a value of 0 may represent that the corresponding blocking policy is illegal (i.e., the image data may not be used on L0C after being blocked in the dimension according to the blocking policy), a value of 1 may represent that the corresponding blocking policy is legal (i.e., the image data may be used on L0C after being blocked in the dimension according to the blocking policy), and the inference result of the model may be tiling _ ids: [ id1, id2, … … ], wherein, the tiling _ ids can represent tensor blocking strategy of image data on a certain buffer module (such as LOC), in this case, each value of tiling _ ids (one of id1, id2 … …) may correspond to a different dimension of image data on LOC, and each value of tiling _ ids may represent a blocking policy on LOC in the corresponding dimension of image data, for example, when id1 is determined, the model may select a corresponding blocking policy from a range of selectable legal values (i.e., a value of 1) according to a value of a tiling _ mask of the image data in a dimension corresponding to id1 on the LOC, which is determined by the checker, as the blocking policy (i.e., id 1) in the dimension corresponding to the tensor. After the tiling _ ids corresponding to the LOC is determined, the checker may continue to perform validity judgment on the blocking strategies in each dimension of the image data for the next cache module, and select one of the corresponding selectable legal values for each dimension of the tensor by the model until the model determines the values of the tiling _ ids in all the cache modules. After the models determine values of tiling _ ids on all cache modules, tiling _ ids on all modules may be combined, and finally a complete set of blocking strategies is determined.
In the process of judging possible blocking strategies on the L0C, the checker may judge whether each blocking strategy is legal or not according to a hardware specification constraint of the LOC and a constraint between the L0C and another cache module, for example, in the case that a configurable dimension range of the LOC is [ nc, mc ], when the checker judges that a condition for judging that a blocking strategy is legal may include a range in which a dimension of each block does not exceed [ nc, mc ], where nc and mc respectively represent upper limits of two dimensions (e.g., length and width); meanwhile, according to the hardware specification constraint of the next cache module (for example, UB) corresponding to LOC on the calculation pipeline, when the configurable dimension range of UB is [ ac, bc ], the condition that the checker judges that the blocking policy for the current cache module L0C is legal during judgment may further include a range that the dimension of each block does not exceed [ ac, bc ], so that after the processor performs blocking according to the blocking policy of the current cache module, the blocked tensor may be transported to the next cache module on the pipeline to obtain a corresponding judgment result, where ac and bc represent upper limits of two dimensions, respectively.
Since there is a certain size order between the hardware specifications of the cache modules, the order of the model for performing block policy inference on the cache modules may also be preset, and this is not limited in the present application, so that the verifier may make the validity judgment on each cache module follow a predetermined order to adapt to the hardware specification constraint of the cache module, for example, for the pipeline calculation of the convolution operator, the cache modules L1, L0A/LOB, and L0C, UB need to be passed through in order, since the distance from the calculation core in these cache modules is larger, the configurable range of the hardware specification constraint of the cache module gradually increases, the model may first judge the order of the cache module with a small configurable range of the hardware specification constraint, and then judge the order of the cache module with a large configurable range, and perform block policy inference on the cache module, that is to perform inference on the block policy, that is to inference on the cache module, that is to inference on the block policy on the cache module, for example, the lot, the buffer module can perform inference on the validity judgment on the block policy on the cache module.
And step S603, the evaluation module evaluates according to a group of complete block strategies determined by the model to obtain an evaluation result.
The evaluation module may compile a set of partitioning policies determined by the model and operator information for obtaining the partitioning policies through an operator compiler (e.g., compile through a schedule template in the operator compiler), obtain an executable binary code, and operate on the EVB to simulate an operator operation performed on a processor, so as to obtain an operation result of the partitioning policies, and the evaluation module may evaluate the partitioning policies according to the operation result on the EVB, so as to obtain an evaluation result, which is, for example, an evaluation of an operation duration or an evaluation of storage consumption, which is not limited in this application.
And step S604, storing the data pair corresponding to the evaluation result in a memory module, calculating loss through the data stored in the memory module, and updating the model parameters.
The data pairs may include evaluation results, operator information corresponding to the evaluation results, and blocking strategies, and the operator information and the blocking strategies may be obtained from the checker. The loss can be calculated according to the data pairs stored in the memory module after reasoning, judging and evaluating all training data in the data set.
And step S605, performing iterative training on the model until the loss converges to a preset threshold value to obtain a trained generalized model.
The training process of the iterative training may be implemented based on the prior art, such as a Reinforcement Learning (RL) framework.
FIG. 7 illustrates a diagram of generalized model training based on a reinforcement learning framework according to an embodiment of the present application. The principle of RL is that a principal (agent) makes a decision action (action) based on the current environment (state), then the environment gives a reward (reward) to the principal based on the decision action, and forms a new state based on the decision action, and the principal makes a new decision based on the new state. Through such continuous iteration, the main body can search in a better direction under the guidance of the reward. As shown in fig. 7, the environment may be an EVB, the principal may be an initialized transform model, the state may include operator information and multiple possible blocking strategies for the checker to determine, the decision behavior may be one of the blocking strategies determined by the model, the reward may be evaluation of the blocking strategies according to the operation result on the EVB under the condition that the operator information and the blocking strategies are determined, the obtained evaluation result may be calculated according to the reward, the model may be updated and iterated, the iterated model makes the decision behavior again, the environment rewards, and the above processes may be repeated until the loss converges to a predetermined threshold, thereby obtaining a trained generalized model.
In a possible implementation, after the trained generalized model is obtained, the generalized model may be optimized to obtain an optimal blocking strategy for a tensor of a specific shape in a single specific operator, for example, the generalized model may be optimized to adapt to an optimal blocking strategy for image data and weight data under a convolution operator and under a certain scale.
FIG. 8 shows an interaction diagram of a model tuning phase according to an embodiment of the application. The model tuning stage may be performed in the model tuning module, as shown in fig. 8, in the model tuning stage, the data set may be used to store data for tuning, where the data may include operator information of a single operator and multiple tensor shapes, and the generalized model may be tuned by using the data. The checker may be configured to check a blocking policy that is possible in the tuning process, determine whether the blocking policy satisfies hardware parameter specifications of each cache module on the processor, and constraints among the cache modules, the evaluation module may be configured to evaluate the blocking policy, may record the evaluation result, and the evaluation result may also serve as feedback to update parameters of the model, and perform inference again to obtain another blocking policy, repeat the above process until loss of the model converges to a predetermined threshold, and may obtain a generalized model after tuning, and at this time, may output a blocking policy with optimal performance (e.g., an optimal evaluation obtained by the evaluation module) among multiple blocking policies of the operator information obtained in the tuning process, and may store the optimal blocking policy and corresponding operator information, e.g., related information of one image data of a convolution operator and the blocking policy of the image data on multiple cache modules, in a knowledge base (which may be stored in a storage unit), and may be ready for use in a stage. The condition for determining the block policy with the optimal performance may be preset, for example, the predicted shortest operation time on the processor in all the block policies, which is not limited in this application.
In a possible implementation manner, model tuning can be performed on multiple operators and multiple tensor shapes respectively to obtain corresponding optimal blocking strategies and operator information, so as to store the optimal blocking strategies and the operator information in a knowledge base.
In the model tuning stage, the manner of interacting between the checker and the model may refer to step S602 in fig. 6, and the manner of evaluating by the evaluation module may refer to step S603 in fig. 6, which is not described herein again.
After the trained generalized model and the knowledge base storing the plurality of pieces of operator information and the corresponding optimal blocking strategies are obtained, the blocking strategies corresponding to the input operator information can be obtained by inquiring the knowledge base or performing inference by using the trained generalized model in an inference stage. The knowledge base can also store the block strategies which are obtained by other ways except optimization (such as manual experience), operator information and corresponding better performance.
FIG. 9 illustrates an interaction diagram of an inference phase according to an embodiment of the application. As shown in FIG. 9, during the model tuning phase, a knowledge base may be obtained; in the training stage of the generalization model, the trained generalization model can be obtained; in the inference stage, in combination with the knowledge base, the data input by the trained generalized model in the generalized model training stage may be operator information, and the output may be a blocking strategy corresponding to the operator information.
Figure 10 illustrates a flow diagram of a tensor blocking method of an inference phase according to an embodiment of the present application. The inference phase can be performed in the inference module, and as shown in fig. 10, the flow of the inference phase can include:
and step S1001, inputting the operator information into the generalization model.
For example, the operator information corresponding to the convolution operator may include an operator type (e.g., convolution operator), a tensor shape (e.g., a shape of a color image and a shape of weight data, see S601 in fig. 6), and a data type (e.g., floating point type) of tensor.
Step S1002, determine whether a knowledge base exists.
Step S1003, under the condition that the knowledge base is determined to exist, judging whether a corresponding partitioning strategy exists in the knowledge base or not; otherwise, step S1005 is performed.
However, it may be determined that a corresponding blocking policy exists when there is operator information matching the input operator information in the knowledge base, or it may be determined that a corresponding blocking policy exists when there is operator information matching part of the input operator information (for example, all of the operator information are convolution operators and the tensor shape, that is, the shape of the color image, is consistent) in the knowledge base, which is not limited in the present application.
Step S1004, under the condition that the corresponding block strategy exists in the knowledge base, outputting the corresponding block strategy in the knowledge base; otherwise, step S1005 is performed.
The partitioning strategy stored in the knowledge base comprises the partitioning strategy with better performance obtained in the model tuning stage, and compared with the partitioning strategy deduced by a generalized model, the partitioning strategy has better performance for the corresponding operator. By pre-storing the blocking strategy with better performance in the knowledge base, time can be saved in the inference stage, and inference is not required to be carried out by a generalization model.
And step S1005, reasoning the generalized model according to the input operator information, and outputting the corresponding block strategy.
The operator information is reasoned through the generalization model, and the corresponding block strategy can be obtained under the condition that a knowledge base does not exist or the corresponding block strategy is not inquired in the knowledge base, so that the tensor block strategy can be generated aiming at various operators, the requirement of practical application is met, and the efficiency is improved.
After obtaining the blocking strategy of each cache module of the convolution operator on the processor, the operator compiler may block the corresponding image data and the weight data to perform convolution operation on the processor, and the result of the convolution operation may be used to output a corresponding result under a related task of deep learning, for example, a result of image data classification. In the process, the tensor blocking method can improve the calculation efficiency of the processor and the execution efficiency of the related deep learning task.
FIG. 11 shows a flow diagram of a tensor blocking method according to an embodiment of the present application. The method can be used for the operator compiler, as shown in fig. 11, and includes:
step S1101, inputting training data into a first model, where the first model is used to determine a blocking strategy of a tensor on each cache module of a processor, the processor is used to execute an operator operation of the tensor, and the training data includes operator information corresponding to at least one operator type;
step S1102, outputting first prediction results by the first model, wherein the first prediction results meet constraint conditions corresponding to each cache module, each group of first prediction results correspond to one cache module of the processor, and the constraint conditions comprise first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions among the cache modules;
step S1103, evaluating the combination of the first prediction results of each cache module to obtain a first evaluation result;
and step S1104, performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met, so as to obtain a trained first model.
According to the embodiment of the application, the first model is used for determining the blocking strategy of the tensor on each cache module of the processor, the processor is used for executing the operator operation of the tensor, each group of first prediction results corresponds to one cache module of the processor, the training data comprises operator information corresponding to at least one operator type, the model can be suitable for multiple operator types, the generalization performance of the model is improved, and a better blocking strategy can be obtained, so that the computing performance of the processor is improved, the constraint conditions corresponding to each cache module are met through the first prediction results, the constraint conditions comprise first constraint conditions of hardware specifications and/or second constraint conditions between the cache modules, the search range of the model in the training process can be reduced, the training speed of the model is improved, and meanwhile, the blocking results can be suitable for relevant constraints of the processor.
The first model may be a generalized model in the foregoing, such as a transform model, a set of first prediction results, which may be, for example, the tiling _ ids corresponding to one cache module in the foregoing, and a combination of the first prediction results of each cache module, which may be, for example, a combination of the tiling _ ids corresponding to each cache module. The contents of the first constraint condition and the second constraint condition are not limited in the present application, as long as the constraint conditions can reduce the search range in the model training process, wherein the hardware specification may be set by each cache module on the processor when the cache module leaves the factory. The above process may be performed on-line.
Examples of steps S1101 to S1104 can be seen in steps S601 to S605 in fig. 6.
FIG. 12 shows a flow diagram of a tensor blocking method according to an embodiment of the present application. As shown in fig. 12, the first model outputs a first prediction result, including:
step S1201, aiming at each dimension of the tensor on each cache module, determining a second prediction result from at least one second prediction result meeting the constraint condition;
step S1202, determining a set of first prediction results, where the set of first prediction results includes a combination of the one second prediction result corresponding to each dimension of the tensor on the corresponding cache module.
According to the embodiment of the application, a second prediction result is determined from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module, and a group of first prediction results are determined, wherein the group of first prediction results comprise a combination of the second prediction results corresponding to the dimensions of the tensor on the corresponding cache module, so that a legal block partitioning strategy can be selected for evaluation in the training process, and the training efficiency of the model is improved.
Each second prediction result may correspond to, for example, an element of the tiling _ mask in one dimension of the tensor on one cache module in the foregoing, and at least one second prediction result that satisfies the constraint condition, for example, may correspond to an element of the tiling _ mask whose value is 1, respectively.
Examples of steps S1201 to S1203 may refer to the relevant description in step S602 in fig. 6.
In a possible implementation manner, the first constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the corresponding cache module; the second constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the next cache module under the condition that the next cache module exists.
According to the embodiment of the application, the search range in the model training process can be reduced, the prediction result is suitable for the corresponding constraint of the processor, and the model training speed is improved.
For example, for the first prediction result corresponding to the cache module L0C, the tensor size after blocking in the first prediction result corresponding to the cache module L0C is not larger than the size specified by the hardware specification of UB.
The above process can refer to the related description in step S602 in fig. 6.
FIG. 13 shows a flow diagram of a tensor blocking method according to an embodiment of the present application. As shown in fig. 13, the method further includes:
step S1301, inputting tuning data into the trained first model to obtain a third prediction result, wherein the tuning data comprises operator information corresponding to a single operator type;
step S1302, evaluating the third prediction result to obtain a second evaluation result;
step S1303, performing iterative training again on the trained first model according to the third prediction result and the second evaluation result until a training convergence condition is met;
step S1304, determining a first blocking strategy from the third prediction result according to the third prediction result output in each iteration process and the corresponding second evaluation result, and recording the first blocking strategy and the first operator information in the corresponding tuning data.
According to the embodiment of the application, the trained first model is optimized, the uniform optimization and training scheme can be realized, the maintenance cost during training and optimization is reduced, the optimized model can be better adapted to a single operator by the optimization data including operator information corresponding to the single operator type, the model convergence speed is higher, the operation efficiency of the operator on a processor is better, the calculation performance of the operator on the processor is improved, the online reasoning of the model is not needed in the subsequent process by recording the first partitioning strategy and the first operator information in the corresponding optimization data, and the efficiency is improved.
The tuning data may be data for tuning stored in the data set, a single operator type, a convolution operator type, or other operator types, which is not limited in this application, and the method for determining the first partitioning policy from the third prediction result is not limited in this application, for example, the determined first partitioning policy may be all partitioning policies, and the first partitioning policy recorded in the first partitioning policy and the first operator information in the corresponding tuning data, which is expected to have the shortest operation time on the processor, may be stored in the above knowledge base.
An example of steps S1301 to S1304 may refer to the related description in fig. 8.
In one possible implementation, the method further includes: and inputting second operator information into the trained first model to obtain a blocking strategy of the second operator information, wherein the blocking strategy of the second operator information comprises blocking strategies of tensors corresponding to the second operator information on each cache module of the processor.
According to the embodiment of the application, the blocking strategy of the second operator information is obtained by inputting the second operator information into the trained first model, the blocking strategy of the tensor on the cache module of the processor can be generated through online inference, the compiling time is short, the generalization performance of the model is better, and the operational efficiency of the operator is improved.
An example of the above process may refer to S1005 in fig. 10.
In a possible implementation manner, the second operator information is operator information corresponding to a convolution operator, the tensor is image data and/or weight data, and a blocking policy of the second operator information is used for indicating blocking of the image data and/or the weight data on each cache module of the processor.
According to the embodiment of the application, the operation performance optimization of the convolution operator by the processor can be realized, and the blocking strategy of the corresponding image data and the corresponding weight data in each cache module of the processor is obtained.
FIG. 14 shows a flow diagram of a tensor blocking method according to an embodiment of the present application. As shown in fig. 14, the method further includes:
step S1401, determining whether operator information matched with second operator information exists in a knowledge base according to the second operator information, wherein the knowledge base comprises the operator information and a corresponding blocking strategy, and the operator information in the knowledge base comprises the first operator information;
step S1402, determining, when there is operator information matched with the second operator information, a blocking policy corresponding to the operator information matched with the second operator information as the blocking policy of the second operator information; otherwise, inputting the second operator information into the trained first model to obtain the blocking strategy of the second operator information.
According to the embodiment of the application, whether the operator information matched with the second operator information exists in the knowledge base or not is determined according to the second operator information, under the condition that the operator information matched with the second operator information exists, the blocking strategy corresponding to the operator information matched with the second operator information is determined to be the blocking strategy of the second operator information, a better blocking strategy can be generated for a single operator, the processor has better operation performance on the single operator, the processing speed is improved, under the condition that the operator information matched with the second operator information does not exist, the second operator information is input into the trained first model, the blocking strategy of the second operator information is obtained, inference on multiple types of operators can be realized, generalization performance in the inference process is better, and the operation performance on the processor is improved.
In the knowledge base, the first blocking strategy and the first operator information in the tuning data recorded in step S1304 may be stored, and other blocking strategies and corresponding operator information may also be stored. The operator information matched with the second operator information may be that partial information in the operator information is consistent with the second operator information, or that all information in the operator information is consistent with the second operator information, which is not limited in this application.
Examples of steps S1401 to S1402 can be seen in steps S1001 to S1005 in fig. 10.
Fig. 15 is a block diagram of a tensor blocking apparatus according to an embodiment of the present application. As shown in fig. 15, the apparatus 1500 includes:
an input module 1501, configured to input training data into a first model, where the first model is used to determine a blocking strategy of a tensor on each cache module of a processor, the processor is used to perform an operator operation of the tensor, and the training data includes operator information corresponding to at least one operator type;
an output module 1502, configured to output a first prediction result by the first model, where the first prediction result meets constraint conditions corresponding to each cache module, each group of the first prediction results corresponds to one cache module of the processor, and the constraint conditions include first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions between the cache modules;
the first evaluation module 1503 is used for evaluating a combination of the first prediction results of the cache modules to obtain a first evaluation result;
and the first iterative training module 1504 is configured to perform iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met, so as to obtain a trained first model.
According to the embodiment of the application, the first model is used for determining the blocking strategy of the tensor on each cache module of the processor, the processor is used for executing the operator operation of the tensor, each group of first prediction results corresponds to one cache module of the processor, the training data comprises the operator information corresponding to at least one operator type, the model can be suitable for multiple operator types, the generalization performance of the model is improved, and a better blocking strategy can be obtained, so that the computing performance of the processor is improved, the constraint conditions corresponding to each cache module are met through the first prediction results, the constraint conditions comprise the first constraint conditions of hardware specifications and/or the second constraint conditions between the cache modules, the search range of the model in the training process can be reduced, the training speed of the model is improved, and meanwhile, the blocking result can be suitable for the relevant constraint of the processor.
In one possible implementation, the first model outputs the first prediction result, including: determining a second prediction result from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module; determining a set of first predictors including a combination of said one second predictor for each of the tensor dimensions at the corresponding cache module.
According to the embodiment of the application, a second prediction result is determined from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module, and a group of first prediction results are determined, wherein the group of first prediction results comprise a combination of the second prediction results corresponding to the dimensions of the tensor on the corresponding cache module, so that a legal block partitioning strategy can be selected for evaluation in the training process, and the training efficiency of the model is improved.
In a possible implementation manner, the first constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the corresponding cache module; the second constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the next cache module under the condition that the next cache module exists.
According to the embodiment of the application, the search range in the model training process can be reduced, the prediction result is suitable for the corresponding constraint of the processor, and the model training speed is improved.
In one possible implementation, the apparatus further includes: the first determining module is used for inputting tuning data into the trained first model to obtain a third prediction result, wherein the tuning data comprises operator information corresponding to a single operator type; the second evaluation module is used for evaluating the third prediction result to obtain a second evaluation result; the second iterative training module is used for carrying out iterative training again on the trained first model according to the third prediction result and the second evaluation result until a training convergence condition is met; and the second determining module is used for determining a first partitioning strategy from the third prediction result according to the third prediction result output in each iteration process and the corresponding second evaluation result, and recording the first partitioning strategy and the corresponding first operator information in the tuning data.
According to the embodiment of the application, the trained first model is optimized, the uniform optimization and training scheme can be realized, the maintenance cost during training and optimization is reduced, the optimized model can be better adapted to a single operator by the optimization data including operator information corresponding to the single operator type, the model convergence speed is higher, the operation efficiency of the operator on a processor is better, the calculation performance of the operator on the processor is improved, the online reasoning of the model is not needed in the subsequent process by recording the first partitioning strategy and the first operator information in the corresponding optimization data, and the efficiency is improved.
In one possible implementation, the apparatus further includes: and the third determining module is used for inputting second operator information into the trained first model to obtain a blocking strategy of the second operator information, wherein the blocking strategy of the second operator information comprises blocking strategies of tensors corresponding to the second operator information on each cache module of the processor.
According to the embodiment of the application, the blocking strategy of the second operator information is obtained by inputting the second operator information into the trained first model, the blocking strategy of the tensor on the cache module of the processor can be generated through online inference, the compiling time is short, the generalization performance of the model is better, and the operational efficiency of the operator is improved.
In a possible implementation manner, the second operator information is operator information corresponding to a convolution operator, the tensor is image data and/or weight data, and a blocking policy of the second operator information is used for indicating that the image data and/or the weight data are blocked on each cache module of the processor.
According to the embodiment of the application, the operation performance optimization of the convolution operator by the processor can be realized, and the blocking strategy of the corresponding image data and the corresponding weight data in each cache module of the processor is obtained.
In one possible implementation, the apparatus further includes: a fourth determining module, configured to determine, according to second operator information, whether operator information matched with the second operator information exists in a knowledge base, where the knowledge base includes the operator information and a corresponding blocking policy, and the operator information in the knowledge base includes the first operator information; a fifth determining module, configured to determine, when there is operator information that matches the second operator information, that a blocking policy corresponding to the operator information that matches the second operator information is the blocking policy of the second operator information; otherwise, inputting the second operator information into the trained first model to obtain the blocking strategy of the second operator information.
According to the embodiment of the application, whether the operator information matched with the second operator information exists in the knowledge base or not is determined according to the second operator information, under the condition that the operator information matched with the second operator information exists, the block strategy corresponding to the operator information matched with the second operator information is determined to be the block strategy of the second operator information, a better block strategy can be generated for a single operator, the processor has better operation performance on a single operator, the processing speed is improved, and under the condition that the operator information matched with the second operator information does not exist, the second operator information is input into the trained first model to obtain the block strategy of the second operator information, the multiple types of operators can be subjected to reasoning, generalization performance in the reasoning process is better, and the operation performance of the operators on the processor is improved.
Fig. 16 is a block diagram showing a tensor blocking apparatus according to an embodiment of the present application. The tensor blocking apparatus is, for example, the tensor blocking apparatus shown in fig. 3 or the tensor blocking apparatus 1500 shown in fig. 15, and is applicable to the operator compiler shown in fig. 3 to perform the function in the tensor blocking method shown in any one of fig. 5 to 14. As shown in fig. 16, tensor blocking apparatus 700 may comprise a processor 701 and a transceiver 702. Optionally, the tensor blocking apparatus 700 may include a memory 703. The processor 701 is coupled to a transceiver 702 and a memory 703, such as may be connected by a communication bus.
The components of the tensor blocking apparatus 700 will be described in detail with reference to fig. 16.
The processor 701 is a control center of the tensor block apparatus 700, and may be a single processor or a collective term for a plurality of processing elements. For example, the processor 701 is one or more CPUs/NPUs/GPUs/TPUs, and may also be an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, such as: one or more microprocessors (digital signal processors, DSPs), or one or more Field Programmable Gate Arrays (FPGAs).
Alternatively, the processor 701 may perform various functions of the tensor blocking apparatus 700 by running or executing a software program stored in the memory 703 and invoking data stored in the memory 703.
In a specific implementation, as an embodiment, the processor 701 may include one or more CPUs, for example, CPU0 and CPU1 shown in fig. 16, the processor 701 may further include one or more NPUs, for example, NPU0 and NPU1 shown in fig. 16, the processor 701 may further include one or more GPUs, for example, GPU0 and GPU1 shown in fig. 16, and the processor 701 may further include one or more TPUs, for example, TPU0 and TPU1 shown in fig. 16.
In one possible implementation, the tensor blocking apparatus 700 may also include a plurality of processors, such as the processor 701, the processor 704, and the processor 705 shown in fig. 16. Each of these processors may be a single-core processor (single-CPU/single-NPU) or a multi-core processor (multi-CPU/multi-NPU). Tensor blocking apparatus 700 may also include other processors, each of which may also be a single-core processor (single-GPU/single-TPU), or may be a multi-core processor (multi-GPU/multi-TPU) (not separately shown in fig. 16), where a processor may refer to one or more communication devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
Optionally, the transceiver 702 may include a receiver and a transmitter (not separately shown in fig. 16). Wherein the receiver is configured to perform a receiving function and the transmitter is configured to perform a transmitting function.
Alternatively, the transceiver 702 may be integrated with the processor 701, or may exist independently, and is coupled to the processor 701 through an input/output port (not shown in fig. 16) of the tensor blocking apparatus 700, which is not limited in this embodiment.
The memory 703 may be used to store a software program for executing the scheme of the present application, and the processor 701 controls the execution of the software program.
The memory 703 may be, but is not limited to, a read-only memory (ROM) or other type of static storage communication device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage communication device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage communication device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. It should be noted that the memory 703 may be integrated with the processor 701, or may exist independently, and is coupled to the processor 701 through an input/output port (not shown in fig. 16) of the tensor blocking apparatus 700, which is not limited in this embodiment.
It should be noted that the structure of the tensor blocking apparatus 700 shown in fig. 16 does not constitute a limitation to the implementation of the tensor blocking apparatus, and an actual tensor blocking apparatus may include more or less components than those shown, or combine some components, or arrange different components.
An embodiment of the present application provides a tensor blocking apparatus, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above method when executing the instructions.
Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
The embodiment of the application provides a terminal device, and the terminal device can execute the method.
Embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable Programmable Read-Only Memory (EPROM or flash Memory), a Static Random Access Memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a Memory stick, a floppy disk, a mechanical coding device, a punch card or an in-groove protrusion structure, for example, having instructions stored thereon, and any suitable combination of the foregoing.
The computer readable program instructions or code described herein may be downloaded to the respective computing/processing device from a computer readable storage medium, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present application may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize custom electronic circuitry, such as Programmable Logic circuits, field-Programmable Gate arrays (FPGAs), or Programmable Logic Arrays (PLAs).
Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., a Circuit or an ASIC) for performing the corresponding function or action, or by combinations of hardware and software, such as firmware.
While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (11)

1. A tensor blocking method, the method comprising:
inputting training data into a first model, wherein the first model is used for determining a blocking strategy of a tensor on each cache module of a processor, the processor is used for executing operator operation of the tensor, and the training data comprises operator information corresponding to at least one operator type;
the first model outputs first prediction results, the first prediction results meet constraint conditions corresponding to each cache module, each group of first prediction results correspond to one cache module of the processor, and the constraint conditions comprise first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions among the cache modules;
evaluating the combination of the first prediction results of the cache modules to obtain a first evaluation result;
and performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met to obtain a trained first model.
2. The method of claim 1, wherein the first model outputs a first predicted result, comprising:
determining a second prediction result from at least one second prediction result meeting the constraint condition for each dimension of the tensor on each cache module;
determining a set of first predictors including a combination of said one second predictor for each of the tensor dimensions at the corresponding cache module.
3. The method according to claim 1 or 2, wherein the first constraint condition is that the size of the tensor after blocking is not larger than the size specified by the hardware specification of the corresponding cache module;
the second constraint condition is that the size of the tensor after the blocking is not larger than the size specified by the hardware specification of the next cache module under the condition that the next cache module exists.
4. The method of claim 1, further comprising:
inputting tuning data into the trained first model to obtain a third prediction result, wherein the tuning data comprises operator information corresponding to a single operator type;
evaluating the third prediction result to obtain a second evaluation result;
performing iterative training on the trained first model again according to the third prediction result and the second evaluation result until a training convergence condition is met;
and determining a first partitioning strategy from the third prediction result according to the third prediction result output in each iteration process and the corresponding second evaluation result, and recording the first partitioning strategy and the corresponding first operator information in the tuning data.
5. The method according to any one of claims 1-4, further comprising:
inputting second operator information into the trained first model to obtain a blocking strategy of the second operator information, wherein the blocking strategy of the second operator information comprises blocking strategies of tensors corresponding to the second operator information on each cache module of the processor.
6. The method according to any one of claims 1-5, further comprising:
determining whether operator information matched with the second operator information exists in a knowledge base or not according to the second operator information, wherein the knowledge base comprises the operator information and a corresponding blocking strategy, and the operator information in the knowledge base comprises the first operator information;
determining a blocking strategy corresponding to the operator information matched with the second operator information as the blocking strategy of the second operator information under the condition that the operator information matched with the second operator information exists; if not, then,
and inputting the second operator information into the trained first model to obtain a blocking strategy of the second operator information.
7. The method according to claim 5 or 6, wherein the second operator information is operator information corresponding to a convolution operator, the tensor is image data and/or weight data, and a blocking policy of the second operator information is used for indicating blocking of the image data and/or the weight data on each buffer module of the processor.
8. A tensor blocking apparatus, the apparatus comprising:
an input module, configured to input training data into a first model, where the first model is used to determine a blocking strategy of a tensor on each cache module of a processor, the processor is used to perform an operator operation of the tensor, and the training data includes operator information corresponding to at least one operator type;
the output module is used for outputting a first prediction result by the first model, the first prediction result meets constraint conditions corresponding to each cache module, each group of first prediction results corresponds to one cache module of the processor, and the constraint conditions comprise first constraint conditions of hardware specifications of the cache modules and/or second constraint conditions among the cache modules;
the first evaluation module is used for evaluating the combination of the first prediction results of the cache modules to obtain a first evaluation result;
and the first iterative training module is used for performing iterative training on the first model according to the combination of the first prediction results and the first evaluation result until a training convergence condition is met to obtain a trained first model.
9. A tensor blocking apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1-7 when executing the instructions.
10. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-7.
11. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the method of any of claims 1-7.
CN202110760579.1A 2021-07-06 2021-07-06 Tensor blocking method and device and storage medium Pending CN115587922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110760579.1A CN115587922A (en) 2021-07-06 2021-07-06 Tensor blocking method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110760579.1A CN115587922A (en) 2021-07-06 2021-07-06 Tensor blocking method and device and storage medium

Publications (1)

Publication Number Publication Date
CN115587922A true CN115587922A (en) 2023-01-10

Family

ID=84772006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110760579.1A Pending CN115587922A (en) 2021-07-06 2021-07-06 Tensor blocking method and device and storage medium

Country Status (1)

Country Link
CN (1) CN115587922A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116862019A (en) * 2023-07-06 2023-10-10 清华大学 Model training method and device based on data parallel paradigm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116862019A (en) * 2023-07-06 2023-10-10 清华大学 Model training method and device based on data parallel paradigm
CN116862019B (en) * 2023-07-06 2024-03-19 清华大学 Model training method and device based on data parallel paradigm

Similar Documents

Publication Publication Date Title
Cheng et al. Verification of Binarized Neural Networks via Inter-neuron Factoring: (Short Paper)
CN111417965A (en) Software-defined quantum computer
KR102257028B1 (en) Apparatus and method for allocating deep learning task adaptively based on computing platform
CN115456160A (en) Data processing method and data processing equipment
CN113361680B (en) Neural network architecture searching method, device, equipment and medium
Rehbach et al. Expected improvement versus predicted value in surrogate-based optimization
CN111783937A (en) Neural network construction method and system
WO2021254114A1 (en) Method and apparatus for constructing multitask learning model, electronic device and storage medium
US20210398331A1 (en) Method for coloring a target image, and device and computer program therefor
CN112183712A (en) Deep learning algorithm compiling method and device and related products
US20190303156A1 (en) Zero overhead loop execution in deep learning accelerators
CN110889497B (en) Learning task compiling method of artificial intelligence processor and related product
CN111966361A (en) Method, device and equipment for determining model to be deployed and storage medium thereof
JP2022506493A (en) Image coloring completion method, its device and its computer program, and artificial neuron learning method, its device and its computer program
Goudarzi et al. Design of a universal logic block for fault-tolerant realization of any logic operation in trapped-ion quantum circuits
KR102076225B1 (en) Method and apparatus for optimizing mission planning of UAV(Unmanned Aerial Vehicle)
CN115587922A (en) Tensor blocking method and device and storage medium
CN114925651A (en) Circuit routing determination method and related equipment
CN110782016A (en) Method and apparatus for optimizing neural network architecture search
CN110766146B (en) Learning task compiling method of artificial intelligence processor and related product
US11526791B2 (en) Methods and systems for diverse instance generation in artificial intelligence planning
Liu et al. A machine learning framework for neighbor generation in metaheuristic search
Zhang et al. A locally distributed mobile computing framework for dnn based android applications
KR102561799B1 (en) Method and system for predicting latency of deep learning model in device
KR102168541B1 (en) A method and computer program of learning a second neural network using a first neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication