WO2023284770A1 - 张量程序优化方法及装置 - Google Patents

张量程序优化方法及装置 Download PDF

Info

Publication number
WO2023284770A1
WO2023284770A1 PCT/CN2022/105400 CN2022105400W WO2023284770A1 WO 2023284770 A1 WO2023284770 A1 WO 2023284770A1 CN 2022105400 W CN2022105400 W CN 2022105400W WO 2023284770 A1 WO2023284770 A1 WO 2023284770A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
tensor
subroutine
mutation
operator
Prior art date
Application number
PCT/CN2022/105400
Other languages
English (en)
French (fr)
Inventor
翟季冬
王豪杰
高鸣宇
马子轩
唐适之
郑立言
王拓为
融凯源
陈源涌
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2023284770A1 publication Critical patent/WO2023284770A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation

Definitions

  • This application relates to data processing technology, specifically a tensor program optimization method and device.
  • Existing deep learning frameworks represent deep learning programs as tensor programs composed of tensors and operators, and the tensor programs are often expressed in the form of computational graphs.
  • the optimization scheme for tensor programs can be divided into three types: graph-level optimization for equivalent replacement of operators on the calculation graph, data layout optimization for replacement of tensor memory layout, and more efficient generation Operator-level optimization of operators.
  • the current technology mainly has the following limitations for optimizing the memory layout: 1.
  • the data layout considered by the underlying operator library is only a simple adjustment of the order of different dimensions of the tensor, such as the row-first and column-first of the two-dimensional tensor First, NCHW, NHWC, etc. for 4D tensors.
  • the tensor program optimization work based on memory layout optimization only considers the relatively limited layout supported by the underlying operator library.
  • rule-based optimization methods For the optimization of computational graphs, many existing works use rule-based optimization methods, that is, for a computational subgraph composed of several operators, it is replaced with another subgraph according to pre-designed specific rules.
  • the rule-based optimization method requires developers to have sufficient understanding of the transformation of the computational subgraph, and to design relevant transformation rules accordingly.
  • the search space is often limited, and it is impossible to explore unknown transformation rules.
  • this application aims to use a new optimization method to expand the optimization space of the tensor program.
  • the embodiment of the first aspect of the present application provides a tensor program optimization method, including:
  • Error correction is performed on the unequal mutation programs in the mutation programs of the subroutines so that each mutation program is equivalent to the corresponding subroutine;
  • the optimal subroutine is selected from the error-corrected mutation program for splicing to generate an optimized tensor program.
  • the embodiment of the second aspect of the present application provides a tensor program optimization device, including:
  • the program division module is used to divide the tensor program to be optimized to generate a linear tensor subprogram
  • a mutation generation module used to generate the mutation program of the subroutine according to a preset operator set
  • the mutation error correction module is used to correct the unequal mutation program in the mutation program of the subroutine so that each mutation program is equivalent to the corresponding subroutine;
  • the optimization program generation module selects the optimal subroutine from the mutation program after error correction processing to splice and generate an optimized tensor program.
  • the embodiment of the third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the above method is implemented when the processor executes the computer program.
  • the embodiment of the fourth aspect of the present application provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program for executing the above method.
  • This application divides the tensor program to be optimized to generate linear tensor subprograms, reduces the search space that needs to be explored, explores all possible local equivalent deformations based on memory rearrangement of each subprogram, and generates all possible subprograms mutation, and in order to ensure the end-to-end correctness of the optimized tensor program, compare the output of the tensor subprogram before and after optimization, find out the unequal position and correct it, and the mutation that completes the error correction will be Further optimization, the mutation of each subroutine is optimally combined into a complete tensor program, which can make the execution of the tensor program more efficient.
  • Fig. 1 is a flow chart of the tensor program optimization method provided by the present application
  • Fig. 2 is the schematic diagram in the embodiment of the present application.
  • Fig. 3 is the schematic diagram in the embodiment of the present application.
  • Fig. 4 is the schematic diagram in the embodiment of the present application.
  • Fig. 5 is the schematic diagram in the embodiment of the present application.
  • Fig. 6 is the schematic diagram in the embodiment of the present application.
  • Fig. 7 is the schematic diagram in the embodiment of the present application.
  • Fig. 8 is the schematic diagram in the embodiment of the present application.
  • Fig. 9 is a schematic diagram in the embodiment of the present application.
  • Fig. 10 is a schematic diagram in the embodiment of the present application.
  • Figure 11 is a schematic diagram in the embodiment of the present application.
  • Fig. 12 is a block diagram of the tensor program optimization device provided by the present application.
  • FIG. 13 is a schematic diagram of an electronic device in an embodiment of the present application.
  • the present application provides a tensor program optimization method, as shown in Figure 1, the method of the present application includes:
  • Step S101 dividing the tensor program to be optimized to generate a linear tensor subprogram
  • Step S102 generating a mutation program of the subroutine according to a preset operator set
  • Step S103 performing error correction on the unequal mutation programs in the mutation programs of the subroutines so that each mutation program is equivalent to the corresponding subroutine;
  • Step S104 selecting the optimal subroutine from the mutation program after the error correction processing, and splicing to generate an optimized tensor program.
  • step S101 dividing the tensor program to be optimized to generate a linear tensor subprogram includes:
  • the tensor program to be optimized is divided to generate a linear tensor subprogram.
  • a nonlinear operator is used as a segmentation point in the tensor program to be optimized.
  • Nonlinear operators such as activation functions in deep neural networks are widely used in tensor programs.
  • a nonlinear activation function operator such as ReLU or sigmoid
  • these operators can effectively split the tensor program into sub-units of desired size. program.
  • there are few non-linear operators included in their replacement patterns except for operator fusion, which will be handled in code generation optimization), so non-linear operators are left outside the subroutine It will not significantly affect the optimization effect.
  • the subroutines can be further resized.
  • several subroutines that do not depend on each other will be composed into a larger subroutine, such as the multiple subroutines in the Inception network. branches.
  • the subroutine is too large, only a subset of operators of the subroutine will be mutated and optimized at a time.
  • each subroutine For each subroutine, all possible local equivalent variants based on memory rearrangement are explored, and all possible mutations of the subroutine are generated. Each mutation has the same input tensor and output tensor of the same shape as the atomic program.
  • generating the mutation program of the subroutine according to the preset operator set includes:
  • Step 1 enumerating the tensors in the subroutine respectively as the input of each operator in the preset operator set;
  • Step 2 adding the output tensor of the operator to the subroutine
  • Step 3 judging whether the size of the subroutine exceeds a preset threshold, and if the size of the subroutine does not reach the threshold, execute steps 1-3, and terminate if it reaches the threshold.
  • the preset operator set includes: calculation-intensive operators, element-by-element operators, and tensor operation operators.
  • the set of operators Covers the most commonly used operators in DNN (DNN, Deep Neural Network, deep neural network) and other tensor programs, including: computationally intensive operators (such as conv, matmul, etc.) and element-wise operators (such as add, mul etc.), and tensor manipulation operators (such as split, transpose, etc.).
  • DNN Deep Neural Network
  • other tensor programs including: computationally intensive operators (such as conv, matmul, etc.) and element-wise operators (such as add, mul etc.), and tensor manipulation operators (such as split, transpose, etc.).
  • computationally intensive operators such as conv, matmul, etc.
  • element-wise operators such as add, mul etc.
  • tensor manipulation operators such as split, transpose, etc.
  • the output of the tensor subprogram before and after optimization is compared to find out the inequalities and correct them.
  • strict linear algebra theoretical basis is used in the processing of mutation error correction to simplify this extremely complex task.
  • Theorem 1 For two MLTPs with m-dimensional output tensors and make is the set of m-dimensional basis vectors, namely is a tuple of length m, where the i-th element is 1 and the others are 0.
  • Theorem 2 For two MLTP and make for and An unequal position of , that is
  • I be a randomly generated input, then is likely to be at most Among them, d is the number of all possible values of the input variable I.
  • Theorem 1 shows that if and If m+1 specific positions in a reduction domain are equivalent, all other positions in the reduction domain are also equivalent.
  • the theorem adopted in this embodiment greatly reduces the workload of verification: instead of checking all positions in the output tensor, only m+1 specific positions in each reduction field need to be tested.
  • Theorem 2 shows that if two subroutines are unequal at a certain position, then the probability of computing the same result on this random input is at most 2 ⁇ 32 if tested randomly with 32-bit integers.
  • checking all combinations of input values that depend on all output positions reduces to a lightweight task that only requires testing a few representative positions with a few randomly generated inputs.
  • performing error correction on the unequal mutation program in the mutation program of the subroutine so that each mutation program is equivalent to the corresponding subroutine includes:
  • Step 1 Reduce field propagation.
  • the reduction domain for a given subroutine is computed by reduction domain propagation.
  • the concept of reduction field propagation is similar to the forward and backward propagation in deep learning, that is, the reduction field of the operator output tensor is calculated by analyzing the dependency of the operator calculation according to the reduction field of the input tensor.
  • a set of split points is maintained for each dimension of a tensor to identify the boundaries of its reduction domain. For a linear operator, the split point of its output tensor can be inferred based on the split point of the input tensor as well as the operator type and hyperparameters.
  • m is the output tensor dimension
  • Step 3 Error correction kernel. For each reduction field that fails the random test, an error correction kernel is generated to fix its output to guarantee mathematical equivalence between the original subroutine and its mutations. To fix the output, the error correction kernel performs the same set of operators on the reduction domains that are not equivalent to the original subroutine to make corrections for those positions.
  • the generated error correction kernel is optimized by kernel fusion optimization technology.
  • the operator of the error correction kernel is the same as that of the optimized subroutine, it can be merged into one calculation kernel through the optimization method of kernel fusion.
  • the operator of the optimized subroutine is transformed from the operator of the atomic program Therefore, the operator of the error correction kernel can always be transformed to have the same operator as the optimized subroutine, and further fusion optimization can be carried out.
  • the selection of the optimal subroutine from the mutation program after error correction processing for splicing to generate an optimized tensor program includes:
  • K is a preset value
  • the splicing of the determined optimal subroutines to generate an optimized tensor program includes:
  • the input program is converted into a list of subroutines.
  • the global optimizer in code generation optimization traverses each subroutine to generate its mutation list by querying the mutation generator, and keeps the top K candidate program Cands of the whole program with the best performance so far in a greedy manner.
  • K candidates will be selected to save (because saving all mutations will take up a lot of space), and then the most suitable one will be selected in the subsequent splicing process.
  • the so-called most suitable subroutine is The subroutine with the least conversion cost at the "seam" between the splicing and the front and back subroutines.
  • the global optimizer carefully designs the mutation process through several key hyperparameters. First, if the subroutine is too large, it is divided into smaller operator subsets, mutation generation is performed on each subset only, and the remaining operators are kept unchanged while mutation generation is performed on one operator subset. Second, by allowing iterative mutations on subroutines to take at most r rounds (r is a preset tunable parameter value), the search space can be greatly expanded and more complex and possibly more optimal mutation generation can be achieved.
  • Reversible operator elimination For any set of R/T operators that can cancel each other out (that is, after running these operators on a tensor is equivalent to no operation), it is called a reversible transformation. Obviously, reversible transformations can be obtained from deleted from the program.
  • the post-optimization processor will complete its memory rearrangement in the preprocessing stage, as shown in Figure 6 (b) on the weight tensor w1 and w2 of the convolution, The operators R/T-B and R/T-I corresponding to the memory rearrangement operation will be executed during the preprocessing of the weights, rather than being completed at runtime.
  • weight tensors used to store weights, and these weight tensors can be determined statically. Preprocessing, that is, before inference, first deform the weight tensor according to the optimized calculation graph, and then store these weight tensors with the transformed result, so that there is no need to execute it again when inference is actually performed These deformation operators.
  • the tensor program optimization method provided by this application is an optimization method based on memory rearrangement and local equivalent transformation, which is not currently used by other frameworks. Although most transformations can be completed through the combination of classical operators, If these operators are not specially optimized for code generation, the optimization effect cannot be reflected. Therefore, the system described in this application is a complete system, and the existing framework cannot replace most of the work. In addition, this application can achieve a speedup ratio of more than 2 times by combining with different backends (such as cuDNN/cuBLAS, TVM, Ansor).
  • An embodiment of the present application also provides a tensor program optimization system based on memory rearrangement and local equivalence transformation, including several main modules: program division, mutation generator, mutation error corrector and code generation optimization.
  • the input of the tensor program optimization system is a tensor program to be optimized.
  • the input program is first divided into smaller subroutines to reduce the search space to be explored and to ensure that all subroutines are linear programs (to apply subsequent verification theory).
  • the mutation generator explores all possible local equivalent deformations based on memory rearrangement, and generates all possible mutations of the subprogram.
  • Each mutation has the same input tensor and output tensor of the same shape as the atomic program.
  • the mutation error corrector compares the output of the tensor subprogram before and after optimization, finds out the unequal positions and corrects them.
  • the catastrophic error corrector simplifies this extremely complex task by using a rigorous theoretical foundation of linear algebra.
  • the mutation that has completed the error correction will be further passed through the program optimizer to optimally combine the mutations of each subroutine into a complete tensor program, and further tune it to obtain the optimized tensor program .
  • a non-linear operator is used as the segmentation point in the input program.
  • non-linear operators such as activation functions in deep neural networks are widely used in tensor programs. Usually, each or a few linear operators are followed by nonlinear activation function operators (such as ReLU or sigmoid), through which tensor programs can be effectively split into subroutines of desired size.
  • nonlinear activation function operators such as ReLU or sigmoid
  • any nonlinear operators must be excluded from the subroutines, which makes using nonlinear operators as splitting points a natural choice.
  • Third, for state-of-the-art equivalent graph optimization algorithms few nonlinear operators are included in their replacement patterns (except for operator fusion, which will be handled in code generation optimization), so keeping them out of subroutines does not Significantly affect the optimization effect.
  • the subroutines can be further resized.
  • several subroutines that do not depend on each other will be composed into a larger subroutine, such as the multiple subroutines in the Inception network. branches.
  • the subroutine is too large, only a subset of operators of the subroutine will be mutated and optimized at a time.
  • the mutation generator will be based on a given set of operators As a basis to generate its possible mutations, the set of operators Covers the most commonly used operators in DNN and other tensor programs, including computationally intensive operators (such as conv, matmul, etc.) and element-wise operators (such as add, mul, etc.), as well as tensor operation operators (such as split , transpose, etc.). This collection can be easily extended to cover a wide variety of different types of tensor programs according to the user's needs.
  • the mutation generator starts with an operator that contains Empty program start with raw input tensor in , enumerate collection Each operator in the current program also enumerates all tensors available in the current program as input to the operator, and adds its output tensors to the current program until the size of the current program reaches a certain threshold (called is the mutation depth, up to the threshold in the algorithm in the figure below). If the generated mutation with the original program have the same number and shape of input and output, the mutation generator will consider the mutation for An efficient mutant of .
  • Mutation generators can generate higher-performance mutations for a subroutine, but there is no guarantee that the mutations are computationally equivalent to the original program.
  • the mutation error corrector converts the original multilinear tensor program and one of its mutations As input, it automatically finds the unequal position in its output and generates the corresponding error correction kernel to ensure that and functionally equivalent.
  • Theorem 1 For two subroutines with m-dimensional output tensors and make is the set of m-dimensional basis vectors, namely is a tuple of length m, where the i-th element is 1 and the others are 0.
  • Theorem 2 For two subroutines and make for and An unequal position of , that is
  • Theorem 1 shows that if and If m+1 specific positions in a reduction domain are equivalent, all other positions in the reduction domain are also equivalent. This theorem greatly reduces the verification effort: instead of checking all positions in the output tensor, the mutation corrector only needs to test m+1 specific positions in each reduction field.
  • Theorem 2 shows that if two subroutines are unequal at a certain position, then the probability of computing the same result on this random input is at most 2 ⁇ 32 if tested randomly with 32-bit integers.
  • mutation error correctors simplify checking all combinations of input values dependent on all output positions to a lightweight task that only requires testing a few representative positions using a few randomly generated inputs .
  • the verification algorithm of the mutation error corrector is as follows:
  • Step 1 Reduce field propagation.
  • the mutation corrector computes the reduction domain for a given subroutine through reduction domain propagation.
  • the concept of reduction field propagation is similar to the forward and backward propagation in deep learning, that is, the reduction field of the operator output tensor is calculated by analyzing the dependency of the operator calculation according to the reduction field of the input tensor.
  • the mutation corrector maintains a set of split points for each dimension of a tensor to identify the boundaries of its reduction domain. For a linear operator, the split point of its output tensor can be inferred based on the split point of the input tensor as well as the operator type and hyperparameters.
  • Figure 4 shows the reduction field propagation process.
  • Step 2 Run a random test on each reduction domain. After obtaining the original program and its mutation After all the reduction fields in the output tensor of , the mutation corrector checks the intersection region of each pair of reduction fields in the two programs using the theorem mentioned earlier. If two reduction domains do not have any overlapping regions, they can be skipped. For each overlapping region, the equivalence of the two programs is checked at a set of m + 1 positions identified by Theorem 1, where m is the output tensor dimension. At these m+1 positions, the mutation error corrector will use a randomly generated set of inputs for random testing.
  • Step 3 Error correction kernel.
  • the tensor program optimization system For each reduction field that fails the random test, the tensor program optimization system generates an error correction kernel to correct its output to guarantee the mathematical equivalence between the original subroutine and its mutations. To fix the output, the error correction kernel performs the same set of operators on the reduction domains that are not equivalent to the original subroutine to make corrections for those positions.
  • Fusion optimization of the error correction kernel In order to reduce the overhead introduced by the error correction kernel, this work optimizes the generated error correction kernel through kernel fusion optimization technology. As shown in the process from (b) to (c) of Figure 5, when the operator of the error correction kernel is the same as that of the optimized subroutine (that is, Conv-1 and Conv-2, sharing weight W 1 ), the kernel can be passed The fused optimization method merges them into one computation kernel (i.e., Conv-1-2 in Fig. 5(c)).
  • the operator in the error correction kernel is different from that in the optimized subroutine, since the error correction kernel is the same as the operator of the atomic program before optimization, the operator of the optimized subroutine is transformed from the operator of the atomic program Therefore, the operator of the error correction kernel can always be transformed to have the same operator as the optimized subroutine, and further fusion optimization can be carried out.
  • the input program is converted into a list of subroutines.
  • the global optimizer in code generation optimization traverses each subprogram to generate its mutation list by querying the mutation generator, and keeps the top K candidates of the whole program with the best performance so far in a greedy manner.
  • K candidates are selected and saved (because saving all mutations will take up a lot of space, so K candidates are selected), and then the most suitable subroutine is selected for splicing in the subsequent splicing process.
  • the so-called most suitable subroutine refers to the subroutine with the least conversion cost at the "seam" between the spliced and front and rear subroutines.
  • the global optimizer carefully designs the mutation process through several key hyperparameters. First, if the subroutine is too large, it is divided into smaller operator subsets, mutation generation is performed on each subset only, and the remaining operators are kept unchanged while mutation generation is performed on one operator subset. Second, by allowing iterative mutations on subroutines to take at most r rounds, the search space can be greatly expanded and more complex and potentially more optimal mutation generation can be achieved.
  • This application involves two key steps in the process of program splicing, one is how to search each subroutine separately (that is, the process of finding the most suitable mutation), and the other is the post-optimization process in the splicing process, so that The resulting programs are more efficient.
  • Figure 6 shows the results of two subroutines after mutation optimization. Firstly, the R/T operator and the nonlinear ReLU operator are reordered, as shown in Figure 6(b), so that all the R/T operators at the connection of the two subroutines are continuous. Obviously, since the calculation of nonlinear activation function operators including ReLU is element-wise, the correctness of this reordering operation can be guaranteed. Next, we will continue to carry out 3 post-optimization processes:
  • Reversible operator elimination For any set of R/T operators that can cancel each other out (that is, after running these operators on a tensor is equivalent to no operation), it is called a reversible transformation. Obviously, reversible transformations can be obtained from deleted from the program. In the example shown in (b) of Figure 6, R/T-E and R/T-G are reversible deformations that can be eliminated.
  • the post-optimization processor will complete its memory rearrangement in the preprocessing stage, as shown in Figure 6 (b) on the convolution weight tensor w1 and w2, where The operators R/T-B and R/T-I corresponding to the memory rearrangement operation will be executed during the preprocessing of the weights, rather than being completed at runtime.
  • weight tensors used to save "weights". These weight tensors can be determined statically (to be precise, they are determined during the training process. The work of this application is oriented to reasoning. In These weights have been determined before inference). Preprocessing, that is, before inference, first deform the weight tensor according to the optimized calculation graph (the operator used in the deformation is shown in RT-B in Figure 6, etc.), and then store the transformed result These weight tensors, then there is no need to execute these deformation operators again during inference.
  • the optimized calculation graph the operator used in the deformation is shown in RT-B in Figure 6, etc.
  • Case 1 As shown in Figure 7, two separate images are connected along the width direction to form a larger image, that is, the N-dimensional data is transferred to the W-dimensional. Under a specific dimension size, this deformation can provide greater parallelism and improve the locality of calculation, thereby improving performance.
  • the idea of memory rearrangement provides new opportunities for tensor program optimization. However, after this transition is performed, elements that are not equal to the original result are produced on the sub-region of the output tensor along the merge boundary (the shaded position of the diagonal line on Figure (b), so the optimization is a locally equivalent optimization, rather than an exact equivalent optimization.
  • Case 2 As shown in Figure 8, a mutation is given, which will change the calculation of dilated convolution into the calculation of standard convolution through memory rearrangement. (The shaded part of the oblique line in the figure) shows the unequal position in the mutation, and the tensor program optimization system will further correct it through the mutation error corrector. This optimization converts inefficient dilated convolutions into standard convolutions that are highly optimized by existing operator libraries, and efficient algorithms such as Winograd and FFT can be used.
  • Figure 9 shows two graph transformation strategies for optimizing the Inception module.
  • Figure 9(a) shows a non-equivalent transformation based on memory rearrangement, which pads W2 with 0s to have the same shape, so that two conv operators can be fused into one group conv operator.
  • W2 the tensor marked as the zeros part, it can be deduced that all elements of the tensor are 0 from the calculation process.
  • Figure 9(b) is the found equivalent conversion.
  • the basic principle of this conversion is also memory rearrangement, but in the process, redundant copying of some tensors is required.
  • This conversion is to merge the two conv operators into one group conv operator by copying the input tensor I2 and merging the input tensor and weights through the concat operator.
  • the main body of this application is a tensor program optimization system based on memory rearrangement and local equivalent transformation.
  • the system provides corresponding interfaces for users to build tensor program calculation graphs, and also supports model import in onnx format, and its output is a Executable tensor programs.
  • This application can make the execution of tensor programs more efficient.
  • the server used in the experiment is equipped with two 28-core Intel Xeon E52680 v4 (with hyperthreading enabled), 256GB DRAM and an NVIDIA Tesla V100 GPU. All experiments use CUDA 10.2 and cuDNN 7.6.5 except TVM and Ansor related experiments, which directly use the best kernels generated by these two tensor compilers.
  • Resnet-18 a widely used convolutional neural network for image classification
  • ⁇ CSRNet an expanded convolutional network for semantic segmentation, can adjust the sampling rate arbitrarily to expand the receiving field to obtain more accurate prediction results
  • Inception-v3 an improved version of GoogleNet, consists of well-designed Inception modules to improve accuracy and reduce computational complexity;
  • BERT a network structure for natural language processing, has very high accuracy
  • Resnet18-3D a neural network for video processing
  • this application can achieve a speedup of up to 2 times by combining with different backends (cuDNN/cuBLAS, TVM, Ansor), as shown in Figure 11.
  • the present application also provides a tensor program optimization device, including:
  • the program division module 201 is used to divide the tensor program to be optimized to generate a linear tensor subprogram
  • a mutation generating module 202 configured to generate a mutation program of the subroutine according to a preset set of operators
  • the mutation error correction module 203 is used to correct the unequal mutation program in the mutation program of the subroutine so that each mutation program is equivalent to the corresponding subroutine;
  • the optimized program generation module 204 selects the optimal subroutine from the mutation program after the error correction processing to splice and generate an optimized tensor program.
  • This embodiment also provides an electronic device, which may be a desktop computer, a tablet computer, a mobile terminal, etc., and this embodiment is not limited thereto.
  • the electronic device may refer to the foregoing embodiments of the method and apparatus, the contents of which are incorporated herein, and repeated descriptions will not be repeated.
  • FIG. 13 is a schematic block diagram of a system configuration of an electronic device 600 according to an embodiment of the present application.
  • the electronic device 600 may include a central processing unit 100 and a memory 140 ; the memory 140 is coupled to the central processing unit 100 .
  • this figure is exemplary; other types of structures may also be used in addition to or instead of this structure to implement telecommunications functions or other functions.
  • the tensor program optimization function can be integrated into the CPU 100 .
  • the central processing unit 100 may be configured to perform the following control:
  • Error correction is performed on the unequal mutation programs in the mutation programs of the subroutines so that each mutation program is equivalent to the corresponding subroutine;
  • the optimal subroutine is selected from the error-corrected mutation program for splicing to generate an optimized tensor program.
  • the tensor program optimization device can be configured separately from the central processing unit 100, for example, the tensor program optimization device can be configured as a chip connected to the central processing unit 100, and the tensor program can be realized through the control of the central processing unit. Program optimization function.
  • the electronic device 600 may further include: a communication module 110 , an input unit 120 , an audio processing unit 130 , a display 160 , and a power supply 170 . It should be noted that the electronic device 600 does not necessarily include all the components shown in FIG. 13; in addition, the electronic device 600 may also include components not shown in FIG. 13, and reference may be made to the prior art.
  • the central processing unit 100 is sometimes also referred to as a controller or an operating control, and may include a microprocessor or other processor devices and/or logic devices.
  • the central processing unit 100 receives input and controls various components of the electronic device 600 The operation of the part.
  • the memory 140 may be, for example, one or more of a cache, a flash memory, a hard drive, a removable medium, a volatile memory, a non-volatile memory, or other suitable devices.
  • the above-mentioned failure-related information may be stored, and a program for executing the related information may also be stored.
  • the central processing unit 100 can execute the program stored in the memory 140 to implement information storage or processing.
  • the input unit 120 provides input to the CPU 100 .
  • the input unit 120 is, for example, a button or a touch input device.
  • the power supply 170 is used to provide power to the electronic device 600 .
  • the display 160 is used to display display objects such as images and characters.
  • the display can be, for example, an LCD display, but is not limited thereto.
  • the memory 140 may be a solid-state memory, for example, a read only memory (ROM), a random access memory (RAM), a SIM card, and the like. Also memory that holds information even when power is off, is selectively erasable and is provided with more data, an example of which is sometimes called EPROM or the like. Memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage part 142 for storing application programs and function programs or procedures for executing operations of the electronic device 600 through the CPU 100 .
  • the memory 140 may also include a data storage 143 for storing data such as contacts, numerical data, pictures, sounds and/or any other data used by the electronic device.
  • the driver storage part 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for executing other functions of the electronic device (such as messaging applications, address book applications, etc.).
  • the communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111 .
  • a communication module (transmitter/receiver) 110 is coupled to the central processing unit 100 to provide input signals and receive output signals, which may be the same as a conventional mobile communication terminal.
  • multiple communication modules 110 may be provided in the same electronic device, such as a cellular network module, a Bluetooth module and/or a wireless local area network module, and the like.
  • the communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide an audio output via the speaker 131 and receive an audio input from the microphone 132 for general telecommunication functions.
  • Audio processor 130 may include any suitable buffers, decoders, amplifiers, and the like.
  • the audio processor 130 is also coupled to the central processing unit 100, so that the microphone 132 can be used to record on the machine, and the speaker 131 can be used to play the sound stored on the machine.
  • the embodiment of the present application also provides a computer-readable program, wherein when the program is executed in the electronic device, the program causes the computer to execute the tensor program optimization method as described in the above embodiments in the electronic device.
  • An embodiment of the present application also provides a storage medium storing a computer-readable program, wherein the computer-readable program enables a computer to execute the tensor program optimization described in the above-mentioned embodiments in an electronic device.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供了一种张量程序优化方法及装置,方法包括:对待优化的张量程序进行划分生成线性的张量子程序;按预设的算子集合生成所述的子程序的突变程序;对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。本申请对待优化的张量程序进行划分生成线性的张量子程序,减少需要探索的搜索空间,将各子程序的突变以最优的方式组合成一个完整的张量程序,可以使张量程序的执行更加高效。

Description

张量程序优化方法及装置
相关申请
本申请要求于2021年07月13日递交的申请号为202110788296.8的中国发明专利申请的优先权,并引用该专利申请公开的内容作为本申请的一部分。
技术领域
本申请涉及数据处理技术,具体的讲是一种张量程序优化方法及装置。
背景技术
现有的深度学习框架将深度学习程序表示为由张量和算子构成的张量程序,而该张量程序往往以计算图的形式表现。现有技术中,对张量程序的优化方案可分为三种类型:对计算图上的算子进行等价替换的图级别优化,对张量内存布局进行替换的数据布局优化及生成更高效算子的算子级优化。
现有的算子库,如cuDNN,cuBLAS,Intel MKL等可以支持不同的数据布局。通过对不同算子应用最适合的内存布局,可以提高其计算效率。目前的技术对于内存布局的优化主要有以下局限性:1.底层算子库所考虑到的数据布局,都只是对张量不同维度的顺序的简单调整,如二维张量的行优先和列优先,四维张量的NCHW,NHWC等。2.基于内存布局优化的张量程序优化工作都只考虑了底层算子库所支持的较为有限的布局。对于计算图的优化,现有的许多工作利用基于规则的优化方法,即对于一个由若干算子构成的计算子图,根据预先设计的特定规则,将其用另一个子图进行替换。但是基于规则的优化方法需要开发人员对计算子图的变换有足够的了解,并据此设计出相关的变换规则。同时,由于所有规则都需要开发人员手动指定,因此其搜索空间往往有较大的局限性,无法对未知的变换规则进行探索。
现有技术中还有一种自动搜索算法,可以通过自动的方法对一个计算图进行优化,并通过形式化验证的方式保证其在数学上的正确性。但是目前的自动搜索算法只支持完全等价的变换,其形式化验证的方式需要根据一些基本规则进行推导,需要的验证时间也比较长。
发明内容
针对现有的张量程序优化中存在的缺陷,本申请旨在利用新的优化方法对张量程序的优化空间进行扩展。
本申请第一方面的实施例提供一种张量程序优化方法,包括:
对待优化的张量程序进行划分生成线性的张量子程序;
按预设的算子集合生成所述的子程序的突变程序;
对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
本申请第二方面的实施例提供一种张量程序优化装置,包括:
程序划分模块,用于对待优化的张量程序进行划分生成线性的张量子程序;
突变生成模块,用于按预设的算子集合生成所述的子程序的突变程序;
突变纠错模块,用于对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
优化程序生成模块,从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
本申请第三方面的实施例提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述方法。
本申请第四方面的实施例提供一种计算机可读存储介质,计算机可读存储介质存储有执行上述方法的计算机程序。
本申请对待优化的张量程序进行划分生成线性的张量子程序,减少需要探索的搜索空间,对于每个子程序所有可能的基于内存重排的局部等价变形进行探索,并生成该子程序所有可能的突变,并且为了保证优化后的张量程序的端到端的正确性,对优化前后的张量子程序的输出进行对比,找出其中的不等价位置并加以修正,完成纠错的突变将会进一步优化,将各子程序的突变以最优的方式组合成一个完整的张量程序,可以使张量程序的执行更加高效。
为让本申请的上述和其他目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附图式,作详细说明如下。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请提供的张量程序优化方法的流程图;
图2为本申请实施例中的示意图;
图3为本申请实施例中的示意图;
图4为本申请实施例中的示意图;
图5为本申请实施例中的示意图;
图6为本申请实施例中的示意图;
图7为本申请实施例中的示意图;
图8为本申请实施例中的示意图;
图9为本申请实施例中的示意图;
图10为本申请实施例中的示意图;
图11为本申请实施例中的示意图;
图12为本申请提供的张量程序优化装置的框图;
图13为本申请实施例中的电子设备示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了对张量程序的优化空间进行扩展,使张量程序的执行更加高效,本申请提供一种张量程序优化方法,如图1所示,本申请的方法包括:
步骤S101,对待优化的张量程序进行划分生成线性的张量子程序;
步骤S102,按预设的算子集合生成所述的子程序的突变程序;
步骤S103,对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
步骤S104,从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
本申请实施例中,步骤S101中,对待优化的张量程序进行划分生成线性的张量子程序包括:
确定所述待优化的张量程序中的非线性激活函数算子;
根据所述的非线性激活函数算子对待优化的张量程序进行划分生成线性的张量子程序。
本申请实施例中,在待优化的张量程序中使用非线性算子作为分割点。诸如深度神经网络中的激活函数之类的非线性算子在张量程序中被广泛使用。通常,每一个或几个线性算子 后面都会有非线性激活函数算子(例如ReLU或sigmoid),本申请实施例中,通过这些算子可以有效地将张量程序拆分为期望大小的子程序。并且,对于最新的等效图形优化算法,在其替换模式中几乎没有包含非线性算子(算子融合除外,将在代码生成优化中处理),因此将非线性算子留在子程序之外不会显著影响优化效果。
对于两个多重线性张量程序
Figure PCTCN2022105400-appb-000001
Figure PCTCN2022105400-appb-000002
如果
Figure PCTCN2022105400-appb-000003
Figure PCTCN2022105400-appb-000004
具有相同数量和形状的输入输出,则称
Figure PCTCN2022105400-appb-000005
Figure PCTCN2022105400-appb-000006
的突变(其计算无需等价)。如果
Figure PCTCN2022105400-appb-000007
是某个张量程序的子程序,则利用
Figure PCTCN2022105400-appb-000008
代替
Figure PCTCN2022105400-appb-000009
该张量程序依然是合法的,但可能会产生不同的结果,因此突变的替换能保证程序的合法性,但无法保证程序的正确性或数学等价性。
在通过非线性算子对张量程序进行划分之后,可以进一步调整子程序的大小。为了将多个独立的算子组合为更大的组或批去处理算子以寻求更多优化机会,会将几个彼此不依赖的子程序组成较大的子程序,例如Inception网络中的多个分支。另一方面,如果子程序太大,每次仅会对该子程序的一个算子子集进行突变优化。
对于每个子程序所有可能的基于内存重排的局部等价变形进行探索,并生成该子程序所有可能的突变。每个突变与原子程序具有相同的输入张量和形状相同的输出张量。
具体的,本申请实施例中,按预设的算子集合生成所述的子程序的突变程序包括:
步骤1,枚举所述子程序中的张量分别作为预设的算子集合中各算子的输入;
步骤2,将算子的输出张量添加到所述子程序中;
步骤3,判断所述子程序的大小是否超过预设的阈值,未达到阈值,则执行步骤1-步骤3,确定达到阈值则终止。
本申请实施例中,所述的预设的算子集合中包括:计算密集型算子、逐元素算子及张量操作算子。
本申请实施例中,对于任意给定的子程序
Figure PCTCN2022105400-appb-000010
根据一个给定的算子集合
Figure PCTCN2022105400-appb-000011
作为基础去生成其可能的突变,其中,算子集合
Figure PCTCN2022105400-appb-000012
涵盖了DNN(DNN,Deep Neural Network,深度神经网络)和其他张量程序中最为常用的算子,包括:计算密集型算子(如conv、matmul等)和逐元素算子(如add、mul等),以及张量操作算子(如split、transpose等)。该集合
Figure PCTCN2022105400-appb-000013
可根据用户的需求轻松地进行扩展,以覆盖各种不同类型的张量程序。为了生成潜在的突变,从一个没有操作符但包含了
Figure PCTCN2022105400-appb-000014
中原始输入张量的空程序开始,枚举集合
Figure PCTCN2022105400-appb-000015
中的每个算子,同时亦枚举当前程序中所有可用的张量作为算子的输入,并将其输出张量添加到当前的程序中,直到当前程序的大小达到了某个阈值(称为突变深度)为止。如果生成的突变
Figure PCTCN2022105400-appb-000016
与原始程序
Figure PCTCN2022105400-appb-000017
具有相同数量和形状的输入输出,则认为该突变
Figure PCTCN2022105400-appb-000018
Figure PCTCN2022105400-appb-000019
的一个有效突变体。
为了保证优化后的张量程序的端到端的正确性,对优化前后的张量子程序的输出进行对 比,找出其中的不等价位置并加以修正。本实施例中,突变纠错的处理中利用了严格的线性代数理论基础来简化了这项极为复杂的任务。
对子程序生成更高性能的突变,但并不能保证突变与原程序在计算上的等价性。为了维持张量程序端到端的正确性,需要进一步找出突变中与原子程序不等价的位置,并加以修正。本实施例中将原始的多重线性张量程序
Figure PCTCN2022105400-appb-000020
及其突变之一
Figure PCTCN2022105400-appb-000021
作为输入,自动找出其输出中不等价的位置并生成对应的纠错内核,以保证
Figure PCTCN2022105400-appb-000022
Figure PCTCN2022105400-appb-000023
在功能上的等价。
为了简化分析,假设每个输入子程序
Figure PCTCN2022105400-appb-000024
及其突变
Figure PCTCN2022105400-appb-000025
都只有一个输出。通过顺序地分析每个输出,该结果可以轻松地推广到具有多个输出的子程序。输出张量的计算通常涉及沿某些维度的归约。例如,卷积算子conv在channel(通道)维度上对输入张量和权重张量在height(高度)和width(宽度)维度的乘积进行累加。
同一个归约域中的输出位置具有相同的归约边界和相似的数学性质,突变纠错器会以此为基础进行等价性检查,本申请实施例基于以下两个线性代数理论:
定理1:对于两个拥有m维输出张量的MLTP
Figure PCTCN2022105400-appb-000026
Figure PCTCN2022105400-appb-000027
Figure PCTCN2022105400-appb-000028
为m维的基向量集合,即
Figure PCTCN2022105400-appb-000029
为长度为m的元组,其中第i个元素为1,其他均为0。
Figure PCTCN2022105400-appb-000030
Figure PCTCN2022105400-appb-000031
Figure PCTCN2022105400-appb-000032
的一个归约域,
Figure PCTCN2022105400-appb-000033
Figure PCTCN2022105400-appb-000034
中的任意位置。令
Figure PCTCN2022105400-appb-000035
1≤j≤m。如果
Figure PCTCN2022105400-appb-000036
0≤i≤m,
Figure PCTCN2022105400-appb-000037
Figure PCTCN2022105400-appb-000038
定理2:对两个MLTP
Figure PCTCN2022105400-appb-000039
Figure PCTCN2022105400-appb-000040
Figure PCTCN2022105400-appb-000041
Figure PCTCN2022105400-appb-000042
Figure PCTCN2022105400-appb-000043
的一个不等价位置,即
Figure PCTCN2022105400-appb-000044
Figure PCTCN2022105400-appb-000045
令I为一个随机生成的输入,则
Figure PCTCN2022105400-appb-000046
的可能性至多为
Figure PCTCN2022105400-appb-000047
其中,d为输入变量I的所有可能取值的个数。
定理1表明,如果
Figure PCTCN2022105400-appb-000048
Figure PCTCN2022105400-appb-000049
在一个归约域中的m+1个特定位置是等价的,则此归约域中的所有其他位置也均为等价的。本实施例中采用的该定理大大减少了验证的工作量:不需要检查输出张量中的所有位置,而只需要测试每个归约域中的m+1个特定位置即可。
定理2表明,如果两个子程序在某个位置上不相等,那么如果用32位整数进行随机测试,则在此随机输入上计算出的结果相同的概率至多为2 -32
通过结合定理1和定理2,检查所有输出位置上依赖的所有输入值的组合简化为仅需要使用几个随机生成的输入来测试几个具有代表性的位置的轻量级任务。
本申请实施例中,对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程 序均与对应的子程序等价包括:
利用归约域传播确定所述子程序及其突变程序的归约域;
对任意两归约域的重叠区域,标识m+1个位置上进行随机测试以确定不等价的突变程序;其中,m为子程序输出张量的维数;
根据不等价的突变程序对应的子程序生成纠错内核;
利用生成的纠错内核修正不等价的突变程序。
本申请实施例中突变纠错的验证算法如下:
步骤1:归约域传播。首先,通过归约域传播来计算给定子程序的归约域。归约域传播的概念类似于深度学习中的前向和后向传播,即根据输入张量的归约域通过分析算子计算的依赖性来计算算子输出张量的归约域。为张量的每个维度维护一组分割点,以标识其归约域的边界。对于线性算子,可以根据输入张量的分割点以及算子类型和超参数来推断其输出张量的分割点。
步骤2:对每个归约域进行随机测试。在获得原程序
Figure PCTCN2022105400-appb-000050
及其突变
Figure PCTCN2022105400-appb-000051
的输出张量中的所有归约域后,利用前面提到的定理检查两个程序中每对归约域的相交区域。如果两个归约域没有任何重叠区域,则可以跳过它们。对于每个重叠区域,在定理1所标识的一组m+1个位置上检查两个程序的等价性,其中m是输出张量维数(例如,上图中的子程序中m=4,因为conv的输出是4维)。在这m+1个位置上,利用随机生成的一组输入进行随机测试。
步骤3:纠错内核。对于每个未通过随机测试的归约域,都会生成纠错内核以修正其输出从而保证原始子程序及其突变之间的数学等价性。为了修正输出,纠错内核需在与原始子程序不等效的归约域上执行相同的一组运算符以对这些位置做出修正。
为了减少纠错内核引入的额外开销,通过内核融合优化技术对生成的纠错内核进行优化。当纠错内核与优化后的子程序的算子相同时,可以通过内核融合的优化方法将其合并为一个计算内核。对于纠错内核与优化后的子程序中的算子不相同的情况,由于纠错内核与优化前的原子程序算子相同,而优化后的子程序的算子是由原子程序的算子变形得到,因此总是可以将纠错内核的算子也变换为与优化后的子程序具有相同的算子,即可进一步进行融合优化。
本申请实施例中,所述的从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序包括:
利用贪心算法从纠错处理后的突变程序中分别选取各子程序的K个候选子程序;K为预设值;
从各子程序的K个候选子程序中选取拼接之后的前后子程序的处转换开销最小的候选子程序作为各子程序的最优子程序;
对确定的最优子程序进行拼接生成优化后的张量程序。
本申请实施例中,所述的对确定的最优子程序进行拼接生成优化后的张量程序包括:
对确定的最优子程序进行拼接生成拼接后的张量程序;
对拼接后的张量程序进行可逆算子消除;
根据预处理后的权重张量对进行可逆算子消除后的拼接后的张量程序进行算子融合生成优化后的张量程序;其中,所述的预处理后的权重张量为根据拼接后的张量程序的计算图对预设的权重张量进行变形后确定的。
在经过程序划分后,输入程序将转换为一个子程序列表。代码生成优化中的全局优化器通过查询突变生成器遍历每个子程序生成其突变列表,并以贪心的方式保持迄今为止性能最佳的整个程序的前K个候选程序Cands。对于每个子程序,都会选K个候选保存下来(因为保存所有的突变会占用大量空间),然后在后面的拼接过程中选取其中最合适的一个,本实施例中,所谓最合适的子程序是拼接之后与前后子程序的“接缝”处转换开销最小的子程序。
为了以合理的时间和空间成本为每个子程序探索足够大的突变空间,全局优化器通过几个关键的超参数对突变过程进行了仔细设计。首先,如果子程序太大,则会将其分成较小的算子子集,仅在每个子集上进行突变生成,在一个算子子集进行突变生成时保持其余算子不变。其次,通过允许子程序上的迭代突变最多进行r轮(r为预设的可调参数值),可以大大扩展搜索空间,并实现更复杂和可能更优的突变生成。
在完成子程序的优化后,还需要将所有子程序的优化结果连接在一起。除了连接它们的输入和输出张量之外,还需要对跨子程序的边界执行一些后优化处理,以进一步提高整体性能。本申请一实施例中,首先将R/T算子与非线性的ReLU算子进行重排序,从而使得两个子程序连接处的所有R/T算子为连续的。显然由于包括ReLU在内的非线性激活函数算子的计算是逐元素的,因此该重排序操作是可以保证其正确性的。接下来将继续进行3项后优化处理:
1.可逆算子消除。对于任何一组可以互相抵消(即在一个张量上运行这些算子后等同于无操作)的R/T算子,称之为可逆变形,显然可逆变形可以在不影响正确性的情况下从程序中删除。
2.算子融合。将剩余的连续内存重排算子R/T融合到单个内核中,以降低内核启动开销(R/T-DH)。同时,对于ReLU这样的激活算子,也会像经典优化一样将其融合到其相邻的计算算子中(Conv-Relu-CF)。
3.预处理。如果一个张量是静态可知的(如权重张量),后优化处理器将会在预处理阶段完成其内存重排,如图6中的(b)上卷积的权重张量w1与w2,其内存重排操作对应的算 子R/T-B与R/T-I将会在对权重的预处理期间执行,而非在运行时完成。
在深度学习应用中,会有一些张量用来保存权重,这些权重张量是静态就可以确定的。预处理,也就是在推理进行之前,先按照优化后的计算图把权重张量做一次变形,然后以变形后的结果来存储这些权重张量,那么在真正进行推理的时候就无需再执行一次这些变形算子了。
本申请提供的张量程序优化方法,其为基于内存重排和局部等价变换的优化方法,目前并未被其他框架所使用,虽然大部分的变换可以通过经典算子的组合来完成,但如果没有对这些算子进行专门的代码生成优化处理,其优化效果无法体现。因此,本申请所述的系统是一个完整系统,现有的框架无法替代其中的大部分工作。另外,本申请通过与不同的后端(如cuDNN/cuBLAS、TVM、Ansor)结合,可以达到最高2倍以上的加速比。
本申请一实施例还提供一种基于内存重排与局部等价变换的张量程序优化系统,包括:程序划分、突变生成器、突变纠错器和代码生成优化几个主要模块。
本实施例提供的张量程序优化系统的整体架构如图2所示:
本申请实施例中,张量程序优化系统的输入是一个待优化的张量程序。首先将输入程序划分为较小的子程序来减少需要探索的搜索空间,并保证所有子程序的均为线性程序(以应用后续的验证理论)。
本实施例中,对于每个子程序,突变生成器对其所有可能的基于内存重排的局部等价变形进行探索,并生成该子程序所有可能的突变。每个突变与原子程序具有相同的输入张量和形状相同的输出张量。
为了保证优化后的张量程序的端到端的正确性,突变纠错器对优化前后的张量子程序的输出进行对比,找出其中的不等价位置并加以修正。突变纠错器利用了严格的线性代数理论基础来简化了这项极为复杂的任务。
完成纠错的突变将会进一步通过程序优化器,将各子程序的突变以最优的方式组合成一个完整的张量程序,并对其进行进一步调优,即可得到优化后的张量程序。
以下将从程序划分、突变生成器、突变纠错器和代码生成优化等几个模块进行分别介绍,并以几个典型的安全来介绍本系统可以达到的优化效果。
(一)程序划分:
本实施例中,在输入程序中使用非线性算子作为分割点。首先,诸如深度神经网络中的激活函数之类的非线性算子在张量程序中被广泛使用。通常,每一个或几个线性算子后面都会有非线性激活函数算子(例如ReLU或sigmoid),通过这些算子可以有效地将张量程序拆分为期望大小的子程序。其次,由于突变纠错器的理论基础仅适用于多重线性张量程序,必 须从子程序中排除任何非线性算子,这使得以非线性算子作为拆分点成为自然的选择。第三,对于最新的等效图形优化算法,在其替换模式中几乎没有包含非线性算子(算子融合除外,将在代码生成优化中处理),因此将它们保留在子程序之外不会显著影响优化效果。
在通过非线性算子对张量程序进行划分之后,可以进一步调整子程序的大小。为了将多个独立的算子组合为更大的组或批去处理算子以寻求更多优化机会,会将几个彼此不依赖的子程序组成较大的子程序,例如Inception网络中的多个分支。另一方面,如果子程序太大,每次仅会对该子程序的一个算子子集进行突变优化。
(二)突变生成器:
对于两个多重线性张量程序
Figure PCTCN2022105400-appb-000052
Figure PCTCN2022105400-appb-000053
如果
Figure PCTCN2022105400-appb-000054
Figure PCTCN2022105400-appb-000055
具有相同数量和形状的输入输出,则称
Figure PCTCN2022105400-appb-000056
Figure PCTCN2022105400-appb-000057
的突变(其计算无需等价)。如果
Figure PCTCN2022105400-appb-000058
是某个张量程序的子程序,则利用
Figure PCTCN2022105400-appb-000059
代替
Figure PCTCN2022105400-appb-000060
该张量程序依然是合法的,但可能会产生不同的结果,因此突变的替换能保证程序的合法性,但无法保证程序的正确性或数学等价性。
对于任意给定的
Figure PCTCN2022105400-appb-000061
突变生成器会根据一个给定的算子集合
Figure PCTCN2022105400-appb-000062
作为基础去生成其可能的突变,其中算子集合
Figure PCTCN2022105400-appb-000063
涵盖了DNN和其他张量程序中最为常用的算子,包括计算密集型算子(如conv、matmul等)和逐元素算子(如add、mul等),以及张量操作算子(如split、transpose等)。该集合可根据用户的需求轻松地进行扩展,以覆盖各种不同类型的张量程序。为了生成潜在的突变,突变生成器从一个没有操作符但包含了
Figure PCTCN2022105400-appb-000064
中原始输入张量的空程序开始,枚举集合
Figure PCTCN2022105400-appb-000065
中的每个算子,同时亦枚举当前程序中所有可用的张量作为算子的输入,并将其输出张量添加到当前的程序中,直到当前程序的大小达到了某个阈值(称为突变深度,下图算法中的threshold)为止。如果生成的突变
Figure PCTCN2022105400-appb-000066
与原始程序
Figure PCTCN2022105400-appb-000067
具有相同数量和形状的输入输出,则突变生成器会认为该突变
Figure PCTCN2022105400-appb-000068
Figure PCTCN2022105400-appb-000069
的一个有效突变体。
如下所示为用于生成
Figure PCTCN2022105400-appb-000070
的可能突变体的深度优先搜索算法的代码。
Figure PCTCN2022105400-appb-000071
(三)突变纠错器:
通过突变生成器可以对一个子程序生成更高性能的突变,但并不能保证突变与原程序在计算上的等价性。为了维持张量程序端到端的正确性,需要进一步找出突变中与原子程序不等价的位置,并加以修正。突变纠错器将原始的多重线性张量程序
Figure PCTCN2022105400-appb-000072
及其突变之一
Figure PCTCN2022105400-appb-000073
作为输入,自动找出其输出中不等价的位置并生成对应的纠错内核,以保证
Figure PCTCN2022105400-appb-000074
Figure PCTCN2022105400-appb-000075
在功能上的等价。
为了简化分析,假设每个输入子程序
Figure PCTCN2022105400-appb-000076
及其突变
Figure PCTCN2022105400-appb-000077
都只有一个输出。通过顺序地分析每个输出,该结果可以轻松地推广到具有多个输出的子程序。输出张量的计算通常涉及沿某些维度的归约。例如,卷积算子conv在channel维度上对输入张量和权重张量在height维度和width维度的乘积进行累加。
但是,输出张量中的不同位置可能涉及不同的归约边界。例如,对于内核大小为3×3的卷积,计算左上角输出位置时仅涉及内核中2×2的位置,因为该位置上的计算在最左侧一列和最上侧一行超出了输入张量的边界(在same padding的情况下),如图3所示。具有相同 归约边界的输出位置可以划入同一个归约域中,即如果输出张量中的多个位置共享完全相同的归约边界,则它们属于同一个归约域。图3示出了具有3×3内核的conv的九个归约域及其对应的归约边界。
同一个归约域中的输出位置具有相同的归约边界和相似的数学性质,突变纠错器会以此为基础进行等价性检查。突变纠错器基于以下两个线性代数理论:
定理1:对于两个拥有m维输出张量的子程序
Figure PCTCN2022105400-appb-000078
Figure PCTCN2022105400-appb-000079
Figure PCTCN2022105400-appb-000080
为m维的基向量集合,即
Figure PCTCN2022105400-appb-000081
为长度为m的元组,其中第i个元素为1,其他均为0。
Figure PCTCN2022105400-appb-000082
Figure PCTCN2022105400-appb-000083
Figure PCTCN2022105400-appb-000084
的一个归约域,
Figure PCTCN2022105400-appb-000085
Figure PCTCN2022105400-appb-000086
中的任意位置。令
Figure PCTCN2022105400-appb-000087
1≤j≤m。如果
Figure PCTCN2022105400-appb-000088
0≤i≤m,
Figure PCTCN2022105400-appb-000089
Figure PCTCN2022105400-appb-000090
定理2:对两个子程序
Figure PCTCN2022105400-appb-000091
Figure PCTCN2022105400-appb-000092
Figure PCTCN2022105400-appb-000093
Figure PCTCN2022105400-appb-000094
Figure PCTCN2022105400-appb-000095
的一个不等价位置,即
Figure PCTCN2022105400-appb-000096
Figure PCTCN2022105400-appb-000097
令I为一个随机生成的输入,则
Figure PCTCN2022105400-appb-000098
的可能性至多为
Figure PCTCN2022105400-appb-000099
其中d为输入变量I的所有可能取值的个数。
定理1表明,如果
Figure PCTCN2022105400-appb-000100
Figure PCTCN2022105400-appb-000101
在一个归约域中的m+1个特定位置是等价的,则此归约域中的所有其他位置也均为等价的。该定理大大减少了验证的工作量:突变纠错器不需要检查输出张量中的所有位置,而只需要测试每个归约域中的m+1个特定位置即可。
定理2表明,如果两个子程序在某个位置上不相等,那么如果用32位整数进行随机测试,则在此随机输入上计算出的结果相同的概率至多为2 -32
通过结合定理1和定理2,突变纠错器将检查所有输出位置上依赖的所有输入值的组合简化为了仅需要使用几个随机生成的输入来测试几个具有代表性的位置的轻量级任务。
本实施例中,突变纠错器的验证算法如下:
步骤1:归约域传播。首先,突变纠错器通过归约域传播来计算给定子程序的归约域。归约域传播的概念类似于深度学习中的前向和后向传播,即根据输入张量的归约域通过分析算子计算的依赖性来计算算子输出张量的归约域。突变纠错器为张量的每个维度维护一组分割点,以标识其归约域的边界。对于线性算子,可以根据输入张量的分割点以及算子类型和超参数来推断其输出张量的分割点。图4显示了归约域传播过程。
步骤2:对每个归约域进行随机测试。在获得原程序
Figure PCTCN2022105400-appb-000102
及其突变
Figure PCTCN2022105400-appb-000103
的输出张量中的所有归约域后,突变纠错器利用前面提到的定理检查两个程序中每对归约域的相交区域。如果 两个归约域没有任何重叠区域,则可以跳过它们。对于每个重叠区域,在定理1所标识的一组m+1个位置上检查两个程序的等价性,其中,m是输出张量维数。在这m+1个位置上,突变纠错器会利用随机生成的一组输入进行随机测试。
步骤3:纠错内核。对于每个未通过随机测试的归约域,张量程序优化系统都会生成纠错内核以修正其输出从而保证原始子程序及其突变之间的数学等价性。为了修正输出,纠错内核需在与原始子程序不等效的归约域上执行相同的一组运算符以对这些位置做出修正。
如图5的(a)到(b)的过程展示了纠错内核的生成,图5的(b)到(c)为纠错内核融合的过程示意。
纠错内核的融合优化:为了减少纠错内核引入的额外开销,本工作通过内核融合优化技术对生成的纠错内核进行优化。如图5的(b)到(c)的过程所示,当纠错内核与优化后的子程序的算子相同时(即Conv-1和Conv-2,共享权重W 1),可以通过内核融合的优化方法将其合并为一个计算内核(即图5的(c)中的Conv-1-2)。对于纠错内核与优化后的子程序中的算子不相同的情况,由于纠错内核与优化前的原子程序算子相同,而优化后的子程序的算子是由原子程序的算子变形得到,因此总是可以将纠错内核的算子也变换为与优化后的子程序具有相同的算子,即可进一步进行融合优化。
(四)代码生成优化:
在经过程序划分后,输入程序将转换为一个子程序列表。代码生成优化中的全局优化器通过查询突变生成器遍历每个子程序生成其突变列表,并以贪心的方式保持迄今为止性能最佳的整个程序的前K个候选。对于每个子程序,都会选K个候选保存下来(因为保存所有的突变会占用大量空间,故选K个候选),然后在后面的拼接过程中选取其中最合适的一个子程序进行拼接。所谓最合适子程序是指拼接之后与前后子程序的“接缝”处转换开销最小的子程序。
为了以合理的时间和空间成本为每个子程序探索足够大的突变空间,全局优化器通过几个关键的超参数对突变过程进行了仔细设计。首先,如果子程序太大,则会将其分成较小的算子子集,仅在每个子集上进行突变生成,在一个算子子集进行突变生成时保持其余算子不变。其次,通过允许子程序上的迭代突变最多进行r轮,可以大大扩展搜索空间,并实现更复杂和可能更优的突变生成。
本申请在程序拼接过程中,涉及两个关键步骤,一是如何对每个子程序进行分别的搜索(即找出最合适的突变的过程),二是在拼接过程中的后优化处理,从而使生成的程序更加高效。
在完成子程序的优化后,还需要将所有子程序的优化结果连接在一起。子程序连接的具 体过程即将若干个计算图拼接的过程,为在深度学习优化领域的基本操作,在此不再赘述。
在完成子程序的优化后,还需要将所有子程序的优化结果连接在一起。除了连接它们的输入和输出张量之外,还需要对跨子程序的边界执行一些后优化处理,以进一步提高整体性能。
图6给出了两个子程序进行突变优化后的结果。首先将R/T算子与非线性的ReLU算子进行重排序,如图6的(b)所示,从而使得两个子程序连接处的所有R/T算子为连续的。显然由于包括ReLU在内的非线性激活函数算子的计算是逐元素的,因此该重排序操作是可以保证其正确性的。接下来将继续进行3项后优化处理:
1.可逆算子消除。对于任何一组可以互相抵消(即在一个张量上运行这些算子后等同于无操作)的R/T算子,称之为可逆变形,显然可逆变形可以在不影响正确性的情况下从程序中删除。如图6的(b)的例子中R/T-E与R/T-G即为可消除的可逆变形。
2.算子融合。如图6的(c)所示,后优化处理器将剩余的连续内存重排算子R/T融合到单个内核中,以降低内核启动开销(R/T-DH)。同时,对于ReLU这样的激活算子,也会像经典优化一样将其融合到其相邻的计算算子中(Conv-Relu-CF)。
3.预处理。如果一个张量是静态可知的(如权重张量),后优化处理器将会在预处理阶段完成其内存重排,如图6的(b)上卷积的权重张量w1与w2,其内存重排操作对应的算子R/T-B与R/T-I将会在对权重的预处理期间执行,而非在运行时完成。
在深度学习应用中,会有一些张量用来保存“权重”,这些权重张量是静态就可以确定的(准确的说是在训练过程中确定的,本申请的工作是面向推理的,在推理之前这些权重都已经定的)。预处理,也就是在推理进行之前,先按照优化后的计算图把权重张量做一次变形(变形用到的算子如图6中的RT-B等),然后以变形后的结果来存储这些权重张量,那么在进行推理的时候就无需再执行一次这些变形算子了。
典型优化案例:
案例1:如图7所示,沿宽度方向将两个单独的图像连成了一个较大的图像,即将N维度的数据转移到了W维度。而在特定的维度大小下,该变形可以提供更大的并行度并改善计算的局部性,从而提升性能。内存重排的思路为张量程序的优化提供了新的机会。然而,在进行完该变迁后,在输出张量沿着合并边界的子区域上(图(b)上的斜线阴影位置)产生了与原结果不相等的元素,因此该优化为一个局部等价的优化,而非完全等价的优化。
案例2:如图8给出了一种突变,该突变将通过内存重排把空洞卷积的计算变为标准卷积的计算。(图中斜线阴影部分)给出的是突变中存在的不等价位置,张量程序优化系统将通过突变纠错器对其进行进一步修正。该优化将低效的空洞卷积转换为了被现有算子库高度优 化的标准卷积,可以使用如Winograd,FFT等高效算法。
案例3:图9展示了用于优化Inception模块的两种图变换策略。对于具有不同的输出通道的两个并行的conv算子,图9的(a)展示了一个基于内存重排的非等价变换,该变换将W2用0进行填充,使其具有与W1相同的形状,从而可以将两个conv算子融合为一个group conv算子。在其纠错过程中,需要再删掉填充到W2中的0计算出的结果(标记为zeros部分的张量,由其计算过程可推导出该张量所有元素均为0)。
图9的(b)中即为发现的等价转换,该转变的基本原理也是内存的重排,但在此过程中需要对一部分张量进行冗余复制。该转换通过对输入张量I2进行复制,并通过concat算子对输入张量和权重进行合并,从而将两个conv算子融合为一个group conv算子。
本申请的主体是基于内存重排与局部等价变换的张量程序优化系统,该系统提供了相应的接口供用户构建张量程序计算图,同时也支持onnx格式的模型导入,其输出为一个可执行的张量程序。
本申请可以使张量程序的执行更加高效。
本申请实施例中,实验所用的服务器配备了两个28核的Intel Xeon E52680 v4(启用超线程)、256GB DRAM和一个NVIDIA Tesla V100 GPU。除了与TVM和Ansor相关的实验外,所有实验均使用CUDA 10.2和cuDNN 7.6.5,而TVM和Ansor相关的实验则直接使用通过这两个张量编译器生成的最佳内核。
测试程序实验中使用了5个真实的DNN模型:
·Resnet-18,一个被广泛使用的卷积神经网络,用于图像分类;
·CSRNet,用于语义分割的膨胀卷积网络,可以任意调整采样率以扩大接收域,以获得更准确的预测结果;
·Inception-v3,为GoogleNet的改进版本,由精心设计的Inception模块组成,以提高准确率并降低计算复杂度;
·BERT,一个用于自然语言处理的网络结构,具有非常高的准确性;
·Resnet18-3D,一个用于视频处理的神经网络;
在端到端的实验上,本申请相较于现有工作,可以达到最高2.51倍的加速比,如图10所示。
在算子级的实验上,本申请通过与不同的后端(cuDNN/cuBLAS、TVM、Ansor)结合,可以达到最高2倍以上的加速比,如图11所示。
本申请基于内存重排和局部等价变换的程序优化方法,目前并未被其他框架所使用,虽然大部分的变换可以通过经典算子的组合来完成,但如果没有对这些算子进行专门的代码生 成优化处理,其优化效果无法体现。因此,本申请所述的系统是一个完整系统,现有的框架无法替代其中的大部分工作。
如图12所示,本申请还提供一种张量程序优化装置,包括:
程序划分模块201,用于对待优化的张量程序进行划分生成线性的张量子程序;
突变生成模块202,用于按预设的算子集合生成所述的子程序的突变程序;
突变纠错模块203,用于对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
优化程序生成模块204,从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
对本领域技术人员而言,通过前述实施例内容的描述可清楚获知本申请提供的一种张量程序优化装置的实现方式,在此不再赘述。
本实施例还提供一种电子设备,该电子设备可以是台式计算机、平板电脑及移动终端等,本实施例不限于此。在本实施例中,该电子设备可以参照前述方法及装置的实施例,其内容被合并于此,重复之处不再赘述。
图13为本申请实施例的电子设备600的系统构成的示意框图。如图13所示,该电子设备600可以包括中央处理器100和存储器140;存储器140耦合到中央处理器100。值得注意的是,该图是示例性的;还可以使用其他类型的结构,来补充或代替该结构,以实现电信功能或其他功能。
一实施例中,张量程序优化功能可以被集成到中央处理器100中。其中,中央处理器100可以被配置为进行如下控制:
对待优化的张量程序进行划分生成线性的张量子程序;
按预设的算子集合生成所述的子程序的突变程序;
对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
在另一个实施方式中,张量程序优化装置可以与中央处理器100分开配置,例如可以将张量程序优化装置配置为与中央处理器100连接的芯片,通过中央处理器的控制来实现张量程序优化功能。
如图13所示,该电子设备600还可以包括:通信模块110、输入单元120、音频处理单元130、显示器160、电源170。值得注意的是,电子设备600也并不是必须要包括图13中所示的所有部件;此外,电子设备600还可以包括图13中没有示出的部件,可以参考现有技 术。
如图13所示,中央处理器100有时也称为控制器或操作控件,可以包括微处理器或其他处理器装置和/或逻辑装置,该中央处理器100接收输入并控制电子设备600的各个部件的操作。
其中,存储器140,例如可以是缓存器、闪存、硬驱、可移动介质、易失性存储器、非易失性存储器或其它合适装置中的一种或更多种。可储存上述与失败有关的信息,此外还可存储执行有关信息的程序。并且中央处理器100可执行该存储器140存储的该程序,以实现信息存储或处理等。
输入单元120向中央处理器100提供输入。该输入单元120例如为按键或触摸输入装置。电源170用于向电子设备600提供电力。显示器160用于进行图像和文字等显示对象的显示。该显示器例如可为LCD显示器,但并不限于此。
该存储器140可以是固态存储器,例如,只读存储器(ROM)、随机存取存储器(RAM)、SIM卡等。还可以是这样的存储器,其即使在断电时也保存信息,可被选择性地擦除且设有更多数据,该存储器的示例有时被称为EPROM等。存储器140还可以是某种其它类型的装置。存储器140包括缓冲存储器141(有时被称为缓冲器)。存储器140可以包括应用/功能存储部142,该应用/功能存储部142用于存储应用程序和功能程序或用于通过中央处理器100执行电子设备600的操作的流程。
存储器140还可以包括数据存储部143,该数据存储部143用于存储数据,例如联系人、数字数据、图片、声音和/或任何其他由电子设备使用的数据。存储器140的驱动程序存储部144可以包括电子设备的用于通信功能和/或用于执行电子设备的其他功能(如消息传送应用、通讯录应用等)的各种驱动程序。
通信模块110即为经由天线111发送和接收信号的发送机/接收机110。通信模块(发送机/接收机)110耦合到中央处理器100,以提供输入信号和接收输出信号,这可以和常规移动通信终端的情况相同。
基于不同的通信技术,在同一电子设备中,可以设置有多个通信模块110,如蜂窝网络模块、蓝牙模块和/或无线局域网模块等。通信模块(发送机/接收机)110还经由音频处理器130耦合到扬声器131和麦克风132,以经由扬声器131提供音频输出,并接收来自麦克风132的音频输入,从而实现通常的电信功能。音频处理器130可以包括任何合适的缓冲器、解码器、放大器等。另外,音频处理器130还耦合到中央处理器100,从而使得可以通过麦克风132能够在本机上录音,且使得可以通过扬声器131来播放本机上存储的声音。
本申请实施例还提供一种计算机可读程序,其中当在电子设备中执行所述程序时,所述 程序使得计算机在所述电子设备中执行如上面实施例所述的张量程序优化方法。
本申请实施例还提供一种存储有计算机可读程序的存储介质,其中所述计算机可读程序使得计算机在电子设备中执行上面实施例所述的张量程序优化。
以上参照附图描述了本申请的优选实施方式。这些实施方式的许多特征和优点根据该详细的说明书是清楚的,因此所附权利要求旨在覆盖这些实施方式的落入其真实精神和范围内的所有这些特征和优点。此外,由于本领域的技术人员容易想到很多修改和改变,因此不是要将本申请的实施方式限于所例示和描述的精确结构和操作,而是可以涵盖落入其范围内的所有合适修改和等同物。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
本申请中应用了具体实施例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (15)

  1. 一种张量程序优化方法,其中,所述的方法包括:
    对待优化的张量程序进行划分生成线性的张量子程序;
    按预设的算子集合生成所述的子程序的突变程序;
    对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
    从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
  2. 如权利要求1所述的张量程序优化方法,其中,所述的对待优化的张量程序进行划分生成线性的张量子程序包括:
    确定所述待优化的张量程序中的非线性激活函数算子;
    根据所述的非线性激活函数算子对待优化的张量程序进行划分生成线性的张量子程序。
  3. 如权利要求1所述的张量程序优化方法,其中,所述的按预设的算子集合生成所述的子程序的突变程序包括:
    步骤1,枚举所述子程序中的张量分别作为预设的算子集合中各算子的输入;
    步骤2,将算子的输出张量添加到所述子程序中;
    步骤3,判断所述子程序的大小是否超过预设的阈值,未达到阈值,则执行步骤1-步骤3,确定达到阈值则终止。
  4. 如权利要求1所述的张量程序优化方法,其中,所述的预设的算子集合中包括:计算密集型算子、逐元素算子及张量操作算子。
  5. 如权利要求1所述的张量程序优化方法,其中,所述的对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价包括:
    利用归约域传播确定所述子程序及其突变程序的归约域;
    对任意两归约域的重叠区域,标识m+1个位置上进行随机测试以确定不等价的突变程序;其中,m为子程序输出张量的维数;
    根据不等价的突变程序对应的子程序生成纠错内核;
    利用生成的纠错内核修正不等价的突变程序。
  6. 如权利要求1所述的张量程序优化方法,其中,所述的从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序包括:
    利用贪心算法从纠错处理后的突变程序中分别选取各子程序的K个候选子程序;K为预设值;
    从各子程序的K个候选子程序中选取拼接之后的前后子程序的处转换开销最小的候选子程序作为各子程序的最优子程序;
    对确定的最优子程序进行拼接生成优化后的张量程序。
  7. 如权利要求6所述的张量程序优化方法,其中,所述的对确定的最优子程序进行拼接生成优化后的张量程序包括:
    对确定的最优子程序进行拼接生成拼接后的张量程序;
    对拼接后的张量程序进行可逆算子消除;
    根据预处理后的权重张量对进行可逆算子消除后的拼接后的张量程序进行算子融合生成优化后的张量程序;其中,所述的预处理后的权重张量为根据拼接后的张量程序的计算图对预设的权重张量进行变形后确定的。
  8. 一种张量程序优化装置,其中,所述的装置包括:
    程序划分模块,用于对待优化的张量程序进行划分生成线性的张量子程序;
    突变生成模块,用于按预设的算子集合生成所述的子程序的突变程序;
    突变纠错模块,用于对子程序的突变程序中不等价的突变程序进行纠错处理以使各突变程序均与对应的子程序等价;
    优化程序生成模块,从纠错处理后的突变程序中选取最优子程序进行拼接生成优化后的张量程序。
  9. 如权利要求8所述的张量程序优化装置,其中,所述的程序划分模块包括:
    算子确定单元,用于确定所述待优化的张量程序中的非线性激活函数算子;
    划分单元,用于根据所述的非线性激活函数算子对待优化的张量程序进行划分生成线性的张量子程序。
  10. 如权利要求8所述的张量程序优化装置,其中,所述的突变生成模块按预设的算子集合生成所述的子程序的突变程序的步骤包括:
    步骤1,枚举所述子程序中的张量分别作为预设的算子集合中各算子的输入;
    步骤2,将算子的输出张量添加到所述子程序中;
    步骤3,判断所述子程序的大小是否超过预设的阈值,未达到阈值,则执行步骤1-步骤3,确定达到阈值则终止。
  11. 如权利要求8所述的张量程序优化装置,其中,所述的突变纠错模块包括:
    归约域确定单元,用于利用归约域传播确定所述子程序及其突变程序的归约域;
    标识单元,用于对任意两归约域的重叠区域,标识m+1个位置上进行随机测试以确定不等价的突变程序;其中,m为子程序输出张量的维数;
    内核生成单元,用于根据不等价的突变程序对应的子程序生成纠错内核;
    纠错单元,用于利用生成的纠错内核修正不等价的突变程序。
  12. 如权利要求8所述的张量程序优化装置,其中,所述的优化程序生成模块包括:
    候选程序确定单元,用于利用贪心算法从纠错处理后的突变程序中分别选取各子程序的K个候选子程序;K为预设值;
    选取单元,用于从各子程序的K个候选子程序中选取拼接之后的前后子程序的处转换开销最小的候选子程序作为各子程序的最优子程序;
    优化程序拼接单元,用于对确定的最优子程序进行拼接生成优化后的张量程序。
  13. 如权利要求12所述的张量程序优化装置,其中,所述的优化程序拼接单元包括:
    拼接单元,用于对确定的最优子程序进行拼接生成拼接后的张量程序;
    可逆算子消除单元,用于对拼接后的张量程序进行可逆算子消除;
    优化单元,用于根据预处理后的权重张量对进行可逆算子消除后的拼接后的张量程序进行算子融合生成优化后的张量程序;其中,所述的预处理后的权重张量为根据拼接后的张量程序的计算图对预设的权重张量进行变形后确定的。
  14. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至7任一项所述方法。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有执行权利要求1至7任一项所述方法的计算机程序。
PCT/CN2022/105400 2021-07-13 2022-07-13 张量程序优化方法及装置 WO2023284770A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110788296.8A CN113703768A (zh) 2021-07-13 2021-07-13 张量程序优化方法及装置
CN202110788296.8 2021-07-13

Publications (1)

Publication Number Publication Date
WO2023284770A1 true WO2023284770A1 (zh) 2023-01-19

Family

ID=78648499

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105400 WO2023284770A1 (zh) 2021-07-13 2022-07-13 张量程序优化方法及装置

Country Status (2)

Country Link
CN (1) CN113703768A (zh)
WO (1) WO2023284770A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113703768A (zh) * 2021-07-13 2021-11-26 清华大学 张量程序优化方法及装置
CN115130675B (zh) * 2022-09-02 2023-01-24 之江实验室 一种量子随机电路的多振幅模拟方法和装置
CN116107669B (zh) * 2023-04-14 2023-08-18 北京大学 深度学习框架的算子注册方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN110968321A (zh) * 2019-10-25 2020-04-07 浙江省北大信息技术高等研究院 张量计算代码优化方法、装置、设备及介质
CN111078395A (zh) * 2019-11-12 2020-04-28 华中科技大学 一种基于张量的深度学习gpu内存管理优化方法及系统
CN111401537A (zh) * 2019-09-24 2020-07-10 上海寒武纪信息科技有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN112579063A (zh) * 2021-03-01 2021-03-30 之江实验室 一种用于深度学习编译器中探索优化空间的加速方法
US11055639B1 (en) * 2020-04-28 2021-07-06 Sas Institute Inc. Optimizing manufacturing processes using one or more machine learning models
CN113703768A (zh) * 2021-07-13 2021-11-26 清华大学 张量程序优化方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111401537A (zh) * 2019-09-24 2020-07-10 上海寒武纪信息科技有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN110968321A (zh) * 2019-10-25 2020-04-07 浙江省北大信息技术高等研究院 张量计算代码优化方法、装置、设备及介质
CN111078395A (zh) * 2019-11-12 2020-04-28 华中科技大学 一种基于张量的深度学习gpu内存管理优化方法及系统
US11055639B1 (en) * 2020-04-28 2021-07-06 Sas Institute Inc. Optimizing manufacturing processes using one or more machine learning models
CN112579063A (zh) * 2021-03-01 2021-03-30 之江实验室 一种用于深度学习编译器中探索优化空间的加速方法
CN113703768A (zh) * 2021-07-13 2021-11-26 清华大学 张量程序优化方法及装置

Also Published As

Publication number Publication date
CN113703768A (zh) 2021-11-26

Similar Documents

Publication Publication Date Title
WO2023284770A1 (zh) 张量程序优化方法及装置
US20210256390A1 (en) Computationally efficient neural network architecture search
US11144831B2 (en) Regularized neural network architecture search
US11544536B2 (en) Hybrid neural architecture search
US11803758B2 (en) Adversarial pretraining of machine learning models
US20200160212A1 (en) Method and system for transfer learning to random target dataset and model structure based on meta learning
US11468324B2 (en) Method and apparatus with model training and/or sequence recognition
US20160342888A1 (en) Memory efficiency for convolutional neural networks operating on graphics processing units
CN111488137B (zh) 一种基于共同注意力表征学习的代码搜索方法
WO2019143661A2 (en) Machine-learning circuit optimization using quantized prediction functions
US11347995B2 (en) Neural architecture search with weight sharing
Groh et al. Ggnn: Graph-based gpu nearest neighbor search
Salesi et al. TAGA: Tabu Asexual Genetic Algorithm embedded in a filter/filter feature selection approach for high-dimensional data
EP3828776A1 (en) Program, learning method, and learning apparatus
JP7457125B2 (ja) 翻訳方法、装置、電子機器及びコンピュータプログラム
EP3942406B1 (en) Reshape and broadcast optimizations to avoid unnecessary data movement
CN117591547A (zh) 数据库的查询方法、装置、终端设备以及存储介质
US11410036B2 (en) Arithmetic processing apparatus, control method, and non-transitory computer-readable recording medium having stored therein control program
Lakhmiri et al. Use of static surrogates in hyperparameter optimization
JP2023544560A (ja) 文字認識における制約条件を強制するためのシステムおよび方法
US20200184328A1 (en) Accelerating artificial neural network computations by skipping input values
Dustin et al. Predictive stability criteria for penalty selection in linear models
US20230334315A1 (en) Information processing apparatus, control method of information processing apparatus, and storage medium
US20240232588A9 (en) Data processing device, data processing method, and computer-readable recording medium storing data processing program
US20240135151A1 (en) Data processing device, data processing method, and computer-readable recording medium storing data processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22841403

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE