WO2023150912A1 - Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage - Google Patents

Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage Download PDF

Info

Publication number
WO2023150912A1
WO2023150912A1 PCT/CN2022/075526 CN2022075526W WO2023150912A1 WO 2023150912 A1 WO2023150912 A1 WO 2023150912A1 CN 2022075526 W CN2022075526 W CN 2022075526W WO 2023150912 A1 WO2023150912 A1 WO 2023150912A1
Authority
WO
WIPO (PCT)
Prior art keywords
cost comparison
candidate
model
schedules
scheduling
Prior art date
Application number
PCT/CN2022/075526
Other languages
English (en)
Chinese (zh)
Inventor
胡以璇
陈金林
伍文龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/075526 priority Critical patent/WO2023150912A1/fr
Priority to CN202280006829.5A priority patent/CN116897356A/zh
Publication of WO2023150912A1 publication Critical patent/WO2023150912A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the present application relates to the field of data processing, and in particular to a method, device and storage medium for comparing scheduled running time of operators.
  • An operator is used to indicate a data processing operation.
  • a neural network usually includes a convolution operator and a pooling operator.
  • the convolution operator is used to indicate a convolution operation
  • the pooling operator is used to indicate a pooling operation. operation.
  • the generation process of the operator's executable code is divided into two steps: calculation expression and scheduling.
  • Computational expression refers to describing the computational logic of an operator through a specific language, that is, describing the tasks that the operator needs to complete, as well as the input and output of the operator, and then converting the language that describes the computational logic of the operator into an intermediate language.
  • the operator's intermediate representation information (also called a template) can be obtained.
  • Scheduling refers to scheduling and optimizing the intermediate representation information of operators according to the hardware characteristics of the target hardware platform. Afterwards, the scheduling-optimized intermediate representation information can be converted into executable code recognizable by the target hardware platform.
  • Automatic operator optimization is an important function of optimization tools and compilers.
  • the difficulty of automatic operator optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization.
  • a pre-trained cost model can be used to evaluate the absolute execution time of scheduling, so as to realize the evaluation of scheduling running cost.
  • the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure.
  • due to the relatively large prediction error in this method the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.
  • a scheduling running time comparison method, device and storage medium of operators are proposed.
  • the embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator.
  • the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic
  • the automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.
  • the embodiment of the present application provides a method for comparing the scheduled running time of operators, the method including:
  • the target computing expression is used to describe the computing logic of the operator, and the candidate scheduling is the available schedule of the operator on the target hardware platform generated based on the target computing expression execute code;
  • the cost comparison model is a model obtained by training a neural network using multiple sample scheduling
  • the output of the cost comparison model is invoked to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules on the target hardware platform.
  • the cost comparison result used to indicate the sorting of the execution time length, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
  • calling the cost comparison model to output the cost comparison result includes:
  • the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
  • the cost comparison model is trained according to at least one set of sample data sets, the high accuracy of the cost comparison model is guaranteed, and the accuracy of the cost comparison result obtained through the output of the cost comparison model is further guaranteed.
  • the preprocessing the at least two candidate schedules to obtain the preprocessed at least two candidate schedules includes:
  • the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code
  • the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle
  • the input data shape information is used to describe the input data of the operator
  • the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
  • the Axis type encoding includes type encoding for operations on the axis
  • data access type encoding includes type encoding for accessing data.
  • the candidate schedule is converted into a feature matrix, which may include at least one of five types of information, which are indication cycle information, input data shape information, calculation code, axis type code, and data
  • the access type is encoded, so that the feature matrix with this data structure is used as the input data of the cost comparison model, which further improves the accuracy of the cost comparison result obtained by the subsequent model output.
  • the acquisition cost comparison model also includes:
  • the training sample set includes at least one set of the sample data set
  • preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules
  • the original parameter model is a neural network model
  • the cost comparison model is obtained through training with an error back propagation algorithm.
  • a training sample set is obtained, and the training sample set includes at least one set of sample data groups; for each set of sample data sets, at least two sample schedules are preprocessed to obtain preprocessing at least two sample schedules after preprocessing; input at least two sample schedules after preprocessing into the original parameter model to obtain the training result, and the original parameter model is a neural network model; compare the training result with the correct cost comparison result to obtain the calculation loss, and calculate The loss is used to indicate the error between the training result and the correct cost comparison result; according to the calculated losses corresponding to at least one set of sample data groups, the error back propagation algorithm is used to train the cost comparison model, so as to obtain the pre-trained evaluation operator
  • the cost comparison model of scheduling running cost ensures the feasibility of the subsequent calling model to realize the scheduling running time comparison method of operators.
  • the method further includes:
  • the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
  • an updated training sample set is obtained by adding at least two candidate scheduling and cost comparison results to the training sample set; the cost comparison model is trained according to the updated training sample set to obtain the updated cost Compare models, so as to update the cost comparison model in time, and continuously improve the accuracy of the cost comparison model.
  • the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
  • a first acquisition unit configured to acquire at least two candidate schedules corresponding to a target calculation expression, the target calculation expression is used to describe the calculation logic of an operator, and the candidate schedule is the operator generated based on the target calculation expression the executable code;
  • the second acquisition unit is used to acquire a cost comparison model, and the cost comparison model is a model obtained by training a neural network by adopting multiple sample scheduling;
  • the calling unit is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
  • the calling unit is also used for:
  • the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
  • the calling unit is also used for:
  • the characteristic matrix is used to indicate at least one of cycle information, input data shape information, calculation code, axis type code, and data access type code
  • the cycle information includes the Information related to the calculation logic of the candidate scheduling cycle
  • the input data shape information is used to describe the input data of the operator
  • the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
  • the Axis type encoding includes type encoding for operations on the axis
  • data access type encoding includes type encoding for accessing data.
  • the device further includes a training unit; the training unit is used for:
  • the training sample set includes at least one set of the sample data set
  • preprocessing at least two sample schedules is performed to obtain the preprocessed at least two sample schedules
  • the original parameter model is a neural network model
  • the cost comparison model is obtained through training with an error back propagation algorithm.
  • the device further includes an update unit; the update unit is configured to:
  • the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
  • the embodiment of the present application provides an operator scheduling runtime comparison device, the device includes:
  • memory for storing processor-executable instructions
  • the processor is configured to implement the above method when executing the instructions.
  • the embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
  • an embodiment of the present application provides a computer program product, and when the computer program product is run on a computer, the computer executes the above-mentioned method.
  • Fig. 1 shows a schematic diagram of a generation process of a scheduling space in the related art.
  • Fig. 2 shows a schematic diagram of the principles of the actual measurement method and the cost model method in the related art.
  • Fig. 3 shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
  • Fig. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
  • Fig. 5 shows a schematic diagram of the principle of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application.
  • Fig. 6 shows a flow chart of the training process of the cost comparison model provided by an exemplary embodiment of the present application.
  • Fig. 7 shows a schematic diagram of a training process of a cost comparison model provided by an exemplary embodiment of the present application.
  • Fig. 8 shows a schematic diagram of an input-output curve of a normalization function provided by an exemplary embodiment of the present application.
  • Fig. 9 shows a schematic diagram of a network structure of a multi-layer perceptron architecture provided by an exemplary embodiment of the present application.
  • Fig. 10 shows a flowchart of a method for comparing scheduled running time of operators provided by another exemplary embodiment of the present application.
  • Fig. 11 shows a schematic diagram of a data structure of a feature matrix provided by an exemplary embodiment of the present application.
  • Fig. 12 shows a schematic diagram of the application process of the cost comparison model provided by another exemplary embodiment of the present application.
  • Fig. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.
  • Deep learning technology establishes a deep learning model and iteratively fits a large amount of historical data (model training), so that the model can establish a mapping relationship between input and output, thereby realizing the prediction of new input data results (model reasoning).
  • the deep learning model contains a large number of operators, such as: convolution operator, fully connected operator, pooling operator, etc.
  • the whole formed by the stacking and connection of different operators constitutes a deep learning model, also known as a neural network model.
  • the topology of the neural network is called the neural network architecture; the parameters of the operators contained in the neural network are model parameters.
  • the specific hardware platform may be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), or a neural network processing unit (Neural network Processing Unit, NPU).
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • NPU neural network Processing Unit
  • scheduling There are many ways to implement the calculation expression of an operator, which is called scheduling.
  • the performance difference of different scheduling on a specific hardware platform can be very large. From a large number of scheduling implementations, using a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput.
  • a compiler/automatic optimizer to automatically search for the optimal scheduling for specific hardware can optimize deep learning applications, thereby reducing computing power requirements and increasing system throughput.
  • there may be an intermediate expression between calculation expression and scheduling which is called a template.
  • Computation expressions can form multiple templates, and each template can generate multiple schedules.
  • Automatic optimization of operators is an important function of optimization tools and compilers.
  • the performance of operators after automatic optimization determines whether deep learning models can be efficiently applied and meet product requirements.
  • the difficulty of operator automatic optimization is that it needs to search for the optimal scheduling implementation for a specific hardware platform in the scheduling space formed by massive scheduling. How to evaluate the execution time of different scheduling of operators in the neural network on the hardware platform is the most important thing for the success of optimization.
  • the "cost" mentioned in this article refers to the execution time of scheduling on the hardware platform. In order to evaluate the execution time of a scheduler on a hardware platform, there are currently two main methods: the actual measurement method and the cost model method.
  • the actual measurement method refers to the code generation of each schedule, the code compilation, and then running on the hardware.
  • the specific execution time is obtained by measuring the running time, which requires a complete compilation process. Its disadvantage is that it takes a long time to evaluate the scheduling (above second level), which takes too long in the actual 100,000 or millions of scheduling space scenarios; limited by the search time, it is difficult to explore a larger scheduling space.
  • the cost model method refers to evaluating the execution time of scheduling by establishing a cost model. Because this method does not need to go through the process of compiling, running and measuring, it has a very obvious advantage in terms of time-consuming evaluation.
  • the methods based on the cost model all use the absolute execution time of the forecasted scheduling to realize the evaluation of the scheduling operation cost.
  • the error between the predicted absolute execution time and the real execution time is relatively large, and professionals are required to build a dedicated cost model for a specific hardware platform, which often requires a large amount of training data and a complex model structure.
  • the uncertainty of cost comparison between schedules with similar predicted values cannot be eliminated.
  • the above shortcomings limit the application of the cost model method in the related art in the actual optimization process.
  • the embodiment of the present application provides a scheduling running time comparison method, device, and storage medium of an operator.
  • the relative size of the execution time of different scheduling is directly compared, so as to realize the compiler/automatic
  • the automatic tuning function of the optimizer greatly improves the evaluation speed and accuracy of the scheduling operation cost.
  • the operator scheduling runtime comparison method provided by the embodiment of this application has strong advantages in speed and accuracy, improves the performance of the operator optimizer and significantly reduces the evaluation time.
  • Computation expression refers to the whole composed of operator input data, output data and calculation logic.
  • a calculation expression is an instance that describes a specific calculation process.
  • the calculation expression can be user-defined, and the calculation expression is used to complete all the information of the calculation logic functions required by the user.
  • the form of computational expression is usually in the form of pseudocode or structured flowchart, which is easy to write but not optimized.
  • Template Computational expressions can be transformed into templates through a series of equivalent transformations.
  • the template is the intermediate representation information between the calculation expression and the scheduling during the optimization process of the calculation expression structure. Generally speaking, the template determines the order of calculation execution in the calculation expression logic and the mode of data access.
  • the template changes the calculation execution sequence and data access mode of the calculation expression, but does not restrict how the input data of the calculation expression is divided. For example, after the cycle is transformed by axis division, a single cycle can be divided into several sub-cycles, and the division of different numbers of sub-cycles is different templates. In each template, the loop upper and lower bounds of the sub-loop only need to be equivalent to the calculation expression, but the value of the loop upper and lower bounds of each sub-loop is uncertain.
  • Schedule According to the hardware characteristics of the target hardware platform, the intermediate representation information of the operator is scheduled and optimized. Scheduling determines the specific expression of all variable parameters in the template, which can be transformed into a description of the calculation expression implemented by software. For the same input data, the scheduled output data is exactly the same as the output data of the calculation expression, but the calculation process can be different.
  • Feature embedding the intermediate output of the input data after passing through the neural network module.
  • Feature embedding is the mapping of the neural network module to the input data in another space, including the extraction, enhancement and encoding of the input data.
  • Multi-layer perceptron a basic unit of neural network composed of fully connected layers, activation layers, etc. Multilayer perceptrons can form an overall neural network architecture, or they can appear as modules within a part of an overall architecture.
  • the generation process of the scheduling space is shown in Figure 1.
  • the computer equipment (such as an automatic optimizer) obtains the calculation expression input by the user, and performs the calculation expression Transformation generates a template space, and the templates in the template space can be transformed into scheduling implementations whose logic is equivalent to computational expressions.
  • the set of valid schedules forms the schedule space.
  • the computer device searches in the scheduling space and outputs the optimal scheduling realization.
  • the calculation is expressed as a user-defined loop calculation, and the calculation logic in the loop body is represented by a statement (stmt).
  • the upper and lower bounds of the loop are from 0 to 546756.
  • the single loop of calculation expression can be equivalently transformed into double nested loop, triple nested loop to N-fold nested loop template.
  • the loop upper and lower bounds of each nested loop are not determined. Axis splits allow for different plans for data access patterns.
  • the formed code that expresses the logic equivalent to calculation is scheduling.
  • the calculation expression composed of complex calculation logic can usually derive tens of millions of schedule implementations, and the execution time of different schedules on the target hardware platform can vary by hundreds or thousands of times.
  • the automatic optimization system uses a series of operations to search for hardware to perform the optimal scheduling in the search space formed by massive scheduling to optimize operators.
  • the actual measurement method generates legal code according to the definition of scheduling, compiles it with a compiler, executes and measures it on the hardware, and obtains performance evaluation results.
  • the result is usually the execution time of the schedule, and can also be a count of hardware clock cycles required to run.
  • the scheduling with the shortest execution time (smallest hardware clock cycle count) is finally selected.
  • the execution time obtained in this way is the most accurate and true.
  • the disadvantage is that the code generation and compilation process usually takes several seconds to several minutes to complete. The time consumption of the operation and measurement process depends on the calculation amount and complexity of the operator. The optimization process is very slow.
  • the search and selection method based on the machine learning model can greatly accelerate the above process, shortening the process of code generation, compilation and running in a few seconds to the neural network reasoning process in milliseconds. At the same time, due to the limitation of the accuracy of model prediction, the effect of optimal selection may decline.
  • the current cost model is as described above. After feature extraction is performed on the schedule, the absolute execution time or number of running cycles of the schedule is predicted by calling the cost model.
  • the accuracy of the cost model in this method is low, and the cost model in the related art has an average error of 16% in the prediction of the scheduling execution time, and the actual execution time difference between quite a lot of scheduling is less than the 16% error value; except for the error
  • the cost of obtaining the cost model is high, training requires 1.8 million training data, the complexity of the network architecture is high, and the training convergence time is long. Therefore, this method cannot well achieve the purpose of fast and accurate operator search optimization.
  • the embodiment of the present application provides a new cost model: a cost comparison model.
  • the cost comparison model avoids directly predicting the scheduling execution time, and transforms the regression problem into a classification problem that is easy for neural network to learn.
  • the cost comparison model takes at least two candidate schedules as input, and the output result is a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform.
  • the method provided by the embodiment of the present application has the advantages of high accuracy, fast inference speed, and lower training cost than existing methods. In the operator optimization process, the cost comparison model provided by the embodiment of the present application can quickly compare the execution time of different schedules, thereby realizing large-scale search optimization of operators.
  • the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to the optimization process of the automatic operator optimization system.
  • the core content of the embodiment of the present application is the cost comparison model, including the model architecture, model training process and model application process of the cost comparison model.
  • the operator scheduling runtime comparison method provided in the embodiment of the present application can be applied to a specific computer device (such as CPU or GPU or NPU), and performs large-scale comparison and search on multiple candidate scheduling implementations of the target computing expression, so as to obtain the optimal Optimal scheduling to achieve the purpose of optimizing the target computing expression on a specific computer device.
  • the execution subject of the method for comparing the scheduled running time of operators provided in the embodiment of the present application is a computer device, which may be a general-purpose computer device or a special-purpose computing device. Please refer to FIG. 3 , which shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
  • the computer device may be a terminal or a server.
  • Terminals include tablet computers, laptop computers, and desktop computers, among others.
  • the server can be one server, or a server cluster composed of several servers, or a cloud computing service center.
  • the computer device includes a processor 10 , a memory 20 and a communication interface 30 .
  • a processor 10 the structure shown in FIG. 1 does not constitute a limitation to the computer device, and may include more or less components than those shown in the illustration, or combine some components, or arrange different components. in:
  • the processor 10 is the control center of the computer equipment, and uses various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20 , to perform various functions of the computer equipment and process data, thereby controlling the computer equipment as a whole.
  • the processor 10 may be implemented by a CPU, or may be implemented by a GPU.
  • the memory 20 can be used to store software programs as well as modules.
  • the processor 10 executes various functional applications and data processing by executing software programs and modules stored in the memory 20 .
  • the memory 20 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system 21, a first acquisition unit 22, a second acquisition unit 23, a detection unit 24 and at least one functionally required application program 25 (such as neural network training, etc.); the storage data area can store data created according to the use of computer equipment, etc.
  • Memory 20 can be realized by any type of volatile or nonvolatile memory device or their combination, such as Static Random Access Memory (Static Random Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read Only Memory (Read Only Memory, ROM), magnetic memory, flash memory, magnetic or optical disk.
  • SRAM Static Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read-Only Memory
  • Read Only Memory Read Only Memory
  • magnetic memory flash memory
  • flash memory magnetic or optical disk.
  • the memory 20 may also include a memory controller to provide the processor 10 with access to the memory 20 .
  • the processor 20 executes the following function by running the first acquisition unit 22: acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is based on the The executable code of the operator generated by the target calculation expression; the processor 20 performs the following functions through the second acquisition unit 23: acquire a cost comparison model, and the cost comparison model is obtained by training the neural network using multiple sample scheduling Model; the processor 20 performs the following functions through the calling unit 24: according to the at least two candidate schedules, call the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules sorted by size.
  • the computer device obtains the calculation expression code input by the user, that is, the target calculation expression, analyzes the target calculation expression through the operator optimization system, generates a template space based on optimization rules or polyhedron models, and generates a large number of A legal candidate schedule, the generated multiple candidate schedules form a scheduling space.
  • An instance in the scheduling space represents a legal scheduling.
  • the cost comparison model provided by the embodiment of this application is used as an evaluation module to compare and output at least two candidate schedulings input to obtain the cost comparison result, so as to realize the search for the optimal scheduling space. The target of the schedule.
  • FIG. 4 shows a flowchart of a method for comparing scheduled running time of operators provided by an exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example.
  • the scheduling running time comparison methods of this operator include:
  • Step 401 Acquire at least two candidate schedules corresponding to the target computing expression.
  • the target computing expression is used to describe the computing logic of the operator.
  • the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
  • the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.
  • the computer device obtains the input target calculation expression, analyzes the target calculation expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and the generated multiple candidate schedules constitute a scheduling space.
  • the computer device obtains at least two candidate schedules from the schedule space.
  • the preset method is a dynamic programming method or an optimization rule method or a polyhedron model method.
  • the preset methods may also be different.
  • the embodiment of the present application does not limit the generation algorithm of the scheduling space. The embodiment of the present application can only be applied if at least two candidate schedules are included in the scheduling space for comparison.
  • the target calculation expression is a specific calculation expression, for example, the target calculation expression is an input calculation expression.
  • the candidate scheduler is the executable code of the operator generated based on the target computing expression on the target hardware platform.
  • the target hardware platform is CPU, GPU or NPU. This embodiment of the present application does not limit it.
  • the computer device when it receives the preset acquisition instruction, it acquires at least two candidate schedules corresponding to the target computing expression. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression every preset time interval. Alternatively, the computer device acquires at least two candidate schedules corresponding to the target computing expression in real time.
  • the preset time interval is a default setting or a custom setting, which is not limited in this embodiment.
  • Step 402 acquiring a cost comparison model, which is a model obtained by training a neural network by using multiple sample scheduling.
  • the computer device obtains the trained cost comparison model.
  • the terminal obtains a trained cost comparison model stored by itself, or obtains a trained cost comparison model from a server.
  • the server obtains a trained cost comparison model stored in itself.
  • the cost comparison model is a model obtained by training the neural network by using at least two sample scheduling and correct cost comparison results. That is, the cost comparison model is determined according to at least two sample scheduling and correct cost comparison results.
  • the correct cost comparison result is a pre-marked correct cost comparison result corresponding to at least two sample schedules.
  • the neural network of the cost comparison model can adopt an end-to-end stacked multi-layer perceptron architecture.
  • Other reasonable deformation architectures can also achieve the fitting function of the cost comparison model, and different architectures will affect the final accuracy of the model.
  • Any network architecture formed by deformation, derivation, and layer replacement of this architecture should be regarded as equivalent to the neural network described in the embodiments of this application.
  • the neural network is a deep neural network (Deep Neural Network, DNN).
  • the neural network is a Convolutional Neural Network (CNN).
  • the neural network is a Recurrent Neural Network (RNN). This embodiment of the present application does not limit it.
  • the cost comparison model is a neural network model that identifies relative execution times of at least two candidate schedules on the target hardware platform.
  • the cost comparison model is used to convert the input at least two candidate schedules into cost comparison results.
  • the cost comparison result is used to indicate the ranking of the execution durations of the at least two candidate schedules on the target hardware platform.
  • the cost comparison model is used to represent the correlation between at least two candidate schedules and the cost comparison results.
  • the cost comparison model is a preset mathematical model, and the cost comparison model includes model coefficients between at least two candidate schedules and cost comparison results.
  • the model coefficient can be a fixed value, or a value that is dynamically modified over time, or a value that is dynamically modified according to a usage scenario.
  • Step 403 According to the at least two candidate schedules, invoke the cost comparison model output to obtain a cost comparison result, and the cost comparison result is used to indicate the order of execution duration of the at least two candidate schedules on the target hardware platform.
  • the computer device performs preprocessing on at least two candidate schedules to obtain at least two preprocessed candidate schedules; input the preprocessed at least two candidate schedules into the cost comparison model, and output the cost comparison results.
  • the cost comparison result is used to indicate the order of the execution durations of at least two candidate schedules on the target hardware platform. That is, the cost comparison result does not indicate the absolute execution time of the at least two candidate schedules on the target hardware platform, but indicates the relative size of the execution time of the at least two candidate schedules on the target hardware platform.
  • the cost comparison result is coding information of a comparison result of the predicted execution durations of at least two candidate schedules.
  • the computer device decodes the coded information output by the cost comparison model, and obtains the order of the execution durations of at least two candidate schedules, that is, the comparison result.
  • the cost comparison result includes encoding information, and the value of the encoding information is in one-to-one correspondence with the execution duration comparison results of at least two candidate schedules.
  • the encoded information is a first value
  • the encoding When the information is the second value, it is used to indicate that the execution duration of the first candidate schedule is equal to the execution duration of the second candidate schedule
  • the encoded information is the third value, it is used to indicate that the execution duration of the first candidate schedule is greater than the execution duration of the second candidate schedule , where the first, second, and third values are different.
  • the computer device selects the candidate schedule with the shortest execution time among the at least two candidate schedules as the target schedule according to the cost comparison results of the at least two candidate schedules, retains the target schedule, and discards other candidate schedules except the target schedule.
  • the computer device takes any one of the at least two candidate schedules as the target schedule, retains the target schedule, and discards all but the target schedule. other candidate schedules.
  • the embodiment of the present application does not limit the method of retaining and discarding the scheduling.
  • the computer device obtains the input target computing expression, analyzes the target computing expression, generates a template space according to a preset method, generates multiple candidate schedules by instantiating the template, and generates A plurality of candidate schedules constitute a scheduling space.
  • Preprocess the schedule A and schedule B to obtain the preprocessed schedule A and schedule B; input the preprocessed schedule A and schedule B into the cost comparison model to output encoded information, and decode the encoded information to obtain schedule A Compare the result with the cost of scheduling B.
  • the encoding information when the encoding information is 001, it is used to indicate that the execution duration of schedule A is less than that of schedule B, and schedule A is retained, and schedule B is discarded; when the encoding information is 002, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B , keep schedule A or schedule B; when the encoding information is 100, it indicates that the execution time of schedule A is longer than that of schedule B, keep schedule B and discard schedule A.
  • the embodiment of the present application obtains at least two candidate schedules corresponding to the target computing expression, and directly compares the relative execution time of the at least two candidate schedules on the target hardware platform according to the call cost comparison model of the at least two candidate schedules. Size, so as to output the cost comparison result used to indicate the size of the execution time, which can realize the automatic tuning function of the compiler/automatic optimizer, and greatly improve the evaluation speed and accuracy of the scheduling operation cost.
  • the computer device acquires the cost comparison model, it needs to train the training sample set to obtain the cost comparison model.
  • the training process of the cost comparison model is introduced below.
  • the training process of the cost comparison model includes the following steps:
  • step 601 a training sample set is obtained, and the training sample set includes at least one set of sample data groups.
  • the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
  • Step 602 for each set of sample data groups, perform preprocessing on at least two sample schedules to obtain at least two preprocessed sample schedules.
  • the computer device For each sample data group, the computer device performs feature extraction on each sample schedule in at least two sample schedules to obtain a feature matrix, and normalizes the feature matrix corresponding to the sample schedule to obtain a preprocessed sample schedule.
  • feature extraction is the process of extracting features from sample schedules and converting features into structured data.
  • Step 603 input the preprocessed at least two sample schedules into the original parameter model to obtain a training result, and the original parameter model is a neural network model.
  • the original parameter model is established according to the neural network model, for example: the original parameter model is established according to the DNN model.
  • the computer device creates an input-output pair corresponding to the set of sample data sets, the input parameter of the input-output pair is at least two sample schedules in the set of sample data sets, and the target parameter is the set
  • the correct cost comparison results in the sample data set; the computer equipment inputs the input parameters into the original parameter model to obtain the training result.
  • input-output pairs are represented by feature vectors.
  • Step 604 comparing the training result with the correct cost comparison result to obtain a calculation loss, which is used to indicate the error between the training result and the correct cost comparison result.
  • the training result is the coding information output by the original parameter model, and the correct cost comparison result is the pre-marked coding information.
  • the encoded information is information encoded by one-hot code (One-Hot).
  • the calculation loss is represented by cross entropy.
  • Step 605 According to the calculated losses corresponding to at least one set of sample data sets, use the error back propagation algorithm to train to obtain a cost comparison model.
  • the computer device determines the gradient direction of the cost comparison model according to the calculation loss through the back propagation algorithm, and updates the model parameters in the cost comparison model layer by layer from the output layer of the cost comparison model.
  • the computer equipment extracts two schedules from the schedule space, that is, schedules A and B, as input data for cost comparison model training. Compare the relative execution time of the two schedules on the target hardware platform, and use one-hot encoding to generate (A, B) input encoding information (that is, the correct cost comparison result) as the target parameter of the backpropagation algorithm.
  • the encoding information is shown in the table one shown.
  • the encoding information When the encoding information is the first value, it is used to indicate that the execution duration of schedule A is less than the execution duration of schedule B; when the encoding information is the second value, it is used to indicate that the execution duration of schedule A is equal to the execution duration of schedule B, and the encoding information is the third value When is used to indicate that the execution duration of schedule A is longer than that of schedule B.
  • the computer equipment performs feature extraction on schedule A and schedule B to obtain respective corresponding feature matrices.
  • the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices.
  • Part of the column data in the feature matrix is normalized to limit its dynamic range.
  • the formula of the normalization function is as follows:
  • v is the input data
  • v* is the output data.
  • FIG. 8 Schematically, the input-output curve of the normalization function is shown in FIG. 8 .
  • the abscissa is the above-mentioned input data
  • the ordinate is the above-mentioned output data.
  • the computer equipment inputs the normalized schedule A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional schedule embedding (schedule embedding) A, and inputs the normalized schedule B to at most
  • the feature embedding module B composed of layer perceptrons is DNN_B, which outputs a 1x512-dimensional scheduling embedding B.
  • the two schedule embeddings are bitwise subtracted, that is, the schedule embedding A is subtracted from the schedule embedding B to obtain a schedule difference embedding.
  • the mean square error loss function (or minimum square error function) is used as the loss function to calculate the calculation loss of the model for the current input.
  • the calculation loss is backpropagated through the gradient descent method, and the model parameters based on the neural network module such as DNN_A, DNN_B, and DNN_CLS are updated. Repeat the above steps to train the training sample set for multiple periods (for example, 30 periods) until the model converges.
  • the network structure of DNN_A, DNN_B, and DNN_CLS may be an end-to-end stacked multi-layer perceptron architecture, and the network structure is shown in FIG. 9 . Among them, the number represents the number of neurons in each layer, and the Relu function is used as the activation function between each fully connected layer.
  • FIG. 10 shows a flowchart of a method for comparing scheduling runtimes of operators provided by another exemplary embodiment of the present application. This embodiment is described by taking the method for comparing the scheduling running time of the operator applied to the computer device shown in FIG. 3 as an example.
  • the scheduling running time comparison methods of this operator include:
  • Step 1001 obtain at least two candidate schedules from the schedule space corresponding to the target computation expression.
  • the target computing expression is used to describe the computing logic of the operator
  • the candidate schedule is the executable code of the operator generated based on the target computing expression on the target hardware platform.
  • the computer device acquires an input target computing expression, analyzes the target computing expression, generates a template according to a preset method, and determines a scheduling space, where the scheduling space includes at least two candidate schedulings generated by instantiating the template.
  • the computer device acquires at least two candidate schedules from the schedule space corresponding to the target computation expression.
  • the scheduling space includes n candidate schedulings.
  • each pairwise comparison is adopted, and the optimal method is kept for n-1 comparisons to obtain the optimal target scheduling.
  • choose the dichotomous method for comparison for example, n is 8, that is, divide the 8 schedules into 4 groups in pairs, and select 4 candidate schedules with the fastest running speed from the 4 groups through the cost comparison model for secondary Grouping, the secondary grouping is divided into 2 groups, and 2 comparisons are required. After the comparison is completed, the 2 optimal candidate schedulings are reserved for final comparison, so as to obtain the optimal target scheduling among the 8 candidate schedulings.
  • the embodiment of the present application does not limit the scheduling group comparison method.
  • Step 1002 for each of the at least two candidate schedules, perform feature extraction on the candidate schedules to obtain a feature matrix.
  • the computer device extracts multiple types of information from m cycles of the candidate schedule and combines them into a vector, which is a feature matrix corresponding to the candidate schedule, where m is a positive integer.
  • a vector which is a feature matrix corresponding to the candidate schedule, where m is a positive integer.
  • the combined vector size is 1x57.
  • a maximum of 250 loops of information are supported, and finally assembled into a 250x57 two-dimensional feature matrix.
  • the number of supported loops can vary according to actual needs, which is not limited in this embodiment of the present application.
  • the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding.
  • the round-robin information includes information related to the round-robin calculation logic of the candidate schedule.
  • the cycle information is cycle information at a level in the scheduling, for example, the size of the cycle information is 1x6.
  • the loop information includes: at least one of loop depth, nesting level, block number, flag indicating whether it is the last loop, quotient of loop depth, and remainder of loop depth. Among them, the quotient of the loop depth and the loop depth needs to be normalized.
  • the input data shape information is used to describe the input data of the operator.
  • the size of the input data shape information is 1x10.
  • the operator is a single-input operator, a double-input operator, or a multi-input operator.
  • the shape information of the input data includes: shape information corresponding to k input data, k is a positive integer, and the shape information includes at least one of batch size, number of channels, height, width, and minimum number of channels.
  • the computation encoding includes the encoding of the computation instruction used in the current cycle of the candidate schedule.
  • the size of the calculation code is 1x6.
  • the computing code includes: at least one of memory access types, program instructions, data types, storage units, and identifiers for indicating whether to use double buffering.
  • Axis type encodings include encodings for the types of operations on the axes.
  • the size of the axis type code is 1x15.
  • Axis type codes are used to indicate at least one operation among extended, normalized axes.
  • the data access type encoding includes the type encoding of the access to the data.
  • the size of the data access type encoding is 1x19.
  • the data access type code is used to indicate at least one access among write data, read data, allocation, and pragma.
  • feature extraction is performed on candidate schedules to obtain a feature matrix, and the data structure of the feature matrix is shown in FIG. 11 . Extract multiple types of information from each cycle of candidate scheduling and combine them into vectors. The size of the combined vector is 1x57, and it supports up to 250 cycles of information.
  • the feature matrix is used to indicate Loop information, input data shape information, calculation encoding, axis type encoding and data access type encoding
  • the size of loop information is 1x6
  • the size of input data shape information is 1x10
  • the size of calculation encoding is 1x6
  • the size of axis type encoding is 1x15
  • the size of the data access type encoding is 1x20.
  • Step 1003 for each of the at least two candidate schedules, normalize the feature matrix corresponding to the candidate schedules to obtain a preprocessed candidate schedule.
  • Step 1004 input the preprocessed at least two candidate schedules into the trained cost comparison model, output the cost comparison result, and the cost comparison result is used to indicate the execution duration of the at least two candidate schedules on the target hardware platform Sort.
  • the computer device acquires a trained cost comparison model, and the cost comparison model is a model obtained by training a neural network by using multiple sample scheduling.
  • the computer device inputs the preprocessed at least two candidate schedules into the trained cost comparison model, outputs a cost comparison result, and the cost comparison result is used to indicate the order of the execution time of the at least two candidate schedules on the target hardware platform
  • the computer device adds at least two candidate scheduling and cost comparison results to the training sample set to obtain an updated training sample set; train the cost comparison model according to the updated training sample set to obtain an updated cost comparison Model.
  • the computer equipment extracts two schedules A and B from the schedule space, and extracts features from the schedule A and schedule B to obtain their corresponding feature matrices.
  • the feature matrices of scheduling A and scheduling B are two 250x57-dimensional matrices. Part of the column data in the feature matrix is normalized to limit its dynamic range. The way of normalization can be compared with the description of normalization in the above model training process, and will not be repeated here.
  • the computer equipment inputs the normalized scheduling A to the feature embedding module A composed of the multi-layer perceptron, that is, DNN_A, outputs a 1x512-dimensional scheduling embedding A, and inputs the normalized scheduling B to the multi-layer perceptron.
  • the feature embedding module B of DNN_B outputs a 1x512-dimensional scheduling embedding B.
  • the two scheduling embeddings are bitwise subtracted, that is, the scheduling embedding A is subtracted from the scheduling embedding B to obtain the scheduling difference embedding.
  • DNN_A, DNN_B, and DNN_CLS can refer to the relevant description in the above model training process by analogy, and will not be repeated here.
  • the computer equipment converts the outputted three-digit encoded information into a one-hot coded label format.
  • the embodiment of the present application also performs feature extraction on at least two candidate schedules, maps the schedules to its unique corresponding matrix expression form, and obtains the feature matrix expressions of at least two candidate schedules; for the two feature matrix expressions Do normalization processing; the cost comparison model based on the deep neural network takes at least two preprocessed feature matrices as input, and the output is the coding information of the comparison result of the predicted execution time of at least two candidate schedules; the cost comparison The encoded information output by the model is decoded to obtain the comparison result of the execution time of at least two candidate schedules, that is, the execution time of different schedule implementations of the same calculation expression on a specific hardware platform is compared through the deep learning network model, thereby replacing the schedule implementation process.
  • the process of running and measuring on the hardware after the compilation process solves the problem of slow speed in large-scale search of operator automatic optimization systems such as automatic optimizers/compilers.
  • the cost comparison model is implemented with the goal of predicting how fast operators will take to execute.
  • the training sample set includes 20792 schedules from 32 operators, and each operator contains different schedules. For scheduling belonging to the same operator, perform pairwise pairing to form a training instance set, compare the execution time of the two operators after pairing, and generate the target of the paired training instance according to the above-mentioned related method.
  • Example Extract schedule A and schedule B the actual execution time of schedule A is 15 seconds, and the actual execution time of schedule B is 17 seconds, then (A, B) is a training instance, and the time of 15 seconds is less than 17 seconds, (A, B)
  • the target encoding for this training instance is 001.
  • the pairwise combination of schedules belonging to the same operator may include a combination of a schedule and the schedule itself in the training sample set, and the target code of the formed training instance is 010. They belong to the same operator scheduling combination, and the combination is sensitive to the order. For example, the (A, B) combination is different from the (B, A) combination. If the execution time of A and B is different, the (A, B) combination and (B, A) The combined target encodings are also different. If an operator contains N (N>2) schedules, then the combination of two pairs can form N square training instances. This combination of training, even if the amount of training data is relatively limited, can also build a relatively large training data set.
  • the neural network model adopts batch training, 5000 training examples are input for each iteration, the learning rate is set to 10e-8, and the momentum stochastic gradient descent method is used to train the complete training example set for multiple periods (for example, 30 periods).
  • the test set includes 46022 test instances, and each test instance is composed of two schedules belonging to the same operator. Any schedule used to generate test instances is not included in the schedule set for generating training instances.
  • the test target code is generated by the above-mentioned related method for the test instance.
  • the prediction result output by the network passes the maximum parameter (argmax) function, if it completely matches the test target code, it is recorded as the test instance is correctly predicted by the network.
  • Accuracy is defined as: the number of test instances correctly predicted by the network/the total number of test instances tested. Tested on 46022 test cases, the method correctly predicts 41242 test cases with an accuracy rate of 89.61%.
  • the embodiment of the present application provides a scheduling running time comparison method for operators, which adopts the idea of cost comparison to determine the comparison result of the relative execution time of at least two schedules, and applies the cost comparison model to the operator automatic
  • the optimization process of the optimization system also involves a modeling method of the cost comparison model that can be applied to the operator automatic optimization system, including the model architecture design, model training and model reasoning application process, and the model training and model
  • scheduling can be converted into a special data structure through feature extraction, and the normalization processing of data and the expression of output format have high accuracy, fast inference speed, and the required training cost is lower than that of existing methods. low pros.
  • the higher accuracy of the cost comparison model is guaranteed; on the other hand, the reasoning speed of the cost comparison model is improved, and it only takes 3 milliseconds to compare a set of instances; on the other hand, the cost comparison model training requires The amount of data and computing power are relatively small, and 30 sessions of training on more than 49 million training instances are completed in 70 hours on a single GPU card. Through the cost comparison model, the code optimizer/compiler automatic tuning only needs to consider how to improve the accuracy of the cost comparison model.
  • the cost model in related technologies Compared with the cost model that predicts the absolute execution time of scheduling in related technologies, in addition to the accuracy of model prediction , the cost model in related technologies also needs to consider how to deal with the boundary problems caused by errors, for example: if the difference between the predicted running times of two schedules is smaller than the error predicted by the model, the absolute value model cannot give a high-confidence prediction at this time.
  • FIG. 13 shows a block diagram of an apparatus for comparing scheduled runtimes of operators provided by an exemplary embodiment of the present application.
  • the apparatus can be implemented as all or a part of the computer equipment provided in FIG. 3 through software, hardware or a combination of the two.
  • the apparatus may include: a first obtaining unit 1310 , a second obtaining unit 1320 and a calling unit 1330 .
  • the first acquisition unit 1310 is configured to acquire at least two candidate schedules corresponding to the target calculation expression, the target calculation expression is used to describe the calculation logic of the operator, and the candidate schedule is the executable code of the operator generated based on the target calculation expression;
  • the second acquisition unit 1320 is configured to acquire a cost comparison model, where the cost comparison model is a model obtained by training a neural network using multiple sample scheduling;
  • the calling unit 1330 is configured to call the output of the cost comparison model according to the at least two candidate schedules to obtain a cost comparison result, and the cost comparison result is used to indicate the order of the execution durations of the at least two candidate schedules.
  • the calling unit 1330 is also used to:
  • the cost comparison model is trained according to at least one set of sample data sets, and each set of sample data sets includes: at least two sample schedules corresponding to sample calculation expressions and pre-marked correct cost comparison results.
  • the calling unit 1330 is also used to:
  • the feature matrix corresponding to the candidate schedule is normalized to obtain the preprocessed candidate schedule.
  • the feature matrix is used to indicate at least one of cycle information, input data shape information, calculation encoding, axis type encoding, and data access type encoding
  • the cycle information includes the cycle calculation logic of the candidate schedule
  • the input data shape information is used to describe the input data of the operator
  • the calculation code includes the code of the calculation instruction used in the current cycle of the candidate schedule
  • the axis type code includes the type code for operating on the axis
  • the data access type code Include type encodings for accessing data.
  • the device further includes a training unit; the training unit is used for:
  • training sample set includes at least one set of sample data sets
  • At least two sample schedules are preprocessed to obtain at least two sample schedules after preprocessing;
  • the original parameter model is a neural network model
  • an error backpropagation algorithm is used to train the cost comparison model.
  • the device further includes an update unit; the update unit is used for:
  • the cost comparison model is trained according to the updated training sample set to obtain an updated cost comparison model.
  • the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to the needs.
  • the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and the method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the method embodiment, and will not be repeated here.
  • An embodiment of the present application provides an operator scheduling runtime comparison device, the operator scheduling runtime comparison device includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute The instructions implement the methods executed by the computer device in the above-mentioned embodiments.
  • An embodiment of the present application provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes.
  • the processor executes the method performed by the computer device in the foregoing embodiments.
  • An embodiment of the present application provides a non-volatile computer-readable storage medium, on which computer program instructions are stored.
  • the computer program instructions are executed by a processor, the methods performed by the computer device in the foregoing embodiments are implemented.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Electrically Programmable Read-Only-Memory (EPROM or flash memory), Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compression Disk Read-Only Memory (Compact Disc Read-Only Memory, CD -ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing .
  • RAM Random Access Memory
  • ROM read only memory
  • EPROM or flash memory erasable Electrically Programmable Read-Only-Memory
  • Static Random-Access Memory SRAM
  • Portable Compression Disk Read-Only Memory Compact Disc Read-Only Memory
  • CD -ROM Compact Disc Read-Only Memory
  • DVD Digital Video Disc
  • Computer readable program instructions or codes described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, local area network, wide area network, and/or wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet).
  • electronic circuits such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby realizing various aspects of the present application.
  • These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented with hardware (such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)), or it can be realized by a combination of hardware and software, such as firmware.
  • hardware such as circuits or ASIC (Application Specific Integrated Circuit, application-specific integrated circuit)
  • firmware such as firmware

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

La présente demande se rapporte au domaine du traitement de données, et, en particulier, à un procédé et un dispositif de comparaison de temps d'opération de planification d'opérateur, et à un support de stockage. Le procédé consiste : à acquérir au moins deux planifications candidates correspondant à une expression de calcul cible, l'expression de calcul cible étant utilisée pour décrire une logique de calcul d'un opérateur ; à acquérir un modèle de comparaison de coûts, le modèle de comparaison de coûts étant un modèle obtenu par entraînement d'un réseau neuronal à l'aide d'une pluralité de planifications d'échantillons ; et selon lesdites planifications candidates, à invoquer le modèle de comparaison de coûts pour délivrer un résultat de comparaison de coûts, le résultat de comparaison de coûts étant utilisé pour indiquer un ordre trié d'amplitudes de durées d'exécution desdites planifications candidates sur une plateforme matérielle cible. Selon la présente demande, les amplitudes relatives des durées d'exécution de différentes planifications sont directement comparées sans prédire des durées d'exécution absolues des planifications, ce qui permet d'obtenir une fonction d'accord automatique d'un compilateur/syntoniseur automatique, et d'améliorer considérablement la vitesse et la précision de l'évaluation du coût d'opération de planification.
PCT/CN2022/075526 2022-02-08 2022-02-08 Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage WO2023150912A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/075526 WO2023150912A1 (fr) 2022-02-08 2022-02-08 Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage
CN202280006829.5A CN116897356A (zh) 2022-02-08 2022-02-08 算子的调度运行时间比较方法、装置及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/075526 WO2023150912A1 (fr) 2022-02-08 2022-02-08 Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage

Publications (1)

Publication Number Publication Date
WO2023150912A1 true WO2023150912A1 (fr) 2023-08-17

Family

ID=87563395

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075526 WO2023150912A1 (fr) 2022-02-08 2022-02-08 Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage

Country Status (2)

Country Link
CN (1) CN116897356A (fr)
WO (1) WO2023150912A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755779A (zh) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 循环间隔的确定方法、装置、设备、存储介质及芯片
CN117032936A (zh) * 2023-09-28 2023-11-10 之江实验室 一种数据调度方法、装置和计算机设备
CN117171577A (zh) * 2023-11-02 2023-12-05 之江实验室 一种高性能算子选择的动态决策方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150783A1 (en) * 2016-08-24 2018-05-31 Clari Inc. Method and system for predicting task completion of a time period based on task completion rates and data trend of prior time periods in view of attributes of tasks using machine learning models
US20180373564A1 (en) * 2017-06-22 2018-12-27 Banuba Limited Computer Systems And Computer-Implemented Methods For Dynamically Adaptive Distribution Of Workload Between Central Processing Unit(s) and Graphics Processing Unit(s)
CN112668701A (zh) * 2020-12-31 2021-04-16 上海商汤智能科技有限公司 神经网络运行方法、装置、电子设备及存储介质
CN113128702A (zh) * 2021-04-15 2021-07-16 杭州电子科技大学 一种基于强化学习的神经网络自适应分布式并行训练方法
CN113342631A (zh) * 2021-07-02 2021-09-03 厦门美图之家科技有限公司 分发管理优化方法、装置和电子设备
CN113946412A (zh) * 2020-07-17 2022-01-18 阿里巴巴集团控股有限公司 调度搜索方法和装置、云服务提供方法、电子设备以及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150783A1 (en) * 2016-08-24 2018-05-31 Clari Inc. Method and system for predicting task completion of a time period based on task completion rates and data trend of prior time periods in view of attributes of tasks using machine learning models
US20180373564A1 (en) * 2017-06-22 2018-12-27 Banuba Limited Computer Systems And Computer-Implemented Methods For Dynamically Adaptive Distribution Of Workload Between Central Processing Unit(s) and Graphics Processing Unit(s)
CN113946412A (zh) * 2020-07-17 2022-01-18 阿里巴巴集团控股有限公司 调度搜索方法和装置、云服务提供方法、电子设备以及计算机可读存储介质
CN112668701A (zh) * 2020-12-31 2021-04-16 上海商汤智能科技有限公司 神经网络运行方法、装置、电子设备及存储介质
CN113128702A (zh) * 2021-04-15 2021-07-16 杭州电子科技大学 一种基于强化学习的神经网络自适应分布式并行训练方法
CN113342631A (zh) * 2021-07-02 2021-09-03 厦门美图之家科技有限公司 分发管理优化方法、装置和电子设备

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755779A (zh) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 循环间隔的确定方法、装置、设备、存储介质及芯片
CN116755779B (zh) * 2023-08-18 2023-12-05 腾讯科技(深圳)有限公司 循环间隔的确定方法、装置、设备、存储介质及芯片
CN117032936A (zh) * 2023-09-28 2023-11-10 之江实验室 一种数据调度方法、装置和计算机设备
CN117032936B (zh) * 2023-09-28 2024-02-06 之江实验室 一种数据调度方法、装置和计算机设备
CN117171577A (zh) * 2023-11-02 2023-12-05 之江实验室 一种高性能算子选择的动态决策方法及装置
CN117171577B (zh) * 2023-11-02 2024-03-22 之江实验室 一种高性能算子选择的动态决策方法及装置

Also Published As

Publication number Publication date
CN116897356A (zh) 2023-10-17

Similar Documents

Publication Publication Date Title
WO2023150912A1 (fr) Procédé et dispositif de comparaison de temps d'opération de planification d'opérateur, et support de stockage
US11410044B2 (en) Application development platform and software development kits that provide comprehensive machine learning services
US11790212B2 (en) Quantization-aware neural architecture search
US20200265301A1 (en) Incremental training of machine learning tools
CN110852438B (zh) 模型生成方法和装置
US20190138887A1 (en) Systems, methods, and media for gated recurrent neural networks with reduced parameter gating signals and/or memory-cell units
US11392829B1 (en) Managing data sparsity for neural networks
US20220230048A1 (en) Neural Architecture Scaling For Hardware Accelerators
WO2019216938A1 (fr) Plateforme de développement d'applications et kits de développement de logiciels fournissant des services d'apprentissage machine complets
JP2018533153A (ja) 機械学習に基づくネットワークモデル構築方法及び装置
US20230196202A1 (en) System and method for automatic building of learning machines using learning machines
Jin et al. Rc-darts: Resource constrained differentiable architecture search
KR20200086581A (ko) 뉴럴 네트워크 양자화를 위한 방법 및 장치
CN112149809A (zh) 模型超参数的确定方法及设备、计算设备和介质
WO2023160290A1 (fr) Procédé d'accélération d'inférence de réseau neuronal, procédé de détection de cible, dispositif et support de stockage
KR20200063041A (ko) 아키텍처 변이 기반 비지도 학습 및 선택적 오류 전파 기반 지도 학습을 이용한 신경망 학습 방법 및 장치
CN116097281A (zh) 经由无限宽度神经网络的理论的超参数传递
CN115699041A (zh) 利用专家模型的可扩展迁移学习
EP4217928A1 (fr) Mise à l'échelle d'architecture neuronale pour accélérateurs matériels
US20220076095A1 (en) Multi-level sparse neural networks with dynamic rerouting
US20240046099A1 (en) Method and system for jointly pruning and hardware acceleration of pre-trained deep learning models
CN116976461A (zh) 联邦学习方法、装置、设备及介质
KR20210035702A (ko) 인공 신경망의 양자화 방법 및 인공 신경망을 이용한 연산 방법
Oh et al. Application of Deep Learning Model Inference with Batch Size Adjustment
US11704562B1 (en) Architecture for virtual instructions

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280006829.5

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22925277

Country of ref document: EP

Kind code of ref document: A1