Disclosure of Invention
The purpose of the application is to provide a tensor calculation code optimization method and device, an electronic device and a computer readable medium.
The first aspect of the present application provides a tensor calculation code optimization method, including:
analyzing the cyclic features and the computational graph features of the tensor computational codes to obtain corresponding cyclic information and computational graph information;
generating an optimization space according to the circulation information, the calculation map information and a preset optimization method; the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection;
searching and determining target space points in the optimized space based on a simulated annealing algorithm and a reinforcement learning algorithm;
and optimizing the tensor calculation code according to the preset optimization method combination and parameter selection corresponding to the target space point.
In some embodiments of the present application, the preset optimizing method includes: cyclic partitioning, cyclic reorganization, cyclic rearrangement, cyclic expansion, vectorization, parallelization.
In some embodiments of the present application, the analyzing the cyclic features and the computational graph features of the tensor computation code to obtain corresponding cyclic information and computational graph information includes:
carrying out grammar analysis on tensor calculation codes, and analyzing the generated abstract grammar tree into a calculation graph;
traversing the computation graph, and collecting connection information and circulation information of each computation node; the circulation information comprises circulation number, circulation occupied space, circulation sequence and circulation data dependence; the connection information includes the input number and the output number of the computation nodes, and the connection information of each computation node constitutes the computation graph information.
In some embodiments of the present application, the generating an optimization space according to the cycle information, the calculation map information, and the preset optimization method includes:
according to the circulation information and the calculation map information, enumerating preset optimization method combinations and parameter selections to form a basic optimization space;
based on pruning technology, selecting a plurality of preset optimization method combinations in the basic optimization space according to preset conditions, setting a parameter selection range, and generating an optimization space.
A second aspect of the present application provides a tensor calculation code optimization apparatus, including:
the static analysis module is used for analyzing the cycle characteristics and the calculation graph characteristics of the tensor calculation code to obtain corresponding cycle information and calculation graph information;
the optimization space generation module is used for generating an optimization space according to the circulation information, the calculation map information and a preset optimization method; the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection;
the optimization space searching module is used for searching and determining target space points in the optimization space based on a simulated annealing algorithm and a reinforcement learning algorithm;
and the optimization implementation module is used for implementing optimization on the tensor calculation codes according to the preset optimization method combination and parameter selection corresponding to the target space point.
In some embodiments of the present application, the preset optimizing method includes: cyclic partitioning, cyclic reorganization, cyclic rearrangement, cyclic expansion, vectorization, parallelization.
In some embodiments of the present application, the static analysis module is specifically configured to:
carrying out grammar analysis on tensor calculation codes, and analyzing the generated abstract grammar tree into a calculation graph;
traversing the computation graph, and collecting connection information and circulation information of each computation node; the circulation information comprises circulation number, circulation occupied space, circulation sequence and circulation data dependence; the connection information includes the input number and the output number of the computation nodes, and the connection information of each computation node constitutes the computation graph information.
In some embodiments of the present application, the optimization space generation module is specifically configured to:
according to the circulation information and the calculation map information, enumerating preset optimization method combinations and parameter selections to form a basic optimization space;
based on pruning technology, selecting a plurality of preset optimization method combinations in the basic optimization space according to preset conditions, setting a parameter selection range, and generating an optimization space.
A third aspect of the present application provides an electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to perform the method of the first aspect of the present application.
A fourth aspect of the present application provides a computer readable medium having stored thereon computer readable instructions executable by a processor to implement the method of the first aspect of the present application.
Compared with the prior art, the tensor calculation code optimization method, device, equipment and medium provided by the application are used for obtaining corresponding circulation information and calculation map information by analyzing the circulation characteristics and calculation map characteristics of tensor calculation codes, and generating an optimization space according to the circulation information and the calculation map information and a preset optimization method, wherein the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection. Based on a simulated annealing algorithm and a reinforcement learning algorithm, searching and determining a target space point in the optimization space, and optimizing the tensor calculation code according to a preset optimization method combination and parameter selection corresponding to the target space point, so that automatic optimization of the tensor calculation code can be completed quickly, the operating efficiency of the tensor calculation code is improved, human input of a development operator can be avoided for a programming developer, relatively good performance is obtained, cost can be reduced, and development efficiency is improved.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.
In addition, the terms "first" and "second" etc. are used to distinguish different objects and are not used to describe a particular order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Embodiments of the present application provide a tensor calculation code optimization method and apparatus, an electronic device, and a computer readable medium, which are described below with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a tensor calculation code optimization method according to some embodiments of the present application is shown, where the tensor calculation code optimization method may include the following steps:
step S101: and analyzing the cyclic features and the computational graph features of the tensor computational code to obtain corresponding cyclic information and computational graph information.
The circulation information specifically comprises: the number of loops, the space occupied by loops, the order of loops, and the dependency of the loop data. The calculation map information specifically includes: the connection mode of each computing node in the computing graph and the input number and the output number of each computing node are calculated.
In some embodiments of the present application, the step S101 may be specifically implemented as:
carrying out grammar analysis on tensor calculation codes, and analyzing the generated abstract grammar tree into a calculation graph;
traversing the computation graph, and collecting connection information and circulation information of each computation node; the circulation information comprises circulation number, circulation occupied space, circulation sequence and circulation data dependence; the connection information includes the input number and the output number of the computation nodes, and the connection information of each computation node constitutes the computation graph information.
Specifically, traversing the computational graph, collecting multiple cycle information in each node, including the number of cycles, the size of the cycles (occupied space), the position (sequence) where the cycles are located, and whether there is a data dependency, where the data dependency is determined according to: only the loops appearing on the right in the tensor-computed einstein expression contain data dependencies, while other loops do not contain data dependencies, loops that do not contain data dependencies can be directly parallelized. The node connection mode of the computation graph and the input number and the output number of each computation node are used for judging whether the nodes can be combined or not. And finally, carrying out topological sorting on the whole calculation graph, and optimizing the nodes one by one according to the topological sorting order.
Step S102: generating an optimization space according to the circulation information, the calculation map information and a preset optimization method; the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection.
In some embodiments of the present application, the preset optimizing method may include: cyclic partitioning, cyclic reorganization, cyclic rearrangement, cyclic expansion, vectorization, parallelization.
In some embodiments of the present application, the step S102 may be specifically implemented as:
according to the circulation information and the calculation map information, enumerating preset optimization method combinations and parameter selections to form a basic optimization space;
based on pruning technology, selecting a plurality of preset optimization method combinations in the basic optimization space according to preset conditions, setting a parameter selection range, and generating an optimization space.
Specifically, according to the cycle information and the calculation map information acquired in step S101, different optimization method combinations and parameter selections are enumerated to form a basic optimization space. Alternative optimization methods include loop partitioning, loop reorganization, loop rearrangement, loop expansion, vectorization, parallelization, as described above, which improve parallelism and locality of the original computation. All optimization methods have different combinations and parameter settings, and these different choices constitute an overall optimization space, also called basic optimization space.
Pruning is carried out in the basic optimization space, unnecessary searching parts are removed according to optimization knowledge, and the searching task amount is reduced. The basic optimization space is extremely large, and common searches cannot be performed therein. The pruning technology is to determine a combination mode of partial optimization methods and limit parameter ranges to reduce search tasks, and specifically, only multiple blocks are considered for loop operation, and only the outermost loop is considered for parallelization loop. The parameters only consider enumeration within a limited range, and the specific range can be set according to actual conditions.
Step S103: searching and determining target space points in the optimized space based on a simulated annealing algorithm and a reinforcement learning algorithm.
Specifically, determining space points to be explored in each step in the search according to a simulated annealing algorithm; in the searching process, a large number of space points need to be explored, each space point needs to be explored, surrounding points need to be evaluated, and better reservation is selected. And selecting explored space points by adopting a heuristic method of simulated annealing.
For each spatial point to be explored, the exploration method is to search all possible directions, then obtain the adjacent points in the corresponding directions and evaluate whether the performance is better by using an optimizer. Since the search task is heavy due to excessive directions, a reinforcement learning method in machine learning is adopted, and the correct direction of the next step is predicted according to the current search state, so that only one direction needs to be explored, and the task amount is greatly reduced. The direction of each search is predicted using the Q-learning method in reinforcement learning.
Step S104: and optimizing the tensor calculation code according to the preset optimization method combination and parameter selection corresponding to the target space point.
Specifically, after the search result is obtained, the optimization scheme represented by the space point needs to be reversely interpreted. The specific method is that records are reserved when an optimized space is generated, the records are searched after target space points (performance optimal points) are searched, and the optimization method combination and parameter selection corresponding to the target space points are determined.
And according to the analyzed optimization method combination and parameter selection, modifying the original tensor calculation program into a new program, adding an optimization primitive and corresponding parameter setting into the new program, and finishing the modification of the program abstract syntax tree. Compiling and code generation are carried out by means of a deep learning compiler TVM tool according to the modified abstract syntax tree.
The tensor calculation code optimization method can be used for a client, and in the embodiment of the application, the client can comprise hardware or software. When the client comprises hardware, it may be a variety of electronic devices having a display screen and supporting information interaction, for example, may include, but is not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the client includes software, it may be installed in the above-described electronic device, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module. The present invention is not particularly limited herein.
Compared with the prior art, the tensor calculation code optimization method provided by the embodiment of the application obtains corresponding circulation information and calculation map information by analyzing the circulation characteristics and the calculation map characteristics of the tensor calculation code, and generates an optimization space according to the circulation information and the calculation map information and a preset optimization method, wherein the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection. Based on a simulated annealing algorithm and a reinforcement learning algorithm, searching and determining a target space point in the optimization space, and optimizing the tensor calculation code according to a preset optimization method combination and parameter selection corresponding to the target space point, so that automatic optimization of the tensor calculation code can be completed quickly, the operating efficiency of the tensor calculation code is improved, human input of a development operator can be avoided for a programming developer, relatively good performance is obtained, cost can be reduced, and development efficiency is improved.
In the above embodiment, a tensor calculation code optimization method is provided, and correspondingly, the application also provides a tensor calculation code optimization device. The tensor calculation code optimization device provided by the embodiment of the application can implement the tensor calculation code optimization method, and the tensor calculation code optimization device can be implemented in a software, hardware or a combination of software and hardware. For example, the tensor computing code optimization device may include integrated or separate functional modules or units to perform the corresponding steps in the methods described above. Referring to fig. 2, a schematic diagram of a tensor calculation code optimization device according to some embodiments of the present application is shown. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
As shown in fig. 2, the tensor calculation code optimization device 10 may include:
the static analysis module 101 is configured to analyze the cycle characteristics and the calculation map characteristics of the tensor calculation code to obtain corresponding cycle information and calculation map information;
the optimization space generating module 102 is configured to generate an optimization space according to the cycle information, the calculation map information and a preset optimization method; the optimization space consists of a plurality of space points, and each space point represents a preset optimization method combination and parameter selection;
an optimization space searching module 103, configured to search and determine a target space point in the optimization space based on a simulated annealing algorithm and a reinforcement learning algorithm;
and the optimization implementation module 104 is configured to implement optimization on the tensor calculation code according to a preset optimization method combination and parameter selection corresponding to the target spatial point.
In some implementations of the embodiments of the present application, the preset optimizing method includes: cyclic partitioning, cyclic reorganization, cyclic rearrangement, cyclic expansion, vectorization, parallelization.
In some implementations of the embodiments of the present application, the static analysis module 101 is specifically configured to:
carrying out grammar analysis on tensor calculation codes, and analyzing the generated abstract grammar tree into a calculation graph;
traversing the computation graph, and collecting connection information and circulation information of each computation node; the circulation information comprises circulation number, circulation occupied space, circulation sequence and circulation data dependence; the connection information includes the input number and the output number of the computation nodes, and the connection information of each computation node constitutes the computation graph information.
In some implementations of the embodiments of the present application, the optimization space generation module 102 is specifically configured to:
according to the circulation information and the calculation map information, enumerating preset optimization method combinations and parameter selections to form a basic optimization space;
based on pruning technology, selecting a plurality of preset optimization method combinations in the basic optimization space according to preset conditions, setting a parameter selection range, and generating an optimization space.
The tensor calculation code optimization device 10 provided in the embodiment of the present application has the same beneficial effects as the tensor calculation code optimization method provided in the foregoing embodiment of the present application due to the same inventive concept.
The embodiment of the application also provides an electronic device corresponding to the tensor calculation code optimization method provided by the previous embodiment, where the electronic device may be an electronic device for a client, for example, a mobile phone, a notebook computer, a tablet computer, a desktop computer, etc., so as to execute the tensor calculation code optimization method.
Referring to fig. 3, a schematic diagram of an electronic device according to some embodiments of the present application is shown. As shown in fig. 3, the electronic device 20 includes: a processor 200, a memory 201, a bus 202 and a communication interface 203, the processor 200, the communication interface 203 and the memory 201 being connected by the bus 202; the memory 201 stores a computer program executable on the processor 200, and the processor 200 executes the tensor calculation code optimization method provided in any of the foregoing embodiments of the present application when the computer program is executed.
The memory 201 may include a high-speed random access memory (RAM: random AccessMemory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 203 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
Bus 202 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, and the tensor calculation code optimization method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200 or implemented by the processor 200.
The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 200 or by instructions in the form of software. The processor 200 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201, and in combination with its hardware, performs the steps of the above method.
The electronic device provided by the embodiment of the application and the tensor calculation code optimization method provided by the embodiment of the application are the same in the same invention conception, and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
The present embodiment also provides a computer readable medium corresponding to the tensor calculation code optimization method provided in the foregoing embodiment, referring to fig. 4, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the tensor calculation code optimization method provided in any of the foregoing embodiments.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
The computer readable storage medium provided by the above embodiments of the present application and the tensor calculation code optimization method provided by the embodiments of the present application have the same advantageous effects as the method adopted, operated or implemented by the application program stored therein, for the same inventive concept.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the embodiments, and are intended to be included within the scope of the claims and description.