WO2022135028A1 - 对接tvm的方法及相关设备 - Google Patents

对接tvm的方法及相关设备 Download PDF

Info

Publication number
WO2022135028A1
WO2022135028A1 PCT/CN2021/133512 CN2021133512W WO2022135028A1 WO 2022135028 A1 WO2022135028 A1 WO 2022135028A1 CN 2021133512 W CN2021133512 W CN 2021133512W WO 2022135028 A1 WO2022135028 A1 WO 2022135028A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
tvm
calculation
calculation graph
computational
Prior art date
Application number
PCT/CN2021/133512
Other languages
English (en)
French (fr)
Inventor
张丹
黎立煌
王和国
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2022135028A1 publication Critical patent/WO2022135028A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a method and related equipment for docking with a TVM.
  • TVM Torsor Virtual Machine, Vector Virtual Machine
  • GPU graphics processing unit
  • CPU central processing unit
  • FPGA Field-Programmable Gate Array
  • It is an open source project at present, mainly acting on the compiler stack of artificial intelligence deep learning system, or TVM is an open deep learning compiler stack for CPU, GPU and special accelerator.
  • the biggest feature of TVM is to optimize instruction generation based on graph and operator structure to maximize hardware execution efficiency.
  • TVM integrates quantization, which can improve efficiency in deep learning inference.
  • TVM upwards can be connected to deep learning frameworks such as Tensorflow, Pytorch, Caffe (Convolutional Architecture for Fast Feature Embedding), among them, Caffe is a deep learning framework with expressiveness, speed and thinking modularity; TVM downwards is compatible with GPU, CPU , ARM processor, Tensor Processor (Tensor Processing Unit, TPU) and other hardware devices.
  • TVM cannot be directly applied to the chip, but some functions of TVM can be connected to the chip development environment to speed up the chip development process.
  • TVM uses Relay to build a deep learning model into a computational graph (data flow). The chip implements the node functions in the computational graph and completes the initial hardware deployment; Relay is a multi-functional programming language used for expressing machine learning systems. Intermediate representation.
  • the introduction of TVM in the chip development environment will cause the running rate to be very slow and slow down the progress of chip development.
  • the embodiments of the present application disclose a method and related equipment for docking with a TVM, which can greatly reduce the computing resource requirements for the chip development environment introduced by the TVM, improve the running rate, and reduce the running time of the chip development environment.
  • a first aspect of the embodiments of the present application discloses an apparatus for docking with a TVM, which is applied to an electronic device.
  • the apparatus includes: a TVM correction module, configured to use the TVM to generate a first calculation graph according to a target model, wherein the target model uses for chip development; a calculation graph generation module is used to generate a second calculation graph according to the first calculation graph, wherein the structure of the second calculation graph is the calculation graph structure used for the chip development, and the second calculation graph
  • the picture shows the input of the chip development environment.
  • TVM is used to generate the first calculation graph according to the target model used for chip development, that is, TVM is used to turn the target model used for chip development into the first calculation graph, and the structure of the first calculation graph is The calculation graph structure used by TVM; then generate the second calculation graph from the first calculation graph, and the structure of the second calculation graph is the calculation graph structure used for chip development, so that the second calculation graph can be used as the input of the chip development environment to realize the The TVM environment is introduced into the chip development environment.
  • the second calculation graph Since the structure of the second calculation graph is the structure of the calculation graph used for chip development, compared with the first calculation graph, the second calculation graph requires less computing resources and runs faster in the chip development environment; Converting one calculation graph into a second calculation graph, and then inputting the second calculation graph into the chip development environment to run, can greatly reduce the computing resource requirements introduced by TVM to the chip development environment, improve the running speed, and reduce the running time of the chip development environment.
  • the calculation graph generating module includes a TVM operator parameter template list and a calculation graph parsing unit, and the TVM operator parameter template list is obtained according to an operator used by the TVM; the calculation graph A parsing unit, configured to: parse the first computation graph according to the TVM operator parameter template list, so as to obtain the operator name, operator parameters, and input data corresponding to each node in the first computation graph
  • the second calculation graph is generated according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node.
  • the TVM operator parameter template list is obtained according to the operators used by TVM, so the TVM operator parameter template list may include information of all operators used by TVM, and the positions of the operators in the calculation graph It is represented as a node in the calculation graph, and the first calculation graph is parsed according to the TVM operator parameter template list, and the operator name, operator parameter, dimension of the input data, output corresponding to each node in the first calculation graph can be obtained.
  • the calculation graph is reorganized according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node obtained through analysis, and a second calculation graph can be generated; Therefore, the calculation graph structure used by TVM is transformed into the calculation graph structure used for chip development, which is beneficial to reduce the computing resource requirements introduced by TVM to the chip development environment.
  • the computation graph parsing unit includes: an operator name extraction subunit, configured to search the first computation graph according to the TVM operator parameter template list to obtain the The operator name corresponding to each node; the operator parameter extraction subunit is used to extract the operator parameter corresponding to each node from the TVM operator parameter template list according to the operator name corresponding to each node.
  • the input and output data dimension extraction subunit is used to extract the dimension of the input data corresponding to each node, the dimension of the output data from the TVM operator parameter template list according to the operator name corresponding to each node;
  • the node label extraction subunit is configured to determine the node label corresponding to each node according to the connection relationship of the nodes in the first computation graph.
  • the first calculation graph is searched according to the TVM operator parameter template list, and the operator name corresponding to each node in the first calculation graph can be obtained; then, according to the corresponding operator name of each node in the first calculation graph
  • the operator name can be extracted from the TVM operator parameter template list to the operator parameter corresponding to each node, and the operator name corresponding to each node in the first calculation graph can be extracted from the TVM operator parameter template list.
  • the dimension of the input data and the dimension of the output data corresponding to each node then determine the node label corresponding to each node in the first calculation graph according to the connection relationship of the nodes in the first calculation graph; thus obtain each node in the first calculation graph
  • Corresponding operator names, operator parameters, dimensions of input data, dimensions of output data, and node labels are helpful for combining to obtain the second calculation graph.
  • the TVM correction module is specifically configured to: use the TVM to generate a third calculation graph according to the target model; use the calculation graph optimization part and the calculation graph quantization part of the TVM to The third computational graph is processed to obtain the first computational graph, wherein the rate at which the first computational graph is run by hardware is greater than the rate at which the third computational graph is executed by the hardware.
  • TVM is first used to generate a third calculation graph according to the target model; then the calculation graph optimization part and the calculation graph quantization part of TVM are used to process the third calculation graph, thereby obtaining the first calculation graph ; Because the first calculation graph is a calculation graph after optimization and quantization, the calculation of invalid nodes and redundant nodes in the calculation graph and the conversion of data types have been removed, so the speed at which the first calculation graph is run by hardware is greater than the described The speed at which the third calculation graph is run by the hardware; the second calculation graph is generated according to the first calculation graph obtained after optimization and quantization, which is beneficial to improve the running speed of the second calculation graph in the chip development environment.
  • the TVM correction module is further configured to: modify the computational graph optimization part and the computational graph quantization part according to the chip architecture, so that the computational graph optimization part and the computational graph Computational graph quantification is partially adapted to the chip development.
  • the calculation graph optimization part and the calculation graph quantization part of TVM are modified according to the chip architecture, so that they are suitable for chip development, which is beneficial to the first part processed by the calculation graph optimization part and the calculation graph quantization part of TVM.
  • a computational graph adaptation is run by the chip development environment.
  • the apparatus further includes: a computational graph processing module, configured to perform optimization and/or quantization processing on the second computational graph to obtain a fourth computational graph, wherein the fourth computational graph is The calculation graph is the input of the chip development environment, and the speed at which the fourth calculation graph is executed by the hardware is greater than the speed at which the second calculation graph is executed by the hardware.
  • a computational graph processing module configured to perform optimization and/or quantization processing on the second computational graph to obtain a fourth computational graph, wherein the fourth computational graph is The calculation graph is the input of the chip development environment, and the speed at which the fourth calculation graph is executed by the hardware is greater than the speed at which the second calculation graph is executed by the hardware.
  • the second calculation graph that needs to be input to run in the chip development environment is subjected to optimization and/or quantization processing, and the fourth calculation graph is obtained after the optimization and/or quantization processing, so that the fourth calculation graph is run by hardware
  • the speed of the second calculation graph is greater than the speed at which the second calculation graph is run by the hardware, which is beneficial to improve the running speed and reduce the running time of the chip development environment.
  • the apparatus further includes: a computational graph statistics module, configured to perform information statistics on the second computational graph and/or the fourth computational graph to obtain computational graph information, wherein, The computation graph information is an input of the chip development environment, and the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are run by hardware.
  • a computational graph statistics module configured to perform information statistics on the second computational graph and/or the fourth computational graph to obtain computational graph information, wherein, The computation graph information is an input of the chip development environment, and the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are run by hardware.
  • information statistics are performed on the calculation graph to be input into the chip development environment to run, to obtain the calculation graph information of the calculation graph, and the calculation graph information is input into the chip development environment, which can improve the performance of the calculation graph in the chip development environment.
  • the running rate in the chip development environment thereby reducing the running time of the chip development environment.
  • the first computational graph and the third computational graph are saved in the form of text
  • the second computational graph and the fourth computational graph are saved in the form of python DataFrame.
  • the first calculation graph and the third calculation graph are saved in the form of text, which can realize the decoupling of the TVM environment and the chip development environment;
  • the second calculation graph and the fourth calculation graph are in the form of python DataFrame. Saving can realize the decoupling of the TVM docking environment and the chip development environment, thereby speeding up the running rate of the chip development environment.
  • the second calculation graph and the fourth calculation graph are saved in the form of python DataFrame, and the visualization of the calculation graph can also be realized.
  • a second aspect of the embodiments of the present application discloses a method for docking a TVM, which is applied to an electronic device.
  • the method includes: using the TVM to generate a first calculation graph according to a target model, where the target model is used for chip development;
  • the first calculation graph generates a second calculation graph, wherein the structure of the second calculation graph is the structure of the calculation graph used for the chip development, and the second calculation graph is the input of the chip development environment.
  • the electronic device stores a TVM operator parameter template list, and the TVM operator parameter template list is obtained according to an operator used by the TVM; the generation of the TVM operator parameter template list according to the first calculation graph
  • the second calculation graph includes: parsing the first calculation graph according to the TVM operator parameter template list to obtain the operator name, operator parameter, input corresponding to each node in the first calculation graph The dimension of the data, the dimension of the output data, and the node label; the second calculation graph is generated according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node.
  • the first computation graph is parsed according to the TVM operator parameter template list to obtain the operator name, computation graph corresponding to each node in the first computation graph Sub-parameters, dimensions of input data, dimensions of output data, and node labels, including: searching in the first calculation graph according to the TVM operator parameter template list to obtain the operator name corresponding to each node ; Extract the operator parameter corresponding to each node from the TVM operator parameter template list according to the operator name corresponding to each node; Calculate the operator parameter from the TVM according to the operator name corresponding to each node The dimension of the input data and the dimension of the output data corresponding to each node are extracted from the sub-parameter template list; the node label corresponding to each node is determined according to the connection relationship of the nodes in the first calculation graph.
  • the generating the first calculation graph according to the target model using the TVM includes: using the TVM to generate a third calculation graph according to the target model; using the TVM's calculation graph to optimize the part and calculation The graph quantization part processes the third computational graph to obtain the first computational graph, wherein the rate at which the first computational graph is executed by hardware is greater than the rate at which the third computational graph is executed by the hardware.
  • the method before the third computation graph is processed by the computation graph optimization part and computation graph quantization part of the TVM, the method further includes: performing the computation on the computation graph according to a chip architecture
  • the graph optimization portion and the computational graph quantization portion are modified to adapt the computational graph optimization portion and the computational graph quantization portion to the chip development.
  • the method further includes: performing optimization and/or quantization processing on the second calculation graph to obtain a fourth calculation graph, wherein the fourth calculation graph is developed by the chip The input of the environment, the rate at which the fourth computational graph is executed by the hardware is greater than the rate at which the second computational graph is executed by the hardware.
  • the method further includes: performing information statistics on the second calculation graph and/or the fourth calculation graph to obtain calculation graph information, wherein the calculation graph information is all The input of the chip development environment, the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are executed by hardware.
  • the first computational graph and the third computational graph are saved in the form of text
  • the second computational graph and the fourth computational graph are saved in the form of python DataFrame.
  • a third aspect of the embodiments of the present application discloses an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be is executed by the processor, and the program includes instructions for executing the steps in the method according to any one of the second aspect of the embodiments of the present application.
  • a fourth aspect of an embodiment of the present application discloses a chip, which is characterized by comprising: a processor, configured to call and run a computer program from a memory, so that a device installed with the chip executes the second aspect of the embodiment of the present application The method of any of the above.
  • a fifth aspect of the embodiments of the present application discloses a computer-readable storage medium, which is characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method as described in the second aspect of the embodiments of the present application. The method of any one.
  • a sixth aspect of the embodiments of the present application discloses a computer program product, and the computer program product causes a computer to execute the method according to any one of the second aspects of the embodiments of the present application.
  • FIG. 1 is a schematic structural diagram of a system for chip development provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a calculation graph generation module provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a computational graph parsing unit provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the internal logic of a calculation graph generation module provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for docking with a TVM provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of the architecture of a system for chip development provided by an embodiment of the present application.
  • the system is applied to electronic equipment.
  • the system includes a TVM (Tensor Virtual Machine, vector virtual machine), a TVM docking Device and chip development environment.
  • the TVM environment in the TVM may be the TVM environment of the historical project, or may be the TVM original environment.
  • the device for docking TVM is connected with the TVM environment, and the device for docking TVM includes:
  • the TVM correction module is used to generate a first calculation graph according to a target model using TVM, wherein the target model is used for chip development;
  • a calculation graph generation module configured to generate a second calculation graph according to the first calculation graph, wherein the structure of the second calculation graph is the calculation graph structure used for the chip development, and the second calculation graph is the chip development environment input.
  • the target model is a deep learning model that needs to be supported for chip development.
  • a computational graph is defined as a directed graph, including nodes and directed edges; where nodes correspond to mathematical operations, that is, where nodes correspond to operators or operators (op), which is a way of expressing and evaluating mathematical expressions .
  • the computation graph structure of the first computation graph is different from the computation graph structure of the second computation graph, and the programming language for generating the first computation graph and the programming language for generating the second computation graph are also different.
  • the programming language used by the TVM correction module is the same as the programming language used by the TVM, but it is different from the programming language used by the calculation graph generation module.
  • the first calculation graph can be the calculation graph structure of the TVM Relay calculation graph
  • the TVM correction module uses TVM Relay to generate the TVM Relay calculation graph for the model that needs to be supported for chip development; if there are multiple models that need to be supported for chip development, then The TVM correction module can use TVM Relay to generate a list of calculation graph files for the list of models that need to be supported for chip development. Since the model list includes multiple models, the calculation graph file list includes multiple TVM Relay calculations corresponding to these multiple models.
  • Graph wherein, the calculation graph file list exists in the form of text, and in practical applications, it can be a txt file or a log file, so that the calculation graph file list has visualization.
  • the calculation graph generation module can convert the TVM Relay calculation graph into a second calculation graph, and the structure of the second calculation graph is the calculation graph structure used for chip development; if there are multiple TVM Relay calculation graphs, the multiple TVM Relay calculation graphs are calculated The graphs are converted into a second computational graph of the computational graph structure used for chip development. Wherein, the second calculation graph output by the device connected to the TVM will be used as the input of the chip development environment.
  • the calculation graph generation unit can parse the first calculation graph, extract information such as operator names, operator parameters, input data dimensions, output data dimensions, node labels and other information corresponding to nodes in the first calculation graph, and Save these information in the second calculation graph, and the second calculation graph can exist in the form of python DataFrame or other data forms.
  • python is a computer programming language
  • DataFrame is a tabular data structure, which is defined in the python pandas library.
  • TVM is a very large environment
  • the TVM Relay calculation graph is directly input into the chip development environment to run, it will cause the running rate to be very slow and slow down.
  • the progress of chip development the reason is: because the structure of the first calculation graph is large, it is not the calculation graph structure used for chip development, so the computing resources required for its operation are relatively large.
  • the first calculation graph is converted into a second calculation graph of the calculation graph structure used for chip development, and then the second calculation graph is input into the chip development environment for operation, the operation required for operation can be significantly reduced resource requirements, thereby increasing the running rate.
  • a target model may correspond to one calculation graph, or may correspond to multiple calculation graphs, that is, the first calculation graph includes multiple TVM Relay calculation graphs.
  • the target model corresponds to multiple TVM Relay calculation graphs
  • multiple TVM Relay calculation graphs may be converted into a second calculation graph, or may be converted into multiple calculation graphs.
  • a second calculation diagram which is not specifically limited in this application.
  • TVM is used to generate the first calculation graph according to the target model used for chip development, that is, TVM is used to turn the target model used for chip development into the first calculation graph, and the structure of the first calculation graph is The calculation graph structure used by TVM; then generate the second calculation graph from the first calculation graph, and the structure of the second calculation graph is the calculation graph structure used for chip development, so that the second calculation graph can be used as the input of the chip development environment to realize the The TVM environment is introduced into the chip development environment.
  • the second calculation graph Since the structure of the second calculation graph is the structure of the calculation graph used for chip development, compared with the first calculation graph, the second calculation graph requires less computing resources and runs faster in the chip development environment; Converting one calculation graph into a second calculation graph, and then inputting the second calculation graph into the chip development environment to run, can greatly reduce the computing resource requirements introduced by TVM to the chip development environment, improve the running speed, and reduce the running time of the chip development environment.
  • the calculation graph generating module includes a TVM operator parameter template list and a calculation graph parsing unit, and the TVM operator parameter template list is obtained according to an operator used by the TVM; the calculation graph A parsing unit, configured to: parse the first computation graph according to the TVM operator parameter template list, so as to obtain the operator name, operator parameters, and input data corresponding to each node in the first computation graph
  • the second calculation graph is generated according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node.
  • FIG. 2 is a schematic structural diagram of a calculation graph generation module provided by an embodiment of the present application.
  • the input of the calculation graph generation module is a TVM Relay calculation graph, and the output is operator information required by chip development.
  • the calculation graph generation module includes a TVM operator parameter template list and a calculation graph parsing unit;
  • the TVM operator parameter template list is a list maintained according to the TVM Relay operator definition, including the operator parameter definitions; the TVM operator parameter template list can be based on The chip development needs to support the model to add the corresponding TVM Relay operator;
  • the calculation graph parsing unit parses out the operator name, operator parameter, input and output data ( vector) dimension, the (vector) dimension of the output data, node labels, etc.
  • the TVM operator parameter template list is obtained according to the operators used by TVM, so the TVM operator parameter template list may include information of all operators used by TVM, and the positions of the operators in the calculation graph It is represented as a node in the calculation graph, and the first calculation graph is parsed according to the TVM operator parameter template list, and the operator name, operator parameter, dimension of the input data, output corresponding to each node in the first calculation graph can be obtained.
  • the calculation graph is reorganized according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node obtained through analysis, and a second calculation graph can be generated; Therefore, the calculation graph structure used by TVM is transformed into the calculation graph structure used for chip development, which is beneficial to reduce the computing resource requirements introduced by TVM to the chip development environment.
  • the computation graph parsing unit includes: an operator name extraction subunit, configured to search the first computation graph according to the TVM operator parameter template list to obtain the The operator name corresponding to each node; the operator parameter extraction subunit is used to extract the operator parameter corresponding to each node from the TVM operator parameter template list according to the operator name corresponding to each node.
  • the input and output data dimension extraction subunit is used to extract the dimension of the input data corresponding to each node, the dimension of the output data from the TVM operator parameter template list according to the operator name corresponding to each node;
  • the node label extraction subunit is configured to determine the node label corresponding to each node according to the connection relationship of the nodes in the first computation graph.
  • FIG. 3 is a schematic structural diagram of a calculation graph parsing unit provided by an embodiment of the present application.
  • the calculation graph parsing unit includes an operator name extraction subunit, an operator parameter extraction subunit, and an input and output data subunit. Dimension extracts subunits and node labels extracts subunits.
  • the operator name extraction subunit searches in the first calculation graph according to the TVM operator parameter template list to obtain the operator name corresponding to each node; that is, the parameter templates in the TVM operator parameter template list are respectively corresponding to the target model. Search and match is performed in the TVM Relay calculation graph of the chip, and the matched operator is used as the operator corresponding to the node in the second calculation graph used for chip development.
  • the operator parameter extraction subunit also incorporates the operator parameters corresponding to the operators in the TVM operator parameter template list into the information of the corresponding nodes in the second calculation graph, and adds the operators in the second calculation graph to the TVM operator parameter template list. Subparameters are optional.
  • the input and output data dimension extraction subunit also adds the dimensions of the operator's input data and output data to the second calculation graph.
  • the node label extraction subunit generates the node label of the second computation graph according to the connection relation of the nodes in the TVM Relay computation graph; wherein, as shown in Figure 4, the connection relation of the output second computation graph of different models is different, and the connection relation is in the first computation graph.
  • the second calculation diagram is represented by the node label.
  • the node label includes the node label of the input node, the node label of the output node, and the node label of the current node; for example, node 1, node 2, and node 3
  • the node labels include the node label of node 1 (the node label of the input node), the node label of node 3 (the node label of the output node), and the node label of node 2 (the node label of the current node).
  • the first calculation graph is searched according to the TVM operator parameter template list, and the operator name corresponding to each node in the first calculation graph can be obtained; then, according to the corresponding operator name of each node in the first calculation graph
  • the operator name can be extracted from the TVM operator parameter template list to the operator parameter corresponding to each node, and the operator name corresponding to each node in the first calculation graph can be extracted from the TVM operator parameter template list.
  • the dimension of the input data and the dimension of the output data corresponding to each node then determine the node label corresponding to each node in the first calculation graph according to the connection relationship of the nodes in the first calculation graph; thus obtain each node in the first calculation graph
  • Corresponding operator names, operator parameters, dimensions of input data, dimensions of output data, and node labels are helpful for combining to obtain the second calculation graph.
  • the TVM correction module is specifically configured to: use the TVM to generate a third calculation graph according to the target model; use the calculation graph optimization part and the calculation graph quantization part of the TVM to The third computational graph is processed to obtain the first computational graph, wherein the rate at which the first computational graph is run by hardware is greater than the rate at which the third computational graph is executed by the hardware.
  • the TVM correction module uses TVM Relay to generate a third calculation graph according to the target model that the chip development needs to support, wherein the third calculation graph is also the calculation graph structure of the TVM Relay calculation graph, and adopts the calculation graph optimization part and the calculation graph quantization part of TVM.
  • the optimization processing and quantization processing are performed on the third calculation graph, so as to obtain the first calculation graph.
  • the first calculation graph and the third calculation graph may exist in the form of text, and in practical applications, may be txt files or log files, so as to realize the decoupling of the TVM environment and the chip development environment.
  • the above optimization part is to optimize the structure of the computational graph.
  • op1-op2-op3 forms a computational graph. If op2 is redundant, it can be deleted. After optimization, it becomes op1-op3.
  • the purpose is to speed up the processing rate of the model on hardware by optimizing the computational graph structure.
  • the above-mentioned quantization part does not involve changing the structure of the calculation graph, but mainly transforms the data type of the model. Converting the data type of the model from floating point to fixed point is also to speed up the processing rate of the model on the hardware.
  • TVM is first used to generate a third calculation graph according to the target model; then the calculation graph optimization part and the calculation graph quantization part of TVM are used to process the third calculation graph, thereby obtaining the first calculation graph ; Because the first calculation graph is a calculation graph after optimization and quantization, the calculation of invalid nodes and redundant nodes in the calculation graph and the conversion of data types have been removed, so the speed at which the first calculation graph is run by hardware is greater than the described The speed at which the third calculation graph is run by the hardware; the second calculation graph is generated according to the first calculation graph obtained after optimization and quantization, which is beneficial to improve the running speed of the second calculation graph in the chip development environment.
  • the TVM correction module is further configured to: modify the computational graph optimization part and the computational graph quantization part according to the chip architecture, so that the computational graph optimization part and the computational graph Computational graph quantification is partially adapted to the chip development.
  • the calculation graph optimization part and the calculation graph quantization part are modified according to the chip architecture, that is, the calculation graph optimization part and the calculation graph quantization part are modified according to the architectural characteristics of the chip;
  • the chip architecture refers to Description of chip object classes and attributes, for each object class, the architecture defines the attributes that the object class must have, it can also have additional attributes, and the object can be its parent object; mainstream chip architectures There are ARM, MIPS, x86, etc.
  • the TVM correction module can first modify the calculation graph optimization part and the calculation graph quantization part of the TVM according to the architectural characteristics of the chip, and then use TVM Relay to generate the first calculation graph for the target model that the chip development needs to support; or the TVM correction module You can first use TVM Relay to generate the third calculation graph for the target model that needs to be supported in chip development, and then modify the calculation graph optimization part and calculation graph quantization part of TVM according to the architectural characteristics of the chip, and then use the calculation graph optimization part and calculation graph quantization part.
  • the third computational graph is partially optimized and quantized to obtain the first computational graph.
  • the calculation graph optimization part and the calculation graph quantization part of TVM are modified according to the chip architecture, so that they are suitable for chip development, which is beneficial to the first part processed by the calculation graph optimization part and the calculation graph quantization part of TVM.
  • a computational graph adaptation is run by the chip development environment.
  • the apparatus further includes: a computational graph processing module, configured to perform optimization and/or quantization processing on the second computational graph to obtain a fourth computational graph, wherein the fourth computational graph is The calculation graph is the input of the chip development environment, and the speed at which the fourth calculation graph is executed by the hardware is greater than the speed at which the second calculation graph is executed by the hardware.
  • a computational graph processing module configured to perform optimization and/or quantization processing on the second computational graph to obtain a fourth computational graph, wherein the fourth computational graph is The calculation graph is the input of the chip development environment, and the speed at which the fourth calculation graph is executed by the hardware is greater than the speed at which the second calculation graph is executed by the hardware.
  • the second calculation graph output by the calculation graph generation module is optimized and/or quantified by the calculation graph processing module to obtain a fourth calculation graph, which is sent to the chip development environment to run.
  • the computation graph structure of the fourth computation graph input to the chip development environment has a fast simulation speed, and retains key information of nodes in the computation graph required by the chip development environment.
  • the computational graph processing module can extract computational graph information from the second computational graph, and process the computational graph information extracted from the second computational graph, thereby generating a new computational graph, that is, generating a fourth computational graph ; wherein, processing the computation graph information extracted from the second computation graph includes performing operator fusion, adding new operator parameters, and the like according to the architectural characteristics of the chip.
  • the second calculation graph that needs to be input to run in the chip development environment is subjected to optimization and/or quantization processing, and the fourth calculation graph is obtained after the optimization and/or quantization processing, so that the fourth calculation graph is run by hardware
  • the speed of the second calculation graph is greater than the speed at which the second calculation graph is run by the hardware, which is beneficial to improve the running speed and reduce the running time of the chip development environment.
  • the apparatus further includes: a computational graph statistics module, configured to perform information statistics on the second computational graph and/or the fourth computational graph to obtain computational graph information, wherein, The computation graph information is an input of the chip development environment, and the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are run by hardware.
  • a computational graph statistics module configured to perform information statistics on the second computational graph and/or the fourth computational graph to obtain computational graph information, wherein, The computation graph information is an input of the chip development environment, and the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are run by hardware.
  • the calculation graph statistics module when the second calculation graph is directly input into the chip development environment, the calculation graph statistics module performs information statistics on the second calculation graph to obtain the calculation graph information of the second calculation graph, and calculates the calculation graph of the second calculation graph.
  • the graph information is output to the chip development environment; when the fourth computational graph is input into the chip development environment, the computational graph statistics module performs information statistics on the fourth computational graph to obtain the computational graph information of the fourth computational graph, and calculates the The computation graph information of the four computation graphs is output to the chip development environment.
  • the second calculation graph or the fourth calculation graph contains the calculation graph information of the target model, for example, the second calculation graph or the fourth calculation graph contains the TVM Relay calculation graph information;
  • the functions of each node are implemented, and each node is assembled and implemented separately, so that the hardware deployment of the deep learning model (that is, the target model) can be completed.
  • the calculation graph information mainly counts the operator parameter information corresponding to a node, and can be input into the chip development environment to guide the chip hardware development.
  • information statistics are performed on the calculation graph to be input into the chip development environment to run, to obtain the calculation graph information of the calculation graph, and the calculation graph information is input into the chip development environment, which can improve the performance of the calculation graph in the chip development environment.
  • the running rate in the chip development environment thereby reducing the running time of the chip development environment.
  • the first computational graph and the third computational graph are saved in the form of text
  • the second computational graph and the fourth computational graph are saved in the form of python DataFrame.
  • the device connected to the TVM saves the first calculation graph and the third calculation graph in the form of text, that is, the TVM Relay calculation graph is saved in the form of text, realizes the decoupling of the TVM environment and the chip development environment, and greatly reduces TVM. Introduce computing resource requirements for the chip development environment to speed up the running rate of the chip development environment.
  • the device for docking with TVM saves the second calculation graph and the fourth calculation graph in the form of a python DataFrame, and the python Dataframe can be output as an excel form, which can be saved in the form of such a file, which can realize the decoupling of the TVM docking environment and the chip development environment, TVM
  • the docking environment is also the environment of the device docking with the TVM.
  • the chip development environment only needs to input the excel sheet output by the device connected to the TVM, without integrating the environment of the device connected to the TVM, which can also speed up the running rate of the chip development environment.
  • the second calculation graph and the fourth calculation graph exist in the form of python DataFrame, which can be output in the form of tabular text for visualization, as shown in Table 1 to Table 3.
  • the first calculation graph and the third calculation graph are saved in the form of text, which can realize the decoupling of the TVM environment and the chip development environment;
  • the second calculation graph and the fourth calculation graph are in the form of python DataFrame. Saving can realize the decoupling of the TVM docking environment and the chip development environment, thereby speeding up the running rate of the chip development environment.
  • the second calculation graph and the fourth calculation graph are saved in the form of python DataFrame, and the visualization of the calculation graph can also be realized.
  • the embodiments of the present application provide a device for docking TVM to efficiently connect TVM to a chip development environment.
  • a deep learning model can be implemented into the chip development environment, that is, a deep learning model can be implemented.
  • the TVM Relay calculation graph is introduced into the chip development environment, and the decoupling of the TVM environment and the chip development environment is realized, which greatly reduces the computing resource requirements for the chip development environment introduced by TVM, and speeds up the running rate of the chip development environment.
  • the device connected to the TVM inputs the calculation graph structure of the chip development environment at a fast simulation rate, and retains the key information of the nodes in the calculation graph required by the chip development environment, and can also realize visualization.
  • FIG. 5 is a schematic flowchart of a method for docking a TVM provided by an embodiment of the present application.
  • the method for docking a TVM can be applied to an electronic device, and the method for docking a TVM includes but is not limited to the following steps.
  • the electronic device stores a TVM operator parameter template list, and the TVM operator parameter template list is obtained according to an operator used by the TVM; the generation of the TVM operator parameter template list according to the first calculation graph
  • the second calculation graph includes: parsing the first calculation graph according to the TVM operator parameter template list to obtain the operator name, operator parameter, input corresponding to each node in the first calculation graph The dimension of the data, the dimension of the output data, and the node label; the second calculation graph is generated according to the operator name, operator parameter, input data dimension, output data dimension, and node label corresponding to each node.
  • the first computation graph is parsed according to the TVM operator parameter template list to obtain the operator name, computation graph corresponding to each node in the first computation graph Sub-parameters, dimensions of input data, dimensions of output data, and node labels, including: searching in the first calculation graph according to the TVM operator parameter template list to obtain the operator name corresponding to each node ; Extract the operator parameter corresponding to each node from the TVM operator parameter template list according to the operator name corresponding to each node; Calculate the operator parameter from the TVM according to the operator name corresponding to each node The dimension of the input data and the dimension of the output data corresponding to each node are extracted from the sub-parameter template list; the node label corresponding to each node is determined according to the connection relationship of the nodes in the first calculation graph.
  • the generating the first calculation graph according to the target model using the TVM includes: using the TVM to generate a third calculation graph according to the target model; using the TVM's calculation graph to optimize the part and calculation The graph quantization part processes the third computational graph to obtain the first computational graph, wherein the rate at which the first computational graph is executed by hardware is greater than the rate at which the third computational graph is executed by the hardware.
  • the method before the third computation graph is processed by the computation graph optimization part and computation graph quantization part of the TVM, the method further includes: performing the computation on the computation graph according to a chip architecture
  • the graph optimization portion and the computational graph quantization portion are modified to adapt the computational graph optimization portion and the computational graph quantization portion to the chip development.
  • the method further includes: performing optimization and/or quantization processing on the second calculation graph to obtain a fourth calculation graph, wherein the fourth calculation graph is developed by the chip The input of the environment, the rate at which the fourth computational graph is executed by the hardware is greater than the rate at which the second computational graph is executed by the hardware.
  • the method further includes: performing information statistics on the second calculation graph and/or the fourth calculation graph to obtain calculation graph information, wherein the calculation graph information is all The input of the chip development environment, the computation graph information is used to improve the speed at which the second computation graph and/or the fourth computation graph are executed by hardware.
  • TVM is used to generate a first calculation graph according to a target model for chip development, that is, TVM is used to turn the target model for chip development into a first calculation graph.
  • the structure of the computational graph is the computational graph structure used by TVM; then the first computational graph is generated into a second computational graph, and the structure of the second computational graph is the computational graph structure used for chip development, so that the second computational graph can be used as a chip development environment input, to achieve the introduction of the TVM environment into the chip development environment.
  • the second calculation graph Since the structure of the second calculation graph is the structure of the calculation graph used for chip development, compared with the first calculation graph, the second calculation graph requires less computing resources and runs faster in the chip development environment; Converting one calculation graph into a second calculation graph, and then inputting the second calculation graph into the chip development environment to run, can greatly reduce the computing resource requirements introduced by TVM to the chip development environment, improve the running speed, and reduce the running time of the chip development environment.
  • FIG. 6 is a schematic structural diagram of an electronic device 610 provided by an embodiment of the present application.
  • the electronic device 610 includes a processor 611, a memory 612, and a communication interface 613.
  • the above-mentioned processor 611, memory 612, and communication interface 613 They are connected to each other through a bus 614 .
  • the memory 612 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or A portable read-only memory (compact disc read-only memory, CD-ROM), the memory 612 is used for related computer programs and data.
  • the communication interface 613 is used to receive and transmit data.
  • the processor 611 may be one or more central processing units (central processing units, CPUs). In the case where the processor 611 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 611 in the electronic device 610 is configured to read the computer program code stored in the above-mentioned memory 612, and perform the following steps: using TVM to generate a first calculation graph according to a target model, wherein the target model is used for chip development;
  • the first calculation graph generates a second calculation graph, wherein the structure of the second calculation graph is a calculation graph structure used for the chip development, and the second calculation graph is an input of the chip development environment.
  • each operation may also correspond to the corresponding descriptions of the embodiments shown in FIG. 1 to FIG. 5 , which will not be repeated here.
  • TVM is used to generate a first calculation graph according to a target model for chip development, that is, TVM is used to turn the target model for chip development into a first calculation graph, and the first calculation
  • the structure of the graph is the calculation graph structure used by TVM; then the first calculation graph is generated into a second calculation graph, and the structure of the second calculation graph is the calculation graph structure used for chip development, so that the second calculation graph can be used as the chip development environment.
  • Input to achieve the introduction of TVM environment into the chip development environment.
  • the second computation graph Since the structure of the second computation graph is the computation graph structure used for chip development, compared with the first computation graph, the second computation graph requires less computing resources and runs faster in the chip development environment; Converting one calculation graph into a second calculation graph, and then inputting the second calculation graph into the chip development environment to run, can greatly reduce the computing resource requirements introduced by TVM to the chip development environment, improve the running speed, and reduce the running time of the chip development environment.
  • An embodiment of the present application further provides a chip, the chip includes at least one processor, a memory and an interface circuit, the memory, the transceiver and the at least one processor are interconnected through a line, and a computer program is stored in the at least one memory; When the computer program is executed by the above-mentioned processor, the method flow shown in FIG. 5 is realized.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is run on a computer, the method flow shown in FIG. 5 is implemented.
  • the embodiment of the present application further provides a computer program product, when the above computer program product runs on a computer, the method flow shown in FIG. 5 is realized.
  • processors mentioned in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits ( Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory mentioned in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically programmable read-only memory (Erasable PROM, EPROM). Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be Random Access Memory (RAM), which acts as an external cache.
  • RAM Static RAM
  • DRAM Dynamic RAM
  • SDRAM Synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDR SDRAM
  • enhanced SDRAM ESDRAM
  • synchronous link dynamic random access memory Synchlink DRAM, SLDRAM
  • Direct Rambus RAM Direct Rambus RAM
  • the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components
  • the memory storage module
  • memory described herein is intended to include, but not be limited to, these and any other suitable types of memory.
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the above units is only a logical function division.
  • multiple units or components may be combined or may be Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the above functions are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods shown in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
  • the modules in the apparatus of the embodiment of the present application may be combined, divided and deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Generation (AREA)

Abstract

本申请实施例提供一种对接TVM的方法及相关设备,该方法包括:采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。采用本申请实施例,能够减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运行时间。

Description

对接TVM的方法及相关设备
本申请要求于2020年12月25日提交中国专利局,申请号为202011565749.2、发明名称为“对接TVM的方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种对接TVM的方法及相关设备。
背景技术
TVM(Tensor Virtual Machine,矢量虚拟机)是一个支持图形处理器(GPU)、中央处理器(CPU)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)指令生成的开源编译器框架,其是目前的一项开源项目,主要作用于人工智能深度学习系统的编译器堆栈,或者说TVM是一个用于CPU、GPU和专用加速器的开放式深度学习编译器堆栈。TVM最大的特点是基于图和算符结构来优化指令生成,最大化硬件执行效率。TVM集成了量化,在深度学习推理时可以提升效率。TVM向上可以对接Tensorflow、Pytorch、Caffe(Convolutional Architecture for Fast Feature Embedding)等深度学习框架,其中,Caffe是一个兼具表达性、速率和思维模块化的深度学习框架;TVM向下可以兼容GPU、CPU、ARM处理器、张量处理器(Tensor Processing Unit,TPU)等硬件设备。当前,TVM无法直接应用在芯片上,但可以将TVM部分功能对接到芯片开发环境中来加速芯片开发流程。TVM使用Relay将深度学习模型构建为计算图(数据流),芯片针对计算图中的节点功能进行实现,完成初步硬件部署;其中Relay是一种功能多样的编程语言,用于机器学习系统表达的中间表示。然而,在芯片开发环境引入TVM,会造成运行速率很慢,减缓芯片开发的进度。
发明内容
本申请实施例公开了一种对接TVM的方法及相关设备,能够极大减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运 行时间。
本申请实施例第一方面公开了一种对接TVM的装置,应用于电子设备,所述装置包括:TVM修正模块,用于采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;计算图产生模块,用于根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
在本申请实施例中,采用TVM根据用于芯片开发的目标模型生成第一计算图,也即采用TVM把用于芯片开发的目标模型变成第一计算图,该第一计算图的结构是TVM使用的计算图结构;然后将第一计算图生成第二计算图,该第二计算图的结构为芯片开发使用的计算图结构,从而第二计算图可以作为芯片开发环境的输入,实现将TVM环境引入到芯片开发环境中。由于第二计算图的结构为芯片开发使用的计算图结构,相比于第一计算图,第二计算图在芯片开发环境运行所需运算资源需求较小、运行速率较快;因此,将第一计算图转变第二计算图,再将第二计算图输入到芯片开发环境运行,能够极大减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述计算图产生模块包括TVM算子参数模板列表和计算图解析单元,所述TVM算子参数模板列表根据所述TVM使用的算子得到;所述计算图解析单元,用于:根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;根据所述每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号生成所述第二计算图。
在本申请实施例中,TVM算子参数模板列表是根据TVM使用的算子得到,故TVM算子参数模板列表中可以包括TVM使用的所有算子的信息,而算子在计算图中的位置表现为计算图中的节点,根据TVM算子参数模板列表对第一计算图进行解析,可以得到第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;然后根据解析得到的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号进行计算图结构重组,可以生成第二计算图;从而将TVM使用的计算图结构转变成芯片开发使用的计算图结构,有利于减少TVM引入对芯片开发环境的运算资源需求。
在一种可能的实施方式中,所述计算图解析单元包括:算子名称提取子单元,用于根据所述TVM算子参数模板列表在所述第一计算图中进行搜索,以得到所述每个节点对应的算子名称;算子参数提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的算子参数;输入输出数据维度提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的输入数据的维度、输出数据的维度;节点标号提取子单元,用于根据所述第一计算图中的节点的 连接关系确定所述每个节点对应的节点标号。
在本申请实施例中,根据TVM算子参数模板列表对第一计算图进行搜索,可以得到第一计算图中每个节点对应的算子名称;然后根据第一计算图中每个节点对应的算子名称可以从TVM算子参数模板列表中提取到每个节点对应的算子参数,以及根据第一计算图中每个节点对应的算子名称可以从TVM算子参数模板列表中提取到每个节点对应的输入数据的维度、输出数据的维度;再根据第一计算图中的节点的连接关系确定第一计算图中每个节点对应的节点标号;从而得到第一计算图中每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号,有利于组合得到第二计算图。
在一种可能的实施方式中,所述TVM修正模块,具体用于:采用所述TVM根据所述目标模型生成第三计算图;采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,得到所述第一计算图,其中,所述第一计算图被硬件运行的速率大于所述第三计算图被所述硬件运行的速率。
在本申请实施例中,先采用TVM根据所述目标模型生成第三计算图;再采用TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,从而得到第一计算图;由于第一计算图是优化和量化后的计算图,去除了计算图中的无效节点、冗余节点的计算以及经过了数据类型的转换,故第一计算图被硬件运行的速率大于所述第三计算图被硬件运行的速率;依据优化和量化后得到的第一计算图生成第二计算图,有利于提升第二计算图在芯片开发环境中的运行速率。
在一种可能的实施方式中,所述TVM修正模块,还用于:根据芯片架构对所述计算图优化部分和所述计算图量化部分进行修改,以使所述计算图优化部分和所述计算图量化部分适配所述芯片开发。
在本申请实施例中,根据芯片架构对TVM的计算图优化部分和计算图量化部分进行修改,使其适配芯片开发,从而有利于TVM的计算图优化部分和计算图量化部分处理得到的第一计算图适配被芯片开发环境运行。
在一种可能的实施方式中,所述装置还包括:计算图处理模块,用于对所述第二计算图进行优化和/或量化处理,以得到第四计算图,其中,所述第四计算图为所述芯片开发环境的输入,所述第四计算图被硬件运行的速率大于所述第二计算图被所述硬件运行的速率。
在本申请实施例中,将需要输入到芯片开发环境中运行的第二计算图进行优化和/或量化处理,优化和/或量化处理后得到第四计算图,从而第四计算图被硬件运行的速率大于第二计算图被硬件运行的速率,有利于提升运行速率,减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述装置还包括:计算图统计模块,用于对所述第二计算图和/或所述第四计算图进行信息统计,以得到计算图信息,其中,所述计算图信息为所述芯片开发环境的输入,所述计算图信息用于提升所述第二计算图和/或所述第四计算图被硬件运行的速率。
在本申请实施例中,对将要输入到芯片开发环境中运行的计算图进行信息 统计,得到该计算图的计算图信息,将该计算图信息输入到芯片开发环境中,可以提升该计算图在芯片开发环境中运行速率,从而减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述第一计算图和所述第三计算图以文本的形式保存,所述第二计算图和所述第四计算图以python DataFrame的形式保存。
在本申请实施例中,将第一计算图和第三计算图以文本的形式保存,可以实现TVM环境与芯片开发环境的解耦;将第二计算图和第四计算图以python DataFrame的形式保存,可以实现TVM对接环境和芯片开发环境解耦,从而可以加快芯片开发环境的运行速率。此外,第二计算图和第四计算图以python DataFrame的形式保存,还可以实现计算图可视化。
本申请实施例第二方面公开了一种对接TVM的方法,应用于电子设备,所述方法包括:采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
在一种可能的实施方式中,所述电子设备存储有TVM算子参数模板列表,所述TVM算子参数模板列表根据所述TVM使用的算子得到;所述根据所述第一计算图生成第二计算图,包括:根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;根据所述每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号生成所述第二计算图。
在一种可能的实施方式中,所述根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号,包括:根据所述TVM算子参数模板列表在所述第一计算图中进行搜索,以得到所述每个节点对应的算子名称;根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的算子参数;根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的输入数据的维度、输出数据的维度;根据所述第一计算图中的节点的连接关系确定所述每个节点对应的节点标号。
在一种可能的实施方式中,所述采用TVM根据目标模型生成第一计算图,包括:采用所述TVM根据所述目标模型生成第三计算图;采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,得到所述第一计算图,其中,所述第一计算图被硬件运行的速率大于所述第三计算图被所述硬件运行的速率。
在一种可能的实施方式中,在所述采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理之前,所述方法还包括:根据芯片架构 对所述计算图优化部分和所述计算图量化部分进行修改,以使所述计算图优化部分和所述计算图量化部分适配所述芯片开发。
在一种可能的实施方式中,所述方法还包括:对所述第二计算图进行优化和/或量化处理,以得到第四计算图,其中,所述第四计算图为所述芯片开发环境的输入,所述第四计算图被硬件运行的速率大于所述第二计算图被所述硬件运行的速率。
在一种可能的实施方式中,所述方法还包括:对所述第二计算图和/或所述第四计算图进行信息统计,以得到计算图信息,其中,所述计算图信息为所述芯片开发环境的输入,所述计算图信息用于提升所述第二计算图和/或所述第四计算图被硬件运行的速率。
在一种可能的实施方式中,所述第一计算图和所述第三计算图以文本的形式保存,所述第二计算图和所述第四计算图以python DataFrame的形式保存。
本申请实施例第三方面公开了一种电子设备,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如本申请实施例第二方面中任一项所述的方法中的步骤的指令。
本申请实施例第四方面公开了一种芯片,其特征在于,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如本申请实施例第二方面中任一项所述的方法。
本申请实施例第五方面公开了一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如本申请实施例第二方面中任一项所述的方法。
本申请实施例第六方面公开了一种计算机程序产品,所述计算机程序产品使得计算机执行如本申请实施例第二方面中任一项所述的方法。
附图说明
图1是本申请实施例提供的一种用于芯片开发的系统的架构示意图。
图2是本申请实施例提供的一种计算图产生模块的结构示意图。
图3是本申请实施例提供的一种计算图解析单元的结构示意图。
图4是本申请实施例提供的一种计算图产生模块的内部逻辑示意图。
图5是本申请实施例提供的一种对接TVM的方法的流程示意图。
图6是本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
请参阅图1,图1是本申请实施例提供的一种用于芯片开发的系统的架构示意图,该系统应用于电子设备,该系统包括TVM(Tensor Virtual Machine,矢量虚拟机)、对接TVM的装置和芯片开发环境。其中,TVM中的TVM环境可以是历史项目的TVM环境,也可以是TVM原始环境。其中,该对接TVM的装置与TVM环境连接,该对接TVM的装置包括:
TVM修正模块,用于采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;
计算图产生模块,用于根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
其中,目标模型为芯片开发需要支持的深度学习模型。
其中,计算图被定义为有向图,包括节点和有向边;其中节点对应于数学运算,也即其中节点对应算子或操作符(op),是表达和评估数学表达式的一种方式。第一计算图的计算图结构和第二计算图的计算图结构不同,生成第一计算图的编程语言和生成第二计算图的编程语言也不同。如图1所示,TVM修正模块采用的编程语言和TVM采用的编程语言是一样的,但其和计算图产生模块采用的编程语言是不同的。
具体地,第一计算图可以是TVM Relay计算图的计算图结构,TVM修正模块针对芯片开发需要支持的模型,采用TVM Relay生成TVM Relay计算图;如果芯片开发需要支持的模型有多个,那么TVM修正模块可以针对芯片开发需要支持的模型列表,采用TVM Relay产生计算图文件列表,由于该模型列表中包括多个模型,故该计算图文件列表包括这多个模型对应的多个TVM Relay计算图;其中,计算图文件列表以文本的形式存在,在实际应用中,可以为txt文件或log文件,从而,计算图文件列表具备可视化。计算图产生模块可以将TVM Relay计算图转换成第二计算图,而第二计算图的结构为芯片开发使用的计算图结构;若由多个TVM Relay计算图,则将这多个TVM Relay计算图都转换成芯片开发使用的计算图结构的第二计算图。其中,对接TVM的装置输出的第二计算图将作为芯片开发环境的输入。
其中,计算图产生单元可以对第一计算图进行解析,提取出第一计算图中节点对应的算子名称、算子参数、、输入数据的维度、输出数据的维度、节点标号等信息,并将这些信息保存在第二计算图里面,该第二计算图可以python DataFrame形式或者其它数据形式存在。其中,python是计算机程序设计语言;DataFrame是一个表格型的数据结构,是python pandas库中的定义的。
应理解,由于TVM是个很大的环境,如果将TVM根据生成的第一计算图输入芯片开发环境中运行,例如直接将TVM Relay计算图输入芯片开发环境中运行,会造成运行速率很慢,减缓芯片开发的进度;其原因是:因为第一计算图的结构大,其不是芯片开发使用的计算图结构,因此其运行所需的运算资源需求较大。而如果将第一计算图经过结构转换,将其转换成芯片开发使用的计算图结构的第二计算图,再将第二计算图输入到芯片开发环境中运行,可以明显减少 运行所需的运算资源需求,从而提升运行速率。
需要说明的是,一个目标模型可能对应一张计算图,也可能对应多张计算图,也即第一计算图包括多张TVM Relay计算图。当目标模型对应多张TVM Relay计算图时,也需要将这多张TVM Relay计算图转换成第二计算图,其中多张TVM Relay计算图可能转换成一张第二计算图,也可能转换成多张第二计算图,本申请对其不作具体限定。当多张TVM Relay计算图转换成多张第二计算图时,这多张第二计算图均是芯片开发环境的输入。
在本申请实施例中,采用TVM根据用于芯片开发的目标模型生成第一计算图,也即采用TVM把用于芯片开发的目标模型变成第一计算图,该第一计算图的结构是TVM使用的计算图结构;然后将第一计算图生成第二计算图,该第二计算图的结构为芯片开发使用的计算图结构,从而第二计算图可以作为芯片开发环境的输入,实现将TVM环境引入到芯片开发环境中。由于第二计算图的结构为芯片开发使用的计算图结构,相比于第一计算图,第二计算图在芯片开发环境运行所需运算资源需求较小、运行速率较快;因此,将第一计算图转变第二计算图,再将第二计算图输入到芯片开发环境运行,能够极大减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述计算图产生模块包括TVM算子参数模板列表和计算图解析单元,所述TVM算子参数模板列表根据所述TVM使用的算子得到;所述计算图解析单元,用于:根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;根据所述每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号生成所述第二计算图。
具体地,请参阅图2,图2是本申请实施例提供的一种计算图产生模块的结构示意图,该计算图产生模块的输入为TVM Relay计算图,输出为根据芯片开发需要的算子信息组成的芯片开发使用的计算图结构,也即第二计算图。该计算图产生模块包括TVM算子参数模板列表和计算图解析单元;TVM算子参数模板列表是根据TVM Relay算子定义维护的列表,包含算子的参数定义;TVM算子参数模板列表可以根据芯片开发需要支持的模型来添加对应的TVM Relay算子;计算图解析单元根据TVM算子参数模板列表解析出TVM Relay计算图中的节点对应的算子名称、算子参数、输入输数据的(向量)维度、输出数据的(向量)维度、节点标号等。
在本申请实施例中,TVM算子参数模板列表是根据TVM使用的算子得到,故TVM算子参数模板列表中可以包括TVM使用的所有算子的信息,而算子在计算图中的位置表现为计算图中的节点,根据TVM算子参数模板列表对第一计算图进行解析,可以得到第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;然后根据解析得到的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号进行计 算图结构重组,可以生成第二计算图;从而将TVM使用的计算图结构转变成芯片开发使用的计算图结构,有利于减少TVM引入对芯片开发环境的运算资源需求。
在一种可能的实施方式中,所述计算图解析单元包括:算子名称提取子单元,用于根据所述TVM算子参数模板列表在所述第一计算图中进行搜索,以得到所述每个节点对应的算子名称;算子参数提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的算子参数;输入输出数据维度提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的输入数据的维度、输出数据的维度;节点标号提取子单元,用于根据所述第一计算图中的节点的连接关系确定所述每个节点对应的节点标号。
具体地,请参阅图3,图3是本申请实施例提供的一种计算图解析单元的结构示意图,该计算图解析单元包括算子名称提取子单元、算子参数提取子单元、输入输出数据维度提取子单元和节点标号提取子单元。算子名称提取子单元根据TVM算子参数模板列表在第一计算图中进行搜索,以得到每个节点对应的算子名称;也即将TVM算子参数模板列表中的参数模板分别在目标模型对应的TVM Relay计算图中进行搜索匹配,匹配到的算子作为芯片开发使用的第二计算图中的节点对应的算子。算子参数提取子单元还将TVM算子参数模板列表中的算子对应的算子参数合入第二计算图对应节点的信息中,TVM算子参数模板列表中加入第二计算图中的算子参数是可以选择的。输入输出数据维度提取子单元将算子的输入数据和输出数据的维度也加入到第二计算图中。节点标号提取子单元根据TVM Relay计算图中节点的连接关系生成第二计算图的节点标号;其中,如图4所示,不同模型的输出第二计算图的连接关系不同,该连接关系在第二计算图中用节点标号来表示,节点标号包括输入节点的节点标号、输出节点的节点标号以及当前节点的节点标号;举例来说,节点1、节点2、节点3依次相连,则节点2的节点标号包括节点1的节点标号(输入节点的节点标号)、节点3的节点标号(输出节点的节点标号)以及节点2的节点标号(当前节点的节点标号)。
在本申请实施例中,根据TVM算子参数模板列表对第一计算图进行搜索,可以得到第一计算图中每个节点对应的算子名称;然后根据第一计算图中每个节点对应的算子名称可以从TVM算子参数模板列表中提取到每个节点对应的算子参数,以及根据第一计算图中每个节点对应的算子名称可以从TVM算子参数模板列表中提取到每个节点对应的输入数据的维度、输出数据的维度;再根据第一计算图中的节点的连接关系确定第一计算图中每个节点对应的节点标号;从而得到第一计算图中每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号,有利于组合得到第二计算图。
在一种可能的实施方式中,所述TVM修正模块,具体用于:采用所述TVM根据所述目标模型生成第三计算图;采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,得到所述第一计算图,其中,所述第一 计算图被硬件运行的速率大于所述第三计算图被所述硬件运行的速率。
具体地,TVM修正模块根据芯片开发需要支持的目标模型采用TVM Relay产生第三计算图,其中第三计算图也是TVM Relay计算图的计算图结构,采用TVM的计算图优化部分和计算图量化部分对第三计算图进行优化处理和量化处理,从而得到第一计算图。其中,第一计算图、第三计算图可以以文本的形式存在,在实际应用中,可以为txt文件或log文件,从而实现TVM环境与芯片开发环境的解耦。
其中,上述优化部分是对计算图结构进行优化,比如op1-op2-op3组成一个计算图,如果op2是冗余,可以删掉,优化后就变成op1-op3,这个就是优化部分,优化的目的是通过优化计算图结构来加快模型在硬件上的处理速率。而上述量化部分不涉及对计算图结构的改变,主要是对模型数据类型的变换,将模型的数据类型由浮点转成定点,也是为了加快模型在硬件上的处理速率。
在本申请实施例中,先采用TVM根据所述目标模型生成第三计算图;再采用TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,从而得到第一计算图;由于第一计算图是优化和量化后的计算图,去除了计算图中的无效节点、冗余节点的计算以及经过了数据类型的转换,故第一计算图被硬件运行的速率大于所述第三计算图被硬件运行的速率;依据优化和量化后得到的第一计算图生成第二计算图,有利于提升第二计算图在芯片开发环境中的运行速率。
在一种可能的实施方式中,所述TVM修正模块,还用于:根据芯片架构对所述计算图优化部分和所述计算图量化部分进行修改,以使所述计算图优化部分和所述计算图量化部分适配所述芯片开发。
其中,根据芯片架构对所述计算图优化部分和所述计算图量化部分进行修改,也即根据芯片的架构特性对所述计算图优化部分和所述计算图量化部分进行修改;芯片架构是指对芯片对象类别和属性的描述,对于每一个对象类别来说,该架构定义了对象类必须具有的属性,它也可以有附加的属性,并且该对象可以是它的父对象;主流的芯片架构有ARM、MIPS、x86等。
具体地,TVM修正模块可以先根据芯片的架构特性对TVM的计算图优化部分、计算图量化部分进行修改,然后针对芯片开发需要支持的目标模型采用TVM Relay产生第一计算图;或者TVM修正模块可以先针对芯片开发需要支持的目标模型采用TVM Relay产生第三计算图,然后根据芯片的架构特性对TVM的计算图优化部分、计算图量化部分进行修改,再采用计算图优化部分、计算图量化部分对第三计算图进行优化和量化,从而得到第一计算图。
在本申请实施例中,根据芯片架构对TVM的计算图优化部分和计算图量化部分进行修改,使其适配芯片开发,从而有利于TVM的计算图优化部分和计算图量化部分处理得到的第一计算图适配被芯片开发环境运行。
在一种可能的实施方式中,所述装置还包括:计算图处理模块,用于对所述第二计算图进行优化和/或量化处理,以得到第四计算图,其中,所述第四计算图为所述芯片开发环境的输入,所述第四计算图被硬件运行的速率大于所 述第二计算图被所述硬件运行的速率。
其中,计算图产生模块输出的第二计算图,经过计算图处理模块进行优化和/或量化处理后,得到第四计算图,将第四计算图送入芯片开发环境中运行。输入到芯片开发环境的该第四计算图的计算图结构仿真速率快,而且保留了芯片开发环境需要的计算图中节点的关键信息。
具体地,计算图处理模块可以从第二计算图中提取计算图信息,并对从第二计算图中提取到的计算图信息进行处理,从而产生新的计算图,也即产生第四计算图;其中,对从第二计算图中提取到的计算图信息进行处理包括根据芯片的架构特性进行算子融合、添加新的算子参数等。
在本申请实施例中,将需要输入到芯片开发环境中运行的第二计算图进行优化和/或量化处理,优化和/或量化处理后得到第四计算图,从而第四计算图被硬件运行的速率大于第二计算图被硬件运行的速率,有利于提升运行速率,减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述装置还包括:计算图统计模块,用于对所述第二计算图和/或所述第四计算图进行信息统计,以得到计算图信息,其中,所述计算图信息为所述芯片开发环境的输入,所述计算图信息用于提升所述第二计算图和/或所述第四计算图被硬件运行的速率。
具体地,当直接将第二计算图输入到芯片开发环境中时,计算图统计模块对第二计算图进行信息统计,以得到第二计算图的计算图信息,并将第二计算图的计算图信息输出到芯片开发环境中;当将第四计算图输入到芯片开发环境中时,计算图统计模块对第四计算图进行信息统计,以得到第四计算图的计算图信息,并将第四计算图的计算图信息输出到芯片开发环境中。
需要说明的是,根据芯片开发环境的需要可以选择将第二计算图、第二计算图的计算图信息中的两个或者任意一个送入到芯片开发环境中;或者将第四计算图、第四计算图的计算图信息中的两个或者任意一个送入到芯片开发环境中。第二计算图或第四计算图包含着目标模型的计算图信息,例如,第二计算图或第四计算图包含着TVM Relay计算图信息;芯片通过对第二计算图或第四计算图中的各个节点功能进行实现,对各个节点分别进行汇编实现,可以完成深度学习模型(也即目标模型)的硬件部署。计算图信息主要统计着某个节点对应的算子参数信息等内容,输入到芯片开发环境中可以指导芯片硬件开发。
在本申请实施例中,对将要输入到芯片开发环境中运行的计算图进行信息统计,得到该计算图的计算图信息,将该计算图信息输入到芯片开发环境中,可以提升该计算图在芯片开发环境中运行速率,从而减少芯片开发环境的运行时间。
在一种可能的实施方式中,所述第一计算图和所述第三计算图以文本的形式保存,所述第二计算图和所述第四计算图以python DataFrame的形式保存。
具体地,对接TVM的装置将第一计算图和第三计算图以文本的形式保存,也即将TVM Relay计算图以文本的形式保存,实现TVM环境与芯片开发环境的解耦,极大减少TVM引入对芯片开发环境的运算资源需求,加快芯片开发环境 的运行速率。对接TVM的装置将第二计算图和第四计算图以python DataFrame的形式保存,python Dataframe可以输出成excel表格,以这样的文件的形式保存,可以实现TVM对接环境和芯片开发环境解耦,TVM对接环境也即对接TVM的装置的环境。芯片开发环境只需要输入对接TVM的装置输出的excel表格,不用集成对接TVM的装置的环境,也可以加快芯片开发环境的运行速率。此外,第二计算图和第四计算图以python DataFrame形式存在,可以用表格文本的形式输出,实现可视化,如表1至表3所示。
表1
Figure PCTCN2021133512-appb-000001
Figure PCTCN2021133512-appb-000002
表2
Figure PCTCN2021133512-appb-000003
Figure PCTCN2021133512-appb-000004
表3
Figure PCTCN2021133512-appb-000005
Figure PCTCN2021133512-appb-000006
Figure PCTCN2021133512-appb-000007
其中,上述表1至表3中的参数在中文释义如表4所示。
表4 参数含义表
Figure PCTCN2021133512-appb-000008
在本申请实施例中,将第一计算图和第三计算图以文本的形式保存,可以实现TVM环境与芯片开发环境的解耦;将第二计算图和第四计算图以python DataFrame的形式保存,可以实现TVM对接环境和芯片开发环境解耦,从而可以加快芯片开发环境的运行速率。此外,第二计算图和第四计算图以python DataFrame的形式保存,还可以实现计算图可视化。
综上,本申请实施例提供一种高效地将TVM对接到芯片开发环境中的对接TVM的装置,通过该对接TVM的装置可以实现将深度学习模型落地到芯片开发环境中,也即将深度学习模型的TVM Relay计算图引入到芯片开发环境中,并实现TVM环境与芯片开发环境的解耦,极大减少TVM引入对芯片开发环境的运 算资源需求,加快芯片开发环境的运行速率。该对接TVM的装置输入到芯片开发环境的计算图结构仿真速率快,而且保留了芯片开发环境需要的计算图中节点的关键信息,还可以实现可视化。
请参阅图5,图5是本申请实施例提供的一种对接TVM的方法的流程示意图,该对接TVM的方法可应用于电子设备,该对接TVM的方法包括但不限于以下步骤。
501、采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;
502、根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
在一种可能的实施方式中,所述电子设备存储有TVM算子参数模板列表,所述TVM算子参数模板列表根据所述TVM使用的算子得到;所述根据所述第一计算图生成第二计算图,包括:根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;根据所述每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号生成所述第二计算图。
在一种可能的实施方式中,所述根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号,包括:根据所述TVM算子参数模板列表在所述第一计算图中进行搜索,以得到所述每个节点对应的算子名称;根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的算子参数;根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的输入数据的维度、输出数据的维度;根据所述第一计算图中的节点的连接关系确定所述每个节点对应的节点标号。
在一种可能的实施方式中,所述采用TVM根据目标模型生成第一计算图,包括:采用所述TVM根据所述目标模型生成第三计算图;采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,得到所述第一计算图,其中,所述第一计算图被硬件运行的速率大于所述第三计算图被所述硬件运行的速率。
在一种可能的实施方式中,在所述采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理之前,所述方法还包括:根据芯片架构对所述计算图优化部分和所述计算图量化部分进行修改,以使所述计算图优化部分和所述计算图量化部分适配所述芯片开发。
在一种可能的实施方式中,所述方法还包括:对所述第二计算图进行优化和/或量化处理,以得到第四计算图,其中,所述第四计算图为所述芯片开发 环境的输入,所述第四计算图被硬件运行的速率大于所述第二计算图被所述硬件运行的速率。
在一种可能的实施方式中,所述方法还包括:对所述第二计算图和/或所述第四计算图进行信息统计,以得到计算图信息,其中,所述计算图信息为所述芯片开发环境的输入,所述计算图信息用于提升所述第二计算图和/或所述第四计算图被硬件运行的速率。
需要说明的是,本申请实施例中所描述的对接TVM的方法的具体流程,可参见上述图1至图4中所示的实施例中的相关描述,此处不再赘述。
在图5所描述的对接TVM的方法中,采用TVM根据用于芯片开发的目标模型生成第一计算图,也即采用TVM把用于芯片开发的目标模型变成第一计算图,该第一计算图的结构是TVM使用的计算图结构;然后将第一计算图生成第二计算图,该第二计算图的结构为芯片开发使用的计算图结构,从而第二计算图可以作为芯片开发环境的输入,实现将TVM环境引入到芯片开发环境中。由于第二计算图的结构为芯片开发使用的计算图结构,相比于第一计算图,第二计算图在芯片开发环境运行所需运算资源需求较小、运行速率较快;因此,将第一计算图转变第二计算图,再将第二计算图输入到芯片开发环境运行,能够极大减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运行时间。
请参见图6,图6是本申请实施例提供的一种电子设备610的结构示意图,该电子设备610包括处理器611、存储器612和通信接口613,上述处理器611、存储器612和通信接口613通过总线614相互连接。
存储器612包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器612用于相关计算机程序及数据。通信接口613用于接收和发送数据。
处理器611可以是一个或多个中央处理器(central processing unit,CPU),在处理器611是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
该电子设备610中的处理器611用于读取上述存储器612中存储的计算机程序代码,执行以下步骤:采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
需要说明的是,各个操作的实现还可以对应参照图1至图5所示的实施例的相应描述,此处不再赘述。
在图6所描述的电子设备610中,采用TVM根据用于芯片开发的目标模型生成第一计算图,也即采用TVM把用于芯片开发的目标模型变成第一计算图,该 第一计算图的结构是TVM使用的计算图结构;然后将第一计算图生成第二计算图,该第二计算图的结构为芯片开发使用的计算图结构,从而第二计算图可以作为芯片开发环境的输入,实现将TVM环境引入到芯片开发环境中。由于第二计算图的结构为芯片开发使用的计算图结构,相比于第一计算图,第二计算图在芯片开发环境运行所需运算资源需求较小、运行速率较快;因此,将第一计算图转变第二计算图,再将第二计算图输入到芯片开发环境运行,能够极大减少TVM引入对芯片开发环境的运算资源需求,提升运行速率,减少芯片开发环境的运行时间。
本申请实施例还提供一种芯片,上述芯片包括至少一个处理器,存储器和接口电路,上述存储器、上述收发器和上述至少一个处理器通过线路互联,上述至少一个存储器中存储有计算机程序;上述计算机程序被上述处理器执行时,图5所示的方法流程得以实现。
本申请实施例还提供一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,图5所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当上述计算机程序产品在计算机上运行时,图5所示的方法流程得以实现。
应理解,本申请实施例中提及的处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
需要说明的是,当处理器为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)集成在处理器中。
应注意,本文描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
还应理解,本文中涉及的第一、第二、第三、第四以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请的范围。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
上述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所示方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。

Claims (10)

  1. 一种对接TVM的装置,其特征在于,应用于电子设备,所述装置包括:
    TVM修正模块,用于采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;
    计算图产生模块,用于根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
  2. 根据权利要求1所述的装置,其特征在于,所述计算图产生模块包括TVM算子参数模板列表和计算图解析单元,所述TVM算子参数模板列表根据所述TVM使用的算子得到;所述计算图解析单元,用于:
    根据所述TVM算子参数模板列表对所述第一计算图进行解析,以得到所述第一计算图中的每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号;
    根据所述每个节点对应的算子名称、算子参数、输入数据的维度、输出数据的维度、节点标号生成所述第二计算图。
  3. 根据权利要求2所述的装置,其特征在于,所述计算图解析单元包括:
    算子名称提取子单元,用于根据所述TVM算子参数模板列表在所述第一计算图中进行搜索,以得到所述每个节点对应的算子名称;
    算子参数提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的算子参数;
    输入输出数据维度提取子单元,用于根据所述每个节点对应的算子名称从所述TVM算子参数模板列表中提取所述每个节点对应的输入数据的维度、输出数据的维度;
    节点标号提取子单元,用于根据所述第一计算图中的节点的连接关系确定所述每个节点对应的节点标号。
  4. 根据权利要求1-3任一项所述的装置,其特征在于,所述TVM修正模块,具体用于:
    采用所述TVM根据所述目标模型生成第三计算图;
    采用所述TVM的计算图优化部分和计算图量化部分对所述第三计算图进行处理,得到所述第一计算图,其中,所述第一计算图被硬件运行的速率大于所述第三计算图被所述硬件运行的速率。
  5. 根据权利要求4所述的装置,其特征在于,所述TVM修正模块,还用于:
    根据芯片架构对所述计算图优化部分和所述计算图量化部分进行修改,以使所述计算图优化部分和所述计算图量化部分适配所述芯片开发。
  6. 根据权利要求1-5中任一项所述的装置,其特征在于,所述装置还包括:
    计算图处理模块,用于对所述第二计算图进行优化和/或量化处理,以得到第四计算图,其中,所述第四计算图为所述芯片开发环境的输入,所述第四计算图被硬件运行的速率大于所述第二计算图被所述硬件运行的速率。
  7. 根据权利要求1-6中任一项所述的装置,其特征在于,所述装置还包括:
    计算图统计模块,用于对所述第二计算图和/或所述第四计算图进行信息统 计,以得到计算图信息,其中,所述计算图信息为所述芯片开发环境的输入,所述计算图信息用于提升所述第二计算图和/或所述第四计算图被硬件运行的速率。
  8. 一种对接TVM的方法,其特征在于,应用于电子设备,所述方法包括:
    采用TVM根据目标模型生成第一计算图,其中,所述目标模型用于芯片开发;
    根据所述第一计算图生成第二计算图,其中,所述第二计算图的结构为所述芯片开发使用的计算图结构,所述第二计算图为芯片开发环境的输入。
  9. 一种电子设备,其特征在于,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求8所述的方法中的步骤的指令。
  10. 一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求8所述的方法。
PCT/CN2021/133512 2020-12-25 2021-11-26 对接tvm的方法及相关设备 WO2022135028A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011565749.2A CN112527272B (zh) 2020-12-25 2020-12-25 对接tvm的方法及相关设备
CN202011565749.2 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022135028A1 true WO2022135028A1 (zh) 2022-06-30

Family

ID=74976468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133512 WO2022135028A1 (zh) 2020-12-25 2021-11-26 对接tvm的方法及相关设备

Country Status (2)

Country Link
CN (1) CN112527272B (zh)
WO (1) WO2022135028A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629330A (zh) * 2023-04-24 2023-08-22 北京大学 一种算子检测方法、装置以及计算机设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527272B (zh) * 2020-12-25 2023-11-17 深圳云天励飞技术股份有限公司 对接tvm的方法及相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN110764744A (zh) * 2018-07-25 2020-02-07 赛灵思公司 用于神经网络计算的中间表示生成方法和装置
CN110968321A (zh) * 2019-10-25 2020-04-07 浙江省北大信息技术高等研究院 张量计算代码优化方法、装置、设备及介质
CN111338635A (zh) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 计算图的图编译方法、装置、设备及存储介质
CN112527272A (zh) * 2020-12-25 2021-03-19 深圳云天励飞技术股份有限公司 对接tvm的方法及相关设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124656B (zh) * 2018-10-31 2023-09-15 伊姆西Ip控股有限责任公司 用于向专用计算资源分配任务的方法、设备和计算机可读存储介质
CN110929851A (zh) * 2019-11-27 2020-03-27 探智立方(北京)科技有限公司 基于计算图子图的ai模型自动生成的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN110764744A (zh) * 2018-07-25 2020-02-07 赛灵思公司 用于神经网络计算的中间表示生成方法和装置
CN110968321A (zh) * 2019-10-25 2020-04-07 浙江省北大信息技术高等研究院 张量计算代码优化方法、装置、设备及介质
CN111338635A (zh) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 计算图的图编译方法、装置、设备及存储介质
CN112527272A (zh) * 2020-12-25 2021-03-19 深圳云天励飞技术股份有限公司 对接tvm的方法及相关设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIANQI CHEN; THIERRY MOREAU; ZIHENG JIANG; LIANMIN ZHENG; EDDIE YAN; MEGHAN COWAN; HAICHEN SHEN; LEYUAN WANG; YUWEI HU; LUIS CEZE;: "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 February 2018 (2018-02-12), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081061540 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629330A (zh) * 2023-04-24 2023-08-22 北京大学 一种算子检测方法、装置以及计算机设备
CN116629330B (zh) * 2023-04-24 2024-04-16 北京大学 一种算子检测方法、装置以及计算机设备

Also Published As

Publication number Publication date
CN112527272B (zh) 2023-11-17
CN112527272A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022135028A1 (zh) 对接tvm的方法及相关设备
EP3923160A1 (en) Method, apparatus, device and storage medium for training model
US10628212B2 (en) Incremental parallel processing of data
CN107480789B (zh) 一种深度学习模型的高效转换方法及装置
JP7286810B2 (ja) テキスト知能化洗浄方法、装置及びコンピュータ読み取り可能な記憶媒体
CN111221842A (zh) 大数据处理系统及方法
CN111309751A (zh) 大数据处理方法及装置
CN109885584A (zh) 分布式数据分析平台的实现方法及终端设备
CN115344932A (zh) 一种模型数据的规则审查方法、装置及电子设备
JP2023047318A (ja) コンピュータプログラミングに関連する言語モデルを訓練するためのコード強化
CN115437808A (zh) 联邦学习平台之间的互通方法、装置、设备、介质及产品
EP3920074A2 (en) Method for industry text increment, related apparatus, and computer program product
US20160154634A1 (en) Modifying an analytic flow
US7788246B2 (en) Linguistic structure for data flow diagrams
CN116560666B (zh) 基于多层级代码生成的ai前端统一计算方法、装置及介质
US20220172044A1 (en) Method, electronic device, and computer program product for deploying machine learning model
CN111221888A (zh) 大数据分析系统及方法
US20230186024A1 (en) Text Processing Method, Device and Storage Medium
US20220350574A1 (en) Code injection from natural language derived intent
JP7344259B2 (ja) 深層学習フレームワークにおけるパターン変換方法、装置、電子デバイス、コンピューター記憶媒体およびコンピュータープログラム製品
CN111221841A (zh) 基于大数据的实时处理方法及装置
CN112650502A (zh) 批处理任务处理方法、装置、计算机设备和存储介质
CN113467828A (zh) 一种异构众核处理器中编程语言转换方法和系统
CN117591087B (zh) 一种针对复杂数据处理需求的高效形式化代码构建方法
US20240013053A1 (en) Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21909027

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21909027

Country of ref document: EP

Kind code of ref document: A1