WO2022087788A1 - 一种神经网络编译优化方法和相关装置 - Google Patents

一种神经网络编译优化方法和相关装置 Download PDF

Info

Publication number
WO2022087788A1
WO2022087788A1 PCT/CN2020/123708 CN2020123708W WO2022087788A1 WO 2022087788 A1 WO2022087788 A1 WO 2022087788A1 CN 2020123708 W CN2020123708 W CN 2020123708W WO 2022087788 A1 WO2022087788 A1 WO 2022087788A1
Authority
WO
WIPO (PCT)
Prior art keywords
subgraph
optimization
divided
grained
feature vector
Prior art date
Application number
PCT/CN2020/123708
Other languages
English (en)
French (fr)
Inventor
范礼
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080106362.2A priority Critical patent/CN116368494A/zh
Priority to PCT/CN2020/123708 priority patent/WO2022087788A1/zh
Publication of WO2022087788A1 publication Critical patent/WO2022087788A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of neural network compilation, and in particular, to a neural network compilation and optimization method and related devices.
  • artificial neural network is one of the common computing models in the field of artificial intelligence.
  • neural networks are represented by specialized models that describe neural network algorithms. Further, by compiling the neural network model, that is, converting the neural network algorithm into a general computational graph, optimizing the computational graph, and then mapping the optimized computational graph into executable instructions and machine codes on the back-end hardware platform, Thereby, the model of the neural network is converted into the target code for execution on the computing platform.
  • Neural network compilation and optimization is a complex and time-consuming computational process.
  • a typical neural network compilation and optimization process includes graph transformation, subgraph segmentation, constant folding, equivalent subgraph transformation, operator fusion, L1/L2 memory data multiplexing, DDR memory allocation and multiplexing, etc.
  • Embodiments of the present application provide a neural network compilation and optimization method and a related device, which are used to solve at least one of the above shortcomings in the art.
  • a method for compiling and optimizing a neural network comprising: dividing a computation graph of a neural network into several subgraphs; querying a topology feature library whether there is an existing topology consistent with the topology of at least one of the divided subgraphs Knowing the optimization strategy of the subgraph; if it exists, extracting the optimization strategy of the known subgraph that is consistent with the topology of the at least one divided subgraph from the topology feature library to optimize the at least one subgraph.
  • the topological feature library includes a topological feature vector of a known subgraph and a corresponding optimization strategy
  • the step of querying includes querying whether there is a topological feature vector consistent with the at least one segmented subgraph. Optimization strategies for known subgraphs.
  • the method further comprises: if there is no optimization strategy for a known subgraph that is consistent with the topological feature vector of the at least one divided subgraph, executing the at least one divided subgraph Compilation and optimization calculation; after the compilation and optimization of the at least one divided subgraph, an optimization strategy of the at least one divided subgraph is generated; the topological feature vector of the at least one divided subgraph and its optimization strategy are associated and added to in the topology feature library.
  • performing a compilation optimization calculation on the at least one divided subgraph includes determining an equivalent subgraph of the at least one divided subgraph by optimizing the calculation, and replacing the at least one divided subgraph into an equivalent subgraph; the step of adding includes storing the topological feature vector of the at least one divided subgraph and the topological feature vector of the equivalent subgraph in a topological feature library in association.
  • performing the compilation optimization calculation on the at least one divided subgraph includes determining that multiple parts in the at least one divided subgraph can be combined to reduce the calculation amount of the subgraph or the calculation speed of the subgraph , merging the plurality of parts, and the adding step includes adding the topological feature vector of the at least one divided subgraph and the topological feature vector of the merged subgraph into the topological feature library in association.
  • performing a compilation optimization calculation on the at least one divided subgraph includes determining an equivalent constant of the at least one divided subgraph, and replacing the at least one divided subgraph with an equivalent
  • the step of adding includes: adding the topological feature vector of the at least one divided sub-graph and the equivalent constant to the topological feature library in association with each other.
  • the step of dividing includes dividing the computational graph into several coarse-grained subgraphs using a clustering technique.
  • the method further comprises: if there is no optimization strategy for a known subgraph consistent with a topological feature vector of the at least one segmented coarse-grained subgraph, The graph is divided into several fine-grained subgraphs, and the topological feature library is queried to see if there is an optimization strategy for a known subgraph that is consistent with the topological feature vector of at least one fine-grained subgraph; If the optimization strategy of the known subgraph is consistent, the optimization strategy of the known subgraph that is consistent with the topology feature vector of the at least one fine-grained subgraph is extracted from the topology feature library to optimize the at least one fine-grained subgraph.
  • the first aspect of the present application if there is no optimization strategy for a known subgraph that is consistent with the topological feature vector of at least one fine-grained subgraph, then: perform a compilation optimization calculation on the at least one fine-grained subgraph; generate the The optimization strategy of at least one fine-grained subgraph; the topological feature vector of the at least one fine-grained subgraph and its optimization strategy are added to the topological feature library.
  • an apparatus for compiling and optimizing a neural network including: a processor and a memory, where the processor is configured to execute program instructions stored in the memory, so that the apparatus implements any of the above a method.
  • a computer-readable storage medium wherein a program code is stored in the computer-readable storage medium, and when the program code is executed by a computer, any one of the above methods is implemented.
  • a computer program product wherein when the program code included in the computer program product is executed by a computer, any of the above methods is implemented.
  • an apparatus for compiling and optimizing a neural network comprising: a segmentation unit configured to segment a computation graph of a neural network into several subgraphs; a query unit configured to query a topology feature library Whether there is an optimization strategy for a known subgraph that is consistent with the topology of the at least one segmented subgraph in the The at least one subgraph is optimized by an optimization strategy of a topology-consistent known topology.
  • the topological feature library includes topological feature vectors of known subgraphs and corresponding optimization strategies
  • the query unit is further configured to: query whether there is a topology of the subgraph that is split with the at least one subgraph Optimization strategy for known subgraphs with consistent eigenvectors.
  • the optimization unit is further configured to: the at least one segmented subgraph Compile and optimize the calculation of the sub-graph; After the compilation and optimization of the at least one divided sub-graph, generate an optimization strategy of the at least one divided sub-graph; The topological feature vector of the at least one divided sub-graph and its optimization strategy are related are added to the topology feature library.
  • the optimization unit when performing a compilation optimization calculation on the at least one divided subgraph, the optimization unit is configured to: determine an equivalent subgraph of the at least one divided subgraph through the optimization calculation figure, replace the at least one divided subgraph with an equivalent subgraph; the optimization unit is configured to: when performing the adding, the topological feature vector of the at least one divided subgraph and the equivalent subgraph
  • the topological feature vectors of are stored in the topological feature library associatively.
  • the optimization unit is configured to: when performing a compilation optimization calculation on the at least one divided subgraph, determine through optimization calculation that multiple parts in the subgraph can be combined to reduce the The calculation amount of the sub-graph or the calculation speed of the sub-graph is improved, and the multiple parts are merged; when the addition is performed, the topological feature vector of the sub-graph is associated with the topological feature vector of the merged sub-graph. added to the topology signature library.
  • the optimization unit is configured to: when performing a compilation optimization calculation on the at least one divided subgraph, determine an equivalent constant of the subgraph through optimization calculation, and use the The subgraph is replaced with the constant; when the adding is performed, the topological feature vector of the subgraph is associated with the constant and added to the topological feature library.
  • the segmentation unit is further configured to: segment the computation graph into coarse-grained subgraphs using a clustering technique.
  • the segmenting unit is configured to: The coarse-grained subgraph is divided into several fine-grained subgraphs; the query unit is configured to: query whether there is an optimization strategy of a known subgraph consistent with the topology feature vector of at least one fine-grained subgraph in the topology feature database; if There is an optimization strategy of a known subgraph that is consistent with the topological feature vector of at least one fine-grained subgraph, then the optimization unit is configured to: extract from the topological feature library the topological feature vector consistent with the at least one fine-grained subgraph.
  • An optimization strategy for the subgraph is known to optimize the at least one fine-grained subgraph.
  • the optimization unit is configured to: for the at least one fine-grained subgraph Compile and optimize the graph; generate an optimization strategy of the at least one fine-grained subgraph; add the topology feature vector of the at least one fine-grained subgraph and the optimization strategy to the topology feature library.
  • the embodiments of the present application speed up the execution of the compilation process by introducing a topology feature library.
  • the compilation optimization strategy of the known subgraphs of the neural network is stored in the feature library.
  • the computational graph of the neural network is divided into several subgraphs, the optimization strategy corresponding to at least one subgraph is found from the topology feature library, and the optimization strategy is directly applied to the subgraph without repeating
  • the subgraph is optimized and calculated to speed up the compilation process of the neural network model.
  • Fig. 1 is the structural representation of a kind of neural network
  • FIG. 2 is a flowchart of a neural network compilation and optimization process according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of segmentation of the computation graph of the neural network in the embodiment of FIG. 2;
  • FIG. 4 is a schematic diagram of node attribute information in the embodiment of FIG. 2;
  • FIG. 5 is a schematic diagram of an equivalent subgraph in the neural network in the embodiment of FIG. 2;
  • FIG. 6 is a flowchart of a method for compiling and optimizing a neural network according to another embodiment of the present application.
  • FIG. 7 is a schematic diagram of fine-grained segmentation of the computation graph in the embodiment of FIG. 6;
  • FIG. 8 is a schematic diagram of a neural network compilation server implementing the neural network compilation and optimization method provided by an embodiment of the present application
  • FIG. 9 is a schematic diagram of a computing device implementing the method for compiling and optimizing a neural network provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an apparatus for implementing the method for compiling and optimizing a neural network provided by an embodiment of the present application.
  • At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, c can be single or multiple.
  • the size of the sequence number of each step does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the application. .
  • ANN artificial neural network
  • DNN deep neural network
  • RNN recurrent neural network
  • CNN convolutional neural network
  • Neural networks are generally represented by specialized models that describe neural network algorithms. Compilation of a neural network refers to converting the neural network algorithm into a general computational graph, optimizing and reconstructing the computational graph, and then mapping the optimized computational graph into executable instructions and machine code on the back-end hardware platform. Thus, the compilation of the neural network algorithm for the hardware platform is completed.
  • the computation graph is to represent the computation of the neural network model graphically, for example, represented by multiple nodes and directed edges (Edges) between nodes.
  • the nodes include at least one of variable nodes, operator nodes, and sample nodes.
  • a directed edge between two nodes is used to represent a dependency between two nodes.
  • Directed edges between nodes have attributes, such as weights, which indicate that a signal (or value) is multiplied by the weight value as an input when inputting through the directed edge to the next node.
  • the neural network compiler can be various computing platforms or neural network compilers for neural network computing, for example, a deep neural network compiler DNNC (Deep Neural Network Compiler), which can compile neural network algorithms into, for example, a DPU (Deep Learning Processor). Unit, DLPU) platform instruction stream.
  • DNNC Deep Neural Network Compiler
  • the neural network compiler can also be other various types of compilers known in the art.
  • the embodiments of the present application relate to a method for compiling and optimizing a neural network.
  • the methods of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
  • Figure 1 provides a schematic diagram of a neural network structure.
  • the neural network structure includes a series of layers that operate in order, such as convolutional layer (Conv), batch normalization (BatchNorm, or BN) layer, scaling (Scale) layer, point addition (eltwise) layer, Relu (Rectified Linear Unit, corrected linear unit) layer, etc.
  • Conv convolutional layer
  • BatchNorm batch normalization
  • Scale Scale.g., Scaletwise
  • Relu Rectified Linear Unit, corrected linear unit
  • FIG. 1 provided in this application is only for illustration, and the neural network in this application is not limited to the structure shown in FIG. 1 , and can be various types of neural network structures in the art.
  • the computational output of the previous layer is the input of the latter layer.
  • the input data required by the Conv layer is first loaded from the off-chip memory to the on-chip cache, and the Conv layer is calculated based on the loaded data.
  • the BN layer loads the output result of the Conv layer from the off-chip memory as the input data and other necessary parameters, performs the calculation, and then stores the result in the off-chip memory; then the Scale layer starts from The output result of the BN layer before the off-chip memory is loaded, the calculation is performed, and the result is stored in the off-chip memory, and so on, until all the computing layers are traversed.
  • the calculation process of the above-mentioned neural network is only exemplary.
  • the calculation process (including the data access process) of the blocks containing successive Conv, BN and Scale may also employ other operations known in the art.
  • a calculation graph of the neural network is generated.
  • Developers can use various deep learning frameworks in the field, such as MxNet, TensorFlow, etc., to design the required neural network models.
  • various neural network model parsers and neural network model builders known in the art can be used to convert the neural network model into a general computation graph corresponding to the neural network processor.
  • the neural network model parser can parse the input neural network model, such as analyzing the grammatical structure or syntax of the input neural network model to generate model information; further, the neural network model builder can generate calculation based on the model information.
  • a graph such as a graph that includes multiple compute nodes.
  • step S120 the computation graph of the neural network is divided into several subgraphs.
  • the computational graph can be divided into several subgraphs using the adjacency matrix-based clustering graph cut algorithm.
  • a clustering algorithm is applied on the adjacency matrix to generate clustering results, thereby determining the boundaries of the sub-graphs to be divided.
  • FIG. 3 shows the subgraph structures of the segmented neural network, eg, subgraph structure A and subgraph structure B, and the corresponding adjacency matrix (shown in the right half of FIG. 3 ).
  • sub-graph division is only an example, and other known sub-graph division techniques in the art may be adopted.
  • the above clustering algorithm may be, but is not limited to, K-Means, Graph Community Detection, and the like. I won't go into details here.
  • the topology information of at least one divided subgraph is generated, for example, a hash summary of attribute information of each node after traversing in a certain order.
  • the traversal of the nodes can be: determining the first node of the subgraph, traversing each node according to the breadth, and generating a node processing queue. Attribute information is generated for each node in turn from the node queue, for example, the attribute information (Attr Info of Op) of the node shown in Figure 4, which includes but is not limited to, the operation type (Op Type), the input number (Input Num), each Input shape (Input[0]Shape, Input[1]Shape, etc.) and input node operation type (for example, Input[0]Op Type, Input[1]Op Type, etc.), output number (Output Num), The shape of the output (Output[0]Shape, etc.) and the operation type of the output node (eg, Output[0]Op Type), etc.
  • the attribute information of the above nodes can be encoded into storable information.
  • the encoding here can be encoded using a Hash algorithm (
  • a topology feature vector may also be used to represent the topology information of the subgraph.
  • the top Conv layer has three downward outputs: Output: 0; Output: 1; Output: 2.
  • the two subgraphs shown in FIG. 5 belong to subgraphs with different topological orderings but equivalent structures.
  • the order of traversing the nodes is different and, therefore, different summaries of node attribute information may be generated.
  • the same optimization strategy can generally be used.
  • the topology information of the structurally equivalent subgraphs needs to be uniquely represented, so as to avoid using different topological representations for some structurally equivalent subgraphs in the topology feature library.
  • a graph neural network GraphSAGE and a feed-forward NN network are employed to generate topological feature vectors of subgraphs, which can uniquely represent the topology of structurally equivalent subgraphs.
  • the adjacency matrix of the subgraph and the topological attribute information of each node of the subgraph are input into the graph neural network GraphSAGE.
  • GraphSAGE is an algorithm known in the art for generating embeddings for each node.
  • the embeddings of these nodes of the subgraph are connected, and then input into the feedforward neural network to generate the topological feature vector of the subgraph.
  • the above manner of generating the topological feature vector of the subgraph is exemplary, and other various manners known in the art can also be used to generate the topological feature vector of the subgraph to uniquely represent the topology of the subgraph with equivalent structure.
  • a topology feature library is set up, in which topology information of known subgraphs and their corresponding optimization strategies are stored. Further, in the topology feature library, the optimization strategy of the subgraph is stored in association with the topology information of the subgraph. During initialization, the topology information of some known subgraphs and the corresponding optimization strategies are stored in the topology feature library. In the topology feature library, an optimization strategy for the subgraph is stored in association with topology information representing the subgraph. For example, an entry can be used to represent the mapping relationship between the topology information of the subgraph and the optimization strategy.
  • the topology information may use the attribute information of each node of the subgraph described above, or use the topology feature vector of the subgraph described above.
  • the attribute information of each node of the subgraph described above is stored in the topology feature library to represent the topology of the subgraph.
  • the topological feature vector of the subgraph described above can be used to be stored in the topological feature library to represent the topology of the subgraph.
  • the topology feature database is queried to determine whether there is an optimization strategy for a known subgraph that is topologically consistent with at least one of the divided subgraphs. If there is, the corresponding optimization strategy is extracted from the topology feature library, and the optimization strategy is directly applied to the at least one divided subgraph, without the need to perform compilation and optimization calculation on the at least one divided subgraph. Specifically, it is queried whether there is an optimization strategy for a known subgraph that is consistent with the topology information (eg, a topology feature vector) of at least one divided subgraph in the topology feature database.
  • the topology information eg, a topology feature vector
  • the topological feature vector of the subgraph generated in step S130 is compared with the topological feature vector of the known subgraph in the topological feature library to determine that the topological feature library is consistent with the topology information (eg, topological feature vector) of at least one segmented subgraph. known subgraphs of .
  • the compilation optimization strategy may be various compilation optimization strategies for the structure of the neural network known in the art. For example, it can be common expression elimination or constant folding of a certain subgraph, or it can be equivalent subgraph transformation, that is, a graph replacement operation that replaces the subgraph with another computational graph. It can also be a graph merging operation that merges several parts of the computation graph. It can also be an operator fusion that fuses the operators of some nodes together.
  • the optimization strategy is not limited to those listed above, and can also be other various types of optimization strategies known in the art.
  • a compilation optimization calculation is performed on the at least one divided subgraph as shown in step S150.
  • the compilation optimization calculation here refers to optimizing and reconstructing the sub-graph to reduce the amount of calculation of the sub-graph or improve the calculation speed of the sub-graph.
  • an equivalent subgraph of the subgraph can be generated by optimizing the calculation to reduce the amount of calculation or improve the calculation speed, and then replace the subgraph with the equivalent subgraph, that is, graph replacement.
  • at least a part of the subgraph is determined to be an equivalent constant through optimization calculation, and at least a part of the subgraph is replaced with the constant, that is, constant folding.
  • the operators of several consecutive layers in the subgraph can be fused to reduce the bandwidth limitation of accessing the external memory by optimizing the calculation, and the operators of the several consecutive layers can be fused.
  • an optimization strategy corresponding to the at least one new subgraph is generated, and then, as shown in step S170, the at least one new subgraph and its optimization strategy are added to the topology feature library .
  • the topology of the at least one new subgraph and the topology of the replacement graph are added to the topology feature library in association.
  • the topology of the at least one new subgraph and the topology of the optimized graph are added to the topology feature library in association.
  • the topology and equivalent constants of the at least one new subgraph may be associated with the topology feature library.
  • the at least one subgraph may be divided into several smaller subgraphs, and the compilation optimization calculation is performed on the at least one smaller subgraph.
  • an optimized subgraph is generated.
  • the optimization strategy of the at least one smaller subgraph, and the optimized smaller subgraph and the corresponding optimization strategy are respectively stored in the topology feature library. It is also possible to generate an optimization strategy of the original subgraph before being split, and store the original subgraph and the corresponding optimization strategy in the topology feature library.
  • the topology information of several known subgraphs and corresponding optimization strategies are stored in the topology feature library.
  • the optimization calculation is performed on the new subgraph.
  • the new subgraph and its optimization strategy can be added to the topology feature
  • the corresponding optimization strategy can be directly retrieved from the topology feature library to apply the optimization strategy to the subgraph without repeating the optimization calculation for the subgraph.
  • a neural network topology feature library is set up, in which known subgraphs of the neural network and optimization strategies thereof are stored.
  • a corresponding entry is created, and the corresponding optimization strategy is stored.
  • Subsequent compilation of the same subgraph of the same neural network only the topology feature database needs to be queried.
  • the existing subgraph of the neural network there is no need to perform optimization calculations, but to directly obtain the corresponding compilation optimization strategy to avoid the same
  • the subgraph of the neural network repeats the optimization calculation to improve the compilation speed.
  • the neural network compilation and optimization method in the embodiment of FIG. 6 is different from the embodiment of FIG. 2 in that the computation graph is firstly divided into several coarse-grained subgraphs, and then the topology feature library is queried to determine whether there is a correlation with at least one coarse-grained subgraph. Optimization strategy for known subgraphs that are topologically consistent; if they do not exist, further divide the coarse-grained subgraph into fine-grained subgraphs, and then query the topology feature library to determine whether there is a topology consistent with at least one fine-grained subgraph. An optimization strategy for knowing subgraphs.
  • the computation graph of the neural network is generated in step S210, as shown in step S220, the computation graph is divided into coarse-grained subgraphs.
  • the coarse-grained subgraph segmentation mentioned here may be the same technique as the segmentation of the computation graph in the embodiment of FIG. 2 .
  • a cluster graph cut algorithm based on adjacency matrix is used to partition the computational graph into several coarse-grained subgraphs.
  • Various other subgraph segmentation techniques known in the art may be used, which may be, but are not limited to, K-Means, Graph Community Detection, and the like.
  • topology information of the coarse-grained subgraph such as a topology feature vector of the subgraph
  • the topological feature vector of the subgraph is generated by using the method of using the graph neural network GraphSAGE and the feedforward NN network described in the embodiment of FIG. 2 .
  • the topology feature database is queried to determine whether there is an optimization strategy for a known subgraph that is topology consistent with at least one partitioned coarse-grained subgraph. If there is, as shown in steps S282 and S284, a corresponding optimization strategy is extracted from the topology feature library, and the optimization strategy is applied to the at least one segmented coarse-grained subgraph. If not, as shown in step S250, the coarse-grained subgraph is further divided into several fine-grained subgraphs.
  • the coarse-grained subgraph can be cut into fine-grained subgraphs in the way of max-flow min-cut.
  • the specific steps are as follows: set the weights on each edge of the subgraph.
  • the weights on each edge of the subgraph are set according to the optimization items of the software stack that performs neural network compilation (such as buffer data multiplexing of operators, data flow template matching, etc.); according to the weights on the subgraph
  • the edges of the subgraph are cut from small to large.
  • the above manner of setting the weight of the edge is exemplary, and other standards known in the art can also be used to set the weight of the edge.
  • the subgraph shown on the right includes only two nodes.
  • the fine-grained segmentation of FIG. 7 is only exemplary, and does not necessarily have to be segmented into a subgraph including only two nodes as shown on the far right of FIG. 7 .
  • the granularity of fine-grained segmentation can be set as desired.
  • the topology feature database is further queried to determine whether there is an optimization strategy for a known subgraph that is topologically consistent with at least one fine-grained subgraph. If yes, as shown in step S282 and step S284, a corresponding optimization strategy is extracted from the topology feature library, and the optimization strategy is applied to the at least one fine-grained subgraph.
  • step S270 compile and optimize the at least one fine-grained subgraph, and after optimization, generate a corresponding optimization strategy, and then, as shown in step S280, perform a compilation and optimization calculation on the at least one new fine-grained subgraph.
  • the topology information and its corresponding optimization strategy are added to the topology feature library. It can also be compiled and optimized for at least one new coarse-grained subgraph before segmentation. After optimization, a corresponding optimization strategy is generated, and then the topology information of at least one new coarse-grained subgraph before segmentation and its corresponding optimization are calculated.
  • the policy is added to the topology signature library.
  • topology information of several known subgraphs and corresponding optimization strategies are stored in the topology feature library.
  • the optimization calculation is performed on the new subgraph, and then the new subgraph and its optimization strategy can be added to the topological feature library. So that when the subgraph is encountered again in the future, the corresponding optimization strategy can be directly retrieved from the topology feature library to apply the optimization strategy to the subgraph, avoiding repeated optimization calculations for the subgraph of the known neural network, and improving the compilation. speed.
  • the above-mentioned embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer program codes or computer program instructions.
  • the computer When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program code or computer program instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program code or computer program instructions may be stored from a computer-readable storage medium. Transmission from one website site, computer, server or data center to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, etc.) or wireless (eg infrared, radio, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable medium may be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; an optical medium, such as a DVD; or a semiconductor medium, such as a solid state disk (Solid State Disk, SSD), and the like.
  • a magnetic medium such as a floppy disk, a hard disk, and a magnetic tape
  • an optical medium such as a DVD
  • a semiconductor medium such as a solid state disk (Solid State Disk, SSD), and the like.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality of the logic.
  • Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware or a combination of computer instructions.
  • a neural network compilation server 100 that executes the method of the embodiment of the present application will be described in detail below with reference to FIG. 8 .
  • the neural network compilation server 100 may be various computational graph compilation servers known in the art, or other various types of servers known in the art for implementing neural network compilation and optimization.
  • the neural network compilation server 100 includes a processor 110 , a neural network processor 120 , a storage device 130 and an encryption chip 140 .
  • the processor 110 may be a general-purpose processor or a processor specially designed for a specific technical field.
  • the processor may be a central processing unit (center processing unit, CPU), or may be various types of processors such as a digital signal processor (digital signal processor, DSP), a microcontroller (micro control unit, MCU).
  • the processor 110 may be various types of single-core or multi-core CPU processors such as Intel or ARM, such as Intel Xeon CPU.
  • the neural network processor 120 may be various types of special-purpose processors for neural networks, such as processors specially designed for artificial intelligence (artificial intelligence, AI) applications, including but not limited to neural network processing units (NPU) , GPU, tensor processing unit (TPU), etc.
  • processors specially designed for artificial intelligence (artificial intelligence, AI) applications including but not limited to neural network processing units (NPU) , GPU, tensor processing unit (TPU), etc.
  • neural network special-purpose processor used in this application may also be abbreviated as “neural network processor” or "NN processor”.
  • the neural network specific processor may be implemented as a deep learning specific processor or a deep learning processor.
  • the storage device 130 may store the topology feature library 132 of the embodiment of the present application.
  • the topology feature library (ie, knowledge base) 132 is used to record topology information of the neural network and its corresponding compilation and optimization strategy.
  • the topological feature base (ie, knowledge base) 132 may use a high-speed in-memory database, such as a Mango database, or may use other types of various databases.
  • Storage 130 may include various types of storage units, such as system memory, read only memory (ROM), and persistent storage. ROM can store static data or instructions needed by the processor.
  • Persistent storage devices may be readable and writable storage devices. Permanent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off.
  • the persistent storage device employs a mass storage device (eg, magnetic or optical disk, flash memory) as the persistent storage device.
  • a mass storage device eg, magnetic or optical disk, flash memory
  • the architecture of the above-mentioned neural network compilation server 100 is exemplary, and other various types of hardware architectures may also be included to implement the methods of the embodiments of the present application.
  • the storage device 130 also stores computer program codes or computer program instructions, which, when processed by the neural network compilation server 100, can enable the neural network compilation server 100 to execute the above-mentioned neural network compilation and optimization method of the present application.
  • the processor 110 When executing the methods of the above embodiments of the present application, the processor 110 is configured to execute the main steps of the methods of the embodiments of the present application.
  • step S130 the step of performing a Hash algorithm (eg, MD5) to encode the node attribute information is performed by a special encryption chip 140 (eg, a hard acceleration chip 140 conforming to the national secret standard).
  • a Hash algorithm eg, MD5
  • a special encryption chip 140 eg, a hard acceleration chip 140 conforming to the national secret standard.
  • the step of using the graph neural network GraphSAGE and the feedforward NN network to generate the topological feature vector of the subgraph is performed by the neural network processor 120 .
  • FIG. 9 shows a schematic structural diagram of a computing device 200 that can be used to implement the above method for compiling and optimizing a neural network according to an embodiment of the present application.
  • computing device 200 includes processor 210 and memory 220 .
  • the computing devices of this application may be various types of computing devices, including but not limited to, various cloud computing devices, edge computing devices of various networks (such as wired or wireless communication networks, smart power networks, Internet of Things, etc.), user terminal equipment, etc.
  • the processor 210 may be a multi-core processor, or may include multiple processors.
  • the processor 210 may comprise a general-purpose main processor and one or more special co-processors, such as graphics processing units (GPUs), digital signal processors (DSPs), neural network processors (neural network processors) processing unit, NPU), tensor processor (tensor processing unit, TPU), etc.
  • Memory 220 may include various types of storage units, such as system memory, read only memory (ROM), and persistent storage. ROM may store static data or instructions required by processor 220 . Persistent storage devices may be readable and writable storage devices.
  • Permanent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off.
  • the persistent storage device employs a mass storage device (eg, magnetic or optical disk, flash memory) as the persistent storage device.
  • persistent storage may be a removable storage device (eg, a floppy disk, optical drive).
  • System memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime.
  • memory 220 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read only memory), and magnetic and/or optical disks may also be employed.
  • memory 220 may include a readable and/or writable removable storage device, such as a compact disc (CD), a digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM) ), read-only Blu-ray discs, ultra-dense discs, flash memory cards (eg, SD cards, minSD cards, Micro-SD cards, etc.), magnetic floppy disks, etc.
  • the memory 220 stores computer program codes or computer program instructions.
  • the processor 210 can be made to execute the above-mentioned neural network compilation and optimization method of the present application.
  • the method of the present application can also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present application.
  • the present application can also be implemented as a non-transitory machine-readable storage medium (or a computer-readable storage medium, or a machine-readable storage medium) on which executable codes (or computer programs, or computer instruction codes are stored) ), when the executable code (or computer program, or computer instruction code) is executed by the processor of the electronic device (or computing device, server, etc.), the processor is caused to perform each step of the above method according to the present invention .
  • FIG. 10 shows a neural network compilation and optimization apparatus 300 that can be used to implement the above neural network compilation and optimization method according to an embodiment of the present application.
  • the neural network compilation and optimization apparatus 00 includes a segmentation unit 310 , a query unit 320 and an optimization unit 330 .
  • the dividing unit 310 is configured to divide the computational graph of the neural network into several subgraphs;
  • the querying unit 320 is configured to query whether there is a topology feature database with An optimization strategy of a known subgraph that is consistent with the topology of the at least one divided subgraph;
  • the optimization unit 330 is configured to: if exists, extract the topology of the at least one divided subgraph from the topology feature library The at least one subgraph is optimized by a consistent optimization strategy of the known topology.
  • the topology feature library includes topology feature vectors of known subgraphs and corresponding optimization strategies
  • the query unit 320 is further configured to: query whether there is a topology feature vector consistent with the at least one segmented subgraph. Optimization strategies for known subgraphs.
  • the optimization unit 330 is further configured to: perform compilation optimization on the at least one divided subgraph Calculate; after the compilation and optimization of the at least one divided subgraph, generate an optimization strategy of the at least one divided subgraph; and add the topological feature vector of the at least one divided subgraph and its optimization strategy to the topological feature in association with in the library.
  • the optimization unit 330 is configured to: when performing a compilation optimization calculation on the at least one divided subgraph, determine an equivalent subgraph of the at least one divided subgraph through optimization calculation, and use the at least one divided subgraph to determine the equivalent subgraph of the at least one divided subgraph.
  • the optimization unit 330 is configured to: when performing the compilation optimization calculation on the at least one divided subgraph, determine through the optimization calculation that multiple parts in the subgraph can be combined to reduce the calculation amount of the subgraph, or To improve the calculation speed of the subgraph, the multiple parts are merged, and the optimization unit 330 is configured to: when performing the addition, the topology feature vector of the subgraph and the topology of the merged subgraph are The eigenvectors are added associatively to the topological feature library.
  • the optimization unit 330 is configured to: when performing a compilation optimization calculation on the at least one divided subgraph, determine an equivalent constant of the subgraph through the optimization calculation, and replace the subgraph with the constant ;
  • the optimizing unit 330 is configured to: when performing the adding, add the topological feature vector of the sub-graph and the constant in association with the topological feature library.
  • segmenting unit 310 is further configured to: segment the computation graph into coarse-grained subgraphs using a clustering technique.
  • the segmenting unit 310 is configured to: segment the at least one segmented coarse-grained subgraph into Several fine-grained subgraphs; the query unit 320 is configured to: query whether there is an optimization strategy for a known subgraph that is consistent with the topology feature vector of at least one fine-grained subgraph in the topological feature library; the optimization strategy of the known subgraphs whose topological feature vectors of the granular subgraphs are consistent, the optimization unit 330 is configured to: extract from the topological feature library the known subgraphs that are consistent with the topological feature vectors of at least one fine-grained subgraph. An optimization strategy is used to optimize the at least one fine-grained subgraph.
  • the optimization unit 330 is configured to: perform a compilation optimization calculation on the at least one fine-grained subgraph; generating an optimization strategy for the at least one fine-grained subgraph; adding the topological feature vector of the at least one fine-grained subgraph and the optimization strategy to the topological feature library.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种神经网络编译优化方法,包括:将神经网络的计算图分割成若干个子图;查询拓扑特征库中是否存在与至少一个分割的子图的拓扑一致的已知子图的优化策略;如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知子图的优化策略来优化该至少一个子图。在特征库中存储神经网络的已知子图的编译优化策略。在进行神经网络的编译时,将神经网络的计算图分割成若干个子图,从拓扑特征库中找到至少一个分割的子图对应的优化策略,直接对该子图应用优化策略,而不需要重复对该子图进行优化计算,加速神经网络模型的编译过程。

Description

一种神经网络编译优化方法和相关装置 技术领域
本申请涉及神经网络编译领域,尤其涉及一种神经网络编译优化方法及相关装置。
背景技术
目前,人工神经网络是人工智能领域中常见的计算模型之一。不同于使用C/C++等高级编程语言描述的源程序代码,神经网络通过描述神经网络算法的专用模型来表示。进一步地,通过神经网络模型的编译,即,将神经网络算法转化成通用的计算图,对计算图进行优化,再将优化后的计算图映射为后端硬件平台可执行的指令和机器码,从而将神经网络的模型转换成面向计算平台上执行的目标代码。神经网络编译优化是一个处理复杂、耗时较长的计算过程。一个典型的神经网络编译优化过程包括图转换、子图切分、常量折叠、等效子图变换、算子融合、L1/L2存储器数据复用、DDR内存分配和复用等。神经网络模型中往往存在较多相同结构,重复编译这些相同结构会导致一些在线编译或者超大模型编译的效率严重下降。此外,如果用户只是对神经网络模型的一个小局部进行修改,就要全部重复以上复杂冗长的编译过程,影响用户的使用体验。
发明内容
本申请实施例提供了一种神经网络编译优化方法及相关装置,用于解决本领域上述至少缺点之一。
根据本申请的第一方面,提供一种神经网络编译优化方法,包括:将神经网络的计算图分割成若干个子图;查询拓扑特征库中是否存在与至少一个分割的子图的拓扑一致的已知子图的优化策略;如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知子图的优化策略来优化该至少一个子图。
根据本申请的第一方面,所述拓扑特征库包括已知子图的拓扑特征向量和对应的优化策略,所述查询的步骤包括查询是否存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略。
根据本申请的第一方面,其中,所述方法进一步包括:如果不存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略,则对该至少一个分割的子图执行编译优化计算;在该至少一个分割的子图的编译优化之后,生成该至少一个分割的子图的优化策略;将该至少一个分割的子图的拓扑特征向量及其优化策略相关联地添加到拓扑特征库中。
根据本申请的第一方面,其中,对该至少一个分割的子图执行编译优化计算包括通过优化计算来确定该至少一个分割的子图的等效子图,将该至少一个分割的子图替换成等效子图;所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量和等效子图的拓扑特征向量关联地存储在拓扑特征库中。
根据本申请的第一方面,其中,对该至少一个分割的子图执行编译优化计算包括确定该至少一个分割的子图中多个部分能够合并而降低子图的计算量或者子图的计算速度,将所述多个部分进行合并,所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量和合并后的子图的拓扑特征向量关联地添加到拓扑特征库中。
根据本申请的第一方面,其中,对该至少一个分割的子图执行编译优化计算包括确定所述该至少一个分割的子图的等效常量,将该至少一个分割的子图替换成等效常量,所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量好和等效常量关联地添加到拓扑特征库中。
根据本申请的第一方面,其中,所述分割的步骤包括采用聚类技术将计算图分割成若干个粗粒度的子图。
根据本申请的第一方面,其中,所述方法进一步包括:如果不存在与至少一个分割的粗粒度子图的拓扑特征向量一致的已知子图的优化策略,则将至少一个分割的粗粒度子图分割成若干个细粒度子图,查询拓扑特征库中是否存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略;如果存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则从拓扑特征库中提取与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略来优化该至少一个细粒度子图。
根据本申请的第一方面,其中,如果不存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则:对该至少一个细粒度子图执行编译优化计算;生成该至少一个细粒度子图的优化策略;将该至少一个细粒度子图的拓扑特征向量及其优化策略添加到拓扑特征库中。
根据本申请的第二方面,提供一种用于神经网络编译优化的装置,包括:处理器和存储器,所述处理器用于执行所述存储器中存储的程序指令,以使所述装置实现上述任一方法。
根据本申请的第三方面,提供一种计算机可读存储介质,其中,所述计算机可读存储介质中存储了程序代码,所述程序代码被计算机执行时,实现上述任一方法。
根据本申请的第四方面,提供一种计算机程序产品,其中,所述计算机程序产品包含的程序代码被计算机执行时,实现上述任一方法。
根据本申请的第五方面,提供一种神经网络编译优化的装置,包括:分割单元,配置成用于将神经网络的计算图分割成若干个子图;查询单元,配置成用于查询拓扑特征库中是否存在与该至少一个分割的子图的拓扑一致的已知子图的优化策略;优化单元,配置成用于:如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知拓扑的优化策略来优化该至少一个子图。
根据本申请的第五方面,所述拓扑特征库包括已知子图的拓扑特征向量和对应的优化策略,所述查询单元进一步配置成用于:查询是否存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略。
根据本申请的第五方面,其中,如果不存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元进一步配置成用于:对该至少一个分割的子图执行编译优化计算;在该至少一个分割的子图的编译优化之后,生成该至少一个分割的 子图的优化策略;将该至少一个分割的子图的拓扑特征向量及其优化策略相关联地添加到拓扑特征库中。
根据本申请的第五方面,其中,在对该至少一个分割的子图执行编译优化计算时,所述优化单元配置成用于:通过优化计算来确定该至少一个分割的子图的等效子图,将该至少一个分割的子图替换成等效子图;所述优化单元配置成用于:在执行所述添加时,将该至少一个分割的子图的拓扑特征向量和等效子图的拓扑特征向量关联地存储在拓扑特征库中。
根据本申请的第五方面,其中,所述优化单元配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定子图中多个部分能够合并而降低该子图的计算量或者提升该子图的计算速度,将所述多个部分进行合并;在执行所述添加时,将该子图的拓扑特征向量和合并后的子图的拓扑特征向量关联地添加到拓扑特征库中。
根据本申请的第五方面,其中,所述优化单元配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定所述子图的等效常量,将该子图替换成该常量;在执行所述添加时,将该子图的拓扑特征向量和该常量关联地添加到拓扑特征库中。
根据本申请的第五方面,其中,所述分割单元进一步配置成用于:采用聚类技术将计算图分割成粗粒度的子图。
根据本申请的第五方面,其中,如果不存在与至少一个分割的粗粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述分割单元配置成用于:将至少一个分割的粗粒度子图分割成若干个细粒度子图;所述查询单元配置成用于:查询拓扑特征库中是否存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略;如果存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元配置成用于:从拓扑特征库中提取与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略来优化该至少一个细粒度子图。
根据本申请的第五方面,其中,如果不存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元配置成用于:对该至少一个细粒度子图执行编译优化计算;生成该至少一个细粒度子图的优化策略;将该至少一个细粒度子图的拓扑特征向量及其优化策略添加到拓扑特征库中。
本申请实施例通过引入拓扑特征库来加速编译过程的执行。具体而言,在特征库中存储神经网络的已知子图的编译优化策略。在进行神经网络的编译时,将神经网络的计算图分割成若干个子图,从拓扑特征库中找到至少一个分割的子图对应的优化策略,直接对该子图应用优化策略,而不需要重复对该子图进行优化计算,加速神经网络模型的编译过程。
附图说明
图1为一种神经网络的结构示意图;
图2为本申请一个实施例的神经网络编译优化过程的流程图;
图3为图2的实施例中的神经网络的计算图的分割的示意图;
图4为图2的实施例中的节点属性信息的示意图;
图5为图2的实施例中的神经网络中的等效子图的示意图;
图6为本申请另一个实施例的神经网络编译优化方法的流程图;
图7为图6的实施例中的计算图的细粒度分割的示意图;
图8为实施本申请实施例提供的神经网络编译优化方法的神经网络编译服务器的示意图;
图9为实施本申请实施例提供的神经网络编译优化方法的计算设备的示意图。
图10为实施本申请实施例提供的神经网络编译优化方法的装置的示意图。
具体实施方式
下面结合附图并举实施例,对本申请提供的技术方案作进一步说明。应理解,本申请实施例中提供的系统和场景主要是为了解释本申请的技术方案的一些可能的实施方式,不应被解读为对本申请的技术方案的唯一性限定。
本申请实施例及附图中的术语“第一”,“第二”以及“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于表示不排他的包含。方法、系统、产品或设备不必仅限于字面列出的那些步骤或单元,而是可包括没有字面列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请中,各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
下面介绍本申请涉及到的若干技术术语的含义。
本申请提到的神经网络,即人工神经网络(ANN),可以是例如深度神经网络(DNN)、循环神经网络(RNN)或者卷积神经网络(CNN)等本领域已知的各种人工神经网络。
神经网络一般通过描述神经网络算法的专用模型来表示。神经网络的编译,指的是,将神经网络算法转化成通用的计算图,对计算图进行优化和重构,再将优化后的计算图映射为后端硬件平台可执行的指令和机器码,从而完成神经网络算法针对硬件平台的编译。
计算图,即是将神经网络模型的计算用图形表示出来,例如,通过多个节点和节点与节点之间的有向边(Edge)表示。节点包括变量节点、算子节点和样本节点中的至少一种。两个节点之间的有向边用于表示两个节点间的依赖关系。节点之间的有向边具有属性,例如权重,其表示一个信号(或者值)经过该有向边输入到下一个节点时要乘以该权重值作为输入。
神经网络编译器可以是各种用于神经网络计算的计算平台或者神经网络编译器,例如,深度神经网络编译器DNNC(Deep Neural Network Compiler),可以将神经网络算法编译成例如DPU(Deep Learning Processor Unit,DLPU)平台的指令流。神经网络编译器也可以是 本领域已知的其他各种类型的编译器。
本申请实施例涉及神经网络编译优化方法。下面结合附图详细描述本申请实施例的方法。
图1提供了一种神经网络结构的示意图。如图1所示,神经网络结构包括一系列有序运行的层,例如卷积层(Conv)、批量归一化(BatchNorm,即BN)层、定标(Scale)层、点加(eltwise)层、Relu(Rectified Linear Unit,修正线性单元)层等。上述这些层仅仅是示例,还可以包含其他各种类型的层。另外,本申请提供的图1的神经网络结构仅仅是为了举例说明,本申请中的神经网络并不局限于如图1所述的结构,可以是本领域各种类型的神经网络的结构。
如图1所示,在神经网络中,前一层的计算输出为后一层的输入。例如,对于图1中包含连续的Conv、BN和Scale的框而言,在计算过程中,首先将Conv层需要的输入数据从片外内存加载至片上缓存,Conv层基于该加载的数据进行计算,然后将计算结果存储在片外内存;然后,BN层从片外内存加载Conv层的输出结果作为输入数据以及其他必要的参数,进行计算,再将结果存储在片外内存;然后Scale层从片外内存加载之前BN层的输出结果,进行计算,再将结果存储在片外内存,以此类推,直到遍历所有的计算层。上述神经网络的计算过程仅是示例性的。上述包含连续的Conv、BN和Scale的框的计算过程(包括数据存取过程)还可以采用本领域已知的其他操作。
下面结合图2-5具体描述本申请一个实施例的神经网络编译优化方法的流程。
如图2所示,在步骤S110中,生成神经网络的计算图。开发者可以使用本领域各种深度学习框架例如MxNet、TensorFlow等来设计需要的神经网络模型。进一步地,可以采用本领域已知的各种神经网络模型解析器和神经网络模型构建器来将神经网络模型转换成神经网络处理器对应的通用的计算图。例如,神经网络模型解析器可以对输入的神经网络模型进行解析,例如对输入的神经网络模型的语法结构或者句法进行分析来生成模型信息;进一步地,神经网络模型构建器可以基于模型信息生成计算图,例如包括多个计算节点的图。
接着,在步骤S120中,将神经网络的计算图分割成若干个子图。可以采用基于邻接矩阵的聚类图割算法将计算图分割成若干个子图。如图3所示,在邻接矩阵上应用聚类算法,生成聚类结果,从而确定分割的子图的边界。图3示出了分割的神经网络的子图结构、例如子图结构A和子图结构B,以及对应的邻接矩阵(图3右半边部所示)。
上述子图分割的方式仅是示例,可以采用本领域中其他已知的各种子图分割技术。上述聚类算法可以是但不限于K-Means、图团体检测(Graph Community Detection)等。这里不再赘述。
接着,如步骤S130所示,生成至少一个分割的子图的拓扑信息,例如,按照一定次序遍历后的各个节点的属性信息hash摘要。
节点的遍历可以是,确定子图的首节点,按照广度遍历每个节点,生成节点处理队列。从节点队列依次对各个节点生成属性信息,例如图4所示的节点的属性信息(Attr Info of Op),其包含但不限于,操作类型(Op Type),输入个数(Input Num),每个输入的形状(Input[0]Shape,Input[1]Shape等)和输入节点操作类型(例如,Input[0]Op Type,Input[1]Op Type 等),输出个数(Output Num),输出的形状(Output[0]Shape等)和输出节点的操作类型(例如,Output[0]Op Type)等。上述节点的属性信息可以编码成可以存储的信息。这里编码可以采用Hash算法(例如MD5)来编码。
除了上述子图的拓扑信息之外,还可以采用拓扑特征向量来表示子图的拓扑信息。神经网络的子图中,存在一些节点排序不同但结构等效的子图。例如,图5的子图中,最上层Conv层分别向下有三个方向的输出:Output:0;Output:1;Output:2。图5示出的两个子图属于拓扑排序不同但结构等效的子图。
对于节点排序不同但结构等效的子图,遍历节点的顺序不同,因此,可能生成不同的节点属性信息摘要。对于结构等效的子图,一般可以采用相同的优化策略。优选地,需要唯一地表示结构等效的子图的拓扑信息,避免在拓扑特征库中对于结构等效的一些子图采用不同的拓扑表示。例如,采用图神经网络GraphSAGE和前馈NN网络来生成子图的拓扑特征向量,其可以唯一表示结构等效的子图的拓扑。具体而言,将子图的邻接矩阵和子图的每个节点的拓扑属性信息(例如上面描述的节点属性的hash摘要)输入图神经网络GraphSAGE。GraphSAGE对于本领域是已知的算法,用于生成每个节点的嵌入(embedding)。接着,将子图的这些节点的embedding连接,进而输入到前馈神经网络中,从而生成该子图的拓扑特征向量。上述生成子图的拓扑特征向量的方式是示例性的,还可以采用本领域已知的其他各种方式来生成子图的拓扑特征向量来唯一地表示结构等效的子图的拓扑。
在本实施例中,设置拓扑特征库,其中存储已知子图的拓扑信息及其对应的优化策略。进一步地,在拓扑特征库中,与子图的拓扑信息关联地存储该子图的优化策略。初始化时,在拓扑特征库中存储一些已知的子图的拓扑信息和对应的优化策略。在该拓扑特征库中,与表示该子图的拓扑信息关联地存储该子图的优化策略。例如可以采用条目来表示子图的拓扑信息与优化策略之间的映射关系。这里,拓扑信息可以采用上面描述的子图的各个节点的属性信息,或者是采用上面描述的子图的拓扑特征向量。例如,将上面描述的子图的各个节点的属性信息存储在拓扑特征库中用于表示该子图的拓扑。或者可以采用上面描述的子图的拓扑特征向量存储在拓扑特征库中用于表示该子图的拓扑。这些仅是示例性的,还可以采用其他本领域已知的反映神经网络的子图的拓扑信息。
接着,如步骤S140所示,查询拓扑特征库来确定是否存在与至少一个分割的子图拓扑一致的已知子图的优化策略。如果有,则从拓扑特征库中提取对应的优化策略,对该至少一个分割的子图直接应用该优化策略,而无需再对该至少一个分割的子图进行编译优化计算。具体地,查询拓扑特征库中是否存在与至少一个分割的子图的拓扑信息(例如拓扑特征向量)一致的已知子图的优化策略。例如,将步骤S130生成的子图的拓扑特征向量与拓扑特征库中的已知子图的拓扑特征向量进行比较来确定拓扑特征库与至少一个分割的子图的拓扑信息(例如拓扑特征向量)一致的已知子图。
编译优化策略可以是本领域已知的各种针对神经网络的结构的编译优化策略。例如,可以是某个子图的公共表达式消除或者常量折叠等,还可以是等效子图变换,即利用另一个计算图替换该子图的图替换操作。还可以是将计算图中若干部分进行合并的图合并操作。还可以是将某些节点的算子融合在一起的算子融合等。优化策略不限于上述列举的,还可 以是本领域已知的其他各种类型的优化策略。
如果拓扑特征库中不存在与至少一个被分割的子图拓扑一致的已知子图的优化策略,则如步骤S150所示对该至少一个分割的子图执行编译优化计算。
这里的编译优化计算指的是,针对该子图进行优化重构从而减少该子图的计算量或者提升该子图的计算速度。例如,可以通过优化计算而生成该子图的等效子图,实现计算量的降低或者计算速度的提升,然后将该子图替换成该等效子图,即图替换。也可以是,通过优化计算而确定子图中多个部分能够通过合并而降低计算量或者提升计算速度,然后将该子图中的多个部分进行合并,即图合并。也可以是,通过优化计算而确定子图中至少一部分为等效常量,以及将该子图的至少一部分替换成该常量,即常量折叠。也可以是,通过优化计算而确定子图中若干个连续层的算子能够通过融合而降低访问外部存储器的带宽限制,以及将所述若干个连续层的算子融合。
上述编译优化计算仅是示例性的,只要是通过优化计算对该子图进行重构而实现计算量的降低或者提升计算速度的编译优化计算都在本申请的范围之内。
在优化之后,如步骤S160所示,生成该至少一个新的子图对应的优化策略,然后,如步骤S170所示,将该至少一个新的子图及其优化策略添加到该拓扑特征库中。例如,如果是上述图替换操作,则将该至少一个新的子图的拓扑和替换图的拓扑关联地添加到该拓扑特征库中。如果是图合并操作,将该至少一个新的子图的拓扑和优化后的图的拓扑关联地添加到该拓扑特征库中。如果是常量折叠,则可以将该至少一个新的子图的拓扑和等效常量关联地添加到该拓扑特征库中。这些都是示例性的,不限于上述列举的优化策略。
在步骤S150中的编译优化计算中,可能将该至少一个子图分割成若干个较小子图,对至少一个较小子图进行编译优化计算,优化之后,如步骤S160所示,生成被优化的该至少一个较小子图的优化策略,分别将优化后的较小子图和对应的优化策略存储在拓扑特征库中。也可以生成被分割之前的原始子图的优化策略,将该原始子图和对应的优化策略存储在拓扑特征库中。
如之前描述,初始化时,在拓扑特征库中存储若干已知的子图的拓扑信息和对应的优化策略。在之后的各种神经网络的计算图的编译优化时,如果遇到新的子图,对新的子图进行优化计算,优化之后,可以将新的子图及其优化策略添加到该拓扑特征库中以便于以后再次遇到该子图时可以从拓扑特征库中直接检索出相应的优化策略来对该子图应用该优化策略,而无需对该子图重复优化计算。
基于本申请实施例的上述技术方案,设置神经网络拓扑特征库,其中保存神经网络的已知子图及其优化策略。在编译时对于首次遇到的神经网络的子图,创建对应的条目,存储对应的优化策略。后续再次重复编译相同的神经网络的子图,则只需要查询拓扑特征数据库,对于已有的神经网络的子图,不需要再进行优化计算,而是直接获取相应的编译优化策略,避免对相同的神经网络的子图重复优化计算,提升编译速度。
下面参照图6详细描述本申请另一个实施例的神经网络编译优化方法。
图6的实施例的神经网络编译优化方法与图2的实施例不同在于,首先将计算图分割成若干个粗粒度子图,然后查询拓扑特征库来确定是否存在与至少一个粗粒度子图的拓扑一致的已知子图的优化策略;如果不存在,则进一步将粗粒度子图分割成细粒度子图,然 后再查询拓扑特征库来确定是否存在与至少一个细粒度子图的拓扑一致的已知子图的优化策略。如果不存在,则对至少一个新的细粒度子图进行优化计算或者对分割前的至少一个新的粗粒度子图进行优化计算,优化之后,将该至少一个新的粗粒度子图或者该至少一个新的细粒度子图和对应的优化策略添加到拓扑特征库中。具体流程如下:
参见图6所示,在步骤S210生成神经网络的计算图之后,如步骤S220所示,将计算图分割成粗粒度的子图。这里提到的粗粒度子图分割可以是与图2的实施例中计算图的分割相同的技术。例如,采用基于邻接矩阵的聚类图割算法将计算图分割成若干个粗粒度子图。可以采用本领域中其他已知的各种子图分割技术,可以是但不限于K-Means、图团体检测(Graph Community Detection)等。
然后,如步骤S230所示,与图2的步骤S130类似,生成该粗粒度子图的拓扑信息,例如该子图的拓扑特征向量。采用图2实施例中描述的采用图神经网络GraphSAGE和前馈NN网络的方式来生成子图的拓扑特征向量。
然后,如步骤S240所示,查询拓扑特征库,确定其中是否存在与至少一个分割的粗粒度子图拓扑一致的已知子图的优化策略。如果有,则如步骤S282和步骤S284所示,从拓扑特征库中提取相应的优化策略,对该至少一个分割的粗粒度子图应用优化策略。如果没有,则如步骤S250所示,对该粗粒度子图进一步分割成若干个细粒度子图。
如图7所示,可以采用最大流最小割的方式将粗粒度子图切割成细粒度子图。具体步骤如下:设置子图的各个边上的权重。例如,根据执行神经网络编译的软件栈的优化项(如算子的缓冲区(buffer)数据复用、数据流模板匹配等)来设置子图的各个边上的权重;按照子图上权重的顺序,从小到大来切割子图的边。上述设置边的权重的方式是示例性的,还可以采用本领域已知的其他标准来设置边的权重。
如图7所示,首先,切割子图上权重最小的边,例如图4中权重为1的边;其次,切割权重值较大的边,例如权重为2的边;从而获得如图7最右侧所示的仅包括两个节点的子图。图7的细粒度分割仅是示例性,不是必须要分割到如图7最右侧所示仅包括两个节点的子图。可以根据需要设置细粒度分割的粒度。
通过上述分割而生成了细粒度的子图之后,如步骤S260所示,进一步查询该拓扑特征库,来确定是否存在与至少一个细粒度子图拓扑一致的已知子图的优化策略。如果是,则如步骤S282和步骤S284所示,从拓扑特征库中提取相应的优化策略,对该至少一个细粒度子图应用优化策略。
如果否,则如步骤S270所示,对该至少一个细粒度子图进行编译优化计算,优化之后,生成对应的优化策略,然后如步骤S280所示,将该至少一个新的细粒度子图的拓扑信息及其对应的优化策略添加到拓扑特征库中。也可以是针对分割前的至少一个新的粗粒度子图进行编译优化计算,优化之后,生成对应的优化策略,然后将分割前的至少一个新的粗粒度子图的拓扑信息及其对应的优化策略添加到拓扑特征库中。
与图2的实施例类似,初始化时,在拓扑特征库中存储若干已知的子图的拓扑信息和对应的优化策略。在之后的各种神经网络的计算图的编译优化时,如果遇到新的子图,对新的子图进行优化计算,之后可以将新的子图及其优化策略添加到该拓扑特征库中以便于以后再次遇到该子图时可以从拓扑特征库中直接检索出相应的优化策略来对该子图应用该 优化策略,避免对已知神经网络的子图进行重复地优化计算,提升编译速度。
本申请的上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序代码或计算机程序指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。
所述计算机程序代码或计算机程序指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序代码或计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤等)或无线(例如红外、无线电、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘和磁带;可以是光介质,例如DVD;也可以是半导体介质,例如固态硬盘(Solid State Disk,SSD)等。
本申请的附图中的流程图和框图示出了根据本申请的各个实施例的系统和方法的可能实现的体系架构、功能和操作。流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序来发生。例如,两个连续的方框实际上可以基本并行地执行,他们有时也可以按照相反的顺序执行,这取决于这些逻辑的功能。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件或计算机指令的组合来实现。
下面参照图8详细描述执行本申请实施例的方法的一种神经网络编译服务器100。
如图8所示,神经网络编译服务器100可以是本领域已知的各种计算图编译服务器,或者是其他各种类型的本领域已知的实现神经网络编译优化的服务器。
神经网络编译服务器100包括处理器110、神经网络处理器120、存储装置130和加密芯片140。
该处理器110可以是通用处理器,也可以是为特定技术领域专门设计的处理器。例如,该处理器可以是中央处理单元(center processing unit,CPU),也可以是数字信号处理器(digital signal processor,DSP)、微控制器(micro control unit,MCU)等各种类型处理器。例如,处理器110可以例如是Intel或者ARM等各种类型的单核或者多核CPU处理器,例如Intel Xeon CPU。
神经网络处理器120可以是各种类型的神经网络专用处理器,例如为人工智能(artificial intelligence,AI)应用专门设计的处理器,包括但不限于神经网络处理器(neural network processing unit,NPU)、GPU、张量处理器(tensor processing unit,TPU)等。
本申请所使用的术语“神经网络专用处理器”,也可简称为“神经网络处理器”或者“NN处理器”。神经网络专用处理器可以被实现为深度学习专用处理器或者深度学习处理器。
存储装置130可以存储本申请实施例的拓扑特征库132。拓扑特征库(即知识库)132 用于记录神经网络的拓扑信息和其对应的编译优化策略。拓扑特征库(即知识库)132可以采用高速的内存数据库,例如Mango数据库,或者也可以采用其他类型的各种数据库。存储装置130可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM)和永久存储装置。ROM可以存储处理器需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如,磁盘或光盘、闪存)作为永久存储装置。上述神经网络编译服务器100的架构是示例性的,也可以包括其他各种类型的硬件架构来实现本申请实施例的方法。
存储装置130还存储计算机程序代码或者计算机程序指令,其被神经网络编译服务器100处理时,可以使神经网络编译服务器100执行本申请上述神经网络编译优化方法。
在执行本申请上述实施例的方法时,处理器110配置成执行本申请实施例的方法的主要步骤。在执行例如步骤S130时,执行Hash算法(例如MD5)来编码节点属性信息的步骤由专门的加密芯片140(例如符合国密标准的硬加速芯片140)来执行。上述实施例的方法中采用图神经网络GraphSAGE和前馈NN网络来生成子图的拓扑特征向量的步骤由神经网络处理器120来执行。
图9示出了根据本申请实施例可用于实现上述神经网络编译优化方法的一种计算设备200的结构示意图。
参见图9,计算设备200包括处理器210和存储器220。本申请的计算设备可以是各种类型的计算设备,包括但不限于,各种云端计算设备、各种网络(例如有线或者无线通信网络、智能电力网络、物联网等)的边缘计算设备、用户终端设备等等。
处理器210可以是一个多核处理器,也可以包含多个处理器。在一些实施例中,处理器210可以包含一个通用主处理器以及一个或多个特殊的协处理器,例如图形处理器(GPU)、数字信号处理器(DSP)、神经网络处理器(neural network processing unit,NPU),张量处理器(tensor processing unit,TPU)等。存储器220可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM)和永久存储装置。ROM可以存储处理器220需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如,磁盘或光盘、闪存)作为永久存储装置。另外,在一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器220可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(DRAM、SRAM、SDRAM、闪存、可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器220可以包括可读和/或可写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如,DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如,SD卡、minSD卡、Micro-SD卡等)、磁性软盘等。
存储器220上存储有计算机程序代码或者计算机程序指令,当计算机程序代码或者计算机程序指令被处理器210处理时,可以使处理器210执行本申请上述神经网络编译优化 方法。
此外,本申请的方法还可以实现为一种计算机程序或者计算机程序产品,该计算机程序或者计算机程序产品包括用于执行本申请的上述方法中限定的上述各步骤的计算机程序代码指令。
或者,本申请还可以实施为一种非暂时性机器可读存储介质(或者计算机可读存储介质、或者机器可读存储介质),其上存储有可执行代码(或计算机程序、或计算机指令代码),当所述可执行代码(或计算机程序、或者计算机指令代码)被电子设备(或者计算设备、服务器等)的处理器执行时,使所述处理器执行根据本发明的上述方法的各个步骤。
图10示出了根据本申请实施例可用于实现上述神经网络编译优化方法的一种神经网络编译优化装置300。所述神经网络编译优化装置00包括分割单元310,查询单元320和优化单元330。
在执行本申请实施例的方法时,所述分割单元310,配置成用于将神经网络的计算图分割成若干个子图;所述查询单元320,配置成用于查询拓扑特征库中是否存在与该至少一个分割的子图的拓扑一致的已知子图的优化策略;所述优化单元330,配置成用于:如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知拓扑的优化策略来优化该至少一个子图。
进一步地,所述拓扑特征库包括已知子图的拓扑特征向量和对应的优化策略,所述查询单元320进一步配置成用于:查询是否存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略。
进一步地,如果不存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元330进一步配置成用于:对该至少一个分割的子图执行编译优化计算;在该至少一个分割的子图的编译优化之后,生成该至少一个分割的子图的优化策略;将该至少一个分割的子图的拓扑特征向量及其优化策略相关联地添加到拓扑特征库中。
进一步地,所述优化单元330配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定该至少一个分割的子图的等效子图,将该至少一个分割的子图替换成等效子图;所述优化单元330配置成用于:在执行所述添加时,将该至少一个分割的子图的拓扑特征向量和等效子图的拓扑特征向量关联地存储在拓扑特征库中。
进一步地,所述优化单元330配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定子图中多个部分能够合并而降低该子图的计算量或者提升该子图的计算速度,将所述多个部分进行合并,所述优化单元330配置成用于:在执行所述添加时,将该子图的拓扑特征向量和合并后的子图的拓扑特征向量关联地添加到拓扑特征库中。
进一步地,所述优化单元330配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定所述子图的等效常量,将该子图替换成该常量;所述优化单元330配置成用于:在执行所述添加时,将该子图的拓扑特征向量和该常量关联地添加到拓扑特征库中。
进一步地,所述分割单元310进一步配置成用于:采用聚类技术将计算图分割成粗粒度的子图。
进一步地,如果不存在与至少一个分割的粗粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述分割单元310配置成用于:将至少一个分割的粗粒度子图分割成若干个细粒度子图;所述查询单元320配置成用于:查询拓扑特征库中是否存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略;如果存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元330配置成用于:从拓扑特征库中提取与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略来优化该至少一个细粒度子图。
进一步地,如果不存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元330配置成用于:对该至少一个细粒度子图执行编译优化计算;生成该至少一个细粒度子图的优化策略;将该至少一个细粒度子图的拓扑特征向量及其优化策略添加到拓扑特征库中。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (21)

  1. 一种神经网络编译优化方法,包括:
    将神经网络的计算图分割成若干个子图;
    查询拓扑特征库中是否存在与至少一个分割的子图的拓扑一致的已知子图的优化策略;
    如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知子图的优化策略来优化该至少一个子图。
  2. 根据权利要求1所述的方法,所述拓扑特征库包括已知子图的拓扑特征向量和对应的优化策略,所述查询的步骤包括查询是否存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略。
  3. 根据权利要求2所述的方法,其中,所述方法进一步包括:如果不存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略,则对该至少一个分割的子图执行编译优化计算;在该至少一个分割的子图的编译优化之后,生成该至少一个分割的子图的优化策略;将该至少一个分割的子图的拓扑特征向量及其优化策略相关联地添加到拓扑特征库中。
  4. 根据权利要求3所述的方法,其中,对该至少一个分割的子图执行编译优化计算包括通过优化计算来确定该至少一个分割的子图的等效子图,将该至少一个分割的子图替换成等效子图;所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量和等效子图的拓扑特征向量关联地存储在拓扑特征库中。
  5. 根据权利要求3所述的方法,其中,对该至少一个分割的子图执行编译优化计算包括确定该至少一个分割的子图中的多个部分能够合并而降低该子图的计算量或者提升该子图的计算速度,将所述多个部分进行合并,所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量和合并后的子图的拓扑特征向量关联地添加到拓扑特征库中。
  6. 根据权利要求3所述的方法,其中,对该至少一个分割的子图执行编译优化计算包括确定所述该至少一个分割的子图的等效常量,将该至少一个分割的子图替换成等效常量,所述添加的步骤包括,将该至少一个分割的子图的拓扑特征向量好和等效常量关联地添加到拓扑特征库中。
  7. 根据权利要求1-3中任一项所述的方法,其中,所述分割的步骤包括采用聚类技术将计算图分割成若干个粗粒度的子图。
  8. 根据权利要求7所述的方法,其中,所述方法进一步包括:如果不存在与至少一个分割的粗粒度子图的拓扑特征向量一致的已知子图的优化策略,则将至少一个分割的粗粒度子图分割成若干个细粒度子图,查询拓扑特征库中是否存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略;如果存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则从拓扑特征库中提取与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略来优化该至少一个细粒度子图。
  9. 根据权利要求8所述的方法,其中,如果不存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则:对该至少一个细粒度子图执行编译优化计算;生成 该至少一个细粒度子图的优化策略;将该至少一个细粒度子图的拓扑特征向量及其优化策略添加到拓扑特征库中。
  10. 一种用于神经网络编译优化的装置,包括:
    处理器和存储器,所述处理器用于执行所述存储器中存储的程序指令,以使所述装置实现所述权利要求1至9中任一项所述的方法。
  11. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储了程序代码,所述程序代码被计算机执行时,实现所述权利要求1至9中任一项所述的方法。
  12. 一种计算机程序产品,其中,所述计算机程序产品包含的程序代码被计算机执行时,实现所述权利要求1至9中任一项所述的方法。
  13. 一种神经网络编译优化的装置,包括:
    分割单元,配置成用于将神经网络的计算图分割成若干个子图;
    查询单元,配置成用于查询拓扑特征库中是否存在与该至少一个分割的子图的拓扑一致的已知子图的优化策略;
    优化单元,配置成用于:如果存在,则从该拓扑特征库中提取与该至少一个分割的子图的拓扑一致的已知拓扑的优化策略来优化该至少一个子图。
  14. 根据权利要求13所述的装置,所述拓扑特征库包括已知子图的拓扑特征向量和对应的优化策略,所述查询单元进一步配置成用于:生成该至少一个分割的子图的拓扑特征向量,查询是否存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略。
  15. 根据权利要求14所述的装置,其中,如果不存在与该至少一个分割的子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元进一步配置成用于:对该至少一个分割的子图执行编译优化计算;在该至少一个分割的子图的编译优化之后,生成该至少一个分割的子图的优化策略;将该至少一个分割的子图的拓扑特征向量及其优化策略相关联地添加到拓扑特征库中。
  16. 根据权利要求15所述的装置,其中,所述优化单元配置成用于:在对该至少一个分割的子图执行编译优化计算时,通过优化计算来确定该至少一个分割的子图的等效子图,将该至少一个分割的子图替换成等效子图;
    所述优化单元配置成用于:在执行所述添加时,将该至少一个分割的子图的拓扑特征向量和等效子图的拓扑特征向量关联地存储在拓扑特征库中。
  17. 根据权利要求15所述的装置,其中,所述优化单元配置成用于:在对该至少一个分割的子图执行编译优化计算时,确定子图中多个部分能够合并而降低该子图的计算量或者提升该子图的计算速度,将所述多个部分进行合并,
    所述优化单元配置成用于:在执行所述添加时,将该子图的拓扑特征向量和合并后的子图的拓扑特征向量关联地添加到拓扑特征库中。
  18. 根据权利要求15所述的装置,其中,所述优化单元配置成用于:在对该至少一个分割的子图执行编译优化计算时,对该子图执行编译优化计算包括确定所述子图的等效常量,将该子图替换成该常量;
    所述优化单元配置成用于:在执行所述添加的步骤时,将该子图的拓扑特征向量和该 常量关联地添加到拓扑特征库中。
  19. 根据权利要求13-15中任一项所述的装置,其中,所述分割单元进一步配置成用于:采用聚类技术将计算图分割成粗粒度的子图。
  20. 根据权利要求19所述的装置,其中,如果不存在与至少一个分割的粗粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述分割单元配置成用于:将至少一个分割的粗粒度子图分割成若干个细粒度子图;所述查询单元配置成用于:查询拓扑特征库中是否存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略;
    如果存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元配置成用于:从拓扑特征库中提取与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略来优化该至少一个细粒度子图。
  21. 根据权利要求20所述的装置,其中,如果不存在与至少一个细粒度子图的拓扑特征向量一致的已知子图的优化策略,则所述优化单元配置成用于:对该至少一个细粒度子图执行编译优化计算;生成该至少一个细粒度子图的优化策略;将该至少一个细粒度子图的拓扑特征向量及其优化策略添加到拓扑特征库中。
PCT/CN2020/123708 2020-10-26 2020-10-26 一种神经网络编译优化方法和相关装置 WO2022087788A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080106362.2A CN116368494A (zh) 2020-10-26 2020-10-26 一种神经网络编译优化方法和相关装置
PCT/CN2020/123708 WO2022087788A1 (zh) 2020-10-26 2020-10-26 一种神经网络编译优化方法和相关装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/123708 WO2022087788A1 (zh) 2020-10-26 2020-10-26 一种神经网络编译优化方法和相关装置

Publications (1)

Publication Number Publication Date
WO2022087788A1 true WO2022087788A1 (zh) 2022-05-05

Family

ID=81381557

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123708 WO2022087788A1 (zh) 2020-10-26 2020-10-26 一种神经网络编译优化方法和相关装置

Country Status (2)

Country Link
CN (1) CN116368494A (zh)
WO (1) WO2022087788A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115268936A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于计算图编译的优化方法及装置
CN116126346A (zh) * 2023-04-04 2023-05-16 上海燧原科技有限公司 Ai模型的代码编译方法、装置、计算机设备及存储介质
CN117170686A (zh) * 2023-11-03 2023-12-05 深圳鲲云信息科技有限公司 用于神经网络编译优化的方法及计算设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN110764744A (zh) * 2018-07-25 2020-02-07 赛灵思公司 用于神经网络计算的中间表示生成方法和装置
CN111160539A (zh) * 2018-11-08 2020-05-15 三星电子株式会社 运行人工神经网络的计算处理图的系统及方法
US20200310955A1 (en) * 2019-03-25 2020-10-01 International Business Machines Corporation Reduced memory neural network training
CN111813949A (zh) * 2020-05-18 2020-10-23 中国人民解放军国防科技大学 联合查询的网络空间知识图谱推理方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
CN110764744A (zh) * 2018-07-25 2020-02-07 赛灵思公司 用于神经网络计算的中间表示生成方法和装置
CN111160539A (zh) * 2018-11-08 2020-05-15 三星电子株式会社 运行人工神经网络的计算处理图的系统及方法
US20200310955A1 (en) * 2019-03-25 2020-10-01 International Business Machines Corporation Reduced memory neural network training
CN111813949A (zh) * 2020-05-18 2020-10-23 中国人民解放军国防科技大学 联合查询的网络空间知识图谱推理方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115268936A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于计算图编译的优化方法及装置
CN116126346A (zh) * 2023-04-04 2023-05-16 上海燧原科技有限公司 Ai模型的代码编译方法、装置、计算机设备及存储介质
CN116126346B (zh) * 2023-04-04 2023-06-16 上海燧原科技有限公司 Ai模型的代码编译方法、装置、计算机设备及存储介质
CN117170686A (zh) * 2023-11-03 2023-12-05 深圳鲲云信息科技有限公司 用于神经网络编译优化的方法及计算设备
CN117170686B (zh) * 2023-11-03 2024-03-12 深圳鲲云信息科技有限公司 用于神经网络编译优化的方法及计算设备

Also Published As

Publication number Publication date
CN116368494A (zh) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2022087788A1 (zh) 一种神经网络编译优化方法和相关装置
CN112579063B (zh) 一种用于深度学习编译器中探索优化空间的加速方法
CN110764744B (zh) 用于神经网络计算的中间表示生成方法和装置
US9239710B2 (en) Programming language transformations with abstract syntax tree extensions
US8316060B1 (en) Segment matching search system and method
Wang et al. FlexGraph: a flexible and efficient distributed framework for GNN training
CN113703775B (zh) 一种编译方法、装置、设备及存储介质
US10901715B1 (en) Lazy compilation and kernel fusion in dynamic computation graphs
JP6237278B2 (ja) コンパイルプログラム、コンパイル方法およびコンパイル装置
CN112529175B (zh) 神经网络的编译方法、系统、计算机存储介质及编译设备
Jo et al. Panene: A progressive algorithm for indexing and querying approximate k-nearest neighbors
CN113283613B (zh) 深度学习模型的生成方法、优化方法、装置、设备及介质
US10671607B2 (en) Pipeline dependent tree query optimizer and scheduler
JP2012513046A (ja) モジュラフォレストオートマトン
Zhi Kang et al. Efficient deep learning pipelines for accurate cost estimations over large scale query workload
Ahmad et al. Leveraging parallel data processing frameworks with verified lifting
WO2023093689A1 (zh) 一种计算图优化方法、装置及设备
EP4280107A1 (en) Data processing method and apparatus, device, and medium
CN115618801B (zh) 缓存一致性检验方法、装置及电子设备
US8515983B1 (en) Segment matching search system and method
US11016745B2 (en) Systems and methods for generating distributed software packages using non-distributed source code
US20240095241A1 (en) Data search method and apparatus, and device
Gupta et al. Map-based graph analysis on MapReduce
CN114281339A (zh) 程序编译方法、编译器、设备及存储介质
Fu et al. Automatic generation of high-performance inference kernels for graph neural networks on multi-core systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20958953

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20958953

Country of ref document: EP

Kind code of ref document: A1