WO2023029589A1 - Neural network compilation method and apparatus, device, and storage medium - Google Patents

Neural network compilation method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023029589A1
WO2023029589A1 PCT/CN2022/093058 CN2022093058W WO2023029589A1 WO 2023029589 A1 WO2023029589 A1 WO 2023029589A1 CN 2022093058 W CN2022093058 W CN 2022093058W WO 2023029589 A1 WO2023029589 A1 WO 2023029589A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
neural network
sequence
topological
data
Prior art date
Application number
PCT/CN2022/093058
Other languages
French (fr)
Chinese (zh)
Inventor
勾志宏
胡英俊
徐宁仪
曹雨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023029589A1 publication Critical patent/WO2023029589A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

Embodiments of the present disclosure provide a neural network compilation method and apparatus, a device, and a storage medium. According to an example of the method, after a calculation graph of a neural network to be compiled is determined, a target topological sequence is selected from among multiple topological sequences of the calculation graph, and the neural network is then compiled on the basis of the target topological sequence to obtain machine instructions executable by a target chip. By screening out the target topological sequence having a high efficiency of execution by the target chip, and then compiling the neural network, the computing capacity of the target chip can be maximized.

Description

用于神经网络编译的方法、装置、设备及存储介质Method, device, device and storage medium for neural network compilation
交叉引用声明cross-reference statement
本申请要求于2021年8月31日提交中国专利局的申请号为202111013533.X的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application with application number 202111013533.X filed with the China Patent Office on August 31, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及人工智能技术领域,尤其涉及用于编译的方法、装置、设备及存储介质。The present disclosure relates to the technical field of artificial intelligence, and in particular to a method, device, device and storage medium for compiling.
背景技术Background technique
在训练得到神经网络后,需将神经网络部署到目标终端上,用于实现不同应用场景中的各种需求。相关技术中,在将神经网络部署到目标终端上时,一般是通过深度学习编译器按照预设的解析规则将神经网络解析成固定的拓扑序列,使得目标终端在进行神经网络推理的过程中,按照该固定的拓扑序列所表示的算子执行顺序对输入神经网络的数据进行运算。虽然深度学习编译器在后端优化时,也能在一定程度上对算子进行融合或者调整部分算子的执行顺序,但这些优化还是基于最初输入给深度学习编译器的拓扑序列进行的。因此,神经网络在部署到不同的芯片上时,算子的执行顺序是基本固定的。然而,神经网络的计算图中通常会包括很多个拓扑序列;并且,针对不同类型的芯片,神经网络的执行效率最高的拓扑序列可能不一样。这使得,如果将神经网络按照预先设定的规则解析成某个固定的拓扑序列,可能无法最大程度的发挥芯片的计算能力,造成资源浪费,从而降低神经网络的推理效率。After the neural network is trained, the neural network needs to be deployed on the target terminal to meet various requirements in different application scenarios. In related technologies, when the neural network is deployed to the target terminal, the neural network is generally parsed into a fixed topology sequence by the deep learning compiler according to the preset parsing rules, so that the target terminal, in the process of neural network reasoning, Operations are performed on the data input to the neural network according to the operator execution sequence represented by the fixed topology sequence. Although the deep learning compiler can also fuse operators to a certain extent or adjust the execution order of some operators during back-end optimization, these optimizations are still based on the topological sequence initially input to the deep learning compiler. Therefore, when the neural network is deployed on different chips, the execution order of the operators is basically fixed. However, the calculation graph of the neural network usually includes many topological sequences; and, for different types of chips, the topological sequence with the highest execution efficiency of the neural network may be different. This makes it impossible to maximize the computing power of the chip if the neural network is parsed into a fixed topology sequence according to preset rules, resulting in waste of resources and reducing the reasoning efficiency of the neural network.
发明内容Contents of the invention
本公开提供一种用于编译的方法、装置、设备及存储介质。The present disclosure provides a compiling method, device, device and storage medium.
根据本公开实施例的第一方面,提供一种编译方法,所述方法包括:确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络中的数据流向;从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;基于所述目标拓扑序列生成所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。According to the first aspect of the embodiments of the present disclosure, there is provided a compiling method, the method comprising: determining a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and The edge in the calculation graph represents the data flow direction in the neural network; the target topological sequence is determined from a plurality of topological sequences in the calculation graph, wherein each of the topological sequences represents an operator in the neural network A specific execution sequence: generating machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.
在一些实施例中,所述目标拓扑序列基于所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长而确定。In some embodiments, the target topology sequence is determined based on the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution sequence represented by each topology sequence.
在一些实施例中,所述目标芯片包括至少两种计算单元,所述至少两种计算单元能够并行地对输入数据进行不同类型的运算。In some embodiments, the target chip includes at least two types of computing units capable of performing different types of operations on input data in parallel.
在一些实施例中,从所述计算图的多个拓扑序列中确定目标拓扑序列,包括:将所述计算图划分成多个子图,其中,每个子图包括至少两个子拓扑序列;针对每个子图,从所述子图的至少两个子拓扑序列中确定目标子拓扑序列;所述目标子拓扑序列基于所述目标芯片按照所述至少两个子拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长而确定;基于每个子图的目标子拓扑序列得到所述目标拓扑序列。In some embodiments, determining the target topological sequence from the multiple topological sequences of the computation graph includes: dividing the computation graph into multiple subgraphs, wherein each subgraph includes at least two subtopological sequences; for each subgraph Graph, determining a target sub-topology sequence from at least two sub-topology sequences in the sub-graph; the target sub-topology sequence is based on the target chip according to the operator execution sequence represented by the at least two sub-topology sequences for the input data The operation duration of the operation is determined; the target topological sequence is obtained based on the target subtopological sequence of each subgraph.
在一些实施例中,将所述计算图划分成多个子图,包括:从所述计算图的多个节点中确定多个关键节点,其中,每个所述关键节点为所述计算图中至少两条路径的汇聚点;基于所述多个关键节点将所述计算图划分成多个子图。In some embodiments, dividing the computation graph into multiple subgraphs includes: determining a plurality of key nodes from multiple nodes in the computation graph, wherein each of the key nodes is at least A convergence point of two paths; dividing the computation graph into multiple subgraphs based on the multiple key nodes.
在一些实施例中,基于所述多个关键节点将所述计算图拆分成多个子图,包括:将邻近的至少两个关键节点以及位于所述至少两个关键节点之间的节点和边构成为一个子图。In some embodiments, splitting the computation graph into a plurality of subgraphs based on the plurality of key nodes includes: dividing at least two adjacent key nodes and nodes and edges between the at least two key nodes constituted as a subgraph.
在一些实施例中,将所述计算图拆分成多个子图之后,还包括:确定所述子图中节点数量小于预设数量的目标子图;将所述目标子图与所述目标子图的邻近子图融合。In some embodiments, after splitting the computation graph into multiple subgraphs, it further includes: determining a target subgraph whose number of nodes in the subgraph is less than a preset number; combining the target subgraph with the target subgraph Adjacent subgraph fusion of a graph.
在一些实施例中,基于所述目标拓扑序列确定所述神经网络对应的机器指令,包括:确定每个所述目标子拓扑序列对应的机器指令;将每个所述目标子拓扑序列对应的机器指令按照所述计算图中的数据流向链接,得到所述神经网络对应的机器指令。In some embodiments, determining the machine instructions corresponding to the neural network based on the target topological sequence includes: determining the machine instructions corresponding to each of the target subtopological sequences; The instruction is linked according to the data flow in the calculation graph to obtain the machine instruction corresponding to the neural network.
在一些实施例中,所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长基于以下方式确定:针对每个所述拓扑序列,确定目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算对应的机器指令;基于所述目标芯片执行所述机器指令的时长确定所述运算时长;或者,针对每个拓扑序列,利用预设的代价模型确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长,其中,所述代价模型用于根据所述目标芯片的硬件参数以及所述拓扑序列表示的算子执行顺序预估所述拓扑序列对应的运算时长。In some embodiments, the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence is determined based on the following method: for each topological sequence, determine The target chip performs operations on the input data according to the operator execution order represented by the topological sequence; determine the operation duration based on the duration of the target chip executing the machine instruction; or, for each topological sequence , using a preset cost model to determine the operation time for the target chip to perform operations on the input data according to the operator execution sequence represented by the topology sequence, wherein the cost model is used according to the hardware parameters of the target chip And the operator execution sequence represented by the topological sequence estimates the operation duration corresponding to the topological sequence.
在一些实施例中,利用预设的代价模型确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长,包括:确定目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述输入数据进行运算对应的机器指令;基于预设的代价模型以及所述机器指令确定所述运算时长。In some embodiments, using a preset cost model to determine the operation duration for the target chip to perform operations on the input data according to the operator execution sequence represented by the topology sequence includes: The operator represented by the sequence executes the corresponding machine instructions for performing operations on the input data in sequence; the operation duration is determined based on a preset cost model and the machine instructions.
在一些实施例中,确定待编译的神经网络对应的计算图,包括:对所述神经网络进 行解析,得到所述神经网络对应的原始计算图;根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图。In some embodiments, determining the calculation graph corresponding to the neural network to be compiled includes: analyzing the neural network to obtain the original calculation graph corresponding to the neural network; according to the memory size of the target chip and the original The data volume of the operation data corresponding to each operator in the calculation graph adjusts the operators in the original calculation graph to update the calculation graph.
在一些实施例中,根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图,包括:针对所述原始计算图中对应的运算数据的数据量大于预设阈值的每个目标算子,在所述原始计算图中新增至少一个与所述目标算子同一类型的附加算子,以将所述运算数据拆分成多份数据后分别通过所述目标算子新增的附加算子进行运算;其中,所述预设阈值基于所述目标芯片的内存大小确定;基于新增的附加算子调整所述原始计算图,以更新所述计算图。In some embodiments, the operators in the original calculation graph are adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph, so as to update the calculation Graph, including: for each target operator whose corresponding operation data in the original computation graph is greater than a preset threshold, add at least one additional object of the same type as the target operator in the original computation graph operator, to split the operation data into multiple pieces of data and then perform operations through additional operators newly added by the target operator; wherein, the preset threshold is determined based on the memory size of the target chip; based on The newly added additional operator adjusts the original computation graph to update the computation graph.
在一些实施例中,所述将所述运算数据拆分成多份数据,包括:基于所述目标算子的类型以及所述目标芯片的硬件性能参数确定对所述运算数据进行拆分的拆分维度;在所述拆分维度上对所述数据进行拆分,以得到多份数据。In some embodiments, the splitting the operation data into multiple pieces of data includes: determining the split method for splitting the operation data based on the type of the target operator and the hardware performance parameters of the target chip sub-dimension; the data is split on the split dimension to obtain multiple pieces of data.
在一些实施例中,所述运算数据包括图像数据,所述拆分维度包括以下一种或多种:所述图像数据的帧数维度、所述图像数据的通道维度、所述图像数据的宽度维度、所述图像数据的高度维度。In some embodiments, the operation data includes image data, and the split dimension includes one or more of the following: frame number dimension of the image data, channel dimension of the image data, width of the image data dimension, the height dimension of the image data.
根据本公开实施例的第二方面,提供一种编译装置,所述装置包括:计算图确定模块,用于确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络中的数据流向;筛选模块,用于从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;编译模块,用于基于所述目标拓扑序列确定所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。According to the second aspect of the embodiments of the present disclosure, there is provided a compiling device, the device comprising: a calculation graph determination module, configured to determine a calculation graph corresponding to a neural network to be compiled, and a node in the calculation graph represents the neural network The operator in the network, the edge in the calculation graph represents the data flow in the neural network; the screening module is used to determine the target topological sequence from the multiple topological sequences in the calculation graph, wherein each of the The topology sequence represents a specific execution order of operators in the neural network; the compiling module is configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.
根据本公开实施例的第三方面,提供一种电子设备,所述电子设备包括处理器、存储器、存储在所述存储器可供所述处理器执行的计算机指令,所述处理器执行所述计算机指令时,可实现上述第一方面提及的方法。According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device includes a processor, a memory, and computer instructions stored in the memory that can be executed by the processor, and the processor executes the computer Instructions, the method mentioned in the first aspect above can be implemented.
根据本公开实施例的第四方面,提供一种计算机可读存储介质,所述存储介质上存储有计算机指令,所述计算机指令被执行时实现上述第一方面提及的方法。According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the method mentioned in the above-mentioned first aspect is implemented.
本公开实施例中,可以确定待编译的神经网络的计算图,从计算图的多个拓扑序列中筛选出目标芯片执行效率较高、运算时长较短的目标拓扑序列,然后基于目标拓扑序列对神经网络进行编译得到机器指令,供目标芯片执行。通过筛选出目标芯片执行效率较高的目标拓扑序列,再进行神经网络的编译,可以最大程度地发挥目标芯片的计算能 力,提高目标芯片基于神经网络进行推理的过程中的处理效率。In the embodiment of the present disclosure, the calculation graph of the neural network to be compiled can be determined, and the target topological sequence with higher execution efficiency and shorter operation time of the target chip can be selected from the multiple topological sequences of the calculation graph, and then based on the target topological sequence to The neural network is compiled to obtain machine instructions for execution by the target chip. By screening the target topology sequence with high execution efficiency of the target chip, and then compiling the neural network, the computing power of the target chip can be maximized, and the processing efficiency of the target chip in the process of reasoning based on the neural network can be improved.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
附图说明Description of drawings
此处的附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。The drawings here show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.
图1是本公开实施例的一种计算图的示意图。FIG. 1 is a schematic diagram of a computation graph of an embodiment of the present disclosure.
图2是本公开实施例的一种神经网络的编译方法的流程图。FIG. 2 is a flowchart of a method for compiling a neural network according to an embodiment of the present disclosure.
图3是本公开实施例的一种计算图的示意图。FIG. 3 is a schematic diagram of a calculation graph of an embodiment of the present disclosure.
图4是本公开实施例的一种计算图的关键节点以及对计算图进行划分的示意图。Fig. 4 is a schematic diagram of key nodes of a calculation graph and division of the calculation graph according to an embodiment of the disclosure.
图5是本公开实施例的在计算图中新增计算单元以调整计算图的示意图。FIG. 5 is a schematic diagram of adding a calculation unit in a calculation graph to adjust the calculation graph according to an embodiment of the disclosure.
图6是本公开实施例的一种神经网络的编译装置的逻辑结构的示意图。FIG. 6 is a schematic diagram of a logical structure of a neural network compiling device according to an embodiment of the present disclosure.
图7是本公开实施例的一种电子设备的逻辑结构示意图。Fig. 7 is a schematic diagram of a logic structure of an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合。The terminology used in the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present disclosure, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."
为了使本技术领域的人员更好的理解本公开实施例中的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中的技术 方案作进一步详细的说明。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present disclosure, and to make the above-mentioned purposes, features and advantages of the embodiments of the present disclosure more obvious and understandable, the technical solutions in the embodiments of the present disclosure are described below in conjunction with the accompanying drawings The program is described in further detail.
在训练得到神经网络后,需将神经网络部署在特定的目标终端上,以对神经网络进行运用。相比于训练阶段,神经网络在运用阶段进行推理时对性能的要求更高,这就需要深度学习编译器能够最大程度地发挥硬件的计算能力。在将神经网络部署到目标终端上的推理芯片时,通常会通过深度学习推理框架将神经网络转化为计算图,然后对神经网络的计算图进行优化,基于优化后的计算图将神经网络编译成二进制指令,在目标终端上执行该二进制指令,完成神经网络的推理过程。通过对神经网络进行编译优化,使得复杂的神经网络在计算能力有限的例如移动终端的目标终端上也可以应用。After the neural network is trained, the neural network needs to be deployed on a specific target terminal to use the neural network. Compared with the training phase, the neural network has higher performance requirements when reasoning in the application phase, which requires the deep learning compiler to maximize the computing power of the hardware. When deploying the neural network to the inference chip on the target terminal, the neural network is usually converted into a calculation graph through the deep learning reasoning framework, and then the calculation graph of the neural network is optimized, and the neural network is compiled into The binary instruction is executed on the target terminal to complete the reasoning process of the neural network. By compiling and optimizing the neural network, the complex neural network can also be applied to target terminals with limited computing power, such as mobile terminals.
如图1所示,为一种计算图的示意图,计算图是一种有向无环图。其中,图中的节点表示神经网络中的算子,例如,Relu表示激活、Conv表示卷积、MatMul表示矩阵乘法、Mean表示取平均、Add表示加法、Apxy表示向量求和、Sub表示除法、Softmax表示激活。此外,图中的边表示神经网络的输入数据的流向。通常,一个计算图可能包括多个拓扑序列,每个拓扑序列表示该计算图中的算子的特定执行顺序。比如,如图1所示的计算图中,Relu-Conv1-MatMul-Mean-Add1-Conv2-Apxy-Sub-Add2-Softmax为一种拓扑序列,而Relu-Conv2-Apxy-Sub-Add2-Conv1-MatMul-Mean-Add1-Softmax为另一种拓扑序列。As shown in FIG. 1 , it is a schematic diagram of a calculation graph, and the calculation graph is a directed acyclic graph. Among them, the nodes in the figure represent the operators in the neural network, for example, Relu represents activation, Conv represents convolution, MatMul represents matrix multiplication, Mean represents averaging, Add represents addition, Apxy represents vector summation, Sub represents division, Softmax Indicates activation. In addition, the edges in the graph indicate the flow of input data to the neural network. Usually, a computation graph may include multiple topological sequences, each of which represents a specific execution order of operators in the computation graph. For example, in the calculation diagram shown in Figure 1, Relu-Conv1-MatMul-Mean-Add1-Conv2-Apxy-Sub-Add2-Softmax is a topological sequence, while Relu-Conv2-Apxy-Sub-Add2-Conv1- MatMul-Mean-Add1-Softmax is another topological sequence.
相关技术中,在将神经网络部署到目标终端时,通常是将神经网络按照预定的解析规则解析成固定的拓扑序列,然后基于该固定的拓扑序列将神经网络编译成二进制指令,以便目标终端的芯片执行该指令。也就是说,在推理过程中,目标终端上的芯片会按照该固定的拓扑序列表示的算子执行顺序对输入数据进行运算,即神经网络中的算子在目标终端的芯片上的执行顺序是固定的。然而,神经网络的计算图可能包括多种拓扑序列;并且,针对不同的芯片,采用不同的算子执行顺序的执行效率可能不一样。比如,针对某些芯片,芯片上可能包括多种计算单元,不同类型的计算单元可以用于执行不同类型的算子,例如,卷积或者矩阵乘法这两种算子会由专用的计算单元执行,而其他类型的算子则会通过另外的计算单元执行。不同的算子执行顺序(即不同的拓扑序列)可能会对芯片上的计算单元的并行度产生影响。这使得,采用固定的拓扑序列可能会使得芯片上的某些计算单元在较长时间内闲置(比如,可能依赖于其他算子的计算结果进行计算),进而影响芯片整体的推理效率。In related technologies, when deploying a neural network to a target terminal, the neural network is usually parsed into a fixed topology sequence according to predetermined parsing rules, and then the neural network is compiled into binary instructions based on the fixed topology sequence, so that the target terminal The chip executes the instruction. That is to say, during the inference process, the chip on the target terminal will perform operations on the input data according to the operator execution order represented by the fixed topology sequence, that is, the execution order of the operators in the neural network on the target terminal chip is stable. However, the calculation graph of the neural network may include various topological sequences; and, for different chips, the execution efficiency of different operator execution sequences may be different. For example, for some chips, multiple computing units may be included on the chip, and different types of computing units can be used to perform different types of operators. For example, convolution or matrix multiplication will be performed by dedicated computing units. , while other types of operators will be executed by another computing unit. Different execution sequences of operators (that is, different topological sequences) may have an impact on the parallelism of computing units on the chip. This makes certain computing units on the chip idle for a long time when using a fixed topology sequence (for example, it may rely on the calculation results of other operators for calculation), thereby affecting the overall inference efficiency of the chip.
为了解决上述问题,本公开实施例提供了一种编译方法,可以基于运行神经网络的目标芯片的硬件性能,从神经网络的计算图所包括的拓扑序列中,筛选出相对于目标芯片来说执行效率较高的目标拓扑序列,并基于该目标拓扑序列得到神经网络对应的机器 指令,输出给目标芯片执行,从而可以最大程度地发挥目标芯片的计算能力,提高目标芯片运用神经网络的推理效率。In order to solve the above problems, the embodiment of the present disclosure provides a compiling method, which can filter out the topological sequences included in the calculation graph of the neural network based on the hardware performance of the target chip running the neural network. The target topology sequence with high efficiency, and based on the target topology sequence, the machine instructions corresponding to the neural network are obtained, and output to the target chip for execution, so as to maximize the computing power of the target chip and improve the reasoning efficiency of the target chip using the neural network.
本公开实施例中提供的编译方法可以用于深度学习编译器或者AI芯片的工具链等深度学习编译优化工具中。深度学习编译优化工具可以将神经网络进行优化并编译成机器可识别的二进制指令,以输出给目标终端上的芯片执行。The compilation method provided in the embodiments of the present disclosure can be used in deep learning compilation and optimization tools such as a deep learning compiler or a tool chain of an AI chip. The deep learning compilation and optimization tool can optimize the neural network and compile it into machine-recognizable binary instructions, which can be output to the chip on the target terminal for execution.
具体的,本公开实施例提供的神经网络编译方法如图2所示,可以包括以下步骤:Specifically, the neural network compilation method provided by the embodiment of the present disclosure is shown in Figure 2, and may include the following steps:
S202、确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络中的数据流向;S202. Determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data flows in the neural network;
S204、从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;S204. Determine a target topological sequence from multiple topological sequences of the computation graph, where each topological sequence represents a specific execution order of operators in the neural network;
S206、基于所述目标拓扑序列确定所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。S206. Determine a machine instruction corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instruction.
在步骤202中,可以对待编译的神经网络进行解析,以得到该神经网络的计算图。比如,可以对神经网络对应的Caffe模型文件进行解析,确定神经网络的计算图。待编译的神经网络可以是各种神经网络,比如,卷积神经网络等等。神经网络的计算图用于表示数据从神经网络的输入到输出的整个计算流程,为一种有向无环图。其中,计算图中的节点可以表示神经网络中的一种算子,比如,卷积、矩阵乘法、激活、加法、除法、取均值等等,计算图中的边的箭头指向可以表示神经网络中的数据流向。In step 202, the neural network to be compiled may be parsed to obtain a computation graph of the neural network. For example, the Caffe model file corresponding to the neural network can be analyzed to determine the calculation graph of the neural network. The neural network to be compiled may be various neural networks, for example, a convolutional neural network and the like. The calculation graph of the neural network is used to represent the entire calculation process of data from the input to the output of the neural network, which is a directed acyclic graph. Among them, the nodes in the calculation graph can represent an operator in the neural network, such as convolution, matrix multiplication, activation, addition, division, averaging, etc., and the arrow points of the edges in the calculation graph can represent the neural network. data flow direction.
在步骤S204中,当确定神经网络的计算图后,可以从神经网络的计算图的多个拓扑序列中确定目标拓扑序列。通常而言,一个计算图可包括多个拓扑序列,每个拓扑序列表示执行该神经网络中的算子的特定执行顺序。以图3所示的一个较为简单的计算图为例,该计算图包括的拓扑序列有Conv-MatMul-Mean-Softmax和Conv-Mean-MatMul-Softmax,不同拓扑序列中算子的执行顺序也不同。对于不同的拓扑序列,其在目标芯片上的执行效率可能不一样,因而,可以从多个拓扑序列中选择目标芯片执行效率较高的目标拓扑序列,比如,可以结合目标芯片的硬件性能参数,比如,目标芯片中的计算单元的种类和数目、计算单元的计算能力、或者内存的大小等参数确定执行效率较高(比如,高于一定阈值)的拓扑序列,作为目标拓扑序列。也可以是,根据目标芯片按照拓扑序列对神经网络的输入数据进行运算得到最终输出的运算时长,确定运算时长小于预设时长的拓扑序列为目标拓扑序列。只要是可以从多个拓扑序列中选取出执行效率较高的目标拓扑序列的方式都适用,本公开实施例不做限制。In step S204, after the computation graph of the neural network is determined, a target topological sequence may be determined from multiple topological sequences of the computation graph of the neural network. Generally speaking, a computation graph can include multiple topological sequences, and each topological sequence represents a specific execution sequence for executing the operators in the neural network. Take a relatively simple calculation graph shown in Figure 3 as an example. The topological sequences included in this computational graph include Conv-MatMul-Mean-Softmax and Conv-Mean-MatMul-Softmax, and the execution order of operators in different topological sequences is also different. . For different topological sequences, their execution efficiency on the target chip may be different. Therefore, the target topological sequence with higher execution efficiency of the target chip can be selected from multiple topological sequences. For example, the hardware performance parameters of the target chip can be combined. For example, parameters such as the type and number of computing units in the target chip, the computing power of the computing units, or the size of the memory determine a topological sequence with high execution efficiency (for example, higher than a certain threshold) as the target topological sequence. Alternatively, according to the operation duration of the final output obtained by the target chip performing operations on the input data of the neural network according to the topological sequence, the topological sequence whose operation duration is shorter than the preset duration is determined as the target topological sequence. Any method is applicable as long as a target topology sequence with higher execution efficiency can be selected from multiple topology sequences, which is not limited in the embodiment of the present disclosure.
目标芯片可以是对神经网络进行推理的各种芯片,比如,可以是CPU、GPU、各种 AI芯片或者是其他具有对神经网络进行推理功能的芯片,本公开实施例对此不做限制。在步骤S206中,在确定目标拓扑序列后,可以基于目标拓扑序列对神经网络进行编译得到机器指令,然后将机器指令输入到目标芯片中,以便目标芯片可通过执行该机器指令来完成神经网络的推理过程。The target chip may be various chips for inferring neural networks, for example, it may be a CPU, GPU, various AI chips or other chips capable of inferring neural networks, which is not limited in the embodiments of the present disclosure. In step S206, after the target topology sequence is determined, the neural network can be compiled based on the target topology sequence to obtain machine instructions, and then the machine instructions are input into the target chip, so that the target chip can complete the neural network by executing the machine instructions. reasoning process.
通过确定待编译的神经网络的计算图,从计算图的多个拓扑序列中筛选出目标芯片执行效率较高的目标拓扑序列,然后基于目标拓扑序列对神经网络进行编译得到机器指令以供目标芯片执行。通过根据目标芯片的硬件能力筛选出较优的目标拓扑序列,再进行神经网络的编译,可以最大程度的发挥目标芯片的计算能力,提高目标芯片基于神经网络进行推理的过程中的处理效率。By determining the calculation graph of the neural network to be compiled, the target topological sequence with higher execution efficiency of the target chip is selected from the multiple topological sequences of the calculation graph, and then the neural network is compiled based on the target topological sequence to obtain machine instructions for the target chip implement. By screening the optimal target topology sequence according to the hardware capability of the target chip, and then compiling the neural network, the computing power of the target chip can be maximized, and the processing efficiency of the target chip in the process of reasoning based on the neural network can be improved.
在一些实施例中,可以根据目标芯片按照每个拓扑序列所表示的算子执行顺序对神经网络的输入数据进行运算的运算时长,从这多个拓扑序列中确定目标拓扑序列。例如,运算时长代表了该拓扑序列在目标芯片上的执行效率,运算时长越短,该拓扑序列对应的执行效率越高。因此,可以从多个拓扑序列中选取运算时长符合一定条件的拓扑序列作为目标拓扑序列,比如,可以选择运算时长最短的拓扑序列作为目标拓扑序列。当然,在一些场景,如果遍历每个拓扑序列以筛选出运算时长最短的拓扑序列,耗费的时间和计算资源有可能过多。因而,也可以仅筛选出运算时长小于预设时长的一种拓扑序列即可,以保证目标芯片的执行效率较优。In some embodiments, the target topological sequence may be determined from the multiple topological sequences according to the operation duration of the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence. For example, the operation time represents the execution efficiency of the topological sequence on the target chip, and the shorter the operation time, the higher the execution efficiency of the topological sequence. Therefore, a topological sequence whose operation time meets certain conditions can be selected from multiple topological sequences as the target topological sequence, for example, a topological sequence with the shortest operation time can be selected as the target topological sequence. Of course, in some scenarios, if traversing each topological sequence to filter out the topological sequence with the shortest operation time, it may consume too much time and computing resources. Therefore, it is also possible to filter out only a topological sequence whose operation duration is shorter than a preset duration, so as to ensure that the execution efficiency of the target chip is better.
在一些实施例中,本公开实施例中的目标芯片包括至少两种计算单元,该至少两种计算单元可以并行地对神经网络的输入数据进行不同类型的运算,即该至少两种计算单元可以并行地执行神经网络中不同类型的算子。通用型的芯片,比如CPU、GPU等,由于只有一个计算单元,不存在多个计算单元并行执行不同类型的算子的情况。因而,针对这种芯片,不同的拓扑序列对应的运算时长可能差异不大。而对于包括至少两种计算单元的目标芯片,由于该至少两种计算单元可以并行地执行不同类型的算子,不同的算子执行顺序(即不同拓扑序列)可能会对计算单元之间的并行产生较大影响。比如,针对某些拓扑序列,其可能会造成有些计算单元长时间处于闲置状态,严重浪费计算资源,降低目标芯片的处理效率。因而,本公开实施例提供的编译方法,主要用来提升具有至少两种计算单元的目标芯片的处理效率,而CPU、GPU等则有其对应的优化手段。In some embodiments, the target chip in the embodiments of the present disclosure includes at least two types of computing units, and the at least two types of computing units can perform different types of operations on the input data of the neural network in parallel, that is, the at least two types of computing units can Execute different types of operators in neural networks in parallel. General-purpose chips, such as CPU, GPU, etc., have only one computing unit, so there is no situation where multiple computing units execute different types of operators in parallel. Therefore, for this kind of chip, the operation time corresponding to different topological sequences may have little difference. For a target chip that includes at least two types of computing units, since the at least two types of computing units can execute different types of operators in parallel, different operator execution sequences (that is, different topological sequences) may affect the parallelism between computing units. have a greater impact. For example, for some topological sequences, it may cause some computing units to be idle for a long time, which seriously wastes computing resources and reduces the processing efficiency of the target chip. Therefore, the compiling method provided by the embodiments of the present disclosure is mainly used to improve the processing efficiency of the target chip having at least two kinds of computing units, while CPU, GPU, etc. have their corresponding optimization means.
在从计算图的多个拓扑序列中确定目标拓扑序列时,一种方式是直接遍历计算图中的所有拓扑序列,确定目标芯片按照每个拓扑序列表示的算子执行顺序对输入数据进行运算的运算时长,然后基于所确定的运算时长筛选出目标拓扑序列。当然,这种方式可能对于结构简单,拓扑序列较少的计算图比较适用,而对于结构较复杂的计算图,要遍 历其所有的拓扑序列较为繁琐。When determining the target topological sequence from multiple topological sequences in the computing graph, one way is to directly traverse all the topological sequences in the computing graph, and determine that the target chip operates on the input data according to the operator execution order represented by each topological sequence operation time, and then filter out the target topology sequence based on the determined operation time. Of course, this method may be more suitable for computational graphs with simple structures and fewer topological sequences, but for computational graphs with more complex structures, it is more cumbersome to traverse all their topological sequences.
因而,在一些实施例中,可以先将计算图划分成多个子图,确定出每个子图对应的较优或最优的目标子拓扑序列,然后将各个目标子拓扑序列链接起来,得到整个计算图的目标拓扑序列。在进行子图划分时,划分得到的每个子图需包括至少两个子拓扑序列,然后针对每个子图,可以从子图的至少两个拓扑序列中确定目标子拓扑序列。例如,可基于目标芯片按照该子图的至少两个子拓扑序列表示的算子执行顺序对输入数据进行运算的运算时长,来确定目标子拓扑序列。比如,目标子拓扑序列可以是该子图的拓扑序列中对应的运算时长最短的子拓扑序列,或者是该子图的子拓扑序列中运算时长小于预设时长的一种子拓扑序列,只要保证目标芯片按照该目标子拓扑序列的算子执行顺序对数据进行运算时,具有较高的处理效率即可。在得到每个子图的目标子拓扑序列后,可以将目标子拓扑序列按照计算图中的数据流向链接起来,即可以得到整个计算图的目标拓扑序列。通过将计算图划分成多个子图,筛选出每个子图的较优或者最优的目标子拓扑序列,再链接得到整个计算图的较优或最优的目标拓扑序列,可以将复杂的计算图简单化,方便较优或最优拓扑序列的筛选。Therefore, in some embodiments, the calculation graph can be divided into multiple sub-graphs first, and the better or optimal target sub-topology sequence corresponding to each sub-graph can be determined, and then the target sub-topology sequences can be linked to obtain the entire calculation The target topological sequence of graphs. When performing subgraph division, each subgraph obtained through division needs to include at least two subtopological sequences, and then for each subgraph, a target subtopological sequence can be determined from at least two topological sequences of the subgraph. For example, the target sub-topology sequence may be determined based on the operation time for the target chip to perform operations on the input data according to the operator execution sequence represented by the at least two sub-topology sequences of the subgraph. For example, the target subtopological sequence can be the subtopological sequence corresponding to the shortest operation time in the topological sequence of the subgraph, or a subtopological sequence whose operation time is shorter than the preset duration in the subtopological sequence of the subgraph, as long as the target When the chip operates on the data according to the operator execution order of the target sub-topology sequence, it only needs to have relatively high processing efficiency. After obtaining the target sub-topology sequence of each subgraph, the target sub-topology sequence can be linked according to the data flow in the computation graph, that is, the target topology sequence of the entire computation graph can be obtained. By dividing the calculation graph into multiple sub-graphs, filtering out the better or optimal target sub-topology sequence of each sub-graph, and then linking to obtain the better or optimal target topology sequence of the entire calculation graph, the complex calculation graph can be Simplification, which facilitates the screening of better or optimal topological sequences.
当然,当子图的子拓扑序列组合数太多时,可以使用随机选取的方法,从候选的子拓扑序列中随机选取指定数量的候选子拓扑序列,并从候选子拓扑序列中选取较优或者最优的目标子拓扑序列。或者,也可以对子图本身进行特征归纳和提取,借助机器学习或者蒙特卡洛等方法,来优化寻找最优子拓扑序列的过程,以提高筛选目标子拓扑序列的处理效率。Of course, when there are too many combinations of subtopological sequences in the subgraph, a random selection method can be used to randomly select a specified number of candidate subtopological sequences from the candidate subtopological sequences, and select a better or optimal subtopological sequence from the candidate subtopological sequences. An optimal target subtopological sequence. Alternatively, it is also possible to perform feature induction and extraction on the subgraph itself, and use machine learning or Monte Carlo methods to optimize the process of finding the optimal subtopological sequence, so as to improve the processing efficiency of screening the target subtopological sequence.
在一些实施例中,在对计算图进行划分得到多个子图时,可以先从计算图的节点中确定关键节点,然后基于该关键节点将计算图划分成多个子图。其中,关键节点为计算图中至少两条路径的汇聚点,即计算图中存在具有两条或者两条以上路径的分支的节点。对于计算图中的节点,若该节点具有两条或者两条以上分支,表示经过该节点时存在多种执行顺序。比如,如图4所示,图中灰色背景的算子节点表示关键节点,从图4中可知,针对每个关键节点,至少可以从该节点引出两条路径,或者至少有两条路径汇聚于该关键节点。在确定关键节点后,可以基于关键节点将计算图划分成多个子图,比如,在一些实施例中,可以将邻近的至少两个关键节点以及该至少两个关键节点之间的节点和边构成为一个子图。In some embodiments, when the computation graph is divided to obtain multiple subgraphs, the key nodes may be determined from the nodes of the computation graph first, and then the computation graph is divided into multiple subgraphs based on the key nodes. Among them, the key node is the converging point of at least two paths in the computation graph, that is, a node with branches having two or more paths in the computation graph. For a node in the calculation graph, if the node has two or more branches, it means that there are multiple execution sequences when passing through the node. For example, as shown in Figure 4, the operator nodes with a gray background in the figure represent key nodes. It can be seen from Figure 4 that for each key node, at least two paths can be derived from the node, or at least two paths can converge on the key node. After the key nodes are determined, the calculation graph can be divided into multiple subgraphs based on the key nodes. For example, in some embodiments, at least two adjacent key nodes and the nodes and edges between the at least two key nodes can be composed of as a subgraph.
比如,如图4中的虚线框401所框选的部分,可以将相邻两个关键节点以及位于该相邻两个关键点节点之间的节点和边构成为一个子图,通过将相邻两个关键节点之间的节点和边构成为一个子图,每个子图中仅包括两个关键节点,即子图的开始和结束为关 键节点,这样每个子图中包括的拓扑序列也比较少。当然,也可以将邻近的多个关键节点以及该多个关键节点之间的节点和边构成为一个子图,如图4中虚线框402所框选的部分。以这种方式,每个子图中可以包括2个以上的关键节点,这样每个子图的拓扑序列会多一些,但子图的数量会少一些。For example, for the part framed by the dotted line box 401 in Figure 4, two adjacent key nodes and the nodes and edges between the adjacent two key point nodes can be formed into a subgraph, by The nodes and edges between two key nodes form a subgraph, and each subgraph includes only two key nodes, that is, the beginning and end of the subgraph are key nodes, so that each subgraph includes relatively few topological sequences . Of course, multiple adjacent key nodes and nodes and edges between the multiple key nodes can also be formed into a subgraph, such as the part framed by the dotted box 402 in FIG. 4 . In this way, each subgraph can include more than 2 key nodes, so that each subgraph will have more topological sequences, but the number of subgraphs will be less.
在一些实施例中,在对计算图进行划分的时候,如果是将相邻两个关键节点之间的部分作为一个子图,得到的子图数量可能会比较多,后续对子图进行编译和链接的时间也较长。可以从划分得到的子图中确定节点数量小于预设数量的目标子图,然后将目标子图与目标子图的邻近子图融合。比如,可以将目标子图与其前一个或者后一个子图合并,以减少子图数量,从而节省后续编译和链接的时间。In some embodiments, when dividing the calculation graph, if the part between two adjacent key nodes is used as a subgraph, the number of subgraphs obtained may be relatively large, and the subsequent subgraphs are compiled and The link time is also longer. A target subgraph whose number of nodes is less than a preset number may be determined from the divided subgraph, and then the target subgraph is fused with adjacent subgraphs of the target subgraph. For example, the target subgraph can be merged with its previous or subsequent subgraph to reduce the number of subgraphs, thereby saving subsequent compilation and linking time.
在一些实施例中,在根据目标拓扑序列确定神经网络对应的机器指令时,可以先确定每个子图对应的目标子拓扑序列对应的机器指令,然后将每个目标子拓扑序列对应的机器指令按照计算图中指示的数据流向链接,得到整个神经网络对应的机器指令。比如,可以确定每个子图对应的目标子拓扑序列对应的二进制机器码,然后通过链接器(linker)将不同子图对应的二进制机器码按照计算图中指示的数据流向进行组合,从而将单独编译的多个神经网络子图组合成一个完整的神经网络。In some embodiments, when determining the machine instruction corresponding to the neural network according to the target topological sequence, the machine instruction corresponding to the target subtopological sequence corresponding to each subgraph can be determined first, and then the machine instruction corresponding to each target subtopological sequence can be determined according to The data flow indicated in the calculation graph goes to the links, and the machine instructions corresponding to the entire neural network are obtained. For example, the binary machine code corresponding to the target subtopological sequence corresponding to each subgraph can be determined, and then the binary machine code corresponding to different subgraphs can be combined according to the data flow direction indicated in the calculation graph through a linker, so that the compiled Multiple neural network subgraphs are combined into a complete neural network.
在一些实施例中,在确定目标芯片按照每个拓扑序列表示的算子运算顺序对输入数据进行运算的运算时长时,可以针对每个拓扑序列,确定目标芯片按照该拓扑序列表示的算子执行顺序对输入数据进行运算对应的机器指令,然后利用目标芯片执行该机器指令,得到目标芯片执行该机器指令的时长,即为目标芯片按照该拓扑序列表示的算子执行顺序对输入数据进行运算的运算时长。这种方式需要将拓扑序列对应的机器指令真实地在目标芯片上执行一次,得到的运算时长比较准确。但是这种方式需要目标芯片参与,比较繁琐和耗时。In some embodiments, when determining the operation duration for the target chip to perform operations on the input data according to the operator operation sequence represented by each topological sequence, it can be determined for each topological sequence that the target chip executes according to the operator represented by the topological sequence Sequentially perform operations on the input data corresponding to the machine instructions, and then use the target chip to execute the machine instructions to obtain the length of time the target chip executes the machine instructions, that is, the target chip performs operations on the input data according to the operator execution sequence represented by the topology sequence Operation time. This method needs to execute the machine instruction corresponding to the topological sequence once on the target chip, and the obtained operation time is more accurate. However, this method requires the participation of the target chip, which is cumbersome and time-consuming.
在一些实施例中,也可以基于代价模型预估出每个拓扑序列对应的运算时长。具体地,可以预先构建一个代价模型,该代价模型可以根据目标芯片的硬件参数以及拓扑序列的类型预估目标芯片按照每个拓扑序列表示的算子执行顺序对输入数据进行运算的运算时长。其中,目标芯片执行机器指令所耗费的时间主要包括目标芯片从存储设备中读取指令的时间、目标芯片中计算单元对输入数据进行计算的计算时间,以及运算过程中的一些等待时间(一般可以忽略不计)。其中,从存储设备中读取指令的时间可以根据目标芯片的端口的传输带宽以及要传输的数据量确定,对数据进行计算的时间可以根据待计算的数据量以及目标芯片的计算单元的计算能力确定。因此,代价模型的运算逻辑是:根据目标芯片的硬件性能参数(比如,目标芯片包括几种计算单元、每种计算单 元的数量、每种计算单元的计算能力、目标芯片读取数据的接口的数据传输带宽等)、输入数据的数据量的大小以及拓扑序列指示的算子执行顺序(比如,先卷积,再相加,或者先相加,再卷积等)预估出运算时长。通过利用该代价模型模拟各拓扑序列对应的机器指令在真实的目标芯片运行的情况,可以预估出各拓扑序列对应的运算时长,而不需要在真实的硬件上运行各拓扑序列对应的机器指令。In some embodiments, the operation time corresponding to each topology sequence can also be estimated based on the cost model. Specifically, a cost model can be constructed in advance, which can estimate the operation time for the target chip to operate on the input data according to the operator execution sequence represented by each topology sequence according to the hardware parameters of the target chip and the type of topology sequence. Among them, the time consumed by the target chip to execute machine instructions mainly includes the time for the target chip to read instructions from the storage device, the calculation time for the calculation unit in the target chip to calculate the input data, and some waiting time during the operation process (generally, it can be can be ignored). Wherein, the time for reading instructions from the storage device can be determined according to the transmission bandwidth of the port of the target chip and the amount of data to be transmitted, and the time for calculating the data can be determined according to the amount of data to be calculated and the computing power of the computing unit of the target chip Sure. Therefore, the operation logic of the cost model is: according to the hardware performance parameters of the target chip (for example, the target chip includes several kinds of computing units, the number of each computing unit, the computing power of each computing unit, the interface of the target chip to read data Data transmission bandwidth, etc.), the size of the input data, and the operator execution order indicated by the topology sequence (for example, convolution first, then addition, or addition first, then convolution, etc.) to estimate the operation time. By using this cost model to simulate the machine instructions corresponding to each topological sequence running on the real target chip, the operation time corresponding to each topological sequence can be estimated without running the machine instructions corresponding to each topological sequence on real hardware .
在一些实施例中,在基于预设的代价模型确定目标芯片按照每个拓扑序列表示的算子执行顺序对输入数据进行运算的运算时长时,可以针对每个拓扑序列,确定目标芯片按照该拓扑序列表示的算子执行顺序对输入数据进行运算对应的机器指令,然后再根据预设的代价模型以及该机器指令确定运算时长。当然,由于基于拓扑序列将神经网络编译成二进制指令这个过程比较耗时,在一些实施例中,可以对构建的代价模型进行优化和改进,使其可以直接根据拓扑序列预估出运算时长,从而可以省去基于拓扑序列对神经网络进行编译的步骤,节省了编译这一步骤所耗费的时间。In some embodiments, when determining the operation time for the target chip to operate on the input data according to the operator execution sequence represented by each topology sequence based on the preset cost model, it can be determined for each topology sequence that the target chip The operator represented by the sequence executes the machine instruction corresponding to the operation on the input data in order, and then determines the operation time according to the preset cost model and the machine instruction. Of course, since the process of compiling the neural network into binary instructions based on the topological sequence is time-consuming, in some embodiments, the constructed cost model can be optimized and improved so that it can directly estimate the operation time based on the topological sequence, thereby The step of compiling the neural network based on the topology sequence can be omitted, and the time spent on compiling this step can be saved.
在一些实施例中,在确定待编译的神经网络对应的计算图时,可对所述神经网络进行解析,得到神经网络对应的原始计算图。针对原始计算图中的有些算子,可能由于目标芯片的内存过小,无法存储该算子对应的运算数据,导致无法一次完成该算子的运算。其中,算子对应的运算数据包括该算子对应的输入数据、模型参数以及输出数据等各种张量。这种情况下,可以根据目标芯片的内存大小以及该原始计算图中每个算子对应的运算数据的数据量对原始计算图中的算子进行调整,以得到神经网络最终的计算图。In some embodiments, when determining the calculation graph corresponding to the neural network to be compiled, the neural network may be analyzed to obtain the original calculation graph corresponding to the neural network. For some operators in the original calculation graph, the target chip's memory may be too small to store the corresponding operation data of the operator, resulting in the inability to complete the operation of the operator at one time. The operation data corresponding to the operator includes various tensors such as input data, model parameters, and output data corresponding to the operator. In this case, the operators in the original computation graph can be adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original computation graph to obtain the final computation graph of the neural network.
在一些实施例中,在根据目标芯片的内存大小以及原始计算图中每个算子对应的运算数据的数据量对原始计算图中的算子进行调整,以得到神经网络最终的计算图时,可以针对原始计算图中的每个算子,执行以下操作:首先,可确定该算子对应的运算数据的数据量,在该算子对应的运算数据的数据量大于预设阈值的情况下,如图5所示,可以在原始计算图中新增一个或多个与该算子同一类型的算子,从而可以将运算数据拆分成多份数据后分别通过新增后的多个算子运算,使得每个算子的运算数据的数据量不会超过目标芯片的内存。其中,该预设阈值基于目标芯片的内存大小确定,比如,可以是目标芯片的内存大小,或者目标芯片的内存大小减去一定的缓冲值得到的数值。针对原始计算图中的每个算子重复上述操作,然后基于新增后的算子调整原始计算图,以得到最终的计算图。In some embodiments, when adjusting the operators in the original calculation diagram according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation diagram to obtain the final calculation diagram of the neural network, For each operator in the original calculation graph, the following operations can be performed: first, the data volume of the operation data corresponding to the operator can be determined, and if the data volume of the operation data corresponding to the operator is greater than the preset threshold, As shown in Figure 5, one or more operators of the same type as the operator can be added in the original calculation graph, so that the operation data can be split into multiple pieces of data and passed through the newly added operators Operation, so that the amount of operation data of each operator will not exceed the memory of the target chip. Wherein, the preset threshold is determined based on the memory size of the target chip, for example, it may be the memory size of the target chip, or a value obtained by subtracting a certain buffer value from the memory size of the target chip. Repeat the above operations for each operator in the original computation graph, and then adjust the original computation graph based on the newly added operators to obtain the final computation graph.
在一些实施例中,在将每个算子的运算数据拆分成多份数据时,可以根据该算子的类型以及目标芯片的硬件性能参数确定对运算数据进行拆分的拆分维度,然后在所确定的拆分维度上对运算数据进行拆分,以得到多份数据。比如,假设输入数据为10帧100 通道的图像,那么可以把10帧100通道的图像拆分成2份数据,每份数据为5帧100通道的图像,或者也可以在通道维度上将这10帧100通道的图像拆分成2份数据,每份数据为10帧50通道的图像。具体选用哪种拆分方式,可以根据算子的类型以及目标芯片的硬件性能参数确定。比如,不同算子的计算方式不同,其适用的拆分方式也不一样。另外,针对目标芯片内存大小的不同,也需选用与其适配的方式进行拆分,以使得拆分后的数据满足目标芯片的内存限制。举个例子,对于conv、fc(fullconnection,全连接)、depthwise(深度可分离卷积)等算子,优先在图像数据的帧数维度上对数据进行拆分,如果帧数维度上无法拆分或者拆分后依旧超出目标芯片的内存限制,可以在通道维度上继续进行拆分。In some embodiments, when splitting the operation data of each operator into multiple pieces of data, the division dimension for splitting the operation data can be determined according to the type of the operator and the hardware performance parameters of the target chip, and then The operation data is split on the determined split dimension to obtain multiple pieces of data. For example, assuming that the input data is 10 frames of 100-channel images, then 10 frames of 100-channel images can be split into two pieces of data, and each piece of data is 5 frames of 100-channel images, or the 10 frames of 100-channel images can also be divided in the channel dimension. A frame of 100-channel images is split into two pieces of data, and each piece of data is an image of 10 frames of 50 channels. Which splitting method to choose can be determined according to the type of operator and the hardware performance parameters of the target chip. For example, different operators have different calculation methods, and their applicable splitting methods are also different. In addition, according to the different size of the memory of the target chip, it is also necessary to choose a method suitable for splitting, so that the split data meets the memory limit of the target chip. For example, for operators such as conv, fc (fullconnection, full connection), depthwise (depthwise separable convolution), the data is first split in the frame number dimension of the image data, if the frame number dimension cannot be split Or after the split, the memory limit of the target chip is still exceeded, and the split can be continued in the channel dimension.
在一些实施例中,该运算数据可以是图像数据,对图像数据进行拆分的拆分维度包括以下一种或多种:该图像数据的帧数维度、该图像数据的通道维度、该图像数据的宽度维度、或者是该图像数据的高度维度。In some embodiments, the operation data may be image data, and the split dimension for splitting the image data includes one or more of the following: the frame number dimension of the image data, the channel dimension of the image data, the image data The width dimension of , or the height dimension of the image data.
为了进一步解释本公开实施例提供的神经网络编译方法,以下结合一个具体的实施例加以解释。In order to further explain the neural network compiling method provided by the embodiments of the present disclosure, it will be explained in conjunction with a specific embodiment below.
相关技术中,在将神经网络部署到目标终端时,通常是按照预设的解析规则将神经网络解析成固定的拓扑序列,并基于该固定的拓扑序列将神经网络编译成二进制指令,将编译成的二进制指令输出给目标终端的目标芯片执行。对于CPU或者GPU这些通用性较强的处理器,由于其只有单个计算单元,需要考虑的问题主要是处理好线程级、指令级的并行,因而采用这种方式对处理效率的影响不大。但是,对于某些AI芯片或者是某些领域的专用加速器,一般其会包括多种计算单元,不同类型的算子可能是在不同类型的计算单元上执行。比如,针对有些AI芯片,其包括进行访存的DAU单元、专门进行卷积等运算的MPU单元以及进行矢量计算的VPU单元,各个计算单元之间可以并发执行。此时,神经网络中算子的执行顺序(对应于拓扑序列)会对不同计算单元的并行度产生影响,如直接使用固定的拓扑序列可能会使得某些计算单元(因例如数据依赖而)在较长时间内闲置,进而影响整体的推理效率。In related technologies, when deploying a neural network to a target terminal, the neural network is usually parsed into a fixed topological sequence according to preset parsing rules, and the neural network is compiled into binary instructions based on the fixed topological sequence, and compiled into The binary instructions are output to the target chip of the target terminal for execution. For general-purpose processors such as CPU or GPU, since they only have a single computing unit, the problem to be considered is mainly to handle thread-level and instruction-level parallelism, so this method has little impact on processing efficiency. However, for some AI chips or special-purpose accelerators in certain fields, they generally include a variety of computing units, and different types of operators may be executed on different types of computing units. For example, for some AI chips, it includes a DAU unit for memory access, an MPU unit for convolution and other operations, and a VPU unit for vector calculations, and each calculation unit can be executed concurrently. At this time, the execution order of operators in the neural network (corresponding to the topological sequence) will have an impact on the parallelism of different computing units, such as directly using a fixed topological sequence may cause some computing units (due to data dependence, for example) If it is idle for a long time, it will affect the overall reasoning efficiency.
基于此,本实施例提供了一种神经网络编译方法,具体包括以下步骤:Based on this, the present embodiment provides a method for compiling a neural network, which specifically includes the following steps:
1、基于神经网络中各算子对应的运算数据的数据量以及目标芯片的内存大小对神经网络的计算图进行更新。1. Update the calculation graph of the neural network based on the amount of computing data corresponding to each operator in the neural network and the memory size of the target chip.
首先,可以将神经网络的Caffe文件转化为计算图,考虑到目标芯片内存的有限性,针对Caffe模型的每个算子,如果该算子的运算数据(即输入输出tensor)所占据的空间超出设定的大小(可基于目标芯片的内存大小设置),则将该算子的运算数据拆分成 多份数据,并且在计算图中新增一个或者多个同样的算子,使得拆分后的每份数据对应一个算子,并且每个算子对应的运算数据可以在目标芯片上一次性运算完成。针对计算图中的每个算子执行同样的操作,直到计算图中的所有算子单独运行所需要的内存空间都不会超过预先设定的大小,即得到更新后的计算图。First of all, the Caffe file of the neural network can be converted into a calculation graph. Considering the limited memory of the target chip, for each operator of the Caffe model, if the space occupied by the operation data of the operator (that is, the input and output tensor) exceeds The set size (can be set based on the memory size of the target chip), split the operation data of the operator into multiple pieces of data, and add one or more same operators in the calculation graph, so that after splitting Each piece of data corresponds to an operator, and the operation data corresponding to each operator can be completed in one operation on the target chip. Perform the same operation for each operator in the calculation graph until the memory space required by all operators in the calculation graph to run alone will not exceed the preset size, that is, the updated calculation graph is obtained.
在对算子的运算数据进行拆分时,可以根据算子的类型以及目标芯片的内存大小确定拆分维度,以在该维度上对算子的运算数据进行拆分。比如,对于conv、fc、depthwise等算子,优先在图像帧数维度上进行算子拆分,如果图像帧数维度无法拆分或者拆分后依旧超出目标芯片的内存限制,可以在图像通道维度上继续进行拆分。When splitting the operation data of the operator, the split dimension can be determined according to the type of the operator and the memory size of the target chip, so as to split the operation data of the operator on this dimension. For example, for operators such as conv, fc, and depthwise, the operator is first split in the image frame dimension. If the image frame dimension cannot be split or the split still exceeds the memory limit of the target chip, it can be split in the image channel dimension. Continue to split.
对于其余类型的算子,可尽量不在规约操作所在的维度进行拆分,而从其他维度中选择拆分后实现复杂度最低的维度。例如对于resize(调整大小)、pooling(池化)算子,由于涉及到在图像高度维度和宽度维度上的规约操作,最好在图像帧数维度或者图像通道数维度上进行运算数据拆分;而对于涉及在图像通道维度和图像宽度维度上操作的transpose(转置)算子,则可以在图像帧数维度和图像高度维度上进行运算数据拆分。For other types of operators, try not to split in the dimension where the reduction operation is located, but select the dimension with the lowest complexity after splitting from other dimensions. For example, for resize (resize) and pooling (pooling) operators, since it involves the reduction operation on the image height and width dimensions, it is best to split the operation data on the image frame number dimension or the image channel number dimension; As for the transpose (transpose) operator that involves operations on the image channel dimension and the image width dimension, the operation data can be split on the image frame number dimension and the image height dimension.
2、将更新后的计算图划分成多个子图2. Divide the updated calculation graph into multiple subgraphs
遍历和记录从更新后的计算图的输入节点到输出节点的所有路径,对所有的路径取交集得到更新后的计算图中的所有关键节点,关键节点即为计算图中两条路径的汇聚点,将相邻两个关键节点以及位于这两个关键节点之间的节点和边构成为一个子图。然后将仅由预设数量以下个节点构成的子图向前序子图合并,从而将神经网络的整个计算图拆分成多个子图。其中,对子图进行合并的主要目的是减少子图的个数,节省后续的编译和链接时间。Traverse and record all the paths from the input node to the output node of the updated calculation graph, and take the intersection of all the paths to get all the key nodes in the updated calculation graph, the key nodes are the convergence points of the two paths in the calculation graph , form two adjacent key nodes and the nodes and edges between these two key nodes into a subgraph. Then, the subgraphs consisting of only a preset number of next nodes are merged into forward subgraphs, thereby splitting the entire computational graph of the neural network into multiple subgraphs. Among them, the main purpose of merging subgraphs is to reduce the number of subgraphs and save subsequent compilation and linking time.
3、子图的拓扑序列寻优3. Topological sequence optimization of subgraphs
由于子图之间是按顺序执行的,因此,在拆分得到多个子图后,只需要优化每个子图的执行时间,就可以使神经网络整体的运行时间最短。Since the subgraphs are executed sequentially, after splitting to obtain multiple subgraphs, it is only necessary to optimize the execution time of each subgraph to minimize the overall running time of the neural network.
为了模拟子图的各拓扑序列对应的机器指令在目标芯片上的执行时长,本实施例构建了一个代价模型,该代价模型可以读入工具链的编译器基于各拓扑序列编译得到的二进制指令流,并通过模拟目标芯片的执行过程来产生一个预估的执行时间。该代价模型的执行逻辑简述如下。In order to simulate the execution time of the machine instructions corresponding to each topological sequence of the subgraph on the target chip, this embodiment constructs a cost model, which can be read into the binary instruction stream compiled by the compiler of the tool chain based on each topological sequence , and generate an estimated execution time by simulating the execution process of the target chip. The execution logic of the cost model is briefly described as follows.
由于目标芯片执行二进制指令流所耗费的时间主要包括以下几部分:The time taken by the target chip to execute the binary instruction stream mainly includes the following parts:
(1)通过目标芯片的数据读取接口从存储器中读取该二进制指令的时间T1。(1) Time T1 for reading the binary instruction from the memory through the data reading interface of the target chip.
时间T1主要和目标芯片的数据读取接口的数据传输带宽和要读取的数据量有关,因而,代价模型可以基于目标芯片数据读取接口的一些性能参数(比如,数据传输带宽) 和待读取的数据量(比如,神经网络的输入数据)估算出时间T1。The time T1 is mainly related to the data transmission bandwidth of the data reading interface of the target chip and the amount of data to be read. Therefore, the cost model can be based on some performance parameters (for example, data transmission bandwidth) of the data reading interface of the target chip and the data to be read The amount of data taken (for example, the input data of the neural network) estimates the time T1.
(2)通过计算单元对读取的数据进行运算的时间T2。(2) The time T2 for calculating the read data by the calculation unit.
时间T2主要和目标芯片中计算单元的性能参数(比如,计算单元的类型、计算单元的数量、计算单元的计算能力等)以及神经网络拓扑序列(即算子的执行顺序)有关,因而代价模型可以基于目标芯片计算单元的性能参数和神经网络拓扑序列估算出时间T2。The time T2 is mainly related to the performance parameters of the computing units in the target chip (such as the type of computing units, the number of computing units, the computing power of computing units, etc.) and the neural network topology sequence (ie, the execution order of operators), so the cost model The time T2 can be estimated based on the performance parameters of the computing unit of the target chip and the topology sequence of the neural network.
(3)实际执行过程中的一些等待时间。这部分时间由于通常较短,因而可以忽略不计。(3) Some waiting time during actual execution. Since this part of the time is usually short, it can be ignored.
基于上述代价模型,得到子图在给定代价模型下的最优拓扑序列,主要包括以下步骤:Based on the above cost model, the optimal topology sequence of the subgraph under the given cost model is obtained, which mainly includes the following steps:
(1)遍历该子图所有可能的一个或多个拓扑序列;(1) traverse all possible one or more topological sequences of the subgraph;
(2)对每一种拓扑序列使用工具链进行编译,生成对应的二进制指令文件,并根据代价模型得到该拓扑序列下子图的执行时间;(2) Use the tool chain to compile each topology sequence, generate the corresponding binary instruction file, and obtain the execution time of the subgraph under the topology sequence according to the cost model;
(3)选择执行时间最短的拓扑序列,作为该子图在给定代价模型和工具链版本下的最优拓扑序列。(3) Select the topology sequence with the shortest execution time as the optimal topology sequence of the subgraph under the given cost model and toolchain version.
当然,编译过程较为耗时,也可以直接根据拓扑序列模拟该拓扑序列对应的机器指令在目标芯片上运行的运行时间。另外,当子图的拓扑序列组合数太多时,可以使用随机选取的方法,从拓扑序列中随机选取指定数量的候选序列进行寻优,也可以对子图本身进行特征归纳和提取,借助机器学习或者蒙特卡洛等方法优化查找过程。Of course, the compilation process is time-consuming, and the running time of the machine instructions corresponding to the topology sequence on the target chip can also be simulated directly according to the topology sequence. In addition, when there are too many combinations of topological sequences in the subgraph, a random selection method can be used to randomly select a specified number of candidate sequences from the topological sequence for optimization. It is also possible to perform feature induction and extraction on the subgraph itself. With the help of machine learning Or methods such as Monte Carlo to optimize the search process.
4、在得到不同子图的最优拓扑序列后,可以使用链接器(linker)将不同子图对应的二进制机器码按照原始的数据流向进行组合,从而将单独编译的神经网络子图组合成一个完整的神经网络模型。4. After obtaining the optimal topological sequence of different subgraphs, the linker can be used to combine the binary machine codes corresponding to different subgraphs according to the original data flow direction, so as to combine the separately compiled neural network subgraphs into one Complete neural network model.
通过本实施例提供的神经网络编译方法,考虑了目标芯片内存空间的有限性,对计算图中各算子的运算数据进行了更细粒度的切分、并相应更新了计算图中的算子的数量,使得每个算子在目标芯片上可以一次性执行完毕。对更新后的计算图根据关键节点拆分得到多个子图,并确定每个子图的最优拓扑序列,对不同子图的最优拓扑序列进行链接得到整个计算图的最优拓扑序列,基于该最优拓扑序列对神经网络进行编译,得到机器指令并输出给目标芯片执行,可以针对不同目标芯片在更大程度上发挥不同计算单元之间的并行性,提高目标芯片的处理效率。另外,本申请也可通过构建的代价模型预估目标芯片的执行时间,而不需要在真实的硬件上运行,来完成拓扑序列的寻优过程。Through the neural network compilation method provided in this embodiment, considering the limitation of the memory space of the target chip, the operation data of each operator in the calculation graph is segmented into a finer granularity, and the operators in the calculation graph are updated accordingly so that each operator can be executed at one time on the target chip. The updated calculation graph is split according to the key nodes to obtain multiple subgraphs, and the optimal topology sequence of each subgraph is determined, and the optimal topology sequence of different subgraphs is linked to obtain the optimal topology sequence of the entire calculation graph. Based on this The optimal topology sequence compiles the neural network, obtains machine instructions and outputs them to the target chip for execution, which can maximize the parallelism between different computing units for different target chips and improve the processing efficiency of the target chip. In addition, this application can also estimate the execution time of the target chip through the constructed cost model without running on real hardware to complete the optimization process of the topology sequence.
与本公开实施例提供的方法相对应,本公开实施例还提供了一种神经网络编译装 置,如图6所示,所述装置60包括:Corresponding to the method provided by the embodiment of the present disclosure, the embodiment of the present disclosure also provides a neural network compiling device, as shown in Figure 6, the device 60 includes:
计算图确定模块61,用于确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络的输入数据的流向;A computation graph determination module 61, configured to determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent inputs to the neural network flow of data;
筛选模块62,用于从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;A screening module 62, configured to determine a target topological sequence from multiple topological sequences of the computation graph, wherein each topological sequence represents a specific execution order of operators in the neural network;
编译模块63,用于基于所述目标拓扑序列确定所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。The compiling module 63 is configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.
在一些实施例中,基于所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,从所述多个拓扑序列中确定所述目标拓扑序列。In some embodiments, the target chip is determined from the plurality of topological sequences based on the operation duration of the target chip to perform operations on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences. topological sequence.
在一些实施例中,所述目标芯片包括至少两种计算单元,所述至少两种计算单元能够并行地对输入数据进行不同类型的运算。In some embodiments, the target chip includes at least two types of computing units capable of performing different types of operations on input data in parallel.
在一些实施例中,所述筛选模块用于从所述计算图的多个拓扑序列中确定目标拓扑序列,具体用于:将所述计算图划分成多个子图,其中,每个子图包括至少两个子拓扑序列;针对每个子图,从所述子图的至少两个子拓扑序列中确定目标子拓扑序列;所述目标子拓扑序列基于所述目标芯片按照所述至少两个子拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长而确定;基于每个子图的目标子拓扑序列得到所述目标拓扑序列。In some embodiments, the screening module is configured to determine a target topological sequence from multiple topological sequences of the computation graph, specifically for: dividing the computation graph into multiple subgraphs, wherein each subgraph includes at least Two subtopological sequences; for each subgraph, determine a target subtopological sequence from at least two subtopological sequences of the subgraph; the target subtopological sequence is based on an algorithm represented by the target chip according to the at least two subtopological sequences The sub-execution sequence is determined by the operation duration of the operation on the input data; the target topological sequence is obtained based on the target sub-topological sequence of each sub-graph.
在一些实施例中,所述筛选模块用于将所述计算图划分成多个子图,具体用于:从所述计算图的多个节点中确定多个关键节点,其中,每个所述关键节点为所述计算图中至少两条路径的汇聚点;基于所述多个关键节点将所述计算图划分成多个子图。In some embodiments, the screening module is used to divide the calculation graph into multiple subgraphs, specifically to: determine multiple key nodes from multiple nodes in the calculation graph, wherein each of the key A node is a convergence point of at least two paths in the computation graph; and the computation graph is divided into multiple subgraphs based on the plurality of key nodes.
在一些实施例中,基于所述多个关键节点将所述计算图拆分成多个子图,包括:将邻近的至少两个关键节点以及位于所述至少两个关键节点之间的节点和边构成为一个子图。In some embodiments, splitting the computation graph into a plurality of subgraphs based on the plurality of key nodes includes: dividing at least two adjacent key nodes and nodes and edges between the at least two key nodes constituted as a subgraph.
在一些实施例中,所述筛选模块用于所述计算图拆分成多个子图之后,还用于:确定所述子图中节点数量小于预设数量的目标子图;将所述目标子图与所述目标子图的邻近子图融合。In some embodiments, after the calculation graph is split into multiple subgraphs, the screening module is further used to: determine a target subgraph whose number of nodes in the subgraph is less than a preset number; The graph is fused with the neighboring subgraphs of the target subgraph.
在一些实施例中,所述编译模块用于基于所述目标拓扑序列确定所述神经网络对应的机器指令,具体用于:确定每个所述目标子拓扑序列对应的机器指令;将每个所述目标子拓扑序列对应的机器指令按照所述计算图中的数据流向链接,得到所述神经网络 对应的机器指令。In some embodiments, the compiling module is configured to determine the machine instructions corresponding to the neural network based on the target topology sequence, specifically to: determine the machine instructions corresponding to each target sub-topology sequence; The machine instructions corresponding to the target sub-topology sequence are linked according to the data flow direction in the calculation graph to obtain the machine instructions corresponding to the neural network.
在一些实施例中,所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对神经网络的输入数据进行运算的运算时长基于以下方式确定:针对每个所述拓扑序列,确定目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算对应的机器指令;基于所述目标芯片执行所述机器指令的时长确定所述运算时长;或者,针对每个拓扑序列,利用预设的代价模型确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长,其中,所述代价模型用于根据所述目标芯片的硬件参数以及所述拓扑序列表示的算子执行顺序预估所述拓扑序列对应的运算时长。In some embodiments, the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution sequence represented by each topological sequence is determined based on the following method: for each topological sequence, determine the target chip According to the execution order of operators represented by the topological sequence, perform operations on the corresponding machine instructions for the input data; determine the operation duration based on the duration of execution of the machine instructions by the target chip; or, for each topological sequence, use The preset cost model determines the operation duration for the target chip to perform operations on the input data in accordance with the operator execution sequence represented by the topology sequence, wherein the cost model is used to Estimate the operation duration corresponding to the topological sequence according to the execution order of the operators represented by the topological sequence.
在一些实施例中,利用预设的代价模型确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述输入数据进行运算的运算时长,包括:确定目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述输入数据进行运算对应的机器指令;基于预设的代价模型以及所述机器指令确定所述运算时长。In some embodiments, using a preset cost model to determine the operation duration for the target chip to perform operations on the input data according to the operator execution sequence represented by the topology sequence includes: The operator represented by the sequence executes the corresponding machine instructions for performing operations on the input data in sequence; the operation duration is determined based on a preset cost model and the machine instructions.
在一些实施例中,所述计算图确定模块用于确定待编译的神经网络对应的计算图时,具体用于:对所述神经网络进行解析,得到所述神经网络对应的原始计算图;根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图。In some embodiments, when the calculation graph determination module is used to determine the calculation graph corresponding to the neural network to be compiled, it is specifically used to: analyze the neural network to obtain the original calculation graph corresponding to the neural network; The memory size of the target chip and the data amount of operation data corresponding to each operator in the original calculation graph adjust the operators in the original calculation graph to update the calculation graph.
在一些实施例中,所述计算图确定模块用于根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图时,具体用于:In some embodiments, the calculation graph determining module is configured to perform calculations on the operators in the original calculation graph according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph Make adjustments to update the computation graph, specifically for:
针对所述原始计算图中对应的运算数据的数据量大于预设阈值的每个目标算子,在所述原始计算图中新增至少一个与所述目标算子同一类型的附加算子,以将所述运算数据拆分成多份数据后分别通过所述目标算子以及新增的所述附加算子进行运算;其中,所述预设阈值基于所述目标芯片的内存大小确定;For each target operator whose corresponding operation data in the original calculation graph is greater than a preset threshold, add at least one additional operator of the same type as the target operator in the original calculation graph, so as to Splitting the operation data into multiple pieces of data and then performing operations through the target operator and the newly added additional operator; wherein the preset threshold is determined based on the memory size of the target chip;
基于新增的所述附加算子调整所述原始计算图,以更新所述计算图。The original computation graph is adjusted based on the newly added additional operator, so as to update the computation graph.
在一些实施例中,所述计算图确定模块用于所述将所述运算数据拆分成多份数据时,具体用于:基于所述目标算子的类型以及所述目标芯片的硬件性能参数确定对所述运算数据进行拆分的拆分维度;在所述拆分维度上对所述数据进行拆分,以得到多份数据。In some embodiments, when the calculation graph determination module is used for splitting the operation data into multiple pieces of data, it is specifically used for: based on the type of the target operator and the hardware performance parameters of the target chip Determining a splitting dimension for splitting the operation data; splitting the data on the splitting dimension to obtain multiple pieces of data.
在一些实施例中,所述运算数据包括图像数据,所述拆分维度包括以下一种或多种:所述图像数据的帧数维度、所述图像数据的通道维度、所述图像数据的宽度维度、所述图像数据的高度维度。In some embodiments, the operation data includes image data, and the split dimension includes one or more of the following: frame number dimension of the image data, channel dimension of the image data, width of the image data dimension, the height dimension of the image data.
此外,本公开实施例还提供一种电子设备,如图7所示,所述电子设备包括处理器71、存储器72、存储在所述存储器72可供所述处理器71执行的计算机指令,所述处理器71执行所述计算机指令时,可实现上述各实施例中所述的方法。In addition, an embodiment of the present disclosure also provides an electronic device. As shown in FIG. 7 , the electronic device includes a processor 71, a memory 72, and computer instructions stored in the memory 72 for execution by the processor 71. When the processor 71 executes the computer instructions, the methods described in the foregoing embodiments can be implemented.
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例所述的方法。An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method described in any one of the foregoing embodiments is implemented.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本公开实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例或者实施例的某些部分所述的方法。It can be known from the above description of the implementation manners that those skilled in the art can clearly understand that the embodiments of the present disclosure can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in storage media, such as ROM/RAM, A magnetic disk, an optical disk, etc., include several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments or some parts of the embodiments of the present disclosure.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules, or units described in the above embodiments can be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may take the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control device, etc. desktops, tablets, wearables, or any combination of these.
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本公开实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中 的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。Each embodiment in the present disclosure is described in a progressive manner, the same and similar parts of the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment. The device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the functions of each module may be integrated into the same or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to realize the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.
以上所述仅是本公开实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开实施例的保护范围。The above is only the specific implementation of the embodiment of the present disclosure. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the embodiment of the present disclosure, some improvements and modifications can also be made. These Improvements and modifications should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims (18)

  1. 一种编译方法,其特征在于,所述方法包括:A compilation method, characterized in that the method comprises:
    确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络中的数据流向;determining a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data flows in the neural network;
    从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;determining a target topological sequence from a plurality of topological sequences of the computation graph, wherein each of the topological sequences represents a specific execution order of operators in the neural network;
    基于所述目标拓扑序列生成所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。A machine instruction corresponding to the neural network is generated based on the target topology sequence, so that the target chip executes the machine instruction.
  2. 根据权利要求1所示的方法,其特征在于,从所述计算图的所述多个拓扑序列中确定所述目标拓扑序列,包括:The method according to claim 1, wherein determining the target topological sequence from the plurality of topological sequences of the computation graph comprises:
    基于所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,从所述多个拓扑序列中确定所述目标拓扑序列。The target topological sequence is determined from the plurality of topological sequences based on the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence.
  3. 根据权利要求2所述的方法,其特征在于,所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,是基于以下方式确定的:The method according to claim 2, characterized in that, the operation duration for the target chip to operate on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences is determined based on the following manner :
    针对每个所述拓扑序列,For each of the topological sequences,
    确定所述目标芯片按照所述拓扑序列表示的算子执行顺序,对所述神经网络的输入数据进行运算对应的机器指令;Determining that the target chip performs operations on the input data of the neural network corresponding to machine instructions according to the operator execution sequence represented by the topology sequence;
    基于所述目标芯片执行所述机器指令的时长确定所述运算时长。The operation duration is determined based on the duration for the target chip to execute the machine instruction.
  4. 根据权利要求2所述的方法,其特征在于,所述目标芯片按照每个所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,是基于以下方式确定的:The method according to claim 2, characterized in that, the operation duration for the target chip to operate on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences is determined based on the following manner :
    针对每个所述拓扑序列,For each of the topological sequences,
    利用预设的代价模型,确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,Using a preset cost model, determine the operation duration for the target chip to operate on the input data of the neural network according to the operator execution sequence represented by the topology sequence,
    其中,所述代价模型用于根据所述目标芯片的硬件参数以及所述拓扑序列表示的算子执行顺序预估所述拓扑序列对应的运算时长。Wherein, the cost model is used to estimate the operation duration corresponding to the topological sequence according to the hardware parameters of the target chip and the execution order of operators represented by the topological sequence.
  5. 根据权利要求4所述的方法,其特征在于,利用预设的所述代价模型,确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述神经网络的输入数据进行运算的运算时长,包括:The method according to claim 4, characterized in that, using the preset cost model, it is determined that the target chip performs operations on the input data of the neural network according to the operator execution order represented by the topology sequence duration, including:
    确定所述目标芯片按照所述拓扑序列表示的算子执行顺序对所述神经网络的输入 数据进行运算对应的机器指令;Determine the machine instructions corresponding to the target chip performing operations on the input data of the neural network according to the operator execution sequence represented by the topology sequence;
    基于所述代价模型以及所述机器指令确定所述运算时长。The operation duration is determined based on the cost model and the machine instruction.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,The method according to any one of claims 1-5, characterized in that,
    所述目标芯片包括至少两种计算单元,The target chip includes at least two computing units,
    所述至少两种计算单元能够并行地对输入数据进行不同类型的运算。The at least two computing units are capable of performing different types of operations on input data in parallel.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,从所述计算图的所述多个拓扑序列中确定所述目标拓扑序列,包括:The method according to any one of claims 1-6, wherein determining the target topological sequence from the plurality of topological sequences of the computation graph comprises:
    将所述计算图划分成多个子图,其中,每个子图包括至少两个子拓扑序列;dividing the computation graph into a plurality of subgraphs, wherein each subgraph includes at least two subtopological sequences;
    针对每个子图,从所述子图的所述至少两个子拓扑序列中确定目标子拓扑序列;for each subgraph, determining a target subtopological sequence from said at least two subtopological sequences of said subgraph;
    基于每个所述子图的所述目标子拓扑序列得到所述目标拓扑序列。The target topological sequence is derived based on the target subtopological sequence of each of the subgraphs.
  8. 根据权利要求7所述的方法,其特征在于,将所述计算图划分成所述多个子图,包括:The method according to claim 7, wherein dividing the computation graph into the plurality of subgraphs comprises:
    从所述计算图的多个节点中确定多个关键节点,其中,每个所述关键节点为所述计算图中至少两条路径的汇聚点;Determining a plurality of key nodes from a plurality of nodes in the calculation graph, wherein each of the key nodes is a convergence point of at least two paths in the calculation graph;
    基于所述多个关键节点将所述计算图划分成所述多个子图。The computation graph is divided into the plurality of subgraphs based on the plurality of key nodes.
  9. 根据权利要求8所述的方法,其特征在于,基于所述多个关键节点将所述计算图拆分成所述多个子图,包括:The method according to claim 8, wherein splitting the computation graph into the plurality of subgraphs based on the plurality of key nodes comprises:
    将邻近的至少两个所述关键节点以及位于所述至少两个关键节点之间的节点和边构成为一个子图。Constructing at least two adjacent key nodes and nodes and edges between the at least two key nodes into a subgraph.
  10. 根据权利要求9所述的方法,其特征在于,将所述计算图拆分成所述多个子图之后,还包括:The method according to claim 9, wherein after splitting the calculation graph into the plurality of subgraphs, further comprising:
    确定所述子图中节点数量小于预设数量的目标子图;determining a target subgraph whose number of nodes in the subgraph is less than a preset number;
    将所述目标子图与所述目标子图的邻近子图融合。The target subgraph is fused with neighboring subgraphs of the target subgraph.
  11. 根据权利要求7-10任一项所述的方法,其特征在于,基于所述目标拓扑序列确定所述神经网络对应的所述机器指令,包括:The method according to any one of claims 7-10, wherein determining the machine instruction corresponding to the neural network based on the target topology sequence comprises:
    确定每个所述目标子拓扑序列对应的机器指令;determining a machine instruction corresponding to each of the target subtopological sequences;
    将每个所述目标子拓扑序列对应的所述机器指令按照所述计算图中的数据流向链接,得到所述神经网络对应的所述机器指令。The machine instructions corresponding to each target sub-topology sequence are linked according to the data flow direction in the calculation graph to obtain the machine instructions corresponding to the neural network.
  12. 根据权利要求1-11任一项所述的方法,其特征在于,确定待编译的所述神经网络对应的所述计算图,包括:The method according to any one of claims 1-11, wherein determining the calculation graph corresponding to the neural network to be compiled comprises:
    对所述神经网络进行解析,得到所述神经网络对应的原始计算图;Analyzing the neural network to obtain an original calculation graph corresponding to the neural network;
    根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图。The operators in the original calculation graph are adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph, so as to update the calculation graph.
  13. 根据权利要求12所述的方法,其特征在于,根据所述目标芯片的内存大小以及所述原始计算图中每个算子对应的运算数据的数据量对所述原始计算图中的算子进行调整,以更新所述计算图,包括:The method according to claim 12, characterized in that, the operator in the original calculation graph is performed according to the memory size of the target chip and the data amount of operation data corresponding to each operator in the original calculation graph Adjustments to update the computation graph include:
    针对所述原始计算图中对应的运算数据的数据量大于预设阈值的每个目标算子,在所述原始计算图中新增至少一个与所述目标算子同一类型的附加算子,以将所述运算数据拆分成多份数据后分别通过所述目标算子以及新增的所述附加算子进行运算;其中,所述预设阈值基于所述目标芯片的内存大小确定;For each target operator whose corresponding operation data in the original calculation graph is greater than a preset threshold, add at least one additional operator of the same type as the target operator in the original calculation graph, so as to Splitting the operation data into multiple pieces of data and then performing operations through the target operator and the newly added additional operator; wherein the preset threshold is determined based on the memory size of the target chip;
    基于新增的所述附加算子调整所述原始计算图,以更新所述计算图。The original computation graph is adjusted based on the newly added additional operator, so as to update the computation graph.
  14. 根据权利要求13所述的方法,其特征在于,所述将所述运算数据拆分成多份数据,包括:The method according to claim 13, wherein said splitting said operation data into multiple data includes:
    基于所述目标算子的类型以及所述目标芯片的硬件性能参数确定对所述运算数据进行拆分的拆分维度;determining a split dimension for splitting the operation data based on the type of the target operator and the hardware performance parameters of the target chip;
    在所述拆分维度上对所述数据进行拆分,以得到多份数据。The data is split on the split dimension to obtain multiple pieces of data.
  15. 根据权利要求14所述的方法,其特征在于,The method according to claim 14, characterized in that,
    所述运算数据包括图像数据,The operation data includes image data,
    所述拆分维度包括以下一种或多种:所述图像数据的帧数维度、所述图像数据的通道维度、所述图像数据的宽度维度、所述图像数据的高度维度。The splitting dimension includes one or more of the following: the frame number dimension of the image data, the channel dimension of the image data, the width dimension of the image data, and the height dimension of the image data.
  16. 一种编译装置,其特征在于,所述装置包括:A compiling device, characterized in that the device comprises:
    计算图确定模块,用于确定待编译的神经网络对应的计算图,所述计算图中的节点表示所述神经网络中的算子,所述计算图中的边表示所述神经网络中的数据流向;A computation graph determination module, configured to determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data in the neural network flow direction;
    筛选模块,用于从所述计算图的多个拓扑序列中确定目标拓扑序列,其中,每个所述拓扑序列表示所述神经网络中的算子的特定执行顺序;a screening module, configured to determine a target topological sequence from a plurality of topological sequences in the computation graph, wherein each topological sequence represents a specific execution order of operators in the neural network;
    编译模块,用于基于所述目标拓扑序列确定所述神经网络对应的机器指令,以使目标芯片执行所述机器指令。A compiling module, configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.
  17. 一种电子设备,其特征在于,所述电子设备包括处理器、存储器、存储在所述存储器可供所述处理器执行的计算机指令,所述处理器执行所述计算机指令时,可实现如权利要求1-15任一项所述的编译方法。An electronic device, characterized in that the electronic device includes a processor, a memory, and computer instructions stored in the memory that can be executed by the processor, and when the processor executes the computer instructions, it can realize the The compiling method described in any one of 1-15 is required.
  18. 一种计算机可读存储介质,其特征在于,所述存储介质上存储有计算机指令,所述计算机指令被执行时实现如权利要求1-15任一项所述的编译方法。A computer-readable storage medium, characterized in that computer instructions are stored on the storage medium, and when the computer instructions are executed, the compiling method according to any one of claims 1-15 is implemented.
PCT/CN2022/093058 2021-08-31 2022-05-16 Neural network compilation method and apparatus, device, and storage medium WO2023029589A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111013533.XA CN113703775B (en) 2021-08-31 2021-08-31 Compiling method, compiling device, compiling equipment and storage medium
CN202111013533.X 2021-08-31

Publications (1)

Publication Number Publication Date
WO2023029589A1 true WO2023029589A1 (en) 2023-03-09

Family

ID=78658087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093058 WO2023029589A1 (en) 2021-08-31 2022-05-16 Neural network compilation method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN113703775B (en)
WO (1) WO2023029589A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116126346A (en) * 2023-04-04 2023-05-16 上海燧原科技有限公司 Code compiling method and device of AI model, computer equipment and storage medium
CN116415103A (en) * 2023-06-09 2023-07-11 之江实验室 Data processing method, device, storage medium and electronic equipment
CN116991564A (en) * 2023-09-28 2023-11-03 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113703775B (en) * 2021-08-31 2023-11-28 上海阵量智能科技有限公司 Compiling method, compiling device, compiling equipment and storage medium
CN114139684A (en) * 2021-12-02 2022-03-04 脸萌有限公司 Graph neural network generation method, device, system, medium, and electronic apparatus
CN114996008B (en) * 2022-05-30 2024-05-03 上海壁仞科技股份有限公司 AI calculation graph multi-back-end cooperative calculation method and device
CN115081598B (en) * 2022-08-23 2022-12-06 北京灵汐科技有限公司 Operator processing method and device, electronic equipment and computer readable storage medium
CN117170686B (en) * 2023-11-03 2024-03-12 深圳鲲云信息科技有限公司 Method and computing device for neural network compilation optimization
CN117492766A (en) * 2023-12-27 2024-02-02 深圳市九天睿芯科技有限公司 Compiling method, compiler, neural network accelerator, chip and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
CN110764744A (en) * 2018-07-25 2020-02-07 赛灵思公司 Intermediate representation generation method and device for neural network computation
CN111338635A (en) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 Graph compiling method, device and equipment for calculation graph and storage medium
CN112711422A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Optimization method and system for neural network compiling
CN113703775A (en) * 2021-08-31 2021-11-26 上海阵量智能科技有限公司 Compiling method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586173B2 (en) * 2016-01-27 2020-03-10 Bonsai AI, Inc. Searchable database of trained artificial intelligence objects that can be reused, reconfigured, and recomposed, into one or more subsequent artificial intelligence models
US11544539B2 (en) * 2016-09-29 2023-01-03 Tsinghua University Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
JP7074777B2 (en) * 2017-11-20 2022-05-24 シャンハイ カンブリコン インフォメーション テクノロジー カンパニー リミテッド Tasks Parallel processing methods, appliances, systems, storage media and computer equipment
CN111860816A (en) * 2020-07-08 2020-10-30 Oppo广东移动通信有限公司 Compiling method, device, equipment and storage medium of neural network model
CN112598121A (en) * 2020-12-21 2021-04-02 北京时代民芯科技有限公司 Efficient operator optimization method for deep learning compiler

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110764744A (en) * 2018-07-25 2020-02-07 赛灵思公司 Intermediate representation generation method and device for neural network computation
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
CN111338635A (en) * 2020-02-20 2020-06-26 腾讯科技(深圳)有限公司 Graph compiling method, device and equipment for calculation graph and storage medium
CN112711422A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Optimization method and system for neural network compiling
CN113703775A (en) * 2021-08-31 2021-11-26 上海阵量智能科技有限公司 Compiling method, device, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116126346A (en) * 2023-04-04 2023-05-16 上海燧原科技有限公司 Code compiling method and device of AI model, computer equipment and storage medium
CN116126346B (en) * 2023-04-04 2023-06-16 上海燧原科技有限公司 Code compiling method and device of AI model, computer equipment and storage medium
CN116415103A (en) * 2023-06-09 2023-07-11 之江实验室 Data processing method, device, storage medium and electronic equipment
CN116415103B (en) * 2023-06-09 2023-09-05 之江实验室 Data processing method, device, storage medium and electronic equipment
CN116991564A (en) * 2023-09-28 2023-11-03 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU
CN116991564B (en) * 2023-09-28 2024-01-09 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU

Also Published As

Publication number Publication date
CN113703775A (en) 2021-11-26
CN113703775B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
WO2023029589A1 (en) Neural network compilation method and apparatus, device, and storage medium
US10372429B2 (en) Method and system for generating accelerator program
Heo et al. Real-time object detection system with multi-path neural networks
US11500959B2 (en) Multiple output fusion for operations performed in a multi-dimensional array of processing units
AU2014203218B2 (en) Memory configuration for inter-processor communication in an MPSoC
CN110689116B (en) Neural network pruning method and device, computer equipment and storage medium
CN110147236A (en) Code compiling method and device
US20160139901A1 (en) Systems, methods, and computer programs for performing runtime auto parallelization of application code
US20230334292A1 (en) Node fusion method for computational graph and device
Zhang et al. Flow faster: Efficient decision algorithms for probabilistic simulations
CN112817730A (en) Deep neural network service batch processing scheduling method and system and GPU
CN115423082A (en) Automatic optimization method for depth model calculation graph related to hardware characteristics
CN115794393A (en) Method, device, server and storage medium for executing business model
US20200118027A1 (en) Learning method, learning apparatus, and recording medium having stored therein learning program
CN110929850A (en) Deep learning operator automatic optimization system and method based on Shenwei processor
CN114398080A (en) Data processing method, device and equipment and computer storage medium
Ara et al. Scalable analysis for multi-scale dataflow models
Kress et al. Comparing time-to-solution for in situ visualization paradigms at scale
CN116974868A (en) Chip power consumption estimation device, method, electronic equipment and storage medium
US11514218B1 (en) System and method for performing static timing analysis of electronic circuit designs using a tag-based approach
US11327733B2 (en) Method of using multidimensional blockification to optimize computer program and device thereof
Feng et al. Cutting down training memory by re-fowarding
Pang et al. Toward the Predictability of Dynamic Real-Time DNN Inference
Yin et al. Exact memory-and communication-aware scheduling of dnns on pipelined edge tpus
CN108564135B (en) Method for constructing framework program and realizing high-performance computing program running time prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862731

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE