CN111160551B

CN111160551B - Calculation map execution method, computer device, and storage medium

Info

Publication number: CN111160551B
Application number: CN201911230228.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2023-09-29
Anticipated expiration: 2039-12-04
Also published as: CN111160551A

Abstract

The embodiment of the application discloses a calculation map execution method, a computer device and a storage medium, wherein the calculation map execution method comprises the following steps: when the general processor compiles the calculation graph with the fusion operator, the binary instruction executed by the artificial intelligent processor corresponding to the calculation graph is obtained according to the operation instruction of the fusion operator.

Description

Calculation map execution method, computer device, and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method for executing a computation graph, a computer device, and a storage medium.

Background

The deep learning framework is the first layer of straight-face application in the whole deep learning ecological system. Earlier in Caffe, layer was considered as the basic element for constructing convolutional neural network, while later in deep learning framework, for example TensorFlow, MXNet, although different names such as Operator are adopted, the Layer is still similar to Caffe in core thought, and the neural network calculation is further split into various common operators facing tensor data, and the user constructs neural network model by combining operators and data.

The above is the interface design idea of the deep learning framework facing the application of the upper layer, and is also the idea of downward butt joint of the deep learning framework. Before the rise of neural network processors or dedicated accelerators, the mainstream computing devices in the deep learning field are CPUs and GPUs, and the deep learning framework needs to embody deep learning tasks expressed by upper layer applications through graph structures into instructions and data that can be executed at the CPUs or GPUs. In this process, the deep learning framework employs operators as specific elements to implement the computational tasks. Specifically, the deep learning framework provides a Kernel function (Kernel) executed on a CPU or a GPU for each operator used as a building block to build the network, and the deep learning framework dispatches and executes the Kernel function corresponding to each operator in the graph structure according to an abstract computation graph given by an upper-layer application to complete the computation of the whole neural network.

The simplest way for operators to map to a specific implementation on a device is through a programming language. The method has the advantage of high flexibility, and for a CPU and a GPU with a CUDA programming architecture, a developer of an algorithm can use a programming language to complete implementation of a specific operator. On the other hand, however, the implementation of operators may not be able to fully exploit the performance of hardware devices, and maximization of efficiency may not be achieved.

On this basis, a further way is to use a high-performance computation library to complete the mapping of the extraction operators in the computation graph to specific kernel functions, typically the operator computation library cuDNN on the GPU developed by Nvidia. The cuDNN provides kernel functions executed on the GPU directly on the granularity of operators, and the deep learning framework only needs to map each abstract operator in the graph into the corresponding kernel function implementation in the computing library in the process of imaging the computing graph, and does not need to develop the abstract operator by using the CUDA. In this case, the deep learning framework interfaces directly with the computing libraries provided by the underlying device manufacturer, and cuDNN tends to perform better than ordinary developers using the kernel functions implemented by CUDA. In fact, providing implemented operators to the compute library, the deep learning framework also tends to invoke the compute library directly, which, on the one hand, relieves the deep learning framework itself from the burden, and, on the other hand, does result in better performance, with the high performance compute library developed by the underlying device manufacturer theoretically maximally exploiting the performance limits of individual operators on a particular device, but with the upper limit on optimization of individual operators. The operator is optimized, the relation between calculation and memory access in the realization is not limited by adjusting, unnecessary memory access behaviors are reduced to the maximum extent, and meanwhile, the idle operation of an operation unit is avoided. If the arithmetic unit is always in a full-load operation state in the execution stage of the whole task, the efficiency of hardware in deep learning calculation reaches 100%, which is an ideal target for software and hardware optimization. However, even if the computation within the operators is optimized to the greatest extent, the computation units cannot be fully loaded with computation at all times, since the optimization is ultimately limited by the gap between operators. The kernel function realization of the operator on the CPU and the GPU, whether the realization is carried out by the operator or the calculation library is called, is a mode of 'off-chip storage → on-chip calculation → off-chip storage', namely, the input data and the output data of the operator are stored in the global storage, the kernel function needs to read the input data from the global storage to complete calculation, and the result is stored in the global storage. This presents two problems, firstly, the access of each operator to the input data and the output data cannot be avoided by optimization within the operator; second, each operator requires startup overhead, especially for heterogeneous computing devices outside of the general purpose processor.

Disclosure of Invention

The embodiment of the application provides a calculation map execution method, computer equipment and a storage medium, which can solve the IO bottleneck when an artificial intelligent processor executes a learning task.

In order to solve the above problems, the present application provides a calculation map execution method, including:

when compiling an original calculation graph with a fusion operator, a general processor obtains a binary instruction executed by an artificial intelligent processor corresponding to the original calculation graph according to an operation instruction of the fusion operator; the operation instruction obtaining step of the fusion operator comprises the following steps:

the general processor divides an operator for the first time according to the execution equipment of the operator in the original calculation graph to obtain an original subgraph; wherein the execution device comprises a general purpose processor and an artificial intelligence processor;

the general processor checks operators in the original subgraph according to rules of the operators in a learning library of the artificial intelligence processor, and divides the original subgraph for the second time according to checking results to obtain a target subgraph;

and compiling the target subgraph by the general processor to obtain an operation instruction corresponding to the fusion operator.

Optionally, the step of obtaining the original subgraph includes:

acquiring the original calculation graph, and determining a first type operator from the original calculation graph; wherein, the operation instruction corresponding to the first type operator can run on the artificial intelligent processor;

obtaining a calculation graph formed by the first type of operators according to directed edges between operators in the original calculation graph, and extracting an original subgraph from the calculation graph formed by the first type of operators; wherein the original subgraph comprises a plurality of input operators and/or a plurality of output operators; all the original subgraphs constitute an original subgraph set.

Optionally, the step of obtaining the target subgraph includes:

the general processor checks operators in the original subgraph according to rules of the operators in a learning library of the artificial intelligence processor to obtain an inspection result;

deleting operators which do not pass the inspection in the original subgraph by utilizing the inspection result, pruning a calculation graph formed by the rest operators in the original subgraph to obtain a corresponding target subgraph; wherein the target subgraph comprises an input operator and an output operator.

Optionally, the step of pruning the computation graph formed by the remaining operators in the original subgraph includes:

And under the condition that the computational graph formed by the residual operators in the original subgraph comprises at least one of one input operator, a plurality of output operators, a plurality of input operators, one output operator, a plurality of input operators and a plurality of output operators, iterative pruning is carried out on the computational graph formed by the residual operators in the original subgraph, so as to obtain a target subgraph.

Optionally, pruning the computational graph formed by the remaining operators in the original subgraph in the case that the computational graph formed by the remaining operators in the original subgraph contains one input operator and a plurality of output operators is satisfied, including:

according to the directed edges between operators in the computational graph formed by the residual operators in the original subgraph, in the computational graph formed by the residual operators in the same original subgraph, the output operators of the computational graph formed by the residual operators in the original subgraph are used as starting points, the computational graph formed by the residual operators in the corresponding original subgraph is traversed reversely, and other output operators are traversed to serve as traversing termination conditions; and stopping iterative pruning under the condition that the computational graph formed by the operators traversed in the reverse direction is a target subgraph.

Optionally, pruning the computational graph formed by the remaining operators in the original subgraph in the case that the computational graph formed by the remaining operators in the original subgraph contains a plurality of input operators and one output operator, includes:

According to the directed edges between operators in the computation graph formed by the residual operators in the original subgraph, in the computation graph formed by the residual operators in the same original subgraph, the input operators of the computation graph formed by the residual operators in the original subgraph are taken as starting points, the computation graph formed by the residual operators in the corresponding original subgraph is traversed in the forward direction, and other input operators are traversed to be taken as traversing termination conditions; and stopping iterative pruning under the condition that a calculation graph formed by operators traversed in the forward direction is a target subgraph.

Optionally, pruning the computational graph formed by the remaining operators in the original subgraph under the condition that the computational graph formed by the remaining operators in the original subgraph contains a plurality of input operators and a plurality of output operators, including:

according to directed edges between operators in a computational graph formed by the residual operators in the original subgraph, in the computational graph formed by the residual operators in the same original subgraph, taking an input operator of the computational graph formed by the residual operators in the original subgraph as a starting point, traversing each original subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversing termination conditions; stopping iterative pruning under the condition that a calculation graph formed by operators traversed in the forward direction is a target subgraph;

According to directed edges between operators in a computational graph formed by the residual operators in the original subgraph, in the computational graph formed by the residual operators in the same original subgraph, taking an output operator of the computational graph formed by the residual operators in the original subgraph as a starting point, traversing the original subgraph which does not obtain a target subgraph in the original subgraph set reversely, and traversing to other output operators as traversing termination conditions; and stopping iterative pruning under the condition that the computational graph formed by the operators traversed in the reverse direction is a target subgraph.

according to directed edges between operators in a computational graph formed by the residual operators in the original subgraph, in the computational graph formed by the residual operators in the same original subgraph, taking an output operator of the computational graph formed by the residual operators in the original subgraph as a starting point, traversing each original subgraph in the original subgraph set reversely, and traversing to other output operators as traversing termination conditions; stopping iterative pruning under the condition that a calculation graph formed by operators traversed in the reverse direction is a target subgraph;

According to directed edges between operators in a computational graph formed by the residual operators in the original subgraph, in the computational graph formed by the residual operators in the same original subgraph, using an input operator of the computational graph formed by the residual operators in the original subgraph as a starting point, traversing the original subgraph without obtaining a target subgraph in the original subgraph set in the forward direction, and traversing to other input operators as traversing termination conditions; and stopping iterative pruning under the condition that a calculation graph formed by operators traversed in the forward direction is a target subgraph.

Optionally, the step of obtaining the original subgraph further comprises:

determining a second type of operator from the original computational graph; the operation instruction corresponding to the second type operator can run on the general processor;

and obtaining the computation graph formed by the second type of operators according to the directed edges between the operators in the original computation graph.

To solve the above-mentioned problems, the present application proposes a computer device comprising a processor and a memory, said processor and memory being interconnected, wherein said memory is adapted to store a computer program, said computer program comprising program instructions, said processor being configured to invoke said program instructions to perform the method as described above.

To solve the above-described problems, the present application proposes a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method described above.

For the technical scheme, the general processor fuses operators capable of running on the artificial intelligent processor in the original calculation graph into one operator, namely a fusion operator, and stores the operation instructions of the fusion operator, and when a corresponding machine learning task is executed later, the stored operation instructions of the fusion operator are directly multiplexed, so that the operation core starting times and the access times of the artificial intelligent processor are reduced, repeated compiling is avoided, and the reasoning speed is greatly accelerated. Meanwhile, the computer equipment prunes the calculation graph running on the artificial intelligent processor, so that the pruned sub-graph comprises an input operator and an output operator, the dependency relationship between the sub-graph running on the artificial intelligent processor and the sub-graph running on the general processor can be avoided, and the efficiency of the heterogeneous system for executing the neural network calculation task in parallel can be improved.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described.

FIG. 1 is a schematic diagram of a software stack of an artificial intelligence processor according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a computer device according to an embodiment of the present application;

FIG. 3A is a flowchart of a method for executing a calculation map according to an embodiment of the present application;

FIG. 3B is a flowchart of obtaining an operation instruction of a fusion operator in the calculation map execution method according to the embodiment of the present application;

FIG. 4 is a schematic structural diagram of a calculation diagram of a neural network model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the structure of a convex graph and a non-convex graph according to an embodiment of the present application;

FIG. 6A is a schematic diagram of an original subgraph extracted from a computation graph corresponding to a second class operator according to an embodiment of the present application;

FIG. 6B is a schematic diagram of another original subgraph extracted from a computation graph corresponding to a second class operator according to an embodiment of the present application;

fig. 6C is a schematic structural diagram of a sub-graph after pruning according to an embodiment of the present application;

FIG. 6D is a schematic diagram of another pruned sub-graph according to an embodiment of the present application;

Fig. 6E is a schematic structural diagram of a sub-graph after pruning a computational graph of a neural network model according to an embodiment of the present application;

FIG. 6F is a schematic diagram of a structure of a subgraph without pruning the computational graph of the neural network model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a calculation map execution device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In order to facilitate better understanding of the technical solution described in the present application, technical terms related to the embodiments of the present application are explained below:

(1) Calculation map

A computational graph is one way to describe a computational process using graph structures. If the computation is significantly modular and there are significant temporal and logical dependencies between the modules, it can be described generally using a directed graph structure. In practical application, the basic elements of the graph structure are two, namely nodes and directed edges. In practical application, the neural network is abstracted into a directed graph structure composed of tensor data and operators. Nodes are also called operators.

Generally, the neural network model is described in a calculation graph mode, so that the overall grasp of the calculation task of the whole neural network is facilitated, and meanwhile, the calculation graph expression mode is convenient for scheduling and parallel execution of the calculation task.

(2) Software stack of artificial intelligence processor:

referring to FIG. 1, the software stack structure 10 includes an artificial intelligence application 100, an artificial intelligence framework 102, an artificial intelligence learning library 104, an artificial intelligence runtime library 106, and a driver 108. This is specifically explained below:

the artificial intelligence application 100 provides corresponding artificial intelligence algorithm models corresponding to different application scenarios. The algorithm model may be directly parsed by the programming interface of the artificial intelligence framework 102, in one possible implementation, by the artificial intelligence learning library 104 converting the neural network model into binary instructions, invoking the artificial intelligence runtime library 106 to convert the binary instructions into artificial intelligence learning tasks, placing the artificial intelligence learning tasks in a task queue, and scheduling the artificial intelligence learning tasks in the task queue by the driver 108 for execution by the underlying artificial intelligence processor.

(3) Subgraph extraction

The heterogeneous system comprises an artificial intelligent processor and a general-purpose processor, wherein in practical application, the general-purpose processor can compile a neural network model to generate corresponding machine learning task binary instructions, and the binary instructions can run on the artificial intelligent processor. Thus, a deep learning framework (e.g., caffe) first needs to extract specific sub-graphs from the complete neural network computational graph, where the operators in these sub-graphs will all be placed on the artificial intelligence processor for execution. And compiling and optimizing the subgraph by a software stack of the artificial intelligent processor to obtain a fusion kernel function corresponding to the whole subgraph.

In the embodiment of the application, in the process of extracting the subgraph, firstly, after the subgraph in the calculation graph is fused into one node, a ring is not introduced into the original calculation graph. The reason for this is that the loops will cause operators in the computation graph to depend on each other topologically.

(4) Dependency relationship

In the embodiment of the present application, the operator a depends on the operator B, which means that the operator a must wait for the kernel function corresponding to the operator B to finish before starting its own calculation task. If the operator B is included in one sub-graph S due to the sub-graph fusion, the operator a must wait until all the calculation tasks of all the operators in S are completely executed, before starting to execute the kernel function of itself.

(5) Deep learning frame

As the name implies, the deep learning framework refers to a framework for deep learning. Specifically, as can be seen from fig. 1, the deep learning framework is the first layer in the software stack of the artificial intelligence processor, and is used to communicate with the deep learning application and the deep learning computing platform in various formats.

In the prior art, a deep learning framework generally adopts a computational graph as a main data structure for describing a neural network model, and on the basis, the mapping from the computational graph to a bottom kernel function is completed by adopting operators as granularity or cross-operator granularity. Meanwhile, the deep learning framework may implement specific kernel functions in a manner that includes directly using a programming language or invoking an underlying computing library.

In an embodiment of the present application, the deep learning framework may include, but is not limited to: google tensor flow graph Tensorflow, convolutional neural network framework Caffe (Convolutional Architecture for Fast Feature Embedding), MXNet, torch, and the like.

Taking Caffe as an example, caffe supports multiple types of deep learning architecture, image classification and image segmentation, and can also support convolutional neural networks (Convolutional Neural Networks, CNN), convolutional neural networks for target detection (Region-CNN, RCNN), long Short-Term Memory (LSTM), and fully-connected neural network designs.

In the embodiment of the application, the Caffe framework can support multiple types of basic operators, and in particular, the multiple types of basic operators can include: common neural network operators. For example, common neural network operators are: convolution/deconvolution operators, pooling operators, activation operators, softmax (classifier) operators, full join operators. Among other things, the activation operator may include, but is not limited to ReLU, sigmoid, tanh and other operators that may be implemented in an interpolated manner.

In an embodiment of the present application, the functions under the Caffe framework may include: caffe Blob function, caffe Layer function, and Caffe Net function. Wherein Blob is used to store, exchange and process data and derivative information for forward and reverse iterations in the network; the Layer is used to perform calculations, and may include non-linear operations such as convolution (convolve), pooling (pool), inner product (inner product), received-linear, and sigmoid, and loss calculations (locks) such as element-level data transformation, normalization (normal), data loading (load data), classification (softmax), and change.

In particular implementations, each Layer defines 3 important operations, which are initialization settings (setup), forward propagation (forward), backward propagation (backward). The setup is used for resetting the relays and the connection between the relays during model initialization; forward is used to accept input data from the bottom (bottom) layer, output after calculation to the top (top) layer; backsaward is used to give the output gradient of the top layer, calculate the gradient of its input, and pass to the bottom layer. For example, the Layers may include Date Layers, convolution Layers, powdering Layers, innerProduct Layer, reLU Layers, sigmoid Layers, LRN Layers, dropout Layers, softmaxWithLoss Layer, softmax Layers, accuracy Layers, and the like. A Net starts with a data layer, i.e., loads data from disk, and ends with a loss layer, i.e., computes objective functions such as classification and reconstruction of tasks. Specifically, net is a directed acyclic computational graph consisting of a series of layers, caffe retains all intermediate values in the computational graph to ensure accuracy of forward and reverse iterations.

(5) Artificial intelligence processor

An artificial intelligence processor, also referred to as a special purpose processor, is a processor that refers to a particular application or domain in embodiments of the present application. For example: graphics processing units (GPU, graphics Processing Unit), also known as display cores, vision processors, and display chips, are special purpose processors that perform image computation operations on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.). Also for example: the neural network processor (NPU, neural Processing Unit) is a special processor for matrix multiplication operation in the application of the artificial intelligence field, adopts a data-driven parallel computing architecture, and is particularly good at processing massive multimedia data such as videos and images.

Referring to fig. 2, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. As shown in fig. 2, the computer device 20 may include a general purpose processor 201, a memory 202, a communication bus 203, a communication interface 204, and at least one artificial intelligence processor 205, the general purpose processor 201, the artificial intelligence processor 205 being coupled to the memory 202 and the communication interface 203 via the communication bus.

The general purpose processor 201 may be a central processing unit (Central Processing Unit, CPU), and the general purpose processor 201 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor 201 may be a microprocessor or the general purpose processor 201 may be any conventional processor or the like.

The general purpose processor 201 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the neural network pruning method of the present application may be completed by the integrated logic of hardware or instructions in the form of software in the general-purpose processor 201.

The Memory 202 may be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), or other Memory. In the embodiment of the present application, the memory 202 is used for storing data and executing software programs corresponding to the methods shown in fig. 3A and 3B, for example, pruning is performed on the original subgraph that meets the pruning condition in the embodiment of the present application, so that each subgraph after pruning contains a program of an input operator and an output operator, etc.

Alternatively, in embodiments of the present application, the memory may comprise physical means for storing information, typically by digitizing the information before storing it in a medium using electrical, magnetic or optical means. The memory according to the present embodiment may further include: means for storing information by means of electrical energy, such as RAM, ROM, etc.; devices for storing information by magnetic energy, such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, and USB flash disk; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of storing, such as quantum storing, graphene storing, etc.

The communication interface 204 enables communication between the computer device 20 and other devices or communication networks using a transceiver means such as, but not limited to, a transceiver. For example, model files sent by other devices may be received through the communication interface 204.

The artificial intelligence processor 205 may be mounted as a coprocessor to a Host CPU (Host CPU) that is assigned tasks by the Host CPU. In actual practice, the artificial intelligence processor 205 may implement one or more operations. For example, taking a neural network processor (Network Processing Unit, NPU) NPU as an example, a core portion of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 202 and perform multiply-add operation.

Alternatively, the artificial intelligence processor 205 may include 8 clusters (clusters) of 4 artificial intelligence processor cores each.

Alternatively, the artificial intelligence processor 205 may be an artificial intelligence processor of a reconfigurable architecture. Herein, a reconfigurable architecture refers to an artificial intelligence processor that is able to utilize reusable hardware resources to flexibly change its architecture according to different application requirements to provide a matching architecture for each particular application requirement, and is referred to as a reconfigurable computing system.

It should be understood that computer device 20 is only one example provided for embodiments of the present application, and that computer device 20 may have more or fewer components than shown, may combine two or more components, or may have different configuration implementations of the components.

In practice, because of the learning libraries of the artificial intelligence processor and supporting all types of operators, frequent input/output interactions between the general purpose processor and the artificial intelligence processor, and kernel start-up of the artificial intelligence processor, consume a significant amount of time in a general operating mode. Under the former dynamic fusion strategy, a plurality of problems are found in the dynamic fusion in the debugging process. Such as: the same neural network runs on different devices, and the number of segments is different; the same network performs fusion nodes differently each time, resulting in repeated compiling, and performance degradation when the artificial intelligence processor performs learning tasks.

Based on this, the following is a flowchart of a calculation graph execution method provided in connection with the embodiment of the present application shown in fig. 3A, which specifically illustrates how to solve the IO bottleneck when the artificial intelligence processor executes the learning task in the embodiment of the present application, and may include, but is not limited to, the following steps:

step a: when the general processor compiles an original calculation graph with a fusion operator, a binary instruction executed by an artificial intelligent processor corresponding to the original calculation graph is obtained according to an operation instruction of the fusion operator.

The general processor fuses operators capable of running on the artificial intelligent processor in the original calculation map into one operator, namely a fusion operator, stores operation instructions of the fusion operator, directly multiplexes the stored operation instructions of the fusion operator when corresponding machine learning tasks are executed later, reduces the operation core starting times and access times of the artificial intelligent processor, avoids repeated compiling, and greatly accelerates reasoning speed.

As shown in fig. 3B, the operation instruction obtaining step of the fusion operator includes:

step 1): the general processor divides an operator for the first time according to the execution equipment of the operator in the original calculation graph to obtain an original subgraph; wherein the execution device comprises a general purpose processor and an artificial intelligence processor.

In this step, the step of acquiring the original subgraph includes: acquiring an original calculation graph, and determining a first type operator from the original calculation graph; the operation instructions corresponding to the first type operators can be run on the artificial intelligent processor. Obtaining a calculation graph formed by the first type of operators according to directed edges between operators in the original calculation graph, and extracting an original subgraph from the calculation graph formed by the first type of operators; wherein the original subgraph comprises a plurality of input operators and/or a plurality of output operators; all the original subgraphs constitute an original subgraph set.

In embodiments of the present application, the first type of operator refers to an operator that may be run on an artificial intelligence processor. For example, the first type of operator may include meta operators supported by an artificial intelligence processor. Specifically, in an embodiment of the present application, meta operators may include, but are not limited to: convolution/deconvolution operators, pooling operators, activation operators, partial response normalized LRN (LRN, local Response Normalization)/batch normalization operators, classifier (Softmax) operators, full join operators, and the like. Among other things, the activation operator may include, but is not limited to ReLU, sigmoid, tanh and other operators that may be implemented in an interpolated manner.

In an embodiment of the application, the second type of operator refers to an operator that can run on a general purpose processor. For example, the second type of operator may include newly developed operators, which are run on a general purpose processor for the following reasons: in practical application, the artificial intelligence learning library of the software stack of the artificial intelligence processor is likely not to support the operator in time, so that the artificial intelligence processor can not obtain a binary instruction corresponding to the operator; as another example, since operators themselves do not contain computational logic that can be accelerated in parallel, they contain many conditional jumps and other computational logic that are compatible with the general purpose processor features, in which case such operators are run on the general purpose processor. It will be appreciated that the second type of operator may be run at a general purpose processor to enhance the speed of operation of the neural network model.

In the embodiment of the application, the computer equipment can acquire a model file of the neural network model, wherein the model file comprises a plurality of operators and connection relations among the operators; the computer device may then construct an original computational graph of the neural network model from the model file. In practical application, in one possible implementation manner, the neural network model includes the first type of operators and the second type of operators, and then the built original calculation graph includes the calculation graph corresponding to the first type of operators and the calculation graph corresponding to the second type of operators. In another possible implementation, the neural network model only includes the first type of operators, and then the built original calculation map only includes the calculation map corresponding to the first type of operators. For example, an original computational graph of a neural network model obtained by a computer device may be shown in FIG. 4, where NNP1-NNP7 (NNP, collectively: neural Network Processor) represent operators running on an artificial intelligence processor, and CPU (collectively: central Processing Unit) represents operators running on a general purpose processor.

In the embodiment of the application, the number of the original subgraphs extracted by the general processor in the computer equipment in the computation graph corresponding to the first type operator can be 1, or can be multiple, for example, 4, and the like.

In one possible implementation manner, the extracting M original subgraphs from the computation graph corresponding to the first type operator includes:

and extracting M original subgraphs from the calculated graph corresponding to the first type operator according to the subgraph extraction rule.

As described above, in the process of extracting the subgraph, it is first ensured that after the subgraph in the computation graph is fused into a node, the original computation graph is not restored and the loop is not introduced. The reason for this is that the loops will cause operators in the computation graph to depend on each other topologically.

Specifically, for deep learning frameworks such as MXNet, the looped graph structure may cause a scheduling engine at the back end of the framework to produce a deadlock when scheduling operators to execute, because the scheduling engine needs to schedule a kernel function corresponding to an operator to start executing, and must ensure that the operator relied on in the computational graph has already executed.

In practical applications, convexity (Convex) may be used as one equivalent constraint to ensure non-deadlock. As shown in fig. 5, sub-graph S in directed graph G is said to be convex if and only if there is no path for any two nodes in S to pass through nodes outside S. Any sub-graph that breaks convexity necessarily has some nodes outside that depend on some nodes inside the sub-graph, while some other nodes inside the sub-graph depend on these outside nodes, thus causing a deadlock of scheduling.

Further, beyond convexity, subgraphs should guarantee connectivity. The subgraph S in the directed graph G becomes connected, and S is a connected graph if and only if the directed edge in S is considered as a undirected edge.

In practical applications, in the process of extracting the subgraph, the extracted subgraph should be ensured to be as large as possible. This principle is based on two intuitive judgments: a very large subgraph can ensure that the search and optimization space is provided for the lower software stack as much as possible; a very large subgraph can minimize the startup overhead of kernel functions.

Step 2): and the general processor checks operators in the original subgraph according to rules of the operators in a learning library of the artificial intelligence processor, and performs second division on the original subgraph according to check results to obtain a target subgraph.

In practice, according to the rules of operators in the learning library of the artificial intelligence processor, the framework also needs to perform operator boundary checking on the operators in each original subgraph obtained in the step 1, divide the operators which can be continuously executed into one subgraph, and compile the operators to form fusion op, which is an operator set which can be really fused into one operator. Fusion op is stored in Cache. When executing the optimized calculation graph with the fusion operator, executing the calculation graph without layer by layer when running to the fusion operator, and directly taking out the compiled fusion op of the formation from the Cache.

In this step, the step of obtaining the target subgraph includes:

In the embodiment of the application, after the original subgraph extracted from the computation graph corresponding to the first type operator is subjected to operator boundary inspection, a computation graph formed by the rest operators in the original subgraph is obtained, and the computation graph can comprise the following various situations:

first case: the computational graph comprises an input operator and a plurality of output operators. For example, as shown in fig. 6A, the extracted calculation graph formed by the remaining operators in the original subgraph includes one input operator and two output operators.

Second case: the computational graph formed by the residual operators in the original subgraph comprises a plurality of input operators and an output operator. For example, as shown in fig. 6B, the extracted computation graph formed by the remaining operators in the original subgraph includes two input operators and one output operator.

Third scenario: and the computational graph formed by the residual operators in the original subgraph comprises a plurality of input operators and a plurality of output operators.

Fourth scenario: and the computational graph formed by the residual operators in the original subgraph comprises an input operator.

Fifth scenario: and the computational graph formed by the residual operators in the original subgraph comprises an output operator.

It should be noted that the above examples are only examples and should not be construed as limiting, since the original subgraphs have various representations. In the embodiment of the present application, taking the computational graph of the neural network model shown in fig. 4 as an example, the subgraphs extracted by the computer device in the computational graph corresponding to the first operator type include the original subgraphs shown in fig. 6A and the original subgraphs shown in fig. 6B.

In a specific implementation, the step of pruning the calculation graph formed by the residual operators in the original subgraph includes:

In the embodiment of the application, for the purpose of convenience of explanation, a calculation graph formed by the residual operators in the original subgraph comprises an input operator and a plurality of output operators; a plurality of input operators and an output operator; the three cases of the plurality of input operators and the plurality of output operators are defined as different pruning conditions. Then pruning the computational graph formed by the remaining operators in the original subgraph when the general processor determines that the computational graph formed by the remaining operators in the original subgraph meets at least one condition of pruning.

In an embodiment of the present application, when the number of original subgraphs extracted by the computer device in the computation graph corresponding to the first type operator is multiple, in a possible implementation manner, the computer device may prune one of the multiple original subgraphs; in one possible implementation, the computer device may also prune each original subgraph of the plurality of original subgraphs, and embodiments of the present application are not particularly limited.

In the embodiment of the application, the process of pruning the calculation graph formed by the residual operators in the original subgraph by the general processor is an iterative pruning process. In particular, an iteration is a repetition of a set of instructions (or a certain step) in a computer program. It may be used as a generic term (synonymous with "repetition") or to describe a specific form of repetition with variable states.

In the embodiment of the present application, in the process of performing iterative pruning on the computation graph formed by the remaining operators in the original sub-graph, taking one input operator and a plurality of output operators as examples in the computation graph formed by the remaining operators in the original sub-graph, for example, the plurality of output operators are output operator 1, output operator 2 and output operator 3, the general processor performs inverse traversal on the computation graph formed by the remaining operators in the original sub-graph by taking the output operator 1 of the computation graph as a starting point, performing inverse traversal on the corresponding original sub-graph, performing traversal on the computation graph by taking the output operator 2 of the computation graph as a traversing termination condition, in this case, obtaining a sub-graph formed by the output operator 1, performing inverse traversal on the corresponding computation graph by taking the output operator 2 of the computation graph, performing inverse traversal on the computation graph to other output operator (for example, the output operator 3) as a terminating condition, in this case, obtaining a condition that the output operator 2 is formed by taking the output operator 1 of the computation graph as a starting point, performing inverse traversal on the sub-graph by taking the output operator 3 as a condition, performing inverse traversal on the sub-graph until at least one of the sub-graph is obtained, performing iterative pruning, and continuing until the conditions are obtained, respectively performing the iterative pruning by taking the sub-graph as the conditions.

The following describes in detail a case where the pruning condition is satisfied:

first case: the computational graph formed by the residual operators in the original subgraph comprises an input operator and a plurality of output operators.

In a specific implementation, under the condition that a computation graph formed by remaining operators in the original subgraph contains one input operator and a plurality of output operators, pruning the original subgraph comprises:

according to the directed edges between operators in the original subgraph, in a computation graph formed by the remaining operators in the same original subgraph, the output operator of the computation graph is used as a starting point, the corresponding computation graph is traversed reversely, and other output operators are traversed to serve as traversing termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the operators traversed in the reverse direction is a target subgraph.

In an embodiment of the present application, the directed edges may be used to characterize the connection relationship (e.g., dependency relationship) between operators, and may also be used to characterize the execution order of the artificial intelligence processor executing the computational graph.

In the embodiment of the present application, according to the directed edges between the operators of the computation graph formed by the remaining operators in the original subgraph, in the computation graph formed by the remaining operators in the same original subgraph, the multiple output operators included in the computation graph are respectively used as starting points, the corresponding computation graph is traversed in a reverse direction, the traversal is performed to other output operators as the traversal termination condition, the subgraph formed by each output operator can be obtained through traversal, where the subgraph formed by each output operator obtained through traversal is a part of the computation graph formed by the remaining operators in the original subgraph, and then, it can be understood that the subgraph formed by each output operator can be obtained through superposition of the subgraphs formed by the remaining operators in the original subgraph. After traversing to obtain the sub-graph formed by each output operator, judging whether the sub-graph formed by each output operator meets the pruning condition, and under the condition that the sub-graph meets the pruning condition, pruning is carried out by the general processor in combination with a specific pruning condition until the traversed sub-graph is a target sub-graph.

In practical applications, performing reverse traversal on the plurality of output operators may include:

obtaining a target output operator; wherein the target output operator is any one of the plurality of output operators;

and carrying out reverse traversal on the target output operator according to directed edges between operators, and stopping the traversal when the traversed operator is an output operator, and carrying out the reverse traversal to obtain a subgraph formed by the target output operator.

It should be noted that, when the inverse traversal is performed on the plurality of output operators, it may be ensured that only one output operator is included in the sub-graph formed by each output operator, but it may not be ensured that only one input operator is included in the sub-graph formed by each output operator. In one case, each output operator comprises an input operator in the sub-graph. In another case, each output operator comprises a plurality of input operators in a sub-graph.

In the embodiment of the application, the process of traversing the calculation graph formed by the residual operators in the original subgraph can be considered as the process of disassembling the calculation graph to obtain a plurality of partial subgraphs. In the process of disassembly, a new sub-graph can be obtained, and the expression forms of the input operator and the output operator in the new sub-graph also have the five situations described in the application.

In the embodiment of the application, the dependency relationship between operators is used for describing the pointing relationship between operators, and can be expressed as a directed edge in a directed graph. For example, taking NNP1 and NNP2 shown in fig. 6A as an example, NNP1 points to NNP2, that is, output tensor data of NNP1 is taken as input tensor data of NNP 2.

As described above, the computational graph formed by the remaining operators in the original subgraph shown in fig. 6A includes one input operator and two output operators. In the case that the computer device determines that the calculation graph composed of the remaining operators in the original subgraph shown in fig. 6A satisfies the pruning condition, the computer device prunes the calculation graph. Because the calculation graph comprises two output operators, when the computer equipment prunes the calculation graph, reverse traversal is needed to be sequentially carried out along different output operators so as to obtain the subgraph formed by each output operator through reverse traversal.

In one case, the sub-graph of traversed operators is pruned if the sub-graph satisfies at least one of pruning conditions. Here, the reason for pruning the sub-graph is that: the sub-graph formed by traversing each output operator reversely comprises one output operator and a plurality of input operators. At this time, forward traversal is performed on the plurality of input operators to traverse to obtain sub-graphs formed by the plurality of input operators. And stopping iterative pruning under the condition that the sub-graph formed by the traversed operators is a pruning result sub-graph.

In another case, pruning is not performed in the case where the sub-graph constituted by traversed operators does not satisfy any one of pruning conditions.

For example, the first reverse traversal, the computer device selects NNP4 as the target output operator from the two output operators (NNP 1 and NNP 4), then reverse traverses NNP4 (i.e., with NNP4 as the reverse traversal starting point) while guaranteeing the dependency between operators, e.g., the first traversed operator is NNP3, the second traversed operator is NNP2, the third traversed operator is NNP1, since NNP1 is the output operator, cannot be accessed in this traversal, in which case the traversal is cut off. Then, in this case, with NNP4 as the reverse traversal starting point, the three traversed operators (NNP 4, NNP3, and NNP 2) constitute the sub-graph corresponding to NNP 4. The second reverse traversal, the computer device selects NNP1 as the target output operator. And traversing NNP1 reversely under the condition of ensuring the dependency relationship between operators, wherein only one operator is left in the rest part of the calculation graph formed by the rest operators in the original subgraph shown in FIG. 6A, and in this case, the operator is used as the subgraph corresponding to NNP 1.

In practical application, since the subgraph obtained by reverse traversal of NNP4 includes an input operator and an output operator, and the subgraph obtained by reverse traversal of NNP1 includes an output operator, any pruning condition is not satisfied, and in this case, pruning operation is not performed on the subgraph.

It can be appreciated from the above description that, when the computer device prunes the calculation graph composed of the remaining operators in the original subgraph shown in fig. 6A, the resulting subgraph can be shown in fig. 6C.

It will be appreciated that for the first traversal, the computer device may also select NNP1 as the target output operator among the two output operators (NNP 1 and NNP 4); the second traversal, the computer device selects NNP4 as the target output operator for traversal. For its specific implementation, reference is made to the foregoing description, and a detailed description is omitted here.

Second case: the computational graph formed by the residual operators in the original subgraph comprises a plurality of input operators and an output operator.

In a specific implementation, when a computation graph formed by remaining operators in the original subgraph contains a plurality of input operators and one output operator, pruning the computation graph includes:

According to directed edges between operators in a computation graph formed by the residual operators in the original subgraph, in the computation graph formed by the residual operators in the same original subgraph, using an input operator of the computation graph as a starting point, traversing the corresponding computation graph in a forward direction, and traversing to other input operators as traversing termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the operators traversed in the forward direction is a target subgraph.

In the embodiment of the application, according to the directed edges between operators in the original subgraph, in the computation graph formed by the residual operators in the same original subgraph, a plurality of input operators contained in the computation graph are respectively used as starting points, the corresponding computation graph is traversed in the forward direction, other input operators are traversed to be used as traversing termination conditions, the subgraph formed by each output operator and each input operator can be obtained, and the traversed subgraph formed by each input operator is a part of the computation graph. After traversing to obtain the sub-graph formed by each input operator, judging whether the sub-graph formed by each input operator meets the pruning condition, and under the condition that the sub-graph meets the pruning condition, pruning is carried out by the general processor in combination with a specific pruning condition until the traversed sub-graph is a target sub-graph.

In practical applications, performing forward traversal on multiple input operators may include:

acquiring a target input operator; wherein the target input operator is any one of the plurality of input operators;

and carrying out forward traversal on the target input operator according to directed edges between operators, and stopping traversal when the traversed operator is an input operator, so as to obtain a subgraph formed by the target input operator in the forward direction.

It should be noted that, when the forward traversal is performed on the plurality of input operators, it may be ensured that only one input operator is included in the sub-graph formed by each input operator, but it may not be ensured that only one output operator is included in the sub-graph formed by each input operator. In one case, each input operator comprises an output operator in the sub-graph. In another case, each input operator comprises a plurality of output operators in a sub-graph.

As mentioned above, we can consider the process of traversing the computational graph composed of the remaining operators in the original subgraph to be understood as the process of disassembling the computational graph to obtain a plurality of partial subgraphs. In the process of disassembly, a new sub-graph can be obtained, and the expression forms of the input operator and the output operator in the new sub-graph also have the five situations described in the application.

As described above, the computational graph formed by the remaining operators in the original subgraph shown in fig. 6B includes two input operators and one output operator. In the case that the computer device determines that the calculation graph formed by the remaining operators in the original subgraph shown in fig. 6B meets the pruning condition, the computer device prunes the calculation graph formed by the remaining operators in the original subgraph. Because the calculation graph formed by the residual operators in the original subgraph comprises two input operators, when the computer equipment prunes the calculation graph formed by the residual operators in the original subgraph, forward traversal is needed to be sequentially carried out along different input operators, and the subgraph formed by each input operator is obtained through the forward traversal.

In one case, the sub-graph of traversed operators is pruned if the sub-graph satisfies at least one of pruning conditions. Here, the reason for pruning the sub-graph is that: the subgraph formed by forward traversing each access operator comprises an input operator and a plurality of output operators. At this time, the plurality of output operators are traversed reversely to obtain sub-graphs formed by the plurality of output operators. And stopping iterative pruning under the condition that the sub-graph formed by the traversed operators is the target sub-graph.

For example, for a first forward traversal, the computer device selects NNP5 as the target input operator from the two input operators (NNP 5 and NNP 7), and then forward traverses NNP5 (i.e., with NNP5 as the forward traversal starting point) while guaranteeing the dependency between operators, e.g., the first traversed operator is NNP6 and the second traversed operator is NNP7, which cannot be accessed in this traversal because NNP7 is the input operator, in which case the traversal is stopped. Then, in this case, with NNP5 as the forward traversal starting point, the two traversed operators (NNP 5 and NNP 6) constitute the sub-graph corresponding to NNP 5. The second forward traversal, the computer device selects NNP7 as the target input operator. The NNP7 is traversed forward under the condition of ensuring the dependency relationship between operators, and only one operator is left in the rest of the calculation graph formed by the rest of the operators in the original subgraph shown in fig. 6B, in this case, the operator is used as the subgraph corresponding to the NNP 7.

In practical application, since the subgraph obtained by the NNP5 forward traversal includes an input operator and an output operator, the subgraph obtained by the NNP7 forward traversal includes an input operator, and any pruning condition is not satisfied, in this case, pruning operation is not performed on the subgraph.

It can be appreciated from the above description that, when the computer device prunes the calculation graph composed of the remaining operators in the original subgraph shown in fig. 6B, the resulting subgraph can be shown in fig. 6D.

It will be appreciated that for the first traversal, the computer device may also select NNP7 as the target output operator among the two input operators (NNP 5 and NNP 7); the second traversal, the computer device selects NNP5 as the target output operator for traversal. For its specific implementation, reference is made to the foregoing description, and a detailed description is omitted here.

In the embodiment of the present application, when the computation graph formed by the remaining operators in the original subgraph includes a plurality of input operators and a plurality of output operators, the general processor may perform forward traversal on the plurality of input operators, then perform reverse traversal on the plurality of output operators, or perform forward traversal on the plurality of output operators, and then perform reverse traversal on the plurality of input operators. These two cases are specifically explained below:

in one possible implementation manner, according to the directed edges between operators in the original subgraph, in the computational graph formed by the remaining operators in the same original subgraph, using the input operators of the computational graph formed by the remaining operators in the original subgraph as starting points, traversing each original subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversing termination conditions; stopping iterative pruning under the condition that a subgraph formed by operators traversed in the forward direction is a pruning result subgraph;

According to directed edges between operators in an original sub-graph, in a calculation graph formed by the remaining operators in the same original sub-graph, taking an output operator of the calculation graph as a starting point, traversing the original sub-graph which does not acquire a target sub-graph in the original sub-graph set reversely, and traversing to other output operators as traversing termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the operators traversed in the reverse direction is a target subgraph.

In this case, the general processor performs forward traversal on the input operator and then performs reverse traversal on the output operator, so as to implement iterative pruning of the computation graph formed by the remaining operators in the original subgraph.

In one possible implementation manner, in a case that a computational graph formed by the remaining operators in the original subgraph contains one input operator and a plurality of output operators, pruning the computational graph formed by the remaining operators in the original subgraph includes:

the general processor reversely traverses each original subgraph in the original subgraph set by taking the output operator of the original subgraph as a starting point in the computation graph formed by the residual operators in the same original subgraph according to the directed edges between the operators in the computation graph formed by the residual operators in the original subgraph, and traverses to other output operators as traversing termination conditions; stopping iterative pruning under the condition that a subgraph formed by operators traversed in the reverse direction is a pruning result subgraph;

According to directed edges between operators in a computational graph formed by the remaining operators in the original subgraph, in the computational graph formed by the remaining operators in the same original subgraph, using an input operator of the original subgraph as a starting point, traversing the original subgraph which does not obtain a pruning result subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversing termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the operators traversed in the forward direction is a target subgraph.

In this case, the general processor performs reverse traversal on the plurality of output operators, and then performs forward traversal on the plurality of input operators, so that iterative pruning of the plurality of input operators and the plurality of output operators can be realized.

In the embodiment of the present application, the implementation process of forward traversing the multiple input operators and reverse traversing the multiple output operators is referred to the foregoing description, and redundant description is omitted herein.

For example, the original subgraph contains a plurality of input operators and a plurality of output operators. And when the computer equipment determines that the original subgraph meets pruning conditions, pruning the original subgraph by the computer equipment.

In one possible implementation, the computer device performs reverse traversal along different output operators to obtain sub-graphs obtained by respective traversal of the plurality of output operators; then, determining an input operator which is not traversed according to the original subgraph and the subgraphs obtained by respectively traversing the plurality of output operators; and then, the computer equipment conducts forward traversal along different input operators to obtain sub-graphs obtained by respective traversal of the input operators which are not traversed. And carrying out iterative pruning on the sub-graph under the condition that the traversed sub-graph meets at least one of pruning conditions. And if the subgraph does not meet any pruning condition, pruning is not performed.

It can be appreciated that the sub-graph traversed by each of the plurality of output operators and the sub-graph traversed by each of the non-traversed input operators are both part of the original sub-graph. Further, after the subgraphs obtained by respective traversal of the plurality of output operators and the subgraphs obtained by respective traversal of the input operators not traversed are overlapped, a calculation graph formed by the residual operators in the original subgraphs can be obtained.

In another possible implementation manner, the computer device performs forward traversal along different input operators to obtain sub-graphs obtained by respective traversal of the plurality of input operators; then, determining an output operator which is not traversed according to the original subgraph and the subgraphs obtained by respectively traversing the plurality of input operators; and then, the computer equipment performs reverse traversal along different output operators to obtain sub-graphs obtained by respective traversal of the non-traversed output operators. And carrying out iterative pruning on the sub-graph under the condition that the sub-graph meets at least one of pruning conditions. And if the subgraph does not meet any pruning condition, pruning is not performed.

It can be appreciated that the sub-graph traversed by each of the plurality of input operators and the sub-graph traversed by each of the non-traversed output operators are both part of the original sub-graph. Further, after the subgraphs obtained by respective traversal of the plurality of input operators and the subgraphs obtained by respective traversal of the non-traversed output operators are overlapped, an original subgraph can be obtained.

In practical application, when the output operator is traversed reversely, and the traversed operator is also the output operator, the traversal is cut off, and a subgraph taking the output operator as a reverse traversal starting point is obtained. And when the input operator is traversed forward, stopping traversing when the traversed operator is also the input operator, and obtaining a subgraph taking the input operator as a forward traversing starting point.

It can be understood that, when the computer device prunes the calculation graph formed by the rest operators in the original subgraph, a pruned subgraph set can be obtained. For example, when the computing graph shown in FIG. 4 is pruned by the computer device, a set of subgraphs as shown in FIG. 6E may be obtained. Here, each sub-graph in the sub-graph set contains one input operator and one output operator. Specifically, sub-graph 1 (NNP 1), sub-graph 2 (NNP 2, NNP3, and NNP 4), sub-graph 3 (NNP 5 and NNP 6), and sub-graph 4 (NNP 7) included in the sub-graph set are used to construct the target sub-graph.

In the embodiment of the application, the original calculation graph contains the first type operator and the second type operator, so that in the case, the operation instruction corresponding to the calculation graph formed by the second type operator runs on the general processor, and the operation instruction corresponding to the pruning result subgraph is sent to the artificial intelligent processor.

In an embodiment of the present application, when the computer device does not prune the original calculation map (fig. 4), its specific implementation may be referred to in fig. 6F. When the artificial intelligent processor in the heterogeneous system runs the operation instruction corresponding to the first type of operator and the general processor runs the operation instruction corresponding to the second type of operator, the computation graph corresponding to the first type of operator is not optimized, and the parallelism is poor in terms of a scheduling layer.

In the embodiment of the present application, as shown in fig. 6E, after the computer device prunes the original calculation graph (fig. 4), a three-part sub-graph may be obtained, specifically, the first part of sub-graph includes NNP1, the second part of sub-graph includes NNP2, NNP3, NNP4, NNP5, NNP6, and CPU; the third partial subgraph includes NNP7. From the aspect of scheduling, compared with the situation of not pruning, the method has better parallelism, and can realize that heterogeneous systems can calculate at the same time.

Step 3): and compiling the target subgraph by the general processor to obtain an operation instruction corresponding to the fusion operator.

For the technical scheme, the general processor fuses operators capable of running on the artificial intelligent processor in the original calculation graph into one operator, namely a fusion operator, and stores the operation instructions of the fusion operator, and when a corresponding machine learning task is executed later, the stored operation instructions of the fusion operator are directly multiplexed, so that the operation core starting times and the access times of the artificial intelligent processor are reduced, repeated compiling is avoided, and the reasoning speed is greatly accelerated. Meanwhile, the computer equipment prunes the calculation graph corresponding to the operator running on the artificial intelligent processor, so that the pruned sub-graph comprises an input operator and an output operator, the dependency relationship between the sub-graph running on the artificial intelligent processor and the sub-graph running on the general processor can be avoided, and the efficiency of the heterogeneous system for executing the neural network calculation task in parallel can be improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all depicted as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

It should be further noted that, although the respective steps in the flowcharts of fig. 3A and 3B are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 3A and 3B may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing details of the method of embodiments of the present application are provided for the purpose of facilitating a better implementation of the foregoing aspects of embodiments of the present application, and accordingly, related devices for implementing the foregoing aspects in conjunction therewith are provided below.

Fig. 7 is a schematic diagram of a computer device according to an embodiment of the present application, where the processor and the memory are connected to each other, and the memory is configured to store a computer program, and the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method shown in fig. 3A and 3B.

It should be noted that, in addition, the present application also provides a computer storage medium for storing computer software instructions for use in the computer device shown in fig. 3A and fig. 3B, which includes a program for executing the method embodiment described above. By executing the stored program, pruning of the calculation graph of the neural network model can be realized, and the efficiency of parallel execution of the neural network calculation tasks by the heterogeneous system is improved.

From the above, the neural network pruning method, device, computer equipment and storage medium provided by the embodiment of the application can avoid the dependency relationship between the subgraph running on the artificial intelligent processor and the subgraph running on the general processor, and can improve the efficiency of parallel execution of the neural network calculation task by the heterogeneous system.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing has outlined rather closely the embodiments of the present disclosure, and detailed description of the principles and embodiments of the present disclosure have been presented herein with the application of specific examples, the description of the examples above being merely intended to facilitate an understanding of the method of the present disclosure and its core ideas. Also, those skilled in the art, based on the teachings of the present disclosure, may make modifications or variations in the specific embodiments and application scope of the present disclosure, all falling within the scope of the protection of the present disclosure. In view of the foregoing, this description should not be construed as limiting the disclosure.

Claims

1. A method for executing a computational graph, comprising:

when the general processor compiles the calculation graph with the fusion operator, a binary instruction executed by the artificial intelligent processor corresponding to the calculation graph is obtained according to the operation instruction of the fusion operator; the fusion operator is an operator which is fused by the general processor to an operator which can run on the artificial intelligent processor in the original calculation graph; the operation instruction obtaining step of the fusion operator comprises the following steps:

the general processor compiles the target subgraph to obtain an operation instruction corresponding to the fusion operator; wherein the step of obtaining the original subgraph comprises:

Obtaining a calculation graph formed by the first type of operators according to directed edges between operators in the original calculation graph, and extracting an original subgraph from the calculation graph formed by the first type of operators; wherein the original subgraph comprises a plurality of input operators and/or a plurality of output operators; all the original subgraphs form an original subgraph set;

the step of obtaining the target subgraph comprises the following steps:

2. The method of claim 1, wherein pruning the computational graph of remaining operators in the original subgraph comprises:

3. The method according to claim 2, wherein pruning the computational graph of remaining operators in the original subgraph in case that the computational graph of remaining operators in the original subgraph contains one input operator and a plurality of output operators, comprises:

4. The method according to claim 2, wherein pruning the computational graph of remaining operators in the original subgraph in case that the computational graph of remaining operators in the original subgraph contains a plurality of input operators and one output operator, comprises:

5. The method of claim 2, wherein pruning the computational graph of remaining operators in the original subgraph, if it is satisfied that the computational graph of remaining operators in the original subgraph contains a plurality of input operators and a plurality of output operators, comprises:

6. The method of claim 2, wherein pruning the computational graph of remaining operators in the original subgraph, if it is satisfied that the computational graph of remaining operators in the original subgraph contains a plurality of input operators and a plurality of output operators, comprises:

7. The method of claim 1, wherein the step of obtaining the original subgraph further comprises:

8. A computer device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is adapted to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-7.