CN111160551A

CN111160551A - Computation graph execution method, computer device, and storage medium

Info

Publication number: CN111160551A
Application number: CN201911230228.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-05-15
Anticipated expiration: 2039-12-04
Also published as: CN111160551B

Abstract

The embodiment of the application discloses a computation graph execution method, computer equipment and a storage medium, wherein the computation graph execution method comprises the following steps: when the general processor compiles the calculation graph with the fusion operator, the binary instructions executed by the artificial intelligence processor corresponding to the calculation graph are obtained according to the operation instructions of the fusion operator.

Description

Computation graph execution method, computer device, and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a computation graph execution method, a computer device, and a storage medium.

Background

The deep learning framework is the first layer of the straight-sided application in the entire deep learning ecosystem. In Caffe, Layer is taken as a basic element for constructing a convolutional neural network, and in a deep learning framework later, such as TensorFlow and MXNet, although different names are used, such as Operator, the kernel concept is similar to that of Caffe Layer, namely, the neural network calculation is further split into various common tensor data-oriented operators, and a user constructs a neural network model by combining the operators and the data.

The above is the interface design idea of the deep learning framework facing to the application of the upper layer, and is also the idea of the deep learning framework docking downwards. Before a neural network processor or a special accelerator is started, the mainstream computing devices in the deep learning field are a CPU and a GPU, and a deep learning framework needs to embody a deep learning task expressed by an upper layer application through a graph structure into instructions and data which can be executed on the CPU or the GPU. In this process, the deep learning framework employs operators as specific elements to implement the computational task. Specifically, the deep learning framework provides a Kernel function (Kernel) executed on a CPU or a GPU for each operator used as a building block to build a network, and schedules the Kernel function corresponding to each operator in an execution graph structure according to an abstract calculation graph given by an upper application, thereby completing the calculation of the entire neural network.

The simplest way for operators to map to a specific implementation on a device is through a programming language. The method has the advantage of high flexibility, and for a CPU and a GPU with a CUDA programming architecture, a developer of the algorithm can use a programming language to complete the realization of a specific operator. On the other hand, however, the implementation of the operator may not be able to fully exploit the performance of the hardware device, and the efficiency may not be maximized.

On this basis, a further way is to use a high-performance computation library to perform mapping of the extraction operators in the computation graph to specific kernel functions, a typical example is an operator computation library cuDNN on a GPU developed by Nvidia. The cuDNN directly provides kernel functions executed on the GPU for users on the granularity of operators, and the deep learning framework is realized by mapping each abstract operator in the graph to the corresponding kernel function in the calculation base without developing the kernel functions by using the CUDA in the process of realizing the iconized calculation graph. In this case, the deep learning framework interfaces directly to the compute libraries provided by the underlying device manufacturer, and cudnns tend to perform better in performance than kernels implemented by common developers using CUDA. In fact, the implemented operators are provided for the computation library, the deep learning framework also tends to directly call the computation library, on one hand, this relieves the burden of the deep learning framework itself, and on the other hand, better performance is indeed obtained, and the high-performance computation library developed by the underlying device manufacturer theoretically maximally explores the performance limit of a single operator on a specific device, but the optimization of the single operator has an upper limit. The optimization of operator realization not only adjusts the relation between calculation and access memory in realization, thereby reducing unnecessary access memory behavior to the maximum extent and avoiding idle operation units. If the operation unit is always in a full-load operation state in the execution stage of the whole task, the efficiency of hardware in deep learning calculation reaches 100%, which is an ideal target pursued by software and hardware optimization. However, even if the calculation in the operator is optimized to the utmost, the operation unit cannot be fully loaded with calculation all the time, because the optimization is finally limited by the gap between operators. The kernel function of the operator on the CPU and the GPU is realized, no matter the operator is realized by the operator or a calculation library is called, the mode of 'off-chip storage → on-chip calculation → off-chip storage' is adopted, namely input data and output data of the operator are stored in global storage, the kernel function needs to read the input data from the global storage to complete calculation, and a result is stored back to the global storage. This brings about two problems, firstly, the access of each operator to the input data and the output data cannot be avoided by the optimization in the operator; second, each operator requires startup overhead, especially for heterogeneous computing devices other than general purpose processors.

Disclosure of Invention

The embodiment of the application provides a computation graph execution method, computer equipment and a storage medium, which can solve the IO bottleneck when an artificial intelligence processor executes a learning task.

In order to solve the above problem, the present application provides a computation graph executing method, including:

when the general processor compiles an original calculation graph with a fusion operator, a binary instruction executed by an artificial intelligence processor corresponding to the original calculation graph is obtained according to an operation instruction of the fusion operator; wherein the operation instruction obtaining step of the fusion operator comprises:

the general processor divides an operator for the first time according to execution equipment of the operator in the original calculation graph to obtain an original subgraph; wherein the execution device comprises a general purpose processor and an artificial intelligence processor;

the general processor checks the operators in the original subgraph according to the rules of the operators in the learning library of the artificial intelligence processor, and divides the original subgraph for the second time according to the check result to obtain a target subgraph;

and compiling the target subgraph by the general processor to obtain an operation instruction corresponding to the fusion operator.

Optionally, the step of obtaining the original subgraph comprises:

acquiring the original calculation graph, and determining a first operator from the original calculation graph; the operation instruction corresponding to the first operator can be operated on the artificial intelligence processor;

acquiring a calculation graph formed by the first type of operators according to directed edges among operators in the original calculation graph, and extracting an original subgraph from the calculation graph formed by the first type of operators; wherein the original subgraph contains a plurality of input operators and/or a plurality of output operators; all original subgraphs constitute an original subgraph set.

Optionally, the step of obtaining the target subgraph includes:

the general processor checks the operator in the original subgraph according to the rule of the operator in the learning library of the artificial intelligence processor to obtain a check result;

deleting the operators which do not pass the inspection in the original subgraph by using the inspection result, and pruning the calculation graph formed by the rest operators in the original subgraph to obtain a corresponding target subgraph; wherein the target subgraph contains an input operator and an output operator.

Optionally, the pruning the computation graph formed by the remaining operators in the original subgraph includes:

and under the condition that the calculation graph formed by the residual operators in the original subgraph comprises at least one of an input operator, a plurality of output operators, a plurality of input operators, an output operator, a plurality of input operators and a plurality of output operators, carrying out iterative pruning on the calculation graph formed by the residual operators in the original subgraph to obtain the target subgraph.

Optionally, in a case that a computation graph formed by remaining operators in the original subgraph includes an input operator and a plurality of output operators, pruning the computation graph formed by the remaining operators in the original subgraph, including:

according to directed edges among operators in a computational graph formed by the rest operators in the original subgraph, in the computational graph formed by the rest operators in the same original subgraph, taking the output operator of the computational graph formed by the rest operators in the original subgraph as a starting point, reversely traversing the computational graph formed by the rest operators in the corresponding original subgraph, and traversing to other output operators as traversal termination conditions; and stopping iterative pruning under the condition that the calculation graph formed by the reversely traversed operators is a target subgraph.

Optionally, in a case that a computation graph formed by remaining operators in the original subgraph includes a plurality of input operators and one output operator, pruning the computation graph formed by the remaining operators in the original subgraph, including:

according to directed edges among operators in a calculation graph formed by the rest operators in the original subgraph, in the calculation graph formed by the rest operators in the same original subgraph, taking an input operator of the calculation graph formed by the rest operators in the original subgraph as a starting point, traversing the calculation graph formed by the rest operators in the corresponding original subgraph in a forward direction, and traversing to other input operators as traversal termination conditions; and stopping iterative pruning under the condition that the calculation graph formed by the forward traversed operators is a target subgraph.

Optionally, in a case that a computation graph formed by remaining operators in the original subgraph includes a plurality of input operators and a plurality of output operators, pruning the computation graph formed by the remaining operators in the original subgraph, including:

according to directed edges among operators in a computational graph formed by the rest operators in the original subgraph, in the computational graph formed by the rest operators in the same original subgraph, taking an input operator of the computational graph formed by the rest operators in the original subgraph as a starting point, traversing each original subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversal termination conditions; stopping iterative pruning under the condition that a calculation graph formed by operators traversed in the forward direction is a target subgraph;

according to directed edges among operators in a computational graph formed by the rest operators in the original subgraph, in the computational graph formed by the rest operators in the same original subgraph, taking an output operator of the computational graph formed by the rest operators in the original subgraph as a starting point, reversely traversing the original subgraph which does not obtain a target subgraph in the original subgraph set, and traversing to other output operators as traversal termination conditions; and stopping iterative pruning under the condition that the calculation graph formed by the reversely traversed operators is a target subgraph.

according to directed edges among operators in a computational graph formed by the rest operators in the original subgraph, in the computational graph formed by the rest operators in the same original subgraph, taking an output operator of the computational graph formed by the rest operators in the original subgraph as a starting point, reversely traversing each original subgraph in the original subgraph set, and traversing to other output operators as traversal termination conditions; stopping iterative pruning under the condition that a computational graph formed by operators traversed reversely is a target subgraph;

according to directed edges among operators in a computational graph formed by the rest operators in the original subgraph, in the computational graph formed by the rest operators in the same original subgraph, taking an input operator of the computational graph formed by the rest operators in the original subgraph as a starting point, traversing the original subgraph which does not obtain a target subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversal termination conditions; and stopping iterative pruning under the condition that the calculation graph formed by the forward traversed operators is a target subgraph.

Optionally, the step of obtaining the original subgraph further comprises:

determining a second type of operator from the original computational graph; the operation instruction corresponding to the second type of operator can be operated on the general-purpose processor;

and acquiring the computational graph formed by the second type of operators according to directed edges among the operators in the original computational graph.

In order to solve the above problem, the present application provides a computer device, including a processor and a memory, the processor and the memory being connected to each other, wherein the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the above method.

To solve the above problem, the present application proposes a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the above-mentioned method.

For the technical scheme, the general processor fuses operators which can run on the artificial intelligence processor in the original calculation graph into one operator, the operator is called as a fusion operator, the operation instruction of the fusion operator is stored, and when the corresponding machine learning task is executed later, the stored operation instruction of the fusion operator is directly reused, so that the starting times and the memory access times of an operation core of the artificial intelligence processor are reduced, repeated compiling is avoided, and the reasoning speed is greatly increased. Meanwhile, the computer device prunes the calculation graph running on the artificial intelligence processor, so that the pruned subgraph comprises an input operator and an output operator, the dependency relationship between the subgraph running on the artificial intelligence processor and the subgraph running on the general processor can be avoided, and the efficiency of the heterogeneous system for executing the neural network calculation task in parallel can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.

FIG. 1 is a diagram illustrating a software stack of an artificial intelligence processor according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 3A is a schematic flowchart of a method for executing a computation graph according to an embodiment of the present application;

fig. 3B is a flowchart for acquiring an operation instruction of a fusion operator in the computational graph execution method according to the embodiment of the present application;

fig. 4 is a schematic structural diagram of a computational graph of a neural network model provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a convex sub-diagram and a non-convex sub-diagram provided in an embodiment of the present application;

fig. 6A is a schematic structural diagram of an original subgraph extracted from a computation graph corresponding to a second class of operators according to the embodiment of the present application;

fig. 6B is a schematic structural diagram of another original subgraph extracted from the computation graph corresponding to the second class of operators according to the embodiment of the present application;

fig. 6C is a schematic structural diagram of a pruned subgraph provided in an embodiment of the present application;

fig. 6D is a schematic structural diagram of another pruned subgraph provided in the embodiment of the present application;

fig. 6E is a schematic structural diagram of a sub-graph after pruning a computation graph of a neural network model according to an embodiment of the present application;

fig. 6F is a schematic structural diagram of a sub-graph in which a computation graph of a neural network model is not pruned according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computation graph execution apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In order to better understand the technical solutions described in the present application, the following first explains the technical terms related to the embodiments of the present application:

(1) calculation chart

A computation graph is one way to describe a computation process using a graph structure. If the computation is significantly modularity and there are significant temporal and logical dependencies between modules, it can be described generally using a directed graph structure. In practical application, there are two basic elements of the graph structure, which are nodes and directed edges. In practical application, the neural network is abstracted into a directed graph structure formed by tensor data and operators. Nodes are also called operators.

Generally, the neural network model is described by using a calculation graph, which is beneficial to overall grasping of the whole neural network calculation task, and meanwhile, the expression mode of the calculation graph is also convenient for scheduling and parallel execution of the calculation task.

(2) Software stack of artificial intelligence processor:

referring to FIG. 1, the software stack 10 includes an artificial intelligence application 100, an artificial intelligence framework 102, an artificial intelligence learning library 104, an artificial intelligence runtime library 106, and a driver 108. This is explained in detail below:

the artificial intelligence application 100 provides corresponding artificial intelligence algorithm models corresponding to different application scenarios. The algorithm model can be directly analyzed by a programming interface of the artificial intelligence framework 102, in one possible implementation manner, the neural network model is converted into a binary instruction through the artificial intelligence learning library 104, the binary instruction is converted into an artificial intelligence learning task by calling the artificial intelligence runtime library 106, the artificial intelligence learning task is placed in a task queue, and the artificial intelligence learning task in the task queue is scheduled by the driver 108 to be executed by a bottom artificial intelligence processor.

(3) Subgraph extraction

The heterogeneous system comprises an artificial intelligence processor and a general processor, and in practical application, the general processor can compile the neural network model to generate a corresponding machine learning task binary instruction, and the binary instruction can run on the artificial intelligence processor. Thus, the deep learning framework (e.g., Caffe) first needs to extract specific subgraphs from the complete neural network computational graph, where the operators are all placed on an artificial intelligence processor for execution. The subgraph is continuously compiled and optimized by a software stack of the artificial intelligence processor to obtain a fusion kernel function corresponding to the whole subgraph.

In the embodiment of the present application, in the process of extracting the subgraph, it is first ensured that a ring is not introduced into the original computational graph after the subgraphs in the computational graph are fused into one node. The reason for this is that the rings will cause the operators in the computation graph to depend on each other topologically.

(4) Dependency relationships

In the embodiment of the present application, operator a depends on operator B, which means that operator a must wait for the kernel function corresponding to operator B to finish executing before starting its own computing task. And if the operator B is contained in one subgraph S due to subgraph fusion, the operator A must wait until all the calculation tasks of all the operators in S are completely executed, and then can start to execute the kernel function of the operator A.

(5) Deep learning framework

As the name implies, the deep learning framework refers to a framework for deep learning. Specifically, as shown in fig. 1, the deep learning framework is the first layer in the software stack of the artificial intelligence processor, and is used to communicate with deep learning applications and deep learning computing platforms with various underlying formats.

In the prior art, a deep learning framework generally adopts a computation graph as a main data structure for describing a neural network model, and on the basis, the mapping from the computation graph to a bottom kernel function is completed by adopting an operator as granularity or the granularity of a cross operator. Meanwhile, the deep learning framework may implement specific kernel functions in a manner that includes using a programming language directly or calling an underlying computational library.

In the embodiment of the present application, the deep learning framework may include, but is not limited to: google tensor flow graph Tensorflow, convolutional neural network framework Caffe (relational Architecture for Fast Feature embedding), MXNet, Torch, and so on.

Taking Caffe as an example, Caffe supports various types of deep learning architectures, image-oriented classification and image segmentation, and can also support Convolutional Neural Networks (CNNs), Convolutional-CNNs (rcnnns) for target detection, Long-Short-Term Memory Neural Networks (LSTM), and fully-connected Neural network designs.

In the embodiment of the present application, the Caffe framework may support multiple types of basic operators, and specifically, the multiple types of basic operators referred to herein may include: common neural network operators. For example, common neural network operators are: convolution/deconvolution operators, pooling operators, activation operators, softmax (classifier) operators, full join operators. The activation operators may include, but are not limited to, ReLU, Sigmoid, Tanh, and other operators that may be implemented in an interpolated manner.

In the embodiment of the present application, the functions under the Caffe framework may include: a Caffe Blob function, a Caffe Layer function, and a Caffe Net function. Wherein, Blob is used to store, exchange and process data and derivative information of forward and backward iterations in the network; layer is used for performing calculation, and may include non-linear operations such as convolution (convolution), pooling (pool), inner product (inner product), reconstructed-line and sigmoid, and may also include loss calculation (losses) such as element-level data transformation, normalization (normalization), data loading (load data), classification (softmax) and change.

In a specific implementation, each Layer defines 3 important operations, which are initialization setting (setup), forward propagation (forward), and backward propagation (backward). Wherein setup is used for resetting layers and the connection between the layers during model initialization; forward is used for receiving input data from a bottom (bottom) layer, and outputting the input data to a top (top) layer after calculation; back ward is used to give the output gradient of the top layer, calculate the gradient of its input, and pass to the bottom layer. For example, the Layers may include Date Layers, volume Layers, Pooling Layers, InnerProduct Layers, ReLU Layers, Sigmoid Layers, LRN Layers, Dropout Layers, SoftmaxWithLoss Layers, Softmax Layers, Accuracy Layers, and the like. A Net starts with a data layer, i.e., loads data from disk, and ends with a loss layer, i.e., computes objective functions for tasks such as classification and reconstruction. In particular, Net is a directed acyclic computational graph composed of a series of layers, and Caffe preserves all intermediate values in the computational graph to ensure accuracy of forward and reverse iterations.

(5) Artificial intelligence processor

An artificial intelligence processor, also referred to as a special purpose processor, in the embodiments of the present application refers to a processor that is specific to a particular application or domain. For example: a Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a special processor dedicated to image operation on a personal computer, a workstation, a game machine, and some mobile devices (such as a tablet computer and a smart phone). Another example is: a Neural Network Processor (NPU), which is a special processor for matrix multiplication in the field of artificial intelligence, adopts a structure of data-driven parallel computation, and is particularly good at Processing massive multimedia data such as video and images.

Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 2, the computer device 20 may comprise a general purpose processor 201, a memory 202, a communication bus 203, a communication interface 204 and at least one artificial intelligence processor 205, the general purpose processor 201, the artificial intelligence processor 205 being connected to said memory 202 and said communication interface 203 via said communication bus.

The general-purpose Processor 201 may be a Central Processing Unit (CPU), and the general-purpose Processor 201 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. The general purpose processor 201 may be a microprocessor or the general purpose processor 201 may be any conventional processor or the like.

The general purpose processor 201 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the neural network pruning method of the present application may be implemented by integrated logic circuits of hardware in the general-purpose processor 201 or instructions in the form of software.

The Memory 202 may be a Read-Only Memory (ROM), a Random Access Memory (RAM), or other Memory. In this embodiment, the memory 202 is used for storing data and executing a software program corresponding to the method shown in fig. 3A and fig. 3B, for example, a program for pruning original subgraphs meeting pruning conditions in this embodiment so that each pruned subgraph includes an input operator and an output operator, and the like.

Alternatively, in embodiments of the present application, the memory may include a physical device for storing information, typically a medium that digitizes the information and stores it in an electrical, magnetic, or optical manner. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.

Communication interface 204 enables communication between computer device 20 and other devices or communication networks using transceiver means, such as, but not limited to, transceivers. For example, model files sent by other devices may be received via communication interface 204.

The artificial intelligence processor 205 may be mounted as a coprocessor to a main CPU (host CPU) for which tasks are assigned. In practical applications, the artificial intelligence processor 205 may implement one or more operations. For example, taking a neural Network Processing Unit (NPU) NPU as an example, a core portion of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 202 and perform a multiply-add operation.

Optionally, the artificial intelligence processor 205 may include 8 clusters (clusters), each cluster including 4 artificial intelligence processor cores.

Alternatively, artificial intelligence processor 205 may be a reconfigurable architecture artificial intelligence processor. Here, the reconfigurable architecture means that if a certain artificial intelligent processor can flexibly change its own architecture according to different application requirements by using reusable hardware resources, so as to provide an architecture matching with each specific application requirement, then the artificial intelligent processor is called a reconfigurable computing system, and its architecture is called a reconfigurable architecture.

It should be understood that computer device 20 is only one example provided for the embodiments of the present application and that computer device 20 may have more or fewer components than shown, may combine two or more components, or may have a different configuration implementation of components.

In practice, due to the learning library of the artificial intelligence processor and supporting all types of operators, in a general operation mode, frequent input/output interactions between the general-purpose processor and the artificial intelligence processor, and the startup of the kernel of the artificial intelligence processor consume a lot of time. However, under the conventional dynamic fusion strategy, a plurality of problems of dynamic fusion are found in the debugging process. Such as: the same neural network runs on different devices, and the number of segments is different; the same network executes different fusion nodes each time, which causes repeated compilation and poor performance of the artificial intelligence processor when executing the learning task.

Based on this, the following flowchart of a computational graph execution method provided in the embodiment of the present application shown in fig. 3A is combined to specifically describe how to solve the IO bottleneck when the artificial intelligence processor executes the learning task in the embodiment of the present application, which may include, but is not limited to, the following steps:

step a: when the general processor compiles the original calculation graph with the fusion operator, the binary instructions executed by the artificial intelligence processor corresponding to the original calculation graph are obtained according to the operation instructions of the fusion operator.

The general processor fuses operators which can run on the artificial intelligence processor in the original calculation graph into one operator, the operator is called as a fusion operator, the operation instruction of the fusion operator is stored, and when a corresponding machine learning task is executed later, the stored operation instruction of the fusion operator is directly reused, so that the starting times and the memory access times of an operation core of the artificial intelligence processor are reduced, repeated compiling is avoided, and the reasoning speed is greatly accelerated.

As shown in fig. 3B, the operation instruction obtaining step of the fusion operator includes:

step 1): the general processor divides an operator for the first time according to execution equipment of the operator in the original calculation graph to obtain an original subgraph; wherein the execution device comprises a general purpose processor and an artificial intelligence processor.

In this step, the step of obtaining the original subgraph includes: acquiring an original calculation graph, and determining a first operator from the original calculation graph; and the operation instruction corresponding to the first type operator can be operated on the artificial intelligence processor. Acquiring a calculation graph formed by the first type of operators according to directed edges among operators in the original calculation graph, and extracting an original subgraph from the calculation graph formed by the first type of operators; wherein the original subgraph contains a plurality of input operators and/or a plurality of output operators; all original subgraphs constitute an original subgraph set.

In the embodiment of the present application, the first type operator refers to an operator that can run on an artificial intelligence processor. For example, the first type of operator may include a meta-operator supported by the artificial intelligence processor. Specifically, in the embodiment of the present application, the meta-operator may include, but is not limited to: convolution/deconvolution operators, pooling operators, activation operators, Local Response Normalization (LRN)/batch Normalization operators, classifier (Softmax) operators, full join operators, and the like. The activation operators may include, but are not limited to, ReLU, Sigmoid, Tanh, and other operators that may be implemented in an interpolated manner.

In the embodiments of the present application, the second type of operator refers to an operator that can run on a general-purpose processor. For example, the second class of operators may include newly developed operators that operate on general purpose processors because: in practical application, the artificial intelligence learning library of the software stack of the artificial intelligence processor may not support the operator in time, so that the artificial intelligence processor cannot obtain the binary instruction corresponding to the operator; for another example, since the operators themselves do not include computing logic that can perform parallel acceleration, but include many conditional jumps and other computing logic that conform to the characteristics of a general-purpose processor, in this case, such operators are run on the general-purpose processor. It can be understood that when the second type of operator is operated in a general-purpose processor for calculation, the operation speed of the neural network model can be improved.

In the embodiment of the application, the computer device can obtain a model file of the neural network model, wherein the model file comprises a plurality of operators and connection relations among the operators; the computer device may then build a raw computational graph of the neural network model from the model file. In practical application, in a possible implementation manner, the neural network model includes a first type of operator and a second type of operator, and then the constructed original computation graph includes a computation graph corresponding to the first type of operator and a computation graph corresponding to the second type of operator. In another possible implementation manner, the neural network model only includes the first type of operator, and then, the constructed original computation graph only includes the computation graph corresponding to the first type of operator. For example, the raw computational graph of the neural network model obtained by the computer device may be as shown in FIG. 4, in which NNP1-NNP7(NNP, full name: neural network Processor) represents an operator running on an artificial intelligence Processor, and CPU (full name: central processing Unit) represents an operator running on a general purpose Processor.

In this embodiment of the present application, the number of original subgraphs extracted by a general-purpose processor in a computer device in a computation graph corresponding to a first class operator may be 1, or may be multiple, for example, 4, and so on.

In one possible implementation manner, the extracting M original subgraphs from the computation graph corresponding to the first operator includes:

and extracting M original subgraphs from the calculation graph corresponding to the first operator according to a subgraph extraction rule.

As described above, in the process of extracting subgraphs, it is first ensured that after the subgraphs in the computation graph are fused into a node, no ring is introduced into the original computation graph summary. The reason for this is that the rings will cause the operators in the computation graph to depend on each other topologically.

In particular, for a deep learning framework such as MXNet, a graph structure with rings may cause a scheduling engine at the back end of the framework to generate deadlock when a scheduling operator executes, because the scheduling engine needs to schedule a kernel function corresponding to an operator to start executing, and it is necessary to ensure that an operator depended on by the operator in a computation graph has been executed.

In practical applications, convexity (Convex) may be used as an equivalent constraint to guarantee non-deadlock. As shown in fig. 5, the subgraph S in the directed graph G is said to be convex if and only if there is no path for any two nodes in S that passes nodes other than S. In any subgraph which breaks the convexity, the fact that some nodes outside the subgraph depend on some nodes inside the subgraph and other nodes inside the subgraph depend on the external nodes is inevitable, and therefore scheduling deadlock is caused.

Further, outside convexity, subgraphs should guarantee connectivity. The subgraph S in the directed graph G becomes connected, and S is a connected graph if and only if the directed edges in S are considered as undirected edges.

In practical application, during the sub-graph extraction process, it should be ensured that the extracted sub-graph is as large as possible. This principle is based on two intuitive judgments: a great subgraph can ensure that as large a searching and optimizing space as possible is provided for a lower software stack; a very large subgraph can reduce the starting overhead of the kernel function to the maximum extent.

Step 2): and the general processor checks the operators in the original subgraph according to the rules of the operators in the learning library of the artificial intelligence processor, and divides the original subgraph for the second time according to the check result to obtain a target subgraph.

In practice, according to the rule of operators in the learning library of the artificial intelligence processor, the framework further needs to perform operator boundary check on the operators in each original sub-graph obtained in step 1, divide the operators that can be continuously executed into one sub-graph, and compile to form fusion op, which is the operator set that can be really fused into one operator. The fusion Op is stored in the Cache. When the optimized calculation graph with the fusion operator is executed, the calculation graph with the fusion operator is not executed layer by layer when the fusion operator is operated, and the compiled fusion Op of the formation is directly taken out from the Cache.

In this step, the step of obtaining the target subgraph includes:

In this embodiment of the present application, after an original sub-graph extracted from a computation graph corresponding to a first operator is subjected to operator boundary check, a computation graph formed by remaining operators in the original sub-graph is obtained, where the computation graph may include the following situations:

the first case: the computation graph comprises an input operator and a plurality of output operators. For example, as shown in fig. 6A, the extracted computation graph formed by the remaining operators in the original subgraph includes one input operator and two output operators.

The second case: and the calculation graph formed by the rest operators in the original subgraph comprises a plurality of input operators and an output operator. For example, as shown in fig. 6B, the extracted computation graph formed by the remaining operators in the original subgraph includes two input operators and one output operator.

The third situation: and the calculation graph formed by the rest operators in the original subgraph comprises a plurality of input operators and a plurality of output operators.

A fourth scenario: and the calculation graph formed by the residual operators in the original subgraph comprises an input operator.

The fifth case: and the calculation graph formed by the residual operators in the original subgraph comprises an output operator.

It should be noted that, because the representation form of the original subgraph has diversity, the above example is only an example, and should not be construed as a limitation. In the embodiment of the present application, taking the computation graph of the neural network model shown in fig. 4 as an example, the subgraphs extracted by the computer device in the computation graph corresponding to the first class operator include the original subgraph shown in fig. 6A and the original subgraph shown in fig. 6B.

In a specific implementation, the step of pruning the computation graph formed by the remaining operators in the original subgraph comprises:

In the embodiment of the present application, for convenience of illustration, a computation graph formed by the remaining operators in the original subgraph includes an input operator and a plurality of output operators; a plurality of input operators and an output operator; the three cases of multiple input operators and multiple output operators are defined as different pruning conditions. Then, when the general purpose processor determines that the computation graph formed by the rest operators in the original subgraph meets at least one condition of pruning conditions, pruning the computation graph formed by the rest operators in the original subgraph.

In this embodiment of the present application, when the computer device extracts a plurality of original subgraphs from the computation graph corresponding to the first class operator, in a possible implementation manner, the computer device may prune one of the plurality of original subgraphs; in a possible implementation manner, the computer device may also prune each original subgraph in the plurality of original subgraphs, and the embodiment of the present application is not particularly limited.

In the embodiment of the present application, the process of pruning the computation graph formed by the remaining operators in the original subgraph by the general processor is an iterative pruning process. In particular, an iteration is a repetition of a set of instructions (or a certain step) in a computer program. It can be used as a generic term (synonymous with "repeat") or to describe a specific form of repeat having a variable state.

In this embodiment of the present application, in the process of performing iterative pruning on a computation graph formed by remaining operators in the original subgraph, taking a computation graph formed by the remaining operators in the original subgraph as an example, where the computation graph formed by the remaining operators in the original subgraph includes an input operator and a plurality of output operators, for example, the plurality of output operators are output operator 1, output operator 2 and output operator 3, and a general-purpose processor traverses, according to directional edges between operators in the computation graph formed by the remaining operators in the original subgraph, the corresponding original subgraph in the computation graph formed by the remaining operators in the same original subgraph with output operator 1 of the computation graph as a starting point, and traverses to other output operators (for example, output operator 2) as a traversal termination condition, in this case, a subgraph formed by output operator 1 can be obtained, and similarly, the general-purpose processor takes output operator 2 of the computation graph as a starting point, and traversing the corresponding calculation graph reversely to other output operators (for example, the output operator 3) as traversal termination conditions, in this case, obtaining a subgraph formed by the output operator 2, similarly, performing reverse traversal by taking the output operator 3 as a starting point to obtain a subgraph formed by the output operator 3, and under the condition that the subgraphs corresponding to the three output operators meet at least one of pruning conditions, continuing iterative pruning until a target subgraph is obtained by pruning.

The following is a detailed description of the case where the pruning condition is satisfied:

the first case: and the calculation graph formed by the rest operators in the original subgraph comprises an input operator and a plurality of output operators.

In a specific implementation, pruning the original subgraph under the condition that a computation graph formed by the remaining operators in the original subgraph contains one input operator and a plurality of output operators comprises:

according to directed edges among operators in the original subgraph, in a calculation graph formed by the rest operators in the same original subgraph, taking an output operator of the calculation graph as a starting point, reversely traversing the corresponding calculation graph, and traversing to other output operators as traversal termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the reversely traversed operators is a target subgraph.

In the embodiment of the present application, the directed edge may be used to characterize operators and connection relationships (e.g., dependencies) between the operators, and may also be used to characterize an execution order of the artificial intelligence processor executing the computation graph.

In this embodiment of the present application, according to the directed edges between the operators of the computation graph formed by the remaining operators in the original subgraph, in the computation graph formed by the remaining operators in the same original subgraph, taking the multiple output operators included in the computation graph as starting points, traversing the corresponding computation graph reversely, and taking traversing to other output operators as traversal termination conditions, so as to obtain subgraphs each of which forms by each output operator through traversal, where the subgraph each of which forms by each output operator through traversal is a part of the computation graph formed by the remaining operators in the original subgraph, it can be understood that the subgraphs each of which forms by each output operator are superimposed, so as to obtain the computation graph formed by the remaining operators in the original subgraph. And after the sub-graph formed by each output operator is obtained by traversing, judging whether the sub-graph formed by each output operator meets the pruning condition or not, and under the condition that the sub-graph meets the pruning condition, pruning by combining a general processor with a specific pruning situation until the traversed sub-graph is the target sub-graph.

In practical applications, the backward traversal of the plurality of output operators may include:

acquiring a target output operator; wherein the target output operator is any one of the plurality of output operators;

and performing reverse traversal on the target output operator according to the directed edges among the operators, stopping traversal when the traversed operator is the output operator, and performing reverse traversal to obtain a subgraph formed by the target output operator.

It should be noted that, in the reverse traversal of multiple output operators, it can be ensured that the subgraph formed by each output operator only contains one output operator, but it cannot be ensured that the subgraph formed by each output operator only contains one input operator. In one case, each output operator forms a sub-graph that contains one input operator. In another case, each output operator forms a subgraph containing multiple input operators.

In the embodiment of the present application, the process of traversing the computation graph formed by the remaining operators in the original sub-graph can be understood as a process of disassembling the computation graph to obtain a plurality of partial sub-graphs. In the process of disassembling, a new subgraph can be obtained, and the expression forms of the input operator and the output operator in the new subgraph also have the five situations described in the application.

In the embodiment of the present application, the dependency relationship between operators is used to describe a directional relationship between operators, and may be represented as a directed edge in a directed graph. For example, taking NNP1 and NNP2 shown in fig. 6A as examples, NNP1 points to NNP2, i.e., the output tensor data of NNP1 is used as the input tensor data of NNP 2.

As mentioned above, the computation graph formed by the remaining operators in the original subgraph shown in fig. 6A includes one input operator and two output operators. When the computer device determines that the computation graph formed by the remaining operators in the original subgraph shown in fig. 6A satisfies the pruning condition, the computer device prunes the computation graph. Because the computation graph comprises two output operators, when the computation graph is pruned by computer equipment, reverse traversal needs to be sequentially performed along different output operators, and sub-graphs formed by each output operator are obtained through reverse traversal.

In one case, in the case that a subgraph formed by the traversed operators meets at least one of the pruning conditions, the subgraph is pruned. Here, the reason why the sub-map is pruned is that: and the sub-graph formed by reversely traversing each output operator comprises one output operator and a plurality of input operators. And at the moment, performing forward traversal on the plurality of input operators to obtain sub-graphs formed by the plurality of input operators respectively through traversal. And stopping iterative pruning under the condition that the subgraph formed by the traversed operators is a pruning result subgraph.

In another case, under the condition that the subgraph formed by the traversed operators does not satisfy any pruning condition, pruning is not carried out.

For example, during a first reverse pass, the computer device selects NNP4 as the target output operator among the two output operators (NNP1 and NNP4), and then performs a reverse pass on NNP4 (i.e., with NNP4 as the reverse pass starting point) while ensuring the dependency relationship between the operators, e.g., the first traversed operator is NNP3, the second traversed operator is NNP2, and the third traversed operator is NNP1, since NNP1 is the output operator, it cannot be accessed in this pass, in which case the pass is cut off. Then, in this case, with NNP4 as the reverse traversal starting point, the three traversed operators (NNP4, NNP3, and NNP2) form the corresponding subgraph of NNP 4. On the second reverse pass, the computer device selects NNP1 as the target output operator. And performing reverse traversal on the NNP1 under the condition of ensuring the dependency relationship between operators, wherein only one operator is left in the residual part of the calculation graph formed by the rest operators in the original subgraph shown in FIG. 6A, and in this case, the operator is taken as the subgraph corresponding to the NNP 1.

In practical application, because the subgraph obtained by reverse traversal of NNP4 contains an input operator and an output operator, and the subgraph obtained by reverse traversal of NNP1 contains an output operator, any pruning condition is not met, and in this case, no pruning operation is performed on the subgraph.

As can be known from the above description, when the computer device prunes the computation graph formed by the remaining operators in the original subgraph shown in fig. 6A, the resulting subgraph can be as shown in fig. 6C.

It will be appreciated that, on the first pass, the computer device may also select NNP1 as the target output operator for the traversal among the two output operators (NNP1 and NNP 4); in the second pass, the computer device selects NNP4 as the target output operator to traverse. For the specific implementation, reference is made to the foregoing description, which is not repeated herein.

The second case: and the calculation graph formed by the rest operators in the original subgraph comprises a plurality of input operators and an output operator.

In a specific implementation, pruning is performed on a computation graph formed by the remaining operators in the original subgraph under the condition that the computation graph comprises a plurality of input operators and one output operator, and the pruning includes:

according to directed edges among operators in a calculation graph formed by the rest operators in the original subgraph, in the calculation graph formed by the rest operators in the same original subgraph, the input operator of the calculation graph is taken as a starting point, the corresponding calculation graph is traversed in a forward direction, and the other input operators are traversed to be taken as traversal termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the forward traversed operators is a target subgraph.

In this embodiment of the present application, according to a directed edge between operators in an original subgraph, in a computation graph formed by remaining operators in the same original subgraph, taking a plurality of input operators included in the computation graph as starting points, respectively, traversing the corresponding computation graph in a forward direction, and traversing to other input operators as traversal termination conditions, thereby obtaining subgraphs each of which is formed by each output operator and each input operator, where the subgraph each of which is formed by each input operator that is traversed is a part of the computation graph, it can be understood that the subgraphs each of which is formed by each input operator are overlaid, and thus the computation graph can be obtained. And after the sub-graph formed by each input operator is obtained by traversal, judging whether the sub-graph formed by each input operator meets the pruning condition or not, and under the condition that the sub-graph meets the pruning condition, pruning by combining a specific pruning situation by the general processor until the traversed sub-graph is the target sub-graph.

In practical applications, the forward traversal of the plurality of input operators may include:

acquiring a target input operator; wherein the target input operator is any one of the plurality of input operators;

and performing forward traversal on the target input operator according to the directed edges among the operators, stopping traversal when the traversed operator is the input operator, and obtaining a subgraph formed by the target input operator in a forward direction.

It should be noted that, when performing forward pass on multiple input operators, it can be ensured that a subgraph formed by each input operator only contains one input operator, but it cannot be ensured that a subgraph formed by each input operator only contains one output operator. In one case, each input operator forms a subgraph containing an output operator. In another case, each input operator forms a subgraph containing multiple output operators.

As described above, what we can consider as visualization of the process of traversing the computation graph formed by the remaining operators in the original sub-graph is the process of disassembling the computation graph to obtain a plurality of partial sub-graphs. In the process of disassembling, a new subgraph can be obtained, and the expression forms of the input operator and the output operator in the new subgraph also have the five situations described in the application.

As mentioned above, the computation graph formed by the remaining operators in the original subgraph shown in FIG. 6B includes two input operators and one output operator. When the computer device determines that the computation graph formed by the remaining operators in the original subgraph shown in fig. 6B satisfies the pruning condition, the computer device prunes the computation graph formed by the remaining operators in the original subgraph. Because the computation graph formed by the remaining operators in the original subgraph contains two input operators, when the computation graph formed by the remaining operators in the original subgraph is pruned by computer equipment, forward traversal needs to be sequentially carried out along different input operators, and the subgraph formed by each input operator is obtained by the forward traversal.

In one case, in the case that a subgraph formed by the traversed operators meets at least one of the pruning conditions, the subgraph is pruned. Here, the reason why the sub-map is pruned is that: the subgraph formed by forward traversing each access operator comprises an input operator and a plurality of output operators. And at the moment, performing reverse traversal on the plurality of output operators to obtain sub-graphs formed by the plurality of output operators through traversal. And stopping iterative pruning under the condition that the subgraph formed by the traversed operators is a target subgraph.

For example, during a first forward pass, the computer device selects NNP5 as the target input operator among the two input operators (NNP5 and NNP7), and then performs a forward pass on NNP5 (i.e., with NNP5 as the forward pass starting point) while ensuring operator-to-operator dependencies, e.g., the first traversed operator is NNP6 and the second traversed operator is NNP7, in which case the walk is cut off since NNP7 is the input operator and cannot be accessed in this pass. Then, in this case, with NNP5 as the forward traversal starting point, the two traversed operators (NNP5 and NNP6) form a corresponding sub-graph of NNP 5. On the second forward pass, the computer device selects NNP7 as the target input operator. And (3) traversing NNP7 in a forward direction under the condition of ensuring the dependency relationship between operators, wherein only one operator is left in the residual part of the calculation graph formed by the rest operators in the original subgraph shown in FIG. 6B, and in this case, the operator is taken as the subgraph corresponding to NNP 7.

In practical application, because the sub-graph obtained by forward traversal of NNP5 contains an input operator and an output operator, and the sub-graph obtained by forward traversal of NNP7 contains an input operator, any pruning condition is not met, and in this case, no pruning operation is performed on the sub-graph.

As can be known from the above description, when the computer device prunes the computation graph formed by the remaining operators in the original subgraph shown in fig. 6B, the resulting subgraph can be as shown in fig. 6D.

It will be appreciated that, on the first pass, the computer device may also select NNP7 as the target output operator for the traversal among the two input operators (NNP5 and NNP 7); in the second pass, the computer device selects NNP5 as the target output operator to traverse. For the specific implementation, reference is made to the foregoing description, which is not repeated herein.

The third situation: and the computation graph formed by the rest operators in the original subgraph comprises a plurality of input operators and a plurality of output operators.

In this embodiment of the application, when a computation graph formed by remaining operators in the original subgraph includes a plurality of input operators and a plurality of output operators, the general processor may perform forward traversal on the plurality of input operators first, then perform backward traversal on the plurality of output operators, or perform forward traversal on the plurality of output operators first, and then perform backward traversal on the plurality of input operators. These two cases are explained in detail below:

in a possible implementation manner, according to directed edges between operators in the original subgraph, in a computational graph formed by remaining operators in the same original subgraph, taking an input operator of the computational graph formed by the remaining operators in the original subgraph as a starting point, traversing each original subgraph in the original subgraph set in a forward direction, and traversing to other input operators as traversal termination conditions; stopping iterative pruning under the condition that a subgraph formed by operators traversed in the forward direction is a pruning result subgraph;

according to directed edges among operators in the original subgraph, in a calculation graph formed by the rest operators in the same original subgraph, taking an output operator of the calculation graph as a starting point, reversely traversing the original subgraph without obtaining a target subgraph in the original subgraph set, and traversing to other output operators as traversal termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the reversely traversed operators is a target subgraph.

In this case, the general-purpose processor performs forward traversal on the input operator, and then performs backward traversal on the output operator, thereby implementing iterative pruning of the computation graph formed by the remaining operators in the original subgraph.

In a possible implementation manner, in a case that a computation graph formed by remaining operators in the original subgraph includes one input operator and a plurality of output operators, pruning the computation graph formed by the remaining operators in the original subgraph includes:

the general processor reversely traverses each original subgraph in the original subgraph set by taking the output operator of the original subgraph as a starting point and traverses to other output operators as traversal termination conditions in the computational graph formed by the residual operators in the original subgraph according to directed edges among the operators in the computational graph formed by the residual operators in the original subgraph; stopping iterative pruning under the condition that a subgraph formed by the reversely traversed operators is a pruning result subgraph;

according to directed edges among operators in a calculation graph formed by the rest operators in the original subgraph, in the calculation graph formed by the rest operators in the same original subgraph, with an input operator of the original subgraph as a starting point, positively traversing the original subgraph which does not obtain a pruning result subgraph in the original subgraph set, and traversing to other input operators as traversal termination conditions; and stopping iterative pruning under the condition that the subgraph formed by the forward traversed operators is a target subgraph.

In this case, the general-purpose processor performs reverse traversal on the plurality of output operators and then performs forward traversal on the plurality of input operators, thereby achieving iterative pruning of the plurality of input operators and the plurality of output operators.

In the embodiment of the present application, please refer to the foregoing description for the implementation process of performing forward traversal on a plurality of input operators and performing reverse traversal on a plurality of output operators, which is not described herein repeatedly.

For example, the original subgraph contains multiple input operators and multiple output operators. And when the computer equipment determines that the original subgraph meets the pruning condition, the computer equipment prunes the original subgraph.

In a possible implementation manner, the computer device performs reverse traversal along different output operators to obtain sub-graphs obtained by respective traversal of the plurality of output operators; then, determining an input operator which is not traversed according to the original subgraph and the subgraphs obtained by respectively traversing the output operators; and then, the computer equipment performs forward traversal along different input operators to obtain sub-graphs obtained by respective traversal of the input operators which are not traversed. And performing iterative pruning on the sub-graph under the condition that the traversed sub-graph meets at least one pruning condition. And under the condition that the subgraph does not meet any pruning condition, pruning is not carried out.

It can be understood that the subgraph traversed by each of the output operators and the subgraph traversed by each of the input operators which is not traversed are both part of the original subgraph. Further, after the subgraph obtained by respective traversal of the output operators and the subgraph obtained by respective traversal of the input operators which is not traversed are superposed, a calculation graph formed by the rest operators in the original subgraph can be obtained.

In another possible implementation manner, the computer device performs forward traversal along different input operators to obtain sub-graphs obtained by respective traversal of the plurality of input operators; then, determining an output operator which is not traversed according to the original subgraph and the subgraphs obtained by respectively traversing the plurality of input operators; and then, the computer equipment performs reverse traversal along different output operators to obtain sub-graphs obtained by respective traversal of the output operators which are not traversed. And carrying out iterative pruning on the sub-graph under the condition that the sub-graph meets at least one pruning condition. And under the condition that the subgraph does not meet any pruning condition, pruning is not carried out.

It can be understood that the subgraph traversed by each of the plurality of input operators and the subgraph traversed by each of the output operators which is not traversed are both part of the original subgraph. Furthermore, after the subgraph obtained by respectively traversing the plurality of input operators and the subgraph obtained by respectively traversing the output operators which are not traversed are superposed, the original subgraph can be obtained.

In practical application, when the output operator is reversely traversed, when the traversed operator is also the output operator, the traversal is cut off, and a subgraph taking the output operator as a reverse traversal starting point is obtained. And during the forward traversal of the input operator, when the traversed operator is also the input operator, stopping the traversal to obtain a subgraph taking the input operator as a forward traversal starting point.

It can be understood that, when the computer device prunes the computation graph formed by the remaining operators in the original subgraph, a pruned subgraph set can be obtained. For example, when the computer device prunes the computation graph shown in fig. 4, a subgraph set as shown in fig. 6E can be obtained. Here, each subgraph in the set of subgraphs contains one input operator and one output operator. Specifically, sub-graph 1(NNP1), sub-graph 2(NNP2, NNP3, and NNP4), sub-graph 3(NNP5 and NNP6), and sub-graph 4(NNP7) included in the sub-graph set are used to construct the target sub-graph.

In the embodiment of the present application, it is considered that the original computation graph includes a first type of operator and a second type of operator, in this case, an operation instruction corresponding to the computation graph formed by the second type of operator runs on the general-purpose processor, and an operation instruction corresponding to the pruning result subgraph is sent to the artificial intelligent processor.

In the embodiment of the present application, when the original computation graph (fig. 4) is not pruned by the computer device, the specific implementation thereof may be referred to fig. 6F. When an artificial intelligence processor in the heterogeneous system runs an operation instruction corresponding to the first operator and a general processor runs an operation instruction corresponding to the second operator, the parallelism is poor in a scheduling aspect because a computation graph corresponding to the first operator is not optimized.

In the embodiment of the present application, as shown in fig. 6E, after the computer device prunes the original computation graph (fig. 4), three partial sub-graphs can be obtained, specifically, the first partial sub-graph includes NNP1, the second partial sub-graph includes NNP2, NNP3, NNP4, NNP5, NNP6, and a CPU; the third partial subgraph includes NNP 7. From the aspect of scheduling, compared with the situation of no pruning, the method has better parallelism, and can realize that the heterogeneous system carries out calculation at the same time.

Step 3): and compiling the target subgraph by the general processor to obtain an operation instruction corresponding to the fusion operator.

For the technical scheme, the general processor fuses operators which can run on the artificial intelligence processor in the original calculation graph into one operator, the operator is called as a fusion operator, the operation instruction of the fusion operator is stored, and when the corresponding machine learning task is executed later, the stored operation instruction of the fusion operator is directly reused, so that the starting times and the memory access times of an operation core of the artificial intelligence processor are reduced, repeated compiling is avoided, and the reasoning speed is greatly increased. Meanwhile, the computer device prunes the calculation graph corresponding to the operator running on the artificial intelligence processor, so that the pruned subgraph comprises an input operator and an output operator, the dependency relationship between the subgraph running on the artificial intelligence processor and the subgraph running on the general processor can be avoided, and the efficiency of the heterogeneous system for executing the neural network calculation task in parallel can be improved.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It is further noted that, although the steps in the flowcharts of fig. 3A and 3B are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3A and 3B may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

While the method of the embodiments of the present application has been described in detail, in order to better implement the above-described aspects of the embodiments of the present application, the following provides a corresponding apparatus for implementing the above-described aspects in a coordinated manner.

Fig. 7 is a schematic diagram of a computer device according to an embodiment of the present application, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method shown in fig. 3A and 3B.

Furthermore, it should be noted that the present application also provides a computer storage medium for storing computer software instructions for the computer device shown in fig. 3A and 3B, which contains a program for executing the method embodiments. By executing the stored program, the pruning of the calculation graph of the neural network model can be realized, and the efficiency of the heterogeneous system for executing the neural network calculation task in parallel is improved.

Therefore, the neural network pruning method, the neural network pruning device, the computer equipment and the storage medium provided by the embodiment of the application avoid the generation of dependency relationship between the subgraph running on the artificial intelligent processor and the subgraph running on the general processor, and can improve the efficiency of the heterogeneous system for executing the neural network computing task in parallel.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

Claims

1. A computational graph execution method, comprising:

when a general processor compiles a calculation graph with a fusion operator, a binary instruction executed by an artificial intelligence processor corresponding to the calculation graph is obtained according to an operation instruction of the fusion operator; wherein the operation instruction obtaining step of the fusion operator comprises:

the general processor divides an operator for the first time according to execution equipment of the operator in an original calculation graph to obtain an original subgraph; wherein the execution device comprises a general purpose processor and an artificial intelligence processor;

2. The method of claim 1, wherein the step of obtaining an original subgraph comprises:

3. The method of claim 1 or 2, wherein the step of obtaining a target subgraph comprises:

4. The method of claim 3, wherein pruning the computational graph of residual operators in the original subgraph comprises:

5. The method of claim 4, wherein pruning the computation graph of the remaining operators in the original subgraph if the computation graph of the remaining operators in the original subgraph contains one input operator and a plurality of output operators comprises:

6. The method of claim 4, wherein pruning the computation graph of the remaining operators in the original subgraph if the computation graph of the remaining operators in the original subgraph contains a plurality of input operators and one output operator comprises:

7. The method of claim 4, wherein pruning the computation graph of the remaining operators in the original subgraph if the computation graph of the remaining operators in the original subgraph contains a plurality of input operators and a plurality of output operators comprises:

8. The method of claim 4, wherein pruning the computation graph of the remaining operators in the original subgraph if the computation graph of the remaining operators in the original subgraph contains one input operator and a plurality of output operators comprises:

9. The method of claim 2, wherein the step of obtaining the original subgraph further comprises:

10. A computer device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any one of claims 1-9.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-9.