CN113705798A

CN113705798A - Processing unit, computing device and computation graph optimization method of deep learning model

Info

Publication number: CN113705798A
Application number: CN202010435236.3A
Authority: CN
Inventors: 马洪朋; 何军; 姚忠伟; 毛钧; 尹莉
Original assignee: Pingtouge Shanghai Semiconductor Co Ltd
Current assignee: Pingtouge Shanghai Semiconductor Co Ltd
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2021-11-26

Abstract

The invention discloses a processing unit, a computing device and a computation graph optimization method of a deep learning model. The calculation graph optimization method of the deep learning model comprises the following steps: determining a plurality of paths extending from an input operator to an output operator of the computational graph to be optimized and comprising at least one transpose operator, wherein each path of the plurality of paths has at least one different transpose operator from the other paths; determining the merging direction of each transposition operator in each path; and determining the transpose operator which can be counteracted in one or more paths according to the merging direction of each transpose operator in each path, and clearing the transpose operator which can be counteracted in one or more paths to obtain the optimized calculation graph. The optimized computation graph eliminates unnecessary transposition operators in the computation graph, thereby reducing the resource overhead.

Description

Processing unit, computing device and computation graph optimization method of deep learning model

Technical Field

The disclosure relates to the field of chips, in particular to a processing unit, a computing device and a computation graph optimization method of a deep learning model.

Background

In recent years, with the rise of artificial intelligence, a deep learning model becomes a prediction model which is widely applied at present. Deep learning models have been increasingly applied to a variety of scenarios such as speech recognition and face recognition. The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data. By using the cloud technology, a customer can provide a computation graph of a deep learning model to be used for a cloud service provider, the computation graph is processed by the cloud service provider and then deployed to a server of a data center to run, an application system of the customer can obtain a prediction result by using the computation graph of the deep learning model of the data center, and the deep learning model is deployed to the data center to help improve the reasoning performance of the deep learning model because the data center runs the computation graph of the deep learning model by using an acceleration unit special for the deep learning model.

In optimizing a computational graph of a deep learning model to be used by a customer, the inventors discovered that there is an unnecessary transpose (trans) operator for portions of the computational graph. The transposition operator is used to convert the data format of the tensor (also called dimension conversion). The transpose operator is executed by an acceleration unit or processor. The transposition operator is executed by the acceleration unit in a case where the acceleration unit is capable of executing the transposition operator, and the transposition operator is executed by the processor in a case where the acceleration unit is incapable of executing the transposition operator. The former takes up resources and time of the acceleration unit, and the latter requires switching between the processor and the acceleration unit, thus incurring greater delay. Therefore, unnecessary transpose operators incur unnecessary resource overhead. To reduce unnecessary resource overhead, it should be ensured that there is no unnecessary transpose operator in the computation graph of the deep learning model to be run on the acceleration unit.

Disclosure of Invention

Based on this, the present disclosure provides a processing unit, a computing device and a computation graph optimization method of a deep learning model, so as to eliminate unnecessary transpose operators in the computation graph.

In a first aspect, an embodiment of the present disclosure provides a processing unit, including:

an instruction fetch unit to retrieve computer instructions from a memory external to the processing unit;

an instruction decode unit to decode the retrieved computer instructions;

an instruction execution unit, configured to execute the decoded computer instructions to implement:

determining a plurality of paths extending from an input operator to an output operator of a computational graph to be optimized and including at least one transpose operator, wherein each path of the plurality of paths has at least one different transpose operator from the other paths;

determining the merging direction of each transposition operator in each path; and

determining a transpose operator which can be counteracted in one or more paths according to the merging direction of each transpose operator in each path, and clearing the transpose operator which can be counteracted in the one or more paths to obtain the optimized calculation graph.

Optionally, the instruction execution unit further implements:

acquiring an initial calculation chart;

judging whether the data format of the initial calculation graph is the same as the data format selected by the specific acceleration unit, and if the data format of the initial calculation graph is the same as the data format selected by the specific acceleration unit, taking the initial calculation graph as the calculation graph to be optimized; if the data format of the initial calculation graph is different from the data format selected by the specific acceleration unit, sequentially inserting a first transposition operator and a second transposition operator before an input operator of the initial calculation graph, sequentially inserting a third transposition operator and a fourth transposition operator after an output operator of the initial calculation graph, and taking the initial calculation graph after the transposition operator is inserted as the calculation graph to be optimized, wherein the first transposition operator and the third transposition operator are used for converting the data format of the calculation graph to be optimized into the data format selected by the specific acceleration unit, and the second transposition operator and the fourth transposition operator are used for converting the data format selected by the specific acceleration unit into the data format of the calculation graph to be optimized.

Optionally, the determining the merging direction of each transpose operator in each path includes:

for each transposition operator in each path in the path set, firstly, judging whether the data format of the output tensor of the transposition operator is the same as the data format selected by the specific acceleration unit, if so, the merging direction of the transposition operator is upward, if not, continuously judging whether the data format of the input tensor of the transposition operator is the same as the data format selected by the specific acceleration unit, and if so, the merging direction of the transposition operator is downward;

then the determining a cancelable transpose operator in one or more paths according to the combining direction of each transpose operator in each path, and clearing the cancelable transpose operator in the one or more paths includes:

sequentially traversing each path, judging whether the first transposition operator and the second transposition operator exist in the same path or not for the first transposition operator and the second transposition operator which are adjacent to each other and have the downward and upward merging directions on each path respectively, if so, determining that the first transposition operator and the second transposition operator are counteracting transposition operators, and clearing the first transposition operator and the second transposition operator;

if not, if the first transposition operator is determined to exist in at least one path independently, and if the merging direction of a next transposition operator adjacent to the first transposition operator in all paths containing the first transposition operator is determined to be upward, determining that the first transposition operator and the next transposition operator adjacent to the first transposition operator are cancelable transposition operators, and clearing the first transposition operator and the next transposition operator adjacent to the first transposition operator in all paths containing the first transposition operator; if the second transposition operator is determined to exist in one path alone, and the merging direction of the previous transposition operator adjacent to the second transposition operator in all paths containing the second transposition operator is determined to be downward, the second transposition operator and the previous transposition operator adjacent to the second transposition operator are determined to be cancelable transposition operators, and the second transposition operator and the previous transposition operator adjacent to the second transposition operator are eliminated in all paths containing the second transposition operator.

Optionally, the data format of the computation graph to be optimized and the data format selected by the specific acceleration unit are one of the following data formats: NHWC and NCHW.

Optionally, the instruction execution unit further implements: and cutting the complete calculation graph of the specific deep learning model to obtain a plurality of sub-graphs, and taking one sub-graph as the calculation graph to be optimized.

Optionally, the instruction execution unit further implements: converting a complete calculation graph of a specific deep learning model into an intermediate expression conforming to the specific accelerating unit, performing at least one of operator combination, quantification and model pruning on the intermediate expression, performing graph cutting on the processed intermediate expression to obtain a plurality of sub-graphs, and taking one of the sub-graphs as the calculation graph to be optimized.

Optionally, the computational graph to be optimized is a complete computational graph of a specific deep learning model.

Optionally, the step of determining a plurality of paths comprises:

searching all paths which extend from an input operator to an output operator of the calculation graph to be optimized and comprise at least one transposition operator, and forming a path set by all the paths;

and for each path, judging whether the transposition operator on the path is the same as the transposition operators on the other paths one by one, and if different transposition operators exist, reserving the path in the path set.

Optionally, the step of determining a plurality of paths comprises:

searching each path extending from the input operator to the output operator of the calculation graph to be optimized, then judging whether a transposition operator exists on the path and judging whether the path and the existing path in the path set have different transposition operators, and if both the paths and the existing paths in the path set have different transposition operators, storing the path into the path set.

Optionally, the merging direction of each transpose operator is determined according to tensor dimension conversion performed by the transpose operator.

In a second aspect, an embodiment of the present disclosure provides a computing device, including an acceleration unit, a memory, and the processing unit described in any one of the above.

In a third aspect, an embodiment of the present disclosure provides a computation graph optimization method for a deep learning model, including:

Optionally, the method further comprises:

acquiring an initial calculation chart;

Optionally, the method further comprises: before the step of determining whether the data format of the computation graph to be optimized is the same as the data format selected by the specific acceleration unit,

and cutting the complete calculation graph of the specific deep learning model to obtain a plurality of sub-graphs, and taking one sub-graph as the calculation graph to be optimized.

converting a complete calculation graph of a specific deep learning model into an intermediate expression conforming to the specific accelerating unit, performing at least one of operator combination, quantification and model pruning on the intermediate expression, performing graph cutting on the processed intermediate expression to obtain a plurality of sub-graphs, and taking one of the sub-graphs as the calculation graph to be optimized.

Optionally, the step of determining a plurality of paths comprises:

In a fourth aspect, the present common embodiment provides a data center including the computing device of any of the above.

The calculation graph optimization method of the embodiment of the disclosure determines the merging direction of each transpose operator in each path on the calculation graph, then determines whether adjacent transpose operators in each path can be cancelled out according to the merging direction of each transpose operator, and executes the clearing operation of the transpose operator according to the determination, thereby obtaining the optimized calculation graph. The optimized computation graph reduces the resource overhead due to the reduction of unnecessary transposition operators.

Drawings

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which refers to the accompanying drawings in which:

FIG. 1 illustrates a hierarchical structure diagram of a data center to which one embodiment of the present disclosure is applied;

FIG. 2 is a block diagram of a data center to which one embodiment of the present disclosure is applied;

FIG. 3 is a block diagram of the internal structure of a server in a data center according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a control relationship of a Central Processing Unit (CPU) and an acceleration unit within a server according to one embodiment of the present disclosure;

FIG. 5 is an internal block diagram of an acceleration unit core according to one embodiment of the present disclosure;

FIG. 6 is a software architecture diagram of a hierarchical design;

FIG. 7 is an example diagram of computational graph conversion;

FIGS. 8a and 8b are flow diagrams of computational graph optimization methods provided by one embodiment and another embodiment of the present disclosure;

FIGS. 9a and 9b are a flowchart of one embodiment of a combination of steps S802 and S803 of FIG. 8a and a combination of steps S815 and S816 of FIG. 8 b;

fig. 10a-10c are examples derived according to embodiments of the present disclosure.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present disclosure. The figures are not necessarily drawn to scale.

The following terms are used herein.

An acceleration unit: in order to improve the data processing speed in the special purpose fields, a processing unit is often used with a general purpose processor CPU, receives the control of the general purpose processor, performs some special purpose or special field processing, and improves the computer processing efficiency in the special purpose or special field. May also be referred to as AI processing units and may include a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and special AI acceleration hardware (e.g., acceleration units).

Memory on chip: and the memory can be used in the primary core or the secondary core independently and cannot be shared.

A command processor: a command interface between the acceleration unit and a central processing unit that drives the acceleration unit in operation. The command processor receives instructions that the central processing unit makes the acceleration unit execute, and distributes the instructions to each core in the acceleration unit for execution. In addition, it is also responsible for the synchronization of the various cores in the acceleration unit.

The life cycle is as follows: an operand is not involved in the entire process of an instruction sequence, the portion of the instruction sequence between its first occurrence and its last use, the operand's life cycle. That is, after the life cycle, it is no longer used and does not have to be left in on-chip memory.

A neural network: generally, the Artificial Neural Network (ANN) is an algorithm Network that simulates behavioral characteristics of an animal Neural Network and performs distributed parallel information processing. A classical neural network, also the simplest neural network structure, comprises three levels: an input layer, an output layer, and an intermediate layer (also called a hidden layer). The input layer, the output layer, and the middle layer, in turn, each include a plurality of nodes.

A neural network model: in a neural network, nodes are mathematically transformed to produce mathematical models of the nodes, the mathematical models of the large number of nodes in the neural network constituting the neural network model.

Deep learning model: the concept of deep learning stems from the study of neural networks, which are referred to as deep learning networks. Thus, the deep learning model is also a neural network model in this sense. Both deep learning models and neural network models must be generated via training. Inputting sample data into a designed network structure (namely the network structure is determined), extracting characteristic information through a plurality of intermediate layers, and continuously correcting the weight parameters of the neurons based on the output result of the output layer to make the output result of the output layer more and more tend to a preset result until the final weight parameters are determined. The trained deep learning model can be really applied to an actual scene, and meanwhile, the use condition of the deep learning model in the actual scene can be collected, and the deep learning model is optimized in turn.

And (3) node: the minimum unit of independent operation in the deep learning model receives input, and generates output after being operated by the weight parameter of the minimum unit or parameters (such as hyper parameters) in other models. The deep learning model may include various specific operations such as convolution, pooling, and the like, with various operational nodes including convolution nodes, pooling nodes, and the like. The deep learning model has a plurality of layers, each layer has a plurality of nodes, and the output of each node is the input of the node of the next layer. In particular, a node includes programs and associated data for specific operations. For example, the convolution operation node includes a program code used for convolution operation and some data used for convolution.

Operator: refers to a set of a series of operations built into a deep learning model to implement a particular function. Each layer of the deep learning model may contain a plurality of such operators. May be referred to as operation in the TensorFlow framework and layer in the Caffe framework. Operators are regarded as further abstractions on a node basis, and one operator may correspond to one or more nodes. Thus, operators and nodes sometimes characterize the same program code.

The instruction set: the set of instructions for operation supported inside the chip mainly supports operations of deep learning operators, such as convergence, firing, ROI, etc.

And (3) quantification: the input of the operation node in the deep learning model, the weight parameter and other parameters of the operation node are converted from a high-precision data type to a low-precision data type, so that the behavior of the requirements on data throughput and storage space is reduced.

Inverse quantization: the inverse process of quantization is to convert the input of the operation node in the deep learning model, and the weight parameter and other parameters of the operation node from the low-precision data type to the high-precision data type.

Intermediate expression (IR): because deep learning models are different in format according to different dependent model frameworks (fram work), the deep learning models can be divided into different formats, such as tenserflow, pitor ch, mxnet, and the like, and the code expressions of the deep learning models are also different. This brings great difficulty to the generality of the deep learning model quantization. Intermediate representation is a representation of deep learning model code in various formats to conform to one or more acceleration units. The meaning of each code statement in the deep learning model is analyzed, and the statements are translated into a universal expression form according to the meaning of the code statements, so that the expressions of the code statements with the same meaning in different depth learning models are the same in intermediate expression. Currently, there are tool products for converting the expression of different depth learning models into intermediate expression.

Calculation graph (calculation graph): at present, the deep learning framework mainly comprises two programming modes, namely declarative programming and command programming. Declarative programming, the program code defines a neural network model structure for describing calculation logic, but not immediately executing, and the neural network model structure is executed only when the program code calling the network model structure is executed, the neural network model structure comprises a plurality of operators (or symbolic expressions of the operators) and connection relations between the operators, and can be graphically shown, so that the neural network model structure is called a static computational graph. And command programming, directly returning the operation result by the program code, and defining and executing the neural network model structure synchronously. Generally speaking, the static graph facilitates compiling and optimizing the overall neural network model, which is more beneficial to improving performance, and the dynamic graph is very convenient for a user to debug a specific program.

Fig. 1 shows a hierarchical structure diagram of a data center as one scenario to which an embodiment of the present disclosure is applied.

A data center is a globally collaborative network of devices that is used to communicate, accelerate, present, compute, store data information over an internet network infrastructure. In future development, the data center will become an asset for enterprise competition. With the popularization of data center applications, artificial intelligence and the like are increasingly applied to data centers. The neural network is an important technology of artificial intelligence, and is widely applied to big data analysis and operation of a data center.

In a conventional large data center, the network structure is generally a three-layer structure shown in fig. 1, i.e., a hierarchical interconnection network model (hierarchical inter-networking model). This model contains the following three layers:

access Layer (Access Layer) 103: sometimes referred to as the edge layer, includes access switch 130 and servers 140 to which the access switch is connected. Each server 140 is a processing and storage entity of a data center in which the processing and storage of large amounts of data is performed by the servers 140. Access switch 130 is a switch used to access these servers to the data center. One access switch 130 accesses multiple servers 140. The access switches 130 are typically located on Top of the Rack, so they are also called set-Top (Top of Rack) switches, which physically connect the servers.

Aggregation Layer (Aggregation Layer) 102: sometimes referred to as the distribution layer, includes aggregation switches 120. Each aggregation switch 120 connects multiple access switches while providing other services such as firewalls, intrusion detection, network analysis, and the like.

Core Layer (Core Layer) 101: including core switches 110. Core switches 110 provide high-speed forwarding of packets to and from the data center and connectivity for multiple aggregation layers. The entire data center network is divided into an L3 layer routing network and an L2 layer routing network, and the core switch 110 provides a flexible L3 layer routing network for the entire data center network.

Typically, the aggregation switch 120 is the demarcation point between L2 and L3 layer routing networks, with L2 below and L3 above the aggregation switch 120. Each group Of aggregation switches manages a Point Of Delivery (POD), within each Of which is a separate VLAN network. Server migration within a POD does not have to modify the IP address and default gateway because one POD corresponds to one L2 broadcast domain.

A Spanning Tree Protocol (STP) is typically used between aggregation switch 120 and access switch 130. STP makes only one aggregation layer switch 120 available for a VLAN network and the other aggregation layer switches 120 are used in the event of a failure (dashed lines in the upper figure). That is, at the aggregation level, no horizontal scaling is done, since only one is still working even if multiple aggregation switches 120 are added.

FIG. 2 illustrates the physical connections of the components in the hierarchical data center of FIG. 1. As shown in fig. 2, one core switch 110 connects to multiple aggregation switches 120, one aggregation switch 120 connects to multiple access switches 130, and one access switch 130 accesses multiple servers 140.

Server

Since the server 140 is the actual device of the data center, fig. 3 shows a block diagram of the internal structure of the server 140. The server 140 includes a memory 210, a Central Processing Unit (CPU)220, and various acceleration units, all of which are connected by a bus. These acceleration units include acceleration unit 230, Data Transfer Unit (DTU)260, graphics processing unit (GPU, not shown), application specific integrated circuit (ASIC, not shown), and field programmable gate array (FPGA, not shown).

In the traditional processor architecture design, a control unit and a storage unit occupy a large part of space in the architecture, and the space occupied by a computing unit is insufficient, so that the traditional processor architecture design is very effective in logic control and is not efficient in large-scale parallel computing. Therefore, various special acceleration units have been developed to perform more efficient processing for increasing the operation speed for calculations of different functions and different fields. The acceleration unit proposed by the present disclosure may be any one of them, and these acceleration units are described below separately.

The acceleration unit 230: the method is a processing unit which adopts a data-driven parallel computing architecture and is used for processing a large number of operations (such as convolution, pooling and the like) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the existing CPU framework is adopted, and because the memory capacity in a CPU core is small, a large amount of external storage is frequently accessed, and the processing efficiency is low. By adopting the accelerating unit, each core is provided with an on-chip internal memory with the storage capacity suitable for the neural network calculation, so that the frequent access to a memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.

And a Data Transmission Unit (DTU)260, which is a wireless terminal device specially used for converting serial data into IP data or converting IP data into serial data and transmitting the serial data through a wireless communication network. The main function of the DTU is to wirelessly transmit data from the remote device back to the back office. At the front end, the DTU interfaces with the customer's equipment. After the DTU is powered on and operated, the DTU is firstly registered to a mobile GPRS network and then goes to a background center arranged in the DTU to establish socket connection. The background center is used as a server side of socket connection, and the DTU is a client side of socket connection. Therefore, the DTU and the background software are matched for use, and after the connection is established, the front-end equipment and the background center can perform wireless data transmission through the DTU.

Graphics Processing Unit (GPU): is a microprocessor specially used for image and graph related operation. The GPU develops the defect of too little space of a computing unit in the CPU, and adopts a large number of computing units specially used for graphics computation, so that the display card reduces the dependence on the CPU and bears some of the computation-intensive graphics image processing work originally borne by the CPU.

Application Specific Integrated Circuit (ASIC): refers to integrated circuits designed and manufactured to meet the needs of a particular user and the needs of a particular electronic system. Since such integrated circuits are customized to the user's requirements, their structure is often tailored to the specific user's requirements.

Field Programmable Gate Array (FPGA): is a product developed on the basis of programmable devices such as PAL, GAL and the like. The circuit is a semi-custom circuit in the field of Application Specific Integrated Circuits (ASIC), not only overcomes the defects of the custom circuit, but also overcomes the defect that the number of gate circuits of the original programmable device is limited.

The acceleration unit, although having the advantage of performing significantly more efficiently than a normal processor for a particular application or domain, is also under the control of the processing unit 220. Taking an acceleration unit dedicated to deep learning models as an example, the memory 210 stores various deep learning models including neurons of these models and weight data of the neurons, and the like. These deep learning models are deployed by a processing unit 220 to an acceleration unit 230 in fig. 3 when needed. Specifically, the processing unit 220 may inform the acceleration unit 230 of the storage location of the deep learning model of the acceleration unit 230 in the memory 210 in the form of instructions. The acceleration unit 230 may then address the locations and store the instructions to be executed in its on-chip memory. The processing unit 220 may also send an instruction to be executed by the acceleration unit 230 to the acceleration unit 230 in the form of an instruction, and the acceleration unit 230 receives the instruction and stores the instruction in the on-chip memory. Similarly, the acceleration unit 230 may also acquire input data in the above manner. The acceleration unit 230 acquires instructions to be executed and input data to perform inferential computations. The weight parameters of the nodes may be included in the instruction sequence of the deep learning model and retrieved from the memory 210 by the acceleration unit 230. Of course, the weight parameters of the nodes may also be stored separately and retrieved from the memory 210 by the acceleration unit 230 when needed. The processing unit 220 may be understood as a hardware unit with scheduling and control capability, and may be a Central Processing Unit (CPU), a microcontroller, a microprocessor, or the like.

Internal structure of the processing unit and the acceleration unit 230

How the processing unit controls the operation of the acceleration unit will be described in conjunction with the internal structure diagram of the processing unit and the acceleration unit 230 of fig. 4.

As shown in fig. 4, the processing unit 220 includes a plurality of processor cores 222 and a cache 221 shared by the plurality of processor cores 222. Each processor core 222 includes an instruction fetch unit 203, an instruction decode unit 224, an instruction issue unit 225, and an instruction execution unit 226.

Instruction fetch unit 223 is configured to move an instruction to be executed from memory 210 into an instruction register (which may be one of register files 229 shown in fig. 4 for storing instructions) and receive or compute a next instruction fetch address according to an instruction fetch algorithm, which includes, for example: the address is incremented or decremented according to the instruction length.

After fetching the instruction, the processing unit 220 enters an instruction decode stage, and the instruction decode unit 224 decodes the fetched instruction according to a predetermined instruction format to obtain operand fetch information required by the fetched instruction, in preparation for operation by the instruction execution unit 225. The operand fetch information points, for example, to an immediate, register, or other software/hardware capable of providing source operands.

An instruction issue unit 225 is located between the instruction decode unit 224 and the instruction execution unit 226 for scheduling and control of instructions to efficiently allocate individual instructions to different instruction execution units 226, enabling parallel operation of multiple instructions.

After instruction issue unit 225 issues an instruction to instruction execution unit 226, instruction execution unit 226 begins executing the instruction. But if the instruction execution unit 226 determines that the instruction should be executed by an acceleration unit, it is forwarded to the corresponding acceleration unit for execution. For example, if the instruction is a neural network inference (inference) instruction, instruction execution unit 226 no longer executes the instruction, but rather sends the instruction over the bus to acceleration unit 230 for execution by acceleration unit 230.

The acceleration unit 230 internally includes a plurality of cores 236 (4 cores are shown in fig. 4, but it will be understood by those skilled in the art that other numbers of cores 236, a command processor 237, a direct memory access mechanism 235, and a bus channel 231 may be included in the acceleration unit 230.

Bus channel 231 is a channel for instructions to pass from the bus to and from acceleration unit 230. According to different mechanisms, bus channels 231 may include PCIE channel 232, I2C channel 233, JTAG channel 234.

PCIE, PCI-Express, is a high-speed serial computer expansion bus standard, proposed by intel in 2001, intended to replace the old PCI, PCI-X and AGP bus standards. PCIE belongs to high-speed serial point-to-point double-channel high-bandwidth transmission, connected equipment distributes independent channel bandwidth and does not share bus bandwidth, and the PCIE mainly supports functions of active power management, error reporting, end-to-end reliable transmission, hot plug, service quality and the like. Its main advantages are high data transmission rate and high development potential. Currently, most of the PCIE buses are PCIE GEN3, but the embodiments of the present disclosure can also adopt PCIE GEN4, i.e. a bus channel conforming to the PCI-express4.0 standard.

The I2C channel 233 is a simple, bi-directional two-wire synchronous serial bus channel developed by Philips corporation. It requires only two wires to transfer information between devices connected to the bus.

JTAG is an abbreviation of Joint Test Action Group (Joint Test Action Group) and is a common name in standard 1149.1 of IEEE entitled standard Test access port and boundary scan architecture. This standard is used to verify the functionality of the printed circuit board as designed and tested. JTAG was formally standardized by IEEE documents 1149.1-1990, and supplementary documents were added to describe the Boundary Scan Description Language (BSDL) in 1994. Since then, this standard has been widely adopted by electronic enterprises worldwide. Boundary scan is almost synonymous with JTAG. JTAG channel 234 is a bus channel conforming to this standard.

Direct Memory Access (DMA) mechanism 235 is a function provided by some computer bus architectures that enables data to be written directly from an attached device (e.g., external storage) to the on-chip Memory of acceleration unit 230. This greatly increases the efficiency of data access compared to the way all data transfers between devices are through the command handler 237. Due to such a mechanism, the core of the acceleration unit 230 can directly access the memory 210, read parameters (e.g., weight parameters of each node) in the deep learning model, and the like, and greatly improve data access efficiency. Although the direct memory access mechanism 235 is shown between the processor 237 and the bus channel 231, the design of the acceleration unit 230 is not so limited. In some hardware designs, each accelerator unit core 236 may include a direct memory access mechanism 235 so that the accelerator unit core 236 reads data from an attached device and writes to the on-chip memory of the accelerator unit 230 directly, without going through the command processor 237.

The command handler 237 distributes instructions provided by the processing unit 220 to the acceleration unit 230 for execution by the core 236. Instruction execution unit 226 sends instructions to be executed that require execution by acceleration unit 230 to acceleration unit 230 or instruction execution unit 226 informs the storage location of the instructions to be executed on memory 210. After the instruction sequence to be executed enters from the bus channel 231, the instruction sequence is buffered in the command processor 237, and the command processor 237 selects the core 236 to allocate the instruction sequence for its execution. The instruction to be executed comes from a compiled deep learning model. It should be understood that the instruction sequence to be executed may include an instruction to be executed in the processing unit 220 and an instruction to be executed in the acceleration unit 230.

Accelerating unit core

FIG. 5 is an internal block diagram of a core of an acceleration unit according to one embodiment of the present disclosure.

In one embodiment, as shown in FIG. 5, the core 236 includes a tensor engine 310, a pooling engine 320, convolution processing 330, an activate operation 380, a sequencer 350, an instruction buffer 340, on-chip memory 360, and a constant buffer 370. Tensor engine 310, pooling engine 320, convolution process 330, and activate operation 380 are all hardware execution units. The hardware execution unit is a hardware module for actually executing various operations. Still other hardware execution units are not shown in the figure.

Instruction sequences assigned to core 236 by command processor 237 are first buffered in instruction buffer 340. The sequencer 350 then fetches instructions from the instruction buffer 340 in a first-in, first-out order, and allocates the instructions to the various hardware execution units for execution based on their properties. The tensor engine 310 is responsible for handling tensor dependent operations in the deep learning model. The pooling engine 320 is responsible for handling pooling operations in the deep learning model. The convolution process 330 is responsible for convolution operations in the deep learning model. Activation operation 380 is used to perform the operations corresponding to the activation functions in the deep learning model. The sequencer 350 determines whether the fetched instruction is to be allocated to each hardware execution unit for execution, based on the operation properties such as convolution, matrix multiplication, pooling, and the like.

The on-chip memory 360 is an in-core memory that stores the weight parameters in the deep learning model, as well as the inputs and various intermediate results when the deep learning model is actually used. The constant buffer 370 is a buffer that stores constant parameters other than the weight parameters in the deep learning model (e.g., hyper-parameters in the deep learning model). As described above, in the process of the processing unit 220 pre-configuring the deep learning model in the acceleration unit 230, the processing unit 220 may send the position of the parameters in the model in the memory 210 to the acceleration unit 230 in the form of instructions. These parameters include the weight of the node and other parameters (e.g., hyper-parameters). With respect to the weights, the acceleration unit 230 is fetched from the corresponding location of the storage 210 and placed in the on-chip memory 360, if necessary. For other parameters, the acceleration unit 230 is fetched from the corresponding location of the memory 210 and placed in the constant buffer 370 if necessary. In addition, when executable instructions are allocated by the command processor 237 for execution by the cores 236, the input parameters in the instructions (inputs to the deep learning model) are also stored in the on-chip memory 360. In addition, after the tensor engine 310 and the pooling engine 320 perform convolution or pooling operation, various intermediate results obtained are also stored in the on-chip memory 360.

Software architecture diagram

The improvement of the deep learning model requires not only the support of the above hardware layer, but also continuous improvement of the software layer and the algorithm layer. Only the combination of the underlying hardware support and the above deep learning algorithm structure can deliver a powerful computational engine.

FIG. 6 is a software architecture diagram of a layered design. Hierarchical software design is the predominant design approach for large software projects. The method can reduce the dependency relationship between layers, so that developers can only pay attention to one layer in the whole structure, and easily replace the original layer with new program codes.

As shown on the figure, the software architecture diagram includes, from top to bottom, an application layer 401, a framework layer 402 and a functional layer.

The application layer 401 is an application of the deep learning model in a specific scene, such as vision 405, natural language 406, recommendation 407, and the like. The applications are built by using the architecture, and the architecture can be called in the applications to provide a running interface so as to obtain reasoning capability in the applications.

The framework layer 402 integrates various deep learning frameworks such as TensorFlow408, MXNet409, Caffe 410, etc., and provides an operator library and tools so that optimization and improvement of various algorithms can continue. TensorFlow408 is a symbolic mathematical system based on data flow programming, and is widely applied to programming realization of various machine learning (machine learning) algorithms. MXNet409 is a deep learning library of Amazon (Amazon) selection. Caffe 410, called the conditional Architecture for Fast Feature Embedding, is a deep learning framework with expressiveness, speed and thinking modularity.

The functional layers include a compilation stack 403 and a run stack 404. The compilation stack 403 is used to transform (convert) 411, quantize (quantize) 412, optimize (optimize) 413, and compile 414 the various models. Transformation 411 is the internal data transformation that provides the model into an Intermediate Representation (IR) format. Quantization 412 is the conversion of parameters such as weights in the deep learning model and inputs to the deep learning model from high precision data types to low precision data types. Optimization 413 is to perform operations such as fusion of operators inside the model, multi-model optimization linkage, and the like. The compiling 414 is to optimize the model according to the hardware, and generate a binary model that the hardware can recognize. The run stack 404 includes a run API 415, an execution manager 416, a user mode driver 417, and a kernel mode driver 418. Resource allocation, bulk scheduling, performed by the execution manager 416. The optimized run API 415 is used to provide interfaces that various runtimes can call. User mode driver 417 and functions to provide hardware commands, resource scheduling, in kernel mode. The kernel mode driver 418 is used to provide task scheduling and hardware control in the kernel mode, and the like.

The deep learning models of a plurality of mainstream can be integrated into one open-source platform, so that a developer can develop, compile and run a plurality of deep learning models on one open-source platform, and the condition that the developer needs to deploy and maintain a plurality of model frameworks is avoided. And the open-source platform can be expanded to support more deep learning models.

The following describes a process of processing a computation graph in an actual development process based on the above software architecture diagram.

As shown with reference to fig. 7. First, as shown in the figure, various deep learning frameworks support models a to M, and a first computation graph 701 of one specific deep learning framework among them is converted into an intermediate expression 702. The intermediate representation 702 predetermines a number of operators and their attributes. Converting the first computation graph 701 into the intermediate representation 702 includes: each operator and its attributes in the first computation graph 701 are converted into a corresponding operator and its attributes defined by the intermediate representation 702. This conversion may be achieved by means of a conversion 411. The transformation 411 defines a mapping function, implementing operator transformation. Mapping functions are used to convert between operators that are functionally identical but have different properties. Developers know the specific functions and attribute definitions of operators and define mapping functions for operators with the same functions but different attributes in advance. Referring to the operator mapping table shown in table 1, the left column is the operator identifier of the first operator, and the right column is the name of the mapping function.

Table 1

The operator attribute conversion is not limited to the mapping function, and can be completed by other methods.

Then, model processing such as operator combination, model pruning, quantization, graph cutting and the like is performed on the intermediate expression 702, and a second calculation graph 703 is output. The quantization process inserts quantization nodes and inverse quantization nodes in the intermediate representation 702. The operator merger merges two or more operators into one operator according to hardware. The cut graph is a graph that divides the intermediate representation into several subgraphs to facilitate the reading and processing of the acceleration unit. Model pruning is a model compression method, introduces sparsity to dense connections of deep learning models, and reduces the number of non-zero weights by directly setting 'unimportant' weights to zero. The second computational graph 703 may be deployed for execution on a designated acceleration unit.

Finally, if necessary, the second computation graph 703 is converted back to a computation graph supported by a specific deep learning framework, which specifically includes: both the operators and their attributes in the second computation graph 703 are converted to operators and their attributes that are supported by a particular deep learning framework. The conversion may also be implemented by a mapping function. Of course, this step can also be implemented in other ways. In addition, since the computation graph contains operators and their connection relationships, after the operator conversion, it is also necessary to ensure that the connection relationships between the operators are correct, or to directly reconstruct the connection relationships between the operators.

Data format and transposition operator

Tensors (tensors) are generalizations of scalars, vectors, and matrices, and in a broad sense, tensors include scalars, vectors, two-dimensional matrices, and matrices that are three-dimensional or more (i.e., high-dimensional matrices). The data format characterizes the arrangement of the original data in the tensor. The description will be given taking image data as an example. Generally, the original image data input to the deep learning model is a batch of images, a four-dimensional matrix is used for representing the image data, and the data format of the four-dimensional matrix is represented by a one-dimensional array and used for specifying the organization and storage mode of the image data, so that the image data can be correctly read only by knowing the data format. Typically each deep learning framework supports one or more data formats. For example, some depth learning frameworks use four characters of 'NHWC' to represent the supported data format, N represents the number of images included in these image data, H represents how many pixels each image has in the vertical direction, W represents how many pixels each image has in the horizontal direction, and C represents the number of channels (for example, the number of channels of an RGB image is 3). When the data format is specified as 'NCHW', the form of the tensor is similar to 'rrrrrrrrggggbbbbbb' because the number of channels C is at the outer layer of the tensor, the pixels within each channel, and the pixels within each channel are close together. When the data format is designated as 'NHWC', C is arranged in the innermost layer and the spatially positioned pixels of the plurality of channels are arranged closely together in a tensor form similar to 'rgbrbrbrbrbrbrbrbrbrbrgbbrgb'.

The transposition operator is used to perform data format conversion of the tensor (tensor dimension conversion). The transposition operator comprises two input parameters, one input parameter is an input tensor, and the other input parameter is a control parameter and is used for converting the dimensionality of the tensor according to the control parameter. A hypothetical example is provided below to understand the operation of the transpose operator. Let us assume that the input tensor of the transpose operator is a [ [ [0,1,2,3], [4, 5,6,7], [8,9,10,11] ], [ [12,13,14,15], [16,17,18,19], [20,21,22, 23] ] ], and that the control parameter perm is [0,2,1], which control parameter indicates that the tensor a is to be converted from the dimension of 2 × 3 × 4 to the dimension of 2 × 4 × 3. The input tensor a can first be seen as two 3 x 4 matrices a1 and a2, as shown below.

The transpose operator transforms the tensor a from the 2 x 3 x 4 dimension to the 2 x 4 x 3 dimension, meaning that the tensor a1 is transformed from the 3 x 4 dimension to the 4 x 3 dimension and the a2 is transformed from the 3 x 4 dimension to the 4 x 3 dimension, as shown below.

Finally, C1 and C2 are used to obtain 2 × 4 × 3 tensors C [ [ [ [ [0,4,8], [1,5,9], [2,6,10], [3,7,11] ], [ [12,16,20], [13,17,21], [14,18,22], [15,19,23] ] ].

The previously described image data is recombined. The data format of a certain tensor T1 is 'NCHW', T1 [ [ R, R ], [ G, G ], [ B, B ] ], the data format of a tensor T2 is 'NHWC', T2 [ [ R, G, B ], [ R, G, B ] ], and the row and column of T1 are rotated with a transposition operator, whereby the tensor T2 can be obtained. The transposition operator thus effects a conversion of the data format of the tensor.

Technical scheme of the embodiment of the disclosure

As described in the background, a customer provides a computational graph of a deep learning model to be run at a data center to a cloud facilitator. However, before that, in order to adapt the computation graph to the previously running acceleration unit, the client usually performs various processes on the computation graph. Such as adjusting the data format of the computation graph. Those skilled in the art will appreciate that the data formats used by different accelerators may be different, for example, a graphics processing unit may be used to process more image data, so that it is desirable that pixels in the same channel are consecutive, thus 'NCHW' is used, and a accelerator is used 'NHWC'. Therefore, when a computation graph is deployed to a specific acceleration unit, a developer judges whether the data format of the computation graph (i.e., the data format of the input tensor of the computation graph) is the same as the data format selected by the specific acceleration unit, if the data format of the computation graph is different from the data format selected by the specific acceleration unit, transposition operators are inserted in front of and behind the computation graph, the transposition operator in front of the computation graph converts the input tensor into the data format selected by the specific acceleration unit, and the transposition operator in back of the computation graph converts the output tensor back into the original data format. In addition, when a developer needs to test the subgraph on a processor of the terminal, a transposition operator is also inserted in order to enable the subgraph to be normally executed. And so on. These operations all cause the computation graph of the deep learning model provided by the customer to the cloud service provider to contain unnecessary transposition operators.

Therefore, the embodiment of the disclosure provides a computational graph optimization method, which can eliminate unnecessary transpose operators in a computational graph, thereby reducing unnecessary resource overhead and facilitating improvement of inference performance of a deep learning model. The computational graph optimization method is shown in fig. 8a and includes the following steps.

In step S801, a plurality of paths are obtained from the calculation map G1. Each path of the plurality of paths is a path extending from an input operator to an output operator of the computational graph to be optimized and including at least one transpose operator.

Referring to fig. 7, the computation graph G1 to be optimized may be the first computation graph 701 or the second computation graph 703, that is, the computation graph G1 to be optimized is a complete static computation graph of the depth learning model for a specific frame or a complete static computation graph after model processing. The computation graph G1 to be optimized may also be a partial computation graph (also referred to as a sub-graph) of the first computation graph 701 or a partial computation graph of the second computation graph 703. The subgraph comes from a graph cut operation on the complete computation graph. The cutting operation can adopt any cutting strategy. For example, an operator that cannot be executed for the specific acceleration unit is used as a separator to perform graph cutting, so as to obtain a plurality of sub-graphs, and one of the sub-graphs is used as the calculation graph G1 to be optimized. The data format of the computation graph G1 to be optimized is determined by the data format of its input tensors. In the case of a complete computation graph, the data format of the input tensor is specified in the code that generates the computation graph, and after the computation graph is generated, the data format is specified. The data format of the subgraph is also the data format of the input tensor of the subgraph, the data format of the subgraph is possibly the same as that of the complete calculation graph, but the data format of the subgraph is possibly different from that of the complete calculation graph because the complete calculation graph contains a transposition operator and the position of the tangent graph affects the data format of the subgraph. Of course, the selection of the cut graph position can ensure that the sub-graph remains consistent with the data format of the complete computation graph. For the data format of computational graphs, the current mainstream framework supports at least one of two data formats NH WC and NCHW. For example, the TensorFlow framework supports both NHWC and NCHW, and thus the TensorFlow framework generates a computational map that is one of NHWC and NCHW. The data format is selected by the acceleration unit during manufacture, and the data format selected by most acceleration units is in one of the two data formats.

This step can be implemented by the following embodiments. The first implementation mode firstly searches all paths which extend from an input operator to an output operator and contain at least one other transposition operator, and all the paths form a path set; then, for each path in the path set, judging whether the transpose operator on the path is the same as the transpose operator in the rest paths, if the two paths are the same transpose operator, marking the path and the path with the same transpose operator as the same path, and only keeping one path in the path set. This step repeatedly removes unsatisfactory paths from the set of paths to ensure that any two paths in the set of paths contain at least one different transpose operator. The second implementation method is that a blank path set is constructed, then each path extending from an input operator to an output operator is searched, whether the path is the same as a transpose operator in the existing path in the path set is judged, if yes, the path is not added into the path set, and otherwise, the path is added into the path set.

In step S802, the merging direction of each transpose operator on each path is determined.

The transpose operator is used to convert one data format of the tensor into another data format. This step therefore determines the merging direction of each transpose operator according to the format conversion direction of each transpose operator for the tensor.

Taking the data formats NHWC and NCHW as an example, NHWC is converted to NCHW as a first direction and NCH W is converted to NHWC as a second direction, whereby the merging direction of the transpose operator performing NHWC conversion to NCHW can be defined as the first direction and the merging direction of the transpose operator performing NCHW conversion to NHWC can be defined as the second direction. Of course, the transpose operator can also be used to perform conversion of three or more data formats, and the merging direction of each transpose operator is also determined according to the format conversion direction for the tensor.

In step S803, a cancelable transpose operator in one or more paths is determined according to the merging direction of each transpose operator in each path, and the cancelable transpose operator in one or more paths is cleared, so as to obtain the optimized computation graph G2.

Take two merging directions as an example. The first direction is to convert the data format from NHWC to NCHW and the second direction is to convert the data format from NCHW to NHWC, which see that for the same tensor the two conversions are of a type that can cancel. However, in step S803, for a transpose operator that can be cancelled, it is necessary to determine whether the operation of the transpose operator can be cancelled or not according to the merging direction, and also determine whether the clearing operation will affect each operator on another path according to the path where the transpose operator is located. For this step, an alternative implementation is to construct conditions on the path and the merge direction, and check the transpose operators one by one according to the conditions to determine the transpose operators that can be cancelled. A cancelable transpose operator should include two or more transpose operators. For example.

In this embodiment, the merging direction of each transpose operator in each path is determined first, then whether the transpose operators can be cancelled out is determined according to the merging direction, and in the case of cancellation, the transpose operators that can be cancelled out are removed. By eliminating unnecessary transpose operators in the computational graph, the resource overhead is reduced, and the reasoning performance of the computational graph is improved.

Fig. 8b illustrates a computational graph optimization method according to another embodiment of the present disclosure. The optimization method includes steps S811 to S816.

In step S11, the data format of the initial computation graph G0 is compared with the data format selected for use by the particular acceleration unit, and if so, step S813 is performed, and if not, step S812 is performed.

In step S812, the transpose operator is inserted before the input operator and after the output operator of the initial computation graph G0, so as to obtain a computation graph G1 to be optimized.

In step S813, the calculation map G0 is set as the calculation map G1 to be optimized.

In the above step, it is determined whether the data format of the initial calculation map G0 is the same as the data format selected by the specific acceleration unit, if so, the calculation map G0 is directly used as the calculation map G1, and if not, the first transposition operator and the second transposition operator (the first transposition operator is before and the second transposition operator is after) are sequentially inserted before the input operator of the initial calculation map G0, and the third transposition operator and the fourth transposition operator (the third transposition operator is before and the fourth transposition operator is after) are sequentially inserted after the output operator of the calculation map G0, so as to obtain the calculation map G1 to be optimized. The first transposing operator and the third transposing operator are used to convert the data format of the initial computation graph G0 into the data format selected by the specific acceleration unit, and the second transposing operator and the fourth transposing operator are used to convert the data format selected by the specific acceleration unit into the data format of the initial computation graph G0. If there are a plurality of input operators of the initial calculation map G0, a first transposing operator and a second transposing operator are sequentially inserted before each input operator, and if there are a plurality of output operators of the initial calculation map G0, a third transposing operator and a fourth transposing operator are sequentially inserted after each output operator. It should be appreciated that after this process, the data format of the computation graph G1 to be optimized is still the same as the data format of the initial computation graph G0.

In step S814, a plurality of paths are obtained from the calculation map to be optimized G1.

In step S815, the merging direction of each transpose operator on each path is determined.

In step S816, a cancelable transpose operator in one or more paths is determined according to the merging direction of each transpose operator in each path, and the cancelable transpose operator in one or more paths is cleared, so as to obtain the optimized computation graph G2.

The operations of steps S814-S816 are the same as those of steps S801-S803, and are not described herein again. However, it should be emphasized that, when the input computation graph of step S814 is a computation graph with the first transpose operator to the fourth transpose operator inserted therein, since the first transpose operator becomes the input operator of the computation graph G1 to be optimized and the fourth transpose operator becomes the output operator of the computation graph G1 to be optimized, step S814 actually determines a plurality of paths extending from the first transpose operator to the fourth transpose operator, and each path has at least one different transpose operator from other paths, and a path only including the first transpose operator to the fourth transpose operator does not belong to the path determined according to step S814.

Fig. 9a and 9b are a flowchart of one embodiment of a combination of steps S802 and S803 of fig. 8a and a combination of steps S815 and S816 of fig. 8 b. The method specifically comprises the following steps.

In step S901, each transpose operator on each path of the read path set

In step S902, it is determined whether the data format of the output tensor of the current transpose operator is the same as the data format selected by the specific acceleration unit.

In step S903, if the same, the merging direction of the current transpose operator is determined to be upward.

In step S904, if they are different, it is determined whether the data format of the input tensor of the current transpose operator is the same as the data format selected by the specific acceleration unit. If so, step S905 is performed,

in step S905, if the two are the same, it is determined that the merging direction of the current transpose operator is downward.

In step S906, it is determined whether the merging directions of all transpose operators on all paths have been determined. If not, it jumps to step S901.

In step S907, each transpose operator of each path is traversed in turn.

In step S908, for adjacent transpose operators a and b whose merging directions in the same path are downward and upward, respectively, it is determined whether the adjacent transpose operator a and transpose operator b exist only in the exactly same path. If so, S909 is executed, and if not, step S910 is executed.

In step S909, the transposition operator a and the transposition operator b are determined to be cancelable transposition operators, and the transposition operator a and the transposition operator b are cleared.

In step S910, it is determined whether the transposition operator a or the transposition operator b exists alone in at least one path. Step S911 is performed if the transposition operator a exists independently in at least one path, and step S912 is performed if the transposition operator b exists independently in at least one path.

In step S911, if it is determined that the merging direction of the next transposing operator adjacent to the transposing operator a in all paths including the transposing operator a is upward, it is determined that the transposing operator a and the next transposing operator adjacent to the transposing operator a are cancelable transposing operators, and an erasing operation is performed. If the merging direction of the next transposing operator adjacent to the transposing operator a in all paths containing the transposing operator a is not upward, the transposing operator a and the next transposing operator adjacent to the transposing operator a are non-cancelable transposing operators, and therefore, the clearing operation is not executed.

In step S912, if it is determined that the merging direction of the previous transposing operator adjacent to the transposing operator b in all paths including the transposing operator b is downward, it is determined that the transposing operator b can be a cancelable transposing operator with the previous transposing operator adjacent to the transposing operator a, and a clear operation is performed. If the merging direction of the previous transpose operator adjacent to the transpose operator b in all paths containing the transpose operator b is not downward, the transpose operator b and the previous transpose operator adjacent to the transpose operator b are non-cancelable transpose operators, and therefore, no clear operation is performed.

In this embodiment, the format conversion directions of the transposing operators are defined as downward and upward, two adjacent transposing operators whose merging directions are downward and upward are searched for in each path, and a path condition that may exist in the two transposing operators is determined, a transposing operator that can be cancelled is determined, and the transposing operator that can be cancelled is cleared.

In this embodiment, how to determine the merging direction of each transpose operator in two data formats, and how to determine a cancelable transpose operator according to the merging direction of each transpose operator are given, and by removing the cancelable transpose operator, resource overhead is reduced, and inference performance of the computation graph is improved.

It should be emphasized that although the above embodiments mostly take two data formats as examples, the embodiments of the present disclosure can also be applied to three or more data formats.

Fig. 10a-10b are examples derived according to embodiments of the present disclosure.

As shown in FIG. 10a, 11 represents the computational graph to be optimized, which is a subgraph including path 1 composed of con v (1), trans (2), relu (3), conv (4), tans (5) and add (6) and path 2 composed of con v (1), trans (2), relu (3), conv (7), trans (8) and mul (9). The input is NHWC and the output is NHWC, and assuming that the data format of a particular acceleration unit is NHW C, there is no need to insert the transpose operator trans before the input operator conv (1) and after the output operators add (6) and mul (9) of the computation graph. The transpose operators trans for path 1 and path 2 are analyzed. As shown above, trans (2) converts NHWC to NCHW in the downward direction, trans (5) converts NCHW to NHWC in the upward direction, and trans (8) converts NCHW to NH WC in the upward direction. The above example belongs to the multi-branch case, trans (2) coexists in

paths

1 and 2, and thus, in analyzing whether trans (2) can be offset from trans (5) and trans (7), it is analyzed whether trans (2) can be offset from trans (5) and trans (7), respectively. According to the above embodiment, trans (5) and trans (8) on both branches of the present example can be cancelled by trans (2), so that trans (2), trans (5) and trans (8) can be deleted from the calculation chart of fig. 11, and an optimized calculation chart can be obtained.

As shown in FIG. 10b, 12 represents the computational graph to be optimized, which is a subgraph including path 1 consisting of con v (1), trans (2), trans (3), relu (4), conv (5), tans (6) and add (7) and path 2 consisting of conv (1), trans (2), trans (3), relu (4), conv (8), trans (9) and mul (10). The input is NCHW, the output1 is NHWC, output2 is NHWC, and assuming that the data format of a particular acceleration cell is NCHW, there is no need to insert any transpose operator before the input operator conv (1) and after the output operators add (7) and mul (10) of the computation graph. The transpose operators trans for path 1 and path 2 are analyzed. As shown above, t rans (2) converts NCHW to NHWC, with the direction up, trans (3) converts NHWC to N CHW, with the direction down, trans (6) converts NCHW to NHWC, with the direction up, and trans (9) converts NCHW to NHWC, with the direction up. The above example belongs to the multi-branch case, when the transpose operators on the

paths

1 and 2 are analyzed, the analysis needs to be started from the transpose operator in the first direction, namely trans (3), and whether trans (3) can be offset with trans (6) and trans (9) is specifically analyzed. In accordance with the embodiment of the present disclosure, trans (6) and trans (9) on both branches of the present example can be cancelled with trans (3), so trans (3), trans (6) and trans (9) can be deleted from the calculation map 12 together, and an optimized calculation map is obtained.

As shown in FIG. 10c, 13 represents the computational graph to be optimized, which is a subgraph including path 1 composed of con v (1), trans (2), relu (3), conv (4) and add (5) and path 2 composed of conv (1), trans (2), relu (3), conv (6), trans (7) and mul (8). The inputs to the computation graph 13 are NHWC, the outputs of output1 are NHWC, the output2 is NHWC, and the data format of a particular acceleration cell is NHWC, so there is no need to insert the transpose operator trans before the input operator conv (1) and after the output operators add (5) and mul (8) of the computation graph. The transpose operators for path 1 and path 2 are analyzed. As shown above, trans (2) converts NHWC to NCH W in the downward direction and trans (7) converts nhhw to NHWC in the upward direction. This example belongs to the multi-branch case, trans (2) co-exists in

paths

1 and 2, so when analyzing whether trans (2) can counteract trans (7), it is analyzed whether there is a transpose operator on both

paths

1 and 2 that counteracts tran (2), but since there is no transpose operator on path 1 that can counteract it, trans (2) cannot counteract trans (7). That is, the computation of FIG. 13 does not have any transpose operators that can be deleted.

In summary, the calculation graph optimization method according to the embodiment of the present disclosure can obtain an optimized calculation graph. The optimized computational graph reduces the resource overhead in the execution of the computational graph due to the reduction of unnecessary transposition operators.

Further, although the server of the data center is described above as an example of the execution subject of the embodiment of the present disclosure, the present disclosure is not limited thereto. In theory, the execution subject of the embodiment of the present disclosure may be any computing device, including the server and the terminal device described above, and for the terminal device, as long as the processor, the memory, the network throughput capability, and the like of the terminal device can meet the operation requirement of the deep learning model, the deep learning model may be deployed on the terminal device and various computational graph processes (including the computational graph processing scheme provided by the embodiment of the present disclosure) may be performed on the terminal device.

Commercial value of the disclosed embodiments

The technical scheme provided by the embodiment of the disclosure has a practical value in the aspect of combining the deep learning model and the cloud computing. The deep learning model optimization method and device can be used for optimizing the deep learning model to be deployed to the data center by a client, so that the inference performance of the deep learning model is improved, the processing efficiency of an application program equipped with the deep learning model is further improved, and the intention of the user for hosting the deep learning model to the data center is improved. Therefore, the disclosed embodiment has market prospect and commercial value.

As will be appreciated by one skilled in the art, the present disclosure may be embodied as systems, methods and computer program products. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code), or in the form of a combination of software and hardware. Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium is, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection for the particular wire or wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In this context, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a chopper. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.

Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming language includes an object-oriented programming language such as JAVA, C + +, and may also include a conventional procedural programming language such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A processing unit, comprising:

an instruction decode unit to decode the retrieved computer instructions;

2. The processing unit of claim 1, wherein the instruction execution unit further implements:

acquiring an initial calculation chart;

3. The processing unit of claim 1 or 2, wherein the determining a merging direction for each transpose operator in each path comprises:

4. The processing unit according to any one of claims 1 to 3, wherein the data format of the computation graph to be optimized and the data format selected by the specific acceleration unit are one of: NHWC and NCHW.

5. The processing unit of any of claims 1 to 3, wherein the instruction execution unit is further to implement: and cutting the complete calculation graph of the specific deep learning model to obtain a plurality of sub-graphs, and taking one sub-graph as the calculation graph to be optimized.

6. The processing unit of any of claims 1 to 3, wherein the instruction execution unit is further to implement: converting a complete calculation graph of a specific deep learning model into an intermediate expression conforming to the specific accelerating unit, performing at least one of operator combination, quantification and model pruning on the intermediate expression, performing graph cutting on the processed intermediate expression to obtain a plurality of sub-graphs, and taking one of the sub-graphs as the calculation graph to be optimized.

7. The processing unit of any of claims 1 to 3, wherein the computational graph to be optimized is a complete computational graph of a particular deep learning model.

8. The processing unit of claim 1, wherein determining a plurality of paths comprises:

9. The processing unit of claim 1, wherein determining a plurality of paths comprises:

10. The processing unit of claim 1, wherein the merging direction of each transpose operator is determined from tensor dimension conversion performed by the transpose operator.

11. A computing device comprising an acceleration unit, a memory, and the processing unit of any of claims 1 to 10.

12. A computational graph optimization method of a deep learning model comprises the following steps:

13. The computational graph optimization method of claim 12, further comprising:

acquiring an initial calculation chart;

14. The computational graph optimization method of claim 12 or 13, wherein said determining a merging direction for each transpose operator in each path comprises:

15. The computation graph optimization method according to any one of claims 12 to 14, wherein the data format of the computation graph to be optimized and the data format selected by the specific acceleration unit are one of the following data formats: NHWC and NCHW.

16. The computational graph optimization method according to any one of claims 12 to 14, further comprising: before the step of determining whether the data format of the computation graph to be optimized is the same as the data format selected by the specific acceleration unit,

17. The computational graph optimization method according to any one of claims 12 to 14, further comprising: before the step of determining whether the data format of the computation graph to be optimized is the same as the data format selected by the specific acceleration unit,

18. The computational graph optimization method according to any one of claims 12 to 14, wherein the computational graph to be optimized is a complete computational graph of a particular deep learning model.

19. The computational graph optimization method of claim 12, wherein the step of determining a plurality of paths comprises:

20. The computational graph optimization method of claim 12, wherein the step of determining a plurality of paths comprises:

21. A data center comprising the computing device of claim 11.