WO2021000638A1 - Procédé et dispositif de compilation pour algorithme d'apprentissage profond, et produit associé - Google Patents

Procédé et dispositif de compilation pour algorithme d'apprentissage profond, et produit associé Download PDF

Info

Publication number
WO2021000638A1
WO2021000638A1 PCT/CN2020/085882 CN2020085882W WO2021000638A1 WO 2021000638 A1 WO2021000638 A1 WO 2021000638A1 CN 2020085882 W CN2020085882 W CN 2020085882W WO 2021000638 A1 WO2021000638 A1 WO 2021000638A1
Authority
WO
WIPO (PCT)
Prior art keywords
operation instruction
data
deep learning
instruction
static
Prior art date
Application number
PCT/CN2020/085882
Other languages
English (en)
Chinese (zh)
Inventor
陈黎明
吴林阳
王子毅
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910596220.8A external-priority patent/CN112183735A/zh
Priority claimed from CN201910596132.8A external-priority patent/CN112183712A/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2021000638A1 publication Critical patent/WO2021000638A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the present disclosure relates to the field of deep learning, and in particular to a method, device and related products for compiling deep learning algorithms.
  • neural network algorithm is a very popular machine learning algorithm recently, and has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc.
  • image recognition speech recognition
  • speech recognition natural language processing
  • the complexity of the algorithm is getting higher and higher.
  • the scale of the model is gradually increasing.
  • the present disclosure proposes a deep learning algorithm compilation method, device and related products, which can improve the performance optimization effect of the deep learning algorithm for the corresponding hardware platform.
  • a method for compiling a deep learning algorithm comprising: receiving operation data transmitted by a deep learning programming library interface; obtaining operation instructions included in the operation data; judging the operation instructions According to the judgment result, execute the compilation operation corresponding to the instruction type to obtain the binary code of the deep learning algorithm.
  • a device for compiling a deep learning algorithm including: an operating data receiving module for receiving operating data transmitted by a deep learning programming library interface; an operating instruction acquiring module for acquiring the operating Operation instructions included in the data; a compilation module, used to determine the instruction type of the operation instruction, and execute a compilation operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • a deep learning computing device comprising the deep learning algorithm compiling device as described in the above second aspect, the deep learning computing device is used to complete the setting Deep learning operations.
  • a combined computing device comprising the deep learning computing device as described in the third aspect, a universal interconnection interface and other processing devices; the deep learning computing device and The other processing devices interact to jointly complete the calculation operation specified by the user.
  • a deep learning chip comprising: the deep learning algorithm compiling device according to the second aspect; or, the deep learning according to the third aspect Arithmetic device; or, the combined arithmetic device as described in the fourth aspect.
  • an electronic device comprising: the deep learning algorithm compiling device as described in the second aspect above; or the deep learning computing device as described in the third aspect above Or, the combined computing device as described in the fourth aspect above; or, the deep learning chip as described in the fifth aspect above.
  • a method for generating operation data comprising: receiving a user instruction; according to the user instruction, triggering a deep learning programming library interface to create or call operation data, wherein the operation The data includes at least one of tensor data and operation instructions.
  • a device for generating operation data including: a user instruction receiving module for receiving user instructions; a triggering module for triggering the creation or creation of a deep learning programming library interface according to the user instructions Call operation data, where the operation data includes at least one of tensor data and operation instructions.
  • a deep learning computing device includes the operating data generating device as described in the eighth aspect, the deep learning computing device is used to complete the setting Deep learning operations.
  • a combined computing device comprising the deep learning computing device as described in the ninth aspect, a universal interconnection interface and other processing devices; the deep learning computing device and The other processing devices interact to jointly complete the calculation operation specified by the user.
  • a deep learning chip comprising: the operation data generating device according to the eighth aspect; or, the deep learning according to the ninth aspect Computing device; or, the combined computing device as described in the tenth aspect.
  • an electronic device comprising: the operating data generating device according to the ninth aspect; or, the deep learning computing device according to the tenth aspect ; Or, the combined computing device as described in the fourth aspect; or, the deep learning chip as described in the eleventh aspect.
  • the compilation operation corresponding to the instruction type is executed to obtain the binary code of the deep learning algorithm, and the deep learning according to various aspects of the embodiment of the present disclosure
  • Algorithm compilation methods, devices and related products can make the compilation process adaptively change according to different types of operating instructions, thereby greatly improving compilation flexibility and compilation efficiency, and effectively improving the performance of deep learning algorithms for corresponding hardware platforms Optimize the effect, and then improve the processing performance of the deep learning processor.
  • a universal programming interface can be provided for users to achieve Effective conversion between user instructions and machine instructions.
  • Fig. 1 shows a flowchart of a method for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • Fig. 2 shows a schematic diagram of the overall architecture of a neural algorithm programming library interface according to an embodiment of the present disclosure.
  • Fig. 3 shows a corresponding relationship diagram between attributes, classifications, and meanings of tensor data according to an embodiment of the present disclosure.
  • Fig. 4 shows a corresponding relationship diagram between tensor data and shape symbols according to an embodiment of the present disclosure.
  • Fig. 5 shows a build-in operation instruction diagram supported by NCLAPI according to an embodiment of the present disclosure.
  • Fig. 6 shows a schematic diagram of an implementation of automatic accuracy selection according to an embodiment of the present disclosure.
  • Fig. 7 shows a schematic diagram of a calculation model according to an embodiment of the present disclosure.
  • Fig. 8 shows a schematic diagram of generating a customized operation instruction according to an embodiment of the present disclosure.
  • FIG. 9 shows a schematic diagram of operation fusion results according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of operation fusion results according to an embodiment of the present disclosure.
  • Fig. 11 shows a schematic diagram of a related programming interface for operation fusion according to an embodiment of the present disclosure.
  • Fig. 12 shows a schematic diagram of a process of creating a fusion operation according to an embodiment of the present disclosure.
  • Fig. 13 shows a schematic diagram of a data flow of a three-layer calculation model according to an embodiment of the present disclosure.
  • Fig. 14 shows an implementation block diagram of a hybrid programming model according to an embodiment of the present disclosure.
  • FIG. 15 shows a schematic diagram of the difference between an offline mode and an online mode according to an embodiment of the present disclosure.
  • Fig. 16 shows a schematic diagram of an offline interface according to an embodiment of the present disclosure.
  • Fig. 17 shows an architecture diagram of TensorFlow according to an embodiment of the present disclosure.
  • FIG. 18 shows a schematic diagram of comparison between NCLAPI and mainstream deep learning programming library interfaces according to an embodiment of the present disclosure.
  • FIG. 19 shows a schematic diagram of the overall architecture of NCLA according to an embodiment of the present disclosure.
  • Fig. 20 shows a schematic diagram of an implementation of a static operation pool according to an embodiment of the present disclosure.
  • FIG. 21 shows a schematic diagram of the architecture of CDUCA according to an embodiment of the present disclosure.
  • FIG. 22 shows the form of the original calculation graph according to an embodiment of the present disclosure.
  • Fig. 23 shows a working flowchart of a computational graph engine according to an embodiment of the present disclosure.
  • Fig. 24 shows a schematic structural diagram of a substructure contained in an image classification network according to an embodiment of the present disclosure.
  • FIG. 25 shows a flowchart of a method for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • FIG. 26 shows a schematic diagram of instruction flow according to an embodiment of the present disclosure.
  • Fig. 27 shows a diagram of an implementation manner of optimizing model data according to an embodiment of the present disclosure.
  • FIG. 28 shows a schematic diagram of optimization after data splitting according to an embodiment of the present disclosure.
  • FIG. 29 shows a schematic diagram of modules and functions of a runtime system according to an embodiment of the present disclosure.
  • Fig. 30 shows a block diagram of a device for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • Fig. 31 shows a block diagram of a combined processing device according to an embodiment of the present disclosure.
  • FIG. 32 shows a flowchart of a method for generating operation data according to an embodiment of the present disclosure.
  • Fig. 33 shows a block diagram of a device for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • deep learning processors In order to alleviate the deteriorating storage wall problem, deep learning processors usually design on-chip memory near the computing unit. The delay of accessing on-chip storage is much lower than that of accessing off-chip storage, so it is reasonable Utilizing on-chip storage is the key to the performance of deep learning processors.
  • the on-chip storage does not have the functions of the cache such as data prefetching, data replacement, and processing of data conflicts. These tasks must be completed by programming instructions.
  • the on-chip storage capacity is very limited, and programmers must split operations and data at the same time, which leads to tight coupling between operations and data.
  • NBI neuron storage
  • on-chip storage display management, limited capacity
  • deep learning algorithms complex and changeable processing layers, diversified computing data types, processing high-dimensional tensors, etc.
  • program optimization being extremely sensitive to algorithm and hardware changes.
  • 2D convolution operation contains at least 9 shape parameters (N, CI, HI, WI, CO, Kh, Kw, Sh, Sw), 7 nested loops, and multiple calculation precisions (half, fxm. b, Qts.b, intx, etc.), the combination of the above parameters or changes in the on-chip storage capacity will affect the optimal split strategy of data and operations, resulting in different binary codes generated.
  • Fig. 1 shows a flowchart of a method for compiling a deep learning algorithm according to an embodiment of the present disclosure. As shown in the figure, the method may include:
  • Step S11 receiving operation data transmitted by the deep learning programming library interface.
  • Step S12 Obtain operation instructions included in the operation data.
  • step S13 the instruction type of the operation instruction is judged, and the compilation operation corresponding to the instruction type is executed according to the judgment result to obtain the binary code of the deep learning algorithm.
  • the binary code is the hardware instruction used to guide the hardware device to execute the deep learning algorithm, which hardware device is specifically guided, and the specific content of the hardware instruction is not limited in the embodiment of the present disclosure, and can be flexibly based on actual conditions. select.
  • the compilation operation corresponding to the instruction type is executed to obtain the binary code of the deep learning algorithm, and the deep learning according to various aspects of the embodiment of the present disclosure
  • Algorithm compilation methods, devices and related products can make the compilation process adaptively change according to different types of operating instructions, thereby greatly improving compilation flexibility and compilation efficiency, and effectively improving the performance of deep learning algorithms for corresponding hardware platforms Optimize the effect, and then improve the processing performance of the deep learning processor.
  • the specific implementation source of the operation data transmitted by the deep learning programming library interface received in step S11 is not limited. In a possible implementation manner, the operating data can be created or called according to user instructions received by the deep learning programming library interface.
  • a universal programming interface can be provided for users to achieve Effective conversion between user instructions and machine instructions.
  • the implementation of the deep learning programming library interface is not limited, and can be flexibly selected according to actual conditions.
  • the deep learning programming library interface can be a neural calculation programming library interface (NCLAPI, neurological calculation Library API), and the specific implementation of the interface can be determined according to the actual situation and is not limited to the following public implementation example.
  • Figure 2 shows a schematic diagram of the overall architecture of a neural calculation programming library interface according to an embodiment of the present disclosure.
  • the implementation of the NCLAPI interface may be: simulating neural calculations to make the NCLAPI interface good Deep learning modeling capabilities; flexibly support various performance optimizations by designing reshaped operations and corresponding operating rules; improving the flexibility of programming models by designing hybrid programming models; simplifying the design of data structures and interfaces The hardware details are hidden inside the data structure and interface.
  • neural calculus is a functional deep learning modeling method.
  • neural calculus can use tensors to represent the input data, output data and model parameters of the input layer, and functions to represent depth
  • functions can be combined according to certain rules to construct various deep learning calculation models. Since the function itself has composability and reusability, neural calculus can well express the composability and reusability of deep learning algorithms.
  • the neural calculus designed according to the above-mentioned ideas has powerful deep learning modeling capabilities.
  • the known deep learning frameworks such as Tensor and Mxnet all use directed graphs to model the deep learning calculation model.
  • any directed graph can be mapped into a combination of functions, and any combination of functions can be mapped into a directed acyclic Graphs, so neural calculus has the same deep learning modeling capabilities as directed graphs.
  • the NCLAPI interface can be equipped with good deep learning modeling capabilities by simulating neural calculations. Therefore, under appropriate simulation methods, NCLAPI can have the same learning modeling capabilities as directed graphs.
  • NCLAPI can have two data structures, namely tensor (nclTensor) and plastic operation (nclOperator), nclTensor is used to describe the input data, output data and model of the deep learning processing layer Parameters, nclOperator is used to describe the deep learning processing layer.
  • nclTensor tensor
  • nclOperator plastic operation
  • nclTensor is used to describe the input data
  • Parameters Parameters
  • nclOperator is used to describe the deep learning processing layer.
  • the method for compiling the deep learning algorithm proposed in the embodiment of the present disclosure first needs to receive the operating data transmitted by the deep learning programming library interface, and the above disclosed embodiment proposes that, in one example, the deep learning programming
  • the library interface can be NCLAPI.
  • the operation data passed by the deep learning programming library interface can be the Zhang corresponding to the nclTensor data structure.
  • the quantity data can also be an operation instruction corresponding to the nclOperator data structure, or it can contain both tensor data and an operation instruction.
  • the tensor data in NCLAPI is an abstract representation of multi-dimensional data, which can be used to represent the input data, output data and model parameters of the deep learning processing layer.
  • the convolutional layer The input, output and weight data of can be expressed as tensor data. Therefore, in a possible implementation, tensor data can have the following characteristics: contain multiple attributes; can describe multidimensional data such as scalar, vector, matrix, and tensor; follow certain naming rules; and can describe multiple data types .
  • tensor data follow a certain naming rule.
  • This naming rule can be flexibly set according to actual conditions and is not limited to the following disclosed embodiments.
  • tensor data The following naming rules can be: only composed of letters, numbers, and underscores; the first character must be an English letter; punctuation marks and type specifiers cannot be included.
  • the deep learning computing model usually processes fixed-size data, take the image classification model AlexNet as an example.
  • the input and output data shape of each processing layer is fixed, and the value of the data will frequently change with the input. Therefore, Data attributes and data values have completely different update frequencies. From the perspective of data structure reuse and programming flexibility, the attributes of the data should be decoupled from the value of the data. Therefore, in a possible implementation, the tensor data in the embodiment of the present disclosure is only used for Describe the attributes of the data, and the value of the data can use the pointer to the memory area to describe the value of the data, and the neural calculation tensor can be completely mapped through the combination of tensor data and pointers.
  • the characteristic of tensor data is that it can contain multiple attributes, and which attributes are specifically contained can be flexibly set and selected according to actual conditions.
  • the tensor data may include a shape attribute (shape), a logical data type attribute (dtype), a physical data type attribute (pdtype), and a physical layout attribute (layout).
  • Fig. 3 shows a corresponding relationship diagram between attributes, classifications, and meanings of tensor data according to an embodiment of the present disclosure.
  • the above four attributes can be divided into two categories: visible attributes and invisible attributes. Among them, shape attributes and logical data type attributes can be classified as visible attributes.
  • the visible attributes can be set through the tensor assignment interface; physical data type attributes and physical layout attributes can be classified as invisible attributes, which can be maintained, modified and used in the programming library, and then can shield hardware details from the outside world , Reduce programming complexity.
  • the physical data type attribute can be used to indicate the accuracy of the data stored in the hardware device memory
  • the logical data type attribute can be used to indicate the accuracy of the data stored in the host memory. Therefore, the physical data type attribute and Logical data type attributes, the precision represented by the two can be the same or different. In a possible implementation, the physical data type data and the logical data type attributes can be different.
  • the compilation process can be set to realize the automatic precision selection function, that is, during the compilation process, the fastest running data can be automatically selected Type calculations, and the process can be transparent to users.
  • the specific implementation process of the automatic accuracy selection function can be determined according to actual conditions, and will be specifically described in the subsequent disclosed embodiments.
  • tensor data can describe multiple data types, and which data types can be described specifically can be flexibly determined according to the actual situation.
  • tensor data can describe low bit width, quantization, etc. type of data.
  • the embodiments of the present disclosure design different data types (including logical data types and physical data types) for tensor data, including:
  • Double-precision floating point double; single-precision floating point: float; half-precision floating point: half; fixed point: fxm.b (m represents integer digits, b represents total digits); quantization: Qts.b (s represents tensor The scaling factor scale, b represents the bias of the tensor); integer type: intx; unsigned integer type: uintx.
  • Deep learning algorithms can support channel-wise quantization of images, and the scale and bias of each channel can be different. Although quantization by channel cannot be described by Qts.b, the scale and add operations provided by NCLAPI can be used instead, so NCLAPI's ability to express quantification is still complete. In addition, considering that other data types may appear in the future, NCLAPI can also support the expansion of data types described by tensor data.
  • nclTensor is used to describe the input data, output data and model parameters of the deep learning processing layer. Since tensor data corresponds to nclTensor, the most common deep learning processing layers are convolution, pooling, and RNN , Their input data, output data, and model parameters are all high-dimensional data.
  • Figure 4 shows a diagram of the correspondence between tensor data and shape symbols according to an embodiment of the present disclosure. As shown in the figure, in one possibility In the implementation of, the shape of the tensor data operated in the deep learning can be agreed according to the corresponding relationship shown in the figure.
  • the operation data can be created or called according to user instructions received by the deep learning programming library interface, and tensor data can be created or called as a possible implementation of the operation data. It must be created before use.
  • the specific creation and calling process can be flexibly set according to the actual situation.
  • the calling process can assign values to tensor data.
  • the creation process can initialize the visible properties of the tensor data when it is created.
  • the creation process can be First create an uninitialized tensor data, and then call the tensor assignment interface (nclSet-TensorAttr) for attribute assignment.
  • the operation instruction in NCLAPI is an abstract representation of transformation. It can be used to represent the deep learning processing layer or general calculation. In one example, the operation instruction can be used Represents the deep learning processing layer, such as convolution, pooling, full connection, etc. In the embodiments of the present disclosure, operations performed by operation instructions may be collectively referred to as plastic operations.
  • the operation instruction can be composed of three parts, which are input parameters (input params), output parameters (output params), and operation types (OpType).
  • the input parameters and the transformed input tensor Set correspondence that is, the input parameters can be nclTensor and pointers corresponding to all input data.
  • the output parameter corresponds to the output tensor set of the transformation, that is, the output parameter can be nclTensor and pointers corresponding to all output data.
  • an operation instruction can be specified to allow zero or more (tensor data, pointers) as input parameters, and one or more (tensor data, Pointer) as an output parameter.
  • the operation type can be used to specify what kind of data transformation the operation instruction performs. Users can specify different operation types when creating operation instructions. These operation types can express three types of data transformations: value transformation, attribute transformation, and null transformation. Therefore, operation instructions can not only describe the deep learning processing layer, but also describe data such as General calculations such as segmentation, data splicing, and size scaling.
  • the deep learning programming library interface it itself can provide a series of pre-defined self-operating instructions.
  • these operating instructions can be called build-in operating instructions (build-in operators)
  • Figure 5 shows a build-in operation instruction diagram supported by NCLAPI according to an embodiment of the present disclosure. It can be seen from the figure that the operation instructions that support the in-situ algorithm are the build-in operations supported by NCLAPI instruction.
  • the nature of the operation instruction can be flexibly set according to the actual situation.
  • the rest of the operation instructions can be unidirectional, Disjointness and idempotence, where unidirectionality means that the operation instruction does not change the input parameters (including tensor data and pointer data), disjointness means that the input parameter and output parameter of the operation instruction must not have the same name, and idempotence Indicates that the result of the operation instruction call depends only on the input parameters and is not affected by the number of calls.
  • operating data can be created or invoked according to user instructions received by the deep learning programming library interface, and operating instructions, as a possible implementation of operating data, can be created or invoked, specifically creating and The calling process can be flexibly set according to the actual situation.
  • the operation instruction needs to be created and then invoked.
  • the operation instruction invocation refers to mapping an operation instruction to the deep learning processor for execution.
  • the operation instruction supports variable operation parameters at runtime by default. Specifically, The operation instruction creation only needs to be done once, and the operation instruction call can be repeated, and different input parameters and output parameters can be specified each time the operation instruction is called.
  • the interface in order to optimize program performance, for the NCLAPI interface, can support two important functions when executing operation instruction calls, namely asynchronous execution and automatic precision selection.
  • asynchronous execution means that the operation instruction calling function will return immediately after being called by the host side.
  • the CPU can perform other operations while the deep learning processor is performing calculations, thereby improving the overall utilization of the system and program performance.
  • NCLAPI provides the device synchronization interface nclSyncDevice, which will block the execution of the CPU until the device finishes the operation.
  • FIG. 6 shows a schematic diagram of an implementation of automatic precision optimization according to an embodiment of the present disclosure.
  • automatic precision optimization means that the programming library automatically selects and executes an operation instruction before being executed on the device For the data type with the shortest time, convert the original data into the optimal format before performing operations.
  • the automatic precision selection will consider the time overhead of data format conversion and operation instruction execution together to ensure the shortest overall execution time of the operation.
  • the programming library will apply for temporary space to complete the data format conversion to ensure that the original input data will not be overwritten.
  • the operation instruction can also support operation connection, which means that the output parameter of an operation instruction A is used as the input parameter of another operation instruction B. After A completes the calculation, B then processes the output of A data.
  • operation connection means that the output parameter of an operation instruction A is used as the input parameter of another operation instruction B. After A completes the calculation, B then processes the output of A data.
  • the necessary and sufficient condition for the two operation instructions A and B to be connected is that they each have at least one output tensor T1 and one input tensor T2, and the properties of T1 and T2 are exactly the same.
  • Operational connection is directional, and the connection direction is from the operation of providing data to the operation of using data.
  • the deep learning calculation model can be expressed as a function combination.
  • a function combination is a function sequence with only one-way function connection (in function combination, the direction of function connection can only be from left to right) . Since operation instructions can be obtained by function mapping, a deep learning calculation model can be expressed as an operation instruction sequence with only one-way operation connection (in the operation instruction sequence, the direction of operation instruction connection can only be from left to right). In the embodiments of the present disclosure, this kind of operation instruction sequence is called a one-way operation sequence.
  • the directed graph can be converted into a one-way operation sequence according to a certain algorithm.
  • the tensor alias technique can be used to eliminate the in-place operation.
  • the algorithm for converting the directed graph into a one-way operation instruction sequence may be: first convert the directed graph g into a directed acyclic graph g'. Then perform topological sorting on the directed acyclic graph g'to obtain:
  • the implementation form of the tensor alias technique can be:
  • FIG. 7 shows a schematic diagram of a calculation model according to an embodiment of the present disclosure.
  • the calculation model can be expressed as (conv, pool, bn, relu, add) and (conv, pool, relu, bn, add) Two one-way operation instruction sequences, call the operation instructions in the order in which the operation instructions appear in the one-way operation instruction sequence, and you can complete a calculation model execution. It should be noted that because relu and add operations exist from right to left Operation connection, so the above calculation model cannot be expressed as (conv, pool, add, bn, relu) operation instruction sequence.
  • the embodiments of the present disclosure design parameter binding and operator specialization functions for operating instructions.
  • Parameter binding refers to fixing part of the input parameters or all input parameters of an operation instruction;
  • operation specialization refers to converting a parameter binding operation instruction into a new operation instruction, then the new operation instruction can be called Specialized operation instructions.
  • Specialized operating instructions still meet the definition and nature of operating instructions, and support all functions of operating instructions (support for specializing, fusion, etc.).
  • the classification of special instructions can be flexibly set according to the actual situation.
  • the special operation instructions can be divided according to the number of parameter bindings.
  • the special instructions include complete Specialized operation instructions, partial specialized operation instructions, and pseudo-specialized operation instructions; among them,
  • Fully specialized operation instructions include the operation instructions obtained by binding all input parameters to the operation instructions;
  • Part of the special operation instruction includes the operation instruction obtained by binding N input parameters to the operation instruction, where N is a positive integer less than the number of input parameters of the operation instruction;
  • Pseudo-specialized operating instructions include operating instructions that are directly converted without binding input parameters to the operating instructions.
  • specialization operation instructions are divided into three categories: binding all input parameters of an operation is a fully specialization operation instruction, and binding some input parameters is a partial specialization operation
  • the instruction without binding any input parameters is a pseudo-specialized operation instruction.
  • the bound parameter can be deleted from the input parameter of the operation instruction. The user does not need to specify the bound parameter when calling the operation instruction of the bound parameter. Therefore, the completely special operation instruction may not specify any input parameters.
  • parameter binding and special instantiation of operation instructions the program can be partially evaluated and optimized during compilation, thereby reducing the running time of operation instructions.
  • the specific implementation of parameter binding and specialization operation instructions is not limited.
  • the specialize operation instruction can be implemented through the nclSpecializeOperator interface. It can compile and optimize the operation on the fly and return the execution time of the hardware device. Shorter specialization operation instructions.
  • Parameter binding and specialization operation instructions can be widely used in real calculation models.
  • the deep learning calculation model usually deals with fixed-size data, so the shape of the tensor can be parameter-bound, and then special cases can be operated. To optimize the performance of the program.
  • the weights are constants trained in advance, so the weight data of the operation can be parameter-bound, and then the operation specialization is performed to obtain the specialization operation instruction, thereby optimizing the performance of the program.
  • the deep learning programming library usually only supports high-frequency and time-consuming processing layers (such as convolution, full connection, RNN, pooling, and activation), resulting in the programming library not being able to support the end well End-to-end execution.
  • operation customization refers to writing an operation in a domain-specific programming language, and then inserting it into a programming library in the form of binary code. In the embodiments of the present disclosure, this operation is called a customized operation instruction (customized operator). Customized operating instructions still meet the definition and nature of operating instructions, and support all functions of operating instructions (support for specialization, fusion, etc.).
  • FIG. 8 shows a schematic diagram of generating a customized operation instruction according to an embodiment of the present disclosure. As shown in the figure, in a possible implementation manner, the generation process of the binary code corresponding to the customized operation instruction may include :
  • the compilation result is inserted into the static operation pool in a dynamic link or static link mode to obtain the binary code corresponding to the customized operation instruction.
  • the static operation pool in the above disclosed embodiment is a storage area in the deep learning programming library, and its specific implementation is described in detail in the subsequent disclosed embodiment.
  • the operation that needs to be executed can be realized by calling the custom operation instruction directly, avoiding repeated useless instruction editing, and since the compilation result of the custom operation instruction has been saved in the deep learning programming library, it can be called directly during compilation
  • the binary code corresponding to the customized operation instruction does not need to be recompiled multiple times, which effectively improves the efficiency of compilation and shortens the compilation time.
  • the specific process of the implementation of the operation customization may be: implementing customized transformation in a programming language to obtain the code to be inserted (insert code);
  • the code to be inserted is encapsulated, and the data format conversion is completed;
  • the code to be inserted is compiled and inserted into the deep learning programming library by dynamic link or static link to complete the operation customization; such as operation
  • the instructions are the same as normal use of custom operation instructions. It should be noted that the name of the custom operation instruction is specified by the user, but it cannot conflict with the operation name of the build-in operation.
  • the operation instructions proposed in the embodiments of the present disclosure may also support the operation fusion function.
  • Operator fusion refers to combining multiple plastic operations in the calling order into a new plastic operation.
  • the operation can be referred to as a fusion operator.
  • the fusion operation instruction still meets the definition and nature of the operation instruction, and supports all the functions of the operation instruction (support for specialization, fusion, etc.).
  • the formal expression of operation fusion is as follows:
  • op fused Fuse(op 1 ,op 2 ,...,op n )
  • the operation fusion satisfies the equivalence of weak transformation:
  • the calculation result of the fusion operation instruction and the calculation result of the original operation instruction sequence can be regarded as equal within the allowable error range, and the formal expression is error ⁇ epsilon, and epsilon is represented by
  • the sensitivity of the application process to accuracy is determined.
  • the fusion operation instruction can participate in operation fusion again, which is called high-order fusion, and the output obtained by high-order fusion is still a plastic operation, which is expressed as follows:
  • op fused2 Fuse(op 1 ,op 2 ,...,op fused ,...,op n )
  • Operational fusion can bring two benefits: optimized performance and simplified programming.
  • the programming library can perform computational graph-level compilation optimization within the fusion operation (for example, linear transformation, constant folding and other optimization techniques can be used to reduce the overall amount of calculation and memory access), thereby reducing the execution of operations on the device Time;
  • simplified programming a single fusion operation instruction can be used to represent commonly used functional blocks in deep learning algorithms (such as residual blocks in ResNet) or even the entire calculation model.
  • operation fusion needs to meet certain conditions.
  • this condition may be: the operation instructions to be fused may be expressed as continuous subsequences in a one-way operation instruction sequence.
  • the calculation model can be expressed as the following two one-way operation instruction sequences, respectively seq1: (conv, pool, bn, relu , Add) and seq2: (conv, pool, relu, bn, add). Any subsequence in these two operation instruction sequences can be operated fusion.
  • FIG. 9 shows a schematic diagram of operation fusion results according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of the operation fusion result according to an embodiment of the present disclosure, as shown in the figure, in an example, Combining the two operation instructions bn and add in seq2, the calculation model (conv, pool, relu, fusion) can be obtained at this time.
  • the three operation instructions pool, relu, and add cannot be merged because they are neither seq1 nor continuous subsequences of seq2.
  • the creation process of the fusion operation instruction may include:
  • the operation connection relationship between the merge operation sub-instructions is determined.
  • connection relationship connect the fusion operation sub-instructions to obtain the connection result.
  • FIG. 11 shows a schematic diagram of the related programming interface of operation fusion according to an embodiment of the present disclosure.
  • a process of creating a fusion operation can be obtained.
  • FIG. 12 shows a schematic diagram of a process of creating a fusion operation according to an embodiment of the present disclosure.
  • the steps of creating a fusion operation Can include:
  • the purpose of the function connection of the sub-operations is to build a calculation graph.
  • the nclFuseOperator interface will compile and optimize the calculation graph in time to speed up the execution of the operation.
  • the operation instructions proposed by the embodiments of the present disclosure may include operation fusion instructions, or other types of operation instructions such as build-in operation instructions, specialization operation instructions, etc., for different types of operations Instructions and the implemented programming model are also different.
  • NCLAPI adopts a hybrid programming model, that is, it supports both imperative programming and declarative programming.
  • a hybrid programming model can be designed based on operation fusion, that is, the programming model that does not use fusion operations is an imperative programming model, and the programming model that uses fusion operations is a declarative programming model. Two programming models Can be mixed.
  • the implementation of the programming model can be flexibly set according to the actual situation.
  • the programming model can be designed based on three factors, namely: data flow, execution flow and control flow.
  • data flow in order to complete the data transfer between the host and the device, the embodiment of the present disclosure designs a data copy interface nclMemcpy for NCLAPI.
  • control flow in order to control the execution of the device and perform synchronization between the host and the device, the embodiments of the present disclosure design an operation call interface nclInvokeOperator and a device synchronization interface nclSyncDevice for NCLAPI.
  • the embodiment of the present disclosure divides the execution mode of the calculation model into three categories, namely: call layer by layer: call all operations in the calculation model one by one; merge call: perform operation fusion on the entire calculation model, and then call Fusion operation; segment fusion call: perform segmentation operation fusion on the calculation model, and then call the fusion operation segmentation.
  • the NCLAPI programming model is distinguished by these three types of execution methods: the execution method of layer-by-layer call corresponds to the imperative programming model; the execution method of fusion call corresponds to the declarative programming model; the execution method of segmented fusion call corresponds to hybrid programming.
  • the operating data is created or invoked according to the user instruction received by the deep learning programming library interface.
  • the execution mode of the three calculation models based on the above disclosed embodiment is triggered according to user instructions, which can include: according to user instructions, call all corresponding operation instructions one by one; or, according to user instructions, merge all corresponding operation instructions to obtain a fusion operation Instruction, call the fusion operation instruction; or, according to the user instruction, segment all the corresponding operation instructions to obtain the segmentation result, fuse each segmentation result separately to obtain the corresponding segmentation fusion operation instruction, and then call the segmentation Segment fusion operation instructions.
  • user instructions can include: according to user instructions, call all corresponding operation instructions one by one; or, according to user instructions, merge all corresponding operation instructions to obtain a fusion operation Instruction, call the fusion operation instruction; or, according to the user instruction, segment all the corresponding operation instructions to obtain the segmentation result, fuse each segmentation result separately to obtain the corresponding segmentation fusion operation instruction, and then call the segmentation Segment fusion operation instructions.
  • these three calling methods can be used to distinguish the NCLAPI programming model.
  • the execution method of layer-by-layer calling can correspond to the imperative programming model; the execution method of fusion calling can correspond to the declarative programming model; segmented fusion
  • the call execution method can correspond to mixed programming.
  • FIG. 13 shows a schematic diagram of the data flow of a three-tier computing model according to an embodiment of the present disclosure.
  • the process of invoking operation instructions through different calling methods may be: Layer by layer call: call the nclInvokeOperator interface three times to perform conv, pool, and fc operations; merge call: first merge the three operations of conv, pool, and fc into a single operation, and then call the nclInvokeOperator interface to perform the merge operation; segment merge call: merge Conv, pool two operations, and then call the nclInvokeOperator interface twice to perform the fusion operation and fc operation respectively.
  • FIG. 14 shows a block diagram of an implementation of a hybrid programming model according to an embodiment of the present disclosure.
  • the hybrid programming model is composed of an imperative programming model and a declarative programming model.
  • the complete programming process of the imperative programming model can be: initialization: initialize the device and operating environment; create operation: create a single operation, selectively bind parameters, and specialize operations; call operations: prepare operation parameters ( Including creating Tensor and allocating device address), copy host side input data to device memory, make operation calls, device synchronization, read output results from device memory; release operation resources: destroy resources that are no longer used in the first two steps, including Zhang Volume, operation, memory, etc.; repeat creation, call, and release operations until all operations in the calculation model are completed; exit: close the device and destroy the operating environment.
  • the complete programming process of the declarative programming model can be: initialization: initialize the device and operating environment; create sub-operations: create all sub-operations that need to participate in the fusion, and selectively bind the parameters of the sub-operations; create fusion Operation: Create a fusion operation, add sub-operations to be fused, specify the operation connection relationship between the sub-operations, set the input and output parameters of the fusion operation, perform operation fusion, and optionally perform specialization of the operation; call the fusion operation: prepare the operation Parameters (including creating Tensor and allocating device address), copy host side input data to device memory, make operation calls, device synchronization, read output results from device memory; release operation resources: release sub-operation resources, release fusion operation resources; exit : Turn off the device and destroy the operating environment.
  • initialization initialize the device and operating environment
  • create sub-operations create all sub-operations that need to participate in the fusion, and selectively bind the parameters of the sub-operations
  • create fusion Operation Create
  • the embodiment of the present disclosure proposes an offline mode for NCLAPI to eliminate the secondary compilation overhead of operation specialization and operation fusion on the host side.
  • the implementation of the offline mode can be flexibly set according to the actual situation. In one example, you can use operation specialization or operation fusion in a separate program to optimize the operation instructions in advance, and use it directly in another program.
  • the operation instruction optimized in advance may be called an offline operator.
  • the offline mode corresponds to the online mode.
  • FIG. 15 shows a schematic diagram of the difference between the offline mode and the online mode according to an embodiment of the present disclosure.
  • the online mode can be the same program
  • the operation specialization or operation fusion is performed first, and then the specialization operation instruction or the fusion operation instruction is called.
  • An offline cache is also designed for NCLAPI.
  • the implementation of offline caching can be flexibly determined according to the actual situation.
  • offline caching includes offline files and index tables; among them, offline files are used to save pre-compiled results of offline operation instructions; index tables are used To indicate the location of the pre-compiled result of the offline operation instruction in the offline file.
  • the specific implementation of the offline file and the index table can also be flexibly selected according to actual conditions.
  • offline operations are saved in the offline file, and the position in the offline file is indexed through the index table.
  • the index table is implemented using key-value pairs (Key, Value), where Key is the name of the offline operation, and Value is a pointer that points to the binary code corresponding to the offline operation in the offline file.
  • the specific interface of the offline operation instruction can be set according to the actual situation.
  • FIG. 16 shows a schematic diagram of the offline interface according to an embodiment of the present disclosure.
  • the interface implementation of the offline operation instruction may be: nclSaveOperator
  • the interface saves the specified operation instruction to the offline cache, and uses the string specified by op_type as the name and index key of the operation.
  • the use of offline operation instructions is exactly the same as the build-in operation instructions, but the operation type is different.
  • NCLAPI can first match the build-in operation instructions according to the given operation name. On the match, go to the offline cache to find out whether there is a corresponding offline operation instruction.
  • NCLAPI can transfer operation data and has good adaptability to deep learning algorithms. Therefore, in a possible implementation manner, NCLAPI can be integrated into a deep learning framework.
  • the deep learning framework can be extended, tensor data and operation instructions are encapsulated in the data of the deep learning framework, and the integration of the deep learning framework and the deep learning programming library interface can be realized.
  • the implementation of integrating NCLAPI into the deep learning framework may also change accordingly.
  • the deep learning framework can be Caffe, because Caffe contains three key data structures: Blob, Layer, and Net. Blob is mainly used to store data, complete data copy between hosts and devices, and provide data access interfaces. Layer is used to represent operations (such as convolution, pooling, etc.), and it takes Blob as input and output. Caffe designed an inheritance system for Layer, and different operations can be implemented by writing subclasses of Layer. Layer has three key methods: Setup, Forward and Backward, which are respectively responsible for operation initialization, forward calculation and reverse calculation. In order to support different devices, the same Layer subclass can contain multiple Forward and Backward methods.
  • Net saves all Blobs and Layers. It uses a directed acyclic graph composed of Layers to express the complete calculation model. Net has three key methods: Init, Forward and Backward.
  • the Init method converts the calculation model defined by NetParameter (converted from prototxt) into Blob and Layer, and calls the Setup method to initialize all layers.
  • the Forward method performs forward inference on the entire calculation model, and the Backward method performs reverse training on the calculation model.
  • Caffe uses prototxt to model the deep learning calculation model. The user describes the processing layer, data, and connection relationship of the processing layer according to the syntax of prototxt. Caffe receives the prototxt file, converts it into Blob, Layer, and Net and executes it.
  • composition of Caffe in an example, the embodiment of the present disclosure integrates NCLAPI into Caffe in the following manner:
  • Expanding the Blob class can be to encapsulate the nclTensor data structure and related interfaces (such as nclCreateTensor, nclSetTensorAttr, nclMemcpy) into the Blob;
  • Extending the Layer class can be to encapsulate the nclOperator data structure and related interfaces of NCLAPI into Layer. Specifically, encapsulate nclCreateOperator, nclSpecializeOperator, nclBlindOutputTensor and other interfaces into Layer's Setup method, and encapsulate nclInvokeOperator into Layer's Forward and Backward methods. in.
  • the embodiment of the present disclosure adopts an implementation method of adding a subclass to the Layer;
  • Expanding the Net class can be to encapsulate the operation fusion interface of NCLAPI into Net. Since all layers can be obtained in Net, Net is most suitable as a carrier for operation fusion.
  • the embodiment of the present disclosure adds an operation fusion module to Net ,
  • the calculation model can be segmented fusion or complete fusion.
  • the deep learning framework may be TensorFlow.
  • FIG. 17 shows an architecture diagram of TensorFlow according to an embodiment of the present disclosure.
  • TensorFlow is designed with good consideration of the architecture. Extensibility, it reserves the operator addition, device registration mechanism, and gives detailed official documentation, so TensorFlow itself is easier to integrate third-party deep learning programming libraries and deep learning processors. Since the distributed master of TensorFlow is responsible for the division and task allocation of the calculation subgraphs, in one example, the operation of NCLAPI can be integrated into the distributed master control module to perform operation fusion on the subgraphs.
  • the embodiment of the present disclosure integrates NCLAPI into TensorFlow in the following manner:
  • NCLAPI uses tensors to represent multi-dimensional data such as scalars, vectors, and matrices, and uses operation instructions to represent the deep learning processing layer.
  • Operation instructions support operation fusion, operation customization, operation specialization, variable operation parameters, offline optimization, and hybrid programming model (imperative + declarative), so as to solve the problems of performance optimization and programming flexibility.
  • a specific implementation of NCLAPI can be deployed on the DaDianNao deep learning processor platform, and it is also integrated into mainstream deep learning frameworks such as Caffe and TensorFlow. Practice has proved that NCLAPI can run mainstream deep learning algorithms including image classification, target detection, and natural language processing. It has strong versatility and flexibility.
  • NCLAPI simulates neural calculus in the design of data structure and interface, thus proving that neural calculus can be used as a theoretical basis for guiding the design of deep learning programming library.
  • Figure 18 shows a schematic diagram of the comparison between NCLAPI and the mainstream deep learning programming library interface according to an embodiment of the present disclosure. It can be seen from the figure that compared with the mainstream deep learning programming library interface, NCLAPI can support a hybrid programming model and can At the same time, it meets the needs of performance optimization and programming flexibility; it can support operation customization, has strong operation scalability, and can better support end-to-end execution performance optimization; it can support operation integration, operation specialization, and offline mode, which can be Optimize the performance of the program from various angles.
  • Tensorflow integrated with NCLAPI can support a large number of deep learning calculation models end-to-end (without using any CPU operations).
  • the compilation method of the deep learning algorithm proposed in the embodiments of the present disclosure can be implemented based on the NCLAPI interface in a possible implementation manner.
  • the embodiments of the present disclosure After receiving the operation data transmitted by NCLAPI, According to the operation instructions included in the operation data, the instruction type is judged, and the compilation operation corresponding to the instruction type is executed according to the judgment result to obtain the binary code of the deep learning algorithm. Therefore, based on this compilation method, the embodiments of the present disclosure also propose a deep learning programming library architecture adapted to the compilation method—neurological calculus library architecture (NCLA).
  • FIG. 19 shows a schematic diagram of the overall architecture of NCLA according to an embodiment of the present disclosure.
  • NCLA can be implemented by a just-in-time compilation system (NCLCS) and static
  • the operation pool static operator pool, NCLSOPP
  • the runtime system runtie system, NCLRT
  • NCLA can be integrated with NCLAPI and compiled according to the operating data passed by NCLAPI.
  • the just-in-time compilation system can perform calculations and data collaborative compilation optimization on any operation instruction at runtime, and generate efficient binary code;
  • the static operation pool can be used to save the optimized binary code, thereby eliminating the overhead of secondary compilation;
  • the runtime system can It provides basic functions such as device management, memory management, operation execution, and device synchronization, so that operation instructions can be deployed end-to-end to the deep learning processor for execution.
  • the embodiment of the present disclosure may adopt just-in-time compilation optimization (JIT) to design the NCLA.
  • JIT just-in-time compilation optimization
  • Just-in-time compilation and optimization can dynamically adjust optimization strategies for different algorithms and different hardware platforms at runtime to achieve universal performance optimization.
  • just-in-time compilation introduces additional runtime overhead.
  • the common method to alleviate this problem is to introduce just-in-time compilation cache. Deep learning algorithms are highly reusable, so cache can play a huge role in a possible implementation.
  • the static operation pool is used as the cache of the just-in-time compilation system. Due to the extremely high degree of coupling between calculations and data, the performance of the deep learning processor cannot be fully exploited by optimizing calculations or data alone. Therefore, in a possible implementation, a compilation framework for collaborative optimization of calculations and data can be used. To design just-in-time compilation.
  • the operating instructions contained in the operating data passed by NCLAPI can be shunted.
  • the specific shunting method can be flexibly selected according to the actual situation.
  • the operating instructions can be divided into static operating instructions and dynamic operating instructions.
  • Operating instructions, dynamic operating instructions can trigger NCLA to perform instant compilation, and static operating instructions can trigger NCLA to perform search operations.
  • the specific instructions contained in the static operation instruction and the dynamic operation instruction can be determined according to the actual situation of the operation instruction transmitted by the deep programming interface, and it is not limited to the following disclosed embodiments.
  • the static operation instruction may include one or more of a custom operation instruction, a build-in operation instruction, and an offline operation instruction; among them,
  • Customized operating instructions include operating instructions with custom functions implemented according to the encoding method and packaging form of the operating instructions;
  • the build-in operating instructions include the own operating instructions included in the deep learning programming library interface;
  • Offline operation instructions include pre-compiled dynamic operation instructions, wherein the pre-compiled result is stored in the offline cache.
  • the dynamic operation instructions may include specialized operation instructions and/or fusion operation instructions; among them,
  • Specialized operating instructions include operating instructions converted after binding input parameters to the operating instructions
  • the fusion operation instruction includes an operation instruction obtained by combining multiple operation instructions according to the calling sequence.
  • step S13 may include:
  • Step S131 Determine the instruction type of the operation instruction.
  • Step S132 when the instruction type is a static operation instruction, according to the name of the static operation instruction, the corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm.
  • the instruction type is a static operation instruction
  • the corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm. You can directly search for the corresponding binary code for reusable operation instructions, avoid multiple repeated compilations, eliminate secondary compilation overhead, and improve compilation efficiency.
  • step S132 may include:
  • the binary code corresponding to the name is searched in the static operation pool.
  • the binary code is returned as the binary code of the deep learning algorithm.
  • step S132 may further include:
  • the static operation instruction is used as the dynamic operation instruction for real-time compilation.
  • the customized operation instruction can be written by the user using a deep learning field programming language, firstly compile it in advance to generate binary code, and then use dynamic linking or static linking. Insert into the static operation pool.
  • the binary code of the offline operation instruction can be generated by the instant compilation system, and the user inserts it into the static operation pool by calling the nclSaveOperator interface.
  • the build-in operation instruction is the operation instruction that comes with NCLAPI. Considering that the program optimization in the deep learning field is extremely sensitive to algorithms and hardware, in order to reduce development costs, the embodiments of the present disclosure do not use manual optimization to implement the build-in operation. Instead, use the just-in-time compilation system to pseudo-specialize the operation instruction (specialization without binding any input parameters) to generate the binary code corresponding to the build-in operation instruction, and then insert it into the static operation pool.
  • the static operation pool can be used to store binary codes corresponding to static operation instructions, and its specific implementation can be flexibly set according to actual conditions, and is not limited to the following disclosed embodiments.
  • the static operation pool may include static operation source files, static operation source files, including static code segments, static data segments, dynamic code segments, and dynamic data segments; among them, static code segments are used to save builds. -in operation instruction corresponding to the binary code; the dynamic code segment is used to save the binary code corresponding to the customized operation instruction; the static data segment is used to save the tensor data corresponding to the build-in operation instruction; the dynamic data segment is used to save and customize the operation The tensor data corresponding to the instruction.
  • the binary code corresponding to the offline operation instruction can be kept in the offline cache. Therefore, in a possible implementation manner, the static operation pool may also include the offline cache.
  • FIG. 20 shows a schematic diagram of the implementation of a static operation pool according to an embodiment of the present disclosure.
  • the source file (.so) of the deep learning programming library can be added 4 types of segments: static code segment, static data segment, dynamic code segment, and dynamic data segment.
  • the static code segment and the static data segment are used to store the binary code corresponding to the build-in operation instruction and the corresponding constant data
  • the dynamic code segment and the dynamic data segment are used to store the binary code and corresponding constant data corresponding to the customized operation instruction.
  • the reason for distinguishing between dynamic segment and static segment is as follows: custom operating instructions are written by users and will continue to expand.
  • the embodiments of the present disclosure design dynamic segments with variable sizes for them; and the build-in operating instructions are self-made by the deep learning programming library. Belt, it will not change, so the embodiment of the present disclosure designs a static section with a constant size for it.
  • the embodiment of the present disclosure does not embed the binary code corresponding to the offline operation instruction into the source file (.so) of the deep learning programming library, because the offline operation instruction is usually a heavyweight calculation model (AlexNet, ResNet), etc.
  • the storage space occupied is huge, so the embodiment of the present disclosure separately designs an offline cache for offline operations, and uses a file system to store offline operation instructions.
  • the offline cache is composed of an index table and offline files.
  • the index table is implemented by key-value pairs (Key, Value). Key is the type name of the offline operation instruction, and Value is the binary code reference pointer corresponding to the offline operation instruction.
  • searching for the binary code corresponding to the name in the static operation pool may include:
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the binary code corresponding to the name.
  • the user can specify the name of the operation when calling the operation creation interface (nclCreateOperator).
  • NCLA uses the name of the operation as an index to find the corresponding binary code in the static operation pool.
  • the order of search is custom operation instructions, build- The in operation instruction, offline operation instruction, if the search hits, the corresponding binary code reference is returned, otherwise the instant compilation is triggered.
  • step S13 may further include step S133: when the instruction type is a dynamic operation instruction, the dynamic operation instruction is compiled in real time to obtain the real-time compilation result as the binary code of the deep learning algorithm.
  • just-in-time compilation also known as nclFuseOperator, nclSpecializeOperator
  • operation fusion or operation specialization interface nclFuseOperator, nclSpecializeOperator
  • Real-time compilation the just-in-time compilation system generates highly optimized binary code for the operation, which is handed over to the runtime system for execution when the operation is called.
  • the user can also call the nclSaveOperator interface to save optimized operation instructions (such as fusion operation instructions), then the binary code corresponding to the operation instruction can be saved in the offline cache and use its operation name as Search index, otherwise, in order to ensure that the volume of the programming library will not rapidly expand, the binary code corresponding to the unsaved operation instruction will be discarded after the program exits.
  • optimized operation instructions such as fusion operation instructions
  • CDUCA computation and data unified compilation architecture
  • FIG. 21 shows a schematic diagram of the architecture of CDUCA according to an embodiment of the present disclosure.
  • CDUCA includes a calculation graph engine and code generation
  • the three components of the data optimizer and the data optimizer are used to coordinate operations and data at multiple levels to generate efficient binary codes.
  • the specific methods of the three components are not unique.
  • the functions that the three components can implement are:
  • Computational graph engine using optimization techniques such as linear transformation and constant folding to perform algorithm-oriented advanced optimization on the original computational graph and constant data, and generate optimized computational graphs and constant data;
  • Code generator adopts a cost model-based heuristic search strategy to perform calculations on calculation graphs and constant data, and coordinate data compilation and optimization to generate efficient target platform code and data descriptors;
  • Data optimizer parse data descriptors, optimize constant data for target platform, split, reorder, precision conversion, etc., and then package optimized constant data and target platform code (address relocation, etc.) to generate the final binary Code.
  • step S133 may include:
  • Step S1331 According to the dynamic operation instruction, the original calculation graph and the original model data corresponding to the dynamic operation instruction are obtained.
  • Step S1332 according to the original calculation graph and the original model data, perform collaborative processing for the deep learning algorithm to obtain the first calculation graph and the first model data.
  • Step S1333 Generate hardware instructions and data descriptors according to the first calculation graph.
  • Step S1334 Perform hardware platform-oriented processing on the first model data according to the data descriptor to obtain second model data.
  • Step S1335 Obtain the binary code of the deep learning algorithm according to the hardware instruction and the second model data.
  • step S1331 may include:
  • the original calculation graph corresponding to the dynamic operation instruction is obtained.
  • the original model data is obtained.
  • the method of obtaining the original calculation graph and original model data corresponding to the dynamic instruction can be: according to the parameters of the dynamic operation instruction, the original model data can be directly obtained, and the original calculation graph can be parsed by NCLA The dynamic operation instruction is generated.
  • the specific analysis process can be flexibly determined according to the actual situation. It is not limited here.
  • FIG. 22 shows the form of the original calculation graph according to an embodiment of the present disclosure. As shown in the figure, in an example, the original The calculation graph contains two graph nodes, Tensor and Operator, respectively corresponding to the input and output data and data transformation of the operation instruction.
  • step S1332 can correspond to the calculation graph engine component in CDUCA, and the implementation of the calculation graph engine is not limited and can be flexibly determined according to actual conditions.
  • the nodes in the calculation graph can be used to represent the operations performed in the deep learning process.
  • Common operations in deep learning algorithms can include convolution, full connection, activation, pooling, batch normalization, scaling, and so on. These operations can be divided into linear transformation operations and non-linear transformation operations according to their specific implementation forms. Among them, all linear transformation operations can be expressed as vectors or matrices for multiplication and addition. Therefore, linear transformation operations
  • the general expression of can be:
  • X and Y are variables
  • W and B are model data constants.
  • Any linear transformation operation can be expressed in the above-mentioned general expression form. Therefore, it cannot be expressed as an operation in the above-mentioned general expression form. It is a nonlinear transformation operation.
  • convolution, full connection, batch normalization, and scaling operations are linear transformation operations
  • pooling and activation are nonlinear transformation operations.
  • X and Y represent the fully connected input neuron matrix and the output neuron matrix, respectively
  • W represents the weight matrix
  • B represents the bias matrix.
  • Other linear transformation operations are specific expressions in general expressions, so I won't repeat them here.
  • the original linear transformation operation can be optimized through linear transformation optimization means and constant folding optimization means, and finally simplified into a one-step linear transformation operation.
  • the model data can be compressed while reducing the amount of calculation. On the one hand, it can reduce the storage overhead of model data, and on the other hand, it can reduce the amount of memory access at runtime. .
  • the specific implementation form of collaborative processing for deep learning algorithms may be linear transformation and constant folding.
  • FIG. 23 shows a working flow chart of a calculation graph engine according to an embodiment of the present disclosure. As shown in the figure, in a possible implementation manner, step S1332 may include:
  • Step S13321 Read the original calculation graph and original model data.
  • Step S13322 Identify the continuous linear transformation operation node in the original calculation graph.
  • step S13323 the continuous linear transformation operation node is processed through linear transformation and constant folding to obtain the first calculation graph and the first model data.
  • the continuous linear transformation operation node may include: at least 2 continuous linear transformation operation nodes.
  • the continuous linear transformation operation node can correspond to at least 2 continuous linear transformation operations, and the number of continuous linear transformation operation nodes is not limited and can be determined according to the actual situation of the calculation graph.
  • step S13323 may include:
  • the continuous linear transformation operation node in the original calculation graph can be node 1, node 2, node 3, and node 4 that are connected in sequence, and the corresponding model data combination can be model data group 1, through linear transformation and constant folding , Node 1, node 2, node 3, and node 4 can be merged into one, and finally node 5 is obtained.
  • the model data group 1 has a constant folding, so the model data contained inside may be merged , And finally get a merged model data group, which can be called model data group 2.
  • the continuous linear transformation operation nodes in the original calculation graph can be the sequentially connected node 1, node 2, node 3, and Node 4, the corresponding model data combination can be model data group 1.
  • node 1, node 2, and node 3 can be merged into one node 6, instead of merging node 6 and node 4.
  • node 4 and node 6 can be finally obtained.
  • the corresponding model data combination in model data group 1 may be merged, and finally A merged model data group is obtained, which can be called model data group 3. Since the model data corresponding to the original node 4 has not undergone constant folding, the model data group 3 and the model data corresponding to node 4 are combined , The model data group 4 corresponding to the current overall linear transformation operation can be obtained.
  • the neural network corresponding to the deep learning algorithm may be the classic image classification network Resnet.
  • FIG. 24 shows a schematic diagram of the structure including substructures in the image classification network according to an embodiment of the present disclosure. As can be seen from the figure, In the Resnet network, there can be a substructure of convolution + batch normalization + scaling.
  • the calculation process of the deep learning algorithm can be optimized, and the first calculation graph and the first model data after collaborative processing for the deep learning algorithm can be obtained, thereby reducing the time required to run the deep learning algorithm.
  • the amount of memory access can also reduce the storage overhead of model data during storage.
  • step S1333 can be used to generate hardware instructions and data descriptors.
  • the specific implementation form of step S1333 is not limited, and any process that can generate hardware instructions based on a calculation graph can be used as an implementation form of step S1333.
  • step S1333 The main purpose of step S1333 is to generate hardware instructions readable by the corresponding hardware platform based on the first calculation graph optimized in step S1332.
  • an on-chip memory (on-chip memory) is often designed in a part of the hardware platform close to the computing position. In one example, this part may be close to the deep learning processor computing unit.
  • the speed of accessing the on-chip cache is often faster than that of accessing other locations.
  • the other locations can be off-chip double data rate synchronous dynamic random access memory (DDR, Double Data Rate). Synchronous Dynamic Random Access Memory).
  • step S1333 may include: processing the first calculation graph according to the cost model, and combining the heuristic search strategy to obtain hardware instructions and data descriptors.
  • step S1333 may include:
  • Step S13331 Model the first calculation graph through the cost model to generate a search space and an objective function.
  • step S13332 a search is performed in the search space through a heuristic search strategy, and when the target function reaches the threshold, hardware instructions and data descriptors for the hardware platform are generated.
  • computational graphs can be used as input, through a model that can generate hardware instructions, and optimize in the output results of the model, the final hardware-oriented platform can be obtained.
  • Hardware instructions The specific application model that can generate hardware instructions is not limited.
  • a cost model can be used to model the first calculation graph.
  • the cost model estimates the total time of the operation performed on the deep learning processor (including overheads such as data format conversion). The main factors it considers are the operation memory access time, operation time, and the overlap rate of the two. These three factors It directly determines the performance of the program.
  • the deep learning processor includes an independent arithmetic unit and a memory access unit.
  • FIG. 26 shows a schematic diagram of the instruction pipeline according to an embodiment of the present disclosure, as shown in the figure. Show.
  • the generated hardware instructions are actually a search space composed of multiple possible hardware instructions.
  • the hardware instructions contained in the search space can indicate the hardware
  • the hardware instruction can instruct the hardware platform to load a complete piece of data into several times and place it in the on-chip cache; in one example, the hardware instruction can instruct the hardware platform to add a piece of data.
  • the complete data is divided into several loads, and swapped in and out according to demand between the on-chip cache and off-chip DDR; in one example, the hardware instruction can instruct the hardware platform to perform a vector operation processing, the amount of data to be processed How big. Since there are many kinds of hardware instructions in the search space, which application instructions will be finally applied, it is necessary to search the search space to find the optimal combination of instructions as the final hardware instructions.
  • the cost model can give the estimated running time of the hardware instruction. Therefore, after passing the first calculation graph through the cost model, the corresponding objective function can also be generated.
  • the specific form of the objective function is not limited here. It can be flexibly set according to the actual situation.
  • This target function can indicate how much time it takes to search in the search space and the hardware instructions that are finally generated at runtime. Therefore, when the target function reaches the threshold, it indicates that the generated hardware instructions have reached The runtime requirements then indicate that the generated hardware instructions can improve the performance of the hardware instructions when they run on the hardware platform. Since the specific implementation form of the objective function is not limited, the threshold of the objective function is also not limited, and can be flexibly set according to the actual situation.
  • How to search in the generated search space is also not limited.
  • a heuristic search strategy can be used to search in the search space.
  • the purpose of the heuristic search strategy is to improve the search efficiency in the search space, so it is not limited to a specific search method, and can be flexibly selected according to the actual situation.
  • the heuristic search strategy may include: a search strategy that improves the utilization of the on-chip cache in the hardware platform; or, on the basis of ensuring the utilization of computing units and access units in the hardware platform, reducing Search strategy for computing granularity and memory access granularity.
  • the purpose of the heuristic search strategy can be to use up the on-chip cache as much as possible.
  • the search strategy can be set as a search strategy to improve the utilization of the on-chip cache in the hardware platform; in one example, the heuristic search
  • the purpose of the strategy can be to select a combination of instructions with a relatively small calculation and access granularity as much as possible under the premise of ensuring the utilization of the computing unit and the memory access unit in the hardware platform, so as to increase the coverage of calculation and memory access.
  • the search strategy is defined as a search strategy that reduces the granularity of calculation and memory access on the basis of ensuring the utilization of calculation units and access units in the hardware platform; in one example, the heuristic search strategy can also be the one of the above two strategies.
  • the equilibrium strategy is the search strategy that makes the above two strategies reach the comprehensive optimal situation.
  • step S13332 can also generate data descriptors to guide how to further optimize the model data.
  • step S1334 may include: according to the data descriptor and the calculation requirements in the hardware platform, the first Model data is split; according to the data descriptor, the first model data is aligned according to the computing requirements in the hardware platform; or, according to the data descriptor, the first model data is aligned according to the computing requirements in the hardware platform. Dimension transformation; or, according to the data descriptor, the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • FIG. 28 shows a schematic diagram of optimizing data after splitting according to an embodiment of the present disclosure.
  • the number of channels can be 2, high.
  • the data descriptor requires the following transformation of the data: height and width are divided into 2 at the same time; the arrangement of the data is adjusted from HWC to CHW; and the accuracy of data operation is adjusted from float32 to half.
  • the data optimizer implements physical transformations such as splitting, rearranging, precision conversion, and alignment on the data from the DDR perspective, and finally obtains the optimized data.
  • the hardware platform may have certain requirements for data alignment during the process of computing according to hardware instructions. If this process is performed in the hardware platform, it will greatly reduce the performance of the hardware platform at runtime. Therefore, , According to the data descriptor, the data to be aligned can be aligned in advance during the compilation process, thereby speeding up the memory access speed when the hardware platform is working.
  • the specific alignment standards and methods are not limited here, and are determined according to the actual requirements of the hardware platform That is; in an example, in the process of computing according to hardware instructions on the hardware platform, some algorithms, such as convolution algorithms, may need to interpret data as a multi-dimensional array, and the arrangement order of the multi-dimensional array will affect the hardware platform The number of memory access jumps affects the runtime performance of the hardware platform.
  • the model data can be dimensionally transformed in advance according to the calculation requirements during the compilation process to improve memory access locality and minimize hardware
  • the memory access jump of the platform, the specific dimensional transformation and the method of transformation are not limited here, and it can be determined according to the actual requirements of the hardware platform; in one example, because different operation precisions correspond to different operation speeds,
  • the hardware platform may support different calculation accuracy requirements.
  • the hardware platform may support low-precision calculations. If high-precision calculations are used, the operating speed of the hardware platform may be reduced. Therefore, you can compile according to the data descriptor.
  • the specific accuracy of the selected data is again not limited. It can be determined according to the requirements of the hardware platform.
  • the preferred accuracy can be 16 bits.
  • the quantization accuracy in an example, the preferred accuracy may be 8-bit quantization accuracy.
  • the above process of processing and optimizing the first model data into the second model data can be any one of the above four methods, or any combination of the above four methods, and it can also include more conducive hardware upgrades. Other ways of platform speed will not be listed here.
  • the performance of the hardware platform at runtime can be greatly improved, such as improving memory access speed and memory access efficiency.
  • step S1335 can include: packing the hardware instructions and the second model data , Get the binary code of the deep learning algorithm.
  • the hardware instructions may be hardware instructions generated according to the first calculation graph, and the second model data may be model data obtained by the original model data through collaborative processing for deep learning algorithms and processing for hardware platforms. That is, the finally obtained binary code of the deep learning algorithm is obtained by sequentially going through the contents of step S1332, step S1333, and step 1334.
  • the hardware instruction may be a hardware instruction generated directly from the original calculation graph, and the second model data may be the model data obtained by the original model data having only undergone hardware-oriented processing, that is, the final deep learning algorithm The executable file is only obtained through step S1333 and step S1334 in sequence.
  • the hardware instructions may be hardware instructions generated according to the first calculation graph, and the second model data may be the first model data, that is, the executable file of the deep learning algorithm finally obtained, which only goes through step S1332 and step S1332 in sequence. S1333 got it.
  • the hardware instructions may be hardware instructions directly generated from the original calculation graph, and the second model data may be the original model data, that is, the executable file of the deep learning algorithm finally obtained, which may be obtained only through step S1333. It can be seen from the above example that step S1332, step S1333, and step S1334 may not exist at the same time, and can be combined flexibly according to actual conditions.
  • the CDUCA implemented by the above disclosed embodiments can integrate the aforementioned compilation optimization techniques such as memory multiplexing, operation fusion, delay concealment, linear algebra transformation, common sub-expression elimination, constant propagation, dead code elimination, and data parallelism.
  • the hierarchical design structure of CDUCA has strong scalability, and developers can integrate various compilation optimization techniques in each module of CDUCA.
  • operation aggregation technology can be integrated in the computational graph engine module
  • polyhedral compilation optimization technology can be integrated in the code generator module. Real-time compilation of dynamic operating instructions through CDUCA can effectively improve compilation efficiency, thereby increasing the operating speed of hardware devices.
  • the binary code of the deep learning algorithm can be generated through two ways of just-in-time compilation and static search.
  • the overall architecture of NCLA also includes running Therefore, in a possible implementation manner, the method proposed in the embodiment of the present disclosure further includes: executing the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • FIG. 29 shows a schematic diagram of the modules and functions of the runtime system according to an embodiment of the present disclosure.
  • the runtime system can be responsible for the interaction between the host and the deep learning processor. , It can encapsulate the device driver interface and provide the upper layer with functions such as device management, memory management, operation execution, and device synchronization.
  • NCLA deep learning algorithms
  • a specific implementation of NCLA can be deployed on the deep learning processor platform and can support image classification and target detection.
  • the embodiments of the present disclosure use TensorFlow to conduct experiments on several types of commonly used deep learning applications.
  • the performance of the binary code generated by the NCLA just-in-time compilation system can reach 83.24% of the performance of the manually optimized code.
  • NCLA can at least Play 72.61% of the peak hardware performance.
  • the successful case of NCLA further confirms the versatility of the neural algorithm and the NCLAPI mentioned in the above-mentioned published examples.
  • FIG. 30 shows a block diagram of a device for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • the device 20 includes: an operating data receiving module 21 for receiving operating data transmitted by a deep learning programming library interface;
  • the instruction acquisition module 22 is used to acquire the operation instructions included in the operation data;
  • the compilation module 23 is used to determine the instruction type of the operation instruction, and execute the compilation operation corresponding to the instruction type according to the judgment result to obtain the depth The binary code of the learning algorithm.
  • operation instructions are created or invoked according to user instructions received by the deep learning programming library interface.
  • the compiling module includes: a judgment unit for judging the instruction type of the operation instruction; and a static search unit for when the instruction type is a static operation instruction, according to the static operation instruction The corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm.
  • the static search unit is configured to: according to the name of the static operation instruction, search for the binary code corresponding to the name in the static operation pool; when the search result is successful, return the The binary code is used as the binary code of the deep learning algorithm.
  • the static search unit is further configured to: when the search result is a failure, use the static operation instruction as a dynamic operation instruction for real-time compilation.
  • the static operation instruction includes one or more of a customized operation instruction, a build-in operation instruction, and an offline operation instruction; wherein, the customized operation instruction includes, according to the encoding mode of the operation instruction And packaging form, the realization of the operation instructions with custom functions; the build-in operation instructions include the own operation instructions included in the deep learning programming library interface; the offline operation instructions include the pre-compiled dynamic The operation instruction, wherein the pre-compiled result is stored in an offline cache.
  • the process of generating binary code corresponding to the customized operation instruction includes: encapsulating the user instruction corresponding to the operation instruction according to the interface and data structure definition of the operation instruction to obtain the encapsulated User instructions; compile the encapsulated user instructions to obtain a compiled result; insert the compiled result into the static operation pool in a dynamic link or static link mode to obtain the binary code corresponding to the customized operation instruction .
  • the static operation pool includes static operation source files, and the static operation source files include static code segments, static data segments, dynamic code segments, and dynamic data segments; wherein, the static code segments are To save the binary code corresponding to the build-in operation instruction; the dynamic code segment is used to save the binary code corresponding to the customized operation instruction; the static data segment is used to save the binary code corresponding to the build-in operation instruction Tensor data; the dynamic data segment is used to store tensor data corresponding to the custom operation instruction.
  • the static operation pool further includes an offline cache
  • the offline cache includes an offline file and an index table; wherein, the offline file is used to store the pre-compiled result of the offline operation instruction; the index table , Used to indicate the location of the pre-compiled result of the offline operation instruction in the offline file.
  • the static search unit is further configured to: according to the name specified when the static operation instruction is created, sequentially search for the binary code and build-in operation corresponding to the custom operation in the static operation pool.
  • the corresponding binary code and the binary code corresponding to the offline operation obtain the binary code corresponding to the name.
  • the compilation module further includes a dynamic compilation unit, configured to: when the instruction type is a dynamic operation instruction, perform real-time compilation on the dynamic operation instruction to obtain a real-time compilation result as the depth The binary code of the learning algorithm.
  • a dynamic compilation unit configured to: when the instruction type is a dynamic operation instruction, perform real-time compilation on the dynamic operation instruction to obtain a real-time compilation result as the depth The binary code of the learning algorithm.
  • the dynamic compilation unit includes: an original data acquisition subunit, configured to obtain, according to the dynamic operation instruction, an original calculation graph and original model data corresponding to the dynamic operation instruction; a collaborative processing subunit , Used to perform collaborative processing for the deep learning algorithm according to the original calculation graph and original model data to obtain a first calculation graph and first model data; a hardware instruction generation subunit is used to generate a hardware instruction according to the first calculation Figure, generating hardware instructions and data descriptors; model data processing subunit, used to perform hardware platform-oriented processing on the first model data according to the data descriptors to obtain second model data; binary code generation subunit , Used to obtain the binary code of the deep learning algorithm according to the hardware instruction and the second model data.
  • the original data acquisition subunit is used to: obtain the original calculation graph corresponding to the dynamic operation instruction by analyzing the dynamic operation instruction; and obtain the dynamic operation instruction according to the parameters of the dynamic operation instruction.
  • Original model data is used to: obtain the original calculation graph corresponding to the dynamic operation instruction by analyzing the dynamic operation instruction; and obtain the dynamic operation instruction according to the parameters of the dynamic operation instruction.
  • the cooperative processing subunit is used to: read the original calculation graph and the original model data; identify the continuous linear transformation operation node in the original calculation graph; pass linear transformation and constant folding , Processing the continuous linear transformation operation node to obtain the first calculation graph and the first model data.
  • the hardware instruction generation subunit is used to process the first calculation graph according to the cost model, and combine the heuristic search strategy to obtain hardware instructions and data descriptors.
  • the model data processing subunit is configured to: perform data alignment on the first model data according to the data descriptor and the operation requirements in the hardware platform; or, according to the data description According to the operation requirements in the hardware platform, the dimensional transformation of the first model data is performed; or, according to the data descriptor, the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • the binary code generation subunit is used to package the hardware instructions and the second model data to obtain the binary code of the deep learning algorithm.
  • the dynamic operation instruction includes a specialized operation instruction and/or a fusion operation instruction; wherein, the specialized operation instruction includes an operation instruction obtained by binding input parameters to the operation instruction;
  • the fusion operation instruction includes an operation instruction obtained by combining a plurality of the operation instructions according to a calling sequence.
  • the special instantiation operation instructions include fully specialization operation instructions, partial specialization operation instructions, and pseudo-specialization operation instructions; wherein, the fully specialization operation instructions include binding the operation instructions All input parameters are converted to operation instructions; the part of the specialized operation instructions includes the operation instructions that are converted after binding N input parameters to the operation instruction, where N is less than the number of input parameters of the operation instruction A positive integer; the pseudo-specialized operation instruction includes an operation instruction obtained by directly converting the operation instruction without binding input parameters.
  • the process of creating the fusion operation instruction includes: creating the name of the fusion operation instruction; determining the fusion operation sub-instruction according to the operation instruction to be fused; and according to the calling sequence of the operation instruction to be fused, Determine the operation connection relationship between the fusion operation sub-instructions; connect the fusion operation sub-instructions according to the operation connection relationship to obtain the connection result; set the fusion operation instruction according to the user instruction corresponding to the fusion operation instruction The input parameters and output parameters of the; package the name, connection result, input parameters and output parameters to obtain the fusion operation instruction.
  • the device further includes an execution module, which is configured to execute the binary code of the deep learning algorithm in the deep learning processor through the runtime system.
  • the operation data further includes tensor data, where the tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • the device is further configured to: extend the class of a deep learning framework, and encapsulate the tensor data and the operation instructions in the data of the deep learning framework to implement the deep learning framework Integration with the deep learning programming library interface.
  • the present disclosure also proposes a deep learning computing device, which includes any one of the above possible deep learning algorithm compilation devices, and the deep learning computing device is used to complete the set deep learning computing .
  • FIG. 31 shows a block diagram of a combined processing device according to an embodiment of the present disclosure.
  • the combined processing device includes the aforementioned deep learning computing device, a universal interconnection interface, and other processing devices.
  • the deep learning computing device interacts with other processing devices to jointly complete the operation specified by the user.
  • Other processing devices include one or more types of general/special processors such as central processing unit CPU, graphics processing unit GPU, neural network processor, etc.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the deep learning computing device and external data and control, including data handling, completing basic controls such as opening and stopping the deep learning computing device; other processing devices can also cooperate with the deep learning computing device to complete computing tasks.
  • the universal interconnection interface is used to transmit data and control commands between the deep learning computing device and other processing devices.
  • the deep learning computing device obtains the required input data from other processing devices and writes it to the storage device on the deep learning computing device chip; it can obtain control instructions from other processing devices and write it to the control buffer on the deep learning computing device chip; also The data in the storage module of the deep learning computing device can be read and transmitted to other processing devices.
  • the combined processing device may further include a storage device, and the storage device is respectively connected to the deep learning computing device and the other processing device.
  • the storage device is used to store data in the deep learning computing device and the other processing devices, and is particularly suitable for data that needs to be computed and cannot be fully stored in the internal storage of the deep learning computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also provides a deep learning chip, which includes the above-mentioned deep learning computing device or combined processing device.
  • the present disclosure also provides a chip packaging structure, which includes the aforementioned chip.
  • the present disclosure also provides a board card, which includes the chip packaging structure described above.
  • the present disclosure also provides an electronic device, which includes the above board.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound, and/or electrocardiograph.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software program module.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more functions for implementing the specified logical function.
  • Executable instructions may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions.
  • a method for compiling a deep learning algorithm comprising:
  • the instruction type of the operation instruction is judged, and the compilation operation corresponding to the instruction type is executed according to the judgment result to obtain the binary code of the deep learning algorithm.
  • Clause A2 the method according to clause A1, the operation data is created or invoked according to a user instruction received by the deep learning programming library interface.
  • the instruction type is a static operation instruction
  • a corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm.
  • the binary code is returned as the binary code of the deep learning algorithm.
  • the static operation instruction is used as a dynamic operation instruction for real-time compilation.
  • the customized operation instruction includes an operation instruction with a self-defined function implemented according to the encoding method and packaging form of the operation instruction;
  • the build-in operation instruction includes the own operation instruction included in the deep learning programming library interface
  • the offline operation instruction includes a pre-compiled dynamic operation instruction, wherein the pre-compiled result is stored in an offline cache.
  • the process of generating the binary code corresponding to the customized operation instruction includes:
  • the compilation result is inserted into the static operation pool in a dynamic link or static link manner to obtain the binary code corresponding to the customized operation instruction.
  • the static code segment is used to save the binary code corresponding to the build-in operation instruction
  • the dynamic code segment is used to save the binary code corresponding to the customized operation instruction
  • the static data segment is used to save tensor data corresponding to the build-in operation instruction
  • the dynamic data segment is used to store tensor data corresponding to the customized operation instruction.
  • the offline file is used to save pre-compiled results of offline operation instructions
  • the index table is used to indicate the position of the pre-compiled result of the offline operation instruction in the offline file.
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the The binary code corresponding to the name.
  • the judging the instruction type of the operation instruction, and executing the compilation operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm further includes:
  • the dynamic operation instruction is compiled in real time to obtain a real-time compilation result as the binary code of the deep learning algorithm.
  • Clause A12 The method according to any one of clauses A1 to A11, wherein when the operation instruction includes a dynamic operation instruction, perform real-time compilation of the dynamic operation instruction to obtain a real-time compilation result as the deep learning
  • the binary code of the algorithm including:
  • the binary code of the deep learning algorithm is obtained.
  • the original model data is obtained.
  • the collaborative processing for the deep learning algorithm is performed according to the original calculation graph and original model data of the deep learning algorithm to obtain the first calculation graph and the first model data, including:
  • the continuous linear transformation operation node is processed to obtain the first calculation graph and the first model data.
  • the first calculation graph is processed according to the cost model, and the heuristic search strategy is combined to obtain hardware instructions and data descriptors.
  • the dimensional transformation of the first model data is performed according to the operation requirements in the hardware platform; or,
  • the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • the special operation instruction includes an operation instruction obtained by binding input parameters to the operation instruction;
  • the fusion operation instruction includes an operation instruction obtained by combining a plurality of the operation instructions according to a calling sequence.
  • the fully-specified operation instruction includes an operation instruction obtained by binding all input parameters to the operation instruction;
  • the partially-specialized operation instruction includes an operation instruction obtained by binding N input parameters to the operation instruction, where N is a positive integer smaller than the number of input parameters of the operation instruction;
  • the pseudo-specialized operation instruction includes an operation instruction obtained by directly converting the operation instruction without binding input parameters.
  • the process of creating the fusion operation instruction includes:
  • Clause A21 The method according to clause A1, further comprising: executing the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • Clause A22 The method according to clause A1, wherein the operation data further includes tensor data, wherein the tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • Class extension is performed on the deep learning framework, and the tensor data and the operation instructions are encapsulated in the data of the deep learning framework to realize the integration of the deep learning framework and the deep learning programming library interface.
  • a compiling device for deep learning algorithms including:
  • the operating data receiving module is used to receive the operating data transmitted by the deep learning programming library interface
  • Operation instruction acquisition module used to acquire operation instructions included in the operation data
  • the compiling module is used to judge the instruction type of the operation instruction, and execute the compiling operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • Clause A25 The device according to clause A24, wherein the operation instruction is created or invoked according to a user instruction received by the deep learning programming library interface.
  • the judgment unit is used to judge the instruction type of the operation instruction
  • the static search unit is configured to search for the corresponding binary code in the static operation pool according to the name of the static operation instruction when the instruction type is a static operation instruction, as the binary code of the deep learning algorithm.
  • the binary code is returned as the binary code of the deep learning algorithm.
  • the static operation instruction is used as a dynamic operation instruction for real-time compilation.
  • the customized operation instruction includes an operation instruction with a self-defined function implemented according to the encoding method and packaging form of the operation instruction;
  • the build-in operation instruction includes the own operation instruction included in the deep learning programming library interface
  • the offline operation instruction includes a pre-compiled dynamic operation instruction, wherein the pre-compiled result is stored in an offline cache.
  • the process of generating the binary code corresponding to the customized operation instruction includes:
  • the compilation result is inserted into the static operation pool in a dynamic link or static link manner to obtain the binary code corresponding to the customized operation instruction.
  • the static code segment is used to save the binary code corresponding to the build-in operation instruction
  • the dynamic code segment is used to save the binary code corresponding to the customized operation instruction
  • the static data segment is used to save tensor data corresponding to the build-in operation instruction
  • the dynamic data segment is used to store tensor data corresponding to the customized operation instruction.
  • the offline file is used to save pre-compiled results of offline operation instructions
  • the index table is used to indicate the position of the pre-compiled result of the offline operation instruction in the offline file.
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the The binary code corresponding to the name.
  • the dynamic operation instruction is compiled in real time to obtain a real-time compilation result as the binary code of the deep learning algorithm.
  • the original data acquisition subunit is used to obtain the original calculation graph and original model data corresponding to the dynamic operation instruction according to the dynamic operation instruction;
  • the collaborative processing subunit is configured to perform collaborative processing for the deep learning algorithm according to the original calculation graph and original model data to obtain a first calculation graph and first model data;
  • a hardware instruction generation subunit configured to generate hardware instructions and data descriptors according to the first calculation graph
  • the model data processing subunit is configured to perform hardware platform-oriented processing on the first model data according to the data descriptor to obtain second model data;
  • the binary code generation subunit is used to obtain the binary code of the deep learning algorithm according to the hardware instruction and the second model data.
  • the original model data is obtained.
  • the continuous linear transformation operation node is processed to obtain the first calculation graph and the first model data.
  • the first calculation graph is processed according to the cost model, and the heuristic search strategy is combined to obtain hardware instructions and data descriptors.
  • the dimensional transformation of the first model data is performed according to the operation requirements in the hardware platform; or,
  • the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • the special operation instruction includes an operation instruction obtained by binding input parameters to the operation instruction;
  • the fusion operation instruction includes an operation instruction obtained by combining a plurality of the operation instructions according to a calling sequence.
  • Clause A42 The device according to Clause A41, wherein the specialization operation instructions include fully specialization operation instructions, partially specialization operation instructions, and pseudo specialization operation instructions; wherein,
  • the fully-specified operation instruction includes an operation instruction obtained by binding all input parameters to the operation instruction;
  • the partially-specialized operation instruction includes an operation instruction obtained by binding N input parameters to the operation instruction, where N is a positive integer smaller than the number of input parameters of the operation instruction;
  • the pseudo-specialized operation instruction includes an operation instruction obtained by directly converting the operation instruction without binding input parameters.
  • Clause A44 The device according to clause A24, wherein the device further includes an execution module configured to execute the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • Clause A45 The device according to clause A24, wherein the operation data further includes tensor data, wherein the tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • Class extension is performed on the deep learning framework, and the tensor data and the operation instructions are encapsulated in the data of the deep learning framework to realize the integration of the deep learning framework and the deep learning programming library interface.
  • a deep learning computing device comprising one or more deep learning algorithm compilation devices as described in any one of clauses A24-A46, said deep learning computing device being used to complete settings Deep learning operations.
  • Clause A48 A combined computing device, which includes one or more deep learning computing devices as described in any one of Clause A47, a universal interconnection interface, and other processing devices;
  • the deep learning computing device interacts with the other processing device to jointly complete the computing operation specified by the user.
  • a deep learning chip including:
  • An electronic device which includes:
  • the deep learning programming library has the advantages of high efficiency and ease of use, and it has been widely used in actual production.
  • the existing deep learning programming library does not give a good balance between performance optimization and programming flexibility in design, and lacks a universal theoretical basis as design guidance.
  • performance optimization some programming libraries do not support operation fusion, end-to-end execution, operation specialization, and offline optimization, resulting in performance that cannot be maximized; in terms of programming flexibility, some programming libraries do not support layer-by-layer execution and operation parameter operation
  • the time is variable, which limits the application range of the programming library. Since the existing deep learning programming library does not take into account performance optimization and programming flexibility in the programming model design, a universal programming library interface with programming flexibility is urgently needed.
  • FIG. 32 shows a flowchart of a method for generating operation data according to an embodiment of the present disclosure. As shown in the figure, the method may include:
  • Step S31 receiving user instructions.
  • Step S32 According to the user instruction, trigger the deep learning programming library interface to create or call the operation data, where the operation data includes at least one of tensor data and operation instructions.
  • a universal programming interface can be provided for users to achieve Effective conversion between user instructions and machine instructions.
  • step S32 may include:
  • the dynamic operation instructions may include special instantiation operation instructions and/or fusion operation instructions; therefore, in a possible implementation manner, according to user instructions, trigger the creation or invocation of dynamic programming library interfaces in deep learning Operation instructions can include:
  • a user instruction trigger the deep learning programming library interface to create or call a special instantiation operation instruction, where the special instantiation operation instruction is converted by binding input parameters to the operation instruction; and/or,
  • the deep learning programming library interface is triggered to create or call fusion operation instructions, where the fusion operation instructions are obtained by combining multiple operation instructions according to the calling sequence.
  • the special instantiation operation instruction may include a fully specialization operation instruction, a partial specialization operation instruction, and a pseudo specialization operation instruction; therefore, in an example, according to a user instruction, a deep learning programming library interface is triggered to create the special case Operational instructions can include:
  • the deep learning programming library interface to bind all input parameters to the operation instruction and then convert it to obtain a fully specialized operation instruction
  • N is a positive integer less than the number of input parameters of the operation instruction
  • trigger the deep learning programming library interface to directly convert the operating instructions without binding input parameters to obtain pseudo-specialized operating instructions.
  • triggering the creation of a fusion operation instruction according to a user instruction may include:
  • the operation connection relationship between the merge operation sub-instructions is determined.
  • connection relationship connect the fusion operation sub-instructions to obtain the connection result.
  • static operation instructions may include one or more of custom operation instructions, build-in operation instructions, and offline operation instructions; therefore, in a possible implementation manner, triggering
  • the deep learning programming library interface to create or call static operation instructions can include:
  • the deep learning programming library interface to create or call a custom operation instruction, where the custom operation instruction is obtained by encapsulating the user instruction corresponding to the operation instruction according to the interface and data structure definition of the operation instruction;
  • the deep learning programming library interface to create or call a build-in operation instruction, where the build-in operation instruction is a self-owned operation instruction included in the deep learning programming library interface;
  • the deep learning programming library interface is triggered to create or call an offline operation instruction, where the offline operation instruction is a pre-compiled dynamic operation instruction, and the pre-compiled result is stored in the offline cache.
  • the static operation pool may include static operation source files, static operation source files, including static code segments, static data segments, dynamic code segments, and dynamic data segments; among them, static code segments are used to save builds. -in operation instruction corresponding to the binary code; the dynamic code segment is used to save the binary code corresponding to the customized operation instruction; the static data segment is used to save the tensor data corresponding to the build-in operation instruction; the dynamic data segment is used to save and customize the operation The tensor data corresponding to the instruction.
  • the offline cache includes an offline file and an index table; among them, the offline file is used to save the pre-compiled result of the offline operation instruction; the index table is used to indicate that the pre-compiled result of the offline operation instruction is in the The location in the offline file.
  • triggering the deep learning programming library interface to create or call custom operation instructions according to user instructions may also include:
  • the compilation result is inserted into the static operation pool in a dynamic link or static link mode to obtain the binary code corresponding to the customized operation instruction.
  • triggering the invocation of operation instructions in the deep learning programming library interface according to user instructions may include: invoking all corresponding operation instructions one by one according to user instructions; or, according to user instructions, all corresponding operations The instructions are merged to obtain the fusion operation instruction, and the fusion operation instruction is called; or, according to the user instruction, all the corresponding operation instructions are segmented to obtain the segmentation result, and each segmentation result is separately merged to obtain the corresponding segmentation Fusion operation instructions, call the segment fusion operation instructions in turn.
  • the tensor data may include a shape attribute (shape), a logical data type attribute (dtype), a physical data type attribute (pdtype), and a physical layout attribute (layout).
  • shape attribute shape
  • dtype logical data type attribute
  • pdtype physical data type attribute
  • layout physical layout attribute
  • the operation data generation method proposed in the embodiment of the present disclosure may further include:
  • step S33 the deep learning programming library framework reads the operation instructions transmitted by the deep learning programming library interface.
  • step S34 the deep learning programming library architecture judges the instruction type of the operation instruction, and executes the compilation operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • step S34 may include:
  • Step S341 Determine the instruction type of the operation instruction.
  • Step S342 When the instruction type is a static operation instruction, according to the name of the static operation instruction, the corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm.
  • step S342 may include:
  • the binary code corresponding to the name is searched in the static operation pool.
  • the binary code is returned as the binary code of the deep learning algorithm.
  • step S342 may further include:
  • the static operation instruction is used as the dynamic operation instruction for real-time compilation.
  • searching for the binary code corresponding to the name in the static operation pool may include:
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the binary code corresponding to the name.
  • step S34 may further include step S343: when the instruction type is a dynamic operation instruction, the dynamic operation instruction is compiled in real time to obtain the real-time compilation result as the binary code of the deep learning algorithm.
  • step S343 may include:
  • Step S3431 According to the dynamic operation instruction, the original calculation graph and the original model data corresponding to the dynamic operation instruction are obtained.
  • Step S3432 according to the original calculation graph and the original model data, perform collaborative processing for the deep learning algorithm to obtain the first calculation graph and the first model data.
  • Step S3433 Generate hardware instructions and data descriptors according to the first calculation graph.
  • Step S3434 Perform hardware platform-oriented processing on the first model data according to the data descriptor to obtain second model data.
  • Step S3435 Obtain the binary code of the deep learning algorithm according to the hardware instruction and the second model data.
  • step S3431 may include:
  • the original calculation graph corresponding to the dynamic operation instruction is obtained.
  • the original model data is obtained.
  • step S3432 may include:
  • Step S34321 Read the original calculation graph and original model data.
  • Step S34322 Identify the continuous linear transformation operation node in the original calculation graph.
  • step S34323 the continuous linear transformation operation node is processed through linear transformation and constant folding to obtain the first calculation graph and the first model data.
  • step S3433 may include: processing the first calculation graph according to the cost model, and combining the heuristic search strategy to obtain hardware instructions and data descriptors.
  • step S3434 may include: according to the data descriptor, according to the operation requirements in the hardware platform, data splitting the first model data; according to the data descriptor, according to the operation requirements in the hardware platform, Perform data alignment on the first model data; or, according to the data descriptor, perform dimensional transformation on the first model data according to the operation requirements in the hardware platform; or, according to the data descriptor, perform the dimensional transformation on the first model data according to the operation requirements in the hardware platform A model data is selected for accuracy.
  • step S3435 may include: packing the hardware instructions and the second model data to obtain the binary code of the deep learning algorithm.
  • the method proposed in the embodiment of the present disclosure further includes: executing the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • the deep learning framework can be extended, tensor data and operation instructions are encapsulated in the data of the deep learning framework, and the integration of the deep learning framework and the deep learning programming library interface can be realized.
  • the operation data generation method proposed in the embodiment of the present disclosure for the implementation manner of each embodiment, reference may be made to each embodiment of the above-mentioned deep learning algorithm compilation method, which is not repeated here.
  • FIG. 33 shows a block diagram of a device for compiling a deep learning algorithm according to an embodiment of the present disclosure.
  • the device 40 includes: a user instruction receiving module 41 for receiving user instructions; a triggering module 42 for receiving user instructions; The user instruction triggers the deep learning programming library interface to create or call operation data, where the operation data includes at least one of tensor data and operation instructions.
  • the trigger module includes: a dynamic operation instruction trigger unit, which is used to trigger the deep learning programming library interface to create or call a dynamic operation instruction according to the user instruction; and/or a static operation instruction trigger unit, It is used to trigger the deep learning programming library interface to create or call a static operation instruction according to the user instruction.
  • the dynamic operation instruction triggering unit includes: a specialized operation instruction triggering subunit, which is used to trigger the deep learning programming library interface to create or call a specialized operation instruction according to the user instruction, wherein the Specialized operation instructions are obtained by binding input parameters to the operation instructions and then converted; and/or, the fusion operation instruction triggering subunit is used to trigger the deep learning programming library interface to create or call the fusion operation instruction according to the user instruction, Wherein, the fusion operation instruction is obtained by combining a plurality of the operation instructions according to the calling sequence.
  • the specialized operation instruction triggering subunit is used to: according to the user instruction, trigger the deep learning programming library interface to bind all input parameters to the operation instruction and convert to obtain a fully specialized operation instruction And/or, according to the user instruction, trigger the deep learning programming library interface to bind N input parameters to the operation instruction and then convert to obtain a part of the special operation instruction, where N is an input parameter smaller than the operation instruction A number of positive integers; and/or, according to the user instruction, trigger the deep learning programming library interface to directly convert the operation instruction without binding input parameters to obtain a pseudo-specialized operation instruction.
  • the fusion operation instruction triggering subunit is used to: create the name of the fusion operation instruction; determine the fusion operation sub-instructions according to the operation instructions to be fused; and according to the calling sequence of the operation instructions to be fused , Determine the operation connection relationship between the fusion operation sub-instructions; connect the fusion operation sub-instructions according to the operation connection relationship to obtain the connection result; set the fusion operation according to the user instruction corresponding to the fusion operation instruction The input parameters and output parameters of the instruction; package the name, connection result, input parameter, and output parameter to obtain the fusion operation instruction.
  • the static operation instruction trigger unit includes: a custom operation instruction trigger subunit, which is used to trigger the deep learning programming library interface to create or call a custom operation instruction according to the user instruction, wherein the custom operation The instruction is obtained by encapsulating the user instruction corresponding to the operation instruction according to the interface and data structure definition of the operation instruction; and/or, the build-in operation instruction triggering subunit is used to trigger deep learning according to the user instruction
  • the programming library interface creates or calls a build-in operation instruction, where the build-in operation instruction is a self-owned operation instruction included in the deep learning programming library interface; and/or, an offline operation instruction triggers a subunit to be used according to the A user instruction triggers the creation or invocation of an offline operation instruction by the deep learning programming library interface, where the offline operation instruction is a pre-compiled dynamic operation instruction, and the pre-compiled result is stored in the offline cache.
  • the binary code corresponding to the static operation instruction is stored in the static operation pool of the deep learning programming library, the static operation pool includes static operation source files, and the static operation source files include static code segments , Static data segment, dynamic code segment and dynamic data segment; wherein, the static code segment is used to store the binary code corresponding to the build-in operation instruction; the dynamic code segment is used to store the binary code corresponding to the customized operation instruction Binary code; the static data segment is used to store tensor data corresponding to the build-in operation instruction; the dynamic data segment is used to store tensor data corresponding to the custom operation instruction.
  • the static operation pool further includes an offline cache
  • the offline cache includes an offline file and an index table; wherein, the offline file is used to store the pre-compiled result of the offline operation instruction; the index table , Used to indicate the location of the pre-compiled result of the offline operation instruction in the offline file.
  • the custom operation instruction triggering subunit is further used to: compile the custom operation instruction to obtain the compilation result; insert the compilation result into the dynamic link or static link.
  • the static operation pool the binary code corresponding to the customized operation instruction is obtained.
  • the trigger module is further configured to: call all corresponding operation instructions one by one according to the user instruction; or, according to the user instruction, merge all corresponding operation instructions to obtain a merged operation instruction , Call the fusion operation instruction; or, according to the user instruction, segment all the corresponding operation instructions to obtain the segmentation result, and fuse each segmentation result separately to obtain the corresponding segmentation fusion operation instruction,
  • the segment fusion operation instructions are sequentially invoked.
  • tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • the device is also connected to a deep learning programming library architecture
  • the deep learning programming library architecture includes: an operation instruction reading module, configured to read the operation instructions delivered by the deep learning programming library interface;
  • the compiling module is used to judge the instruction type of the operation instruction, and execute the compiling operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • the compilation module includes: an operation instruction type judgment unit for judging the instruction type of the operation instruction; a static search unit for when the instruction type is a static operation instruction, according to the The name of the static operation instruction is searched for the corresponding binary code in the static operation pool as the binary code of the deep learning algorithm.
  • the static search unit is configured to: according to the name of the static operation instruction, search for the binary code corresponding to the name in the static operation pool; when the search result is successful, return the The binary code is used as the binary code of the deep learning algorithm.
  • the static search unit is further configured to: when the search result is a failure, use the static operation instruction as a dynamic operation instruction for real-time compilation.
  • the static search unit is further configured to: according to the name specified when the static operation instruction is created, sequentially search for the binary code and build-in operation corresponding to the custom operation in the static operation pool.
  • the corresponding binary code and the binary code corresponding to the offline operation obtain the binary code corresponding to the name.
  • the compilation module further includes a dynamic compilation unit, configured to: when the instruction type is a dynamic operation instruction, perform real-time compilation on the dynamic operation instruction to obtain a real-time compilation result as the depth The binary code of the learning algorithm.
  • a dynamic compilation unit configured to: when the instruction type is a dynamic operation instruction, perform real-time compilation on the dynamic operation instruction to obtain a real-time compilation result as the depth The binary code of the learning algorithm.
  • the dynamic compilation unit includes: an original data acquisition subunit, configured to obtain, according to the dynamic operation instruction, an original calculation graph and original model data corresponding to the dynamic operation instruction; a collaborative processing subunit , According to the original calculation graph and the original model data, perform collaborative processing for the deep learning algorithm to obtain the first calculation graph and the first model data; the hardware instruction generation sub-unit generates hardware according to the first calculation graph Instructions and data descriptors; a model data processing subunit, according to the data descriptors, perform hardware platform-oriented processing on the first model data to obtain second model data; a binary code generation subunit, according to the hardware instructions And the second model data to obtain the binary code of the deep learning algorithm.
  • the original data acquisition subunit is used to: obtain the original calculation graph corresponding to the dynamic operation instruction by analyzing the dynamic operation instruction; and obtain the dynamic operation instruction according to the parameters of the dynamic operation instruction.
  • Original model data is used to: obtain the original calculation graph corresponding to the dynamic operation instruction by analyzing the dynamic operation instruction; and obtain the dynamic operation instruction according to the parameters of the dynamic operation instruction.
  • the cooperative processing subunit is used to: read the original calculation graph and the original model data; identify the continuous linear transformation operation node in the original calculation graph; pass linear transformation and constant folding , Processing the continuous linear transformation operation node to obtain the first calculation graph and the first model data.
  • the hardware instruction generation subunit is used to process the first calculation graph according to the cost model, and combine the heuristic search strategy to obtain hardware instructions and data descriptors.
  • the model data processing subunit is configured to: perform data alignment on the first model data according to the data descriptor and the operation requirements in the hardware platform; or, according to the data description According to the operation requirements in the hardware platform, the dimensional transformation of the first model data is performed; or, according to the data descriptor, the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • the binary code generation subunit is used to package the hardware instructions and the second model data to obtain the binary code of the deep learning algorithm.
  • the device is further used to execute the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • the device is further configured to: extend the class of a deep learning framework, and encapsulate the tensor data and the operation instructions in the data of the deep learning framework to implement the deep learning framework Integration with the deep learning programming library interface.
  • the present disclosure also provides a combined processing device, including the aforementioned deep learning computing device, a universal interconnection interface, and other processing devices.
  • the deep learning computing device interacts with other processing devices to jointly complete the operation specified by the user.
  • Other processing devices include one or more types of general/special processors such as central processing unit CPU, graphics processing unit GPU, neural network processor, etc.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the deep learning computing device and external data and control, including data handling, completing basic controls such as opening and stopping the deep learning computing device; other processing devices can also cooperate with the deep learning computing device to complete computing tasks.
  • the universal interconnection interface is used to transmit data and control commands between the deep learning computing device and other processing devices.
  • the deep learning computing device obtains the required input data from other processing devices and writes it to the storage device on the deep learning computing device chip; it can obtain control instructions from other processing devices and write it to the control buffer on the deep learning computing device chip; also The data in the storage module of the deep learning computing device can be read and transmitted to other processing devices.
  • the combined processing device may further include a storage device, and the storage device is respectively connected to the deep learning computing device and the other processing device.
  • the storage device is used to store data in the deep learning computing device and the other processing devices, and is particularly suitable for data that needs to be computed and cannot be fully stored in the internal storage of the deep learning computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also provides a deep learning chip, which includes the above-mentioned deep learning computing device or combined processing device.
  • the present disclosure also provides a chip packaging structure, which includes the aforementioned chip.
  • the present disclosure also provides a board card, which includes the chip packaging structure described above.
  • the present disclosure also provides an electronic device, which includes the above board.
  • a method for generating operation data includes:
  • the deep learning programming library interface is triggered to create or call operation data, where the operation data includes at least one of tensor data and operation instructions.
  • Clause B2 according to the method described in clause B1, triggering the creation or invocation of the operation instruction of the deep learning programming library interface according to the user instruction, including:
  • the deep learning programming library interface to create or call a dynamic operation instruction
  • the deep learning programming library interface is triggered to create or call a static operation instruction.
  • Clause B3 The method according to clause B2, wherein the triggering of the deep learning programming library interface to create or call dynamic operation instructions according to the user instruction includes:
  • the deep learning programming library interface to create or call a special instantiation operation instruction, wherein the special instantiation operation instruction is obtained by binding input parameters to the operation instruction and then converted; and/or,
  • the deep learning programming library interface is triggered to create or call a fusion operation instruction, wherein the fusion operation instruction is obtained by combining a plurality of the operation instructions according to the calling sequence.
  • the deep learning programming library interface to bind all input parameters to the operation instruction and then convert it to obtain a fully specialized operation instruction
  • N is a positive integer less than the number of input parameters of the operation instruction; and / or,
  • the deep learning programming library interface is triggered to directly convert the operation instruction without binding input parameters to obtain a pseudo-specialized operation instruction.
  • triggering the creation of the fusion operation instruction according to the user instruction includes:
  • Clause B6 The method according to clause B2, wherein the triggering of the deep learning programming library interface to create or call a static operation instruction according to the user instruction includes:
  • the deep learning programming library interface to create or call a customized operation instruction, wherein the customized operation instruction is obtained by encapsulating the user instruction corresponding to the operation instruction according to the interface and data structure definition of the operation instruction ;and / or,
  • the deep learning programming library interface to create or call a build-in operation instruction, where the build-in operation instruction is a self-owned operation instruction included in the deep learning programming library interface; and/or,
  • a deep learning programming library interface is triggered to create or call an offline operation instruction, where the offline operation instruction is a pre-compiled dynamic operation instruction, and the pre-compiled result is stored in an offline cache.
  • the binary code corresponding to the static operation instruction is stored in the static operation pool of the deep learning programming library, the static operation pool includes a static operation source file, the static operation source file, Including static code segment, static data segment, dynamic code segment and dynamic data segment; among them,
  • the static code segment is used to save the binary code corresponding to the build-in operation instruction
  • the dynamic code segment is used to save the binary code corresponding to the customized operation instruction
  • the static data segment is used to save tensor data corresponding to the build-in operation instruction
  • the dynamic data segment is used to store tensor data corresponding to the customized operation instruction.
  • Clause B8 The method according to clause B7, wherein the static operation pool further includes an offline cache, and the offline cache includes offline files and an index table; wherein,
  • the offline file is used to save pre-compiled results of offline operation instructions
  • the index table is used to indicate the position of the pre-compiled result of the offline operation instruction in the offline file.
  • the compilation result is inserted into the static operation pool in a dynamic link or static link manner to obtain the binary code corresponding to the customized operation instruction.
  • all corresponding operation instructions are segmented to obtain segmentation results, each segmentation result is separately merged to obtain corresponding segmentation fusion operation instructions, and the segmentation fusion operation instructions are sequentially invoked.
  • Clause B11 The method according to clause B1, wherein the tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • Clause B12 the method according to clause B1, the method further comprising:
  • the deep learning programming library architecture reads the operation instructions delivered by the deep learning programming library interface
  • the deep learning programming library architecture judges the instruction type of the operation instruction, and executes the compilation operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • Clause B13 According to the method described in clause B12, the judging the instruction type of the operation instruction, and executing the compilation operation corresponding to the instruction type according to the judgment result, to obtain the binary code of the deep learning algorithm, including:
  • the instruction type is a static operation instruction
  • a corresponding binary code is searched in the static operation pool as the binary code of the deep learning algorithm.
  • Clause B14 The method according to clause B13, wherein when the instruction type is a static operation instruction, according to the name of the static operation instruction, search the corresponding binary code in the static operation pool as the deep learning algorithm
  • the binary code includes:
  • the binary code is returned as the binary code of the deep learning algorithm.
  • the static operation instruction is used as a dynamic operation instruction for real-time compilation.
  • Clause B16 The method according to clause B14, wherein the search for the binary code corresponding to the name in the static operation pool according to the name of the static operation instruction includes:
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the The binary code corresponding to the name.
  • the judging the instruction type of the operation instruction, and executing the compilation operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm further includes:
  • the dynamic operation instruction is compiled in real time to obtain a real-time compilation result as the binary code of the deep learning algorithm.
  • Clause B18 The method according to clause B17, wherein when the instruction type is a dynamic operation instruction, compiling the dynamic operation instruction in real time to obtain a real-time compilation result as the binary code of the deep learning algorithm, including :
  • the binary code of the deep learning algorithm is obtained.
  • Clause B19 The method according to clause B18, wherein the obtaining the original calculation graph and original model data corresponding to the dynamic operation instruction according to the dynamic operation instruction includes:
  • the original model data is obtained.
  • Clause B20 According to the method described in clause B18, the collaborative processing for the deep learning algorithm is performed according to the original calculation graph and original model data of the deep learning algorithm to obtain the first calculation graph and the first model data, including:
  • the continuous linear transformation operation node is processed to obtain the first calculation graph and the first model data.
  • the first calculation graph is processed according to the cost model, and the heuristic search strategy is combined to obtain hardware instructions and data descriptors.
  • the dimensional transformation of the first model data is performed according to the operation requirements in the hardware platform; or,
  • the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • obtaining the binary code of the deep learning algorithm according to the hardware instructions and the second model data includes:
  • Clause B24 The method according to clause B12, further comprising: executing the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • Clause B25 The method according to Clause B1, the method further comprising:
  • Class extension is performed on the deep learning framework, and the tensor data and the operation instructions are encapsulated in the data of the deep learning framework to realize the integration of the deep learning framework and the deep learning programming library interface.
  • a device for generating operation data including:
  • User instruction receiving module for receiving user instructions
  • the trigger module is configured to trigger the deep learning programming library interface to create or call operation data according to the user instruction, where the operation data includes at least one of tensor data and operation instructions.
  • the dynamic operation instruction trigger unit is used to trigger the creation or call of the dynamic operation instruction by the deep learning programming library interface according to the user instruction; and/or,
  • the static operation instruction trigger unit is used to trigger the creation or call of the static operation instruction by the deep learning programming library interface according to the user instruction.
  • the special instantiation operation instruction triggering subunit is used to trigger the deep learning programming library interface to create or call a special instantiation operation instruction according to the user instruction, wherein the special instantiation operation instruction is converted by binding input parameters to the operation instruction Get; and/or,
  • the fusion operation instruction triggering subunit is used to trigger the deep learning programming library interface to create or call a fusion operation instruction according to the user instruction, wherein the fusion operation instruction is obtained by combining a plurality of the operation instructions according to the calling sequence.
  • the deep learning programming library interface to bind all input parameters to the operation instruction and then convert it to obtain a fully specialized operation instruction
  • N is a positive integer less than the number of input parameters of the operation instruction; and / or,
  • the deep learning programming library interface is triggered to directly convert the operation instruction without binding input parameters to obtain a pseudo-specialized operation instruction.
  • the custom operation instruction trigger subunit is used to trigger the deep learning programming library interface to create or call a custom operation instruction according to the user instruction, wherein the custom operation instruction is defined according to the interface and data structure of the operation instruction.
  • the user instruction corresponding to the operation instruction is encapsulated; and/or,
  • the build-in operation instruction triggering subunit is used to trigger the deep learning programming library interface to create or call a build-in operation instruction according to the user instruction, wherein the build-in operation instruction is included in the deep learning programming library interface Own operating instructions; and/or,
  • the offline operation instruction triggering subunit is used to trigger the deep learning programming library interface to create or call an offline operation instruction according to the user instruction, wherein the offline operation instruction is a pre-compiled dynamic operation instruction, and the pre-compiled result Save in offline cache.
  • Clause B32 The device according to clause B31, wherein the binary code corresponding to the static operation instruction is stored in a static operation pool of a deep learning programming library, the static operation pool includes a static operation source file, the static operation source file, Including static code segment, static data segment, dynamic code segment and dynamic data segment; among them,
  • the static code segment is used to save the binary code corresponding to the build-in operation instruction
  • the dynamic code segment is used to save the binary code corresponding to the customized operation instruction
  • the static data segment is used to save tensor data corresponding to the build-in operation instruction
  • the dynamic data segment is used to store tensor data corresponding to the customized operation instruction.
  • the offline file is used to save pre-compiled results of offline operation instructions
  • the index table is used to indicate the position of the pre-compiled result of the offline operation instruction in the offline file.
  • the compilation result is inserted into the static operation pool in a dynamic link or static link manner to obtain the binary code corresponding to the customized operation instruction.
  • all corresponding operation instructions are segmented to obtain segmentation results, each segmentation result is separately merged to obtain corresponding segmentation fusion operation instructions, and the segmentation fusion operation instructions are sequentially invoked.
  • Clause B36 The device according to clause B26, wherein the tensor data includes shape attributes, logical data type attributes, physical data type attributes, and physical layout attributes.
  • Clause B37 The device according to clause B26, wherein the device is further connected to a deep learning programming library architecture, and the deep learning programming library architecture includes:
  • the operation instruction reading module is used to read the operation instruction transmitted by the deep learning programming library interface
  • the compiling module is used to judge the instruction type of the operation instruction, and execute the compiling operation corresponding to the instruction type according to the judgment result to obtain the binary code of the deep learning algorithm.
  • the operation instruction type judgment unit is used to judge the instruction type of the operation instruction
  • the static search unit is configured to search for the corresponding binary code in the static operation pool according to the name of the static operation instruction when the instruction type is a static operation instruction, as the binary code of the deep learning algorithm.
  • the binary code is returned as the binary code of the deep learning algorithm.
  • the static operation instruction is used as a dynamic operation instruction for real-time compilation.
  • the binary code corresponding to the custom operation the binary code corresponding to the build-in operation, and the binary code corresponding to the offline operation are sequentially searched to obtain the The binary code corresponding to the name.
  • the dynamic operation instruction is compiled in real time to obtain a real-time compilation result as the binary code of the deep learning algorithm.
  • the original data acquisition subunit is used to obtain the original calculation graph and original model data corresponding to the dynamic operation instruction according to the dynamic operation instruction;
  • the collaborative processing subunit performs collaborative processing for the deep learning algorithm according to the original calculation graph and original model data to obtain a first calculation graph and first model data;
  • the model data processing subunit performs hardware platform-oriented processing on the first model data according to the data descriptor to obtain second model data;
  • the binary code generation subunit obtains the binary code of the deep learning algorithm according to the hardware instruction and the second model data.
  • the original model data is obtained.
  • the continuous linear transformation operation node is processed to obtain the first calculation graph and the first model data.
  • the first calculation graph is processed according to the cost model, and the heuristic search strategy is combined to obtain hardware instructions and data descriptors.
  • the dimensional transformation of the first model data is performed according to the operation requirements in the hardware platform; or,
  • the accuracy of the first model data is selected according to the operation requirements in the hardware platform.
  • Clause B49 The device according to clause B37, wherein the device is further configured to execute the binary code of the deep learning algorithm in a deep learning processor through a runtime system.
  • Class extension is performed on the deep learning framework, and the tensor data and the operation instructions are encapsulated in the data of the deep learning framework to realize the integration of the deep learning framework and the deep learning programming library interface.
  • a deep learning computing device comprising one or more operation data generating devices as described in any one of clauses B26-B50, the deep learning computing device being used to complete the set depth Learning calculations.
  • Clause B52 A combined computing device, which includes one or more deep learning computing devices as described in Clause B51, a universal interconnection interface, and other processing devices;
  • the deep learning computing device interacts with the other processing device to jointly complete the computing operation specified by the user.
  • a deep learning chip which includes:
  • An electronic device including:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

L'invention concerne un procédé et un dispositif de compilation pour algorithme d'apprentissage profond et un produit associé, conduisant à l'amélioration de l'efficacité de calcul du produit associé lors de l'exécution d'opérations d'un modèle de réseau neuronal. Le produit comprend une unité de commande, et l'unité de commande comprend une unité de mémoire cache d'instructions, une unité de traitement d'instructions et une unité de file d'attente de stockage. L'unité de mémoire cache d'instructions est utilisée pour stocker des instructions de calcul corrélées à des opérations de réseau neuronal artificiel ; l'unité de traitement d'instructions est utilisée pour analyser les instructions de calcul de façon à obtenir une pluralité d'instructions opérationnelles ; l'unité de file d'attente de stockage est utilisée pour stocker une file d'attente d'instructions, la file d'attente d'instructions comprenant une pluralité d'instructions opérationnelles ou d'instructions de calcul à exécuter selon l'ordre séquentiel de la file d'attente.
PCT/CN2020/085882 2019-07-03 2020-04-21 Procédé et dispositif de compilation pour algorithme d'apprentissage profond, et produit associé WO2021000638A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201910596220.8A CN112183735A (zh) 2019-07-03 2019-07-03 操作数据的生成方法、装置及相关产品
CN201910596132.8A CN112183712A (zh) 2019-07-03 2019-07-03 深度学习算法的编译方法、装置及相关产品
CN201910596132.8 2019-07-03
CN201910596220.8 2019-07-03

Publications (1)

Publication Number Publication Date
WO2021000638A1 true WO2021000638A1 (fr) 2021-01-07

Family

ID=74100855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085882 WO2021000638A1 (fr) 2019-07-03 2020-04-21 Procédé et dispositif de compilation pour algorithme d'apprentissage profond, et produit associé

Country Status (1)

Country Link
WO (1) WO2021000638A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1518693A (zh) * 2000-10-05 2004-08-04 皇家菲利浦电子有限公司 可重定目标的编译系统和方法
US20090031111A1 (en) * 2007-07-26 2009-01-29 Chou Deanna J Method, apparatus and computer program product for dynamically selecting compiled instructions
CN106325967A (zh) * 2015-06-30 2017-01-11 华为技术有限公司 一种硬件加速方法、编译器以及设备
US10157045B2 (en) * 2016-11-17 2018-12-18 The Mathworks, Inc. Systems and methods for automatically generating code for deep learning systems
CN109496294A (zh) * 2018-01-15 2019-03-19 深圳鲲云信息科技有限公司 人工智能处理装置的编译方法及系统、存储介质及终端
CN109754011A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 基于Caffe的数据处理方法、装置和相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1518693A (zh) * 2000-10-05 2004-08-04 皇家菲利浦电子有限公司 可重定目标的编译系统和方法
US20090031111A1 (en) * 2007-07-26 2009-01-29 Chou Deanna J Method, apparatus and computer program product for dynamically selecting compiled instructions
CN106325967A (zh) * 2015-06-30 2017-01-11 华为技术有限公司 一种硬件加速方法、编译器以及设备
US10157045B2 (en) * 2016-11-17 2018-12-18 The Mathworks, Inc. Systems and methods for automatically generating code for deep learning systems
CN109496294A (zh) * 2018-01-15 2019-03-19 深圳鲲云信息科技有限公司 人工智能处理装置的编译方法及系统、存储介质及终端
CN109754011A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 基于Caffe的数据处理方法、装置和相关产品

Similar Documents

Publication Publication Date Title
WO2021000970A1 (fr) Procédé, dispositif de compilation d'algorithme d'apprentissage profond, et produit associé
AU2019204395B2 (en) Visually specifying subsets of components in graph-based programs through user interactions
WO2021000971A1 (fr) Procédé et dispositif de génération de données de fonctionnement et produit associé
CN106663010B (zh) 执行基于图的程序规范
Kotsifakou et al. Hpvm: Heterogeneous parallel virtual machine
CN106687918B (zh) 编译基于图的程序规范
US11144283B2 (en) Visual program specification and compilation of graph-based computation
CN110383247B (zh) 由计算机执行的方法、计算机可读介质与异构计算系统
CN106687920B (zh) 管理任务的调用
CN106687919B (zh) 用于控制多个组件的执行的方法、系统和计算机可读介质
CN103858099A (zh) 用于在异构计算机上编译和运行高级程序的技术
EP2805232B1 (fr) Prédication d'instructions de flux de contrôle ayant des instructions de chargement de texture associées pour un processeur graphique
CN105493030A (zh) 着色器函数链接图表
US9268537B1 (en) Automatic generation of domain-aware phase ordering for effective optimization of code for a model
Brunhaver II Design and optimization of a stencil engine
WO2023030507A1 (fr) Procédé et appareil d'optimisation de compilation, dispositif informatique et support de stockage
WO2021000638A1 (fr) Procédé et dispositif de compilation pour algorithme d'apprentissage profond, et produit associé
JP7495028B2 (ja) コンピュータ実装方法及びコンピュータプログラム
US20240069511A1 (en) Instruction generation and programming model for a data processing array and microcontroller
Pan et al. A Deep Learning Compiler for Vector Processor
Sujeeth Productivity and Performance with Embedded Domain-Specific Languages
Mogers Guided rewriting and constraint satisfaction for parallel GPU code generation
Coco Homogeneous programming, scheduling and execution on heterogeneous platforms
CN117407169A (zh) 一种用于OpenMP Offload的性能优化方法、装置及电子设备
CN117648091A (zh) 计算图的编译方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20835595

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20835595

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20835595

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 05/07/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20835595

Country of ref document: EP

Kind code of ref document: A1