CN114186678A - Hardware adaptation device and method based on deep learning - Google Patents

Hardware adaptation device and method based on deep learning Download PDF

Info

Publication number
CN114186678A
CN114186678A CN202111504826.8A CN202111504826A CN114186678A CN 114186678 A CN114186678 A CN 114186678A CN 202111504826 A CN202111504826 A CN 202111504826A CN 114186678 A CN114186678 A CN 114186678A
Authority
CN
China
Prior art keywords
hardware
interface
operator
model
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111504826.8A
Other languages
Chinese (zh)
Other versions
CN114186678B (en
Inventor
洪明
朱鹏阳
严春伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111504826.8A priority Critical patent/CN114186678B/en
Publication of CN114186678A publication Critical patent/CN114186678A/en
Application granted granted Critical
Publication of CN114186678B publication Critical patent/CN114186678B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a hardware adaptation device and method based on deep learning, and relates to the technical field of deep learning and graph optimization. One embodiment of the apparatus comprises: the deep learning inference framework module is used for obtaining the intermediate representation of the graph structure based on the input target model file; the subgraph engine module is used for fusing the configuration file and the intermediate representation to obtain a subgraph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the subgraph operator; and the hardware adaptation module is used for converting the sub-graph operator into an instruction code executed on target hardware corresponding to the hardware type.

Description

Hardware adaptation device and method based on deep learning
Technical Field
The present disclosure relates to the field of computers, and in particular, to deep learning and graph optimization, and more particularly, to a hardware adaptation apparatus and method based on deep learning.
Background
With the wide application of deep learning technology in various fields, a large number of Artificial Intelligence (AI) chips which are more efficient than traditional architectures such as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) are developed; the good software ecology is the key for successful AI hardware, depends on the maturity of the software stack of the hardware manufacturer and whether the extensive support of a deep learning reasoning framework can be obtained, and the later can help the user to simplify the service deployment process and reduce the migration cost caused by hardware difference, thereby quickly obtaining higher performance and energy efficiency benefits.
At present, the following schemes are generally adopted by various large-depth learning reasoning frameworks: (1) delegation (delete) method: TensorFlow Lite and MindSpore Lite commonly adopt a delegation (Delegate) mode to operate partial operators in the model on accelerators such as a GPU, a Digital Signal Processor (DSP) and a Network Processor (NPU), manufacturers only need to respectively realize Delegate subclasses and interfaces thereof for each hardware during adaptation, a plurality of operators supported by the hardware are fused into a sub-graph operator during model optimization, and the sub-graph is converted into a hardware model when the sub-graph operator is operated during the model execution and a result is returned to a framework after the sub-graph operator is executed. (2) Android NNAPI (Android Networks API) and Android NN Runtime (ART): in order to decouple the inference framework and hardware adaptation, Google standardizes interfaces of device management, model networking, execution and the like of a framework layer to establish an Android NNAPI interface system, and then adapts different AI hardware through Android NN Runtime, so that a hardware manufacturer can theoretically access different deep learning inference frameworks such as TensorFlow Lite and PyTorch Mobile (Mobile terminal) through uniform Android NNAPI after adapting the Android NN Runtime.
Disclosure of Invention
The embodiment of the disclosure provides a hardware adaptation device and method based on deep learning.
In a first aspect, an embodiment of the present disclosure provides a hardware adaptation device based on deep learning, including: the deep learning inference framework module is used for obtaining the intermediate representation of the graph structure based on the input target model file; the subgraph engine module is used for fusing the configuration file and the intermediate representation to obtain a subgraph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the subgraph operator; and the hardware adaptation module is used for converting the sub-graph operator into an instruction code executed on target hardware corresponding to the hardware type.
In a second aspect, an embodiment of the present disclosure provides a hardware adaptation method based on deep learning, including: acquiring a configuration file and a target model file; obtaining an intermediate representation of the graph structure by using a deep learning inference framework; fusing the intermediate representation by using a configuration file and a subgraph engine to obtain a subgraph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the subgraph operator; and converting the sub-graph operator into instruction codes executed on target hardware corresponding to the hardware type according to the hardware adaptation framework.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the second aspect.
In a fourth aspect, the disclosed embodiments propose a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the second aspect.
In a fifth aspect, the disclosed embodiments propose a computer program product comprising a computer program that, when executed by a processor, implements the method as described in the second aspect.
According to the hardware adaptation device and method based on deep learning, provided by the embodiment of the disclosure, an intermediate representation of a graph structure corresponding to an input target model file is obtained through a deep learning inference framework module; fusing the configuration file and the intermediate representation through a sub-graph engine module to obtain a sub-graph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the sub-graph operator; and the hardware adaptation module is used for converting the sub-graph operator into an instruction code executed on target hardware corresponding to the hardware type. The hardware adaptation module can be established between the deep learning inference frame module and the target hardware, so that the decoupling of hardware adaptation and the deep learning inference frame is realized, the learning threshold of the deep learning inference frame is reduced, the data attribute and the hardware type can be defined through the configuration file, the use scene of the data attribute can be customized, and the precision of operator operation is improved. Meanwhile, any change of the deep learning inference framework is absorbed by the hardware adaptation layer, so that the instruction code has stronger robustness and maintainability.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of one embodiment of a deep learning based hardware adaptation apparatus according to the present disclosure;
FIG. 2 is a schematic diagram of a sub-graph engine module;
FIG. 3 is a schematic diagram of a hardware adaptation module;
FIG. 4 is a flow diagram of one embodiment of a deep learning based hardware adaptation method according to the present disclosure;
FIG. 5 is a block diagram of an electronic device used to implement an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows a schematic diagram of an embodiment of a deep learning based hardware adaptation apparatus to which the present disclosure may be applied.
As shown in fig. 1, the hardware adaptation apparatus based on deep learning may include: a deep learning inference framework module 101, a sub-graph engine module 102 and a hardware adaptation module 103; wherein the content of the first and second substances,
and the deep learning inference framework module 101 is used for the deep learning inference framework module and obtaining the intermediate representation of the graph structure based on the input target model file. And the sub-graph engine module 102 is configured to fuse the configuration file and the intermediate representation to obtain a sub-graph operator, where the configuration file is used to define data attributes and hardware types of the sub-graph operator. A hardware adaptation module 103, configured to convert the sub-graph operator into instruction code executed on a target hardware corresponding to the hardware type.
In this embodiment, the deep learning inference framework module 101 is specifically configured to: analyzing the input target model file to acquire topological structure information of the neural network; feature graph information and computational operation information in the topology information are used as nodes and edges, respectively, to generate an intermediate representation of the graph structure.
In one example, deep learning inference framework module 101 may pass parsing sub-modules each corresponding to a type of model file, each parsing sub-module for parsing a corresponding type of model file. For example, respective paddlepaddleparser and tensrflow parsers are used to parse model files acquired via paddlepaddlepaddleand tensrflow deep learning inference framework modules, respectively. Such as paddlepaddlee, Caffe, TensorFlow, MxNet, PyTorch, and the like. The deep learning inference framework module can set different computational graph models by using the DSL and the API of the framework to realize specific tasks, such as face recognition, image detection, voice recognition and the like.
It should be noted that, for the new deep learning inference framework, a parser for the model file may be added accordingly.
Here, the intermediate representation is generated as a node representation feature graph, the edges are represented as a computation graph of computation operations, and the nodes and edges each include attribute features. The attributes of the nodes may include dimension information and/or length-width channel information of the feature map. The computing operation of the edge representation includes at least one of: convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, non-linearity, batch normalization (BatchNorm), scaling (scale). The attributes of the edge then include parameters of the computing operation and include at least one of: convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).
In one example, the intermediate representation is IR in the form of a graph, each node of the graph represents Op (operator) including but not limited to convolution, pooling, dimensional transformation, point addition, deconvolution, normalization, non-linearity, etc., and tensors, which represent input, output data of the operator. Each edge of the graph represents the dependency relationship of an operator and a tensor, and an operator node is adjacent to a tensor node.
In this embodiment, the deep learning inference framework module may analyze the neural network model (i.e., the target model file) developed on different depth learning frameworks into a framework-independent Intermediate Representation (IR), thereby implementing decoupling between the deep learning framework and the hardware adaptation module, and converting graph structures with different particle sizes of various deep learning inference frameworks into the fixed-particle intermediate representation in the present disclosure.
In this embodiment, the sub-graph engine module 102 is specifically configured to: and combining the calculation operations, taking the feature graph as a node, and taking the combined plurality of calculation operations as subgraph operators of the edge.
Wherein merging the computing operations comprises at least one of: removing unnecessary or non-influence operation on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse or implement processing of the decomposed computing operation with a preceding or subsequent computing operation.
Here, the merging is merging a plurality of adjacent calculation operations.
In one example, the fusion may fuse a computing operation into a pre-and post-access operation and/or a computing operation.
Correspondingly, in this example, obtaining the operator subgraph comprises the following steps:
the first step is as follows: and (5) marking an operator.
And sequentially traversing each operator in the graph according to the topological sequence of the target model file, and marking the operator which can be converted into the hardware IR according to the registered Paddle operator- > hardware IR conversion table.
For example, the graph includes 10 operators Op 1-Op 10, and assuming that Op1, Op3 and Op10 cannot be converted to hardware IR, then Op1, Op3 and Op10 are labeled as a class of algorithms (e.g., labeled yellow); while Op2, Op4, Op5, Op6, Op7, Op8 and Op9 are labeled as another class of operators (e.g., labeled red), indicating that Op2, Op4, Op5, Op6, Op7, Op8 and Op9 operators can be converted to IR for hardware.
The second step is that: detecting a subgraph;
and (3) adopting a reverse Depth-First-Search (DFS) algorithm to mark operators marked in the First part as the same subgraph.
For example, Op2 is divided separately to sub fig. 1, while Op4, Op5, Op6, Op7, Op8 and Op9 are divided to sub fig. 2. That is, there is no dependency between OP2 and OP4, OP5, OP6, OP7, OP8, and OP 9. At this time, the operators with the dependency relationship are marked as the same subgraph.
The third step: subgraph fusion;
and performing fusion processing on the subgraphs obtained in the second step, for example, deleting the subgraphs with operators less than the preset operators in the subgraphs.
In one example, if a subgraph has too few operators, the subgraph is deleted and then the other subgraphs undergo operator fusion.
Specifically, a subgraph operator is used for representing all operators contained in the subgraph, all the operators contained in the subgraph are stored in a Program (Program desc) in the form of new blocks (Block desc), and Block indexes are stored in the subgraph operator in the form of attributes.
It should be noted that the preset operator may be 1 operator. The preset operator can be set according to the inference precision and the inference speed or by related personnel. Wherein whether the operator is supported in the hardware is determined by the hardware platform.
In this embodiment, subgraphs with fewer operators are deleted, so that the overhead caused by excessive data copying at hardware and Host (Host) ends can be reduced.
In this embodiment, the hardware adaptation module 103 is specifically configured to: the sub-graph operator is converted into instruction code that is executed on target hardware corresponding to the hardware type.
In one example, the subgraph operator can be mapped to the instruction code of the target hardware. The instruction code may be a specific instruction for the target hardware. The target hardware may be at least one of: target hardware based on FPGA or ASIC; target hardware based on a GPU; a target hardware based on the CPU.
In this embodiment, in order to deploy the target model file, the sub-graph operator needs to be compiled into binary instruction code that can be executed by the target hardware. For example, the sub-graph operator can be compiled into an instruction code of a dpu (deep Learning Processor unit) platform (i.e., target hardware) through dnnc (deep Neural Network compiler), such as a serialized model generated by Nvidia TensorRT, so as to perform inference calculation of a corresponding target model file.
According to the hardware adaptation device based on deep learning provided by the embodiment of the disclosure, the intermediate representation of the graph structure corresponding to the input target model file is obtained through the deep learning inference frame module; fusing the configuration file and the intermediate representation through a sub-graph engine module to obtain a sub-graph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the sub-graph operator; the sub-graph operator is converted into instruction code executed on the target hardware by the hardware adaptation module. The hardware adaptation module can be established between the deep learning inference frame module and the target hardware, decoupling of hardware adaptation and the deep learning inference frame is achieved, the learning threshold of the deep learning inference frame is reduced, and data attributes and hardware types can be defined through the configuration file, so that the use scene of the data attributes can be customized, the precision of operator operation is improved, meanwhile, any change of the deep learning inference frame is absorbed by the hardware adaptation layer, and the instruction code has strong robustness and maintainability.
In some optional implementations of this embodiment, the hardware adaptation module 103 is specifically configured to: and converting the sub-graph operator into instruction codes executed on target hardware corresponding to the interface through the interface corresponding to the sub-graph operator.
In this implementation, the hardware adaptation module 103 may convert the sub-graph operator into an instruction code executed on a target hardware corresponding to the interface through the interface corresponding to the sub-graph operator.
In some optional implementations of this embodiment, the interface includes at least one of: the system comprises a hardware management interface, a multi-hardware unified context interface, a model networking interface, a model compiling interface and a model executing interface.
In this implementation, the hardware adaptation framework module includes an inference framework adaptation layer interface, a runtime, a hardware abstraction layer standard interface, and a standard operator definition (as shown in fig. 2).
In one example, a framework adaptation layer standard interface that adapts different deep learning inference frameworks to achieve a complete decoupling from the deep learning inference framework may include at least one of: hardware management interface, the unified context interface of many hardware, model networking interface, model compilation interface and model execution interface:
the hardware management interface is used for inquiring hardware basic information including a hardware name, a manufacturer name, an accelerator card type, a hardware abstraction layer library version, hardware acquisition and initialization and the like.
And the multi-hardware unified context interface is used for creating a plurality of hardware unified contexts. Preferably, the multi-hardware unified context interface provides parameters such as hardware operation, model compiling and executing for each hardware configuration by means of key value strings.
And the model networking interface is used for realizing the decoupling with the model expression mode in the deep learning inference framework. So as to establish a uniform intermediate representation independent of hardware, thereby transforming operators and tensor objects in the model of the inference framework into an internal uniform expression.
And the model compiling interface and the model executing interface are used for realizing the conversion of the intermediate representation of the model to a target hardware code (namely, an instruction code) by calling a hardware manufacturer software stack in the hardware abstraction layer library, and returning a result to the deep learning inference framework after execution.
It should be noted that, in order to reduce the overhead caused by online patterning (i.e., the above graph structure) and model compilation by the model compilation interface, the build hardware model (i.e., the instruction code) may be read from the cache by the model compilation interface. The model compiling interface comprises the following steps:
model compiler (Model online Model, void cache Model, …); // compile to generate the hardware model from the incoming model intermediate representation online _ model or model cache _ model.
In this implementation, in order to create an intermediate representation independent of the deep-learning inference framework, hardware-independent, runtime-and hardware-abstraction-level unified, in addition to the data structure needed to define the model and its contained operators and tensors, the operator type and parameter list are also normalized.
In this implementation, during operation, the framework adaptation layer and the hardware abstraction layer serve as a bridge, and functions of not only translating the call of the framework adaptation layer interface into the intermediate representation of the model, the operator and the tensor and the call of the hardware abstraction layer interface, but also registering the hardware abstraction layer library, and serializing and deserializing the model cache.
It should be noted that, when the hardware model is operated, the hardware model cache format and the serialization and deserialization processes need to be unified, the model cache data set by the interface of the frame adaptation layer is deserialized and then transferred to the hardware abstraction layer, and the instruction code is recovered by each hardware abstraction layer library, so that the mode can establish a unified model cache analysis and recovery process for different hardware (i.e., target hardware).
In one example, a hardware abstraction layer standard interface is used to shield hardware details and provide a uniform device access interface to a runtime.
It should be noted that a hardware abstraction layer is established between the runtime and the vendor software stack, and the hardware abstraction layer is composed of data structures such as the uniform device interface description, the model, the intermediate representation of the operator and the tensor, which are realized by the C structural body, as follows:
Figure BDA0003403783160000091
the above interface definition mainly relates to hardware-related basic information, such as hardware name, vendor name, accelerator type and hardware abstraction layer interface version, and more important definition of hardware function call interface.
In one example, an open _ device interface is called when the hardware is initialized during runtime, create _ context is called when the hardware context is created, a model intermediate representation or model cache data needs to be compiled to generate a hardware model, a create _ program interface is called, and finally an execute _ program is called when the hardware model is executed. Hardware manufacturers only need to realize the interfaces to complete the adaptation of the hardware, the realization principle of the deep learning reasoning framework does not need to be known, and the learning and development cost can be greatly reduced.
In some optional implementations of this embodiment, the data attribute includes: data type and/or data structure.
In this implementation, the data attributes may include: data type and/or data structure.
In one example, fp32, int8, int16 can be set to equate data types in units of operators in the target model file.
In one example, the data structure is translated according to standard types, e.g., to normalize operator types and parameter lists.
In some optional implementations of this embodiment, the data type includes at least one of: fp32, int8, int 16.
In this implementation, the data type includes at least one of: fp32, int8, int 16.
In one example, the subgraph engine module includes blending precision processing, subgraph detection/fusion, subgraph transformation/execution to transform the intermediate representation into subgraph operators (as shown in fig. 3).
The blending precision processing is used for setting quantization data types such as fp32, int8, int16 and the like by taking an operator in a target model file as a unit according to blending precision configuration information defined by a user, and taking the following formats as examples:
input _ name _ list output _ name _ list precision _ type// calculation type// list of input tensor names// list of output tensor names// type of device
op _ type input _ name _ list precision type
op_type::output_name_list:precision_type
op_type:::precision_type
For example:
data types conv2d in _ var0, in _ var1 out _ var0 int8// tensor
Fp32// meaning that all softmax operators run at fp32 precision
In this implementation, quantization or inverse quantization operators are automatically inserted between operators of different precision in the intermediate representation.
In one example, for already quantized parameters, such as the filter of conv2d, if conv2d is forced to run on fp32 precision calculations, the filter is also restored from the quantized type to the floating point type.
In one example, subgraph detection, fusion, for setting device types in units of operators in target model files according to user-defined subgraph detection configuration information
For example, the device types are:
input _ name _ list output _ name _ list device _ type _ device _ type// operator type// list of input tensor names// list of output tensor names// device type
op_type:input_name_list::device_type
op_type::output_name_list:device_type
op_type:::device_type
Several cases are supported, for example:
transpose in _ var0: out _ var0: CPU// means that all the transpose operators run on the CPU
CPU// means that all softmax operators run on the CPU.
In the sub-graph detection process, a specific operator cannot be drawn into a hardware sub-graph, which is often used in some detection class target model files, namely, a backbone structure of the first half part of the target model file runs in a hardware accelerator, and a post-processing operator runs on a CPU.
In one example, a subgraph transforms, executes a call to turn each operator in the subgraph into a hardware networking API.
With further reference to fig. 4, fig. 4 illustrates a flow 400 of one embodiment of a deep learning based hardware adaptation method according to the present disclosure. The hardware adaptation method based on deep learning can comprise the following steps:
step 401, obtaining a configuration file and a target model file.
In this embodiment, the execution subject (e.g., server) of the deep learning based hardware adaptation method may obtain the configuration file and the target model file locally or externally. The configuration file may be used to define the data type of the sub-graph operator and the target hardware on which the sub-graph operator is run. The target model file may be a model file to be deployed on the target hardware.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
And step 402, obtaining an intermediate representation of the graph structure by using a deep learning inference framework.
In this embodiment, the execution subject may pass through parsing sub-modules each corresponding to one type of model file, and each parsing sub-module is configured to parse a model file of a corresponding type. For example, respective paddlepaddleparser and tensrflow parsers are used to parse model files acquired via paddlepaddlepaddleand tensrflow deep learning inference framework modules, respectively. Such as paddlepaddlee, Caffe, TensorFlow, MxNet, PyTorch, and the like. The deep learning inference framework module can set different computational graph models by using a DSL and an Application Programming Interface (API) of the framework, so as to realize specific tasks, such as face recognition, image detection, voice recognition, and the like.
It should be noted that, for the new deep learning inference framework, a parser for the model file may be added accordingly.
Here, the intermediate representation is generated as a node representation feature graph, the edges are represented as a computation graph of computation operations, and the nodes and edges each include attribute features. The attributes of the nodes may include dimension information and/or length-width channel information of the feature map. The computing operation of the edge representation includes at least one of: convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, non-linearity, batch normalization (BatchNorm), scaling (scale). The attributes of the edge then include parameters of the computing operation and include at least one of: convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).
In one example, the intermediate representation is IR in the form of a graph, each node of the graph represents Op (operator) including but not limited to convolution, pooling, dimensional transformation, point addition, deconvolution, normalization, non-linearity, etc., and tensors, which represent input, output data of the operator. Each edge of the graph represents the dependency relationship of an operator and a tensor, and an operator node is adjacent to a tensor node.
In this embodiment, the deep learning inference framework module may analyze the neural network model (i.e., the target model file) developed on different depth learning frameworks into a framework-independent Intermediate Representation (IR), thereby implementing decoupling between the deep learning framework and the hardware adaptation module, and converting graph structures with different particle sizes of various deep learning inference frameworks into the fixed-particle intermediate representation in the present disclosure.
And step 403, fusing the intermediate representation by using the configuration file and the subgraph engine to obtain a subgraph operator, wherein the configuration file is used for defining the data type and the hardware type of the subgraph operator.
In this embodiment, the execution body may merge the computation operations, and use the feature graph as a node, and use the merged multiple computation operations as a sub-graph operator of an edge.
Wherein merging the computing operations comprises at least one of: removing unnecessary or non-influence operation on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse or implement processing of the decomposed computing operation with a preceding or subsequent computing operation.
Here, the merging is merging a plurality of adjacent calculation operations.
In one example, the fusion may fuse a computing operation into a pre-and post-access operation and/or a computing operation.
Correspondingly, in this example, obtaining the operator subgraph comprises the following steps:
the first step is as follows: and (5) marking an operator.
And sequentially traversing each operator in the graph according to the topological sequence of the target model file, and marking the operator which can be converted into the hardware IR according to the registered Paddle operator- > hardware IR conversion table.
For example, the graph includes 10 operators Op 1-Op 10, and assuming that Op1, Op3 and Op10 cannot be converted to hardware IR, then Op1, Op3 and Op10 are labeled as a class of algorithms (e.g., labeled yellow); while Op2, Op4, Op5, Op6, Op7, Op8 and Op9 are labeled as another class of operators (e.g., labeled red), indicating that Op2, Op4, Op5, Op6, Op7, Op8 and Op9 operators can be converted to IR for hardware.
The second step is that: detecting a subgraph;
and (3) adopting a reverse Depth-First-Search (DFS) algorithm to mark operators marked in the First part as the same subgraph.
For example, Op2 is divided separately to sub fig. 1, while Op4, Op5, Op6, Op7, Op8 and Op9 are divided to sub fig. 2. That is, there is no dependency between OP2 and OP4, OP5, OP6, OP7, OP8, and OP 9. At this time, the operators with the dependency relationship are marked as the same subgraph.
The third step: subgraph fusion;
and performing fusion processing on the subgraphs obtained in the second step, for example, deleting the subgraphs with operators less than the preset operators in the subgraphs.
In one example, if a subgraph has too few operators, the subgraph is deleted and then the other subgraphs undergo operator fusion.
Specifically, a subgraph operator is used for representing all operators contained in the subgraph, all the operators contained in the subgraph are stored in a Program (Program desc) in the form of new blocks (Block desc), and Block indexes are stored in the subgraph operator in the form of attributes.
It should be noted that the preset operator may be 1 operator. The preset operator can be set according to the inference precision and the inference speed or by related personnel. Wherein whether the operator is supported in the hardware is determined by the hardware platform.
In this embodiment, subgraphs with fewer operators are deleted, so that the overhead caused by excessive data copying at hardware and Host (Host) ends can be reduced.
Step 404, converting the sub-graph operator into an instruction code executed on the target hardware according to a hardware adaptation framework.
In this embodiment, the execution body may convert the sub-graph operator into an instruction code executed on the target hardware.
In one example, the subgraph operator can be mapped to the instruction code of the target hardware. The instruction code may be a specific instruction for the target hardware. The target hardware may be at least one of: target hardware based on FPGA or ASIC; target hardware based on a GPU; a target hardware based on the CPU.
In this embodiment, in order to deploy the target model file, the sub-graph operator needs to be compiled into binary instruction code that can be executed by the target hardware. For example, the sub-graph operator can be compiled into an instruction code of a dpu (deep Learning Processor unit) platform (i.e., target hardware) through dnnc (deep Neural Network compiler), such as a serialized model generated by Nvidia TensorRT, so as to perform inference calculation of a corresponding target model file.
The hardware adaptation method based on deep learning provided by the embodiment of the disclosure includes the steps of firstly, obtaining a configuration file and a target model file; then, obtaining an intermediate representation of the graph structure by utilizing a deep learning inference framework; then, fusing the intermediate representation by using a configuration file and a subgraph engine to obtain a subgraph operator, wherein the configuration file is used for defining the data type and the hardware type of the subgraph operator; the sub-graph operator is then converted into instruction code that is executed on the target hardware according to the hardware adaptation framework. The hardware adaptation frame can be established between the deep learning inference frame and the target hardware, so that the decoupling of the hardware adaptation frame and the deep learning inference frame is realized, the learning threshold of the deep learning inference frame is reduced, the data attribute and the hardware type can be defined through the configuration file, the use scene of the data attribute can be customized, and the precision of operator operation is improved. Meanwhile, any change of the deep learning inference framework is absorbed by the hardware adaptation framework, so that the instruction code has stronger robustness and maintainability.
In some optional implementations of this embodiment, converting the sub-graph operator into instruction code executed on target hardware corresponding to the hardware type according to a hardware adaptation framework includes: and converting the sub-image operator into an instruction code executed on target hardware corresponding to the interface according to the interface corresponding to the sub-image operator in the hardware adaptation frame.
In this implementation manner, the execution main body may convert the sub-graph operator into an instruction code executed on a target hardware corresponding to the interface through an interface corresponding to the sub-graph operator in the hardware adaptation frame.
In some optional implementations of this embodiment, the interface includes at least one of: the system comprises a hardware management interface, a multi-hardware unified context interface, a model networking interface, a model compiling interface and a model executing interface.
In this implementation, the hardware adaptation framework module includes an inference framework adaptation layer interface, a runtime, a hardware abstraction layer standard interface, and a standard operator definition (as shown in fig. 2).
In one example, a framework adaptation layer standard interface that adapts different deep learning inference frameworks to achieve a complete decoupling from the deep learning inference framework may include at least one of: hardware management interface, the unified context interface of many hardware, model networking interface, model compilation interface and model execution interface:
the hardware management interface is used for inquiring hardware basic information including a hardware name, a manufacturer name, an accelerator card type, a hardware abstraction layer library version, hardware acquisition and initialization and the like.
And the multi-hardware unified context interface is used for creating a plurality of hardware unified contexts. Preferably, the multi-hardware unified context interface provides parameters such as hardware operation, model compiling and executing for each hardware configuration by means of key value strings.
And the model networking interface is used for realizing the decoupling with the model expression mode in the deep learning inference framework. So as to establish a uniform intermediate representation independent of hardware, thereby transforming operators and tensor objects in the model of the inference framework into an internal uniform expression.
And the model compiling interface and the model executing interface are used for realizing the conversion of the intermediate representation of the model to a target hardware code (namely, an instruction code) by calling a hardware manufacturer software stack in the hardware abstraction layer library, and returning a result to the deep learning inference framework after execution.
It should be noted that, in order to reduce the overhead caused by online patterning (i.e., the above graph structure) and model compilation by the model compilation interface, the build hardware model (i.e., the instruction code) may be read from the cache by the model compilation interface. The model compiling interface comprises the following steps:
model compiler (Model online Model, void cache Model, …); // compile to generate the hardware model from the incoming model intermediate representation online _ model or model cache _ model.
In this implementation, in order to create an intermediate representation independent of the deep-learning inference framework, hardware-independent, runtime-and hardware-abstraction-level unified, in addition to the data structure needed to define the model and its contained operators and tensors, the operator type and parameter list are also normalized.
In this implementation, during operation, the framework adaptation layer and the hardware abstraction layer serve as a bridge, and functions of not only translating the call of the framework adaptation layer interface into the intermediate representation of the model, the operator and the tensor and the call of the hardware abstraction layer interface, but also registering the hardware abstraction layer library, and serializing and deserializing the model cache.
It should be noted that, when the hardware model is operated, the hardware model cache format and the serialization and deserialization processes need to be unified, the model cache data set by the interface of the frame adaptation layer is deserialized and then transferred to the hardware abstraction layer, and the instruction code is recovered by each hardware abstraction layer library, so that the mode can establish a unified model cache analysis and recovery process for different hardware (i.e., target hardware).
In one example, a hardware abstraction layer standard interface is used to shield hardware details and provide a uniform device access interface to a runtime.
It should be noted that a hardware abstraction layer is established between the runtime and the vendor software stack, and the hardware abstraction layer is composed of data structures such as the uniform device interface description, the model, the intermediate representation of the operator and the tensor, which are realized by the C structural body, as follows:
Figure BDA0003403783160000171
the above interface definition mainly relates to hardware-related basic information, such as hardware name, vendor name, accelerator type and hardware abstraction layer interface version, and more important definition of hardware function call interface.
In one example, an open _ device interface is called when the hardware is initialized during runtime, create _ context is called when the hardware context is created, a model intermediate representation or model cache data needs to be compiled to generate a hardware model, a create _ program interface is called, and finally an execute _ program is called when the hardware model is executed. Hardware manufacturers only need to realize the interfaces to complete the adaptation of the hardware, the realization principle of the deep learning reasoning framework does not need to be known, and the learning and development cost can be greatly reduced.
In some optional implementations of this embodiment, the data attribute includes: data type and/or data structure.
In this implementation, the data attributes may include: data type and/or data structure.
In one example, fp32, int8, int16 can be set to equate data types in units of operators in the target model file.
In one example, the data structure is translated according to standard types, e.g., to normalize operator types and parameter lists.
It should be noted that the configuration file is also used to define a cache format of the object model file.
In some optional implementations of this embodiment, the data type includes at least one of: fp32, int8, int 16.
In this implementation, the data type includes at least one of: fp32, int8, int 16.
In one example, the subgraph engine module includes blending precision processing, subgraph detection/fusion, subgraph transformation/execution to transform the intermediate representation into subgraph operators.
The blending precision processing is used for setting quantization data types such as fp32, int8, int16 and the like by taking an operator in a target model file as a unit according to blending precision configuration information defined by a user, and taking the following formats as examples:
input _ name _ list output _ name _ list precision _ type// calculation type// list of input tensor names// list of output tensor names// type of device
op_type:input_name_list::precision_type
op_type::output_name_list:precision_type
op_type:::precision_type
For example:
conv2d:in_var0,in_var1:out_var0:int8
fp32// meaning that all softmax operators run at fp32 precision
In this implementation, quantization or inverse quantization operators are automatically inserted between operators of different precision in the intermediate representation.
In one example, for already quantized parameters, such as the filter of conv2d, if conv2d is forced to run on fp32 precision calculations, the filter is also restored from the quantized type to the floating point type.
In one example, subgraph detection, fusion, for setting device types in units of operators in target model files according to user-defined subgraph detection configuration information
For example, the device types are:
op_type:input_name_list:output_name_list:device_type
op_type:input_name_list::device_type
op_type::output_name_list:device_type
op_type:::device_type
several cases are supported, for example:
transpose:in_var0:out_var0:cpu
CPU// means that all softmax operators run on the CPU.
In the sub-graph detection process, a specific operator cannot be drawn into a hardware sub-graph, which is often used in some detection class target model files, namely, a backbone structure of the first half part of the target model file runs in a hardware accelerator, and a post-processing operator runs on a CPU.
In one example, a subgraph transforms, executes a call to turn each operator in the subgraph into a hardware networking API.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as a hardware adaptation method based on deep learning. For example, in some embodiments, the deep learning based hardware adaptation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the deep learning based hardware adaptation method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the deep learning based hardware adaptation method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
Artificial intelligence is the subject of studying computers to simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural voice processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel, sequentially, or in a different order, as long as the desired results of the technical solutions mentioned in this disclosure can be achieved, and are not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (13)

1. A hardware adaptation apparatus based on deep learning, comprising:
the deep learning inference framework module is used for obtaining the intermediate representation of the graph structure based on the input target model file;
the subgraph engine module is used for fusing the configuration file and the intermediate representation to obtain a subgraph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the subgraph operator;
and the hardware adaptation module is used for converting the sub-graph operator into an instruction code executed on target hardware corresponding to the hardware type.
2. The apparatus according to claim 1, wherein the hardware adaptation module is specifically configured to:
and converting the sub-graph operator into instruction codes executed on target hardware corresponding to the interface through the interface corresponding to the sub-graph operator.
3. The apparatus of claim 1 or 2, wherein the interface comprises at least one of: the system comprises a hardware management interface, a multi-hardware unified context interface, a model networking interface, a model compiling interface and a model executing interface.
4. The apparatus of any of claims 1-3, wherein the data attributes comprise: data type and/or data structure.
5. The apparatus of claim 4, wherein the data type comprises at least one of: fp32, int8, int 16.
6. A hardware adaptation method based on deep learning comprises the following steps:
acquiring a configuration file and a target model file;
obtaining an intermediate representation of the graph structure by using a deep learning inference framework;
fusing the intermediate representation by using the configuration file and the subgraph engine to obtain a subgraph operator, wherein the configuration file is used for defining the data attribute and the hardware type of the subgraph operator;
and converting the sub-graph operator into an instruction code executed on target hardware corresponding to the hardware type according to a hardware adaptation framework.
7. The method of claim 6, wherein said converting the sub-graph operator into instruction code executing on target hardware corresponding to the hardware type according to a hardware adaptation framework comprises:
and converting the sub-graph operator into an instruction code executed on target hardware corresponding to the interface according to the interface corresponding to the sub-graph operator in the hardware adaptation frame.
8. The method of claim 6 or 7, wherein the interface comprises at least one of: the system comprises a hardware management interface, a multi-hardware unified context interface, a model networking interface, a model compiling interface and a model executing interface.
9. The method of any of claims 6-8, wherein the data attributes comprise: data type and/or data structure.
10. The method of claim 9, wherein the data type comprises at least one of: fp32, int8, int 16.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 6-10.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 6-10.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 6-10.
CN202111504826.8A 2021-12-10 2021-12-10 Hardware adaptation device and method based on deep learning Active CN114186678B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111504826.8A CN114186678B (en) 2021-12-10 2021-12-10 Hardware adaptation device and method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111504826.8A CN114186678B (en) 2021-12-10 2021-12-10 Hardware adaptation device and method based on deep learning

Publications (2)

Publication Number Publication Date
CN114186678A true CN114186678A (en) 2022-03-15
CN114186678B CN114186678B (en) 2023-04-07

Family

ID=80604284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111504826.8A Active CN114186678B (en) 2021-12-10 2021-12-10 Hardware adaptation device and method based on deep learning

Country Status (1)

Country Link
CN (1) CN114186678B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149856A (en) * 2023-01-09 2023-05-23 中科驭数(北京)科技有限公司 Operator computing method, device, equipment and medium
CN116579400A (en) * 2023-05-19 2023-08-11 北京百度网讯科技有限公司 Quantization method, data processing method and device of deep learning model

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166942A1 (en) * 2011-12-22 2013-06-27 International Business Machines Corporation Unfusing a failing part of an operator graph
CN105912330A (en) * 2016-04-07 2016-08-31 北京北方微电子基地设备工艺研究中心有限责任公司 Hardware device control method and device
CN106155755A (en) * 2015-06-03 2016-11-23 上海红神信息技术有限公司 Program compiling method and compiler
CN106650922A (en) * 2016-09-29 2017-05-10 清华大学 Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN110321999A (en) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 Neural computing figure optimization method
US20190377819A1 (en) * 2018-06-12 2019-12-12 Bank Of America Corporation Machine learning system to detect, label, and spread heat in a graph structure
US20200034710A1 (en) * 2018-07-26 2020-01-30 DeepScale, Inc. Optimizing neural network structures for embedded systems
CN110766147A (en) * 2018-07-25 2020-02-07 赛灵思公司 Neural network compiler architecture and compiling method
CN111104120A (en) * 2018-10-29 2020-05-05 赛灵思公司 Neural network compiling method and system and corresponding heterogeneous computing platform
CN111427845A (en) * 2020-02-28 2020-07-17 中国电子科技集团公司第十五研究所 Interactive modeling analysis operator data exchange method
CN111782181A (en) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 Code generation method and device, electronic equipment and storage medium
CN112149812A (en) * 2019-06-28 2020-12-29 英特尔公司 Hardware-independent deep neural network compiler
CN112182635A (en) * 2019-07-03 2021-01-05 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing joint modeling
CN112633502A (en) * 2020-12-29 2021-04-09 北京百度网讯科技有限公司 Cross-platform execution method and device of deep learning model and electronic equipment
WO2021093654A1 (en) * 2019-11-15 2021-05-20 中兴通讯股份有限公司 Daughter card initialization method, electronic apparatus, and storage medium
CN113010469A (en) * 2021-03-18 2021-06-22 恒睿(重庆)人工智能技术研究院有限公司 Image feature extraction method, device and computer-readable storage medium
CN113449858A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Processing method of neural network model and related equipment

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166942A1 (en) * 2011-12-22 2013-06-27 International Business Machines Corporation Unfusing a failing part of an operator graph
CN106155755A (en) * 2015-06-03 2016-11-23 上海红神信息技术有限公司 Program compiling method and compiler
CN105912330A (en) * 2016-04-07 2016-08-31 北京北方微电子基地设备工艺研究中心有限责任公司 Hardware device control method and device
CN106650922A (en) * 2016-09-29 2017-05-10 清华大学 Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN110321999A (en) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 Neural computing figure optimization method
US20190377819A1 (en) * 2018-06-12 2019-12-12 Bank Of America Corporation Machine learning system to detect, label, and spread heat in a graph structure
CN110766147A (en) * 2018-07-25 2020-02-07 赛灵思公司 Neural network compiler architecture and compiling method
US20200034710A1 (en) * 2018-07-26 2020-01-30 DeepScale, Inc. Optimizing neural network structures for embedded systems
CN111104120A (en) * 2018-10-29 2020-05-05 赛灵思公司 Neural network compiling method and system and corresponding heterogeneous computing platform
CN112149812A (en) * 2019-06-28 2020-12-29 英特尔公司 Hardware-independent deep neural network compiler
CN112182635A (en) * 2019-07-03 2021-01-05 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing joint modeling
WO2021093654A1 (en) * 2019-11-15 2021-05-20 中兴通讯股份有限公司 Daughter card initialization method, electronic apparatus, and storage medium
CN111427845A (en) * 2020-02-28 2020-07-17 中国电子科技集团公司第十五研究所 Interactive modeling analysis operator data exchange method
CN113449858A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Processing method of neural network model and related equipment
CN111782181A (en) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 Code generation method and device, electronic equipment and storage medium
CN112633502A (en) * 2020-12-29 2021-04-09 北京百度网讯科技有限公司 Cross-platform execution method and device of deep learning model and electronic equipment
CN113010469A (en) * 2021-03-18 2021-06-22 恒睿(重庆)人工智能技术研究院有限公司 Image feature extraction method, device and computer-readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MINGZHEN LI等: "The deep learning compiler:A comprehensive survey" *
TIANQI CHEN等: "TVM:An Automated End-to-End Optimizing Compiler for Deep Learning", 《PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION》 *
焦润: "一种优化硬件延时的Transformer架构搜索算法", 《中国硕士学位论文全文数据库(信息科技辑)》 *
焦禹铭等: "基于专用卷积神经网络加速器的编译器设计与实现", 《计算机应用》 *
胡勇等: "几何代数GIS计算引擎的设计与实现" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149856A (en) * 2023-01-09 2023-05-23 中科驭数(北京)科技有限公司 Operator computing method, device, equipment and medium
CN116579400A (en) * 2023-05-19 2023-08-11 北京百度网讯科技有限公司 Quantization method, data processing method and device of deep learning model
CN116579400B (en) * 2023-05-19 2024-02-23 北京百度网讯科技有限公司 Quantization method, data processing method and device of deep learning model

Also Published As

Publication number Publication date
CN114186678B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN114186678B (en) Hardware adaptation device and method based on deep learning
Jin et al. Compiling onnx neural network models using mlir
US10269087B2 (en) Language translation using preprocessor macros
CN113449858A (en) Processing method of neural network model and related equipment
CN110262783B (en) Interface generation method and device and terminal equipment
CN113656590B (en) Industry map construction method and device, electronic equipment and storage medium
CN112527262B (en) Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN112540767B (en) Program code generation method and device, electronic equipment and storage medium
CN115240048A (en) Deep learning operator positioning fusion method and device for image classification
Koranne et al. Boost c++ libraries
Hoffmann et al. Defining models-meta models versus graph grammars
CN111309332A (en) File content on-demand loading method and device, electronic equipment and storage medium
CN113705798A (en) Processing unit, computing device and computation graph optimization method of deep learning model
CN112559760B (en) CPS (cyber physical system) resource capacity knowledge graph construction method for text description
CN112633502B (en) Cross-platform execution method and device of deep learning model and electronic equipment
CN115469931B (en) Instruction optimization method, device, system, equipment and medium of loop program
CN109597611B (en) Front-end data flow control component development system, method, device and storage medium
CN116560666A (en) AI front end unified computing method, device and medium based on multi-level code generation
CN113360490B (en) Data processing method, device, apparatus, medium and program product
CN114168151A (en) Container-based program compiling method and device, electronic equipment and storage medium
CN111221532A (en) Method and device for generating dynamic link library
Maan et al. Parsing C and Fortran code to SymPy Expressions
Ammar et al. Visualizing a hierarchy of performance models for software systems
Park et al. Interworking technology of neural network and data among deep learning frameworks
Fayzrakhmanov WPPS: A novel and comprehensive framework for web page understanding and information extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant