CN113220457B - Model deployment method, model deployment device, terminal equipment and readable storage medium - Google Patents

Model deployment method, model deployment device, terminal equipment and readable storage medium Download PDF

Info

Publication number
CN113220457B
CN113220457B CN202110567899.5A CN202110567899A CN113220457B CN 113220457 B CN113220457 B CN 113220457B CN 202110567899 A CN202110567899 A CN 202110567899A CN 113220457 B CN113220457 B CN 113220457B
Authority
CN
China
Prior art keywords
node
model
operator
scheme
cost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110567899.5A
Other languages
Chinese (zh)
Other versions
CN113220457A (en
Inventor
李发兵
林伟伟
李想
毛兴中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhixin Huaxi Information Technology Co ltd
Original Assignee
Shenzhen Zhixin Huaxi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhixin Huaxi Information Technology Co ltd filed Critical Shenzhen Zhixin Huaxi Information Technology Co ltd
Priority to CN202110567899.5A priority Critical patent/CN113220457B/en
Publication of CN113220457A publication Critical patent/CN113220457A/en
Application granted granted Critical
Publication of CN113220457B publication Critical patent/CN113220457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • G06N3/105Shells for specifying net layout
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Abstract

The invention discloses a model deployment method, a model deployment device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained; acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set; based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set; and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set. The invention can be fully compatible with equipment with different calculation forces, and can improve the operation efficiency and the overall throughput rate.

Description

Model deployment method, model deployment device, terminal equipment and readable storage medium
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to the field of model deployment, and in particular relates to a model deployment method, a model deployment device, terminal equipment and a readable storage medium.
Background
Machine Learning (ML) is one of the fastest growing areas of computer science today, and typical Machine Learning techniques can train statistical models for specific applications on a large number of pre-collected data sets to update model parameters (also called "weights") until convergence; the trained model is used for reasoning, i.e. predicting the result of the new data. Deep learning based on neural networks is the most widely used ML algorithm due to its excellent effect. Neural network models are multi-layered Directed Acyclic Graphs (DAGs), the models typically consist of operations of convolution, matrix multiplication, pooling, batch regularization, etc., which are connected in linear chains or more complex patterns (e.g., branch and residual links). The generalization ability and accuracy of neural network models generally improves with deeper topologies and larger network layers, such as ResNet, VGG, but this also incurs higher execution delays.
Researchers are continually pushing out complex network models to achieve better generalization ability and accuracy; with the increase of the parameter number of the model, the calculation complexity is higher; for example, the GPT-3 model proposed by OPENAI reaches a surprising 1750 million by merely referencing, and the model occupies more than 700G of hard disk storage space. Researchers have now proposed various system-level optimization schemes to address the performance challenges posed by larger and more complex models; wherein in terms of hardware, faster computation is supported using a GPU and dedicated accelerators (e.g., google TPUs); in terms of software, there are many frameworks to bridge the gap between productivity-centric high-level interfaces and performance-oriented low-level implementations, including TensorFlow, pyTorch, MXNet, TVM, etc.
However, the above-described studies have not solved the deployment problem of large parameter models well. Specifically, most of the prior machines for operating the deep learning model by default are sufficient in memory and hard disk, and can directly operate the model on a single machine to obtain the output of the model; on the cloud server cluster, distributed storage is generally used to obtain better file read-write speed and better throughput, but more time is wasted on transmission, and the operation capability of all devices cannot be fully exerted. With the development of deep learning, researchers find that the computational resources are always limited; in addition, the device responsible for the operation generally needs stronger calculation force, and the calculation force is idle most of the time in charge of reading and writing, which causes huge waste on the one hand and makes the throughput rate difficult to be improved on the other hand.
In view of the foregoing, in order to enable the deep learning model to operate in the limited resource scenario, a new method, system, device and medium for deploying the deep learning model under the limited resource are needed.
Disclosure of Invention
The invention aims to provide a model deployment method, a model deployment device, terminal equipment and a readable storage medium, so as to solve one or more technical problems. The invention can be fully compatible with equipment with different calculation forces, and can improve the operation efficiency and the overall throughput rate.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses a model deployment method, which comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set;
based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set;
and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set to complete model deployment.
The invention further improves that the step of acquiring the operator model set of the deep neural network model to be deployed specifically comprises the following steps:
and selecting a single-layer neural network as a basic granularity, and dividing a deep neural network model to be deployed to obtain an operator model set.
The invention further improves that the operator model meeting the preset condition in the operator model set is subjected to operator fusion or operator segmentation treatment, and the step of obtaining the treated operator model set specifically comprises the following steps:
Comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; operator segmentation is carried out on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the operator model with the parameter less than 1/10 of the memory until the parameter is greater than 1/10 and less than 1/2 of the memory.
The invention is further improved in that the preset searching method is a backtracking searching method;
when combining operator models in the processed operator model set by adopting a backtracking method searching method, adopting a high throughput rate priority scheme when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,…,Node i …, nodeb; slave Node 1 Starting up to Noden, for NodeB i Querying topological graph, obtaining Node i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node i Node of each branch Node of (a) i+1 Generating two branches for representing Node i And Node i+1 In the same sub-model and in different sub-models; node i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found j Will remove Node i To Node j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number D of available devices, removing the scheme to obtain an existing scheme set;
traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, multiplying the maximum overheads of the devices by the number of consumed devices to obtain total overheads, finding out a combination with the minimum total overheads as an optimal combination, and marking the combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.
The invention further improves that when a backtracking method searching method is adopted to combine operator models in the processed operator model set, when the actual running time is smaller than the theoretical time delay of a high throughput rate priority scheme, a low service time delay priority scheme is adopted; the low service delay priority scheme specifically comprises the following steps:
The nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,…,Node i …, nodeb; slave Node 1 Starting up to Noden, for NodeB i Querying topological graph, obtaining Node i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node i Node of each branch Node of (a) i+1 Generating two branches for representing Node i And Node i+1 In the same sub-model and in different sub-models; node i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found j Will remove Node i To Node j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number D of available devices, removing the scheme to obtain an existing scheme set;
traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, accumulating to obtain total overheads, finding out one combination with the minimum total overheads as an optimal combination, and marking the optimal combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.
A further improvement of the invention is that all devices in the set of devices for the deployment model are identical; the preset searching method is a dynamic programming searching method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is smaller than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,…,Node i …, noden, which are connected by n×n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;
slave Node 1 Start traversing to Node n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Node i To Node j The node recursion call optimization target among the nodes is { min { Number } i..j ×max{Low_Cost i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost i +L compute (i..j)+L communicate (i) Record Low_cost j A sub-model combination mode of (2); after traversing, low_cost n Obtaining a deep learning model partition deployment scheme for the estimated minimum delay by inquiring the record; number of i..j The number of machines needed to be consumed between node_i and node_j is represented;
wherein, low_cost is adopted i The combination mode with the lowest total delay in all sub-model combination modes from the 0 th operator to the i th operator is represented as a state transition equation:
Low_Cost n =min{Low_Cost 0 +L compute (0..n)+L communicate (0),Low_Cost 1 +L compute (1..n)+L communicate (1),...,Low_Cost n-1 +L compute (n-1..n)+L communicate (n-1)}
Low_Cost 0 =0,
L communicate (0)=0,
L communicate =Data_Size*Coefficient,
wherein L is compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator i For the calculation delay of the ith operator, L communicate Referring to the delay of Data transmission, data_size refers to the Size of Data to be transmitted, and Coefficient is a constant that varies according to the bandwidth of the network.
The invention further improves that when the operator model in the processed operator model set is combined by adopting a dynamic programming search method, when the actual running time is more than or equal to the theoretical time delay of the high throughput rate priority scheme, the high throughput rate priority scheme is adopted; the high throughput rate priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,..,Node i ,...,Node n Representing their connected relationship by an n x n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;
Slave Node 1 Start traversing to Node n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Node i To Node j The node recursion call optimization target among the nodes is { min { Number } i..j ×max{Low_Cost i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost×device 0..j Record Low_cost j A corresponding sub-model combination mode; node is traversed n After that, low_cost n Is the estimated minimum delayAnd when the sub-model combination mode of the query record is used, obtaining the deep learning model division deployment scheme.
The invention provides a model deployment device, comprising:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
the running time set acquisition module is used for acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to acquire a running time set;
the sub-model set acquisition module is used for combining the operator models in the processed operator model set by adopting a preset search method according to the running time set to obtain a sub-model set;
The deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set, so as to complete model deployment.
An electronic apparatus of the present invention includes: a processor; a memory for storing computer program instructions; when loaded and executed by the processor, the computer program instructions perform the method for model deployment according to any of the above-described aspects of the present invention.
A readable storage medium of the present invention stores computer program instructions that, when loaded and executed by a processor, perform the model deployment method of any of the above-described aspects of the present invention.
Compared with the prior art, the invention has the following beneficial effects:
in the model deployment method provided by the invention, the large parameter model to be deployed is divided into a plurality of small operator models, and the sub-models are obtained by adopting a preset planning method based on the operator models, so that the sub-models can be perfectly adapted to the content capacities of computing equipment with different computing forces, the non-computing expenditure is reduced, and the capability of the existing computing equipment is fully exerted. The model deployment method can deploy the model with larger parameter number on the equipment with limited calculation force. It should be noted that, the method of the invention deploys the model on different devices, naturally introduces communication overhead, but the introduced communication overhead is far less than the calculation overhead of the model itself; for the cloud computing platform, the communication overhead is far less than the loading overhead compared with the work of loading the parameters from the hard disk in a blocking way.
In the embodiment of the further improved model deployment method, a specific model segmentation algorithm is provided, configuration schemes under two conditions of high load and low load can be generated, and the configuration schemes are respectively called a low service delay-first scheme and a high throughput-first scheme, and also support preset equipment parameters in advance according to an equipment adjustment scheme; once the solution is generated, the service provider can flexibly adjust the solution according to the requirements.
In the embodiment of the further improved model deployment method, aiming at the scene that the computing power of more common computing equipment is almost the same, a dynamic programming algorithm with lower complexity is provided, and a globally optimal equipment segmentation scheme can be obtained more efficiently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.
FIG. 1 is a schematic block flow diagram of a model deployment method of an embodiment of the present invention;
FIG. 2 is a schematic diagram of an operator fusion in an embodiment of the invention;
FIG. 3 is a flow chart of low service latency priority in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart illustrating high throughput prioritization in an embodiment of the present invention;
FIG. 5 is a schematic diagram of deployment results in an embodiment of the present invention.
Detailed Description
In order to make the purposes, technical effects and technical solutions of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it will be apparent that the described embodiments are some of the embodiments of the present invention. Other embodiments, which may be made by those of ordinary skill in the art based on the disclosed embodiments without undue burden, are within the scope of the present invention.
The model deployment method of the embodiment of the invention comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
Acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set;
based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set;
and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set to complete model deployment.
Machine learning has become the most important and popular technique in modern data driven applications, and recent advances in deep learning have achieved unprecedented success in various challenging tasks such as image/video classification and natural language processing. In recent years, researchers have continually introduced complex network models to achieve better generalization ability and accuracy. But with this, the number of parameters of the model is also larger and the computational complexity is higher and higher. The GPT-3 model proposed by OPENAI reaches a surprising 1750 million by just referencing, and the model occupies more than 700G of hard disk storage space. With the development of deep learning, researchers have found that the computational resources are always limited. In order to enable the deep learning model to operate in a limited resource scene, the embodiment of the invention provides a novel model deployment method, wherein the method can divide the model to be deployed into small units, deploy the small units on equipment with weak computing capacity, bring the computing power of the equipment into play, and reduce the reading and writing expenditure. Specifically, the embodiment of the invention discloses a flexible deep learning model deployment method, which enables a deep learning model to be fully compatible with equipment with different calculation forces and fully squeeze out the calculation forces of the equipment, so as to obtain optimal operation efficiency and overall throughput rate; the invention can fully consider the difference of the computing power of the devices with strong computing power in cloud computing or the devices with weak computing power in edge computing and even embedded devices, and can obtain the optimal throughput rate in the global.
In the embodiment of the invention, the overhead mainly comes from two parts, namely the execution time of the model and the communication time between devices; wherein typically the execution time is much longer than the communication time.
In the embodiment of the present invention, two optimization scenarios are set, including:
(1) In the low-frequency request scene, the model deployment method should provide lower prediction delay for a single request of a user;
(2) In the high-frequency request scene, the model deployment method should optimize throughput rate as a whole and provide services for more users in unit time under the condition of ensuring that single prediction delay is acceptable.
Notably, when optimizing for throughput, all devices form a relatively complex pipeline, and throughput depends on the slowest stage of pipeline; thus, optimization schemes in high throughput scenarios will also generally tend to make the load of the individual devices more balanced.
In the embodiment of the invention, once the available resources and the model to be deployed are determined, a low service delay-first scheme and a high throughput-first scheme can be generated, and the corresponding theoretical delay Time is calculated at the same Time latency Sum Time throughpu The method comprises the steps of carrying out a first treatment on the surface of the In actual operation, a low service delay priority scheme is adopted first, and when the actual operation Time of a user is longer than the theoretical delay Time of a high throughput rate priority scheme throughput When switching to the high throughput rate priority scheme, recording the current request Number flag . Definition: when the Number of user requests is Number request >Number flag When, for a high load environment, a high throughput-first (throughput-first) priority scheme is employed. When the Number of user requests is Number request <Number flag In this case, a low service latency-first (latency) scheme is employed for a low load environment.
In the embodiment of the invention, a total of n operator models are provided, and the calculation delay of the ith operator model is L comp,i The communication delay between the ith operator model and the (i+1) th operator model is L comm,i
Wherein, the optimization function under the low-delay scheme is min { Σ i L comp,i +L comm,i -representing combining the operator models into different sub-models, obtaining a combination of operator models with minimum delay, such that overall delay is the lowest; since the devices form a relatively complex pipeline under the high throughput priority scheme, and the throughput rate depends on the slowest stage of pipeline, the optimization function at this time is min { n×max } i {L conp,i ,L comm,i And }, which represents a combination mode of taking all the operator model combination modes, wherein the product of the time delay of the submodel with the maximum time delay and the submodel quantity is minimum.
Referring to fig. 1, a model deployment method in an embodiment of the present invention is used for deploying a deep neural network model to be deployed on a set of devices defined by resources for deploying the model, and specifically includes the following steps:
screening to obtain an operator model set of the deep neural network model to be deployed, and carrying out parallel post-processing to obtain a processed operator model set; wherein the post-processing comprises: operator fusion or operator segmentation is carried out on the operator model meeting the preset condition (according to the operator model parameter and the memory of the minimum memory device); each operator model makes delay statistics on each device in the device set to obtain a running time set; based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set; and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set to complete model deployment.
The deep learning model has an obvious characteristic, and various operators which are easy to distinguish form an integral model according to a specific topological structure. If each operator is considered a node, the connections between them are considered edges, the topology of the deep learning model can be considered as a typical directed graph, and the nodes and edges each have respective values representing the overhead of operations and communications. Because of the difference in memory capacity, the capability of different devices to carry models is different, and for devices supporting virtual memory, a large number of "page fault interrupts" can be generated correspondingly though a larger number of models can be run through virtual memory, so that huge expenditure is brought. So if there are too many nodes contained in a sub-model, the overhead eventually rises due to "page fault break".
In the method provided by the embodiment of the invention, the nodes and the edges are divided into the sub-models, so that the total cost is minimum.
In the embodiment of the application, the specific steps of granularity selection include: when classifying deep learning models, too coarse a granularity may miss opportunities for potentially optimal classification and may not place the model completely in the device's memory, while too fine a granularity may slow the search for optimal strategies. In the deep learning model, it is a natural matter to select a single-layer neural network as the basic granularity; each partitioned sub-model partition will contain one or more layers of network. However, in some cases, special adjustments are required.
In the embodiment of the application, the specific steps of operator fusion comprise: in addition to computationally intensive network layers such as convolution and matrix multiplication, neural networks use small network layers with little or no weight; for example, merging, batch regularization, and ReLU activation; these layers do not significantly contribute to memory usage, and they are fused with their neighbor convolution or matrix multiplication network layers to reduce the number of basic units used for partitioning. Such layer fusion is a performance optimization method commonly used in many deep learning frameworks; such as TVM, tensorflow. As shown in fig. 2, conv represents a convolution layer, BN represents a batch regularization layer, and Relu represents a linear rectification function in fig. 2.
In the embodiment of the application, the steps of processing the big parameter operator specifically comprise: for operators with excessive parameter quantity, the operators are split into a plurality of parallel operators to reduce the parameter quantity of single operators, and the outputs of the operators can be combined and are equivalent to the output of the original operators. According to experience preference, setting the memory size of the device with the smallest memory in all the devices to be deployed as SM, splitting operators with parameter sizes exceeding 0.5SM generally until the split operator sizes are smaller than 0.5SM.
In the embodiment of the present application, the specific steps of the branching processing include: deep learning models such as ResNet, acceptance and DenseNet employ complex topologies that are not simple linear sequences but rather have branches. These branches are individually labeled in embodiments of the present invention. Where such parallel branching units are considered to be the same stage of pipeline in computing performance, their delay is dependent on the slowest one.
In the embodiment of the present invention, in order to obtain the optimal partitioning scheme, the operator calculation delay L needs to be accurately obtained compute And communication delay L communicate And thus learn about combinations (i.e., sub-models) of operators that may be running in the same piece of equipment. There is communication overhead between the sub-models, and there is no communication overhead in the sub-models. Typically an operator running on a device will have two delays: a delay under low "page fault interrupt" and a delay under high "page fault interrupt".
In the embodiment of the invention, for calculating operator delay, if the total memory usage of one sub-model exceeds the available memory size, operators in the sub-model will suffer serious "page fault" interruption. Thus, embodiments of the present invention search for an optimal operator combination scheme using methods that include operator delay statistics and submodel delay predictions.
In the embodiment of the invention, the step of operator delay statistics comprises the following steps: the operator model operates on the equipment to acquire the time delay under the low page fault interrupt and the time delay under the high page fault interrupt; in both cases, the performance of each operator is compiled and measured separately: a simple memory expansion module is used alone, which occupies a large portion of the memory space. When the memory expansion module operates, the starting operator corresponds to the operation time of 'page missing interrupt', and corresponds to the time delay under high 'page missing interrupt'; when the memory expansion module is closed, the running time corresponding to no page fault interrupt is the time delay corresponding to low page fault interrupt.
In the embodiment of the invention, the steps of sub-model delay estimation specifically comprise: starting from the starting node, a sub-model is generated. And calculating the memory usage of the neutron model in the search space. The deep learning model has very regular program semantics. The memory layout mainly consists of model weights, intermediate results and input/output, wherein the model weights account for the largest share. The memory occupied by the weight is also fixed for a determined model, and the intermediate result can also obtain the memory occupied size through the input and output of the calculation operator. Sub-model delay estimation: for any given combination of operators, the memory usage of the operators is predicted, and compared with the size of the memory of the equipment to determine whether a large number of page fault interrupts are generated, and the delays of the submodels are obtained through accumulation and recorded. And returning to the step two until traversing the search space. And summarizing the corresponding sub-model delays into total delays, and selecting a scheme meeting the optimization target.
In the embodiment of the invention, a scheme which is suitable for the requirement is selected from a search space based on the obtained normal operation cost of each operator on different equipment and the operation cost when a large number of page missing interruption occurs; the topology structure of the current deep learning model is Directed Acyclic Graph (DAG), hereinafter referred to as Node (Node), which is called operator i Representing the ith operator. Is thatAnd obtaining the combination of all the submodels of the method, searching in a backtracking mode, and representing the search space as a search tree structure.
Referring to fig. 3, in the embodiment of the present invention, assuming that the optimization objective of the current solution is the minimum delay, only the integration of the delay of the division of the model needs to be minimized under the current equipment resource, and the steps are as follows:
step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes 1 ,Node 2 ,…,Node n . Each branched node is provided with a plurality of outlet nodes, each converged node is provided with a plurality of inlet nodes, and the converged nodes are represented as more than two edges in a topological graph;
step two, slave Node 1 Starting until Node n . Node pair i Querying topology diagram, obtaining branch Node connected with it, node for each branch Node i+1 Generating two branches representing Node i And Node i+1 In the same sub-model and in different sub-models;
step three, performing DFS traversal on the search tree generated in the step two to obtain a segmentation scheme;
step four, for Node i If it has multiple branch nodes, it needs to find the Node where all branch nodes converge j Will remove Node i To Node i And the schemes with the same segmentation scheme except the inter-node are regarded as the same scheme, and are combined to finally obtain x schemes.
Step five, traversing the scheme, and for the a scheme Plan a If the scheme requires the number of devices D a > D (D is the number of available devices), the scheme is removed.
Step six, starting traversing the existing scheme and scheme Plan a For each overlay/sub-model G in the scheme a Calculating the operator corresponding to the operator in each device b Overhead Cost on ab . Scheme Plan a We Cost its overhead at different devices ab Combining, accumulating to obtain Total overhead total_cost, and finding out a group minimizing total_costIs combined as Plan a Is denoted as the optimal combination of the final partitioning scheme Opti _ Plan a Its minimum overhead is denoted as Low_cost a
And step seven, comparing the low_cost values of all the x Opti_planes, finding out the Opti_plane of the minimum alternative partitioning scheme of the low_cost values, and outputting the Opti_plane as the optimal partitioning scheme.
The optimal scheme of the same equipment with the minimum delay, which is preferable in the embodiment of the invention, comprises the following steps: when the equipment configurations in the application scene are completely the same, an optimal deep learning model segmentation scheme can be obtained by adopting a dynamic programming algorithm with lower complexity. By Low_cost i Representing the combination mode with the lowest total delay in all sub-model combination modes from the 0 th operator to the i th operator. The state transition equation is expressed as:
Low_Cost n =min{LoW_Cost 0 +L compute (0..n)+L communicate (0),Low_Cost 1 +L compute (1..n)+L communicate (1),...,Low_Cost n-1 +L compute (n-1..n)+L communicate (n-1)}
for the nth operator, it is always recursively derived from the optimal case in the first n-1 operator combinations. Definition Low_cost 0 =0,L communicate (0)=0,L communicate =Data_Size*Coefficient{L compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator i For the calculation delay of the ith operator, L communtcate Referring to the delay of Data transmission, data_size refers to the Size of Data to be transmitted in MB, coefficient is a constant that varies according to the network bandwidth, and 1Gbps ethernet peaks at about 8 ms/MB). Using a Low_Cost table to store the calculated Low_Cost i Recording Node by using Position table j Selecting Low_cost i Is a position i of (c).
In the above embodiment of the present invention, the steps for obtaining the optimal deep learning model segmentation scheme by adopting the dynamic programming algorithm are as follows:
Step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes 1 ,Node 2 ,…,Node n . Each branched node has a plurality of outgoing nodes and each converged node has a plurality of incoming nodes.
Step two, slave Node 1 Starting, for Node i If Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Node i To Node i The nodes between the nodes recursively call the algorithm, and the optimization target is { min { max { Low_cost } i..j -the minimum delay depends on the one of the branches with the largest delay, and finding the optimal combination minimizes the delay of this branch.
Step three, for Node j Solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost i +L compute (i..j)+L communicate (i) Record Low_cost j Is a sub-model combination mode.
Step four, after Node is traversed, low_Cost n And inquiring the record to obtain the deep learning model partition deployment scheme, wherein the minimum estimated delay is the minimum estimated delay.
Referring to fig. 4, in the embodiment of the present invention, it is assumed that the current scheme optimizes the target to the highest throughput rate, and the delay depends on the device with the longest operation time, so that the delays of the submodel on each device are balanced as much as possible, and the steps are as follows:
step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes 1 ,Node 2 ,...,Node n . Each branched node is provided with a plurality of outlet nodes, each converged node is provided with a plurality of inlet nodes, and the converged nodes are represented as more than two edges in a topological graph;
step two, slave Node 1 Starting until Node n . Node pair i Querying topology diagram, obtaining branch Node connected with it, node for each branch Node i+1 Generating two branches representing Node i And Node i+1 At the position ofThe same sub-model and different sub-models;
step three, performing DFS traversal on the search tree generated in the step two to obtain a segmentation scheme;
step four, for Node i If it has multiple branch nodes, it needs to find the Node where all branch nodes converge j Will remove Node i To Node j And the schemes with the same segmentation scheme except the inter-node are regarded as the same scheme, and are combined to finally obtain x schemes.
Step five, traversing the scheme, and for the a scheme Plan a If the scheme requires the number of devices D a > D (D is the number of available devices), the scheme is removed.
Step six, starting traversing the existing scheme and scheme Plan a For each overlay/sub-model G in the scheme a Calculating the operator corresponding to the operator in each device b Overhead Cost on ab . Scheme Plan a We Cost its overhead at different devices ab Combining with maximum overhead Cost of devices ab Multiplying the number of consumed devices D to obtain the Total overhead total_cost, finding the combination that minimizes total_cost as Plan ia Is denoted as the optimal combination of the final partitioning scheme Opti _ Plan ia Its minimum overhead is denoted as Low_cost a
And step seven, comparing the low_cost values of all the x Opti_Plans, finding out the smallest alternative partitioning scheme Opti_Plan with the minimum low_cost, and outputting the smallest alternative partitioning scheme Opti_Plan as the optimal partitioning scheme.
In the embodiment of the invention, the same equipment optimization scheme of the highest throughput rate comprises the following steps: when the equipment configurations in the application scene are completely the same, a deep learning model segmentation scheme with optimal throughput rate can be obtained by adopting a dynamic programming algorithm with lower complexity. The core idea is to minimize the delay of the highest latency submodel, i.e. load balancing.
The state transition equation is expressed as:
Low_Cost n =min{max{Low_Cost 0 ,L compute (0..n)×(d 0 +1)},max{Low_Cost 1 ,L compute (1..n)×(d 1 +1)},...,max{Low_Cost n-1 ,L compute (n-1..n)×(d n-1 +1)}}
in general L commumcate 《L compute Therefore, do not consider L communicate Is a delay of (a) to (b). The nth operator and some operators in the front form a submodel, and the delay is determined by the maximum submodel delay. The optimal case in the first n operator combinations is always recursively derived from the low_cost. Definition Low_cost 0 =0,L communicate (0)=0, Calculation delay Cost for submodel formed by combining ith operator to jth operator i For the calculation delay of the ith operator, L communicate Refers to the delay of data transmission, d n Representing the number of devices that the optimal solution of the 1 st to nth operators needs to occupy) using a Low_cost table to store the calculated Low_cost i Recording Node by using Position table j Selecting Low_cost i The Number table is used to record the Number of devices required for the corresponding scheme.
In the above embodiment of the present invention, the steps for obtaining the deep learning model segmentation scheme with optimal throughput rate by adopting the dynamic programming algorithm are as follows:
step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes 1 ,Node 2 ,…,Node n . Their connected relationship is represented by an n x n matrix M. Each branched node has a plurality of primary nodes, and each converged node has a plurality of ingress nodes.
Step two, slave Node 1 Starting, for Node i If Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Nodei to NodeI i The nodes among the nodes recursively call the algorithm, and the optimization target is { min }Number i..j ×max{Low_Cost i..j -the minimum delay depends on the one of the branches with the largest delay, and finding the optimal combination minimizes the delay of this branch.
Step three, for Node j Solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost×device 0..j Record Low_cost j Corresponding sub-model combination mode
Step four, after Node is traversed, low_Cost n And obtaining a deep learning model division deployment scheme by inquiring a combination mode of the records, wherein the minimum estimated delay is the estimated minimum delay.
Referring to fig. 5, the present invention successfully partitions a large parameter model into small sub-models, so that the sub-models can perfectly adapt to the content capacity of the computing device, reduce non-computing overhead, and fully exert the capability of the existing computing device.
The invention preferably adopts ONNX model protocol to store the deep learning model, can be loaded into various deep learning frameworks, and can be connected with the API thereof to be output to various hardware back ends. The TCP protocol is used for completing various computing devices, complex data format conversion is not needed, and the compatibility and the universality are good. The invention deploys the model on different devices, naturally introduces communication overhead, but the introduced communication overhead is far smaller than the calculation overhead of the model itself. And communication overhead is less worth mentioning for cloud computing platforms. The communication overhead is also much less than the loading overhead than the work of loading parameters from a hard disk in blocks.
The invention provides a more flexible model segmentation algorithm, which can generate a configuration scheme under two conditions of high load and low load, also supports presetting of equipment parameters in advance and adjusts the scheme according to equipment. Once the solution is generated, the service provider can flexibly adjust the solution according to the requirements. Aiming at the more common scene that the calculation forces of all computing devices are almost the same, the invention provides a dynamic programming algorithm with lower complexity, and a globally optimal device segmentation scheme can be obtained more efficiently. The invention provides a delay pre-estimation model, which can pre-estimate the running time of a sub-model on different devices instead of directly running the sub-model for statistics, thereby greatly reducing the expense of preprocessing.
The model deployment device of the embodiment of the invention comprises:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
the running time set acquisition module is used for acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to acquire a running time set;
the sub-model set acquisition module is used for combining the operator models in the processed operator model set by adopting a preset search method according to the running time set to obtain a sub-model set;
the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set, so as to complete model deployment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, one skilled in the art may make modifications and equivalents to the specific embodiments of the present invention, and any modifications and equivalents not departing from the spirit and scope of the present invention are within the scope of the claims of the present invention.

Claims (9)

1. A method of model deployment, comprising the steps of:
acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
Acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set;
based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set;
based on the sub-model set, deploying a deep neural network model to be deployed on the equipment set to complete model deployment;
the preset searching method is a backtracking searching method;
when combining operator models in the processed operator model set by adopting a backtracking method searching method, adopting a high throughput rate priority scheme when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,...,Node i ,...,Node n
Slave Node 1 Start to Node n For Node i Querying topological graph, obtaining Node i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node i Node of each branch Node of (a) i+1 Generating two branches for representing Node i And Node i+1 In the same sub-model and in different sub-models; node i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found j Will remove Node i To Node j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number of available devices, removing the scheme to obtain an existing scheme set;
traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, multiplying the maximum overheads of the devices by the number of consumed devices to obtain total overheads, finding out a combination with the minimum total overheads as an optimal combination, and marking the combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.
2. The method for deploying models according to claim 1, wherein the step of obtaining the operator model set of the deep neural network model to be deployed specifically comprises:
And selecting a single-layer neural network as a basic granularity, and dividing a deep neural network model to be deployed to obtain an operator model set.
3. A model deployment method according to claim 1, wherein,
the step of performing operator fusion or operator segmentation processing on the operator models meeting the preset conditions in the operator model set to obtain the processed operator model set specifically comprises the following steps:
comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; operator segmentation is carried out on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and performing operator fusion on the operator model with the parameter less than 1/10 of the memory until the parameter is greater than 1/10 and less than 1/2 of the memory.
4. The model deployment method according to claim 1, wherein when a backtracking search method is adopted to combine operator models in the processed operator model set, a low service delay priority scheme is adopted when the actual running time is smaller than the theoretical delay of the high throughput priority scheme; the low service delay priority scheme specifically comprises the following steps:
The nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,…,Node i ,…,Node n
Slave Node 1 Start to Node n For Node i Querying topological graph, obtaining Node i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node i Node of each branch Node of (a) i+1 Generating two branches for representing Node i And Node i+1 In the same sub-model and in different sub-models; node i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found j Will remove Node i To Node j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number of available devices, removing the scheme to obtain an existing scheme set;
traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, accumulating to obtain total overheads, finding out one combination with the minimum total overheads as an optimal combination, and marking the optimal combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.
5. A method of model deployment according to claim 1, wherein all devices in the set of devices for deploying models are the same; the preset searching method is a dynamic programming searching method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is smaller than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,...,Node i ,...,Node n Representing their connected relationship by an n x n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;
slave Node 1 Start traversing to Node n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Node i To Node j The node recursion call optimization target among the nodes is { min { Number } i..j ×max{Low_Cost i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost i +L compute (i..j)+L communicate (i) Record Low_cost j A sub-model combination mode of (2); after traversing, low_cost n Obtaining a deep learning model partition deployment scheme for the estimated minimum delay by inquiring the record;
wherein, low_cost is adopted i The combination mode with the lowest total delay in all sub-model combination modes from the 0 th operator to the i th operator is represented as a state transition equation:
Low_Cost n =min{Low_Cost 0 +L compute (0..n)+L communicate (0),Low_Cost 1 +L compute (1..n)+L communicate (1),...,Low_Cost n-1 +L compute (n-1..n)+L communicate (n-1)}
Low_Cost 0 =0,
L communicate (0)=0,
L communicate =Data_Size*Coefficient,
wherein, (L) compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator i For the calculation delay of the ith operator, L communicate Data_size refers to the Size of Data to be transmitted, and Coefficient is a constant that varies according to the bandwidth of the network; number of i..j Indicating the number of machines that need to be consumed between Node i and Node j.
6. The model deployment method according to claim 5, wherein when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, a high throughput rate priority scheme is adopted when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,...,Node i ,...,Node n Representing their connected relationship by an n x n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;
Slave Node 1 Start traversing to Node n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node i Is the father Node of the branch Node, find the gathering Node of the branch Node j For Node i To Node j The node recursion call optimization target among the nodes is { min { Number } i..j ×max{Low_Cost i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation j =Low_Cost×device 0..j Record Low_cost j A corresponding sub-model combination mode; node is traversed n After that, low_cost n And obtaining a deep learning model division deployment scheme for the estimated minimum delay in a sub-model combination mode of query records.
7. A model deployment apparatus, comprising:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;
the running time set acquisition module is used for acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to acquire a running time set;
the sub-model set acquisition module is used for combining the operator models in the processed operator model set by adopting a preset search method according to the running time set to obtain a sub-model set;
The deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set, so as to complete model deployment;
the preset searching method is a backtracking searching method;
when combining operator models in the processed operator model set by adopting a backtracking method searching method, adopting a high throughput rate priority scheme when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:
the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node 1 ,Node 2 ,...,Node i ,...,Node n
Slave Node 1 Start to Node n For Node i Querying topological graph, obtaining Node i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node i Node of each branch Node of (a) i+1 Generating two branches for representing Node i And Node i+1 In the same sub-model and in different sub-models; node i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found j Will remove Node i To Node j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, for a scheme, if the number of required devices is greater than the number of available devices, removing the scheme Obtaining an existing schema set;
traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, multiplying the maximum overheads of the devices by the number of consumed devices to obtain total overheads, finding out a combination with the minimum total overheads as an optimal combination, and marking the combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.
8. An electronic device, comprising: a processor; a memory for storing computer program instructions; it is characterized in that the method comprises the steps of,
the computer program instructions, when loaded and executed by the processor, perform the model deployment method of any one of claims 1 to 6.
9. A readable storage medium storing computer program instructions, which when loaded and executed by a processor, performs the model deployment method of any one of claims 1 to 6.
CN202110567899.5A 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium Active CN113220457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110567899.5A CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110567899.5A CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113220457A CN113220457A (en) 2021-08-06
CN113220457B true CN113220457B (en) 2024-03-22

Family

ID=77098247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110567899.5A Active CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113220457B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118592B (en) * 2022-06-15 2023-08-08 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator feature analysis
CN115115062B (en) * 2022-06-29 2023-09-19 北京百度网讯科技有限公司 Machine learning model building method, related device and computer program product
WO2024014728A1 (en) * 2022-07-11 2024-01-18 Samsung Electronics Co., Ltd. Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device
CN116070675B (en) * 2023-03-06 2023-06-09 西南交通大学 Side slope neural network model selection method, device, equipment and readable storage medium
CN115981870B (en) * 2023-03-10 2023-06-13 之江实验室 Data processing method and device, storage medium and electronic equipment
CN116050499B (en) * 2023-04-03 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Self-adaptive model partitioning method, system and equipment in model parallel training
CN116167463B (en) * 2023-04-26 2023-07-07 之江实验室 Distributed model training container scheduling method and device for intelligent computing
CN116306856B (en) * 2023-05-17 2023-09-05 之江实验室 Deep learning model deployment method and device based on search
CN116630632B (en) * 2023-07-25 2023-11-03 腾讯科技(深圳)有限公司 Image segmentation model quantization method, device and equipment and computer storage medium
CN117155791B (en) * 2023-10-31 2024-02-13 浪潮电子信息产业股份有限公司 Model deployment method, system, equipment and medium based on cluster topology structure
CN117311998B (en) * 2023-11-30 2024-03-05 卓世未来(天津)科技有限公司 Large model deployment method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
CN110490322A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 Method for splitting and device, the electronic equipment and storage medium of operation node
CN110674936A (en) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 Neural network processing method and device, computer equipment and storage medium
CN111340237A (en) * 2020-03-05 2020-06-26 腾讯科技(深圳)有限公司 Data processing and model operation method, device and computer equipment
CN111738434A (en) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 Method for executing deep neural network on heterogeneous processing unit
CN112270399A (en) * 2020-09-29 2021-01-26 北京百度网讯科技有限公司 Operator registration processing method and device based on deep learning and electronic equipment
WO2021057465A1 (en) * 2019-09-26 2021-04-01 中兴通讯股份有限公司 Method and apparatus for performing parallel processing on deep learning model
CN112686378A (en) * 2020-12-23 2021-04-20 展讯通信(上海)有限公司 Calculation deployment method and device of neural network, storage medium and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390387B (en) * 2018-04-20 2023-07-18 伊姆西Ip控股有限责任公司 Assessment of resources used by deep learning applications

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
CN110490322A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 Method for splitting and device, the electronic equipment and storage medium of operation node
CN110674936A (en) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 Neural network processing method and device, computer equipment and storage medium
WO2021057465A1 (en) * 2019-09-26 2021-04-01 中兴通讯股份有限公司 Method and apparatus for performing parallel processing on deep learning model
CN111340237A (en) * 2020-03-05 2020-06-26 腾讯科技(深圳)有限公司 Data processing and model operation method, device and computer equipment
CN111738434A (en) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 Method for executing deep neural network on heterogeneous processing unit
CN112270399A (en) * 2020-09-29 2021-01-26 北京百度网讯科技有限公司 Operator registration processing method and device based on deep learning and electronic equipment
CN112686378A (en) * 2020-12-23 2021-04-20 展讯通信(上海)有限公司 Calculation deployment method and device of neural network, storage medium and computer equipment

Also Published As

Publication number Publication date
CN113220457A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN113220457B (en) Model deployment method, model deployment device, terminal equipment and readable storage medium
US20220129302A1 (en) Data processing system and method for heterogeneous architecture
CN114338504B (en) Micro-service deployment and routing method based on network edge system
CN113708972B (en) Service function chain deployment method and device, electronic equipment and storage medium
CN108572873B (en) Load balancing method and device for solving Spark data inclination problem
CN110084363B (en) Deep learning model acceleration method based on FPGA platform
KR20200113744A (en) Method and apparatus for partitioning deep neural networks
CN115237580B (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN111371616B (en) Virtual network function chain deployment method and system for NUMA (non Uniform memory Access) architecture server
CN112463189B (en) Distributed deep learning multi-step delay updating method based on communication operation sparsification
CN116662010B (en) Dynamic resource allocation method and system based on distributed system environment
Tanaka et al. Automatic graph partitioning for very large-scale deep learning
CN111314235A (en) Network delay optimization method based on virtual network function resource demand prediction
CN112015765B (en) Spark cache elimination method and system based on cache value
CN113794748B (en) Performance-aware service function chain intelligent deployment method and device
WO2020164644A2 (en) Neural network model splitting method, apparatus, computer device and storage medium
CN111506431A (en) Method for optimizing perception load performance of cloud server under energy consumption constraint
CN112862083B (en) Deep neural network inference method and device in edge environment
CN104683480A (en) Distribution type calculation method based on applications
WO2015055502A2 (en) Method of partitioning storage in a distributed data storage system and corresponding device
CN115392467B (en) Cloud edge cooperative self-adaptive depth reasoning method for real-time processing of mass data
CN116841710A (en) Task scheduling method, task scheduling system and computer storage medium
CN113824650B (en) Parameter transmission scheduling algorithm and system in distributed deep learning system
CN114356738A (en) Method for predicting time required for executing neural network model and related product
CN116501828B (en) Non-perception vector query method and system for server based on unstructured data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220615

Address after: 518031 room 1410, building 1, Changfu Jinmao building, south side of Shihua Road, Fubao community, Fubao street, Futian District, Shenzhen, Guangdong

Applicant after: Shenzhen Zhixin Huaxi Information Technology Co.,Ltd.

Address before: 710077 11th floor, building B2, yunhuigu 156, software new town, Tiangu 8th Road, high tech Zone, Xi'an City, Shaanxi Province

Applicant before: Cross Information Core Technology Research Institute (Xi'an) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant