CN113220457A - Model deployment method, model deployment device, terminal device and readable storage medium - Google Patents
Model deployment method, model deployment device, terminal device and readable storage medium Download PDFInfo
- Publication number
- CN113220457A CN113220457A CN202110567899.5A CN202110567899A CN113220457A CN 113220457 A CN113220457 A CN 113220457A CN 202110567899 A CN202110567899 A CN 202110567899A CN 113220457 A CN113220457 A CN 113220457A
- Authority
- CN
- China
- Prior art keywords
- node
- model
- operator
- scheme
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 46
- 238000003062 neural network model Methods 0.000 claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000013136 deep learning model Methods 0.000 claims description 25
- 238000005457 optimization Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 238000005192 partition Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 230000007704 transition Effects 0.000 claims description 10
- FGXWKSZFVQUSTL-UHFFFAOYSA-N domperidone Chemical compound C12=CC=CC=C2NC(=O)N1CCCN(CC1)CCC1N1C2=CC=C(Cl)C=C2NC1=O FGXWKSZFVQUSTL-UHFFFAOYSA-N 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 239000002356 single layer Substances 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims 1
- 238000004891 communication Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 239000010410 layer Substances 0.000 description 9
- 238000000638 solvent extraction Methods 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000001934 delay Effects 0.000 description 6
- 230000006872 improvement Effects 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 125000002015 acyclic group Chemical group 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000001503 joint Anatomy 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 229920002803 thermoplastic polyurethane Polymers 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
- G06N3/105—Shells for specifying net layout
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5017—Task decomposition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a model deployment method, a model deployment device, a terminal device and a readable storage medium, wherein the method comprises the following steps: acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set; acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to the field of model deployment, and particularly relates to a model deployment method, a model deployment device, terminal equipment and a readable storage medium.
Background
Machine Learning (ML) is one of the fastest growing fields in today's computer science, and typical Machine Learning techniques can train statistical models for specific applications on large sets of pre-collected data to update model parameters (also called "weights") until convergence; the trained model is used for reasoning, i.e. predicting the result of new data. Deep learning based on neural networks is the most widely used ML algorithm due to its excellent effect. Neural network models are multi-layered Directed Acyclic Graphs (DAGs), which typically consist of convolution, matrix multiplication, pooling, batch regularization, etc. operations, connected in linear chains or more complex patterns (e.g., branches and residual links). The generalization capability and accuracy of neural network models generally improves with deeper topologies and larger network layers, such as ResNet, VGG, but this also results in higher execution delays.
Researchers are continuously pushing out complex network models to obtain better generalization ability and accuracy; the parameter quantity of the model is increased and the calculation complexity is increased; for example, the GPT-3 model proposed by OPENI reaches a striking 1750 billion with only a parameter, and the hard disk storage space occupied by the model also exceeds 700G. Researchers have currently proposed various system-level optimization schemes to address the performance challenges posed by larger and more complex models; among them, in terms of hardware, faster computations are supported using GPUs and special accelerators (e.g., Google TPUs); in terms of software, there are many frameworks to bridge the gap between productivity-centric high-level interfaces and performance-oriented low-level implementations, including TensorFlow, PyTorch, MXNet, TVM, etc.
However, the above studies do not solve the deployment problem of large parametric models well. Specifically, most of the current work defaults that the memory and the hard disk of the machine running the deep learning model are sufficient, and the model can be directly run on a single machine to obtain the output of the model; on a cloud server cluster, distributed storage is generally used to obtain a better file read-write rate and further obtain a better throughput rate, but this wastes more time on transmission and cannot fully exert the computing capability of all devices. With the development of deep learning, researchers find that computational resources are always limited; in addition, the device responsible for calculation usually needs stronger computing power, and the computing power responsible for reading and writing is idle most of the time, which causes huge waste on one hand, and on the other hand, makes the throughput rate difficult to improve.
In summary, in order to allow the deep learning model to operate in a restricted resource scenario, a new method, system, device and medium for deploying the deep learning model in the restricted resource scenario are urgently needed.
Disclosure of Invention
The present invention is directed to a model deployment method, a model deployment apparatus, a terminal device and a readable storage medium, so as to solve one or more of the above technical problems. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a model deployment method, which comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
The further improvement of the present invention is that the step of obtaining the operator model set of the deep neural network model to be deployed specifically includes:
and selecting a single-layer neural network as basic granularity, dividing the deep neural network model to be deployed, and obtaining an operator model set.
The further improvement of the present invention lies in that the step of performing operator fusion or operator segmentation processing on the operator models satisfying the preset conditions in the operator model set to obtain the processed operator model set specifically comprises:
comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.
The invention has the further improvement that the preset searching method is a backtracking searching method;
when an operator model in the processed operator model set is combined by adopting a backtracking method searching method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, adopting the high throughput rate priority scheme; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Noden; slave Node1Starting up to Node n, for NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach of the branch nodes Nod ofei+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;
traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a backtracking method searching method, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Noden; slave Node1Starting up to Node n, for NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill be removedNodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;
traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
A further improvement of the invention is that all devices in the set of devices for deploying the model are identical; the preset search method is a dynamic programming search method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is less than the theoretical delay of the high-throughput-rate priority scheme, adopting a low-service delay priority scheme; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Node n, using n × n matrix M to represent their connection relationship; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (2); after traversing, Low _ CostnInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay; numberi..jRepresenting the number of machines to be consumed between Node _ i and Node _ j;
wherein, adopt Low _ CostiAnd representing the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator, wherein the state transition equation is represented as:
Low_Costn=min{Low_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
Low_Cost0=0,
Lcommunicate(0)=0,
Lcommunicate=Data_Size*Coefficient,
wherein L iscompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommunicateThe delay of Data transmission is referred to, the Data _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant value which changes according to the network bandwidth.
The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, the high throughput rate priority scheme is adopted; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,..,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination modes; node is traversednAfter that, Low _ CostnAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.
The invention relates to a model deployment device, which comprises:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
An electronic device of the present invention includes: a processor; a memory for storing computer program instructions; the computer program instructions, when loaded and executed by the processor, cause the processor to perform any of the above-described model deployment methods of the invention.
A readable storage medium of the present invention stores computer program instructions which, when loaded and executed by a processor, cause the processor to perform any of the above-described model deployment methods of the present invention.
Compared with the prior art, the invention has the following beneficial effects:
in the model deployment method provided by the invention, a large parameter model to be deployed is divided into a plurality of small operator models, and the sub-model is obtained by adopting a preset planning method based on the operator models, so that the sub-model can perfectly adapt to the content capacity of the computing equipment with different computing power, the non-computing overhead is reduced, and the capability of the existing computing equipment is fully exerted. The model deployment method can deploy the model with larger parameters on the device with limited calculation capacity. It should be mentioned that, the method of the present invention deploys the model on different devices, which naturally introduces communication overhead, but the introduced communication overhead is far smaller than the calculation overhead of the model itself; for the cloud computing platform, the communication overhead is not worth mentioning, and compared with the work of loading the parameters in the hard disk in blocks, the communication overhead is far smaller than the loading overhead.
In the embodiment of the further improved model deployment method, a specific model segmentation algorithm is provided, configuration schemes under high load and low load conditions can be generated, the configuration schemes are respectively called a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme, and the scheme is supported to be preset in advance and adjusted according to equipment; once the solution is generated, the service provider can flexibly adjust the solution according to the requirements.
In the embodiment of the further improved model deployment method, a dynamic planning algorithm with lower complexity is provided for the more common scenes that computing equipment has almost the same computing power, and the globally optimal equipment segmentation scheme can be obtained more efficiently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic block flow diagram of a model deployment method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of operator fusion according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating low service latency precedence in an embodiment of the present invention;
FIG. 4 is a flow chart illustrating high throughput first in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a deployment result according to an embodiment of the present invention.
Detailed Description
In order to make the purpose, technical effect and technical solution of the embodiments of the present invention clearer, the following clearly and completely describes the technical solution of the embodiments of the present invention with reference to the drawings in the embodiments of the present invention; it is to be understood that the described embodiments are only some of the embodiments of the present invention. Other embodiments, which can be derived by one of ordinary skill in the art from the disclosed embodiments without inventive faculty, are intended to be within the scope of the invention.
The model deployment method of the embodiment of the invention comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
Machine learning has become the most important and prevalent technique in modern data-driven applications, and recent advances in deep learning have yielded unprecedented results in a variety of challenging tasks, such as image/video classification and natural language processing. In recent years, researchers have continually introduced sophisticated network models to achieve better generalization capability and accuracy. But with the increasing number of parameters of the model and the increasing computational complexity. The GPT-3 model proposed by OPENI reaches 1750 hundred million which is remarkable only in quantity, and the storage space of a hard disk occupied by the model also exceeds 700G. With the development of deep learning, researchers find that computational resources are always limited. In order to enable the deep learning model to operate in a limited resource scene, the embodiment of the invention provides a novel model deployment method, and the method can divide the model to be deployed into small units and deploy the small units on equipment with weak computing power, so that the computing power of the equipment can be better exerted, and the reading and writing overhead can be reduced. Specifically, the embodiment of the invention discloses a flexible deep learning model deployment method, so that a deep learning model can be fully compatible with devices with different computing powers, the computing power of the devices is fully squeezed, and the optimal operation efficiency and the overall throughput rate are obtained; the invention can fully consider the calculation power difference of the device with strong calculation power in cloud calculation, the device with weak calculation power in edge calculation and even an embedded device, and obtains the optimal throughput rate on the whole.
In the embodiment of the invention, the overhead mainly comes from two parts, namely the execution time of the model and the communication time between devices; wherein the execution time is typically much larger than the communication time.
In the embodiment of the present invention, two optimization scenarios are set, including:
(1) in a low-frequency request scene, a model deployment method should provide lower prediction delay for a single request of a user;
(2) in a high-frequency request scenario, the model deployment method should optimize the throughput rate as a whole and provide services to more users in a unit time under the condition of ensuring that single prediction delay is acceptable.
It is worth noting that when the throughput rate is optimized, all the devices form a relatively complex pipeline, and the throughput rate depends on the slowest stage of pipeline in the pipeline; therefore, optimization schemes in high throughput scenarios also generally tend to make the load of the various devices more balanced.
In the embodiment of the invention, once the available resources and the model to be deployed are determined, a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme can be generated, and the corresponding theoretical delay Time is calculated at the same TimelatencyAnd Timethroughpu(ii) a In actual operation, a low-service delay priority scheme is adopted, and when the actual operation Time of a user is greater than the theoretical delay Time of the high-throughput priority schemethroughputWhen the request is switched to the high throughput rate priority scheme, the current request Number is recordedflag. Defining: when the user requests the Numberrequest>NumberflagIn the time of the day, for a high load environment, a high throughput-first (throughput-first) priority scheme is adopted. When the user requests the Numberrequest<NumberflagIn time, a low service delay-first (latency-first) scheme is adopted for a low load environment.
In the embodiment of the invention, the total n operator models are set, and the calculation delay of the ith operator model is set as Lcomp,iThe communication delay between the ith operator model and the (i + 1) th operator model is Lcomm,i。
Wherein, the optimization function under the low-delay scheme is min { ∑iLcomp,i+Lcomm,iThe operator models are combined into different sub models, and an operator model combination mode with the minimum delay is obtained, so that the overall delay is the lowest; because the equipment under the high throughput rate priority scheme forms a more complex assembly lineAnd the throughput rate depends on the slowest stage of the pipeline, so the optimization function at this time is min { n × max }i{Lconp,i,Lcomm,iAnd } represents a combination mode which takes the product of the delay of the submodel with the maximum delay and the number of the submodels as the minimum in all the operator model combination modes.
Referring to fig. 1, a model deployment method according to an embodiment of the present invention is used to deploy a deep neural network model to be deployed on a resource-defined device set for deploying the model, and specifically includes the following steps:
screening to obtain an operator model set of the deep neural network model to be deployed, and performing parallel post-processing to obtain a processed operator model set; wherein the post-processing comprises: performing operator fusion or operator segmentation on the operator model meeting the preset condition (comparing the parameter number of the operator model with the memory of the minimum memory device); each operator model carries out delay statistics on each device in the device set to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
The deep learning model has an obvious characteristic that various operators which are easy to distinguish form an integral model according to a certain specific topological structure. If the operators are regarded as nodes and the connections between the operators are regarded as edges, the topological structure of the deep learning model can be regarded as a typical directed graph, and the nodes and the edges have respective values and represent the overhead of operation and communication. Because the memory capacity is different, the capability of different devices for bearing models is different, and for a device supporting virtual memory, although a model with a larger parameter number can be run through the virtual memory, a large amount of 'page missing interruption' is generated correspondingly, which brings huge expenses. So if too many nodes are included in a sub-model, the overhead will eventually be exacerbated by the "page fault interrupt".
In the method provided by the embodiment of the invention, the nodes and the edges are divided into the sub-models, so that the total overhead is minimum.
In the embodiment of the present invention, the specific step of selecting the granularity includes: when a deep learning model is partitioned, too coarse granularity may miss the chance of potential optimal partitioning and may not allow the model to be fully placed in the memory of the device, while too fine granularity may slow the speed of searching for the optimal strategy. In the deep learning model, it is a natural matter to select a single-layer neural network as the basic granularity; each partitioned sub-model partition will contain one or more layers of networks. However, in some cases, special adjustments are required.
In the embodiment of the present invention, the operator fusion specifically includes the following steps: in addition to computationally intensive network layers such as convolution and matrix multiplication, neural networks also use small network layers with little or no weight; for example, merge, batch regularization, and ReLU activation; these layers, which do not contribute significantly to memory usage, are fused with their neighbor convolutional or matrix multiplication network layers to reduce the number of elementary units for partitioning. Such layer fusion is a commonly used performance optimization method in many deep learning frameworks; for example TVM, tensorflow. As shown in fig. 2, Conv denotes a convolution layer, BN denotes a batch regularization layer, and Relu denotes a linear rectification function.
In the embodiment of the present invention, the step of processing the large parameter operator specifically includes: for an operator with an excessive amount of parameters, the operator is split into a plurality of parallel operators so as to reduce the parameter amount of a single operator, and the outputs of the operators can be combined and are equal to the output of the original operator. According to experience optimization, the memory size of the device with the minimum memory in all the devices to be deployed is set to be SM, an operator with the parameter quantity size exceeding 0.5SM is generally split until the size of the split operator is smaller than 0.5 SM.
In the embodiment of the present invention, the branch processing specifically includes: deep learning models such as ResNet, inclusion and DenseNet employ complex topologies that are not simple linear sequences but rather have branches. These branches are individually labeled in embodiments of the present invention. Where such parallel branching units are treated as the same stage of pipeline in computing performance, their delay depends on the slowest one.
In the embodiment of the invention, in order to obtain the optimal division scheme, the operator calculation delay L needs to be accurately obtainedcomputeAnd a communication delay LcommunicateAnd thus the combination of operators (i.e., submodels) that may be running in the same plant. There is communication overhead between submodels, and there is no communication overhead within submodels. Usually, an operator running on one device has two delays: delay under low "page out interrupt" and delay under high "page out interrupt".
In the embodiment of the invention, for calculating the operator delay, if the total memory usage of a sub-model exceeds the available memory size, the operator in the sub-model suffers from serious page fault interruption. Therefore, the embodiment of the invention searches the optimal operator combination scheme by using the following method, wherein the method comprises operator delay statistics and sub-model delay estimation.
In the embodiment of the invention, the operator delay statistics step comprises the following steps: the operator model runs on the equipment to obtain the time delay under low missing page interruption and the time delay under high missing page interruption; in both cases, the performance of each operator is compiled and measured separately: a simple memory expansion module is used alone, which takes up most of the memory space. When the memory expansion module runs, starting the operator to correspond to the running time of 'page fault interruption', and corresponding to the time delay under high 'page fault interruption'; when the memory expansion module is closed, the operation time without page fault interruption is corresponding to the delay under low page fault interruption.
In the embodiment of the invention, the sub-model delay pre-estimation step specifically comprises the following steps: from the start node, a sub-model is generated. And calculating the memory usage amount of the sub model in the search space. Deep learning models have very regular program semantics. The memory layout is mainly composed of model weights, intermediate results and input/output, wherein the model weights account for the largest share. The memory occupied by the weight is also a fixed quantity for a determined model, and the memory occupied size can be obtained by calculating the input and output of an operator through an intermediate result. And (3) sub-model delay pre-estimation: for any given operator combination, the memory usage of the operators is predicted and compared with the size of the memory of the equipment to determine whether a large number of page faults and interruptions can be generated, and the delay of the sub-model is obtained through accumulation and recorded. And returning to the step two until the search space is traversed. And summarizing the corresponding sub-model delays into total delay, and selecting a scheme meeting the optimization target.
In the embodiment of the invention, based on the obtained normal operation cost of each operator on different devices and the operation cost when a large amount of page missing interruption occurs, a scheme suitable for the requirement is selected from a search space; the topological structure of the current deep learning model is Directed Acyclic Graph (DAG), the following operators are nodes (nodes), and the nodes areiThe ith operator is represented. In order to obtain the combination of all the submodels, a backtracking method is adopted for searching, and a search space is represented as a search tree structure.
Referring to fig. 3, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the minimum delay, only the synthesis of the divided delays of the model needs to be minimized under the current device resource, and the steps are as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;
step two, the slave Node1Starting until Noden. To NodeiInquiring topological graph, obtaining branch Node connected with it, and for every branch Nodei+1Generating two branches to represent NodeiAnd Nodei+1The models are in the same submodel and different submodels;
step three, DFS traversal is carried out on the search tree generated in the step two, and a segmentation scheme is obtained;
step four, the Node is pairediIf it has a plurality ofBranch Node, ingress Node for finding convergence of all branch nodesjWill remove the NodeiTo NodeiAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.
Step five, traversing the scheme, and aiming at the a-th scheme PlanaNumber of devices D if required by the schemea> D (D is the number of available devices), this scheme is eliminated.
Step six, starting to traverse the existing scheme and the scheme PlanaFor each overlay/submodel G in the schemeaCalculating the corresponding operator at each equipment devicebOverhead Cost onab. For PlanaFor which we Cost at different devicesabCombining, accumulating to obtain Total Cost Total _ Cost, and finding out the combination which makes Total _ Cost minimum as PlanaIs recorded as the final division partitioning scheme Opti _ PlanaIts minimum overhead is noted as Low _ Costa
And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the Opti _ Plan of the alternative partitioning scheme with the minimum Low _ Cost value, and outputting the Opti _ Plan as the optimal partitioning scheme.
Preferably, in the embodiment of the present invention, the same device optimization scheme with minimum delay includes: when the equipment configuration in the application scene is completely the same, an optimal deep learning model segmentation scheme can be obtained by adopting a dynamic programming algorithm with lower complexity. By Low _ CostiAnd the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator is shown. The state transition equation is expressed as:
Low_Costn=min{LoW_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
for the nth operator, it is always recursive for the best case of the first n-1 operator combinations. Define Low _ Cost0=0,Lcommunicate(0)=0,Lcommunicate=Data_Size*Coefficient{Lcompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommuntcateData _ Size refers to the Size of Data to be transmitted, the unit is MB, Coefficient is a constant that changes according to the network bandwidth, and the peak value of 1Gbps ethernet is about 8 ms/MB). Saving the calculated Low _ Cost by using a Low _ Cost tableiRecording Node by using a Position tablejSelect Low _ CostiPosition i of (1).
In the above embodiment of the present invention, the step of obtaining the optimal deep learning model segmentation scheme by using the dynamic programming algorithm is as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Each node with branches has a plurality of egress nodes, and each aggregated node has a plurality of ingress nodes.
Step two, the slave Node1To start, for NodeiIf Node isiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodeiNodes in the process are recursively called to the algorithm, and the optimization target is min (max (Low _ Cost)i..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.
Step three, to NodejSolving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (1).
Step four, after traversing Node, Low _ CostnAnd inquiring the record to obtain a deep learning model division deployment scheme, namely the estimated minimum delay.
Referring to fig. 4, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the highest throughput rate, at this time, the delay depends on the device with the longest operation time, and the delays of the submodels on the devices need to be equalized as much as possible, the steps are as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,...,Noden. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;
step two, the slave Node1Starting until Noden. To NodeiInquiring topological graph, obtaining branch Node connected with it, and for every branch Nodei+1Generating two branches to represent NodeiAnd Nodei+1The models are in the same submodel and different submodels;
step three, DFS traversal is carried out on the search tree generated in the step two, and a segmentation scheme is obtained;
step four, the Node is pairediIf it has multiple branch nodes, it needs to find the input Node where all branch nodes are convergedjWill remove the NodeiTo NodejAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.
Step five, traversing the scheme, and aiming at the a-th scheme PlanaNumber of devices D if required by the schemea> D (D is the number of available devices), this scheme is eliminated.
Step six, starting to traverse the existing scheme and the scheme PlanaFor each overlay/submodel G in the schemeaCalculating the corresponding operator at each equipment devicebOverhead Cost onab. For PlanaFor which we Cost at different devicesabCombining, with maximum overhead Cost of the deviceabMultiplying the consumed equipment number D to obtain Total Cost Total _ Cost, and finding a combination which enables the Total _ Cost to be minimum as PlaniaIs optimally combinedDenoted as the final partition scheme Opti _ PlaniaIts minimum overhead is noted as Low _ Costa
And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the minimum alternative partitioning scheme Opti _ Plan with the minimum Low _ Cost, and outputting the minimum alternative partitioning scheme Opti _ Plan as the optimal partitioning scheme.
In the embodiment of the invention, the same equipment optimization scheme with the highest throughput rate comprises the following steps: when the equipment configuration in the application scene is completely the same, a dynamic programming algorithm with lower complexity can be adopted to obtain a deep learning model segmentation scheme with the optimal throughput rate. The core idea is to minimize the delay of the submodel with the highest delay, i.e. load balancing.
The state transition equation is expressed as:
Low_Costn=min{max{Low_Cost0,Lcompute(0..n)×(d0+1)},max{Low_Cost1,Lcompute(1..n)×(d1+1)},...,max{Low_Costn-1,Lcompute(n-1..n)×(dn-1+1)}}
in general, Lcommumcate《LcomputeTherefore, L is not consideredcommunicateThe delay of (2). The nth operator and some previous operators form a sub-model, and the delay is determined by the maximum sub-model delay. Low _ Cost is always recurred to the best case of the first n operator combinations. Define Low _ Cost0=0,Lcommunicate(0)=0, Computing delays, Cost, of submodels formed for the combination of the ith to jth operatorsiFor the calculation delay of the ith operator, LcommunicateTime delay of data transmission, dnRepresenting the number of devices occupied by the optimal scheme from the 1 st operator to the nth operator), and saving the calculated Low _ Cost by using a Low _ Cost tableiRecording Node by using a Position tablejSelect Low _ CostiAnd (4) recording the Number of devices required by the corresponding scheme by using a Number table.
In the above embodiment of the present invention, the step of obtaining the deep learning model segmentation scheme with the optimal throughput rate by using the dynamic programming algorithm is as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Their connectivity is represented by an n x n matrix M. Each node with branches has a plurality of initial nodes, and each converged node has a plurality of incoming nodes.
Step two, the slave Node1To start, for NodeiIf Node isiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo Nodei to NodeiNodes among the nodes recursively call the algorithm, and the optimization target is min Numberi..j×max{Low_Costi..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.
Step three, to NodejSolving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination mode
Step four, after traversing Node, Low _ CostnNamely the estimated minimum delay, the deep learning model division deployment scheme can be obtained by inquiring the combination mode of the records.
Referring to fig. 5, the present invention successfully partitions the large parameter model into small submodels, so that the submodels can be perfectly adapted to the content capacity of the computing device, thereby reducing the non-computing overhead and fully exerting the capabilities of the existing computing devices.
The invention preferably adopts an ONNX model protocol to store the deep learning model, can be loaded to various deep learning frameworks, is in butt joint with an API (application program interface) of the deep learning framework and outputs the deep learning model to various hardware back ends. The method is completed by using a TCP protocol among various computing devices, does not need complex data format conversion, and has good compatibility and universality. According to the invention, the model is deployed on different devices, and communication overhead is naturally introduced, but the introduced communication overhead is far smaller than the calculation overhead of the model. For the cloud computing platform, the communication overhead is not worth mentioning. Compared with the work of loading the parameters from the hard disk in blocks, the communication overhead is far smaller than the loading overhead.
The invention provides a more flexible model segmentation algorithm, which can generate configuration schemes under high load and low load conditions, also supports presetting equipment parameters in advance and adjusts the schemes according to the equipment. Once the solution is generated, the service provider can flexibly adjust the solution according to the requirements. The invention provides a dynamic programming algorithm with lower complexity aiming at the more common scene that the computing power of each computing device is almost the same, and can more efficiently obtain a globally optimal device segmentation scheme. The invention provides a delay estimation model which can estimate the operation time of a sub-model on different equipment, rather than directly operating the sub-model for statistics, thereby greatly reducing the cost of preprocessing.
The model deployment device of the embodiment of the invention comprises:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.
Claims (10)
1. A method of model deployment, comprising the steps of:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
2. The model deployment method according to claim 1, wherein the step of obtaining the operator model set of the deep neural network model to be deployed specifically comprises:
and selecting a single-layer neural network as basic granularity, dividing the deep neural network model to be deployed, and obtaining an operator model set.
3. A model deployment method according to claim 1,
the method for processing the operator model meeting the preset conditions in the operator model set through operator fusion or operator segmentation comprises the following steps of:
comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.
4. The model deployment method according to claim 1, wherein the preset search method is a backtracking search method;
when an operator model in the processed operator model set is combined by adopting a backtracking method searching method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, adopting the high throughput rate priority scheme; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,Noden;
Slave Node1Starting until NodenTo NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity, removing the scheme to obtain an existing scheme set;
traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
5. The model deployment method according to claim 4, wherein when a backtracking search method is used to combine the operator models in the processed operator model set, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is used; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,Noden;
Slave Node1Starting until NodenTo NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity, removing the scheme to obtain an existing scheme set;
traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
6. A model deployment method according to claim 1, wherein all devices in said set of devices for deploying a model are the same; the preset search method is a dynamic programming search method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is less than the theoretical delay of the high-throughput-rate priority scheme, adopting a low-service delay priority scheme; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (2); after traversing, Low _ CostnInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay;
wherein, adopt Low _ CostiThe total delay is the lowest in the combination mode of all submodels from the 0 th operator to the ith operatorThe state transition equation of the combination is expressed as:
Low_Costn=min{Low_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
Low_Cost0=0,
Lcommunicate(0)=0,
Lcommunicate=Data_Size*Coefficient,
wherein L iscompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommunicateData _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant that changes according to the network bandwidth; numberi..jIndicating the number of machines that need to be consumed between Node _ i to Node _ j.
7. The model deployment method of claim 6, characterized in that, when the operator models in the processed operator model set are combined by using a dynamic programming search method, when the actual running time is greater than or equal to the theoretical delay of the high throughput priority scheme, the high throughput priority scheme is used; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Therein, for exampleFruit NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination modes; node is traversednAfter that, Low _ CostnAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.
8. A model deployment apparatus, comprising:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
9. An electronic device, comprising: a processor; a memory for storing computer program instructions; it is characterized in that the preparation method is characterized in that,
the computer program instructions, when loaded and executed by the processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.
10. A readable storage medium storing computer program instructions, wherein the computer program instructions, when loaded and executed by a processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110567899.5A CN113220457B (en) | 2021-05-24 | 2021-05-24 | Model deployment method, model deployment device, terminal equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110567899.5A CN113220457B (en) | 2021-05-24 | 2021-05-24 | Model deployment method, model deployment device, terminal equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113220457A true CN113220457A (en) | 2021-08-06 |
CN113220457B CN113220457B (en) | 2024-03-22 |
Family
ID=77098247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110567899.5A Active CN113220457B (en) | 2021-05-24 | 2021-05-24 | Model deployment method, model deployment device, terminal equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113220457B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118592A (en) * | 2022-06-15 | 2022-09-27 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis |
CN115115062A (en) * | 2022-06-29 | 2022-09-27 | 北京百度网讯科技有限公司 | Machine learning model establishing method, related device and computer program product |
CN115981870A (en) * | 2023-03-10 | 2023-04-18 | 之江实验室 | Data processing method and device, storage medium and electronic equipment |
CN116050499A (en) * | 2023-04-03 | 2023-05-02 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Self-adaptive model partitioning method, system and equipment in model parallel training |
CN116070675A (en) * | 2023-03-06 | 2023-05-05 | 西南交通大学 | Side slope neural network model selection method, device, equipment and readable storage medium |
CN116167463A (en) * | 2023-04-26 | 2023-05-26 | 之江实验室 | Model training method and device, storage medium and electronic equipment |
CN116306856A (en) * | 2023-05-17 | 2023-06-23 | 之江实验室 | Deep learning model deployment method and device based on search |
CN116630632A (en) * | 2023-07-25 | 2023-08-22 | 腾讯科技(深圳)有限公司 | Image segmentation model quantization method, device and equipment and computer storage medium |
CN117155791A (en) * | 2023-10-31 | 2023-12-01 | 浪潮电子信息产业股份有限公司 | Model deployment method, system, equipment and medium based on cluster topology structure |
CN117311998A (en) * | 2023-11-30 | 2023-12-29 | 卓世未来(天津)科技有限公司 | Large model deployment method and system |
WO2024014728A1 (en) * | 2022-07-11 | 2024-01-18 | Samsung Electronics Co., Ltd. | Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298437A (en) * | 2019-06-28 | 2019-10-01 | Oppo广东移动通信有限公司 | Separation calculation method, apparatus, storage medium and the mobile terminal of neural network |
US20190325307A1 (en) * | 2018-04-20 | 2019-10-24 | EMC IP Holding Company LLC | Estimation of resources utilized by deep learning applications |
CN110490322A (en) * | 2019-08-14 | 2019-11-22 | 北京中科寒武纪科技有限公司 | Method for splitting and device, the electronic equipment and storage medium of operation node |
CN110674936A (en) * | 2019-09-24 | 2020-01-10 | 上海寒武纪信息科技有限公司 | Neural network processing method and device, computer equipment and storage medium |
CN111340237A (en) * | 2020-03-05 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Data processing and model operation method, device and computer equipment |
CN111738434A (en) * | 2020-06-03 | 2020-10-02 | 中国科学院计算技术研究所 | Method for executing deep neural network on heterogeneous processing unit |
CN112270399A (en) * | 2020-09-29 | 2021-01-26 | 北京百度网讯科技有限公司 | Operator registration processing method and device based on deep learning and electronic equipment |
WO2021057465A1 (en) * | 2019-09-26 | 2021-04-01 | 中兴通讯股份有限公司 | Method and apparatus for performing parallel processing on deep learning model |
CN112686378A (en) * | 2020-12-23 | 2021-04-20 | 展讯通信(上海)有限公司 | Calculation deployment method and device of neural network, storage medium and computer equipment |
-
2021
- 2021-05-24 CN CN202110567899.5A patent/CN113220457B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190325307A1 (en) * | 2018-04-20 | 2019-10-24 | EMC IP Holding Company LLC | Estimation of resources utilized by deep learning applications |
CN110298437A (en) * | 2019-06-28 | 2019-10-01 | Oppo广东移动通信有限公司 | Separation calculation method, apparatus, storage medium and the mobile terminal of neural network |
CN110490322A (en) * | 2019-08-14 | 2019-11-22 | 北京中科寒武纪科技有限公司 | Method for splitting and device, the electronic equipment and storage medium of operation node |
CN110674936A (en) * | 2019-09-24 | 2020-01-10 | 上海寒武纪信息科技有限公司 | Neural network processing method and device, computer equipment and storage medium |
WO2021057465A1 (en) * | 2019-09-26 | 2021-04-01 | 中兴通讯股份有限公司 | Method and apparatus for performing parallel processing on deep learning model |
CN111340237A (en) * | 2020-03-05 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Data processing and model operation method, device and computer equipment |
CN111738434A (en) * | 2020-06-03 | 2020-10-02 | 中国科学院计算技术研究所 | Method for executing deep neural network on heterogeneous processing unit |
CN112270399A (en) * | 2020-09-29 | 2021-01-26 | 北京百度网讯科技有限公司 | Operator registration processing method and device based on deep learning and electronic equipment |
CN112686378A (en) * | 2020-12-23 | 2021-04-20 | 展讯通信(上海)有限公司 | Calculation deployment method and device of neural network, storage medium and computer equipment |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118592B (en) * | 2022-06-15 | 2023-08-08 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator feature analysis |
CN115118592A (en) * | 2022-06-15 | 2022-09-27 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis |
CN115115062A (en) * | 2022-06-29 | 2022-09-27 | 北京百度网讯科技有限公司 | Machine learning model establishing method, related device and computer program product |
CN115115062B (en) * | 2022-06-29 | 2023-09-19 | 北京百度网讯科技有限公司 | Machine learning model building method, related device and computer program product |
WO2024014728A1 (en) * | 2022-07-11 | 2024-01-18 | Samsung Electronics Co., Ltd. | Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device |
CN116070675A (en) * | 2023-03-06 | 2023-05-05 | 西南交通大学 | Side slope neural network model selection method, device, equipment and readable storage medium |
CN116070675B (en) * | 2023-03-06 | 2023-06-09 | 西南交通大学 | Side slope neural network model selection method, device, equipment and readable storage medium |
CN115981870A (en) * | 2023-03-10 | 2023-04-18 | 之江实验室 | Data processing method and device, storage medium and electronic equipment |
CN116050499A (en) * | 2023-04-03 | 2023-05-02 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Self-adaptive model partitioning method, system and equipment in model parallel training |
CN116167463A (en) * | 2023-04-26 | 2023-05-26 | 之江实验室 | Model training method and device, storage medium and electronic equipment |
CN116167463B (en) * | 2023-04-26 | 2023-07-07 | 之江实验室 | Distributed model training container scheduling method and device for intelligent computing |
CN116306856B (en) * | 2023-05-17 | 2023-09-05 | 之江实验室 | Deep learning model deployment method and device based on search |
CN116306856A (en) * | 2023-05-17 | 2023-06-23 | 之江实验室 | Deep learning model deployment method and device based on search |
CN116630632A (en) * | 2023-07-25 | 2023-08-22 | 腾讯科技(深圳)有限公司 | Image segmentation model quantization method, device and equipment and computer storage medium |
CN116630632B (en) * | 2023-07-25 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Image segmentation model quantization method, device and equipment and computer storage medium |
CN117155791A (en) * | 2023-10-31 | 2023-12-01 | 浪潮电子信息产业股份有限公司 | Model deployment method, system, equipment and medium based on cluster topology structure |
CN117155791B (en) * | 2023-10-31 | 2024-02-13 | 浪潮电子信息产业股份有限公司 | Model deployment method, system, equipment and medium based on cluster topology structure |
CN117311998A (en) * | 2023-11-30 | 2023-12-29 | 卓世未来(天津)科技有限公司 | Large model deployment method and system |
CN117311998B (en) * | 2023-11-30 | 2024-03-05 | 卓世未来(天津)科技有限公司 | Large model deployment method and system |
Also Published As
Publication number | Publication date |
---|---|
CN113220457B (en) | 2024-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113220457B (en) | Model deployment method, model deployment device, terminal equipment and readable storage medium | |
CN114186633B (en) | Distributed training method, device, equipment and storage medium of model | |
US20220129302A1 (en) | Data processing system and method for heterogeneous architecture | |
KR20200113744A (en) | Method and apparatus for partitioning deep neural networks | |
CN108304925B (en) | Pooling computing device and method | |
CN113315669B (en) | Cloud edge cooperation-based throughput optimization machine learning inference task deployment method | |
CN115237580B (en) | Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method | |
Tanaka et al. | Automatic graph partitioning for very large-scale deep learning | |
CN112015765B (en) | Spark cache elimination method and system based on cache value | |
CN112862083B (en) | Deep neural network inference method and device in edge environment | |
CN107528731B (en) | Network segmentation optimization algorithm applied to NS3 parallel simulation | |
CN113986485A (en) | Cross-data center data transmission energy-saving optimization method and system | |
CN115392467B (en) | Cloud edge cooperative self-adaptive depth reasoning method for real-time processing of mass data | |
Henna et al. | Distributed and collaborative high-speed inference deep learning for mobile edge with topological dependencies | |
CN113824650B (en) | Parameter transmission scheduling algorithm and system in distributed deep learning system | |
CN116418808A (en) | Combined computing unloading and resource allocation method and device for MEC | |
CN112601232B (en) | Load balancing multi-service migration method and system based on minimum cost and maximum flow | |
CN114816742A (en) | Request processing method and device, electronic equipment and storage medium | |
CN114035906A (en) | Virtual machine migration method and device, electronic equipment and storage medium | |
CN116306943B (en) | AIoT-oriented multi-task local collaborative reasoning method and system | |
CN117573379B (en) | Micro-service deployment method based on symmetrical scaling merging | |
US20230111791A1 (en) | Artificial intelligence planning method and artificial intelligence planning device | |
CN116980423B (en) | Model scheduling method, device, computing system, equipment and readable storage medium | |
CN110099003B (en) | Parallel routing optimization method under elastic optical network | |
Myna | Heterogeneous adaptive heuristics for graph processing in Geo distributed Data Centre |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220615 Address after: 518031 room 1410, building 1, Changfu Jinmao building, south side of Shihua Road, Fubao community, Fubao street, Futian District, Shenzhen, Guangdong Applicant after: Shenzhen Zhixin Huaxi Information Technology Co.,Ltd. Address before: 710077 11th floor, building B2, yunhuigu 156, software new town, Tiangu 8th Road, high tech Zone, Xi'an City, Shaanxi Province Applicant before: Cross Information Core Technology Research Institute (Xi'an) Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |