CN113220457A - Model deployment method, model deployment device, terminal device and readable storage medium - Google Patents

Model deployment method, model deployment device, terminal device and readable storage medium Download PDF

Info

Publication number
CN113220457A
CN113220457A CN202110567899.5A CN202110567899A CN113220457A CN 113220457 A CN113220457 A CN 113220457A CN 202110567899 A CN202110567899 A CN 202110567899A CN 113220457 A CN113220457 A CN 113220457A
Authority
CN
China
Prior art keywords
node
model
operator
scheme
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110567899.5A
Other languages
Chinese (zh)
Other versions
CN113220457B (en
Inventor
李发兵
林伟伟
李想
毛兴中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhixin Huaxi Information Technology Co ltd
Original Assignee
Cross Information Core Technology Research Institute Xi'an Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cross Information Core Technology Research Institute Xi'an Co ltd filed Critical Cross Information Core Technology Research Institute Xi'an Co ltd
Priority to CN202110567899.5A priority Critical patent/CN113220457B/en
Publication of CN113220457A publication Critical patent/CN113220457A/en
Application granted granted Critical
Publication of CN113220457B publication Critical patent/CN113220457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • G06N3/105Shells for specifying net layout
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a model deployment method, a model deployment device, a terminal device and a readable storage medium, wherein the method comprises the following steps: acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set; acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.

Description

Model deployment method, model deployment device, terminal device and readable storage medium
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to the field of model deployment, and particularly relates to a model deployment method, a model deployment device, terminal equipment and a readable storage medium.
Background
Machine Learning (ML) is one of the fastest growing fields in today's computer science, and typical Machine Learning techniques can train statistical models for specific applications on large sets of pre-collected data to update model parameters (also called "weights") until convergence; the trained model is used for reasoning, i.e. predicting the result of new data. Deep learning based on neural networks is the most widely used ML algorithm due to its excellent effect. Neural network models are multi-layered Directed Acyclic Graphs (DAGs), which typically consist of convolution, matrix multiplication, pooling, batch regularization, etc. operations, connected in linear chains or more complex patterns (e.g., branches and residual links). The generalization capability and accuracy of neural network models generally improves with deeper topologies and larger network layers, such as ResNet, VGG, but this also results in higher execution delays.
Researchers are continuously pushing out complex network models to obtain better generalization ability and accuracy; the parameter quantity of the model is increased and the calculation complexity is increased; for example, the GPT-3 model proposed by OPENI reaches a striking 1750 billion with only a parameter, and the hard disk storage space occupied by the model also exceeds 700G. Researchers have currently proposed various system-level optimization schemes to address the performance challenges posed by larger and more complex models; among them, in terms of hardware, faster computations are supported using GPUs and special accelerators (e.g., Google TPUs); in terms of software, there are many frameworks to bridge the gap between productivity-centric high-level interfaces and performance-oriented low-level implementations, including TensorFlow, PyTorch, MXNet, TVM, etc.
However, the above studies do not solve the deployment problem of large parametric models well. Specifically, most of the current work defaults that the memory and the hard disk of the machine running the deep learning model are sufficient, and the model can be directly run on a single machine to obtain the output of the model; on a cloud server cluster, distributed storage is generally used to obtain a better file read-write rate and further obtain a better throughput rate, but this wastes more time on transmission and cannot fully exert the computing capability of all devices. With the development of deep learning, researchers find that computational resources are always limited; in addition, the device responsible for calculation usually needs stronger computing power, and the computing power responsible for reading and writing is idle most of the time, which causes huge waste on one hand, and on the other hand, makes the throughput rate difficult to improve.
In summary, in order to allow the deep learning model to operate in a restricted resource scenario, a new method, system, device and medium for deploying the deep learning model in the restricted resource scenario are urgently needed.
Disclosure of Invention
The present invention is directed to a model deployment method, a model deployment apparatus, a terminal device and a readable storage medium, so as to solve one or more of the above technical problems. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a model deployment method, which comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
The further improvement of the present invention is that the step of obtaining the operator model set of the deep neural network model to be deployed specifically includes:
and selecting a single-layer neural network as basic granularity, dividing the deep neural network model to be deployed, and obtaining an operator model set.
The further improvement of the present invention lies in that the step of performing operator fusion or operator segmentation processing on the operator models satisfying the preset conditions in the operator model set to obtain the processed operator model set specifically comprises:
comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.
The invention has the further improvement that the preset searching method is a backtracking searching method;
when an operator model in the processed operator model set is combined by adopting a backtracking method searching method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, adopting the high throughput rate priority scheme; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Noden; slave Node1Starting up to Node n, for NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach of the branch nodes Nod ofei+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;
traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a backtracking method searching method, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Noden; slave Node1Starting up to Node n, for NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill be removedNodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;
traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
A further improvement of the invention is that all devices in the set of devices for deploying the model are identical; the preset search method is a dynamic programming search method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is less than the theoretical delay of the high-throughput-rate priority scheme, adopting a low-service delay priority scheme; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,…,Nodei…, Node n, using n × n matrix M to represent their connection relationship; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (2); after traversing, Low _ CostnInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay; numberi..jRepresenting the number of machines to be consumed between Node _ i and Node _ j;
wherein, adopt Low _ CostiAnd representing the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator, wherein the state transition equation is represented as:
Low_Costn=min{Low_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
Low_Cost0=0,
Lcommunicate(0)=0,
Figure BDA0003081437600000051
Lcommunicate=Data_Size*Coefficient,
wherein L iscompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommunicateThe delay of Data transmission is referred to, the Data _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant value which changes according to the network bandwidth.
The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, the high throughput rate priority scheme is adopted; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,..,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination modes; node is traversednAfter that, Low _ CostnAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.
The invention relates to a model deployment device, which comprises:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
An electronic device of the present invention includes: a processor; a memory for storing computer program instructions; the computer program instructions, when loaded and executed by the processor, cause the processor to perform any of the above-described model deployment methods of the invention.
A readable storage medium of the present invention stores computer program instructions which, when loaded and executed by a processor, cause the processor to perform any of the above-described model deployment methods of the present invention.
Compared with the prior art, the invention has the following beneficial effects:
in the model deployment method provided by the invention, a large parameter model to be deployed is divided into a plurality of small operator models, and the sub-model is obtained by adopting a preset planning method based on the operator models, so that the sub-model can perfectly adapt to the content capacity of the computing equipment with different computing power, the non-computing overhead is reduced, and the capability of the existing computing equipment is fully exerted. The model deployment method can deploy the model with larger parameters on the device with limited calculation capacity. It should be mentioned that, the method of the present invention deploys the model on different devices, which naturally introduces communication overhead, but the introduced communication overhead is far smaller than the calculation overhead of the model itself; for the cloud computing platform, the communication overhead is not worth mentioning, and compared with the work of loading the parameters in the hard disk in blocks, the communication overhead is far smaller than the loading overhead.
In the embodiment of the further improved model deployment method, a specific model segmentation algorithm is provided, configuration schemes under high load and low load conditions can be generated, the configuration schemes are respectively called a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme, and the scheme is supported to be preset in advance and adjusted according to equipment; once the solution is generated, the service provider can flexibly adjust the solution according to the requirements.
In the embodiment of the further improved model deployment method, a dynamic planning algorithm with lower complexity is provided for the more common scenes that computing equipment has almost the same computing power, and the globally optimal equipment segmentation scheme can be obtained more efficiently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic block flow diagram of a model deployment method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of operator fusion according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating low service latency precedence in an embodiment of the present invention;
FIG. 4 is a flow chart illustrating high throughput first in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a deployment result according to an embodiment of the present invention.
Detailed Description
In order to make the purpose, technical effect and technical solution of the embodiments of the present invention clearer, the following clearly and completely describes the technical solution of the embodiments of the present invention with reference to the drawings in the embodiments of the present invention; it is to be understood that the described embodiments are only some of the embodiments of the present invention. Other embodiments, which can be derived by one of ordinary skill in the art from the disclosed embodiments without inventive faculty, are intended to be within the scope of the invention.
The model deployment method of the embodiment of the invention comprises the following steps:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
Machine learning has become the most important and prevalent technique in modern data-driven applications, and recent advances in deep learning have yielded unprecedented results in a variety of challenging tasks, such as image/video classification and natural language processing. In recent years, researchers have continually introduced sophisticated network models to achieve better generalization capability and accuracy. But with the increasing number of parameters of the model and the increasing computational complexity. The GPT-3 model proposed by OPENI reaches 1750 hundred million which is remarkable only in quantity, and the storage space of a hard disk occupied by the model also exceeds 700G. With the development of deep learning, researchers find that computational resources are always limited. In order to enable the deep learning model to operate in a limited resource scene, the embodiment of the invention provides a novel model deployment method, and the method can divide the model to be deployed into small units and deploy the small units on equipment with weak computing power, so that the computing power of the equipment can be better exerted, and the reading and writing overhead can be reduced. Specifically, the embodiment of the invention discloses a flexible deep learning model deployment method, so that a deep learning model can be fully compatible with devices with different computing powers, the computing power of the devices is fully squeezed, and the optimal operation efficiency and the overall throughput rate are obtained; the invention can fully consider the calculation power difference of the device with strong calculation power in cloud calculation, the device with weak calculation power in edge calculation and even an embedded device, and obtains the optimal throughput rate on the whole.
In the embodiment of the invention, the overhead mainly comes from two parts, namely the execution time of the model and the communication time between devices; wherein the execution time is typically much larger than the communication time.
In the embodiment of the present invention, two optimization scenarios are set, including:
(1) in a low-frequency request scene, a model deployment method should provide lower prediction delay for a single request of a user;
(2) in a high-frequency request scenario, the model deployment method should optimize the throughput rate as a whole and provide services to more users in a unit time under the condition of ensuring that single prediction delay is acceptable.
It is worth noting that when the throughput rate is optimized, all the devices form a relatively complex pipeline, and the throughput rate depends on the slowest stage of pipeline in the pipeline; therefore, optimization schemes in high throughput scenarios also generally tend to make the load of the various devices more balanced.
In the embodiment of the invention, once the available resources and the model to be deployed are determined, a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme can be generated, and the corresponding theoretical delay Time is calculated at the same TimelatencyAnd Timethroughpu(ii) a In actual operation, a low-service delay priority scheme is adopted, and when the actual operation Time of a user is greater than the theoretical delay Time of the high-throughput priority schemethroughputWhen the request is switched to the high throughput rate priority scheme, the current request Number is recordedflag. Defining: when the user requests the Numberrequest>NumberflagIn the time of the day, for a high load environment, a high throughput-first (throughput-first) priority scheme is adopted. When the user requests the Numberrequest<NumberflagIn time, a low service delay-first (latency-first) scheme is adopted for a low load environment.
In the embodiment of the invention, the total n operator models are set, and the calculation delay of the ith operator model is set as Lcomp,iThe communication delay between the ith operator model and the (i + 1) th operator model is Lcomm,i
Wherein, the optimization function under the low-delay scheme is min { ∑iLcomp,i+Lcomm,iThe operator models are combined into different sub models, and an operator model combination mode with the minimum delay is obtained, so that the overall delay is the lowest; because the equipment under the high throughput rate priority scheme forms a more complex assembly lineAnd the throughput rate depends on the slowest stage of the pipeline, so the optimization function at this time is min { n × max }i{Lconp,i,Lcomm,iAnd } represents a combination mode which takes the product of the delay of the submodel with the maximum delay and the number of the submodels as the minimum in all the operator model combination modes.
Referring to fig. 1, a model deployment method according to an embodiment of the present invention is used to deploy a deep neural network model to be deployed on a resource-defined device set for deploying the model, and specifically includes the following steps:
screening to obtain an operator model set of the deep neural network model to be deployed, and performing parallel post-processing to obtain a processed operator model set; wherein the post-processing comprises: performing operator fusion or operator segmentation on the operator model meeting the preset condition (comparing the parameter number of the operator model with the memory of the minimum memory device); each operator model carries out delay statistics on each device in the device set to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
The deep learning model has an obvious characteristic that various operators which are easy to distinguish form an integral model according to a certain specific topological structure. If the operators are regarded as nodes and the connections between the operators are regarded as edges, the topological structure of the deep learning model can be regarded as a typical directed graph, and the nodes and the edges have respective values and represent the overhead of operation and communication. Because the memory capacity is different, the capability of different devices for bearing models is different, and for a device supporting virtual memory, although a model with a larger parameter number can be run through the virtual memory, a large amount of 'page missing interruption' is generated correspondingly, which brings huge expenses. So if too many nodes are included in a sub-model, the overhead will eventually be exacerbated by the "page fault interrupt".
In the method provided by the embodiment of the invention, the nodes and the edges are divided into the sub-models, so that the total overhead is minimum.
In the embodiment of the present invention, the specific step of selecting the granularity includes: when a deep learning model is partitioned, too coarse granularity may miss the chance of potential optimal partitioning and may not allow the model to be fully placed in the memory of the device, while too fine granularity may slow the speed of searching for the optimal strategy. In the deep learning model, it is a natural matter to select a single-layer neural network as the basic granularity; each partitioned sub-model partition will contain one or more layers of networks. However, in some cases, special adjustments are required.
In the embodiment of the present invention, the operator fusion specifically includes the following steps: in addition to computationally intensive network layers such as convolution and matrix multiplication, neural networks also use small network layers with little or no weight; for example, merge, batch regularization, and ReLU activation; these layers, which do not contribute significantly to memory usage, are fused with their neighbor convolutional or matrix multiplication network layers to reduce the number of elementary units for partitioning. Such layer fusion is a commonly used performance optimization method in many deep learning frameworks; for example TVM, tensorflow. As shown in fig. 2, Conv denotes a convolution layer, BN denotes a batch regularization layer, and Relu denotes a linear rectification function.
In the embodiment of the present invention, the step of processing the large parameter operator specifically includes: for an operator with an excessive amount of parameters, the operator is split into a plurality of parallel operators so as to reduce the parameter amount of a single operator, and the outputs of the operators can be combined and are equal to the output of the original operator. According to experience optimization, the memory size of the device with the minimum memory in all the devices to be deployed is set to be SM, an operator with the parameter quantity size exceeding 0.5SM is generally split until the size of the split operator is smaller than 0.5 SM.
In the embodiment of the present invention, the branch processing specifically includes: deep learning models such as ResNet, inclusion and DenseNet employ complex topologies that are not simple linear sequences but rather have branches. These branches are individually labeled in embodiments of the present invention. Where such parallel branching units are treated as the same stage of pipeline in computing performance, their delay depends on the slowest one.
In the embodiment of the invention, in order to obtain the optimal division scheme, the operator calculation delay L needs to be accurately obtainedcomputeAnd a communication delay LcommunicateAnd thus the combination of operators (i.e., submodels) that may be running in the same plant. There is communication overhead between submodels, and there is no communication overhead within submodels. Usually, an operator running on one device has two delays: delay under low "page out interrupt" and delay under high "page out interrupt".
In the embodiment of the invention, for calculating the operator delay, if the total memory usage of a sub-model exceeds the available memory size, the operator in the sub-model suffers from serious page fault interruption. Therefore, the embodiment of the invention searches the optimal operator combination scheme by using the following method, wherein the method comprises operator delay statistics and sub-model delay estimation.
In the embodiment of the invention, the operator delay statistics step comprises the following steps: the operator model runs on the equipment to obtain the time delay under low missing page interruption and the time delay under high missing page interruption; in both cases, the performance of each operator is compiled and measured separately: a simple memory expansion module is used alone, which takes up most of the memory space. When the memory expansion module runs, starting the operator to correspond to the running time of 'page fault interruption', and corresponding to the time delay under high 'page fault interruption'; when the memory expansion module is closed, the operation time without page fault interruption is corresponding to the delay under low page fault interruption.
In the embodiment of the invention, the sub-model delay pre-estimation step specifically comprises the following steps: from the start node, a sub-model is generated. And calculating the memory usage amount of the sub model in the search space. Deep learning models have very regular program semantics. The memory layout is mainly composed of model weights, intermediate results and input/output, wherein the model weights account for the largest share. The memory occupied by the weight is also a fixed quantity for a determined model, and the memory occupied size can be obtained by calculating the input and output of an operator through an intermediate result. And (3) sub-model delay pre-estimation: for any given operator combination, the memory usage of the operators is predicted and compared with the size of the memory of the equipment to determine whether a large number of page faults and interruptions can be generated, and the delay of the sub-model is obtained through accumulation and recorded. And returning to the step two until the search space is traversed. And summarizing the corresponding sub-model delays into total delay, and selecting a scheme meeting the optimization target.
In the embodiment of the invention, based on the obtained normal operation cost of each operator on different devices and the operation cost when a large amount of page missing interruption occurs, a scheme suitable for the requirement is selected from a search space; the topological structure of the current deep learning model is Directed Acyclic Graph (DAG), the following operators are nodes (nodes), and the nodes areiThe ith operator is represented. In order to obtain the combination of all the submodels, a backtracking method is adopted for searching, and a search space is represented as a search tree structure.
Referring to fig. 3, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the minimum delay, only the synthesis of the divided delays of the model needs to be minimized under the current device resource, and the steps are as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;
step two, the slave Node1Starting until Noden. To NodeiInquiring topological graph, obtaining branch Node connected with it, and for every branch Nodei+1Generating two branches to represent NodeiAnd Nodei+1The models are in the same submodel and different submodels;
step three, DFS traversal is carried out on the search tree generated in the step two, and a segmentation scheme is obtained;
step four, the Node is pairediIf it has a plurality ofBranch Node, ingress Node for finding convergence of all branch nodesjWill remove the NodeiTo NodeiAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.
Step five, traversing the scheme, and aiming at the a-th scheme PlanaNumber of devices D if required by the schemea> D (D is the number of available devices), this scheme is eliminated.
Step six, starting to traverse the existing scheme and the scheme PlanaFor each overlay/submodel G in the schemeaCalculating the corresponding operator at each equipment devicebOverhead Cost onab. For PlanaFor which we Cost at different devicesabCombining, accumulating to obtain Total Cost Total _ Cost, and finding out the combination which makes Total _ Cost minimum as PlanaIs recorded as the final division partitioning scheme Opti _ PlanaIts minimum overhead is noted as Low _ Costa
And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the Opti _ Plan of the alternative partitioning scheme with the minimum Low _ Cost value, and outputting the Opti _ Plan as the optimal partitioning scheme.
Preferably, in the embodiment of the present invention, the same device optimization scheme with minimum delay includes: when the equipment configuration in the application scene is completely the same, an optimal deep learning model segmentation scheme can be obtained by adopting a dynamic programming algorithm with lower complexity. By Low _ CostiAnd the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator is shown. The state transition equation is expressed as:
Low_Costn=min{LoW_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
for the nth operator, it is always recursive for the best case of the first n-1 operator combinations. Define Low _ Cost0=0,Lcommunicate(0)=0,
Figure BDA0003081437600000141
Lcommunicate=Data_Size*Coefficient{Lcompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommuntcateData _ Size refers to the Size of Data to be transmitted, the unit is MB, Coefficient is a constant that changes according to the network bandwidth, and the peak value of 1Gbps ethernet is about 8 ms/MB). Saving the calculated Low _ Cost by using a Low _ Cost tableiRecording Node by using a Position tablejSelect Low _ CostiPosition i of (1).
In the above embodiment of the present invention, the step of obtaining the optimal deep learning model segmentation scheme by using the dynamic programming algorithm is as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Each node with branches has a plurality of egress nodes, and each aggregated node has a plurality of ingress nodes.
Step two, the slave Node1To start, for NodeiIf Node isiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodeiNodes in the process are recursively called to the algorithm, and the optimization target is min (max (Low _ Cost)i..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.
Step three, to NodejSolving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (1).
Step four, after traversing Node, Low _ CostnAnd inquiring the record to obtain a deep learning model division deployment scheme, namely the estimated minimum delay.
Referring to fig. 4, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the highest throughput rate, at this time, the delay depends on the device with the longest operation time, and the delays of the submodels on the devices need to be equalized as much as possible, the steps are as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,...,Noden. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;
step two, the slave Node1Starting until Noden. To NodeiInquiring topological graph, obtaining branch Node connected with it, and for every branch Nodei+1Generating two branches to represent NodeiAnd Nodei+1The models are in the same submodel and different submodels;
step three, DFS traversal is carried out on the search tree generated in the step two, and a segmentation scheme is obtained;
step four, the Node is pairediIf it has multiple branch nodes, it needs to find the input Node where all branch nodes are convergedjWill remove the NodeiTo NodejAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.
Step five, traversing the scheme, and aiming at the a-th scheme PlanaNumber of devices D if required by the schemea> D (D is the number of available devices), this scheme is eliminated.
Step six, starting to traverse the existing scheme and the scheme PlanaFor each overlay/submodel G in the schemeaCalculating the corresponding operator at each equipment devicebOverhead Cost onab. For PlanaFor which we Cost at different devicesabCombining, with maximum overhead Cost of the deviceabMultiplying the consumed equipment number D to obtain Total Cost Total _ Cost, and finding a combination which enables the Total _ Cost to be minimum as PlaniaIs optimally combinedDenoted as the final partition scheme Opti _ PlaniaIts minimum overhead is noted as Low _ Costa
And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the minimum alternative partitioning scheme Opti _ Plan with the minimum Low _ Cost, and outputting the minimum alternative partitioning scheme Opti _ Plan as the optimal partitioning scheme.
In the embodiment of the invention, the same equipment optimization scheme with the highest throughput rate comprises the following steps: when the equipment configuration in the application scene is completely the same, a dynamic programming algorithm with lower complexity can be adopted to obtain a deep learning model segmentation scheme with the optimal throughput rate. The core idea is to minimize the delay of the submodel with the highest delay, i.e. load balancing.
The state transition equation is expressed as:
Low_Costn=min{max{Low_Cost0,Lcompute(0..n)×(d0+1)},max{Low_Cost1,Lcompute(1..n)×(d1+1)},...,max{Low_Costn-1,Lcompute(n-1..n)×(dn-1+1)}}
in general, Lcommumcate《LcomputeTherefore, L is not consideredcommunicateThe delay of (2). The nth operator and some previous operators form a sub-model, and the delay is determined by the maximum sub-model delay. Low _ Cost is always recurred to the best case of the first n operator combinations. Define Low _ Cost0=0,Lcommunicate(0)=0,
Figure BDA0003081437600000161
Figure BDA0003081437600000162
Computing delays, Cost, of submodels formed for the combination of the ith to jth operatorsiFor the calculation delay of the ith operator, LcommunicateTime delay of data transmission, dnRepresenting the number of devices occupied by the optimal scheme from the 1 st operator to the nth operator), and saving the calculated Low _ Cost by using a Low _ Cost tableiRecording Node by using a Position tablejSelect Low _ CostiAnd (4) recording the Number of devices required by the corresponding scheme by using a Number table.
In the above embodiment of the present invention, the step of obtaining the deep learning model segmentation scheme with the optimal throughput rate by using the dynamic programming algorithm is as follows:
step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes1,Node2,…,Noden. Their connectivity is represented by an n x n matrix M. Each node with branches has a plurality of initial nodes, and each converged node has a plurality of incoming nodes.
Step two, the slave Node1To start, for NodeiIf Node isiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo Nodei to NodeiNodes among the nodes recursively call the algorithm, and the optimization target is min Numberi..j×max{Low_Costi..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.
Step three, to NodejSolving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination mode
Step four, after traversing Node, Low _ CostnNamely the estimated minimum delay, the deep learning model division deployment scheme can be obtained by inquiring the combination mode of the records.
Referring to fig. 5, the present invention successfully partitions the large parameter model into small submodels, so that the submodels can be perfectly adapted to the content capacity of the computing device, thereby reducing the non-computing overhead and fully exerting the capabilities of the existing computing devices.
The invention preferably adopts an ONNX model protocol to store the deep learning model, can be loaded to various deep learning frameworks, is in butt joint with an API (application program interface) of the deep learning framework and outputs the deep learning model to various hardware back ends. The method is completed by using a TCP protocol among various computing devices, does not need complex data format conversion, and has good compatibility and universality. According to the invention, the model is deployed on different devices, and communication overhead is naturally introduced, but the introduced communication overhead is far smaller than the calculation overhead of the model. For the cloud computing platform, the communication overhead is not worth mentioning. Compared with the work of loading the parameters from the hard disk in blocks, the communication overhead is far smaller than the loading overhead.
The invention provides a more flexible model segmentation algorithm, which can generate configuration schemes under high load and low load conditions, also supports presetting equipment parameters in advance and adjusts the schemes according to the equipment. Once the solution is generated, the service provider can flexibly adjust the solution according to the requirements. The invention provides a dynamic programming algorithm with lower complexity aiming at the more common scene that the computing power of each computing device is almost the same, and can more efficiently obtain a globally optimal device segmentation scheme. The invention provides a delay estimation model which can estimate the operation time of a sub-model on different equipment, rather than directly operating the sub-model for statistics, thereby greatly reducing the cost of preprocessing.
The model deployment device of the embodiment of the invention comprises:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.

Claims (10)

1. A method of model deployment, comprising the steps of:
acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;
combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;
and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.
2. The model deployment method according to claim 1, wherein the step of obtaining the operator model set of the deep neural network model to be deployed specifically comprises:
and selecting a single-layer neural network as basic granularity, dividing the deep neural network model to be deployed, and obtaining an operator model set.
3. A model deployment method according to claim 1,
the method for processing the operator model meeting the preset conditions in the operator model set through operator fusion or operator segmentation comprises the following steps of:
comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.
4. The model deployment method according to claim 1, wherein the preset search method is a backtracking search method;
when an operator model in the processed operator model set is combined by adopting a backtracking method searching method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, adopting the high throughput rate priority scheme; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,Noden
Slave Node1Starting until NodenTo NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity, removing the scheme to obtain an existing scheme set;
traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
5. The model deployment method according to claim 4, wherein when a backtracking search method is used to combine the operator models in the processed operator model set, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is used; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,Noden
Slave Node1Starting until NodenTo NodeiQuerying a topology map, obtaining and NodeiThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pairiEach branch Node ofi+1Two branches are generated for representing NodeiAnd Nodei+1The models are in the same submodel and different submodels; nodeiWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is foundjWill remove the NodeiTo NodejTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity, removing the scheme to obtain an existing scheme set;
traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.
6. A model deployment method according to claim 1, wherein all devices in said set of devices for deploying a model are the same; the preset search method is a dynamic programming search method;
when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is less than the theoretical delay of the high-throughput-rate priority scheme, adopting a low-service delay priority scheme; the low service delay priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Wherein, if NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Costi+Lcompute(i..j)+Lcommunicate(i) Record Low _ CostjThe sub-model combination mode of (2); after traversing, Low _ CostnInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay;
wherein, adopt Low _ CostiThe total delay is the lowest in the combination mode of all submodels from the 0 th operator to the ith operatorThe state transition equation of the combination is expressed as:
Low_Costn=min{Low_Cost0+Lcompute(0..n)+Lcommunicate(0),Low_Cost1+Lcompute(1..n)+Lcommunicate(1),...,Low_Costn-1+Lcompute(n-1..n)+Lcommunicate(n-1)}
Low_Cost0=0,
Lcommunicate(0)=0,
Figure FDA0003081437590000041
Lcommunicate=Data_Size*Coefficient,
wherein L iscompute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operatoriFor the calculation delay of the ith operator, LcommunicateData _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant that changes according to the network bandwidth; numberi..jIndicating the number of machines that need to be consumed between Node _ i to Node _ j.
7. The model deployment method of claim 6, characterized in that, when the operator models in the processed operator model set are combined by using a dynamic programming search method, when the actual running time is greater than or equal to the theoretical delay of the high throughput priority scheme, the high throughput priority scheme is used; the high throughput rate priority scheme comprises the following specific steps:
the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes1,Node2,...,Nodei,...,NodenThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;
slave Node1Begin traversing to Noden(ii) a Therein, for exampleFruit NodeiIs a father Node of the branch Node, finds a convergent Node of the branch NodejTo NodeiTo NodejThe nodes among the nodes are recursively called to optimize the target to be min Numberi..j×max{Low_Costi..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equationj=Low_Cost×device0..jRecord Low _ CostjCorresponding sub-model combination modes; node is traversednAfter that, Low _ CostnAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.
8. A model deployment apparatus, comprising:
the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;
a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;
a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;
and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.
9. An electronic device, comprising: a processor; a memory for storing computer program instructions; it is characterized in that the preparation method is characterized in that,
the computer program instructions, when loaded and executed by the processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.
10. A readable storage medium storing computer program instructions, wherein the computer program instructions, when loaded and executed by a processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.
CN202110567899.5A 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium Active CN113220457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110567899.5A CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110567899.5A CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113220457A true CN113220457A (en) 2021-08-06
CN113220457B CN113220457B (en) 2024-03-22

Family

ID=77098247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110567899.5A Active CN113220457B (en) 2021-05-24 2021-05-24 Model deployment method, model deployment device, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113220457B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118592A (en) * 2022-06-15 2022-09-27 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis
CN115115062A (en) * 2022-06-29 2022-09-27 北京百度网讯科技有限公司 Machine learning model establishing method, related device and computer program product
CN115981870A (en) * 2023-03-10 2023-04-18 之江实验室 Data processing method and device, storage medium and electronic equipment
CN116050499A (en) * 2023-04-03 2023-05-02 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Self-adaptive model partitioning method, system and equipment in model parallel training
CN116070675A (en) * 2023-03-06 2023-05-05 西南交通大学 Side slope neural network model selection method, device, equipment and readable storage medium
CN116167463A (en) * 2023-04-26 2023-05-26 之江实验室 Model training method and device, storage medium and electronic equipment
CN116306856A (en) * 2023-05-17 2023-06-23 之江实验室 Deep learning model deployment method and device based on search
CN116630632A (en) * 2023-07-25 2023-08-22 腾讯科技(深圳)有限公司 Image segmentation model quantization method, device and equipment and computer storage medium
CN117155791A (en) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 Model deployment method, system, equipment and medium based on cluster topology structure
CN117311998A (en) * 2023-11-30 2023-12-29 卓世未来(天津)科技有限公司 Large model deployment method and system
WO2024014728A1 (en) * 2022-07-11 2024-01-18 Samsung Electronics Co., Ltd. Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
US20190325307A1 (en) * 2018-04-20 2019-10-24 EMC IP Holding Company LLC Estimation of resources utilized by deep learning applications
CN110490322A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 Method for splitting and device, the electronic equipment and storage medium of operation node
CN110674936A (en) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 Neural network processing method and device, computer equipment and storage medium
CN111340237A (en) * 2020-03-05 2020-06-26 腾讯科技(深圳)有限公司 Data processing and model operation method, device and computer equipment
CN111738434A (en) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 Method for executing deep neural network on heterogeneous processing unit
CN112270399A (en) * 2020-09-29 2021-01-26 北京百度网讯科技有限公司 Operator registration processing method and device based on deep learning and electronic equipment
WO2021057465A1 (en) * 2019-09-26 2021-04-01 中兴通讯股份有限公司 Method and apparatus for performing parallel processing on deep learning model
CN112686378A (en) * 2020-12-23 2021-04-20 展讯通信(上海)有限公司 Calculation deployment method and device of neural network, storage medium and computer equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190325307A1 (en) * 2018-04-20 2019-10-24 EMC IP Holding Company LLC Estimation of resources utilized by deep learning applications
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
CN110490322A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 Method for splitting and device, the electronic equipment and storage medium of operation node
CN110674936A (en) * 2019-09-24 2020-01-10 上海寒武纪信息科技有限公司 Neural network processing method and device, computer equipment and storage medium
WO2021057465A1 (en) * 2019-09-26 2021-04-01 中兴通讯股份有限公司 Method and apparatus for performing parallel processing on deep learning model
CN111340237A (en) * 2020-03-05 2020-06-26 腾讯科技(深圳)有限公司 Data processing and model operation method, device and computer equipment
CN111738434A (en) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 Method for executing deep neural network on heterogeneous processing unit
CN112270399A (en) * 2020-09-29 2021-01-26 北京百度网讯科技有限公司 Operator registration processing method and device based on deep learning and electronic equipment
CN112686378A (en) * 2020-12-23 2021-04-20 展讯通信(上海)有限公司 Calculation deployment method and device of neural network, storage medium and computer equipment

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118592B (en) * 2022-06-15 2023-08-08 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator feature analysis
CN115118592A (en) * 2022-06-15 2022-09-27 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis
CN115115062A (en) * 2022-06-29 2022-09-27 北京百度网讯科技有限公司 Machine learning model establishing method, related device and computer program product
CN115115062B (en) * 2022-06-29 2023-09-19 北京百度网讯科技有限公司 Machine learning model building method, related device and computer program product
WO2024014728A1 (en) * 2022-07-11 2024-01-18 Samsung Electronics Co., Ltd. Method and system for optimizing neural networks (nn) for on-device deployment in an electronic device
CN116070675A (en) * 2023-03-06 2023-05-05 西南交通大学 Side slope neural network model selection method, device, equipment and readable storage medium
CN116070675B (en) * 2023-03-06 2023-06-09 西南交通大学 Side slope neural network model selection method, device, equipment and readable storage medium
CN115981870A (en) * 2023-03-10 2023-04-18 之江实验室 Data processing method and device, storage medium and electronic equipment
CN116050499A (en) * 2023-04-03 2023-05-02 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Self-adaptive model partitioning method, system and equipment in model parallel training
CN116167463A (en) * 2023-04-26 2023-05-26 之江实验室 Model training method and device, storage medium and electronic equipment
CN116167463B (en) * 2023-04-26 2023-07-07 之江实验室 Distributed model training container scheduling method and device for intelligent computing
CN116306856B (en) * 2023-05-17 2023-09-05 之江实验室 Deep learning model deployment method and device based on search
CN116306856A (en) * 2023-05-17 2023-06-23 之江实验室 Deep learning model deployment method and device based on search
CN116630632A (en) * 2023-07-25 2023-08-22 腾讯科技(深圳)有限公司 Image segmentation model quantization method, device and equipment and computer storage medium
CN116630632B (en) * 2023-07-25 2023-11-03 腾讯科技(深圳)有限公司 Image segmentation model quantization method, device and equipment and computer storage medium
CN117155791A (en) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 Model deployment method, system, equipment and medium based on cluster topology structure
CN117155791B (en) * 2023-10-31 2024-02-13 浪潮电子信息产业股份有限公司 Model deployment method, system, equipment and medium based on cluster topology structure
CN117311998A (en) * 2023-11-30 2023-12-29 卓世未来(天津)科技有限公司 Large model deployment method and system
CN117311998B (en) * 2023-11-30 2024-03-05 卓世未来(天津)科技有限公司 Large model deployment method and system

Also Published As

Publication number Publication date
CN113220457B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN113220457B (en) Model deployment method, model deployment device, terminal equipment and readable storage medium
CN114186633B (en) Distributed training method, device, equipment and storage medium of model
US20220129302A1 (en) Data processing system and method for heterogeneous architecture
KR20200113744A (en) Method and apparatus for partitioning deep neural networks
CN108304925B (en) Pooling computing device and method
CN113315669B (en) Cloud edge cooperation-based throughput optimization machine learning inference task deployment method
CN115237580B (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
Tanaka et al. Automatic graph partitioning for very large-scale deep learning
CN112015765B (en) Spark cache elimination method and system based on cache value
CN112862083B (en) Deep neural network inference method and device in edge environment
CN107528731B (en) Network segmentation optimization algorithm applied to NS3 parallel simulation
CN113986485A (en) Cross-data center data transmission energy-saving optimization method and system
CN115392467B (en) Cloud edge cooperative self-adaptive depth reasoning method for real-time processing of mass data
Henna et al. Distributed and collaborative high-speed inference deep learning for mobile edge with topological dependencies
CN113824650B (en) Parameter transmission scheduling algorithm and system in distributed deep learning system
CN116418808A (en) Combined computing unloading and resource allocation method and device for MEC
CN112601232B (en) Load balancing multi-service migration method and system based on minimum cost and maximum flow
CN114816742A (en) Request processing method and device, electronic equipment and storage medium
CN114035906A (en) Virtual machine migration method and device, electronic equipment and storage medium
CN116306943B (en) AIoT-oriented multi-task local collaborative reasoning method and system
CN117573379B (en) Micro-service deployment method based on symmetrical scaling merging
US20230111791A1 (en) Artificial intelligence planning method and artificial intelligence planning device
CN116980423B (en) Model scheduling method, device, computing system, equipment and readable storage medium
CN110099003B (en) Parallel routing optimization method under elastic optical network
Myna Heterogeneous adaptive heuristics for graph processing in Geo distributed Data Centre

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220615

Address after: 518031 room 1410, building 1, Changfu Jinmao building, south side of Shihua Road, Fubao community, Fubao street, Futian District, Shenzhen, Guangdong

Applicant after: Shenzhen Zhixin Huaxi Information Technology Co.,Ltd.

Address before: 710077 11th floor, building B2, yunhuigu 156, software new town, Tiangu 8th Road, high tech Zone, Xi'an City, Shaanxi Province

Applicant before: Cross Information Core Technology Research Institute (Xi'an) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant