CN113220457A

CN113220457A - Model deployment method, model deployment device, terminal device and readable storage medium

Info

Publication number: CN113220457A
Application number: CN202110567899.5A
Authority: CN
Inventors: 李发兵; 林伟伟; 李想; 毛兴中
Original assignee: Cross Information Core Technology Research Institute Xi'an Co ltd
Current assignee: Shenzhen Zhixin Huaxi Information Technology Co ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-08-06
Anticipated expiration: 2041-05-24
Also published as: CN113220457B

Abstract

The invention discloses a model deployment method, a model deployment device, a terminal device and a readable storage medium, wherein the method comprises the following steps: acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set; acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.

Description

Model deployment method, model deployment device, terminal device and readable storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to the field of model deployment, and particularly relates to a model deployment method, a model deployment device, terminal equipment and a readable storage medium.

Background

Machine Learning (ML) is one of the fastest growing fields in today's computer science, and typical Machine Learning techniques can train statistical models for specific applications on large sets of pre-collected data to update model parameters (also called "weights") until convergence; the trained model is used for reasoning, i.e. predicting the result of new data. Deep learning based on neural networks is the most widely used ML algorithm due to its excellent effect. Neural network models are multi-layered Directed Acyclic Graphs (DAGs), which typically consist of convolution, matrix multiplication, pooling, batch regularization, etc. operations, connected in linear chains or more complex patterns (e.g., branches and residual links). The generalization capability and accuracy of neural network models generally improves with deeper topologies and larger network layers, such as ResNet, VGG, but this also results in higher execution delays.

Researchers are continuously pushing out complex network models to obtain better generalization ability and accuracy; the parameter quantity of the model is increased and the calculation complexity is increased; for example, the GPT-3 model proposed by OPENI reaches a striking 1750 billion with only a parameter, and the hard disk storage space occupied by the model also exceeds 700G. Researchers have currently proposed various system-level optimization schemes to address the performance challenges posed by larger and more complex models; among them, in terms of hardware, faster computations are supported using GPUs and special accelerators (e.g., Google TPUs); in terms of software, there are many frameworks to bridge the gap between productivity-centric high-level interfaces and performance-oriented low-level implementations, including TensorFlow, PyTorch, MXNet, TVM, etc.

However, the above studies do not solve the deployment problem of large parametric models well. Specifically, most of the current work defaults that the memory and the hard disk of the machine running the deep learning model are sufficient, and the model can be directly run on a single machine to obtain the output of the model; on a cloud server cluster, distributed storage is generally used to obtain a better file read-write rate and further obtain a better throughput rate, but this wastes more time on transmission and cannot fully exert the computing capability of all devices. With the development of deep learning, researchers find that computational resources are always limited; in addition, the device responsible for calculation usually needs stronger computing power, and the computing power responsible for reading and writing is idle most of the time, which causes huge waste on one hand, and on the other hand, makes the throughput rate difficult to improve.

In summary, in order to allow the deep learning model to operate in a restricted resource scenario, a new method, system, device and medium for deploying the deep learning model in the restricted resource scenario are urgently needed.

Disclosure of Invention

The present invention is directed to a model deployment method, a model deployment apparatus, a terminal device and a readable storage medium, so as to solve one or more of the above technical problems. The invention can be fully compatible with devices with different computing powers, and can improve the operation efficiency and the overall throughput rate.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a model deployment method, which comprises the following steps:

acquiring an operator model set of a deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;

acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to obtain a running time set;

combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set;

and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.

The further improvement of the present invention is that the step of obtaining the operator model set of the deep neural network model to be deployed specifically includes:

and selecting a single-layer neural network as basic granularity, dividing the deep neural network model to be deployed, and obtaining an operator model set.

The further improvement of the present invention lies in that the step of performing operator fusion or operator segmentation processing on the operator models satisfying the preset conditions in the operator model set to obtain the processed operator model set specifically comprises:

comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.

The invention has the further improvement that the preset searching method is a backtracking searching method;

when an operator model in the processed operator model set is combined by adopting a backtracking method searching method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, adopting the high throughput rate priority scheme; the high throughput rate priority scheme comprises the following specific steps:

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，…，Node_i…, Noden; slave Node₁Starting up to Node n, for Node_iQuerying a topology map, obtaining and Node_iThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pair_iEach of the branch nodes Nod ofe_i+1Two branches are generated for representing Node_iAnd Node_i+1The models are in the same submodel and different submodels; node_iWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is found_jWill remove the Node_iTo Node_jTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;

traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.

The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a backtracking method searching method, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme comprises the following specific steps:

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，…，Node_i…, Noden; slave Node₁Starting up to Node n, for Node_iQuerying a topology map, obtaining and Node_iThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pair_iEach branch Node of_i+1Two branches are generated for representing Node_iAnd Node_i+1The models are in the same submodel and different submodels; node_iWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is found_jWill be removedNode_iTo Node_jTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity D, removing the scheme to obtain an existing scheme set;

traverse the existing set of solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.

A further improvement of the invention is that all devices in the set of devices for deploying the model are identical; the preset search method is a dynamic programming search method;

when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is less than the theoretical delay of the high-throughput-rate priority scheme, adopting a low-service delay priority scheme; the low service delay priority scheme comprises the following specific steps:

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，…，Node_i…, Node n, using n × n matrix M to represent their connection relationship; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;

slave Node₁Begin traversing to Node_n(ii) a Wherein, if Node_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Node_iTo Node_jThe nodes among the nodes are recursively called to optimize the target to be min Number_i..j×max{Low_Cost_i..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost_i+L_compute(i..j)+L_communicate(i) Record Low _ Cost_jThe sub-model combination mode of (2); after traversing, Low _ Cost_nInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay; number_i..jRepresenting the number of machines to be consumed between Node _ i and Node _ j;

wherein, adopt Low _ Cost_iAnd representing the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator, wherein the state transition equation is represented as:

Low_Cost_n＝min{Low_Cost₀+L_compute(0..n)+L_communicate(0)，Low_Cost₁+L_compute(1..n)+L_communicate(1)，...，Low_Cost_n-1+L_compute(n-1..n)+L_communicate(n-1)}

Low_Cost₀＝0，

L_communicate(0)＝0，

L_communicate＝Data_Size*Coefficient，

wherein L is_compute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operator_iFor the calculation delay of the ith operator, L_communicateThe delay of Data transmission is referred to, the Data _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant value which changes according to the network bandwidth.

The invention has the further improvement that when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is more than or equal to the theoretical delay of the high throughput rate priority scheme, the high throughput rate priority scheme is adopted; the high throughput rate priority scheme comprises the following specific steps:

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，..，Node_i，...，Node_nThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;

slave Node₁Begin traversing to Node_n(ii) a Wherein, if Node_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Node_iTo Node_jThe nodes among the nodes are recursively called to optimize the target to be min Number_i..j×max{Low_Cost_i..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost×device_0..jRecord Low _ Cost_jCorresponding sub-model combination modes; node is traversed_nAfter that, Low _ Cost_nAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.

The invention relates to a model deployment device, which comprises:

the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; carrying out operator fusion or operator segmentation processing on operator models meeting preset conditions in the operator model set to obtain a processed operator model set;

a running time set obtaining module, configured to obtain a running time of each operator model in the processed operator model set on each device in a device set used for deploying the model, and obtain a running time set;

a sub-model set obtaining module, configured to combine operator models in the processed operator model set by using a preset search method according to the running time set, so as to obtain a sub-model set;

and the deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set to complete model deployment.

An electronic device of the present invention includes: a processor; a memory for storing computer program instructions; the computer program instructions, when loaded and executed by the processor, cause the processor to perform any of the above-described model deployment methods of the invention.

A readable storage medium of the present invention stores computer program instructions which, when loaded and executed by a processor, cause the processor to perform any of the above-described model deployment methods of the present invention.

Compared with the prior art, the invention has the following beneficial effects:

in the model deployment method provided by the invention, a large parameter model to be deployed is divided into a plurality of small operator models, and the sub-model is obtained by adopting a preset planning method based on the operator models, so that the sub-model can perfectly adapt to the content capacity of the computing equipment with different computing power, the non-computing overhead is reduced, and the capability of the existing computing equipment is fully exerted. The model deployment method can deploy the model with larger parameters on the device with limited calculation capacity. It should be mentioned that, the method of the present invention deploys the model on different devices, which naturally introduces communication overhead, but the introduced communication overhead is far smaller than the calculation overhead of the model itself; for the cloud computing platform, the communication overhead is not worth mentioning, and compared with the work of loading the parameters in the hard disk in blocks, the communication overhead is far smaller than the loading overhead.

In the embodiment of the further improved model deployment method, a specific model segmentation algorithm is provided, configuration schemes under high load and low load conditions can be generated, the configuration schemes are respectively called a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme, and the scheme is supported to be preset in advance and adjusted according to equipment; once the solution is generated, the service provider can flexibly adjust the solution according to the requirements.

In the embodiment of the further improved model deployment method, a dynamic planning algorithm with lower complexity is provided for the more common scenes that computing equipment has almost the same computing power, and the globally optimal equipment segmentation scheme can be obtained more efficiently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic block flow diagram of a model deployment method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of operator fusion according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating low service latency precedence in an embodiment of the present invention;

FIG. 4 is a flow chart illustrating high throughput first in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a deployment result according to an embodiment of the present invention.

Detailed Description

In order to make the purpose, technical effect and technical solution of the embodiments of the present invention clearer, the following clearly and completely describes the technical solution of the embodiments of the present invention with reference to the drawings in the embodiments of the present invention; it is to be understood that the described embodiments are only some of the embodiments of the present invention. Other embodiments, which can be derived by one of ordinary skill in the art from the disclosed embodiments without inventive faculty, are intended to be within the scope of the invention.

The model deployment method of the embodiment of the invention comprises the following steps:

Machine learning has become the most important and prevalent technique in modern data-driven applications, and recent advances in deep learning have yielded unprecedented results in a variety of challenging tasks, such as image/video classification and natural language processing. In recent years, researchers have continually introduced sophisticated network models to achieve better generalization capability and accuracy. But with the increasing number of parameters of the model and the increasing computational complexity. The GPT-3 model proposed by OPENI reaches 1750 hundred million which is remarkable only in quantity, and the storage space of a hard disk occupied by the model also exceeds 700G. With the development of deep learning, researchers find that computational resources are always limited. In order to enable the deep learning model to operate in a limited resource scene, the embodiment of the invention provides a novel model deployment method, and the method can divide the model to be deployed into small units and deploy the small units on equipment with weak computing power, so that the computing power of the equipment can be better exerted, and the reading and writing overhead can be reduced. Specifically, the embodiment of the invention discloses a flexible deep learning model deployment method, so that a deep learning model can be fully compatible with devices with different computing powers, the computing power of the devices is fully squeezed, and the optimal operation efficiency and the overall throughput rate are obtained; the invention can fully consider the calculation power difference of the device with strong calculation power in cloud calculation, the device with weak calculation power in edge calculation and even an embedded device, and obtains the optimal throughput rate on the whole.

In the embodiment of the invention, the overhead mainly comes from two parts, namely the execution time of the model and the communication time between devices; wherein the execution time is typically much larger than the communication time.

In the embodiment of the present invention, two optimization scenarios are set, including:

(1) in a low-frequency request scene, a model deployment method should provide lower prediction delay for a single request of a user;

(2) in a high-frequency request scenario, the model deployment method should optimize the throughput rate as a whole and provide services to more users in a unit time under the condition of ensuring that single prediction delay is acceptable.

It is worth noting that when the throughput rate is optimized, all the devices form a relatively complex pipeline, and the throughput rate depends on the slowest stage of pipeline in the pipeline; therefore, optimization schemes in high throughput scenarios also generally tend to make the load of the various devices more balanced.

In the embodiment of the invention, once the available resources and the model to be deployed are determined, a low service delay-first (latency-first) scheme and a high throughput-first (throughput-first) scheme can be generated, and the corresponding theoretical delay Time is calculated at the same Time_latencyAnd Time_throughpu(ii) a In actual operation, a low-service delay priority scheme is adopted, and when the actual operation Time of a user is greater than the theoretical delay Time of the high-throughput priority scheme_throughputWhen the request is switched to the high throughput rate priority scheme, the current request Number is recorded_flag. Defining: when the user requests the Number_request＞Number_flagIn the time of the day, for a high load environment, a high throughput-first (throughput-first) priority scheme is adopted. When the user requests the Number_request＜Number_flagIn time, a low service delay-first (latency-first) scheme is adopted for a low load environment.

In the embodiment of the invention, the total n operator models are set, and the calculation delay of the ith operator model is set as L_comp，iThe communication delay between the ith operator model and the (i + 1) th operator model is L_comm，i。

Wherein, the optimization function under the low-delay scheme is min { ∑_iL_comp，i+L_comm，iThe operator models are combined into different sub models, and an operator model combination mode with the minimum delay is obtained, so that the overall delay is the lowest; because the equipment under the high throughput rate priority scheme forms a more complex assembly lineAnd the throughput rate depends on the slowest stage of the pipeline, so the optimization function at this time is min { n × max }_i{L_conp，i，L_comm，iAnd } represents a combination mode which takes the product of the delay of the submodel with the maximum delay and the number of the submodels as the minimum in all the operator model combination modes.

Referring to fig. 1, a model deployment method according to an embodiment of the present invention is used to deploy a deep neural network model to be deployed on a resource-defined device set for deploying the model, and specifically includes the following steps:

screening to obtain an operator model set of the deep neural network model to be deployed, and performing parallel post-processing to obtain a processed operator model set; wherein the post-processing comprises: performing operator fusion or operator segmentation on the operator model meeting the preset condition (comparing the parameter number of the operator model with the memory of the minimum memory device); each operator model carries out delay statistics on each device in the device set to obtain a running time set; combining operator models in the processed operator model set by adopting a preset search method based on the running time set to obtain a sub-model set; and deploying the deep neural network model to be deployed on the equipment set based on the sub-model set to complete model deployment.

The deep learning model has an obvious characteristic that various operators which are easy to distinguish form an integral model according to a certain specific topological structure. If the operators are regarded as nodes and the connections between the operators are regarded as edges, the topological structure of the deep learning model can be regarded as a typical directed graph, and the nodes and the edges have respective values and represent the overhead of operation and communication. Because the memory capacity is different, the capability of different devices for bearing models is different, and for a device supporting virtual memory, although a model with a larger parameter number can be run through the virtual memory, a large amount of 'page missing interruption' is generated correspondingly, which brings huge expenses. So if too many nodes are included in a sub-model, the overhead will eventually be exacerbated by the "page fault interrupt".

In the method provided by the embodiment of the invention, the nodes and the edges are divided into the sub-models, so that the total overhead is minimum.

In the embodiment of the present invention, the specific step of selecting the granularity includes: when a deep learning model is partitioned, too coarse granularity may miss the chance of potential optimal partitioning and may not allow the model to be fully placed in the memory of the device, while too fine granularity may slow the speed of searching for the optimal strategy. In the deep learning model, it is a natural matter to select a single-layer neural network as the basic granularity; each partitioned sub-model partition will contain one or more layers of networks. However, in some cases, special adjustments are required.

In the embodiment of the present invention, the operator fusion specifically includes the following steps: in addition to computationally intensive network layers such as convolution and matrix multiplication, neural networks also use small network layers with little or no weight; for example, merge, batch regularization, and ReLU activation; these layers, which do not contribute significantly to memory usage, are fused with their neighbor convolutional or matrix multiplication network layers to reduce the number of elementary units for partitioning. Such layer fusion is a commonly used performance optimization method in many deep learning frameworks; for example TVM, tensorflow. As shown in fig. 2, Conv denotes a convolution layer, BN denotes a batch regularization layer, and Relu denotes a linear rectification function.

In the embodiment of the present invention, the step of processing the large parameter operator specifically includes: for an operator with an excessive amount of parameters, the operator is split into a plurality of parallel operators so as to reduce the parameter amount of a single operator, and the outputs of the operators can be combined and are equal to the output of the original operator. According to experience optimization, the memory size of the device with the minimum memory in all the devices to be deployed is set to be SM, an operator with the parameter quantity size exceeding 0.5SM is generally split until the size of the split operator is smaller than 0.5 SM.

In the embodiment of the present invention, the branch processing specifically includes: deep learning models such as ResNet, inclusion and DenseNet employ complex topologies that are not simple linear sequences but rather have branches. These branches are individually labeled in embodiments of the present invention. Where such parallel branching units are treated as the same stage of pipeline in computing performance, their delay depends on the slowest one.

In the embodiment of the invention, in order to obtain the optimal division scheme, the operator calculation delay L needs to be accurately obtained_computeAnd a communication delay L_communicateAnd thus the combination of operators (i.e., submodels) that may be running in the same plant. There is communication overhead between submodels, and there is no communication overhead within submodels. Usually, an operator running on one device has two delays: delay under low "page out interrupt" and delay under high "page out interrupt".

In the embodiment of the invention, for calculating the operator delay, if the total memory usage of a sub-model exceeds the available memory size, the operator in the sub-model suffers from serious page fault interruption. Therefore, the embodiment of the invention searches the optimal operator combination scheme by using the following method, wherein the method comprises operator delay statistics and sub-model delay estimation.

In the embodiment of the invention, the operator delay statistics step comprises the following steps: the operator model runs on the equipment to obtain the time delay under low missing page interruption and the time delay under high missing page interruption; in both cases, the performance of each operator is compiled and measured separately: a simple memory expansion module is used alone, which takes up most of the memory space. When the memory expansion module runs, starting the operator to correspond to the running time of 'page fault interruption', and corresponding to the time delay under high 'page fault interruption'; when the memory expansion module is closed, the operation time without page fault interruption is corresponding to the delay under low page fault interruption.

In the embodiment of the invention, the sub-model delay pre-estimation step specifically comprises the following steps: from the start node, a sub-model is generated. And calculating the memory usage amount of the sub model in the search space. Deep learning models have very regular program semantics. The memory layout is mainly composed of model weights, intermediate results and input/output, wherein the model weights account for the largest share. The memory occupied by the weight is also a fixed quantity for a determined model, and the memory occupied size can be obtained by calculating the input and output of an operator through an intermediate result. And (3) sub-model delay pre-estimation: for any given operator combination, the memory usage of the operators is predicted and compared with the size of the memory of the equipment to determine whether a large number of page faults and interruptions can be generated, and the delay of the sub-model is obtained through accumulation and recorded. And returning to the step two until the search space is traversed. And summarizing the corresponding sub-model delays into total delay, and selecting a scheme meeting the optimization target.

In the embodiment of the invention, based on the obtained normal operation cost of each operator on different devices and the operation cost when a large amount of page missing interruption occurs, a scheme suitable for the requirement is selected from a search space; the topological structure of the current deep learning model is Directed Acyclic Graph (DAG), the following operators are nodes (nodes), and the nodes are_iThe ith operator is represented. In order to obtain the combination of all the submodels, a backtracking method is adopted for searching, and a search space is represented as a search tree structure.

Referring to fig. 3, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the minimum delay, only the synthesis of the divided delays of the model needs to be minimized under the current device resource, and the steps are as follows:

step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes₁，Node₂，…，Node_n. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;

step two, the slave Node₁Starting until Node_n. To Node_iInquiring topological graph, obtaining branch Node connected with it, and for every branch Node_i+1Generating two branches to represent Node_iAnd Node_i+1The models are in the same submodel and different submodels;

step three, DFS traversal is carried out on the search tree generated in the step two, and a segmentation scheme is obtained;

step four, the Node is paired_iIf it has a plurality ofBranch Node, ingress Node for finding convergence of all branch nodes_jWill remove the Node_iTo Node_iAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.

Step five, traversing the scheme, and aiming at the a-th scheme Plan_aNumber of devices D if required by the scheme_a> D (D is the number of available devices), this scheme is eliminated.

Step six, starting to traverse the existing scheme and the scheme Plan_aFor each overlay/submodel G in the scheme_aCalculating the corresponding operator at each equipment device_bOverhead Cost on_ab. For Plan_aFor which we Cost at different devices_abCombining, accumulating to obtain Total Cost Total _ Cost, and finding out the combination which makes Total _ Cost minimum as Plan_aIs recorded as the final division partitioning scheme Opti _ Plan_aIts minimum overhead is noted as Low _ Cost_a

And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the Opti _ Plan of the alternative partitioning scheme with the minimum Low _ Cost value, and outputting the Opti _ Plan as the optimal partitioning scheme.

Preferably, in the embodiment of the present invention, the same device optimization scheme with minimum delay includes: when the equipment configuration in the application scene is completely the same, an optimal deep learning model segmentation scheme can be obtained by adopting a dynamic programming algorithm with lower complexity. By Low _ Cost_iAnd the combination mode with the lowest total delay in all the sub-model combination modes from the 0 th operator to the ith operator is shown. The state transition equation is expressed as:

for the nth operator, it is always recursive for the best case of the first n-1 operator combinations. Define Low _ Cost₀＝0，L_communicate(0)＝0，

L_communicate＝Data_Size*Coefficient{L_compute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operator_iFor the calculation delay of the ith operator, L_communtcateData _ Size refers to the Size of Data to be transmitted, the unit is MB, Coefficient is a constant that changes according to the network bandwidth, and the peak value of 1Gbps ethernet is about 8 ms/MB). Saving the calculated Low _ Cost by using a Low _ Cost table_iRecording Node by using a Position table_jSelect Low _ Cost_iPosition i of (1).

In the above embodiment of the present invention, the step of obtaining the optimal deep learning model segmentation scheme by using the dynamic programming algorithm is as follows:

step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes₁，Node₂，…，Node_n. Each node with branches has a plurality of egress nodes, and each aggregated node has a plurality of ingress nodes.

Step two, the slave Node₁To start, for Node_iIf Node is_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Node_iTo Node_iNodes in the process are recursively called to the algorithm, and the optimization target is min (max (Low _ Cost)_i..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.

Step three, to Node_jSolving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost_i+L_compute(i..j)+L_communicate(i) Record Low _ Cost_jThe sub-model combination mode of (1).

Step four, after traversing Node, Low _ Cost_nAnd inquiring the record to obtain a deep learning model division deployment scheme, namely the estimated minimum delay.

Referring to fig. 4, in the embodiment of the present invention, assuming that the optimization target of the current scheme is the highest throughput rate, at this time, the delay depends on the device with the longest operation time, and the delays of the submodels on the devices need to be equalized as much as possible, the steps are as follows:

step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes₁，Node₂，...，Node_n. Each node with branches has a plurality of exit nodes, each converged node has a plurality of entry nodes, and the topology graph shows that more than two edges exist;

step four, the Node is paired_iIf it has multiple branch nodes, it needs to find the input Node where all branch nodes are converged_jWill remove the Node_iTo Node_jAnd taking the schemes with the same external segmentation scheme as the same schemes, and merging to finally obtain x schemes.

Step six, starting to traverse the existing scheme and the scheme Plan_aFor each overlay/submodel G in the scheme_aCalculating the corresponding operator at each equipment device_bOverhead Cost on_ab. For Plan_aFor which we Cost at different devices_abCombining, with maximum overhead Cost of the device_abMultiplying the consumed equipment number D to obtain Total Cost Total _ Cost, and finding a combination which enables the Total _ Cost to be minimum as Plan_iaIs optimally combinedDenoted as the final partition scheme Opti _ Plan_iaIts minimum overhead is noted as Low _ Cost_a

And step seven, comparing the Low _ Cost values of all the x Opti _ Plans, finding out the minimum alternative partitioning scheme Opti _ Plan with the minimum Low _ Cost, and outputting the minimum alternative partitioning scheme Opti _ Plan as the optimal partitioning scheme.

In the embodiment of the invention, the same equipment optimization scheme with the highest throughput rate comprises the following steps: when the equipment configuration in the application scene is completely the same, a dynamic programming algorithm with lower complexity can be adopted to obtain a deep learning model segmentation scheme with the optimal throughput rate. The core idea is to minimize the delay of the submodel with the highest delay, i.e. load balancing.

The state transition equation is expressed as:

Low_Cost_n＝min{max{Low_Cost₀，L_compute(0..n)×(d₀+1)}，max{Low_Cost₁，L_compute(1..n)×(d₁+1)}，...，max{Low_Cost_n-1，L_compute(n-1..n)×(d_n-1+1)}}

in general, L_commumcate《L_computeTherefore, L is not considered_communicateThe delay of (2). The nth operator and some previous operators form a sub-model, and the delay is determined by the maximum sub-model delay. Low _ Cost is always recurred to the best case of the first n operator combinations. Define Low _ Cost₀＝0，L_communicate(0)＝0，

Computing delays, Cost, of submodels formed for the combination of the ith to jth operators_iFor the calculation delay of the ith operator, L_communicateTime delay of data transmission, d_nRepresenting the number of devices occupied by the optimal scheme from the 1 st operator to the nth operator), and saving the calculated Low _ Cost by using a Low _ Cost table_iRecording Node by using a Position table_jSelect Low _ Cost_iAnd (4) recording the Number of devices required by the corresponding scheme by using a Number table.

In the above embodiment of the present invention, the step of obtaining the deep learning model segmentation scheme with the optimal throughput rate by using the dynamic programming algorithm is as follows:

step one, numbering nodes in sequence according to input-output topological structure, and marking the nodes as nodes₁，Node₂，…，Node_n. Their connectivity is represented by an n x n matrix M. Each node with branches has a plurality of initial nodes, and each converged node has a plurality of incoming nodes.

Step two, the slave Node₁To start, for Node_iIf Node is_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Nodei to Node_iNodes among the nodes recursively call the algorithm, and the optimization target is min Number_i..j×max{Low_Cost_i..j-the minimum delay depends on the branch with the largest delay among the branches, -finding the best combination minimizes the delay of this branch.

Step three, to Node_jSolving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost×device_0..jRecord Low _ Cost_jCorresponding sub-model combination mode

Step four, after traversing Node, Low _ Cost_nNamely the estimated minimum delay, the deep learning model division deployment scheme can be obtained by inquiring the combination mode of the records.

Referring to fig. 5, the present invention successfully partitions the large parameter model into small submodels, so that the submodels can be perfectly adapted to the content capacity of the computing device, thereby reducing the non-computing overhead and fully exerting the capabilities of the existing computing devices.

The invention preferably adopts an ONNX model protocol to store the deep learning model, can be loaded to various deep learning frameworks, is in butt joint with an API (application program interface) of the deep learning framework and outputs the deep learning model to various hardware back ends. The method is completed by using a TCP protocol among various computing devices, does not need complex data format conversion, and has good compatibility and universality. According to the invention, the model is deployed on different devices, and communication overhead is naturally introduced, but the introduced communication overhead is far smaller than the calculation overhead of the model. For the cloud computing platform, the communication overhead is not worth mentioning. Compared with the work of loading the parameters from the hard disk in blocks, the communication overhead is far smaller than the loading overhead.

The invention provides a more flexible model segmentation algorithm, which can generate configuration schemes under high load and low load conditions, also supports presetting equipment parameters in advance and adjusts the schemes according to the equipment. Once the solution is generated, the service provider can flexibly adjust the solution according to the requirements. The invention provides a dynamic programming algorithm with lower complexity aiming at the more common scene that the computing power of each computing device is almost the same, and can more efficiently obtain a globally optimal device segmentation scheme. The invention provides a delay estimation model which can estimate the operation time of a sub-model on different equipment, rather than directly operating the sub-model for statistics, thereby greatly reducing the cost of preprocessing.

The model deployment device of the embodiment of the invention comprises:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.

Claims

1. A method of model deployment, comprising the steps of:

2. The model deployment method according to claim 1, wherein the step of obtaining the operator model set of the deep neural network model to be deployed specifically comprises:

3. A model deployment method according to claim 1,

the method for processing the operator model meeting the preset conditions in the operator model set through operator fusion or operator segmentation comprises the following steps of:

comparing the parameter quantity of each operator model in the operator model set with the memory of the equipment with the minimum memory in the equipment set; carrying out operator segmentation on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the 1/10 operator models with the parameter numbers smaller than the memory until the parameter numbers are more than 1/10 and less than 1/2 of the memory.

4. The model deployment method according to claim 1, wherein the preset search method is a backtracking search method;

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，...，Node_i，...，Node_n；

Slave Node₁Starting until Node_nTo Node_iQuerying a topology map, obtaining and Node_iThe connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, the Node pair_iEach branch Node of_i+1Two branches are generated for representing Node_iAnd Node_i+1The models are in the same submodel and different submodels; node_iWhen a plurality of branch nodes exist, an ingress Node for converging all branch nodes is found_jWill remove the Node_iTo Node_jTaking the schemes with the same external segmentation scheme as the same schemes for merging to obtain the final x schemes; traversing the x schemes, and for a certain scheme, if the required equipment quantity is more than the available equipment quantity, removing the scheme to obtain an existing scheme set;

traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, multiplying the maximum device expenses by the number of consumed devices to obtain total expenses, finding out a combination which enables the total expenses to be minimum as an optimal combination, and recording the optimal combination as a dividing and segmenting scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.

5. The model deployment method according to claim 4, wherein when a backtracking search method is used to combine the operator models in the processed operator model set, when the actual running time is less than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is used; the low service delay priority scheme comprises the following specific steps:

traverse the set of existing solutions: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the expenses on different devices, accumulating to obtain the total expenses, finding out a combination which minimizes the total expenses as an optimal combination, and recording as a division and segmentation scheme; and comparing the minimum expenses of all the partition and segmentation schemes, and taking the partition and segmentation scheme corresponding to the minimum expenses as a final partition and segmentation scheme.

6. A model deployment method according to claim 1, wherein all devices in said set of devices for deploying a model are the same; the preset search method is a dynamic programming search method;

the nodes are numbered in sequence according to the input-output topological structure and are represented as nodes₁，Node₂，...，Node_i，...，Node_nThe matrix M of n multiplied by n is used for representing the connection relation of the two; each node with branches is provided with a plurality of initial nodes, and each converged node is provided with a plurality of access nodes;

slave Node₁Begin traversing to Node_n(ii) a Wherein, if Node_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Node_iTo Node_jThe nodes among the nodes are recursively called to optimize the target to be min Number_i..j×max{Low_Cost_i..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost_i+L_compute(i..j)+L_communicate(i) Record Low _ Cost_jThe sub-model combination mode of (2); after traversing, Low _ Cost_nInquiring records to obtain a deep learning model division deployment scheme for the estimated minimum delay;

wherein, adopt Low _ Cost_iThe total delay is the lowest in the combination mode of all submodels from the 0 th operator to the ith operatorThe state transition equation of the combination is expressed as:

Low_Cost₀＝0，

L_communicate(0)＝0，

L_communicate＝Data_Size*Coefficient，

wherein L is_compute(i.. j) is the calculated delay, Cost, of the submodel formed by combining the ith operator to the jth operator_iFor the calculation delay of the ith operator, L_communicateData _ Size refers to the Size of Data to be transmitted, and Coefficient is a constant that changes according to the network bandwidth; number_i..jIndicating the number of machines that need to be consumed between Node _ i to Node _ j.

7. The model deployment method of claim 6, characterized in that, when the operator models in the processed operator model set are combined by using a dynamic programming search method, when the actual running time is greater than or equal to the theoretical delay of the high throughput priority scheme, the high throughput priority scheme is used; the high throughput rate priority scheme comprises the following specific steps:

slave Node₁Begin traversing to Node_n(ii) a Therein, for exampleFruit Node_iIs a father Node of the branch Node, finds a convergent Node of the branch Node_jTo Node_iTo Node_jThe nodes among the nodes are recursively called to optimize the target to be min Number_i..j×max{Low_Cost_i..jThe algorithm of finding the optimal combination to minimize the delay of the branch; solving the optimization scheme as Low _ Cost according to the state transition equation_j＝Low_Cost×device_0..jRecord Low _ Cost_jCorresponding sub-model combination modes; node is traversed_nAfter that, Low _ Cost_nAnd inquiring the recorded submodel combination mode to obtain a deep learning model division deployment scheme for the estimated minimum delay.

8. A model deployment apparatus, comprising:

9. An electronic device, comprising: a processor; a memory for storing computer program instructions; it is characterized in that the preparation method is characterized in that,

the computer program instructions, when loaded and executed by the processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.

10. A readable storage medium storing computer program instructions, wherein the computer program instructions, when loaded and executed by a processor, cause the processor to perform the model deployment method of any one of claims 1 to 7.