CN113220457B

CN113220457B - Model deployment method, model deployment device, terminal equipment and readable storage medium

Info

Publication number: CN113220457B
Application number: CN202110567899.5A
Authority: CN
Inventors: 李发兵; 林伟伟; 李想; 毛兴中
Original assignee: Shenzhen Zhixin Huaxi Information Technology Co ltd
Current assignee: Shenzhen Zhixin Huaxi Information Technology Co ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2024-03-22
Anticipated expiration: 2041-05-24
Also published as: CN113220457A

Abstract

The invention discloses a model deployment method, a model deployment device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained; acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set; based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set; and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set. The invention can be fully compatible with equipment with different calculation forces, and can improve the operation efficiency and the overall throughput rate.

Description

Model deployment method, model deployment device, terminal equipment and readable storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to the field of model deployment, and in particular relates to a model deployment method, a model deployment device, terminal equipment and a readable storage medium.

Background

Machine Learning (ML) is one of the fastest growing areas of computer science today, and typical Machine Learning techniques can train statistical models for specific applications on a large number of pre-collected data sets to update model parameters (also called "weights") until convergence; the trained model is used for reasoning, i.e. predicting the result of the new data. Deep learning based on neural networks is the most widely used ML algorithm due to its excellent effect. Neural network models are multi-layered Directed Acyclic Graphs (DAGs), the models typically consist of operations of convolution, matrix multiplication, pooling, batch regularization, etc., which are connected in linear chains or more complex patterns (e.g., branch and residual links). The generalization ability and accuracy of neural network models generally improves with deeper topologies and larger network layers, such as ResNet, VGG, but this also incurs higher execution delays.

Researchers are continually pushing out complex network models to achieve better generalization ability and accuracy; with the increase of the parameter number of the model, the calculation complexity is higher; for example, the GPT-3 model proposed by OPENAI reaches a surprising 1750 million by merely referencing, and the model occupies more than 700G of hard disk storage space. Researchers have now proposed various system-level optimization schemes to address the performance challenges posed by larger and more complex models; wherein in terms of hardware, faster computation is supported using a GPU and dedicated accelerators (e.g., google TPUs); in terms of software, there are many frameworks to bridge the gap between productivity-centric high-level interfaces and performance-oriented low-level implementations, including TensorFlow, pyTorch, MXNet, TVM, etc.

However, the above-described studies have not solved the deployment problem of large parameter models well. Specifically, most of the prior machines for operating the deep learning model by default are sufficient in memory and hard disk, and can directly operate the model on a single machine to obtain the output of the model; on the cloud server cluster, distributed storage is generally used to obtain better file read-write speed and better throughput, but more time is wasted on transmission, and the operation capability of all devices cannot be fully exerted. With the development of deep learning, researchers find that the computational resources are always limited; in addition, the device responsible for the operation generally needs stronger calculation force, and the calculation force is idle most of the time in charge of reading and writing, which causes huge waste on the one hand and makes the throughput rate difficult to be improved on the other hand.

In view of the foregoing, in order to enable the deep learning model to operate in the limited resource scenario, a new method, system, device and medium for deploying the deep learning model under the limited resource are needed.

Disclosure of Invention

The invention aims to provide a model deployment method, a model deployment device, terminal equipment and a readable storage medium, so as to solve one or more technical problems. The invention can be fully compatible with equipment with different calculation forces, and can improve the operation efficiency and the overall throughput rate.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention discloses a model deployment method, which comprises the following steps:

acquiring an operator model set of a deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;

acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models, and acquiring a running time set;

based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set;

and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set to complete model deployment.

The invention further improves that the step of acquiring the operator model set of the deep neural network model to be deployed specifically comprises the following steps:

and selecting a single-layer neural network as a basic granularity, and dividing a deep neural network model to be deployed to obtain an operator model set.

The invention further improves that the operator model meeting the preset condition in the operator model set is subjected to operator fusion or operator segmentation treatment, and the step of obtaining the treated operator model set specifically comprises the following steps:

Comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; operator segmentation is carried out on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and carrying out operator fusion on the operator model with the parameter less than 1/10 of the memory until the parameter is greater than 1/10 and less than 1/2 of the memory.

The invention is further improved in that the preset searching method is a backtracking searching method;

when combining operator models in the processed operator model set by adopting a backtracking method searching method, adopting a high throughput rate priority scheme when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:

the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ，Node ₂ ，…，Node _i …, nodeb; slave Node ₁ Starting up to Noden, for NodeB _i Querying topological graph, obtaining Node _i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node _i Node of each branch Node of (a) _i+1 Generating two branches for representing Node _i And Node _i+1 In the same sub-model and in different sub-models; node _i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found _j Will remove Node _i To Node _j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number D of available devices, removing the scheme to obtain an existing scheme set;

traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, multiplying the maximum overheads of the devices by the number of consumed devices to obtain total overheads, finding out a combination with the minimum total overheads as an optimal combination, and marking the combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.

The invention further improves that when a backtracking method searching method is adopted to combine operator models in the processed operator model set, when the actual running time is smaller than the theoretical time delay of a high throughput rate priority scheme, a low service time delay priority scheme is adopted; the low service delay priority scheme specifically comprises the following steps:

traversing the existing set of schemes: for each scheme, calculating the cost of the operator model corresponding to each sub-model on each device; combining the overheads on different devices, accumulating to obtain total overheads, finding out one combination with the minimum total overheads as an optimal combination, and marking the optimal combination as a division scheme; comparing the minimum cost of all the division schemes, and taking the division scheme corresponding to the minimum cost as the final division scheme.

A further improvement of the invention is that all devices in the set of devices for the deployment model are identical; the preset searching method is a dynamic programming searching method;

when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, when the actual running time is smaller than the theoretical delay of the high throughput rate priority scheme, a low service delay priority scheme is adopted; the low service delay priority scheme specifically comprises the following steps:

the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ，Node ₂ ，…，Node _i …, noden, which are connected by n×n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;

slave Node ₁ Start traversing to Node _n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Node _i To Node _j The node recursion call optimization target among the nodes is { min { Number } _i..j ×max{Low_Cost _i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost _i +L _compute (i..j)+L _communicate (i) Record Low_cost _j A sub-model combination mode of (2); after traversing, low_cost _n Obtaining a deep learning model partition deployment scheme for the estimated minimum delay by inquiring the record; number of _i..j The number of machines needed to be consumed between node_i and node_j is represented;

wherein, low_cost is adopted _i The combination mode with the lowest total delay in all sub-model combination modes from the 0 th operator to the i th operator is represented as a state transition equation:

Low_Cost _n ＝min{Low_Cost ₀ +L _compute (0..n)+L _communicate (0)，Low_Cost ₁ +L _compute (1..n)+L _communicate (1)，...，Low_Cost _n-1 +L _compute (n-1..n)+L _communicate (n-1)}

Low_Cost ₀ ＝0，

L _communicate (0)＝0，

L _communicate ＝Data_Size*Coefficient，

wherein L is _compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator _i For the calculation delay of the ith operator, L _communicate Referring to the delay of Data transmission, data_size refers to the Size of Data to be transmitted, and Coefficient is a constant that varies according to the bandwidth of the network.

The invention further improves that when the operator model in the processed operator model set is combined by adopting a dynamic programming search method, when the actual running time is more than or equal to the theoretical time delay of the high throughput rate priority scheme, the high throughput rate priority scheme is adopted; the high throughput rate priority scheme specifically comprises the following steps:

the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ，Node ₂ ，..，Node _i ，...，Node _n Representing their connected relationship by an n x n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;

Slave Node ₁ Start traversing to Node _n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Node _i To Node _j The node recursion call optimization target among the nodes is { min { Number } _i..j ×max{Low_Cost _i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost×device _0..j Record Low_cost _j A corresponding sub-model combination mode; node is traversed _n After that, low_cost _n Is the estimated minimum delayAnd when the sub-model combination mode of the query record is used, obtaining the deep learning model division deployment scheme.

The invention provides a model deployment device, comprising:

the operator model set acquisition module is used for acquiring an operator model set of the deep neural network model to be deployed; operator fusion or operator segmentation processing is carried out on the operator models meeting preset conditions in the operator model set, and a processed operator model set is obtained;

the running time set acquisition module is used for acquiring the running time of each operator model in the processed operator model set on each device in the device set for deploying the models to acquire a running time set;

the sub-model set acquisition module is used for combining the operator models in the processed operator model set by adopting a preset search method according to the running time set to obtain a sub-model set;

The deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set, so as to complete model deployment.

An electronic apparatus of the present invention includes: a processor; a memory for storing computer program instructions; when loaded and executed by the processor, the computer program instructions perform the method for model deployment according to any of the above-described aspects of the present invention.

A readable storage medium of the present invention stores computer program instructions that, when loaded and executed by a processor, perform the model deployment method of any of the above-described aspects of the present invention.

Compared with the prior art, the invention has the following beneficial effects:

in the model deployment method provided by the invention, the large parameter model to be deployed is divided into a plurality of small operator models, and the sub-models are obtained by adopting a preset planning method based on the operator models, so that the sub-models can be perfectly adapted to the content capacities of computing equipment with different computing forces, the non-computing expenditure is reduced, and the capability of the existing computing equipment is fully exerted. The model deployment method can deploy the model with larger parameter number on the equipment with limited calculation force. It should be noted that, the method of the invention deploys the model on different devices, naturally introduces communication overhead, but the introduced communication overhead is far less than the calculation overhead of the model itself; for the cloud computing platform, the communication overhead is far less than the loading overhead compared with the work of loading the parameters from the hard disk in a blocking way.

In the embodiment of the further improved model deployment method, a specific model segmentation algorithm is provided, configuration schemes under two conditions of high load and low load can be generated, and the configuration schemes are respectively called a low service delay-first scheme and a high throughput-first scheme, and also support preset equipment parameters in advance according to an equipment adjustment scheme; once the solution is generated, the service provider can flexibly adjust the solution according to the requirements.

In the embodiment of the further improved model deployment method, aiming at the scene that the computing power of more common computing equipment is almost the same, a dynamic programming algorithm with lower complexity is provided, and a globally optimal equipment segmentation scheme can be obtained more efficiently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.

FIG. 1 is a schematic block flow diagram of a model deployment method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an operator fusion in an embodiment of the invention;

FIG. 3 is a flow chart of low service latency priority in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart illustrating high throughput prioritization in an embodiment of the present invention;

FIG. 5 is a schematic diagram of deployment results in an embodiment of the present invention.

Detailed Description

In order to make the purposes, technical effects and technical solutions of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it will be apparent that the described embodiments are some of the embodiments of the present invention. Other embodiments, which may be made by those of ordinary skill in the art based on the disclosed embodiments without undue burden, are within the scope of the present invention.

The model deployment method of the embodiment of the invention comprises the following steps:

Machine learning has become the most important and popular technique in modern data driven applications, and recent advances in deep learning have achieved unprecedented success in various challenging tasks such as image/video classification and natural language processing. In recent years, researchers have continually introduced complex network models to achieve better generalization ability and accuracy. But with this, the number of parameters of the model is also larger and the computational complexity is higher and higher. The GPT-3 model proposed by OPENAI reaches a surprising 1750 million by just referencing, and the model occupies more than 700G of hard disk storage space. With the development of deep learning, researchers have found that the computational resources are always limited. In order to enable the deep learning model to operate in a limited resource scene, the embodiment of the invention provides a novel model deployment method, wherein the method can divide the model to be deployed into small units, deploy the small units on equipment with weak computing capacity, bring the computing power of the equipment into play, and reduce the reading and writing expenditure. Specifically, the embodiment of the invention discloses a flexible deep learning model deployment method, which enables a deep learning model to be fully compatible with equipment with different calculation forces and fully squeeze out the calculation forces of the equipment, so as to obtain optimal operation efficiency and overall throughput rate; the invention can fully consider the difference of the computing power of the devices with strong computing power in cloud computing or the devices with weak computing power in edge computing and even embedded devices, and can obtain the optimal throughput rate in the global.

In the embodiment of the invention, the overhead mainly comes from two parts, namely the execution time of the model and the communication time between devices; wherein typically the execution time is much longer than the communication time.

In the embodiment of the present invention, two optimization scenarios are set, including:

(1) In the low-frequency request scene, the model deployment method should provide lower prediction delay for a single request of a user;

(2) In the high-frequency request scene, the model deployment method should optimize throughput rate as a whole and provide services for more users in unit time under the condition of ensuring that single prediction delay is acceptable.

Notably, when optimizing for throughput, all devices form a relatively complex pipeline, and throughput depends on the slowest stage of pipeline; thus, optimization schemes in high throughput scenarios will also generally tend to make the load of the individual devices more balanced.

In the embodiment of the invention, once the available resources and the model to be deployed are determined, a low service delay-first scheme and a high throughput-first scheme can be generated, and the corresponding theoretical delay Time is calculated at the same Time _latency Sum Time _throughpu The method comprises the steps of carrying out a first treatment on the surface of the In actual operation, a low service delay priority scheme is adopted first, and when the actual operation Time of a user is longer than the theoretical delay Time of a high throughput rate priority scheme _throughput When switching to the high throughput rate priority scheme, recording the current request Number _flag . Definition: when the Number of user requests is Number _request ＞Number _flag When, for a high load environment, a high throughput-first (throughput-first) priority scheme is employed. When the Number of user requests is Number _request ＜Number _flag In this case, a low service latency-first (latency) scheme is employed for a low load environment.

In the embodiment of the invention, a total of n operator models are provided, and the calculation delay of the ith operator model is L _comp，i The communication delay between the ith operator model and the (i+1) th operator model is L _comm，i 。

Wherein, the optimization function under the low-delay scheme is min { Σ _i L _comp，i +L _comm，i -representing combining the operator models into different sub-models, obtaining a combination of operator models with minimum delay, such that overall delay is the lowest; since the devices form a relatively complex pipeline under the high throughput priority scheme, and the throughput rate depends on the slowest stage of pipeline, the optimization function at this time is min { n×max } _i {L _conp，i ，L _comm，i And }, which represents a combination mode of taking all the operator model combination modes, wherein the product of the time delay of the submodel with the maximum time delay and the submodel quantity is minimum.

Referring to fig. 1, a model deployment method in an embodiment of the present invention is used for deploying a deep neural network model to be deployed on a set of devices defined by resources for deploying the model, and specifically includes the following steps:

screening to obtain an operator model set of the deep neural network model to be deployed, and carrying out parallel post-processing to obtain a processed operator model set; wherein the post-processing comprises: operator fusion or operator segmentation is carried out on the operator model meeting the preset condition (according to the operator model parameter and the memory of the minimum memory device); each operator model makes delay statistics on each device in the device set to obtain a running time set; based on the running time set, combining operator models in the processed operator model set by adopting a preset searching method to obtain a sub model set; and based on the sub-model set, deploying the deep neural network model to be deployed on the equipment set to complete model deployment.

The deep learning model has an obvious characteristic, and various operators which are easy to distinguish form an integral model according to a specific topological structure. If each operator is considered a node, the connections between them are considered edges, the topology of the deep learning model can be considered as a typical directed graph, and the nodes and edges each have respective values representing the overhead of operations and communications. Because of the difference in memory capacity, the capability of different devices to carry models is different, and for devices supporting virtual memory, a large number of "page fault interrupts" can be generated correspondingly though a larger number of models can be run through virtual memory, so that huge expenditure is brought. So if there are too many nodes contained in a sub-model, the overhead eventually rises due to "page fault break".

In the method provided by the embodiment of the invention, the nodes and the edges are divided into the sub-models, so that the total cost is minimum.

In the embodiment of the application, the specific steps of granularity selection include: when classifying deep learning models, too coarse a granularity may miss opportunities for potentially optimal classification and may not place the model completely in the device's memory, while too fine a granularity may slow the search for optimal strategies. In the deep learning model, it is a natural matter to select a single-layer neural network as the basic granularity; each partitioned sub-model partition will contain one or more layers of network. However, in some cases, special adjustments are required.

In the embodiment of the application, the specific steps of operator fusion comprise: in addition to computationally intensive network layers such as convolution and matrix multiplication, neural networks use small network layers with little or no weight; for example, merging, batch regularization, and ReLU activation; these layers do not significantly contribute to memory usage, and they are fused with their neighbor convolution or matrix multiplication network layers to reduce the number of basic units used for partitioning. Such layer fusion is a performance optimization method commonly used in many deep learning frameworks; such as TVM, tensorflow. As shown in fig. 2, conv represents a convolution layer, BN represents a batch regularization layer, and Relu represents a linear rectification function in fig. 2.

In the embodiment of the application, the steps of processing the big parameter operator specifically comprise: for operators with excessive parameter quantity, the operators are split into a plurality of parallel operators to reduce the parameter quantity of single operators, and the outputs of the operators can be combined and are equivalent to the output of the original operators. According to experience preference, setting the memory size of the device with the smallest memory in all the devices to be deployed as SM, splitting operators with parameter sizes exceeding 0.5SM generally until the split operator sizes are smaller than 0.5SM.

In the embodiment of the present application, the specific steps of the branching processing include: deep learning models such as ResNet, acceptance and DenseNet employ complex topologies that are not simple linear sequences but rather have branches. These branches are individually labeled in embodiments of the present invention. Where such parallel branching units are considered to be the same stage of pipeline in computing performance, their delay is dependent on the slowest one.

In the embodiment of the present invention, in order to obtain the optimal partitioning scheme, the operator calculation delay L needs to be accurately obtained _compute And communication delay L _communicate And thus learn about combinations (i.e., sub-models) of operators that may be running in the same piece of equipment. There is communication overhead between the sub-models, and there is no communication overhead in the sub-models. Typically an operator running on a device will have two delays: a delay under low "page fault interrupt" and a delay under high "page fault interrupt".

In the embodiment of the invention, for calculating operator delay, if the total memory usage of one sub-model exceeds the available memory size, operators in the sub-model will suffer serious "page fault" interruption. Thus, embodiments of the present invention search for an optimal operator combination scheme using methods that include operator delay statistics and submodel delay predictions.

In the embodiment of the invention, the step of operator delay statistics comprises the following steps: the operator model operates on the equipment to acquire the time delay under the low page fault interrupt and the time delay under the high page fault interrupt; in both cases, the performance of each operator is compiled and measured separately: a simple memory expansion module is used alone, which occupies a large portion of the memory space. When the memory expansion module operates, the starting operator corresponds to the operation time of 'page missing interrupt', and corresponds to the time delay under high 'page missing interrupt'; when the memory expansion module is closed, the running time corresponding to no page fault interrupt is the time delay corresponding to low page fault interrupt.

In the embodiment of the invention, the steps of sub-model delay estimation specifically comprise: starting from the starting node, a sub-model is generated. And calculating the memory usage of the neutron model in the search space. The deep learning model has very regular program semantics. The memory layout mainly consists of model weights, intermediate results and input/output, wherein the model weights account for the largest share. The memory occupied by the weight is also fixed for a determined model, and the intermediate result can also obtain the memory occupied size through the input and output of the calculation operator. Sub-model delay estimation: for any given combination of operators, the memory usage of the operators is predicted, and compared with the size of the memory of the equipment to determine whether a large number of page fault interrupts are generated, and the delays of the submodels are obtained through accumulation and recorded. And returning to the step two until traversing the search space. And summarizing the corresponding sub-model delays into total delays, and selecting a scheme meeting the optimization target.

In the embodiment of the invention, a scheme which is suitable for the requirement is selected from a search space based on the obtained normal operation cost of each operator on different equipment and the operation cost when a large number of page missing interruption occurs; the topology structure of the current deep learning model is Directed Acyclic Graph (DAG), hereinafter referred to as Node (Node), which is called operator _i Representing the ith operator. Is thatAnd obtaining the combination of all the submodels of the method, searching in a backtracking mode, and representing the search space as a search tree structure.

Referring to fig. 3, in the embodiment of the present invention, assuming that the optimization objective of the current solution is the minimum delay, only the integration of the delay of the division of the model needs to be minimized under the current equipment resource, and the steps are as follows:

step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes ₁ ，Node ₂ ，…，Node _n . Each branched node is provided with a plurality of outlet nodes, each converged node is provided with a plurality of inlet nodes, and the converged nodes are represented as more than two edges in a topological graph;

step two, slave Node ₁ Starting until Node _n . Node pair _i Querying topology diagram, obtaining branch Node connected with it, node for each branch Node _i+1 Generating two branches representing Node _i And Node _i+1 In the same sub-model and in different sub-models;

step three, performing DFS traversal on the search tree generated in the step two to obtain a segmentation scheme;

step four, for Node _i If it has multiple branch nodes, it needs to find the Node where all branch nodes converge _j Will remove Node _i To Node _i And the schemes with the same segmentation scheme except the inter-node are regarded as the same scheme, and are combined to finally obtain x schemes.

Step five, traversing the scheme, and for the a scheme Plan _a If the scheme requires the number of devices D _a > D (D is the number of available devices), the scheme is removed.

Step six, starting traversing the existing scheme and scheme Plan _a For each overlay/sub-model G in the scheme _a Calculating the operator corresponding to the operator in each device _b Overhead Cost on _ab . Scheme Plan _a We Cost its overhead at different devices _ab Combining, accumulating to obtain Total overhead total_cost, and finding out a group minimizing total_costIs combined as Plan _a Is denoted as the optimal combination of the final partitioning scheme Opti _ Plan _a Its minimum overhead is denoted as Low_cost _a

And step seven, comparing the low_cost values of all the x Opti_planes, finding out the Opti_plane of the minimum alternative partitioning scheme of the low_cost values, and outputting the Opti_plane as the optimal partitioning scheme.

The optimal scheme of the same equipment with the minimum delay, which is preferable in the embodiment of the invention, comprises the following steps: when the equipment configurations in the application scene are completely the same, an optimal deep learning model segmentation scheme can be obtained by adopting a dynamic programming algorithm with lower complexity. By Low_cost _i Representing the combination mode with the lowest total delay in all sub-model combination modes from the 0 th operator to the i th operator. The state transition equation is expressed as:

for the nth operator, it is always recursively derived from the optimal case in the first n-1 operator combinations. Definition Low_cost ₀ ＝0，L _communicate (0)＝0，L _communicate ＝Data_Size*Coefficient{L _compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator _i For the calculation delay of the ith operator, L _communtcate Referring to the delay of Data transmission, data_size refers to the Size of Data to be transmitted in MB, coefficient is a constant that varies according to the network bandwidth, and 1Gbps ethernet peaks at about 8 ms/MB). Using a Low_Cost table to store the calculated Low_Cost _i Recording Node by using Position table _j Selecting Low_cost _i Is a position i of (c).

In the above embodiment of the present invention, the steps for obtaining the optimal deep learning model segmentation scheme by adopting the dynamic programming algorithm are as follows:

Step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes ₁ ，Node ₂ ，…，Node _n . Each branched node has a plurality of outgoing nodes and each converged node has a plurality of incoming nodes.

Step two, slave Node ₁ Starting, for Node _i If Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Node _i To Node _i The nodes between the nodes recursively call the algorithm, and the optimization target is { min { max { Low_cost } _i..j -the minimum delay depends on the one of the branches with the largest delay, and finding the optimal combination minimizes the delay of this branch.

Step three, for Node _j Solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost _i +L _compute (i..j)+L _communicate (i) Record Low_cost _j Is a sub-model combination mode.

Step four, after Node is traversed, low_Cost _n And inquiring the record to obtain the deep learning model partition deployment scheme, wherein the minimum estimated delay is the minimum estimated delay.

Referring to fig. 4, in the embodiment of the present invention, it is assumed that the current scheme optimizes the target to the highest throughput rate, and the delay depends on the device with the longest operation time, so that the delays of the submodel on each device are balanced as much as possible, and the steps are as follows:

step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes ₁ ，Node ₂ ，...，Node _n . Each branched node is provided with a plurality of outlet nodes, each converged node is provided with a plurality of inlet nodes, and the converged nodes are represented as more than two edges in a topological graph;

step two, slave Node ₁ Starting until Node _n . Node pair _i Querying topology diagram, obtaining branch Node connected with it, node for each branch Node _i+1 Generating two branches representing Node _i And Node _i+1 At the position ofThe same sub-model and different sub-models;

step four, for Node _i If it has multiple branch nodes, it needs to find the Node where all branch nodes converge _j Will remove Node _i To Node _j And the schemes with the same segmentation scheme except the inter-node are regarded as the same scheme, and are combined to finally obtain x schemes.

Step six, starting traversing the existing scheme and scheme Plan _a For each overlay/sub-model G in the scheme _a Calculating the operator corresponding to the operator in each device _b Overhead Cost on _ab . Scheme Plan _a We Cost its overhead at different devices _ab Combining with maximum overhead Cost of devices _ab Multiplying the number of consumed devices D to obtain the Total overhead total_cost, finding the combination that minimizes total_cost as Plan _ia Is denoted as the optimal combination of the final partitioning scheme Opti _ Plan _ia Its minimum overhead is denoted as Low_cost _a

And step seven, comparing the low_cost values of all the x Opti_Plans, finding out the smallest alternative partitioning scheme Opti_Plan with the minimum low_cost, and outputting the smallest alternative partitioning scheme Opti_Plan as the optimal partitioning scheme.

In the embodiment of the invention, the same equipment optimization scheme of the highest throughput rate comprises the following steps: when the equipment configurations in the application scene are completely the same, a deep learning model segmentation scheme with optimal throughput rate can be obtained by adopting a dynamic programming algorithm with lower complexity. The core idea is to minimize the delay of the highest latency submodel, i.e. load balancing.

The state transition equation is expressed as:

Low_Cost _n ＝min{max{Low_Cost ₀ ，L _compute (0..n)×(d ₀ +1)}，max{Low_Cost ₁ ，L _compute (1..n)×(d ₁ +1)}，...，max{Low_Cost _n-1 ，L _compute (n-1..n)×(d _n-1 +1)}}

in general L _commumcate 《L _compute Therefore, do not consider L _communicate Is a delay of (a) to (b). The nth operator and some operators in the front form a submodel, and the delay is determined by the maximum submodel delay. The optimal case in the first n operator combinations is always recursively derived from the low_cost. Definition Low_cost ₀ ＝0，L _communicate (0)＝0， Calculation delay Cost for submodel formed by combining ith operator to jth operator _i For the calculation delay of the ith operator, L _communicate Refers to the delay of data transmission, d _n Representing the number of devices that the optimal solution of the 1 st to nth operators needs to occupy) using a Low_cost table to store the calculated Low_cost _i Recording Node by using Position table _j Selecting Low_cost _i The Number table is used to record the Number of devices required for the corresponding scheme.

In the above embodiment of the present invention, the steps for obtaining the deep learning model segmentation scheme with optimal throughput rate by adopting the dynamic programming algorithm are as follows:

step one, sequentially numbering nodes according to an input-output topological structure, and marking the nodes as nodes ₁ ，Node ₂ ，…，Node _n . Their connected relationship is represented by an n x n matrix M. Each branched node has a plurality of primary nodes, and each converged node has a plurality of ingress nodes.

Step two, slave Node ₁ Starting, for Node _i If Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Nodei to NodeI _i The nodes among the nodes recursively call the algorithm, and the optimization target is { min }Number _i..j ×max{Low_Cost _i..j -the minimum delay depends on the one of the branches with the largest delay, and finding the optimal combination minimizes the delay of this branch.

Step three, for Node _j Solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost×device _0..j Record Low_cost _j Corresponding sub-model combination mode

Step four, after Node is traversed, low_Cost _n And obtaining a deep learning model division deployment scheme by inquiring a combination mode of the records, wherein the minimum estimated delay is the estimated minimum delay.

Referring to fig. 5, the present invention successfully partitions a large parameter model into small sub-models, so that the sub-models can perfectly adapt to the content capacity of the computing device, reduce non-computing overhead, and fully exert the capability of the existing computing device.

The invention preferably adopts ONNX model protocol to store the deep learning model, can be loaded into various deep learning frameworks, and can be connected with the API thereof to be output to various hardware back ends. The TCP protocol is used for completing various computing devices, complex data format conversion is not needed, and the compatibility and the universality are good. The invention deploys the model on different devices, naturally introduces communication overhead, but the introduced communication overhead is far smaller than the calculation overhead of the model itself. And communication overhead is less worth mentioning for cloud computing platforms. The communication overhead is also much less than the loading overhead than the work of loading parameters from a hard disk in blocks.

The invention provides a more flexible model segmentation algorithm, which can generate a configuration scheme under two conditions of high load and low load, also supports presetting of equipment parameters in advance and adjusts the scheme according to equipment. Once the solution is generated, the service provider can flexibly adjust the solution according to the requirements. Aiming at the more common scene that the calculation forces of all computing devices are almost the same, the invention provides a dynamic programming algorithm with lower complexity, and a globally optimal device segmentation scheme can be obtained more efficiently. The invention provides a delay pre-estimation model, which can pre-estimate the running time of a sub-model on different devices instead of directly running the sub-model for statistics, thereby greatly reducing the expense of preprocessing.

The model deployment device of the embodiment of the invention comprises:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, one skilled in the art may make modifications and equivalents to the specific embodiments of the present invention, and any modifications and equivalents not departing from the spirit and scope of the present invention are within the scope of the claims of the present invention.

Claims

1. A method of model deployment, comprising the steps of:

based on the sub-model set, deploying a deep neural network model to be deployed on the equipment set to complete model deployment;

the preset searching method is a backtracking searching method;

the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ，Node ₂ ，...，Node _i ，...，Node _n ；

Slave Node ₁ Start to Node _n For Node _i Querying topological graph, obtaining Node _i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node _i Node of each branch Node of (a) _i+1 Generating two branches for representing Node _i And Node _i+1 In the same sub-model and in different sub-models; node _i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found _j Will remove Node _i To Node _j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, and for a scheme, if the number of required devices is greater than the number of available devices, removing the scheme to obtain an existing scheme set;

2. The method for deploying models according to claim 1, wherein the step of obtaining the operator model set of the deep neural network model to be deployed specifically comprises:

3. A model deployment method according to claim 1, wherein,

the step of performing operator fusion or operator segmentation processing on the operator models meeting the preset conditions in the operator model set to obtain the processed operator model set specifically comprises the following steps:

comparing the parameter number of each operator model in the operator model set with the memory of the device with the minimum memory in the device set; operator segmentation is carried out on the operator model with the parameter quantity larger than the memory until the parameter quantity is smaller than 1/2 of the memory; and performing operator fusion on the operator model with the parameter less than 1/10 of the memory until the parameter is greater than 1/10 and less than 1/2 of the memory.

4. The model deployment method according to claim 1, wherein when a backtracking search method is adopted to combine operator models in the processed operator model set, a low service delay priority scheme is adopted when the actual running time is smaller than the theoretical delay of the high throughput priority scheme; the low service delay priority scheme specifically comprises the following steps:

The nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ,Node ₂ ,…,Node _i ,…，Node _n ；

5. A method of model deployment according to claim 1, wherein all devices in the set of devices for deploying models are the same; the preset searching method is a dynamic programming searching method;

the nodes are numbered in turn according to the topology structure of input-output, which is expressed as Node ₁ ，Node ₂ ，...，Node _i ，...，Node _n Representing their connected relationship by an n x n matrix M; each branched node is provided with a plurality of primary nodes, and each converged node is provided with a plurality of incoming nodes;

slave Node ₁ Start traversing to Node _n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Node _i To Node _j The node recursion call optimization target among the nodes is { min { Number } _i..j ×max{Low_Cost _i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost _i +L _compute (i..j)+L _communicate (i) Record Low_cost _j A sub-model combination mode of (2); after traversing, low_cost _n Obtaining a deep learning model partition deployment scheme for the estimated minimum delay by inquiring the record;

Low_Cost ₀ ＝0，

L _communicate (0)＝0，

L _communicate ＝Data_Size*Coefficient，

wherein, (L) _compute (i..j) calculating delay, cost, for a sub-model formed by combining the ith operator to the jth operator _i For the calculation delay of the ith operator, L _communicate Data_size refers to the Size of Data to be transmitted, and Coefficient is a constant that varies according to the bandwidth of the network; number of _i..j Indicating the number of machines that need to be consumed between Node i and Node j.

6. The model deployment method according to claim 5, wherein when the operator models in the processed operator model set are combined by adopting a dynamic programming search method, a high throughput rate priority scheme is adopted when the actual running time is greater than or equal to the theoretical delay of the high throughput rate priority scheme; the high throughput rate priority scheme specifically comprises the following steps:

Slave Node ₁ Start traversing to Node _n The method comprises the steps of carrying out a first treatment on the surface of the Wherein if Node _i Is the father Node of the branch Node, find the gathering Node of the branch Node _j For Node _i To Node _j The node recursion call optimization target among the nodes is { min { Number } _i..j ×max{Low_Cost _i..j -an algorithm to find the optimal combination to minimize the delay of this branch; solving an optimization scheme as Low_cost according to a state transition equation _j ＝Low_Cost×device _0..j Record Low_cost _j A corresponding sub-model combination mode; node is traversed _n After that, low_cost _n And obtaining a deep learning model division deployment scheme for the estimated minimum delay in a sub-model combination mode of query records.

7. A model deployment apparatus, comprising:

The deployment module is used for deploying the deep neural network model to be deployed on the equipment set according to the sub-model set, so as to complete model deployment;

the preset searching method is a backtracking searching method;

Slave Node ₁ Start to Node _n For Node _i Querying topological graph, obtaining Node _i The connected branch nodes obtain a search tree, and DFS traversal is carried out on the search tree to obtain a segmentation scheme; wherein, for Node _i Node of each branch Node of (a) _i+1 Generating two branches for representing Node _i And Node _i+1 In the same sub-model and in different sub-models; node _i When a plurality of branch nodes exist, an ingress Node converged by all branch nodes is found _j Will remove Node _i To Node _j The schemes with the same segmentation scheme except the inter-node are regarded as the same scheme to be combined, so that the final x schemes are obtained; traversing the x schemes, for a scheme, if the number of required devices is greater than the number of available devices, removing the scheme Obtaining an existing schema set;

8. An electronic device, comprising: a processor; a memory for storing computer program instructions; it is characterized in that the method comprises the steps of,

the computer program instructions, when loaded and executed by the processor, perform the model deployment method of any one of claims 1 to 6.

9. A readable storage medium storing computer program instructions, which when loaded and executed by a processor, performs the model deployment method of any one of claims 1 to 6.