CN114358255B

CN114358255B - DNN model parallelization and partial calculation unloading method based on fusion layer

Info

Publication number: CN114358255B
Application number: CN202210018356.2A
Authority: CN
Inventors: 周欢; 李明泽
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2024-07-12
Anticipated expiration: 2042-01-07
Also published as: CN114358255A

Abstract

The invention discloses a DNN model parallelization and partial calculation unloading method based on a fusion layer, which belongs to the field of data processing and comprises the following steps: s1: dividing the DNN model by using the FL technology to obtain a calculation layer with calculation correlation; s2: carrying out parallelization reasoning on the DNN model divided by the FL technology by using a partial calculation unloading mode to obtain DNN reasoning time; s3: determining a path scheduling strategy and FL path number by using a minimum waiting algorithm; s4: combining a particle swarm optimization algorithm with a minimum waiting algorithm, determining FL path length, interception fusion layer size and path unloading strategy, and obtaining the minimum DNN reasoning time as an optimal solution. The invention dynamically updates FL path length, intercepts fusion layer size and path unloading strategy through PARTICLE SWARM Optimization With Minimizing Waiting algorithm to explore optimal solution and avoid sinking into local minimum.

Description

DNN model parallelization and partial calculation unloading method based on fusion layer

Technical Field

The invention relates to the field of data processing, in particular to a DNN model parallelization and partial calculation unloading method based on a fusion layer.

Background

With the popularity of mobile devices and advances in wireless access technology, the explosive growth of data traffic has been caused by the mobile applications that have been actively developed. According to international data corporation (International Data Corporation) report, global data center traffic would reach 163zettabytes by 2025, and more than 75% of the data would be processed at the network edge. Deep learning, on the other hand, has been successful in complex tasks, including computer vision, natural language processing, machine translation, and many others. The use of deep learning in internet of things (Internet of Things, ioT) systems remains a number of obstacles, one of which is that internet of things devices cannot provide results that meet both real-time and high-precision requirements due to computational resource limitations. However, in many internet of things systems, such as traffic monitoring, not only is higher processing speed required, but also higher accuracy is required.

In order to address the above challenges, mobile edge computing (Mobile Edge Computing, MEC) has recently been proposed. The MEC pushes computing, caching, etc. functions towards the network edge to perform task processing and provide services, avoiding unnecessary transmission delays. However, MECs introduce additional transmission overhead and delay, which are not negligible due to the large amount of data (e.g., video) transmitted and slow transmission speed.

In order to reduce the inference time of deep neural networks (Deep Neural Network, DNN), recent studies explored three ways of terminal device calculation only, full calculation offloading and partial calculation offloading. In terms of terminal device-only computing, existing research has accelerated DNN reasoning mainly by optimizing DNN structure or using multiple cores. In full compute offload, the raw data is directly offloaded to the edge server of that class. Fang et al describe an alternate direction multiplier approach that prunes the filters in a hierarchical fashion and then accelerates reasoning DNN(F.Yu,L.Cui,P.Wang,C.Han,R.Huang,and X.Huang,"Easiedge:A novel global deep neural networks pruning method for efficient edge computing,"IEEE Internet of Things Journal,vol.8,no.3,pp.1259–1271,2021.). at the edge servers (EDGE SERVER, ES) in partial computation offload where the DNN model is broken down into sub-tasks at the level of the hierarchy, with intermediate feature layers offloaded to the ES by following the corresponding computation dependencies. In general, the intermediate feature layer has a smaller transmission data size and thus a shorter transmission time. Duan et al minimize DNN inference time by jointly optimizing the partitioning and scheduling of multiple DNNs (Y.Duan and J.Wu,"Joint optimization of dnn partition and scheduling for mobile cloud computing,"in Proceedings of IEEE ICPP,2021,pp.1–10.).

However, current DNN reasoning does not optimize for parallel computation and does not support finer granularity of the middle neural layer of DNN into multiple small layers. Thus, the advantages of partial computation offloading are not maximized. In the present invention, the DNN is first converted from a single sequence of nerve layers to multiple sequences of nerve layers using a Fused Layer (FL) technique. Each layer of the sequence is called as an FL path, wherein each FL path consists of a small layer of the sequence, each FL path can be independently calculated, and the precision of DNN reasoning results is not changed after the calculated results are fused and spliced. Thus, greater flexibility in terms of partial computation offload of DNN reasoning is created by scheduling FL paths. However, FL technology also presents some challenges: (1) How to determine the optimal FL strategy is challenging for the trade-off between parallel computing offload flexibility and model parallelization overhead; (2) FL techniques result in a more complex DNN architecture that is abstracted into the form of a directed acyclic graph (DIRECTED ACYCLIC GRAPH, DAG), so it is important to determine the best path offloading and path scheduling policies.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a DNN model parallelization and partial calculation unloading method based on a fusion layer, which provides a minimum waiting algorithm to heuristically determine a path scheduling sequence and FL path number. Then, the particle swarm optimization algorithm is combined with the MW algorithm, PSOMW algorithm is designed, FL path length is dynamically updated, fusion layer size and path unloading strategy are intercepted, so that the optimal solution is explored and the trapping to local minimum is avoided.

In order to achieve the above purpose, the present invention provides a method for parallelization and partial computation offloading of a DNN model based on a fusion layer, comprising the steps of:

s1: dividing the DNN model by using the FL technology to obtain a calculation layer with calculation correlation;

S2: carrying out parallelization reasoning on the DNN model divided by the FL technology by using a partial calculation unloading mode to obtain DNN reasoning time; in the partial calculation unloading, obtaining DNN reasoning time by calculating the calculation time, transmission time and ES calculation time of the terminal equipment;

s3: determining a path scheduling strategy and FL path number by using a minimum waiting algorithm;

s4: combining a particle swarm optimization algorithm with a minimum waiting algorithm, determining FL path length, interception fusion layer size and path unloading strategy, and obtaining the minimum DNN reasoning time as an optimal solution.

In an embodiment of the present invention, the step S1 includes: establishing a two-dimensional Cartesian coordinate system with the upper left corner of the feature layer as an origin, wherein the step length of convolution is set to be 1, the size of a convolution kernel is 3 multiplied by 3, and the sizes of an input feature layer, an intermediate feature layer and an output feature layer are respectively 6 multiplied by 6, 4 multiplied by 4 and 2 multiplied by 2;

The FL path is: a sequence of independently calculated layers divided by FL technique as FL paths in DAG Wherein P represents the total FL path number, and the DNN model is divided into a plurality of paths and a plurality of layers with calculated correlation by adopting a DAG form;

FL Path Length τ, vector representing the blend layer size of FL Path Length τ is Wherein the method comprises the steps ofIs the length vector of the fusion layer,Is the width vector of the fusion layer; intercepting the size vector set of the fusion layer asWherein the method comprises the steps ofIs the truncated fusion layer size vector for FL path p,Is to intercept the length of the fusion layer size,Intercepting the width of the size of the fusion layer; the size of the truncated fusion layer does not exceedAnd

In an embodiment of the present invention, the step S2 includes:

s201: calculating an offloading policy in an offloading layer;

S202: acquiring calculation time of terminal equipment and ES and transmission time between the terminal equipment and the ES, and acquiring FL strategy, path scheduling strategy and path unloading strategy;

s203: DNN inference times with calculated correlations are obtained.

In one embodiment of the present invention, in the step S201, the following steps are usedRepresenting a set of layers, wherein V represents a certain computational layer, and V is the total number of computational layers; using c _v and d _v to represent the transmission data size of layer v and the calculation amount of layer v, respectively; e _v′v = (v ', v) ∈e represents the calculated correlation from v ' to v, where E is the set of calculated correlations, and layer v can be calculated after layer v ' is calculated;

in partial calculation unloading, after the front layer of the current unloading layer finishes calculation, the ES starts to calculate the current unloading layer; calculating an offloading policy as Wherein if layer v is calculated on the terminal device, h _v =0, and if layer v is offloaded to ES, h _v =1.

In an embodiment of the present invention, in the step S202, a calculation time of an upper layer v of the terminal device is obtained from a calculation time of the terminal device and the ESThe calculation is as follows:

calculation time of v layer on ES The calculation is as follows:

acquiring transmission time of layer v from terminal equipment to ES in transmission time between terminal equipment and ES The calculation is as follows:

The calculation sequence of the FL path on the terminal equipment is the same as the transmission sequence and the calculation sequence on the ES, and the calculation sequence of the FL path on the terminal equipment is a path scheduling strategy Wherein s _p is the p-th scheduling path; path offload policies ofWhere o _p is the number of layers between the first calculation layer and the offload layer on path p.

In one embodiment of the present invention, in the step S203, the task completion time T _p (v) of the layer v on the path p is calculated as follows:

where v' represents the last computational layer with computational correlation lines for layer v;

DNN inference time with computational correlation Expressed as:

C2:T_p(v)

Where constraint C1 represents the computational dependency, the computational layer can only compute if all of its pre-layers have been computed.

In an embodiment of the present invention, in the step S3, the FL policy is obtained by traversing all FL path numbers and FL path lengths;

For the path scheduling policy, the layer v offloaded on the path p should satisfy the minimum transmission completion time, and its formula is as follows:

The first scheduling path p is recorded as s ₁ and the unload layer is recorded as o _p; the offloading policy of the P-th scheduling path (P e {2,3, …, P }) is determined by the transmission completion time of the P-1-th scheduling path in order to minimize the latency between the two paths; the p-th scheduling path and offloading layer v are recorded as s _p and _o p, determined by the following formula:

where v is the offload layer of the p-1 th scheduling path and v' is the offload layer of the p-th scheduling path.

In an embodiment of the present invention, in the step S4, an initial path offload policy is randomly generated for each solution using the FL path number obtained in the step S3 with the minimum DNN inference timeFor solution k, record its minimum DNN inference timeAnd corresponding toHistorical minimum DNN inference time for population in solution spaceAnd corresponding toτ^best、

For each solution, updating the FL path length, the size of the interception fusion layer and the path unloading strategy to find the optimal solution, wherein the change amount of the path unloading strategy is thatWherein the method comprises the steps ofRepresenting the change amount of o _p, intercepting the change amount of the size of the fusion layer as follows The FL path length change amount is τ ^*; FL path length, intercept fusion layer size, and path offload policy change are optimized by its own inertia, own historyHistorical optimal solutions for population in solution spaceτ^best、Determining;

the FL path length, intercept fusion layer size and path offload policy updates are as follows:

The solution space is explored by continuously iteratively updating each solution, and when the set iteration times are reached, the historical minimum DNN reasoning time of all solutions in the solution space is achieved And corresponding toτ^best、 The optimal solution is obtained.

Compared with the prior art, the DNN model parallelization and partial calculation unloading method based on the fusion layer has the following advantages:

1. the invention performs parallelization processing on the DNN model in partial calculation unloading. In particular, the FL technique is used to implement parallelized computation of DNN models without loss of accuracy.

2. The invention provides a heuristic method PSOMW to obtain an approximately optimal FL strategy, a path scheduling strategy and a path unloading strategy, and has lower time complexity.

3. The method of the invention is fully simulated to verify the effectiveness of the method in commonly used DNNs. The results show that DNN inference time for model parallelization using PSOMW algorithm is reduced by a factor of 12.75 compared to results that do not consider model parallelization.

Drawings

FIG. 1 is a flow chart of a fusion layer-based DNN model parallelization and partial computation offload method in accordance with the present invention;

FIG. 2 is a schematic diagram of DNN reasoning using FL technique according to the present invention;

FIG. 3 is a schematic diagram of a DNN parallelization calculation procedure using FL technique according to the present invention;

FIG. 4 is a schematic diagram of the application of MW algorithm under 5-layer DNN according to the present invention;

FIG. 5-1 is a schematic diagram of DNN inference time for changing only transmission speed under AlexNet neural networks according to the present invention;

FIG. 5-2 is a schematic diagram of DNN inference time for changing only transmission speed under MobileNet neural networks according to the present invention;

Fig. 5-3 are schematic diagrams of DNN inference times for varying transmission speeds only under SqueezeNet neural networks, according to the present invention;

FIGS. 5-4 are schematic diagrams of DNN inference time for changing only transmission speed under a VGG16 neural network according to the present invention;

FIGS. 5-5 are schematic diagrams of DNN inference times for varying transmission speeds only under YOLOv neural networks, according to the present invention;

Detailed Description

The following detailed description of embodiments of the invention is, therefore, to be taken in conjunction with the accompanying drawings, and it is to be understood that the scope of the invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the term "comprise" or variations thereof such as "comprises" or "comprising", etc. will be understood to include the stated element or component without excluding other elements or components.

As shown in fig. 1 to 5-5, a fusion Layer-based DNN model parallelization and partial computation offload method according to a preferred embodiment of the present invention, which uses a Fusion Layer (FL) technique for the processing procedure of DNN, may implement DNN (deep neural network (Deep Neural Network)) parallel computation.

The key idea of FL technology is to exploit the locality of DNN operations (such as rolling and pooling). For these operations, each output feature value depends only on the value in the corresponding region of the previous feature layer. According to the observation result, the FL technology calculates corresponding areas in the output characteristic layer by splitting the input characteristic layer into a plurality of independent small layers, and further performs fusion splicing on corresponding results, thereby obtaining an original output result. Thus, FL technology creates the opportunity for parallelized computation offloading.

The method comprises the following steps:

S1: and dividing the DNN model by using the FL technology to obtain a calculation layer with calculation correlation.

As shown in fig. 2, the processing dependence of four rectangular areas represented by four different color depths in a three-layer neural network, the convolution kernel parameters are at the bottom right corner of the figure.

The invention establishes a two-dimensional Cartesian coordinate system with the upper left corner of the feature layer as the origin. The step size of convolution is set to be 1, the size of convolution kernel is 3×3, and the sizes of input feature layer, intermediate feature layer and output feature layer are 6×6, 4×4 and 2×2 respectively. DNN is calculated by multiplying the feature parameters of the feature layer and the convolution kernel, and then adding the multiplied feature parameters, wherein the value at the coordinates (1, 1) in the middle feature layer is calculated by calculating the value of the vertex { (1, 1) in the input feature layer, (1, 3), (3, 1), the rectangular area surrounded by (3, 3) and the convolution kernel should be multiplied and added (4=2) x0+0+1×0+1×1+2×0+1×1+2×0+2×1+1×0). The four regions of different color depth in fig. 3 may be independently calculated in the input feature layer and the intermediate feature layer, and then fused on the output feature layer, where the neural layer where the output feature layer is located is also a fusion layer in the present invention. Fig. 3 illustrates a DNN parallelization calculation process using the FL technique. The FL technique can divide the neural layer of DNN, generating several sequences of independently computable small layers.

Wherein, the FL path is: a sequence of independently calculated layers divided by FL technique as FL paths in DAGWhere P represents the total FL path number.

FL Path Length τ, vector representing the blend layer size of FL Path Length τ is Wherein the method comprises the steps ofIs the length vector of the fusion layer,Is the width vector of the fusion layer. Intercepting the size vector set of the fusion layer asWherein the method comprises the steps ofIs the truncated fusion layer size vector of FL path p, whereIs to intercept the length of the fusion layer size,Is the width of the truncated fusion layer size. The size of the truncated fusion layer cannot exceedAnd

In fig. 3, after applying the FL technique, the DNN has 4 FL paths, the FL path length is 2, s ₂ = {2,2}, and u ₁＝u₂＝u₃＝u₄ = {1,1}. Notably, applying FL techniques will result in additional computational redundancy. For example, in the input element layer, a rectangular region whose vertices are { (2, 1), (2, 5), (5, 2), (5, 5) } is a region that is repeatedly calculated.

Therefore, it is important to find the best FL strategy, i.e., FL path length, FL path number, and the truncated fusion layer size.

S2: carrying out parallelization reasoning on the DNN model divided by the FL technology by using a partial calculation unloading mode to obtain DNN reasoning time; in the partial calculation unloading, the DNN reasoning time is obtained by calculating the calculation time, the transmission time and the ES calculation time of the terminal equipment.

Specifically, step S2 includes the steps of:

s201: calculating an offloading policy in an offloading layer;

The computational correlation of the layers in DNN is DAG and uses Represents a set of layers, where V represents a certain computational layer and V is the total number of computational layers. The transmission data size of layer v and the calculation amount of layer v are denoted by c _v and d _v, respectively. E _v′v = (v ', v) ∈e represents the calculated correlation from v ' to v, which means that layer v can be calculated only after layer v ' is calculated, where E is the set of calculated correlations. In the partial calculation offloading, after the front layer of the current offloading layer completes the calculation, the ES may start to perform the calculation of the current offloading layer. Calculating an offloading policy asWherein if layer v is calculated on the terminal device, h _v =0, and if layer v is offloaded to ES, h _v =1.

S202: the calculation time of the terminal device and the ES and the transmission time between the terminal device and the ES are acquired.

1) Calculation time of the terminal device and the ES. In the present invention, it is assumed that a terminal device can only calculate one calculation layer at a time. If layer v is calculated on the terminal device, the calculation time of layer v on the terminal deviceThe calculation is as follows:

where f _end is the CPU frequency of the terminal device.

Similarly, the computation time of the v layer on the ES can be obtainedThe calculation is as follows:

Where f _es is the CPU frequency of the ES.

2) Transmission time between the terminal device and the ES. The computation layer transmits according to a first-come first-served order. If layer v is offloaded to ES, then layer v's transfer time from terminal device to ESThe calculation is as follows:

Where R is the transmission rate between the terminal device and the ES.

3) Acquiring FL policies(I.e., FL path number P, FL path length τ, size of intercept fusion layer), path scheduling policy, and path offloading policy. The calculation order of the FL path on the terminal device is the same as the transmission order and calculation order on the ES. Therefore, the calculation sequence of FL paths on the terminal equipment is a path scheduling strategyWherein s _p is the p-th scheduling path; path offload policies of Where o _p is the number of layers between the first calculation layer and the offload layer on path p.

S203: DNN inference times with calculated correlations are obtained.

Specifically, T _p (v) is the task completion time for layer v on path p, which can be recursively and formulaically calculated as follows:

where v' denotes the last computational layer with computational correlation lines for layer v, i.e

The aim of the invention is to minimize the DNN inference time in partial computation offloading while taking into account DNN model parallelization. The DNN inference time T with calculated correlation can be expressed as:

C2:T_p(v)

Where constraint C1 represents the computational dependency, the computational layer can only compute if all of its pre-layers have been computed. Constraint C2 is the task completion time for layer v.

Then, reasoning about the DNN model in step S2, the invention uses a minimum waiting particle swarm optimization algorithm (PARTICLE SWARM Optimization With Minimizing Waiting, PSOMW) to optimize and obtain a near optimal solution. The method specifically comprises the following steps:

S3: the path scheduling policy and FL path number are determined using a minimum wait (Minimizing Waiting, MW) algorithm.

In MW, the FL policy may be obtained by traversing all possible FL path numbers and FL path lengths. Then, the fusion layer S _τ is truncated to have P small layers of the same size, thereby obtaining a size vector U of the truncated fusion layer. Once the FL policy is obtained, the DNN inference time may be obtained by determining a path scheduling policy and a path offloading policy. Thus, by continually cycling through the update FL strategy, a minimum DNN inference time can be obtained.

The main idea of the path scheduling strategy and the path unloading strategy is that the local calculation completion time of the next path should be as close as possible to the transmission completion time of the current path, so that the next path can start transmission immediately without waiting. The first scheduling path and the offload layer may be determined using the following criteria. When one path has a smaller number of layers to calculate on the terminal device, the next path can start to calculate on the terminal device faster. However, too few layers calculated on the terminal device will result in a huge amount of transmission data, thereby increasing the transmission time for the terminal device to offload to the ES. Thus, it is necessary to find the appropriate computation layer in each path for offloading. The layer v offloaded on path p should meet the minimum transfer completion time as follows:

Then, the first scheduling path p is recorded as s ₁ and the unload layer is recorded as o _p. If the offloaded layers on the multiple paths have the same minimum transmission completion time, the path with the smallest local computation time will be selected. Since in this case the ES can start to calculate the offload layer of the path as soon as possible, at the same time the next path can be calculated on the terminal device as soon as possible, which will make full use of the computing resources of the terminal device and the ES.

The offloading policy of the P-th scheduling path (P e {2,3, …, P }) is determined by the transmission completion time of the P-1-th scheduling path in order to minimize the latency between the two paths. In particular, the task completion time on the terminal device of the p-th scheduling path should be as close as possible to the transmission completion time of the p-1-th scheduling path. Then, after the p-1 th scheduling path is completed, the p-th scheduling path can start transmission as soon as possible. The p-th scheduling path and its offload layer v are recorded as s _p and o _p, which can be determined by the following formula:

Where v is the offload layer of the p-1 th scheduling path and v' is the offload layer of the p-th scheduling path. If multiple paths have the same close task completion time on the terminal device, the path where the offload layer with more pre-calculation layers is located will be selected. In this way, the p-th scheduling path can complete more calculation on the terminal equipment, reduce the calculation time of the ES and fully utilize the calculation resources of the terminal equipment and the ES. In summary, a path scheduling policy and a path offloading policy are obtained.

A five-layer DNN is taken as an example to illustrate the MW algorithm proposed by the present invention. As shown in fig. 4, the FL path length is 3, the FL path number is 3, the CPU frequency of the terminal device is 1, the transmission rate of the ES is 1, and the CPU frequency of the ES is 2. The minimum transmission completion times for the three FL paths are 2,4, and 4, respectively. By using the MW algorithm, path 1 is the first scheduling path. Since the task completion time of the upper layer 1 of the terminal device is 0 and the transmission time of the layer 1 is 2, if the layer 1 is selected to be offloaded, the transmission completion time of the path 1 is 0+2=2. According to the same logic, if layer 2 or layer 3 is selected to be unloaded, the transfer completion time is 2+2=4 or 2+2+2=6, respectively. Therefore, o ₁ = 1. For the second scheduling path, the task completion time of each layer on the terminal device should be as close to 2 as possible. The task completion times of the upper layers 5 and 7 of the terminal device are 1+1=2 and 2, respectively. Thus, path 2 is determined to be the second scheduling path and layer 6 is offloaded to the ES. That is, o ₂ =3, and the transmission completion time of layer 6 is 2+2=4. Path 3 is the third scheduling path, and the task completion times of the 7 th layer, the 8 th layer and the 9 th layer on the terminal device are 2+2=4, 2+2+2=6 and 2+2+2+2=8, respectively. The task completion time of the upper layer 8 of the terminal device is closest to 4, so the offloading layer on path 3 is layer 8. That is, o ₁ = 2. After the first several layers of the 10 th layer are completed, the outputs of the 3 rd layer, the 6 th layer and the 9 th layer are fused into the 10 th layer, the fused 10 th layer can start calculation, and the task completion time of the 10 th layer is 8+3=11.

It is noted that although the MW algorithm may derive the path scheduling policy and the path offloading policy, the solution derived by the MW algorithm is often not an optimal solution, and thus the MW algorithm is only used to determine the path scheduling policy S and the FL path number P.

S4: and combining a Particle Swarm Optimization (PSO) algorithm with a MW algorithm to obtain FL path length, interception fusion layer size and path unloading strategy, wherein the obtained minimum DNN reasoning time is the optimal solution.

The basic idea of particle swarm Optimization algorithm (PARTICLE SWARM Optimization) is to simulate the predation behavior of birds. Birds adjust the search path through their own experience and communication between populations to find where food is most. The particle swarm optimization algorithm is a global optimization algorithm based on probability. The method has strong global searching capability on nonlinear and multimodal problems, and has high probability of obtaining a globally optimal solution.

The invention combines PSO algorithm with MW algorithm, and proposes PSOMW algorithm. In PSOMW, the solution space is first initialized using the MW algorithm. The FL path number obtained by using the MW algorithm with the minimum DNN inference time is taken as the FL path number in PSOMW. Each solution then randomly generates an initial path offload policyAnd determines a path scheduling policy using the idea of the MW algorithm. The DNN inference time was used as an evaluation index of PSOMW. For solution k, it will record the minimum DNN inference time in its own historyAnd corresponding toIn addition, the historical minimum DNN inference time of the totality in the solution space will also be recordedAnd corresponding toτ^best、

Next, for each solution, its FL path length, the size of the intercept fusion layer, and the path offload policy are updated to find the optimal solution. The amount of change of the path offload policy isWherein the method comprises the steps ofRepresents the amount of change of o _p. Intercepting the size change amount of the fusion layer as Wherein the method comprises the steps ofIs the amount of change in the length vector,Is the width vector change amount. The FL path length change amount is τ ^*. FL path length, intercept fusion layer size, and path offload policy change are optimized by its own inertia, own historyHistorical optimal solutions for population in solution spaceτ^best、And (5) determining. The FL path length, intercept fusion layer size and path offload policy updates are as follows:

Where γ ₁ is the influence factor of inertia, γ ₂ is the influence factor of the own history optimal solution, and γ ₃ is the influence factor of the history optimal solution in the solution space. PSOMW searching the solution space by continuously iteratively updating each solution, and when the set iteration times are reached, the historical minimum DNN reasoning time of all solutions in the solution space And corresponding toτ^best、The optimal solution obtained by the algorithm is obtained.

A specific embodiment of the present invention will now be described with reference to fig. 5:

The present invention uses a number of simulations in five neural networks (1) AlexNet, (2) MobileNet, (3) SqueezeNet, (4) VGG16, and (5) YOLOv to demonstrate the effectiveness of the proposed method. MW was compared to the performance of the following baseline algorithm:

(1) FL-free (No Fused-Layer, NFL): in the algorithm, partial calculation unloading is not performed by using the FL technology, and after the neural network is locally calculated to a certain middle characteristic layer, the whole layer is unloaded to the ES for residual calculation.

(2) Brute Force (BF): the algorithm uses FL technology to perform partial calculation unloading, divides DNN into a plurality of paths, and obtains the optimal FL strategy, path scheduling strategy and unloading strategy by traversing all feasible solutions.

(3) Minimum Wait (MW): the algorithm uses part of the FL technique to calculate the offload. The MW algorithm is used to determine a path offloading policy and a path scheduling policy, the FL policy is determined by traversing all DNN neural layers.

The complexity of BF finding the optimal solution grows exponentially as the total number of DNN layers and FL paths increase. Thus, four FL paths with uniform truncated fusion layer sizes were selected and the performance of BF (2 x 2), MW (2 x 2) and PSOMW (2 x 2) were compared. Furthermore, MW (kxk) is used to represent MW with uniform truncated fusion layer size, where the number of FL paths (kxk) is determined by the MW algorithm. PSOMW (k ²) represents a PSOMW with non-uniform truncated fusion layer size, and the FL path number k ² is determined by the MW algorithm. Due to the time complexity of BF, BF results of non-uniform cut bath size cannot be obtained, and the superiority of PSOMW is verified by comparing BF (2X 2) and PSOMW (2X 2) results.

The transmission rate is set to be changed from 1.1MB/s to 3MB/s to simulate various scenes in life, including common network environments such as 4g 1.3MB/s and WiFi 1.8MB/s (refer to Kang,J.Hauswald,J.Mars,C.Gao,and A.Rovinski,"Neurosurgeon:Collaborative intelligence between the cloud and mobile edge,"in Proceedings of ASPLOS,2017,pp.615–629.) and the like.

Fig. 5-1 to 5-5 show simulation results. In five different neural networks, when the transmission rate increases from 1.1MB/s to 3MB/s, the DNN inference time of MW (k 2) is on average 12.75 times smaller than NFL. However, the reduction in DNN inference time depends on the neural network architecture. FIG. 5-1 shows the results in AlexNet, where the DNN inference times for NFL, BF (2X 2), MW (k X k) and PSOMW (k X2) are reduced from 1470ms,260ms,335ms,40ms and 32ms to 1384ms,160ms,255ms,38ms and 21ms, respectively, when the transmission rate is changed from 1.1MB/s to 3 MB/s. VGG16 has 18 nerve layers, with a greater number of nerve layers than AlexNet, and an increase in the number of nerve layers would result in more computation. It can be found that in VGG16, DNN inference times for NFL, BF (2X 2) and MW (2X 2) are reduced from 1856ms, 1300ms and 1302ms to 1207ms, 752ms and 940ms, respectively. Therefore, the DNN inference time in AlexNet is shorter than in VGG 16.

On the other hand, the number of FL paths is very important. The following observations can be made from the results in FIG. 5-1. When the FL path number is 4, the DNN inference time (2×2) of BF is reduced by an average of 5 times than that of NFL. When MobileNet is changed from 1.1MB/s to 3MB/s, the DNN inference time for MW (kXk) is reduced from 59ms to 33ms, respectively. DNN inference times for MW (kXk) and PSOMW (k2) are reduced from 59ms and 30ms to 33ms and 17ms, respectively, compared to MW (2X 2). DNN inference times for MW (k x k) and PSOMW (k x 2) are further reduced by an average of 171ms and 165ms compared to MW (2 x 2) and PSOMW (2 x 2). The reason for this is that the greater the number of FL paths, the greater the flexibility of path scheduling, and thus the better results.

In addition, the simulation results shown in FIGS. 5-1 to 5-5 show that DNN inference times of BF (2X 2) and PSOMW (2X 2) are substantially the same, and that the average difference in DNN inference times in AlexNet, mobileNet, squeezeNet, VGG and YOLOv is 0, 0.02%, 0, 0.04%, respectively. Thus, the accuracy of the results obtained by PSOMW was demonstrated. In AlexNet, mobileNet, squeezeNet, VGG and YOLOv2, the results for MW and PSOMW are very different, with MW (kXk) and PSOMW (k2) results of 51.3%, 90.1%, 4.7%, 12.2% and 13.1%, respectively. Therefore, the MW derived path offload strategy is not always the optimal solution.

Overall, the method of the present invention reduces DNN inference time well, either in lightweight DNNs (e.g., alexNet, mobileNet and YOLOv 2) or in heavy DNNs (e.g., VGG16 and SqueezeNet). The convolution step length of the lightweight DNN in most nerve layers is 1 or 2, and more DNN reasoning time can be reduced by using the FL technology. In AlexNet, mobileNet and YOLOv, the average DNN inference times for PSOMW (k 2) are 26ms, 24ms, and 21ms, respectively, which are reduced by 1388ms, 180ms, and 222ms, respectively, compared to NFL. Heavyweight DNNs have a larger convolution step or more neural layers, for example, a convolution step of 7 in SqueezeNet, but by performing parallel calculations on the terminal device and ES, DNN inference time can still be greatly reduced. For example, the average DNN inference time for PSOMW (k≡2) in SqueezeNet and VGG16 is 42ms and 270ms, respectively, which are reduced by 532ms and 1251ms, respectively, compared to NFL. Therefore, PSOMW have proven advantageous.

The foregoing descriptions of specific exemplary embodiments of the present invention are presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application to thereby enable one skilled in the art to make and utilize the invention in various exemplary embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. The DNN model parallelization and partial calculation unloading method based on the fusion layer is characterized by comprising the following steps of:

s4: combining a particle swarm optimization algorithm with a minimum waiting algorithm, determining FL path length, interception fusion layer size and path unloading strategy, and obtaining the minimum DNN reasoning time as an optimal solution;

wherein, in step S3, the FL policy is obtained by traversing all FL path numbers and FL path lengths;

The first scheduling path p is recorded as s ₁ and the unload layer is recorded as o _p; the offloading policy of the P-th scheduling path (P e {2,3,., P }) is determined by the transmission completion time of the P-1-th scheduling path so as to minimize the latency between the two paths; the p-th scheduling path and offloading layer v are recorded as s _p and o _p, determined by the following formula:

wherein v is the unloading layer of the p-1 th scheduling path, and v' is the unloading layer of the p-th scheduling path;

In step S4, the FL path number minimizing DNN inference time in step S3 is selected, and then an initial path offload strategy is randomly generated for each solution For solution k, record its minimum DNN inference timeAnd corresponding toHistorical minimum DNN inference time for population in solution spaceAnd corresponding toτ^best、

For each solution, updating the FL path length, the size of the interception fusion layer and the path unloading strategy to find the optimal solution, wherein the change amount of the path unloading strategy is thatWherein the method comprises the steps ofRepresenting the change amount of o _p, intercepting the change amount of the size of the fusion layer as followsThe FL path length change amount is τ ^*; FL path length, intercept fusion layer size, and path offload policy change are optimized by its own inertia, own historyHistorical optimal solutions for population in solution spaceτ^best、Determining;

2. The fusion layer-based DNN model parallelization and partial computation offload method of claim 1, wherein step S1 comprises: establishing a two-dimensional Cartesian coordinate system with the upper left corner of the feature layer as an origin, wherein the step length of convolution is set to be 1, the size of a convolution kernel is 3 multiplied by 3, and the sizes of an input feature layer, an intermediate feature layer and an output feature layer are respectively 6 multiplied by 6, 4 multiplied by 4 and 2 multiplied by 2;

3. The fusion layer-based DNN model parallelization and partial computation offload method of claim 2, wherein step S2 comprises:

s201: acquiring a calculation unloading strategy of a calculation layer;

S202: acquiring calculation time of the terminal equipment and the ES, transmission time between the terminal equipment and the ES, and acquiring FL strategy, path scheduling strategy and path unloading strategy;

s203: DNN inference time with task computation relevance is obtained.

4. The fusion-layer-based DNN model parallelization and partial computation offload method of claim 3, wherein in step S201, usingRepresenting a set of layers, wherein V represents a certain calculation layer, and V is the total number of calculation layers; using c _v and d _v to represent the transmission data size of layer v and the calculation amount of layer v, respectively; e _v′v = (v ', v) ∈e represents the calculated correlation from v ' to v, where E is the set of calculated correlations, and layer v can be calculated after layer v ' is calculated;

5. The method for parallelization and partial computation offload of DNN model based on fusion layer according to claim 3, wherein in step S202, the computation time of upper layer v of terminal equipment is obtained from the computation time of terminal equipment and ESThe calculation is as follows:

calculation time of v layer on ES The calculation is as follows:

6. The method for parallelization and partial computation offload of DNN model based on fusion layer according to claim 3, wherein in step S203, the task completion time T _p (v) of layer v on path p is calculated as follows:

DNN inference time with computational correlation Expressed as:

C2：T_p(v)