CN114970824A - Edge cloud collaborative convolution neural network reasoning method and system - Google Patents
Edge cloud collaborative convolution neural network reasoning method and system Download PDFInfo
- Publication number
- CN114970824A CN114970824A CN202210611122.9A CN202210611122A CN114970824A CN 114970824 A CN114970824 A CN 114970824A CN 202210611122 A CN202210611122 A CN 202210611122A CN 114970824 A CN114970824 A CN 114970824A
- Authority
- CN
- China
- Prior art keywords
- model
- compression
- given
- scheme
- precision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 109
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 9
- 238000007906 compression Methods 0.000 claims abstract description 210
- 230000006835 compression Effects 0.000 claims abstract description 209
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 97
- 238000004891 communication Methods 0.000 claims abstract description 25
- 238000005457 optimization Methods 0.000 claims abstract description 22
- 238000013138 pruning Methods 0.000 claims abstract description 15
- 238000013139 quantization Methods 0.000 claims abstract description 15
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 14
- 238000000638 solvent extraction Methods 0.000 claims description 48
- 238000010276 construction Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 7
- 230000001934 delay Effects 0.000 claims description 6
- 238000005192 partition Methods 0.000 claims description 6
- 238000010845 search algorithm Methods 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 12
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Abstract
A method and a system for reasoning on edge cloud collaborative convolution neural network include: based on the constructed model compression method, obtaining the time delay of the model under all compression division schemes; determining the performance upper and lower bounds of the combined compression division scheme based on the obtained time delay of all the compression division schemes; constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer; constructing a compression rate decision method when the precision requirement and CNN are given to divide layers; searching a combined optimal model compression division scheme with optimal time delay; and operating the system to perform model reasoning. The invention carries out hierarchical computation unloading through the compression division of the CNN model, carries out joint optimization on communication and computation bottlenecks, realizes rapid intelligent analysis of mass terminal data, reliably, controllably and efficiently compresses the communication traffic of the CNN model on any given layer through an congruent channel pruning method and a uniform affine quantization method, and obviously reduces the transmission delay of end edge cloud collaborative CNN reasoning.
Description
Technical Field
The invention belongs to the field of distributed intelligence, and particularly relates to a terminal edge cloud collaborative convolution neural network reasoning method and system.
Background
With the development of highly intelligent deep learning algorithm and widely applied internet of things technology, a large number of intelligent applications (such as traffic monitoring, defect detection, power grid inspection and the like) rely on the use of a deep learning Convolutional Neural Network (Convolutional Neural Network) model for reasoning to perform high-precision and rapid intelligent analysis on mass terminal data. The existing method promotes high-precision intelligent analysis by designing and optimizing deep learning CNN reasoning, and even obtains the effect exceeding human on part of visual tasks. However, high-precision intelligent analysis based on CNN inference is often accompanied by high computational overhead, and it is difficult to implement rapid intelligent analysis directly in terminal deployment with limited computational resources, which hinders the landing of a large number of practical applications. Therefore, how to realize high-precision and fast intelligent analysis of terminal data under the condition of actual equipment resources is a key problem for supporting intelligent application. The high computational overhead of the high-precision deep learning CNN inference limits the rapid intelligent analysis of the CNN inference to be completed on equipment with limited general computing resources. In order to eliminate the computing bottleneck brought by CNN inference, the existing applications often upload data to complete intelligent analysis by using computationally scalable cloud computing. However, considering the volume of massive terminal data, this method cannot support a large amount of intelligent applications in practical bandwidth resources. The existing terminal computing and cloud computing modes are respectively limited by computing and communication, and high-precision and quick intelligent analysis of mass terminal data cannot be supported.
Disclosure of Invention
The invention aims to provide a method and a system for reasoning a terminal edge cloud collaborative convolution neural network, which aim to solve the problems that the existing mode can not support a large amount of intelligent applications in actual bandwidth resources, and the existing terminal computing and cloud computing modes are respectively limited by computing and communication and can not support high-precision and rapid intelligent analysis of mass terminal data.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for reasoning edge cloud collaborative convolution neural network comprises the following steps:
constructing a communication optimal model compression method, and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;
based on the constructed model compression method, information collection is carried out on a given CNN model in a given end edge cloud system, and the time delay of the model under all compression division schemes is obtained;
determining an upper bound (T) for performance of a joint compression partitioning scheme based on the obtained latencies of all compression partitioning schemes max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with minimal delay when no compression is present, (T) min ,A min ) Determined by the scheme with the minimum compression time delay;
constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;
constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;
at a given accuracy requirement A 0 And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A max Directly providing an upper bound scheme; if the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) To transportOptimal end-to-end reasoning delay T of model based on scheme optimization * ;
At a given delay requirement T 0 And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T max Directly providing an upper bound scheme; if the given delay is less than the lower bound T min Directly providing a lower bound scheme; otherwise, based on given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * ;
Output-based joint optimal model compression partitioning scheme (l) * ,r * ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.
Further, the compression method for constructing the communication optimal model comprises the following steps:
step 1.1, congruent channel pruning, for a given CNN layer, solving
s.t.||β|| 0 ≤K′,
Pruning insignificant convolution kernels, where | · |) F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and k representing the channel corresponding to the kth input profile, W l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ 1 Is a penalty coefficient; first, W is fixed l,k Increase of λ 1 Calculating beta, deleting the minimum value in the beta vector corresponding to the current beta vector and the convolution kernel corresponding to the minimum value, fixing the beta with the minimum element deleted, and updating W through training l,k Repeating the iteration until the number of the components in the beta is less than K';
step 1.2, unifying affine quantization, and performing affine quantization on the given CNN layer output compressed in the step 1.1 to 8-bit.
Further, the specific operation of the model under all compression partitioning schemes is obtained as follows: information collection is carried out on a given CNN model in a given end edge cloud system to obtain time delay of all compression division schemes, one N-layer CNN model is deployed in a 3-layer end edge cloud system, and l is set in a division layer (l ═ l) 1 ,l 2 ) Compression setting r ═ r (r) 1 ,r 2 ) Wherein 0 to l 1 Layer CNN model runs on the end equipment, l 1 Compression ratio of layer CNN model is r 1 ,l 1 +1 to l 2 Layer CNN model runs on edge equipment, l 2 Compression ratio of layer CNN model is r 2 ,l 2 The +1 to N layers of CNN models run on the cloud equipment, corresponding compression rates are achieved based on the compression models, and under the compression division scheme (l, r), end-to-end delay of CNN reasoningWherein l 0 ≡0,l 3 ≡N,T c Is the sum of the computation delays, T, on all devices of the edge cloud t Is the sum of the communication delays between the end edge clouds.
Further, the specific operation of constructing the model accuracy upper bound estimation method at a given compression rate on a given CNN division layer is as follows:
model accuracy upper bound estimation at a given compression ratio on a given CNN partition level, for a given partition level l g Compressibility r-precision A function (l) g R) has a monotonically concave nature, at a given division level and compression ratio (l) g ,r g ) Based on two existing compression ratio-precision data points ((l) g ,r 1 ),A 1 ) And ((l) g ,r 2 ),A 2 ),r 1 ≤r 2 <r g Or r g <r 1 ≤r 2 Estimate scheme (l) g ,r g ) Upper bound of accuracy ofSelected existing data (l) from g ,r g ) The closer, the more accurate the estimate, so the distance (l) is chosen among the existing data g ,r g ) Two points with the smallest sum of distances.
Further, a compression rate decision method is constructed when the accuracy requirement and CNN are given to divide layers:
given CNN layer l with compression g post-CNN model precision A-compressibility R function R (l) g The single concave nature of A), quickly determining that the accuracy requirement A is met g When l is turned on g The highest compression ratio CRD (A) g |l g )=R * (l g ,A g );
The method comprises the following steps:
step 5.1, based on the existing sum (l) g ,A g ) Two data points with the smallest sum of distances ((l) g ,A 1 ),r 1 ) And ((l) g ,A 2 ),r 2 ) An estimated value r' based on the calculated compressibility;
step 5.2 data ((l) obtained by actual model compression g ,A′),r′);
Step 5.3, repeating steps 5.1 and 5.2 until R' no longer increases, the maximum compression ratio R * (l g ,A g ) If the estimated value of r 'is out of range in the loop iteration, a new r' is determined using dichotomy in the feasible value range.
Further, at a given accuracy requirement A 0 Then, based on a model precision upper bound estimation method and a compression ratio decision method, searching a joint optimal model compression division scheme with optimal delay specifically comprises:
the method comprises the steps of compressing a partition scheme search algorithm through a joint optimal model, dynamically compressing a scheme search space, and determining that the given precision requirement A is met 0 Model optimization scheme for optimization of down-time * ,r * )。
Further, the method comprises the following steps:
step 6.1, setting local optimal delay T * =T max Let l 1 ←1;
Step 6.2, set l 2 ←l 1 ;
Step 6.3, set scheme l as (l) 1 ,l 2 ) Based on local optimum delay T * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
step 6.4, based on R, let l 1 Candidate compression ratioIf the scheme isAccuracy of modelUpdate l 1 Candidate compression ratioUpdating
Step 6.5, ifLet l 2 Candidate compression ratioUpdatingBased on step 4, if Andare all greater than or equal to A 0 WhereinIs a schemeEstimating and updating the upper bound of model precisionCombined model compression division scheme with optimal time delaySetting an optimal delay T * ←T(l * ,r * );
Step 6.8, update l 2 ←l 2 +1;
Step 6.9, if l 2 N-1 or less, repeating the step 6.3 to 6.8;
step 6.10, update l 1 ←l 1 +1;
Step 6.11, if 1 N-1 or less, repeating the step 6.2 to 6.10;
step 6.12, output (l) * ,r * ) And T * 。
Further, T is required at a given delay 0 Next, based on a model precision upper bound estimation method and a compression ratio decision method, searching for a joint optimal model compression partitioning scheme with optimal precision specifically includes:
searching the space by a dynamic compression scheme, determining the time when a given delay requirement T is met 0 Model optimization scheme with optimal lower precision (l) * ,r * )。
Further, the method comprises the following steps:
step 7.1, setting local optimal precision A * =A max Let l 1 ←1;
Step 7.2, set l 2 ←l 1 ;
Step 7.3, put scheme l ═ l (l) 1 ,l 2 ) Based on local optimum accuracy A * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
Step 7.5, ifLet l 2 Candidate compression ratioUpdatingIf it is Andare all greater than A * Setting the optimal compression division scheme of the combined modelSetting the optimum precision A * ←A(l * ,r * );
Step 7.8, update l 2 ←l 2 +1;
Step 7.9, if l 2 N-1 or less, repeating the step 7.3 to 7.8;
step 7.10, update l 1 ←l 1 +1;
Step 7.11, if 1 N-1 or less, repeating the step 7.2 to 7.10;
step 7.12, output (l) * ,r * ) And A * 。
Further, an edge cloud convolution neural network inference system based on joint compression partitioning comprises:
the model compression method construction module is used for constructing a communication optimal model compression method and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;
the model delay obtaining module is used for carrying out information collection on a given CNN model in a given end edge cloud system based on a constructed model compression method to obtain the delay of the model under all compression division schemes;
a performance upper and lower bound determining module for determining the performance upper bound (T) of the combined compression division scheme based on the obtained time delay of all compression division schemes max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with minimal delay when no compression is present, (T) min ,A min ) Determined by the scheme with the minimum compression time delay;
the estimation method building module is used for building a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;
the decision method construction module is used for constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;
an optimal model compression partitioning scheme obtaining module for obtaining the optimal model compression partitioning scheme at a given precision requirement A 0 And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A max Directly providing an upper bound scheme; if the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme * (ii) a At a given delay requirement T 0 And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T max Directly providing an upper bound scheme; if the given delay is less than the lower bound T min Directly providing a lower bound scheme; otherwise, based on given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * ;
An output module for an output-based joint optimal model compression partitioning scheme (l) * ,r * ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.
Compared with the prior art, the invention has the following technical effects:
according to the method, a terminal-edge-cloud computing architecture is adopted, hierarchical computing unloading is carried out through compression and division of the CNN model, communication and computing bottlenecks are jointly optimized, rapid intelligent analysis of mass terminal data is achieved under the condition that accuracy requirements are guaranteed, communication traffic of the CNN model on any given layer is reliably, controllably and efficiently compressed through an congruent channel pruning method and a uniform affine quantization method, and transmission delay of end edge cloud collaborative CNN reasoning is remarkably reduced.
Further, by utilizing the monotone concave property of precision-compression ratio/precision-reasoning delay time of the CNN model after the given CNN layer is compressed, an optimal combined model compression division scheme is quickly determined through a precision upper bound estimation method and a compression ratio decision method, and the calculation overhead of CNN model optimization is remarkably reduced.
Further, based on an upper-bound precision estimation method and a compression rate decision method, under the condition of a given inference precision or delay requirement, a combined optimal model compression partitioning scheme search algorithm is used for dynamically compressing a scheme search space, a model optimization scheme with the shortest delay time meeting the given precision requirement or a model optimization scheme with the highest precision meeting the given delay requirement is efficiently determined, and low-delay and high-precision collaborative inference of a given CNN model in a given edge cloud system is supported.
Drawings
FIG. 1 is a schematic representation of an implementation of the process herein;
FIG. 2 is a logic flow diagram of the method herein;
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
Referring to fig. 1, the present invention provides an edge cloud collaborative CNN inference method based on joint compression partitioning, including the following steps:
step 1.1, congruent channel pruning, for a given CNN layer, solving
s.t.||β|| 0 ≤K′,
Pruning insignificant convolution kernels, where | · |) F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and k representing the channel corresponding to the kth input profile, W l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ 1 Is a penalty factor. First, W is fixed l,k Increase of λ 1 Calculating beta, deleting the minimum value in the beta vector corresponding to the current beta vector and the convolution kernel corresponding to the minimum value, fixing the beta with the minimum element deleted, and updating W through training l,k Repeating the iteration until the number of the components in the beta is less than K';
step 1.2, unifying affine quantization, and performing affine quantization on the output of the given CNN layer compressed in the step 1.1 to 8-bit;
step 2, System information collection (System Profiling), wherein information collection is carried out on a given CNN model in a given end edge cloud SystemObtaining the time delay of all compression division schemes, deploying an N-layer CNN model in a 3-layer end edge cloud system, and setting l-l (l) for the division layers 1 ,l 2 ) Compression setting r ═ r (r) 1 ,r 2 ) Wherein 0 to l 1 Layer CNN model runs on the end equipment, l 1 Compression ratio of layer CNN model is r 1 ,l 1 +1 to l 2 Layer CNN model runs on edge equipment, l 2 Compression ratio of layer CNN model is r 2 ,l 2 The +1 to N layers of CNN models run on the cloud equipment, the compression models reach corresponding compression rates based on the step 1, and the CNN reasoning end-to-end time delay is carried out under the compression division scheme (l, r)Wherein l 0 ≡0,l 3 ≡N,T c Is the sum of the computation delays, T, on all devices of the edge cloud t The sum of communication delays between end clouds;
step 3, determining the performance upper bound (T) of the joint compression partitioning scheme max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with minimal delay when no compression is present, (T) min ,A min ) Determined by the smallest scheme when compression latency is considered;
step 4, estimating the model precision upper Bound (ABE) under a given compression ratio on a given CNN division layer, and for a given division layer l g Compressibility r-precision A function (l) g R) has a monotonically concave nature, at a given division level and compression ratio (l) g ,r g ) Based on two existing compression ratio-precision data points ((l) g ,r 1 ),A 1 ) And ((l) g ,r 2 ),A 2 )(r 1 ≤r 2 <r g Or r g <r 1 ≤r 2 ) Estimate scheme (l) g ,r g ) Upper bound of accuracy ofSelected existing data (l) from g ,r g ) The closer, the more accurate the estimate, so the distance (l) is chosen among the existing data g ,r g ) Two points with the minimum distance sum;
step 5 Compression Rate Determination (CRD) at a given precision requirement and CNN partitioning level for a given partitioning level l g Precision A-compressibility R function R (l) g A) has a monotonous concave property by which it is quickly determined that the accuracy requirement A is satisfied g When, given a division level l g The highest compression ratio CRD (A) g |l g )=R * (l g ,A g ) Step 5 comprises the following steps:
step 5.1, based on the existing sum (l) g ,A g ) Two data points with the smallest sum of distances ((l) g ,A 1 ),r 1 ) And ((l) g ,A 2 ),r 2 ) Calculating an estimated value r' of the compressibility based on step 4;
step 5.2 data ((l) obtained by actual model compression g ,A′),r′);
Step 5.3, repeating steps 5.1 and 5.2 until R' no longer increases, the maximum compression ratio R * (l g ,A g ) If the estimated value of r 'is out of range in the loop iteration, determining new r' by using a dichotomy in a feasible value range;
step 6, giving a precision requirement A 0 Searching the combined optimal model compression division scheme with optimal lower time delay and optimal time delay, and if the given precision is greater than the upper bound A max Directly providing an upper bound scheme; the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme * Step 6 comprises the following steps:
step 6.1, setting local optimal delay T * =T max Let l 1 ←1;
Step 6.2, set l 2 ←l 1 ;
Step 6.3, set scheme l as (l) 1 ,l 2 ) Based on local optimum delay T * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
step 6.4, based on R, let l 1 Candidate compression ratioIf the scheme isAccuracy of modelBased on step 5, update l 1 Candidate compression ratioUpdating
Step 6.5, ifLet l 2 Candidate compression ratioUpdatingBased on step 4, if Andare all greater than or equal to A 0 WhereinIs a schemeEstimating the upper limit of the model precision, and updating based on the step 5Combined model compression division scheme with optimal time delaySetting an optimal delay T * ←T(l * ,r * );
Step 6.8, update l 2 ←l 2 +1;
Step 6.9, if l 2 N-1 or less, repeating the step 6.3 to 6.8;
step 6.10, update l 1 ←l 1 +1;
Step 6.11, if 1 N-1 or less, repeating the step 6.2 to 6.10;
step 6.12, output (l) * ,r * ) And T * ;
Step 7, giving a delay requirement T 0 Searching the combined optimal model compression division scheme with optimal lower precision and optimal precision, and if the given delay is greater than the upper bound T max Directly provideAn upper bound scheme; given delay less than lower bound T min Directly providing a lower bound scheme; otherwise, based on given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * Step 7 comprises the following steps:
step 7.1, setting local optimal precision A * =A max Let l 1 ←1;
Step 7.2, set l 2 ←l 1 ;
Step 7.3, put scheme l ═ l (l) 1 ,l 2 ) Based on local optimum accuracy A * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
Step 7.5, ifLet l 2 Candidate compression ratioUpdatingBased on step 4, if Andare all greater than A * Setting the optimal compression division scheme of the combined modelSetting the optimum precision A * ←A(l * ,r * );
Step 7.8, update l 2 ←l 2 +1;
Step 7.9, if l 2 N-1 or less, repeating the step 7.3 to 7.8;
step 7.10, update l 1 ←l 1 +1;
Step 7.11, if 1 N-1 or less, repeating the step 7.2 to 7.10;
step 7.12, output (l) * ,r * ) And A * ;
Step 8, compressing and dividing scheme (l) based on the joint optimal model output in step 6 or 7 * ,r * ) Optimizing the model and deploying in the end edge cloud system;
and 9, operating the system to carry out model reasoning.
Referring to fig. 2, the invention provides a cooperative inference method for end edge cloud based on joint model compression partitioning, wherein a logic architecture of the cooperative inference method comprises three parts of system information collection, model optimization, model deployment and inference, and a main body is model optimization. In order to reduce the transmission delay of the end edge cloud collaborative CNN inference, the communication optimal model compression is carried out on the given CNN model; in order to reduce the calculation overhead of CNN model optimization, an optimal combined model compression division scheme is rapidly determined by an upper-bound precision estimation method and a compression rate decision method; in order to support low-delay and high-precision collaborative reasoning of a given CNN model in a given edge cloud system, under the given reasoning precision or delay requirement, a model optimization scheme with the shortest time delay under the condition of meeting the given precision requirement or a model optimization scheme with the highest precision under the condition of meeting the given delay requirement is efficiently determined by a combined optimal model compression partitioning scheme search algorithm.
In another embodiment of the present invention, a joint compression division-based edge cloud convolutional neural network inference system is provided, which can be used to implement the above mentioned edge cloud convolutional neural network inference method based on joint compression division, and specifically, the system includes:
the model compression method construction module is used for constructing a communication optimal model compression method and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;
the model delay obtaining module is used for carrying out information collection on a given CNN model in a given end edge cloud system based on a constructed model compression method to obtain the delay of the model under all compression division schemes;
a performance upper and lower bound determining module for determining the performance upper bound (T) of the combined compression division scheme based on the obtained time delay of all compression division schemes max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with the least delay without compression, (T) min ,A min ) The minimum scheme when the time delay is compressed is determined;
the estimation method construction module is used for constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;
the decision method building module is used for building a compression rate decision method when the accuracy requirement and CNN are given to be layered;
an optimal model compression partitioning scheme obtaining module for obtaining the optimal model compression partitioning scheme at a given precision requirement A 0 Then, based on a model precision upper bound estimation method and a compression rate decision method, searching a joint optimal model compression division scheme with optimal time delay, wherein if the given precision is greater than an upper bound A max Directly providing an upper bound scheme; if the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme * (ii) a At a given delay requirement T 0 And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T max Directly providing an upper bound scheme; if the given delay is less than the lower bound T min Directly providing a lower bound scheme; otherwise, based on given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * ;
An output module for an output-based joint optimal model compression partitioning scheme (l) * ,r * ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.
The invention solves the problem that the prior art can not realize the low-delay and high-precision collaborative CNN reasoning in the end edge cloud system. According to the method, a terminal-edge-cloud computing architecture is adopted, hierarchical computing unloading is performed through compression division of a CNN model, communication and computing bottlenecks are optimized in a combined mode, and rapid intelligent analysis of mass terminal data is achieved under the condition that accuracy requirements are guaranteed. The invention can reduce the inference time delay of the CNN model and ensure the inference precision of the CNN model.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (10)
1. An end edge cloud collaborative convolution neural network reasoning method is characterized by comprising the following steps:
constructing a communication optimal model compression method, and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;
based on the constructed model compression method, information collection is carried out on a given CNN model in a given end edge cloud system, and the time delay of the model under all compression division schemes is obtained;
determining an upper bound (T) for performance of a joint compression partitioning scheme based on the obtained latencies of all compression partitioning schemes max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with minimal delay when no compression is present, (T) min ,A min ) Determined by the scheme with the minimum compression time delay;
constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;
constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;
at a given accuracy requirement A 0 Then, based on a model precision upper bound estimation method and a compression rate decision method, searching a joint optimal model compression division scheme with optimal time delay, wherein if the given precision is greater than an upper bound A max Directly providing an upper bound scheme; if the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme * ;
At a given delay requirement T 0 And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T max Directly providing an upper bound scheme; if the given delay is less than the lower bound T min Directly providing a lower bound scheme; otherwise, based on given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * ;
Output-based joint optimal model compression partitioning scheme (l) * ,r * ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.
2. The method for inference on the basis of the edge cloud convolutional neural network based on joint compression partitioning as claimed in claim 1, wherein the method for constructing the communication optimal model compression comprises the following steps:
step 1.1, congruent channel pruning, for a given CNN layer, solving
Pruning insignificant convolution kernels, where | · |) F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and k representing the channel corresponding to the kth input profile, W l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ 1 Is a penalty coefficient; first, W is fixed l,k Increase of λ 1 Calculating beta, deleting the minimum value in the beta vector and the convolution kernel corresponding to the minimum value, and fixing the beta after deleting the minimum elementOver-training update W l,k Circularly iterating until the number of the components in the beta is less than K';
step 1.2, unifying affine quantization, and performing affine quantization on the given CNN layer output compressed in the step 1.1 to 8-bit.
3. The method for inference based on joint compressed partitioning end edge cloud convolutional neural network as claimed in claim 1, wherein the specific operation of obtaining the model delay under all compressed partitioning schemes is: information collection is carried out on a given CNN model in a given end edge cloud system to obtain time delay of all compression division schemes, an N-layer CNN model is deployed in a 3-layer end edge cloud system, and division layers are set to be l (l) 1 ,l 2 ) Compression setting r ═ r (r) 1 ,r 2 ) Wherein 0 to l 1 Layer CNN model runs on the end equipment, l 1 Compression ratio of layer CNN model is r 1 ,l 1 +1 to l 2 Layer CNN model runs on edge equipment, l 2 Compression ratio of layer CNN model is r 2 ,l 2 The +1 to N layers of CNN models run on the cloud equipment, corresponding compression rates are achieved based on the compression models, and under the compression division scheme (l, r), end-to-end delay of CNN reasoningWherein l 0 ≡0,l 3 ≡N,T c Is the sum of the computation delays, T, on all devices of the edge cloud t Is the sum of the communication delays between the end edge clouds.
4. The inference method based on the edge cloud convolutional neural network of the joint compression partitioning as claimed in claim 1, wherein the specific operation of constructing the model accuracy upper bound estimation method at a given compression rate on a given CNN partitioning layer is:
model upper bound estimation at a given compressibility on a given CNN partition level, for a given partition level l g Compressibility r-precision A function (l) g R) has a monotonically concave nature, at a given division level and compression ratio (l) g ,r g ) Time of flightBased on two existing compression ratio-precision data points ((l) g ,r 1 ),A 1 ) And ((l) g ,r 2 ),A 2 ),r 1 ≤r 2 <r g Or r g <r 1 ≤r 2 Estimate scheme (l) g ,r g ) Upper bound of accuracy ofSelected existing data (l) from g ,r g ) The closer, the more accurate the estimate, so the distance (l) is chosen among the existing data g ,r g ) Two points with the smallest sum of distances.
5. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 1, wherein a compression rate decision method at a given precision requirement and CNN partitioning level is constructed:
given CNN layer l with compression g Precision A-compressibility R function R (l) of post-CNN model g The single concave nature of A), quickly determining that the accuracy requirement A is met g When l is turned on g The highest compression ratio CRD (A) g |l g )=R * (l g ,A g );
The method comprises the following steps:
step 5.1, based on the existing sum (l) g ,A g ) Two data points with the smallest sum of distances ((l) g ,A 1 ),r 1 ) And ((l) g ,A 2 ),r 2 ) Based on an estimated value r' of the calculated compressibility;
step 5.2 data ((l) obtained by actual model compression g ,A′),r′);
Step 5.3, repeating steps 5.1 and 5.2 until R' no longer increases, the maximum compression ratio R * (l g ,A g ) If the estimated value of r 'is out of range in the loop iteration, a new r' is determined using dichotomy in the feasible value range.
6. A method according to claim 1 based on joint compressionA partitioned edge cloud convolutional neural network inference method, characterized in that at a given accuracy requirement A 0 Then, based on a model precision upper bound estimation method and a compression ratio decision method, searching a joint optimal model compression division scheme with optimal delay specifically comprises:
the method comprises the steps of compressing a partition scheme search algorithm through a joint optimal model, dynamically compressing a scheme search space, and determining that the given precision requirement A is met 0 Model optimization scheme for optimization of down-time * ,r * )。
7. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 6, comprising the following steps:
step 6.1, setting local optimal delay T * =T max Let l 1 ←1;
Step 6.2, set l 2 ←l 1 ;
Step 6.3, set scheme l as (l) 1 ,l 2 ) Based on local optimum delay T * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
step 6.4, based on R, let l 1 Candidate compression ratioIf the scheme isAccuracy of modelUpdate l 1 Candidate compression ratioUpdating
Step 6.5, ifLet l 2 Candidate compression ratioUpdatingBased on step 4, ifAndare all greater than or equal to A 0 WhereinIs a schemeEstimating and updating the upper bound of model precisionCombined model compression division scheme with optimal time delaySetting an optimal delay T * ←T(l * ,r * );
Step 6.8, update l 2 ←l 2 +1;
Step 6.9, if l 2 N-1 or less, repeating the step 6.3 to 6.8;
step 6.10, update l 1 ←l 1 +1;
Step 6.11, if 1 N-1 or less, repeating the step 6.2 to 6.10;
step 6.12, output (l) * ,r * ) And T * 。
8. The method as claimed in claim 1, wherein T is a given delay requirement 0 Next, based on a model precision upper bound estimation method and a compression ratio decision method, searching for a joint optimal model compression partitioning scheme with optimal precision specifically includes:
searching the space by a dynamic compression scheme, determining the time when a given delay requirement T is met 0 Model optimization scheme with optimal lower precision (l) * ,r * )。
9. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 8, comprising the following steps:
step 7.1, setting local optimal precision A * =A max Let l 1 ←1;
Step 7.2, set l 2 ←l 1 ;
Step 7.3, put scheme l ═ l (l) 1 ,l 2 ) Based on local optimum accuracy A * Reducing a solution search spaceWhereinIs 1 1 、l 2 A set of layer selectable compression ratios;
Step 7.5, ifLet l 2 Candidate compression ratioUpdatingIf it isAndare all greater than A * Setting the optimal compression division scheme of the combined modelSetting the optimum precision A * ←A(l * ,r * );
Step 7.8, update l 2 ←l 2 +1;
Step 7.9, if 2 N-1 or less, repeating the step 7.3 to 7.8;
step 7.10, update l 1 ←l 1 +1;
Step 7.11, if 1 N-1 or less, repeating the step 7.2 to 7.10;
step 7.12, output (l) * ,r * ) And A * 。
10. An edge cloud convolution neural network inference system based on joint compression partitioning is characterized by comprising:
the model compression method construction module is used for constructing a communication optimal model compression method and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;
the model delay obtaining module is used for carrying out information collection on a given CNN model in a given end edge cloud system based on a constructed model compression method to obtain the delay of the model under all compression division schemes;
a performance upper and lower bound determining module for determining the performance upper bound (T) of the combined compression division scheme based on the obtained time delay of all compression division schemes max ,A max ) And lower bound (T) min ,A min ) Wherein, T max 、T min To infer the upper and lower bounds of the delay, A max 、A min To infer the upper and lower bounds of precision, (T) max ,A max ) Determined by the scheme with minimal delay when no compression is present, (T) min ,A min ) Determined by the scheme with the minimum compression time delay;
the estimation method construction module is used for constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;
the decision method construction module is used for constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;
optimal model compressionA partitioning scheme obtaining module for obtaining a partitioning scheme at a given precision requirement A 0 And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A max Directly providing an upper bound scheme; if the given precision is less than the lower bound A min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A 0 Search time delay optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme * (ii) a At a given delay requirement T 0 And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T max Directly providing an upper bound scheme; if the given delay is less than the lower bound T min Directly providing a lower bound scheme; otherwise, based on a given delay requirement T 0 Search precision optimized joint model compression partitioning scheme (l) * ,r * ) And outputting the optimal inference precision A of the model based on the scheme optimization * ;
An output module for an output-based joint optimal model compression partitioning scheme (l) * ,r * ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210611122.9A CN114970824B (en) | 2022-05-31 | Terminal edge cloud collaborative convolutional neural network reasoning method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210611122.9A CN114970824B (en) | 2022-05-31 | Terminal edge cloud collaborative convolutional neural network reasoning method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114970824A true CN114970824A (en) | 2022-08-30 |
CN114970824B CN114970824B (en) | 2024-05-10 |
Family
ID=
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113067873A (en) * | 2021-03-19 | 2021-07-02 | 北京邮电大学 | Edge cloud collaborative optimization method based on deep reinforcement learning |
US20210295165A1 (en) * | 2020-03-18 | 2021-09-23 | Donghua University | Method for constructing efficient product surface defect detection model based on network collaborative pruning |
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210295165A1 (en) * | 2020-03-18 | 2021-09-23 | Donghua University | Method for constructing efficient product surface defect detection model based on network collaborative pruning |
CN113067873A (en) * | 2021-03-19 | 2021-07-02 | 北京邮电大学 | Edge cloud collaborative optimization method based on deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
薛峰;方维维;: "EdgeMI:资源受限条件下深度学习多设备协同推理", 现代计算机, no. 20, 15 July 2020 (2020-07-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Jalad: Joint accuracy-and latency-aware deep structure decoupling for edge-cloud execution | |
WO2022063247A1 (en) | Neural architecture search method and apparatus | |
CN112149797A (en) | Neural network structure optimization method and device and electronic equipment | |
CN114912705A (en) | Optimization method for heterogeneous model fusion in federated learning | |
CN112949840A (en) | Channel attention guided convolutional neural network dynamic channel pruning method and device | |
CN110309904B (en) | Neural network compression method | |
CN111176853A (en) | Data quantization method and device, computer equipment and storage medium | |
CN113595993B (en) | Vehicle-mounted sensing equipment joint learning method for model structure optimization under edge calculation | |
JP2021108039A (en) | Model compression device and program | |
CN112835715A (en) | Method and device for determining task unloading strategy of unmanned aerial vehicle based on reinforcement learning | |
CN114528987A (en) | Neural network edge-cloud collaborative computing segmentation deployment method | |
US20220198271A1 (en) | Method for building a resource-frugal neural network | |
CN114757347A (en) | Method and system for realizing low bit quantization neural network accelerator | |
CN117032938B (en) | Operator parallel scheduling method and device, electronic equipment and storage medium | |
CN114239799A (en) | Efficient target detection method, device, medium and system | |
CN117521752A (en) | Neural network acceleration method and system based on FPGA | |
CN114970824B (en) | Terminal edge cloud collaborative convolutional neural network reasoning method and system | |
CN114970824A (en) | Edge cloud collaborative convolution neural network reasoning method and system | |
CN113504949A (en) | Task unloading and parameter optimization method and system for MAR client in edge computing | |
CN116663644A (en) | Multi-compression version Yun Bianduan DNN collaborative reasoning acceleration method | |
CN115759209B (en) | Quantification method and device of neural network model, electronic equipment and medium | |
CN112446461A (en) | Neural network model training method and device | |
CN115913995A (en) | Cloud service dynamic QoS prediction method based on Kalman filtering correction | |
CN114118358A (en) | Image processing method, image processing apparatus, electronic device, medium, and program product | |
CN114785692A (en) | Virtual power plant aggregation regulation and control communication network flow balancing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |