CN114970824A

CN114970824A - Edge cloud collaborative convolution neural network reasoning method and system

Info

Publication number: CN114970824A
Application number: CN202210611122.9A
Authority: CN
Inventors: 杨树森; 段亚璐; 赵聪; 赵鹏; 张展华; 郭思言; 栗海亮
Original assignee: Hangzhou Cumulus Technology Co ltd
Current assignee: Hangzhou Cumulus Technology Co ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-30
Anticipated expiration: 2042-05-31

Abstract

A method and a system for reasoning on edge cloud collaborative convolution neural network include: based on the constructed model compression method, obtaining the time delay of the model under all compression division schemes; determining the performance upper and lower bounds of the combined compression division scheme based on the obtained time delay of all the compression division schemes; constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer; constructing a compression rate decision method when the precision requirement and CNN are given to divide layers; searching a combined optimal model compression division scheme with optimal time delay; and operating the system to perform model reasoning. The invention carries out hierarchical computation unloading through the compression division of the CNN model, carries out joint optimization on communication and computation bottlenecks, realizes rapid intelligent analysis of mass terminal data, reliably, controllably and efficiently compresses the communication traffic of the CNN model on any given layer through an congruent channel pruning method and a uniform affine quantization method, and obviously reduces the transmission delay of end edge cloud collaborative CNN reasoning.

Description

Edge cloud collaborative convolution neural network reasoning method and system

Technical Field

The invention belongs to the field of distributed intelligence, and particularly relates to a terminal edge cloud collaborative convolution neural network reasoning method and system.

Background

With the development of highly intelligent deep learning algorithm and widely applied internet of things technology, a large number of intelligent applications (such as traffic monitoring, defect detection, power grid inspection and the like) rely on the use of a deep learning Convolutional Neural Network (Convolutional Neural Network) model for reasoning to perform high-precision and rapid intelligent analysis on mass terminal data. The existing method promotes high-precision intelligent analysis by designing and optimizing deep learning CNN reasoning, and even obtains the effect exceeding human on part of visual tasks. However, high-precision intelligent analysis based on CNN inference is often accompanied by high computational overhead, and it is difficult to implement rapid intelligent analysis directly in terminal deployment with limited computational resources, which hinders the landing of a large number of practical applications. Therefore, how to realize high-precision and fast intelligent analysis of terminal data under the condition of actual equipment resources is a key problem for supporting intelligent application. The high computational overhead of the high-precision deep learning CNN inference limits the rapid intelligent analysis of the CNN inference to be completed on equipment with limited general computing resources. In order to eliminate the computing bottleneck brought by CNN inference, the existing applications often upload data to complete intelligent analysis by using computationally scalable cloud computing. However, considering the volume of massive terminal data, this method cannot support a large amount of intelligent applications in practical bandwidth resources. The existing terminal computing and cloud computing modes are respectively limited by computing and communication, and high-precision and quick intelligent analysis of mass terminal data cannot be supported.

Disclosure of Invention

The invention aims to provide a method and a system for reasoning a terminal edge cloud collaborative convolution neural network, which aim to solve the problems that the existing mode can not support a large amount of intelligent applications in actual bandwidth resources, and the existing terminal computing and cloud computing modes are respectively limited by computing and communication and can not support high-precision and rapid intelligent analysis of mass terminal data.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for reasoning edge cloud collaborative convolution neural network comprises the following steps:

constructing a communication optimal model compression method, and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;

based on the constructed model compression method, information collection is carried out on a given CNN model in a given end edge cloud system, and the time delay of the model under all compression division schemes is obtained;

determining an upper bound (T) for performance of a joint compression partitioning scheme based on the obtained latencies of all compression partitioning schemes _max ,A _max ) And lower bound (T) _min ,A _min ) Wherein, T _max 、T _min To infer the upper and lower bounds of the delay, A _max 、A _min To infer the upper and lower bounds of precision, (T) _max ,A _max ) Determined by the scheme with minimal delay when no compression is present, (T) _min ,A _min ) Determined by the scheme with the minimum compression time delay;

constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;

constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;

at a given accuracy requirement A ₀ And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A _max Directly providing an upper bound scheme; if the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) To transportOptimal end-to-end reasoning delay T of model based on scheme optimization ^* ；

At a given delay requirement T ₀ And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T _max Directly providing an upper bound scheme; if the given delay is less than the lower bound T _min Directly providing a lower bound scheme; otherwise, based on given delay requirement T ₀ Search precision optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal inference precision A of the model based on the scheme optimization ^* ；

Output-based joint optimal model compression partitioning scheme (l) ^* ,r ^* ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.

Further, the compression method for constructing the communication optimal model comprises the following steps:

step 1.1, congruent channel pruning, for a given CNN layer, solving

s.t.||β|| ₀ ≤K′,

Pruning insignificant convolution kernels, where | · |) _F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and _k representing the channel corresponding to the kth input profile, W _l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ ₁ Is a penalty coefficient; first, W is fixed _l,k Increase of λ ₁ Calculating beta, deleting the minimum value in the beta vector corresponding to the current beta vector and the convolution kernel corresponding to the minimum value, fixing the beta with the minimum element deleted, and updating W through training _l,k Repeating the iteration until the number of the components in the beta is less than K';

step 1.2, unifying affine quantization, and performing affine quantization on the given CNN layer output compressed in the step 1.1 to 8-bit.

Further, the specific operation of the model under all compression partitioning schemes is obtained as follows: information collection is carried out on a given CNN model in a given end edge cloud system to obtain time delay of all compression division schemes, one N-layer CNN model is deployed in a 3-layer end edge cloud system, and l is set in a division layer (l ═ l) ₁ ,l ₂ ) Compression setting r ═ r (r) ₁ ,r ₂ ) Wherein 0 to l ₁ Layer CNN model runs on the end equipment, l ₁ Compression ratio of layer CNN model is r ₁ ，l ₁ +1 to l ₂ Layer CNN model runs on edge equipment, l ₂ Compression ratio of layer CNN model is r ₂ ，l ₂ The +1 to N layers of CNN models run on the cloud equipment, corresponding compression rates are achieved based on the compression models, and under the compression division scheme (l, r), end-to-end delay of CNN reasoning

Wherein l ₀ ≡0，l ₃ ≡N，T _c Is the sum of the computation delays, T, on all devices of the edge cloud _t Is the sum of the communication delays between the end edge clouds.

Further, the specific operation of constructing the model accuracy upper bound estimation method at a given compression rate on a given CNN division layer is as follows:

model accuracy upper bound estimation at a given compression ratio on a given CNN partition level, for a given partition level l _g Compressibility r-precision A function (l) _g R) has a monotonically concave nature, at a given division level and compression ratio (l) _g ,r _g ) Based on two existing compression ratio-precision data points ((l) _g ,r ₁ ),A ₁ ) And ((l) _g ,r ₂ ),A ₂ )，r ₁ ≤r ₂ <r _g Or r _g <r ₁ ≤r ₂ Estimate scheme (l) _g ,r _g ) Upper bound of accuracy of

Selected existing data (l) from _g ,r _g ) The closer, the more accurate the estimate, so the distance (l) is chosen among the existing data _g ,r _g ) Two points with the smallest sum of distances.

Further, a compression rate decision method is constructed when the accuracy requirement and CNN are given to divide layers:

given CNN layer l with compression _g post-CNN model precision A-compressibility R function R (l) _g The single concave nature of A), quickly determining that the accuracy requirement A is met _g When l is turned on _g The highest compression ratio CRD (A) _g |l _g )＝R ^* (l _g ,A _g )；

The method comprises the following steps:

step 5.1, based on the existing sum (l) _g ,A _g ) Two data points with the smallest sum of distances ((l) _g ,A ₁ ),r ₁ ) And ((l) _g ,A ₂ ),r ₂ ) An estimated value r' based on the calculated compressibility;

step 5.2 data ((l) obtained by actual model compression _g ,A′),r′)；

Step 5.3, repeating steps 5.1 and 5.2 until R' no longer increases, the maximum compression ratio R ^* (l _g ,A _g ) If the estimated value of r 'is out of range in the loop iteration, a new r' is determined using dichotomy in the feasible value range.

Further, at a given accuracy requirement A ₀ Then, based on a model precision upper bound estimation method and a compression ratio decision method, searching a joint optimal model compression division scheme with optimal delay specifically comprises:

the method comprises the steps of compressing a partition scheme search algorithm through a joint optimal model, dynamically compressing a scheme search space, and determining that the given precision requirement A is met ₀ Model optimization scheme for optimization of down-time ^* ,r ^* )。

Further, the method comprises the following steps:

step 6.1, setting local optimal delay T ^* ＝T _max Let l ₁ ←1；

Step 6.2, set l ₂ ←l ₁ ；

Step 6.3, set scheme l as (l) ₁ ,l ₂ ) Based on local optimum delay T ^* Reducing a solution search space

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 6.4, based on R, let l ₁ Candidate compression ratio

If the scheme is

Accuracy of model

Update l ₁ Candidate compression ratio

Updating

Step 6.5, if

Let l ₂ Candidate compression ratio

Updating

Based on step 4, if

And

are all greater than or equal to A ₀ Wherein

Is a scheme

Estimating and updating the upper bound of model precision

Combined model compression division scheme with optimal time delay

Setting an optimal delay T ^* ←T(l ^* ,r ^* )；

Step 6.6, update

Step 6.7, repeating the steps 6.5 and 6.6 until the step

Step 6.8, update l ₂ ←l ₂ +1；

Step 6.9, if l ₂ N-1 or less, repeating the step 6.3 to 6.8;

step 6.10, update l ₁ ←l ₁ +1；

Step 6.11, if ₁ N-1 or less, repeating the step 6.2 to 6.10;

step 6.12, output (l) ^* ,r ^* ) And T ^* 。

Further, T is required at a given delay ₀ Next, based on a model precision upper bound estimation method and a compression ratio decision method, searching for a joint optimal model compression partitioning scheme with optimal precision specifically includes:

searching the space by a dynamic compression scheme, determining the time when a given delay requirement T is met ₀ Model optimization scheme with optimal lower precision (l) ^* ,r ^* )。

Further, the method comprises the following steps:

step 7.1, setting local optimal precision A ^* ＝A _max Let l ₁ ←1；

Step 7.2, set l ₂ ←l ₁ ；

Step 7.3, put scheme l ═ l (l) ₁ ,l ₂ ) Based on local optimum accuracy A ^* Reducing a solution search space

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 7.4, based on R, let l ₁ Candidate compression ratio

Updating

Step 7.5, if

Let l ₂ Candidate compression ratio

Updating

If it is

And

are all greater than A ^* Setting the optimal compression division scheme of the combined model

Setting the optimum precision A ^* ←A(l ^* ,r ^* )；

Step 7.6, based on step 4, update

7.7, repeating the steps 7.5 and 7.6 until

Step 7.8, update l ₂ ←l ₂ +1；

Step 7.9, if l ₂ N-1 or less, repeating the step 7.3 to 7.8;

step 7.10, update l ₁ ←l ₁ +1；

Step 7.11, if ₁ N-1 or less, repeating the step 7.2 to 7.10;

step 7.12, output (l) ^* ,r ^* ) And A ^* 。

Further, an edge cloud convolution neural network inference system based on joint compression partitioning comprises:

the model compression method construction module is used for constructing a communication optimal model compression method and compressing the communication traffic of the CNN model on any given layer through congruent channel pruning and uniform affine quantization;

the model delay obtaining module is used for carrying out information collection on a given CNN model in a given end edge cloud system based on a constructed model compression method to obtain the delay of the model under all compression division schemes;

a performance upper and lower bound determining module for determining the performance upper bound (T) of the combined compression division scheme based on the obtained time delay of all compression division schemes _max ,A _max ) And lower bound (T) _min ,A _min ) Wherein, T _max 、T _min To infer the upper and lower bounds of the delay, A _max 、A _min To infer the upper and lower bounds of precision, (T) _max ,A _max ) Determined by the scheme with minimal delay when no compression is present, (T) _min ,A _min ) Determined by the scheme with the minimum compression time delay;

the estimation method building module is used for building a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;

the decision method construction module is used for constructing a compression rate decision method when the precision requirement and CNN are given to divide layers;

an optimal model compression partitioning scheme obtaining module for obtaining the optimal model compression partitioning scheme at a given precision requirement A ₀ And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A _max Directly providing an upper bound scheme; if the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme ^* (ii) a At a given delay requirement T ₀ And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T _max Directly providing an upper bound scheme; if the given delay is less than the lower bound T _min Directly providing a lower bound scheme; otherwise, based on given delay requirement T ₀ Search precision optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal inference precision A of the model based on the scheme optimization ^* ；

An output module for an output-based joint optimal model compression partitioning scheme (l) ^* ,r ^* ) And optimizing the model, deploying the model in the end edge cloud system, operating the system and performing model reasoning.

Compared with the prior art, the invention has the following technical effects:

according to the method, a terminal-edge-cloud computing architecture is adopted, hierarchical computing unloading is carried out through compression and division of the CNN model, communication and computing bottlenecks are jointly optimized, rapid intelligent analysis of mass terminal data is achieved under the condition that accuracy requirements are guaranteed, communication traffic of the CNN model on any given layer is reliably, controllably and efficiently compressed through an congruent channel pruning method and a uniform affine quantization method, and transmission delay of end edge cloud collaborative CNN reasoning is remarkably reduced.

Further, by utilizing the monotone concave property of precision-compression ratio/precision-reasoning delay time of the CNN model after the given CNN layer is compressed, an optimal combined model compression division scheme is quickly determined through a precision upper bound estimation method and a compression ratio decision method, and the calculation overhead of CNN model optimization is remarkably reduced.

Further, based on an upper-bound precision estimation method and a compression rate decision method, under the condition of a given inference precision or delay requirement, a combined optimal model compression partitioning scheme search algorithm is used for dynamically compressing a scheme search space, a model optimization scheme with the shortest delay time meeting the given precision requirement or a model optimization scheme with the highest precision meeting the given delay requirement is efficiently determined, and low-delay and high-precision collaborative inference of a given CNN model in a given edge cloud system is supported.

Drawings

FIG. 1 is a schematic representation of an implementation of the process herein;

FIG. 2 is a logic flow diagram of the method herein;

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

Referring to fig. 1, the present invention provides an edge cloud collaborative CNN inference method based on joint compression partitioning, including the following steps:

step 1, communication optimal model compression, namely reducing communication traffic between CNN intermediate layers by a CNN model compression method, deleting unimportant convolution kernels in a given CNN layer by adopting equal Channel Pruning (identify Channel Pruning) to reduce the number of transmitted feature graphs, and reducing the number of bits of data on the feature graphs output by the given CNN layer by adopting Uniform Affine Quantization (Uniform affinity Quantization), wherein the step 1 comprises the following steps:

step 1.1, congruent channel pruning, for a given CNN layer, solving

s.t.||β|| ₀ ≤K′,

Pruning insignificant convolution kernels, where | · |) _F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and _k representing the channel corresponding to the kth input profile, W _l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ ₁ Is a penalty factor. First, W is fixed _l,k Increase of λ ₁ Calculating beta, deleting the minimum value in the beta vector corresponding to the current beta vector and the convolution kernel corresponding to the minimum value, fixing the beta with the minimum element deleted, and updating W through training _l,k Repeating the iteration until the number of the components in the beta is less than K';

step 1.2, unifying affine quantization, and performing affine quantization on the output of the given CNN layer compressed in the step 1.1 to 8-bit;

step 2, System information collection (System Profiling), wherein information collection is carried out on a given CNN model in a given end edge cloud SystemObtaining the time delay of all compression division schemes, deploying an N-layer CNN model in a 3-layer end edge cloud system, and setting l-l (l) for the division layers ₁ ,l ₂ ) Compression setting r ═ r (r) ₁ ,r ₂ ) Wherein 0 to l ₁ Layer CNN model runs on the end equipment, l ₁ Compression ratio of layer CNN model is r ₁ ，l ₁ +1 to l ₂ Layer CNN model runs on edge equipment, l ₂ Compression ratio of layer CNN model is r ₂ ，l ₂ The +1 to N layers of CNN models run on the cloud equipment, the compression models reach corresponding compression rates based on the step 1, and the CNN reasoning end-to-end time delay is carried out under the compression division scheme (l, r)

Wherein l ₀ ≡0，l ₃ ≡N，T _c Is the sum of the computation delays, T, on all devices of the edge cloud _t The sum of communication delays between end clouds;

step 3, determining the performance upper bound (T) of the joint compression partitioning scheme _max ,A _max ) And lower bound (T) _min ,A _min ) Wherein, T _max 、T _min To infer the upper and lower bounds of the delay, A _max 、A _min To infer the upper and lower bounds of precision, (T) _max ,A _max ) Determined by the scheme with minimal delay when no compression is present, (T) _min ,A _min ) Determined by the smallest scheme when compression latency is considered;

step 4, estimating the model precision upper Bound (ABE) under a given compression ratio on a given CNN division layer, and for a given division layer l _g Compressibility r-precision A function (l) _g R) has a monotonically concave nature, at a given division level and compression ratio (l) _g ,r _g ) Based on two existing compression ratio-precision data points ((l) _g ,r ₁ ),A ₁ ) And ((l) _g ,r ₂ ),A ₂ )(r ₁ ≤r ₂ <r _g Or r _g <r ₁ ≤r ₂ ) Estimate scheme (l) _g ,r _g ) Upper bound of accuracy of

Selected existing data (l) from _g ,r _g ) The closer, the more accurate the estimate, so the distance (l) is chosen among the existing data _g ,r _g ) Two points with the minimum distance sum;

step 5 Compression Rate Determination (CRD) at a given precision requirement and CNN partitioning level for a given partitioning level l _g Precision A-compressibility R function R (l) _g A) has a monotonous concave property by which it is quickly determined that the accuracy requirement A is satisfied _g When, given a division level l _g The highest compression ratio CRD (A) _g |l _g )＝R ^* (l _g ,A _g ) Step 5 comprises the following steps:

step 5.1, based on the existing sum (l) _g ,A _g ) Two data points with the smallest sum of distances ((l) _g ,A ₁ ),r ₁ ) And ((l) _g ,A ₂ ),r ₂ ) Calculating an estimated value r' of the compressibility based on step 4;

step 5.2 data ((l) obtained by actual model compression _g ,A′),r′)；

Step 5.3, repeating steps 5.1 and 5.2 until R' no longer increases, the maximum compression ratio R ^* (l _g ,A _g ) If the estimated value of r 'is out of range in the loop iteration, determining new r' by using a dichotomy in a feasible value range;

step 6, giving a precision requirement A ₀ Searching the combined optimal model compression division scheme with optimal lower time delay and optimal time delay, and if the given precision is greater than the upper bound A _max Directly providing an upper bound scheme; the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme ^* Step 6 comprises the following steps:

step 6.1, setting local optimal delay T ^* ＝T _max Let l ₁ ←1；

Step 6.2, set l ₂ ←l ₁ ；

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 6.4, based on R, let l ₁ Candidate compression ratio

If the scheme is

Accuracy of model

Based on step 5, update l ₁ Candidate compression ratio

Updating

Step 6.5, if

Let l ₂ Candidate compression ratio

Updating

Based on step 4, if

And

are all greater than or equal to A ₀ Wherein

Is a scheme

Estimating the upper limit of the model precision, and updating based on the step 5

Combined model compression division scheme with optimal time delay

Setting an optimal delay T ^* ←T(l ^* ,r ^* )；

Step 6.6, update

Step 6.7, repeating the steps 6.5 and 6.6 until the step

Step 6.8, update l ₂ ←l ₂ +1；

Step 6.9, if l ₂ N-1 or less, repeating the step 6.3 to 6.8;

step 6.10, update l ₁ ←l ₁ +1；

Step 6.11, if ₁ N-1 or less, repeating the step 6.2 to 6.10;

step 6.12, output (l) ^* ,r ^* ) And T ^* ；

Step 7, giving a delay requirement T ₀ Searching the combined optimal model compression division scheme with optimal lower precision and optimal precision, and if the given delay is greater than the upper bound T _max Directly provideAn upper bound scheme; given delay less than lower bound T _min Directly providing a lower bound scheme; otherwise, based on given delay requirement T ₀ Search precision optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal inference precision A of the model based on the scheme optimization ^* Step 7 comprises the following steps:

step 7.1, setting local optimal precision A ^* ＝A _max Let l ₁ ←1；

Step 7.2, set l ₂ ←l ₁ ；

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 7.4, based on R, let l ₁ Candidate compression ratio

Updating

Step 7.5, if

Let l ₂ Candidate compression ratio

Updating

Based on step 4, if

And

Setting the optimum precision A ^* ←A(l ^* ,r ^* )；

Step 7.6, update based on step 4

7.7, repeating the steps 7.5 and 7.6 until

Step 7.8, update l ₂ ←l ₂ +1；

Step 7.9, if l ₂ N-1 or less, repeating the step 7.3 to 7.8;

step 7.10, update l ₁ ←l ₁ +1；

Step 7.11, if ₁ N-1 or less, repeating the step 7.2 to 7.10;

step 7.12, output (l) ^* ,r ^* ) And A ^* ；

Step 8, compressing and dividing scheme (l) based on the joint optimal model output in step 6 or 7 ^* ,r ^* ) Optimizing the model and deploying in the end edge cloud system;

and 9, operating the system to carry out model reasoning.

Referring to fig. 2, the invention provides a cooperative inference method for end edge cloud based on joint model compression partitioning, wherein a logic architecture of the cooperative inference method comprises three parts of system information collection, model optimization, model deployment and inference, and a main body is model optimization. In order to reduce the transmission delay of the end edge cloud collaborative CNN inference, the communication optimal model compression is carried out on the given CNN model; in order to reduce the calculation overhead of CNN model optimization, an optimal combined model compression division scheme is rapidly determined by an upper-bound precision estimation method and a compression rate decision method; in order to support low-delay and high-precision collaborative reasoning of a given CNN model in a given edge cloud system, under the given reasoning precision or delay requirement, a model optimization scheme with the shortest time delay under the condition of meeting the given precision requirement or a model optimization scheme with the highest precision under the condition of meeting the given delay requirement is efficiently determined by a combined optimal model compression partitioning scheme search algorithm.

In another embodiment of the present invention, a joint compression division-based edge cloud convolutional neural network inference system is provided, which can be used to implement the above mentioned edge cloud convolutional neural network inference method based on joint compression division, and specifically, the system includes:

a performance upper and lower bound determining module for determining the performance upper bound (T) of the combined compression division scheme based on the obtained time delay of all compression division schemes _max ,A _max ) And lower bound (T) _min ,A _min ) Wherein, T _max 、T _min To infer the upper and lower bounds of the delay, A _max 、A _min To infer the upper and lower bounds of precision, (T) _max ,A _max ) Determined by the scheme with the least delay without compression, (T) _min ,A _min ) The minimum scheme when the time delay is compressed is determined;

the estimation method construction module is used for constructing a model precision upper bound estimation method under a given compression ratio on a given CNN division layer;

the decision method building module is used for building a compression rate decision method when the accuracy requirement and CNN are given to be layered;

an optimal model compression partitioning scheme obtaining module for obtaining the optimal model compression partitioning scheme at a given precision requirement A ₀ Then, based on a model precision upper bound estimation method and a compression rate decision method, searching a joint optimal model compression division scheme with optimal time delay, wherein if the given precision is greater than an upper bound A _max Directly providing an upper bound scheme; if the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme ^* (ii) a At a given delay requirement T ₀ And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T _max Directly providing an upper bound scheme; if the given delay is less than the lower bound T _min Directly providing a lower bound scheme; otherwise, based on given delay requirement T ₀ Search precision optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal inference precision A of the model based on the scheme optimization ^* ；

The invention solves the problem that the prior art can not realize the low-delay and high-precision collaborative CNN reasoning in the end edge cloud system. According to the method, a terminal-edge-cloud computing architecture is adopted, hierarchical computing unloading is performed through compression division of a CNN model, communication and computing bottlenecks are optimized in a combined mode, and rapid intelligent analysis of mass terminal data is achieved under the condition that accuracy requirements are guaranteed. The invention can reduce the inference time delay of the CNN model and ensure the inference precision of the CNN model.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. An end edge cloud collaborative convolution neural network reasoning method is characterized by comprising the following steps:

at a given accuracy requirement A ₀ Then, based on a model precision upper bound estimation method and a compression rate decision method, searching a joint optimal model compression division scheme with optimal time delay, wherein if the given precision is greater than an upper bound A _max Directly providing an upper bound scheme; if the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme ^* ；

2. The method for inference on the basis of the edge cloud convolutional neural network based on joint compression partitioning as claimed in claim 1, wherein the method for constructing the communication optimal model compression comprises the following steps:

step 1.1, congruent channel pruning, for a given CNN layer, solving

Pruning insignificant convolution kernels, where | · |) _F Representing Frobenius norm, S, L, K and K' representing the number of test samples, the number of branches requiring deletion of convolution kernels simultaneously, the number of deleted convolution kernels and the number of remaining convolution kernels respectively, Y representing a characteristic diagram of the output of the current convolution kernel, X representing the characteristic diagram of the current convolution kernel, and _k representing the channel corresponding to the kth input profile, W _l,k The kth column representing the l convolution kernel, β is a k-dimensional vector, the value of each dimension represents the importance of a convolution kernel, λ ₁ Is a penalty coefficient; first, W is fixed _l,k Increase of λ ₁ Calculating beta, deleting the minimum value in the beta vector and the convolution kernel corresponding to the minimum value, and fixing the beta after deleting the minimum elementOver-training update W _l,k Circularly iterating until the number of the components in the beta is less than K';

3. The method for inference based on joint compressed partitioning end edge cloud convolutional neural network as claimed in claim 1, wherein the specific operation of obtaining the model delay under all compressed partitioning schemes is: information collection is carried out on a given CNN model in a given end edge cloud system to obtain time delay of all compression division schemes, an N-layer CNN model is deployed in a 3-layer end edge cloud system, and division layers are set to be l (l) ₁ ,l ₂ ) Compression setting r ═ r (r) ₁ ,r ₂ ) Wherein 0 to l ₁ Layer CNN model runs on the end equipment, l ₁ Compression ratio of layer CNN model is r ₁ ，l ₁ +1 to l ₂ Layer CNN model runs on edge equipment, l ₂ Compression ratio of layer CNN model is r ₂ ，l ₂ The +1 to N layers of CNN models run on the cloud equipment, corresponding compression rates are achieved based on the compression models, and under the compression division scheme (l, r), end-to-end delay of CNN reasoning

4. The inference method based on the edge cloud convolutional neural network of the joint compression partitioning as claimed in claim 1, wherein the specific operation of constructing the model accuracy upper bound estimation method at a given compression rate on a given CNN partitioning layer is:

model upper bound estimation at a given compressibility on a given CNN partition level, for a given partition level l _g Compressibility r-precision A function (l) _g R) has a monotonically concave nature, at a given division level and compression ratio (l) _g ,r _g ) Time of flightBased on two existing compression ratio-precision data points ((l) _g ,r ₁ ),A ₁ ) And ((l) _g ,r ₂ ),A ₂ )，r ₁ ≤r ₂ <r _g Or r _g <r ₁ ≤r ₂ Estimate scheme (l) _g ,r _g ) Upper bound of accuracy of

5. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 1, wherein a compression rate decision method at a given precision requirement and CNN partitioning level is constructed:

given CNN layer l with compression _g Precision A-compressibility R function R (l) of post-CNN model _g The single concave nature of A), quickly determining that the accuracy requirement A is met _g When l is turned on _g The highest compression ratio CRD (A) _g |l _g )＝R ^* (l _g ,A _g )；

The method comprises the following steps:

step 5.1, based on the existing sum (l) _g ,A _g ) Two data points with the smallest sum of distances ((l) _g ,A ₁ ),r ₁ ) And ((l) _g ,A ₂ ),r ₂ ) Based on an estimated value r' of the calculated compressibility;

step 5.2 data ((l) obtained by actual model compression _g ,A′),r′)；

6. A method according to claim 1 based on joint compressionA partitioned edge cloud convolutional neural network inference method, characterized in that at a given accuracy requirement A ₀ Then, based on a model precision upper bound estimation method and a compression ratio decision method, searching a joint optimal model compression division scheme with optimal delay specifically comprises:

7. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 6, comprising the following steps:

step 6.1, setting local optimal delay T ^* ＝T _max Let l ₁ ←1；

Step 6.2, set l ₂ ←l ₁ ；

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 6.4, based on R, let l ₁ Candidate compression ratio

If the scheme is

Accuracy of model

Update l ₁ Candidate compression ratio

Updating

Step 6.5, if

Let l ₂ Candidate compression ratio

Updating

Based on step 4, if

And

are all greater than or equal to A ₀ Wherein

Is a scheme

Estimating and updating the upper bound of model precision

Combined model compression division scheme with optimal time delay

Setting an optimal delay T ^* ←T(l ^* ,r ^* )；

Step 6.6, update

Step 6.7, repeating the steps 6.5 and 6.6 until the step

Step 6.8, update l ₂ ←l ₂ +1；

Step 6.9, if l ₂ N-1 or less, repeating the step 6.3 to 6.8;

step 6.10, update l ₁ ←l ₁ +1；

Step 6.11, if ₁ N-1 or less, repeating the step 6.2 to 6.10;

step 6.12, output (l) ^* ,r ^* ) And T ^* 。

8. The method as claimed in claim 1, wherein T is a given delay requirement ₀ Next, based on a model precision upper bound estimation method and a compression ratio decision method, searching for a joint optimal model compression partitioning scheme with optimal precision specifically includes:

9. The method for inference based on a joint compression partitioning end edge cloud convolutional neural network as claimed in claim 8, comprising the following steps:

step 7.1, setting local optimal precision A ^* ＝A _max Let l ₁ ←1；

Step 7.2, set l ₂ ←l ₁ ；

Wherein

Is 1 ₁ 、l ₂ A set of layer selectable compression ratios;

step 7.4, based on R, let l ₁ Candidate compression ratio

Updating

Step 7.5, if

Let l ₂ Candidate compression ratio

Updating

If it is

And

Setting the optimum precision A ^* ←A(l ^* ,r ^* )；

Step 7.6, update based on step 4

Step 7.7, repeating steps 7.5 and 7.6 until

Step 7.8, update l ₂ ←l ₂ +1；

Step 7.9, if ₂ N-1 or less, repeating the step 7.3 to 7.8;

step 7.10, update l ₁ ←l ₁ +1；

Step 7.11, if ₁ N-1 or less, repeating the step 7.2 to 7.10;

step 7.12, output (l) ^* ,r ^* ) And A ^* 。

10. An edge cloud convolution neural network inference system based on joint compression partitioning is characterized by comprising:

optimal model compressionA partitioning scheme obtaining module for obtaining a partitioning scheme at a given precision requirement A ₀ And searching a joint optimal model compression division scheme with optimal time delay based on a model precision upper bound estimation method and a compression rate decision method, wherein if the given precision is greater than an upper bound A _max Directly providing an upper bound scheme; if the given precision is less than the lower bound A _min Directly providing a lower bound scheme; the rest of the cases, based on a given accuracy requirement A ₀ Search time delay optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal end-to-end reasoning delay T based on the model optimized by the scheme ^* (ii) a At a given delay requirement T ₀ And searching a combined optimal model compression division scheme with optimal precision based on a model precision upper bound estimation method and a compression rate decision method, and if the given delay is greater than an upper bound T _max Directly providing an upper bound scheme; if the given delay is less than the lower bound T _min Directly providing a lower bound scheme; otherwise, based on a given delay requirement T ₀ Search precision optimized joint model compression partitioning scheme (l) ^* ,r ^* ) And outputting the optimal inference precision A of the model based on the scheme optimization ^* ；