CN115048218A

CN115048218A - End cloud collaborative reasoning method and system in edge heterogeneous scene

Info

Publication number: CN115048218A
Application number: CN202210650520.1A
Authority: CN
Inventors: 姬晨晨; 于佳耕; 侯朋朋; 邰阳; 苗玉霞; 佟晓宇; 张丽敏; 全雨; 武延军
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-13

Abstract

The invention discloses a method and a system for end cloud collaborative reasoning in an edge heterogeneous scene. The method comprises two stages of off-line and on-line. And in an off-line stage, obtaining a logistic regression function of each node CPU and memory occupancy rate and reasoning time of each layer of the neural network. And dynamically acquiring the utilization rate of the CPU and the memory at an online stage, updating the execution time of each layer of the neural network under the current CPU and memory pressure values in real time, and selecting an optimal model division point and an optimal node according to the time so as to achieve the maximum system throughput. The invention discloses a composition principle of an end cloud collaborative reasoning framework in an edge heterogeneous scene, which can determine the optimal edge equipment and model division points of a heterogeneous edge computing system, is applied to an image classification detection system and meets the requirement of high throughput of the system.

Description

End cloud collaborative reasoning method and system in edge heterogeneous scene

Technical Field

The invention relates to the field of edge heterogeneous system and deep neural network model reasoning, in particular to a method and a system for end cloud collaborative reasoning in an edge heterogeneous scene.

Background

In recent years, deep learning has been successfully applied in the field of computer vision. In order to further realize high-precision image classification and target detection performance, the depth of the neural network model is continuously increased. In order to meet the large amount of computing requirements of deep learning and avoid the problems of large amount of data transmission and insufficient computing power of edge computing caused by cloud computing, an edge cloud collaborative reasoning method is proposed in the industry. The cooperative reasoning of the cloud center and the edge device can effectively solve the real-time deep reasoning problem of the resource-limited Internet of things device, and the model segmentation is a common method. The model segmentation divides a DNN (Deep Neural Networks) model into two parts, one part is executed on the edge device, and the other part is executed on the cloud server. The cooperative reasoning between the cloud middle and the edge device not only avoids the transmission of a large amount of data, but also can make full use of the performance of the edge device.

At present, research on model division mainly focuses on division of a DNN model under a single-edge server, and an optimized division point is obtained by minimizing the total inference delay. But there are some challenges in partitioning: 1) a plurality of edge devices are arranged in a heterogeneous edge system in a real scene; 2) CPU usage and memory pressure on different edge devices can affect the inference time per tier. In a heterogeneous edge computing system, there are multiple edge devices, which may include different devices such as an edge server with a GPU or a raspberry pi. When the inference task comes, the edge device may execute another task to affect the performance of the device, which is mainly reflected in the occupancy rates of the CPU and the memory, the traditional model segmentation method does not consider the influence of the occupancy rates of the CPU and the memory on the neural network inference time in the edge heterogeneous scene, however, the inference time of each layer of the neural network plays a key role in the division of the model, and if the inference time is inaccurate, a large error exists in the division layer of the model.

Disclosure of Invention

The invention provides a new method for predicting the inference time of each layer of a neural network model on different equipment, and before the model is divided, the time of each layer is updated according to the occupancy rates of a CPU and a memory on the equipment. The invention provides an intelligent collaborative reasoning framework under an edge heterogeneous scene. The method can determine the optimal edge equipment and the model dividing point of the heterogeneous edge computing system, can be applied to detection systems such as image recognition and the like, and meets the requirement of high throughput of the system.

The technical scheme adopted by the invention is as follows:

a method for end cloud collaborative reasoning in an edge heterogeneous scene comprises the following steps:

in an off-line stage, obtaining the CPU occupancy rate and the memory occupancy rate of each edge device and each cloud server and a logistic regression function of each layer of inference time of the neural network;

in an online stage, the CPU occupancy rate and the memory occupancy rate of the edge device and the cloud server are dynamically acquired, the execution time of each layer of the neural network is updated in real time by using a logistic regression function, and the optimal model division point and the optimal edge device of the neural network are selected according to the execution time of each layer of the neural network so as to achieve the maximum system throughput.

Further, the offline stage comprises:

obtaining basic information of a neural network model, including the input and output sizes of each layer;

selecting real-time CPU and memory use conditions of different edge devices and cloud servers as determination variables, training a polynomial regression model for estimation in operation of each layer, and obtaining a logistic regression function of CPU and memory occupancy rates and inference time of each layer of a neural network.

Further, the off-line stage adopts the following steps to obtain a logistic regression function of the CPU and memory occupancy rates and the inference time of each layer of the neural network:

the method comprises the steps of determining the CPU occupancy rate and the memory occupancy rate for edge equipment and a cloud server;

running different neural network models under the conditions of different CPU occupancy rates and different memory occupancy rates to obtain and record the execution time of each layer of each model;

and according to the relation between the CPU occupancy rate and the memory occupancy rate and the execution of each layer of the neural network, fitting by using a mathematical tool, and predicting the execution time of each layer of the neural network under different CPU occupancy rates and memory occupancy rates.

Further, the online phase includes:

acquiring the bandwidth of current network transmission, and acquiring the transmission time of the intermediate result of each layer of the neural network according to the bandwidth of current network transmission and the output size of data;

acquiring the CPU and memory occupancy rate of each current available edge device and cloud server, updating the execution time of each layer of neural network at the edge end by using a logistic regression function, and updating the execution time of each layer of neural network at the cloud end;

the optimal model division point and the optimal edge device are obtained according to the transmission time of the middle result of each layer of the neural network, the execution time of each layer of the neural network at the edge end and the execution time of each layer of the neural network at the cloud end, and the maximum throughput of the system is achieved.

Further, the online phase includes:

deploying Kubernets in an edge heterogeneous scene for performing container management, calculation configuration and network resource allocation functions;

deploying Prometeus and Node-export for real-time monitoring to obtain the CPU and memory occupation condition of each edge device and cloud server in real time, executing a model division algorithm, and distributing the division result to the edge devices;

and after receiving the division result, the edge device executes the calculation of the corresponding network layer to obtain an intermediate result, the intermediate result is sent to the cloud server, and the cloud server executes the calculation of the rest network layers to obtain a final result.

Further, the online stage obtains the optimal model dividing point and the optimal edge device by adopting the following steps:

traversing all layers of the neural network model as division points, and calculating the time of three stages on all equipment, wherein the three stages are respectively as follows: the end side reasoning total time, the data transmission time and the cloud reasoning total time;

and traversing the time of the three stages under all the division points to obtain the maximum value of the time in the three stages under each division point, and constraining the maximum value to be minimum to obtain the optimal model division point and the optimal edge equipment.

An end-cloud collaborative reasoning system in a marginal heterogeneous scenario, comprising:

the offline processing module is used for acquiring the CPU occupancy rate and the memory occupancy rate of each edge device and each cloud server and the logistic regression function of each layer of inference time of the neural network in an offline stage;

and the online processing module is used for dynamically acquiring the CPU occupancy rate and the memory occupancy rate of the edge device and the cloud server in an online stage, updating the execution time of each layer of the neural network in real time by using a logistic regression function, and selecting the optimal model division point and the optimal edge device of the neural network according to the execution time of each layer of the neural network so as to achieve the maximum system throughput.

Compared with the prior art, the invention has the following positive effects:

1) the method comprises the steps of predicting the inference time of each layer of a model according to the CPU and the memory pressure on the edge equipment, and updating in real time.

2) The invention can select the proper edge equipment and the optimal division point for the intelligent cooperative reasoning of the edge heterogeneous system, thereby achieving the requirement of the maximum throughput of the system.

Drawings

FIG. 1 is a schematic diagram of a terminal cloud collaborative inference framework in the edge heterogeneous scenario of the present invention.

FIG. 2 is an exemplary graph fitting CPU occupancy, memory occupancy, and inference time for each layer of the neural network model.

FIG. 3 is a schematic diagram of neural network model segmentation at an online stage.

FIG. 4 is an exemplary time delay diagram of each layer of the NiN model in different CPU occupancy rates and memory occupancy rates in the embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

The patent provides an end cloud collaborative reasoning method in an edge heterogeneous scene. The method can determine the optimal edge equipment and model division points of intelligent collaborative reasoning in the heterogeneous edge computing system to obtain the maximum throughput, and can be applied to high-throughput image detection systems, including application scenarios such as face detection, pedestrian detection, license plate detection and the like.

The model division divides the neural network model into two parts, one part is processed on the edge device, and the other part is processed at the cloud. For example, in a video frame detection scene for inspecting a subway security inspection mask, a video is generated at the edge side, and a video frame is used as an input of a neural network model and is transmitted to a neural network. The computation of each layer in the neural network may be performed on an edge device or cloud. The computing layer on the edge device does not need to transmit data to the cloud, but the inference time is long due to the limited resources and performance of the edge device; the performance of the cloud center is strong, the inference time is usually short, but a large amount of raw data needs to be transmitted from the edge device to the cloud center, and transmission delay is generated in data transmission. By selecting a proper model division point, the computing power of the edge end and the cloud center is fully utilized, and the throughput requirement can be maximized. In addition, in a heterogeneous scene, the edge end has a plurality of devices of different types, and it is necessary to find the best point for partitioning the model and select a suitable device to deploy the model.

FIG. 1 shows an end cloud collaborative inference framework in a marginal heterogeneous scenario, which includes an offline phase and an online phase. The tasks mainly completed in the off-line stage are as follows:

1) obtaining basic information of a neural network model, including the input and output sizes of each layer;

2) and obtaining a logistic regression function of the CPU and memory occupancy rates and the inference time of each layer of the neural network. Real-time CPU and memory use conditions of different edge devices and cloud servers are selected as determination variables, and a polynomial regression model is trained and used for estimation in operation of each layer.

And in the online stage, the device performances (CPU and memory occupancy rates) of all the edge devices and the cloud server are mainly obtained, each layer of execution delay is updated, and the model segmentation is completed. The tasks to be completed in the online phase are as follows:

1) acquiring the bandwidth of current network transmission, and acquiring the transmission time of the intermediate result of each layer of the neural network according to the bandwidth of current network transmission and the output size of data;

2) acquiring the CPU and memory occupancy rate of each current available edge device and cloud server, updating the execution time of each layer of neural network at the edge end by using a logistic regression function, and updating the execution time of each layer of neural network at the cloud end;

3) the optimal model division point and the optimal edge device are obtained according to the transmission time of the intermediate result, the execution time of each layer of neural network at the edge end and the execution time of each layer of neural network at the cloud end, so that the maximum throughput of the system is achieved (hereinafter referred to as a model division algorithm).

The method for obtaining the optimal model dividing point and the optimal edge device in the step 3) comprises the following steps:

a) traversing all layers of the neural network model as division points, and calculating the time of three stages on all equipment, wherein the three stages are respectively as follows: the end side reasoning total time, the data transmission time and the cloud reasoning total time;

b) and traversing the time of the three stages (the end-side reasoning total time, the data transmission time and the cloud reasoning total time) under all the division points to obtain the maximum value of the time in the three stages under each division point, and constraining the maximum value to be minimum to obtain the optimal division strategy and the optimal division point. For example, there are three division points, and the time of the three stages corresponding to the first division point is (6, 4, 3), and the maximum value is 6; the time of the three phases corresponding to the second division point is (5, 3, 4), and the maximum value is 5; the time of the three phases corresponding to the third division point is (4, 5, 7), and the maximum value at this time is 7; the best division point in this case is the second division point.

According to experiments, the occupancy rates of the CPU and the memory affect the inference time of each layer of the model, as shown in FIG. 2. That is, as the CPU and memory occupancy increases, the inference time of each layer of the model also increases. In order to obtain a logistic regression function of the relationship between the CPU and memory occupancy rate of the edge device and the inference time of the neural network model, the off-line stage comprises the following specific steps:

1) and (3) using a stress tool in the Linux system to determine the CPU occupancy rate and the memory occupancy rate of the edge device and the cloud server.

2) Under the conditions of different CPU occupancy rates and memory occupancy rates, using the caffe to run different neural network models, and obtaining and recording the execution time of each layer of each model.

3) And according to the measured relation among the CPU occupancy rate, the memory occupancy rate and the DNN execution of each layer, fitting by using a python mathematical tool, and predicting the execution time of each layer of the neural network under different CPU occupancy rates and memory occupancy rates.

In the online phase, it is necessary to obtain the CPU utilization and the memory occupation of each edge device in the heterogeneous system at each moment in real time to update the inference time of each layer, and complete the segmentation of the model, and the architecture is shown in fig. 3, and the specific steps are as follows:

1) kubernets are deployed in the edge heterogeneous scene, and the Kubernets provide the functions of container management, calculation configuration, network resource allocation and the like;

2) in order to acquire the CPU and memory occupation conditions of each edge device and each cloud server in real time, Prometeus and Node-export are deployed for real-time monitoring;

3) the information interaction of the edge device, the cloud server and the host computer is realized through the GRPC, and the open source code provides a flexible remote calling interface and can be used for communication; the host is central management equipment, can acquire CPU and memory occupation conditions of the edge equipment and the cloud server, completes information interaction and task distribution, and can be deployed at the side close to the edge or the cloud server according to scene requirements;

4) the method comprises the steps that a Master is deployed on a host, real-time information of each device is obtained through the Prometheus, inference time is updated, a model division algorithm is executed, and division results are distributed to edge devices;

5) and after receiving the information of the division layer, the edge equipment executes the calculation of the corresponding network layer and obtains an intermediate result. And the intermediate result is sent to a cloud server as the input of the cloud center, and the cloud server executes the calculation of the rest network layer to obtain the final result.

In fig. 3, a indicates that a host issues a CPU and memory instruction to an edge device and a cloud server, the edge device and the cloud server receive the instruction and then return the instruction to the current CPU and memory occupancy rate of the host, B indicates that the current CPU and memory occupancy rate of each edge device or cloud server is input to a model partitioning algorithm, C indicates that an inference task is distributed to the edge device, and D indicates that information such as an intermediate result of inference is transmitted to the cloud server.

Example (b):

by using the method, the edge heterogeneous cluster is constructed for experiment, the experiment simulates a real environment, and a Master host, two edge devices and a cloud center server are configured. The specific arrangement is shown in tables 1 to 3.

TABLE 1. host

Hardware device and environment	Properties
		System for controlling a power supply	Ubuntu 16.04.6LTS
CPU	Intel Core i7-10510U CPU
		Memory device	16.0GB DDR4 2133MHz

TABLE 2 edge devices

TABLE 3 cloud center

The embodiment comprises the following steps:

firstly, the relation between the CPU and memory occupancy rate of each edge device and the inference time of each layer of the neural network needs to be obtained. Using the measurement methods described above, the layers of DNN were analyzed and the computation time for the layers on the edges and clouds was collected empirically. Experiments were performed on 6 DNNs (chain DNN model: Alexnet, CaffeNet, NiN, VGG-16; DAG DNN model: GoogleNet, ResNet-18), and a polynomial regression model for layer runtime estimation was determined by setting different conditions for CPU utilization and memory utilization. Taking edge device 1 as an example (Intel Core i7-10510U CPU), the specific experiment is as follows:

1) CPU utilization of edge device: consider three different stress levels-10%, 30%, and 45% CPU utilization;

2) memory utilization of edge devices: three different pressure levels were considered — memory utilization of 38%, 60%, and 90%. For the CPU and memory loads of the edge device, the use of Linux tools causes memory loads on the memory resources.

The NiN model was used as an analysis example, and as shown in FIG. 4, included 9 convolutional layers (conv 1-conv 9) and three fully-connected layers (fc 1-fc 3). For example, when the CPU and memory utilization on the edge device are 10% and 38% (left), respectively, the latency of conv1 is 77 ms; the latency of conv1 is 246ms when the CPU and memory utilization on the edge device is 30% and 60% (medium), respectively; the latency of conv1 is 616ms when the CPU and memory occupancy on the edge device are 45% and 90% (right), respectively. The latency per layer increases with increasing CPU and memory pressure. The detailed data are shown in table 4.

TABLE 4 execution time of each layer of the NiN model under different CPU and memory utilization rates

Under the conditions of different CPU occupancy rates and memory occupancy rates, using caffe to run different neural network models, obtaining and recording the execution time of each layer of each model.

And according to the measured relation among the CPU occupancy rate, the memory occupancy rate and the DNN layer execution time, fitting by using a python mathematical tool, and predicting the DNN layer execution time under different CPU occupancy rates and memory occupancy rates.

Using the above method, the data of CPU utilization, memory utilization, and conv1 in table 4 were fitted, and the results are shown in fig. 2.

And secondly, selecting proper equipment and division points in the edge equipment and the cloud center cluster. WiFi is set in the edge devices and the cloud center cluster, the bandwidth is 20Mbps, and table 5 shows model division of the NiN model in the edge heterogeneous scene. The experimental results are as follows: the CPU and memory occupancy of device 1 is obtained as 10% and 35%, respectively, at which point the division point is Pool0 level and the throughput is 8, while the CPU and memory occupancy of device 2 is 20% and 25%, at which point the division point is Pool0 level and the throughput is 9.36. So we choose device 2 as the edge device for inference, the model partition point at this time is pool0 level.

TABLE 5 model partitioning and device selection in edge heterogeneous scenarios

Another embodiment of the present invention provides an end cloud collaborative inference system in a marginal heterogeneous scene, including:

The specific implementation procedures of the offline processing module and the online processing module are as described above for the method of the present invention.

Another embodiment of the invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by a processor, and a processor, the computer program comprising instructions for performing the steps of the method of the invention.

Another embodiment of the invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, performs the steps of the method of the invention.

The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims

1. A method for cooperatively reasoning end cloud in an edge heterogeneous scene is characterized by comprising the following steps:

2. The method of claim 1, wherein the offline stage comprises:

3. The method of claim 1, wherein the off-line stage is configured to obtain the logistic regression function of CPU and memory occupancy and the inference time of each layer of the neural network by using the following steps:

running different neural network models under the conditions of different CPU occupancy rates and memory occupancy rates to obtain and record the execution time of each layer of each model;

4. The method of claim 1, wherein the online phase comprises:

5. The method of claim 1, wherein the online phase comprises:

6. The method of claim 1, wherein the online phase uses the following steps to obtain an optimal model partition point and an optimal edge device:

7. An end cloud collaborative reasoning system in an edge heterogeneous scene, comprising:

8. The system of claim 7, wherein the online processing module obtains the optimal model partition point and the optimal edge device by:

traversing the maximum value of the sum of the time of the three stages under all the division points, constraining the maximum value to be minimum, and obtaining the optimal division strategy and the optimal division points.

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 6.