CN111522657A

CN111522657A - Distributed equipment collaborative deep learning reasoning method

Info

Publication number: CN111522657A
Application number: CN202010289197.0A
Authority: CN
Inventors: 白跃彬; 胡传文; 王锐; 刘畅; 汪啸林; 江文灏; 程琨
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-08-11
Anticipated expiration: 2040-04-14
Also published as: CN111522657B

Abstract

The invention discloses a method for deploying a cache-based deep neural network on distributed edge equipment in a distributed manner. The method comprises the steps of dividing a neural network, pruning a neural network at the previous layer of the divided part, calculating one part of a deep neural network at a task initiating device, transmitting a small amount of intermediate results to other edge devices, calculating the rest part, caching and reusing the intermediate results of the neural network of the edge devices in addition to the calculation, and sharing the cache among different devices, thereby reducing the delay of edge intelligent application and reducing the requirements of the neural network on the performance of the edge devices, particularly reducing the repeated calculation amount when the edge side initiates an intelligent task request on similar data, reducing the performance requirements of deep learning on the devices, and fully utilizing the calculation resources of edge scenes.

Description

Distributed equipment collaborative deep learning reasoning method

Technical Field

The invention relates to the field of artificial intelligence and edge calculation in computer science, in particular to a deep learning inference method combining dispersive equipment cooperative calculation and cache.

Background

Edge computing is a new computing paradigm, and aims to meet the requirements of users on real-time response, privacy, safety, computing autonomy and the like of services by utilizing computing and communication resources of edge equipment. Under the promotion of rapid development of algorithms, computing power and big data, deep learning has made great progress in many fields as the most active field in artificial intelligence. With the development of the internet of things and information physical systems (CPS), novel applications such as automatic driving, intelligent unmanned aerial vehicle formation, intelligent robot clustering and the like drive the fusion of edge computing and artificial intelligence, and promote the appearance and rapid development of edge intelligent technologies. How to deploy and implement a deep learning function in the edge equipment with limited resources brings huge technical challenges to the edge intelligent technology.

The deep learning calculation on the edge device needs to accelerate a general traditional algorithm, and common acceleration methods include:

(1) the model exits in advance: DNN models with high accuracy typically have deeper structures. Executing such DNN models on terminal devices consumes a significant amount of resources. To speed up model reasoning, the model early exit method uses the output data of the early layers to obtain classification results, which means that the reasoning process is done by using a partial DNN model. Reducing the delay is an optimization goal for the model to exit early.

(2) Inputting and filtering: input filtering is an effective way to accelerate DNN model inference, especially for video analytics. The key idea of input filtering is to remove the non-target object frame of the input data and avoid redundant calculation of DNN model inference, thereby improving inference precision, shortening inference time delay and reducing energy consumption.

(3) Selecting a model: the model selection method is proposed to optimize the DNN inference problem of time delay, accuracy and energy consumption. The main idea of model selection is that we can first train a set of DNN models with different model sizes offline for the same task and then adaptively select the models to reason online. Model selection is similar to model early exit, where the exit point of the model early exit mechanism can be viewed as the DNN model. The key difference is that the egress point shares a portion of the DNN layer with the main branch model, and the models in the model selection mechanism are independent.

The methods can accelerate the deep learning model in a certain aspect, but cannot fully utilize the characteristic of interconnection between edge devices, and does not consider the characteristic that the edge devices are likely to perform intelligent tasks in a period of time, and input data are similar or even repeated. Therefore, the invention provides a deep learning inference method combining edge cooperative computing and caching aiming at the characteristics of the edge scenes.

Disclosure of Invention

The invention integrates caching, neural networks and edge computing technologies. The method utilizes geographically dispersed computing equipment in an edge scene to cooperatively perform deep learning computation, wherein task initiating equipment performs part of computation and caches the result and a final label, if the similarity between an intermediate result and the content in the cache reaches a certain degree, the label in the cache is used as the result, otherwise, the intermediate result is transmitted to other dispersed equipment to continue computation to obtain the final result and is transmitted back to the task initiating equipment, so that intelligent task computation is performed in the edge scene without data interaction with a cloud data center. The method specifically comprises a scattered equipment selection step and a distributed neural network calculation step based on cache, wherein:

the distributed equipment selection requires that each equipment registers own equipment information including IP addresses, port numbers and other equipment information in a registration center, the registration center acquires information such as network transmission speed with other equipment, memory, load, calculation performance, electric quantity and the like when initiating tasks, then the weighted summation is carried out on each piece of performance information according to specific conditions, and the equipment with the highest weighted value is selected as the other equipment for distributed neural network calculation.

Based on the edge scene collaborative reasoning of the cache, dividing the neural network according to the information of the two devices and the information of the neural network model, wherein the dividing is based on the layering characteristic of the neural network, and different layers have different calculated amounts according to the type of the network layer and the position of the network layer; the delay of convolutional and pooling layers is much smaller relative to fully connected layers, especially on GPUs; the convolution and pooling layers are typically in front of the network and the fully-connected layer is typically in the back of the network; the data size of the previous layer of the network is generally gradually reduced, and the data size of the later layers is generally obviously smaller than that of the original input data, so that the time required for transmitting or buffering the intermediate result of the later layers is far shorter than that for directly buffering or transmitting the original data. After division is finished, structured pruning can be carried out on the previous layer of the division position by utilizing the neural network structured pruning operation, so that the size of a result in the middle of transmission can be further reduced, the complexity of the model is basically not reduced and the accuracy of the model is basically not lost because only one layer of the division is carried out. After the division, the current task initiating device carries out previous part of calculation, the obtained result is used as a key value to be inquired in a cache, if the similarity is high enough, the corresponding value is used as a final result, otherwise, the intermediate result is transmitted to another device to carry out the rest calculation, and the final result is transmitted back to the task initiating device. The content in the cache can be adjusted in time according to the size of the memory, and the extra time and computing power for repeatedly initiating computing tasks can be effectively avoided even if only the last content is stored in the cache.

And remotely calling another device to perform the rest of neural network calculation by using a dynamic proxy mode, wherein different tasks can be remotely called by using the same set of remote calling module and the dynamically generated proxy class.

Compared with the prior art, the innovation of the invention is that: the neural network is combined with cache distribution to run on geographically dispersed edge equipment by utilizing the characteristics that the intelligent task data of the edge scene are similar and the hierarchical characteristics of the neural network, so that the computing resources of the edge equipment are effectively utilized, and the deep learning computing task which needs a large amount of resources can be efficiently deployed on the edge equipment; after the neural network is divided, the structured pruning is carried out on the neural layer which obtains the intermediate result, so that the size of the transmitted intermediate result can be fully reduced, and the complexity of the model is not reduced basically; other equipment is remotely called to carry out the rest neural network calculation, and a dynamic proxy mode is used, so that a plurality of calculation tasks can share one set of remote calling frame without concerning the implementation details of remote calling; in addition, the calculation cache key value is shared with the first half part of the neural network of the intelligent task, and extra calculation resources and time are not needed.

Drawings

FIG. 1 schematic diagram of a decentralized apparatus selection

FIG. 2 neural network model partitioning diagram

FIG. 3 is a schematic diagram of edge scene collaborative inference based on caching

Detailed Description

The invention mainly comprises two steps: a step of selecting decentralized equipment and a step of calculating the distributed neural network based on the cache.

Decentralized device selection

Because the edge scene devices are generally heterogeneous, that is, different computing performance, network performance and storage performance exist, it is necessary to reasonably select the cooperative device before initiating the cooperative intelligent computing task, and because the edge scene is generally in a centerless mode, each device can be used as a task initiating device or a task cooperative device. As shown in fig. 1, in order to let other devices know their existence, each device registers its IP address, port number, and other device performance information in the registry, and establishes a stable connection while ensuring network reliability. The task initiating device obtains information of other devices through the registration center, firstly screens out devices with insufficient memory, insufficient electric quantity and overlarge load from all devices establishing connection, carries out sequencing according to calculation performance and network performance weighting in the rest devices, and selects the device with the most front sequencing as another cooperative device. When the two devices still do not meet the calculation requirements, the second device selects the next cooperative device from the devices which establish stable connection with the second device in the same way, and so on. The information of the device needs to be updated in time, and the registration center needs to be informed in time when the conditions such as the load of the device and the occupation of the memory of the device change.

In order to prevent the devices from being down, network connection is interrupted and the like, each device needs to send heartbeat information to the registry at regular time, and when the heartbeat information is not received for a continuous period of time, the device is considered to be in fault and is deleted from the registry.

Cache-based edge scene collaborative reasoning

Because of the characteristics of neural network layering, the types of network layers and the positions of the network layers, different layers have different calculated amounts; the delay of convolutional and pooling layers is much smaller relative to fully connected layers, especially on GPUs; the convolution and pooling layers are typically in front of the network and the fully-connected layer is typically in the back of the network; the data size of the layers in front of the network is generally gradually reduced, and the data size of the later layers is generally obviously smaller than that of the original input data, so that the time required for transmitting or buffering intermediate results of the later layers is far shorter than that for directly buffering or transmitting the original data, the neural network can be divided into different parts, and the parts can be operated on different devices in a distributed mode. Distributed neural network reasoning requires first a reasonable partitioning of the neural network. There are two main factors affecting the optimal segmentation point location: one is a static factor, such as the structure of the model; one is dynamic factors such as network transmission speed, load of the edge server, and remaining available power of the device. For the model structure, the running time of the intelligent inference application can be measured on the edge device and the edge server respectively aiming at each layer before the intelligent inference application is deployed, so that the framework performance and the hardware performance are not required to be acquired, and the method is more accurate. Real-time acquisition is required for various dynamic factors, such as network transmission speed, which can be measured using an iperf tool. As shown in fig. 2, a network model is loaded, types and parameter conditions of each network layer in the DNN model are analyzed, and then a prediction model is used to predict delay time of each network layer on each mobile device. And selecting the segmentation position according to the predicted result and by combining the current electric quantity of the edge equipment and the load condition, and advancing the segmentation position when the electric quantity of the equipment is lower so as to save the electric quantity of the equipment.

After the neural network is divided, structured pruning can be carried out on a neural layer which obtains an intermediate result to be transmitted, the specific process is that a trained neural network model is obtained firstly, then the parameter sum of different convolution kernels of the neural layer which obtains the intermediate result is sequenced, a convolution kernel with a larger absolute value sum is taken according to a certain proportion, the rest convolution kernels are pruned (namely discarded), and finally the neural network is finely adjusted.

After the neural network is divided and the neural layer for obtaining the intermediate result is pruned, the distributed deep learning calculation can be carried out. The specific steps are shown in fig. 3: firstly, the first half of calculation is carried out on a task initiating device, and the features extracted through the first half of neural network are used for inquiring and comparing in a cache on one hand and are transmitted to another device for the rest of neural network calculation on the other hand. The characteristics of the input data are stored in the cache instead of the whole input data, and the characteristics extracted by the middle layer of the neural network are far smaller than the original input data, so that the efficiency of query comparison in the cache is high. Inquiring the obtained intermediate result in a cache, using cosine distance as a standard, and directly using value corresponding to the key value in the cache as a result when the similarity is higher than a certain proportion; when the key values which are similar enough are not searched in the cache, the characteristics are transmitted to another device, the other device firstly utilizes the characteristics to search and compare in the edge cache, if the key values are hit, the corresponding result is transmitted back to the edge device, otherwise, the other device carries out the rest calculation and transmits back the result. If the device is a multi-core processor, two threads can be started to execute a query cache task and an intermediate characteristic data sending task in parallel. Both devices can update the content in the cache at this time, cache replacement uses the least recently used algorithm, values that have not been queried for the latest time are eliminated in the cache each time, and the values are replaced with the newly calculated content. In the cache mode, a part of space can be set in each device to serve as a cache, or one device can be designated separately to serve as a cache dedicated server, and all devices fetch the cache from the cache server.

The task initiating device remotely calls another device to perform residual neural network calculation by using a dynamic proxy mode, the specific task initiating device remotely calls a proxy class, the proxy class serializes a function name and a parameter list (the function name can be an intelligent task name, the parameter list contains an intermediate result to be transmitted, information such as network division at the layer of the number of layers and the like), the function name is sent to the other device by using socket, and the other device completes function calling through a reflection mechanism. In this way, intelligent tasks and remote invocations can be decoupled, and a remote invocation module need not be implemented once for each task.

Claims

1. A deep learning reasoning method combining dispersive equipment cooperative computing and cache is disclosed, the method utilizes edge scene dispersive computing equipment to cooperatively perform deep learning computing, wherein task initiating equipment performs a part of operation, caches the part of result and final label, if the similarity between the intermediate result and the content in the cache reaches a certain degree, the label in the cache is used as the result, otherwise the intermediate result is transmitted to other dispersive equipment to continue computing to obtain the final result and transmit the final result back to the task initiating equipment, the method comprises a dispersive equipment selecting step and a cache-based distributed deep learning computing step, and is characterized in that:

the dispersion device selecting step includes:

1) all equipment registers equipment information in a registration center, wherein the equipment information comprises an IP address, a port number, a network transmission speed, calculation performance, a memory and a load;

2) selecting the most suitable equipment according to the performance weighting of each equipment to join in the cooperative calculation;

the cache-based distributed deep learning calculation step comprises the following steps:

1) dividing the network according to the neural network model and the performance of the computing equipment, and performing structured pruning on a layer for obtaining an intermediate result;

2) the task initiating device carries out first half part calculation, and inquires whether the intermediate result has a similar value in a cache, and if the intermediate result has the similar value, the result in the cache is used;

3) and when the cache does not have the similar value, sending the intermediate result and the function name to another device by using the dynamically generated proxy class, remotely calling the other device to perform residual calculation, and transmitting the obtained result back to the task initiating device.

2. The method according to claim 1, wherein there is a registry, and all devices register their own device information with the registry, and all devices can acquire the information of other devices through the registry.

3. The method of claim 1, wherein after partitioning the network layer, performing a structured pruning operation on the network layer where the intermediate result is obtained.

4. The method of claim 1, wherein the edge devices are geographically dispersed and heterogeneous.

5. The method of claim 1, in which the cached content is a key-value pair of an output of a neural network middle layer and a final result tag.

6. The method of claim 1, in which the computation to obtain the cached key value and the neural network computation by the task initiating device are shared.

7. The method of claim 1, wherein distributed remote invocation is implemented using a dynamic proxy mode, and wherein all remote invocation implementations are responsible for the dynamically generated proxy class.