CN114662690B

CN114662690B - Mobile device collaborative inference system for deep learning Transformer type model

Info

Publication number: CN114662690B
Application number: CN202210547606.1A
Authority: CN
Inventors: 许封元; 吴昊; 柯晓鹏; 赵鑫; 姚荣春
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-09-13
Anticipated expiration: 2042-05-20
Also published as: CN114662690A

Abstract

The invention relates to a deep learning Transformer model-oriented mobile device collaborative inference system, which comprises the following steps: a preparation stage: and evaluating available hardware resources, determining the segmentation granularity, and slicing and distributing the segmentation granularity to each device. A deployment phase: the devices are controlled through DNS networking to carry out collaborative inference on heterogeneous devices. And each device to be inferred deploys an inference service program. And the control node sends the samples to be inferred to each device, the current node transmits the intermediate result to the node responsible for the next model slice, and the rest is done until the last node finishes reasoning, and then the final result is sent to the control node for output. And the situation that a single device cannot operate due to the fact that the whole Transformer model occupies a large space is avoided.

Description

Mobile device collaborative inference system for deep learning Transformer type model

Technical Field

The invention relates to a deep learning Transformer model-oriented mobile device collaborative inference system, and belongs to the technical field of deep learning application of computers.

Background

With the continuous development of deep learning technology, the inference service based on the deep learning model is widely applied to daily life of people, such as automatic subtitle generation of video websites, website translation, automatic question answering service systems, audio text interconversion and the like, and provides great convenience for the life of people. With the popularization of mobile intelligent devices and the great enhancement of perception capability (more and more sensors), the method provides one of the most important application scenes for the deployment of deep learning services in life.

In this scenario, deployment of the deep learning service on the mobile terminal can mine and utilize computing resources of the mobile device to avoid using the cloud, so that overhead of the computing resources is saved. Meanwhile, the service is directly deployed on the mobile equipment, so that the user information can be prevented from being transmitted to the cloud, and the probability of privacy disclosure is reduced. Finally, deploying deep learning services directly at the edge end may also allow the entire system to be deployed in areas where networks are unavailable. For example, in an intelligent home scene, various devices utilize data collected by sensors of the devices to serve family members, such as falling detection of old people, intrusion detection of thieves and the like, and during field exploration, the devices directly analyze data collected by the sensors and the like.

Among a plurality of deep learning models for providing deep learning services, a Transformer network obtains better effects in most fields by virtue of strong universality and learning capability, and achieves the best effects in a plurality of fields beyond the traditional models, such as computer vision, natural language processing and the like, and the high-precision Transformer models have great attraction to mobile equipment users. The method has high demand and application value in deducing by using a Transformer on a mobile device.

But it is challenging to deploy and run Transformer-like large models on resource-limited mobile terminals. Taking the pre-training model T5 in the field of natural language processing as an example, T5-11B has up to 110 hundred million parameters, occupies at least several dozen G of memory and storage space, even BERT model has 1 hundred million parameter amount, and for computer vision model, vit-base model has tens of millions of parameters, and converts into ONNX format to store more than 300 MB of space. In the face of a large Transformer model, hardware resources of a mobile device are very weak, for example, a widely-used 32-bit raspberry type has less than 4GB of available memory, if the raspberry type is old, the available memory may be as low as 1GB, and the SD card on these devices has an economic consideration that the SD card often has only a storage space of several GB to dozens of GB. Considering that the operating system and other services run by the mobile device also occupy a certain amount of memory and storage, these mobile devices simply cannot run the inference service of the larger transform model by themselves: neither the entire model can be loaded in memory, nor there is enough storage space to hold the entire model for segmented operation.

In order to solve the problem of limited resources of a mobile terminal, researchers often study from two perspectives of 'open source' and 'throttling', but solutions from the two perspectives cannot well solve the challenges in the new scene of the deep learning service. First, from a "throttling" perspective, existing research efforts such as dynamic network structure, model compression, network space search, etc. reduce the overhead of computing resources by dynamically executing a portion of a large network or retraining a small network. For a dynamically structured network, although the loss of the inference accuracy of the network can be reduced, a mobile device with limited memory and storage still cannot load the entire model. For the solution of network space search and model compression, which is equivalent to redeploying a smaller network, the great reduction of model parameters causes great reduction of model precision, and high-quality service cannot be provided to meet the requirement of users for using high-precision models. From the perspective of 'open source', the existing research work often researches the collaborative inference of the mobile device and the cloud, and is also not suitable for new service scenes. Firstly, many service deployment scenes may be in regions without external network resources, without connectable cloud computing resources, such as unmanned aerial vehicle clusters, field exploration acquisition data and the like, and secondly, many service scenes have local computing requirements due to the privacy of users, such as the smart home with sensors in smart home scenes to detect intruders or other abnormal conditions. Therefore, the existing work cannot well solve the problem of hardware resource shortage when a high-precision transform model is deployed at a mobile terminal.

Disclosure of Invention

The invention aims to: in view of the existing problems and disadvantages, the present invention provides a deep learning Transformer-class model-oriented mobile device collaborative inference system, which solves the problem of how to partition a Transformer model and dispatch the Transformer model to different mobile devices for storage and operation, and invokes heterogeneous mobile devices to collaborate to complete the entire inference service during collaborative inference. The situation that the whole Transformer model cannot be placed on a single device due to large occupied space is avoided. A disaggregated deployment, where a single device cannot operate, can operate under collaborative inference. Service throughput which can be provided by the whole collaborative inference system is maximized on the basis of the service throughput.

The technical scheme is as follows: in order to realize the purpose of the invention, the invention adopts the following technical scheme:

a deep learning Transformer class model-oriented mobile device collaborative inference system comprises the following steps:

step 1: a preparation stage: performing equipment performance evaluation test on the equipment, and allocating model slices to be deployed to different mobile terminal equipment;

step 1.1: evaluating available hardware resources of the mobile terminal equipment: acquiring hardware information of each device to be used;

step 1.2: determining the cutting granularity: analyzing a Transformer model structurally, and analyzing to obtain adaptive model segmentation granularity by combining the hardware information of each device to be used, which is obtained in the step 1.1;

step 1.3: slicing is carried out: segmenting the Transformer model according to the model segmentation granularity obtained in the step 1.2 to obtain a plurality of model slices;

step 1.4: slice distribution: distributing the model slices in the step 1.3 to each device to be used through a distribution algorithm;

step 2: a deployment phase: controlling the devices to carry out collaborative inference on heterogeneous devices through SDN networking;

step 2.1: SDN networking: determining devices to be inferred which participate in deep learning from all the devices, networking by using an SDN (software defined network), ensuring that the devices are connected to the same network, deploying an SDN control program by using the devices with the strongest network processing capacity as control nodes, wherein all the nodes participate in collaborative inference, and each device to be inferred deploys an inference service program;

step 2.2: when deep learning inference is required, a request initiation program for requesting inference service sends a request for providing service to the control node in the nodes in the step 2.1;

step 2.3: the control node communicates with other nodes to ensure that all the nodes in the step 2.1 normally perform collaborative inference service;

step 2.4: a request initiating program sends a sample to be inferred to a control node, and the control node sends the sample to be inferred to each device to be inferred, namely each other node;

step 2.5: the current node transmits the intermediate result to the node responsible for the next model slice according to the given sequence, and so on until the node of the last model slice completes reasoning, and then sends the final result to the control node;

step 2.6: and the control node outputs the final result obtained in the step 2.5.

Further, the hardware information in step 1.1 includes a memory and a storage space, and the cpu calculates the frequency.

Further, the distribution algorithm in step 1.4 includes the following steps: and testing each model slice on each device, deducing the actually required time overhead, searching the device with the minimum current load, and distributing the model slice to the device, wherein the device with the minimum load refers to the device with the minimum time required for running the currently distributed model slice.

Further, the distribution algorithm specifically comprises the following steps:

step 1.4.1: initialization: inputting a model to be split, using a Transformer model, and recording the pre-allocation of tasks to each device by using an array TThe computational time overhead is estimated and,

the estimated calculation time expenditure of the tasks distributed to the equipment i is initialized to be all 0 before starting, and the model fragments and the model loads distributed to the equipment are recorded by the array Workload _[i] Is a set, representing all model slices to which the device i is assigned, and is denoted model _ slices;

step 1.4.2: distributing all model slices, and for each model slice in model _ slices, performing the following steps: find the device i with the least burden of the current task, i.e.

The value of (A) is minimum in T, the model slice is added into a task set distributed by equipment i, and updated Workload is obtained _[i] ；

Step 1.4.3: updating Workload _[i] Then, updating the estimated time overhead of the task distributed by the device i by looking up the table of the data measured in the test evaluation process, and inquiring the running Workload of the device i in the test record by using Estimate _ time _[i] The measured time overhead of a task, if the device is unable to undertake the assigned task, will be

Assigning a value to an infinite value;

step 1.4.4: after the allocation is completed, find

Minimum equipment to be

Moving the model slice on the largest device to the device to reduce the maximum value in the array T, and repeating the process until the maximum value in the T cannot be reduced;

step 1.4.5: detecting if there is a device i that cannot supportFor deducing assigned model slices, i.e. checking

If there is an infinite value in (b), it indicates that the model is too large, and the whole system cannot run the model even though it is coordinated, and the program ends.

Further, in the step 2, differences of underlying heterogeneous systems are shielded by using an ONNX + runtime environment, and underlying differences and network conditions of heterogeneous devices are abstracted by using a software-defined network technology.

Further, the software defined networking technology uses onos to control the SDN control layer, uses ovs to control the SDN forwarding layer, deploys the SDN control layer and the SDN forwarding layer on the same node, and connects other nodes with the control node through vxlan.

Has the advantages that: compared with the prior art, the invention has the following advantages: in the process of realizing the multi-mobile-device collaborative inference system, how to form a crowd-sourcing system by realizing the multi-mobile-device ad hoc network is solved, then the mobile scene inference target of a large-scale Transformer model is completed by collaborative division of labor, the computation time is optimized, and the access to the internet or the use of cloud computing is not needed. The situation that the whole Transformer model cannot be placed on a single device due to large occupied space and the Transformer model which cannot be run due to the limitation of computing resources of the single mobile device is operated is avoided.

Drawings

FIG. 1 is a schematic flow diagram of the preparation phase of the present invention;

FIG. 2 is a schematic flow diagram of the deployment phase of the present invention;

FIG. 3 is a schematic flow chart of a transform model slicing process according to the present invention;

FIG. 4 is a flow diagram of collaborative inference of the present invention;

FIG. 5 is a block diagram of the heterogeneous device collaborative inference module architecture of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

In order to enable a Transformer with higher precision but larger network to run on a resource-limited mobile device independently of an external network, the method provides that a crowd-sourcing system is formed by adopting mobile terminal intelligent devices in an ad hoc network interconnection mode, then a Transformer model network is optimized and segmented and deployed on a plurality of mobile devices, and inference service of the Transformer in a mobile scene is realized by performing combined flow type on the plurality of devices, so that the method is applicable to subdivision application occasions such as field operation without the internet, post-disaster search and rescue, privacy protection and the like.

In the operation process, the system proposed by the method firstly divides transform model fragments with different sizes for different mobile devices in the system according to different resources of the existing device, including the bandwidth, the memory size and the storage size of the device, and when dividing, firstly, the different devices are ensured to have enough storage space to store the weight fragments of the transform model, and secondly, the network card transmission capability, the memory and the computing capability of the different devices are considered so that the service throughput of the system formed by the whole mobile device is maximum by cooperation inference.

And (3) segmentation and distribution of the model:

in a system formed by a plurality of heterogeneous mobile devices, hardware resources of different devices are very different, and the heterogeneous system may have devices with strong computing power and large storage space, such as a personal notebook computer, a high-end smart phone, and various customized chips with weak performance, development boards, such as a raspberry group, a programmable network camera, and the like. As mentioned above, these mobile devices are not enough to undertake the deployment of the whole high-precision Transformer model by themselves, so we need to slice the large Transformer model and allocate the task to different devices in the implementation of collaborative inference, and due to the difference of hardware resources of these different mobile devices, the model cannot be simply randomly sliced into several blocks on average to be allocated in the implementation of model slicing and allocation. In order to optimize the inference throughput of the whole collaborative inference system and fully utilize the computing resources of different mobile terminal devices, the collaborative inference system realized by the method is designed and a transform model segmentation and distribution algorithm is realized. The following details why existing work does not solve the problem of model segmentation in mobile device-side collaborative inference well and the challenges encountered in model segmentation in the service scenario presented above.

As described above, the existing work does not implement deployment of a high-precision transform model at a mobile terminal with limited hardware resources, and although considerable work researches how to expand the computing capability of the mobile terminal in a mobile terminal-cloud terminal cooperation manner to deploy the high-precision transform model, the model segmentation methods in these works cannot be applied in the multi-mobile device cooperation inference scene of the method: the scene of mobile terminal-cloud terminal collaborative inference only relates to a mobile terminal device and an abstracted computing resource, so that the model is divided into two parts only by considering, one part is placed at the mobile terminal, the other part is placed at the cloud terminal, the computing resource of the cloud terminal far exceeds that of a single mobile device, and the operation of the whole model can be completely borne, therefore, the bottleneck of the whole system only lies in the single mobile terminal device, the cloud terminal can completely backup the whole model, the worst situation is that the inference of the whole model can be completed only by the cloud terminal, and the problem of speed coordination of two-terminal inference is easily solved. In the scenario of this document, hardware resources owned by each mobile device are insufficient, and cannot independently undertake the operation of the high-precision transform model, and due to the difference in hardware resources of different devices, each device may become a performance bottleneck of the operation of the whole system due to different reasons, and it is more complicated and more difficult to coordinate a plurality of mobile devices: the model needs to be divided into a plurality of parts to be distributed as different devices, but not simply divided into two parts, the different modes of model division can cause different burdens on different mobile devices, and if the unreasonable mode of model division causes that a certain mobile device in the whole system is difficult to bear the assigned inference task or becomes a short board in all devices, the device will delay the cooperative inference of the whole system, so that the time required by inference is greatly increased, and even the inference cannot be completed.

Specifically, when designing the segmentation and distribution algorithm of the transform model, we face the following challenges:

storage space differences of mobile devices: firstly, the storage space of different mobile devices is different, and the storage space of the devices directly determines the upper limit of the model that can be carried: the sum of the available storage space of all mobile devices in the system must be able to fit all the weight files cut by the entire transform model. If the storage space is insufficient, even if the computing performance of a certain mobile device is higher, the model slices can be obtained from the device with abundant storage space and loaded to run continuously, however, the additional cost introduced by the part can be far more than the benefit and is not worth doing so. The storage space differences of mobile devices undoubtedly present a great challenge and complexity to the design of the split distribution of models.

Available memory difference of mobile device: the difference of available memories of different mobile devices is large, the available memory of the device is also a factor influencing the division and slicing of the transform model, for the mobile device with small memory, the model slice capable of being loaded and operated each time is smaller, and even if a large storage space exists, the next model segment can be loaded and operated by continuously releasing the memory to run more tasks. Meanwhile, the performance and size of storage of different devices also affect the performance of the devices when the memory is loaded too much, faster storage can provide more efficient memory-storage replacement (such as swap mechanism of linux), and larger storage can allow larger memory-storage replacement space to be configured.

Computational performance differences of mobile devices: the difference in the computing performance of different mobile devices also affects the consideration of transform model division, and a device with stronger computing performance can certainly complete inference faster, but in order to better utilize the computing resources of the whole system to improve the throughput of the whole collaborative inference system, different mobile devices should certainly be burdened with inference tasks with similar time and cost. If a mobile device becomes a short board of the overall collaborative inference system, other mobile devices may be unnecessarily stuck in idle waiting and waste a significant amount of computing resources.

As described above, there is no related work to solve the complex situation of coordination of multiple resource-constrained terminals when the mobile terminal device cooperatively infers, and the work of the method will firstly solve the problem and various challenges caused by differences in storage, memory, and calculation performance of different mobile devices.

Collaborative inference of heterogeneous devices:

after the slicing of the model is completed and different model slices are stored in different mobile device terminals, the whole heterogeneous mobile device system is required to run cooperatively to complete the whole inference on the model input and achieve a certain inference throughput. Specifically, the system realized by the method can be deployed on a plurality of heterogeneous mobile terminal devices, and the whole system can coordinately transfer each mobile terminal device to finish the inference of each responsible part after receiving the service request. When different heterogeneous mobile end devices are actually deployed and coordinated, the following two challenges are encountered:

the heterogeneity of devices presents deployment difficulties:

when a deep learning service is required to be run, a framework supporting the deep learning service and a Runtime (Runtime) are required, and a model in a certain format can be inferred on a device only by analyzing the corresponding framework and calling the corresponding deep learning backend, namely the Runtime.

However, the formats of the most excellent open source models on the network, especially the newly released models, are various, such as the most popular pyrrch format and tensoflow format, it is not practical to install all corresponding runtime frames for these different formats at the mobile end once, and it is also impossible for the mobile devices with limited resources to bear, therefore, these formats need to be converted into a general format, and all the mobile devices use the uniform runtime frame by means of the general format to facilitate our deployment. In addition to the complex format of the model, the heterogeneity of the mobile end device also provides challenges for deep learning runtime deployment: although popular deep learning frameworks such as the pitcher and the tenserflow all promote versions of adaptive mobile end devices such as the pitcher-mobile and the tenserflow-lite, the frameworks are mainly oriented to smartphones carrying an IOS system or an Android system, and other internet of things devices (such as various custom chips and development boards) are prone to version generation due to the heterogeneity of hardware architectures and system software when the frameworks are installed, and the problems of dependence on adaptation and recompilation of installation versions without adaptation are solved. In short, the heterogeneity of mobile devices leads to difficulties in deep learning runtime framework installation.

Challenges in networking and control of mobile devices:

in addition to the deployment difficulties presented by the heterogeneity of devices, developers face networking and control challenges for mobile devices: each mobile device takes the assigned inference task of the appropriate size and the corresponding model slice, and after installing a runtime environment capable of supporting inference, needs to run in coordination to complete the inference of the entire sample. Since the input of each slice of the model is the inference result of the last slice, and the output of each slice of the model is the input of the next slice of the model, the whole system needs to realize data transmission between the mobile devices during collaborative inference. In addition to data transmission, the whole system needs to know the condition of each mobile terminal device in the inference process to cope with some emergency situations, such as some devices being driven by batteries and possibly being dropped when the power is low, some devices may terminate model inference accidentally for some reason, and in these situations, the whole system needs to alarm the user or redistribute the calculation task. The networking and control requirements of these mobile devices introduce significant implementation complexity into our overall system.

1. The technical scheme of the invention, the design concept and technical characteristics of the computer program, and the implementation mode for achieving the technical effect (or solving the existing problems) are described on the whole; based on the given computer program flow, the steps of the computer program are described in natural language according to the time sequence of the flow. 2. The invention including the changes to the hardware structure of the computer device shall be clearly and completely described in terms of the hardware entity structure diagram.

Briefly, the method involves a core assembly having two parts:

1. the method provides a model segmentation and distribution algorithm for maximizing the throughput of the whole collaborative inference system by dividing a Transformer model according to different performances of mobile terminal equipment. The algorithm can distribute task quantity with approximate inferred time overhead for the mobile equipment nodes in the system in a mode of dividing the Transformer model into a plurality of pieces, so that the whole system can fully utilize the performance of each mobile equipment in a working mode of a production line. The transform model slicing and distribution technology based on multi-device computing time averaging corresponds to the preparation stage of a system. The technology comprises three key steps of equipment performance evaluation test, model segmentation and model distribution. By using the technology, the system can allocate model slices with proper sizes to different mobile devices, so that the system can fully utilize the computing performance of each device when carrying out collaborative inference, and the goal of maximizing the inference throughput of the system is achieved.

2. The method also designs a cross-platform mobile device collaborative inference architecture. According to the method, the bottom layer difference and the network condition of heterogeneous equipment are abstracted by means of an ONNX + runtime environment (available runtime environment comprises application layer virtual machines such as V8 or WebAssembly) and a Software Defined Network (SDN) technology, and cross-platform model inference and simple and effective multi-mobile-end equipment cooperative control can be achieved. Meanwhile, due to the openness of the ONNX format, the method can support the operation of the Transformer models with different formats. The mobile device collaborative inference architecture is a heterogeneous device collaborative inference technology and corresponds to a deployment stage in a system. The technology comprises the inferred environment deployment of 'ONNX + runtime environment' on heterogeneous equipment, and the design and implementation of an SDN ad hoc network between mobile equipment.

The following is a specific system design summary:

fig. 1 and fig. 2 show the work flow of the system of the present method, and the work flow of the system is divided into two phases, namely a preparation phase and a deployment phase. Firstly, the general work flow of the system of the method is summarized as follows: the collaborative inference system proposed by the method firstly slices the model to be deployed and allocates the sliced model to different mobile terminal devices through a preparation stage before deployment, and because the storage space of the mobile terminal devices cannot store the whole high-precision transform model, the process of downloading and splitting the model needs to be performed on a strong enough device, which is also called the preparation stage. After the preparation stage is completed, the method realizes the inference operation environment for shielding the heterogeneity of the bottom layer of the mobile equipment in a mode of 'ONNX + operation environment', each mobile equipment has the capability of operating the inference task of the responsible model part, and the mobile terminal equipment can actually operate the collaborative inference service by performing SDN networking in the deployment stage, so that the aim of operating a high-precision transform model which can not be operated by a single mobile equipment is fulfilled.

The following describes the details of the system design involved in these two key technologies:

the method comprises the following steps of calculating time-averaged Transformer model slices and distribution technology based on multiple devices:

this technique corresponds to the preparation phase of the system. FIG. 3 is three steps of the transform model slicing technique: and evaluating and testing the performance of the equipment, segmenting the model and distributing the model. These three steps correspond to the preparation phases of fig. 1.

The purpose of the equipment performance evaluation stage is to provide reasonable basis for subsequent model segmentation. In order to fully utilize the computing performance of the mobile terminal equipment, maximize the inferred service throughput of the whole system, the method needs to allocate inferred tasks (model slices) with similar computing time and cost to different mobile equipment, so that the equipment needs to be tested to know the hardware performance of the equipment, such as how large the model slices can be stored in the equipment, and the time consumed by inferring the model slices with different sizes. With the evaluation information of the equipment performance, the subsequent steps can slice the model through the Transformer model segmentation and distribution algorithm designed by the method and send the sliced model to different mobile equipment for storage.

The model segmentation step is responsible for segmenting the Transformer model at a relatively fine granularity, and segmenting the Transformer model into a plurality of slices for distribution to each mobile device.

The method designs the segmentation granularity of the Transformer. This is because when segmenting the transform model, the model should neither be split too trivially nor at too large a granularity. If the granularity of the split is too large, it is possible that the memory/storage space of some devices cannot be stored even one model slice. If the splitting granularity of the model is too small, in order to avoid a large amount of overhead of network transmission, the model needs to allocate adjacent model slices to the same mobile device as much as possible, and the adjacent slices can be merged, so that the method is equivalent to the splitting of the model with larger granularity.

The method uses the following particle sizes for the segmentation: first the Transformer model can be viewed as two parts: the transform layer (encoder layer or decoder layer) portion, i.e., the transform body of the network, and the non-transform layer portion.

Taking a Vision Transformer of computer Vision as an example, a Transformer part of the Vision Transformer consists of 12 Transformer encoder layers, and the rest comprises a preprocessing part mainly comprising a convolution layer and a linear layer for classifying at the end of a model. For the non-Transformer layer part, the method divides the non-Transformer layer part into a plurality of layers according to the layer unit. For the Transformer layer part, if the parameter quantity of the Transformer layer is small, a single layer can be stored in the device memory, the layer is used as a unit for splitting, and if the parameter quantity of the Transformer layer is large and cannot be stored, the splitting needs to be continuously carried out in each layer. In the example of a Vision Transformer, the Vision Transformer uses an encoder layer, and each layer can be split into two parts, namely a self-attention mechanism part and a subsequent feed-forward network. If the amount of the self-attention mechanism operation is too large, the method can be continuously split into two parts of calculation QKV matrix and multiplication between QKV matrices.

Then, the model distribution step distributes the model slices to different mobile terminal devices through the algorithm proposed by the method.

The method designs the distribution of the Transformer model slices. In order to optimize the service throughput of the entire collaborative inference system, different mobile devices are assigned inference tasks (model slices) with similar computational time overhead as possible. Since serial execution of multiple mobile devices constitutes a pipeline of model inference, the inference time overhead for each mobile device is made similar, maximizing the throughput of the entire inference system.

Specifically, since the slices of the model need to be run according to the sequence in the original model, the mobile terminal device needs to wait for the device responsible for the previous slice to finish the inference and transmit the intermediate result to the local, and then can start the inference of the model slice in charge of the mobile terminal device. Thus, the mobile device responsible for the first slice of the model first starts to perform the inference when the entire system first receives a service request, when only one device of the entire system is put into actual operation. After the first mobile device finishes the inference, the inference result is transmitted to the next mobile device, at this time, if a new service request comes, the first mobile device starts to infer the next sample, and at this time, two mobile end devices are put into the model inference service. As the system continually receives new service requests, all mobile devices in the overall system perform inferences of the respective samples on the models of the responsible slices in a pipelined fashion. Although the mobile device cannot infer the same sample at the same time, the working form of the pipeline can still fully utilize hardware resources of all mobile terminal devices, and the sample inference throughput of the whole system is maximized. It is therefore important to assign inference tasks that require similar computational overhead to different mobile devices, and if the time difference between the mobile devices inferring the responsible model slices is too large, it is equivalent to the higher inference speed device waiting for the lower inference speed device. In this case, the average inference time overhead of the entire system sample in the pipeline mode of operation must exceed the maximum of the time required to complete model slice inference in all mobile devices.

The pseudo code form of the transform model segmentation and distribution algorithm based on multi-device computation time averaging is as follows, and the model segmentation and distribution algorithm provided by the method is introduced with reference to the pseudo code. The distribution algorithm is as follows:

inputting: model to be split

And (3) outputting: the result of the model slicing for device assignment is Workload, which is an array in which

Is a collection of model shards for which the computing performance ith-strong device is responsible

Algorithm transform _ slicing (model):

the initialization array T is all 0 s,

calculating the time cost of the task allocated to the device i in an estimated mode;

model_slices← slice(Model);

foreachslice∈model_slicesdo

find the index i of the device with the least computation time for the current task, i.e.

Is the minimum value of T;

；

end foreach

while True do

find outThe index j, i.e. T, of the device for which the current task has the least computation time _[j] Is the maximum value of T;

if the index i exists such that moving a partial slice from device j to device i causes the maximum value of T, Max (T), to be reduced by then

Moving fragments and updating Workload _[i] , Workload _[j] ,

, T _[j] ;

continue;

else

break;

end if

end while

There is an inference task then in if T that device i cannot complete the assignment

Return "cannot run the model

end if

Return model slicing result Workload

Firstly, performing appropriate-granularity segmentation on the Transformer in the 3 rd row of the algorithm, and obtaining a stack of model slices for subsequent distribution to different mobile devices through segmentation. The content after line 4 is about how to distribute the divided model slices, and the algorithm proposed by the method finds the device with the smallest current load for all the model slices and distributes the model slices to the device. The least loaded device refers to the device that requires the least amount of time to run the currently assigned model slice, which run time is estimated from information obtained from testing the model during the device performance assessment phase. The estimate _ time function in row 7 of the algorithm accepts two parameters, device number i and the model slice currently assigned by device i

The function returns the estimated time required for the mobile device i to run its assigned model slice. For the equipment which can not bear the distributed model slice (such as insufficient storage space), the required time is estimated to be infinite directly. Behavior 9-17 of the algorithm versus model scoreAdjusting the matching result: the device with the highest time overhead (i.e., the short board) in the entire system is first found and adjusted if the short board (time overhead of the slowest device) in the entire system can be reduced by transferring some of its model slices to other devices. If finally, equipment which cannot bear the model slice exists in the whole system, the model is too large, and the whole system cannot provide deep learning service of the model in a mode of coordinating a plurality of mobile equipment.

Heterogeneous device collaborative inference technology:

the heterogeneous device collaborative inference technology provided by the method corresponds to the deployment stage of fig. 2. The deployment phase of the system proposed by the method relates to a collaborative inference process of the whole heterogeneous system.

In the implementation technology:

in order to deal with the challenge of deep learning inference service operation environment configuration caused by the heterogeneity of equipment and adapt to Transformer models in various formats, the method deploys the deep learning service through ONNX based on a Javascript interface of node. The ONNX is an open artificial intelligence model specification which is widely supported, and the mainstream deep learning model format supports conversion into the ONNX format to be deployed through the ONNX operation. Js is a network service program that is widely supported by devices of different architectures. By means of the cross-platform characteristic of the application layer virtual machine, the method can shield the bottom layer complexity of the heterogeneous equipment.

In order to efficiently and robustly self-organize the network of the mobile equipment, the method utilizes the SDN networking technology to avoid the underlying complex network programming. Software Defined Networking (SDN) allows developers to program the behavior of a control network in a centralized control manner through flexible APIs. Compared with a mode of realizing a set of control codes on network equipment in a self-defined manner, the method has great simplicity by utilizing an existing SDN framework and is easier to realize. In addition, a control plane abstracted by the SDN is independent of hardware and is completely realized by software, so that the SDN has better expansibility and is easy to maintain.

In the operation process:

different mobile terminal devices with deep learning inference runtime environments are firstly subjected to SDN networking, model slices responsible for the mobile terminal devices are loaded into a memory, and the mobile terminal devices are ready to cooperate to provide inference services. A user (and likely a mobile device in the system) first sends a request for deep learning inference services to a control node in the system. After receiving the request, the control node notifies the user to send the input sample to the mobile terminal device (this device may not be the control node) responsible for the first slice, and as shown in fig. 2, the user sends a picture of the object to be detected. When the collaborative inference is carried out, the system moves the mobile equipment to finish the inference of the responsible part of the mobile equipment according to the sequence of the model slices in the original model, transmits the intermediate data to the next mobile equipment until the equipment responsible for the last slice finishes the inference, and returns the final inference result of the model to the user. Fig. 4 is a flowchart of the collaborative inference system work when all the mobile terminal devices can normally operate, as shown in the figure, the user sends a request to the control node operating the control program through the request initiating program, and the control node views the state of each node and invokes the model inference service programs on each node to collaborate to complete inference of the whole sample and forward the final result back to the user.

Fig. 5 is an architecture adopted for implementing the entire collaborative inference system, all mobile devices are required to undertake respective inference tasks, so that inference modules need to be deployed, and management nodes need to install management modules in order to manage the entire SDN. The inference module is respectively a model inference service, an ONNX.js library when a Javascript interface of the ONNX model runs, and a Javascript/Webassignment engine for shielding the bottom layer heterogeneity of the mobile equipment from top to bottom. The control module realizes a controller of the SDN based on onos, the onos is in butt joint with a virtual switch realized by ovs through a southbound interface and cooperates to control network forwarding in the SDN, and other mobile devices are connected with a control node through vxlan to carry out SDN networking of the whole system. By using the SDN technology, the method can enable the mobile equipment to efficiently and robustly self-organize the network, and finally coordinate a plurality of mobile terminals to finish the Trnformer model inference of the whole input data.

The feasibility of heterogeneous equipment networking through the SDN is explored at the present stage of the method. We use 3 Hikey960 development boards to represent mobile devices, Soc of which is the hua is kylin 960 processor, adopt Arm big. LITTLE processor architecture, have 4 Arm Cortex-A73 CPUs with main frequency of 2.4GHz and 4 Arm Cortex A53 CPUs with main frequency of 1.8GHz, have 3GB DDR4 memory and 32GB UFS2.0 flash memory, support 2.4GHz and 5GHz dual bands and have dual antennas in terms of network performance. The SDN control node is a Dell G7-7590 notebook computer, an intel i 78750H central processing unit is used, and the dominant frequency is 2.2 GHz.

Taking the vit _ base model as an example, the size of the intermediate inference result is less than 200KB, calculated at a transmission speed of 4Mb/s, and the time of background transmission is much shorter than the time consumed by model single-layer inference, so that when the mobile device infers the current sample on the model slice, the intermediate inference result of the previous sample already completes the network transmission process in the background, and therefore the intermediate inference result to be transmitted does not backlog on the device and thus the whole system operation is not disturbed. Besides being capable of transmission in the background, cpu resources are not occupied by network transmission, so that the overhead of SDN networking has little influence on collaborative inference system service throughput.

Three Hikey960 development plates are used for carrying out experimental tests on the synergistic inference effect of the heterogeneous equipment of the system. The experiments are divided into two groups in total, and calculation force distribution of two different heterogeneous devices is simulated respectively. In the first set of experiments, we simulated a system consisting of three heterogeneous mobile devices with computing power equivalent to 20%, 50%, 100% of a single kernel of Hikey960, respectively, using the cgroup mechanism (computing power ratio of 2: 5: 10). In the second set of experiments, we simulated a case of three mobile devices with computing power equivalent to 50%, 50%, 100% of a single core of Hikey960 (computing power ratio of 1: 1: 2). The service throughput which can be achieved by the collaborative inference system under the collaborative inference of two model fragment distribution algorithms is tested, wherein the two algorithms are respectively a Transformer model fragment and distribution algorithm which is adopted by the system and is based on multi-device computing time averaging, and a model fragment and distribution algorithm which is used as a baseline algorithm and is based on multi-device inference computing amount averaging. In the experiment, 6 visual transformers (vision transformers) with different sizes are selected for testing, and the influence caused by network transmission is not taken into account in the simulation experiment. The following table 1 shows the results of the synergy inference test measured in test one:

the following table 2 shows the results of the synergy inference experiment measured in experiment two:

it can be seen that the algorithm proposed by the present system can achieve more service throughput than the baseline algorithm in the case where the system is operating in a pipelined manner.

Through tests, 3 Hikey960 development boards and a notebook computer can successfully carry out SDN networking, and the average network speed between the development boards and a control node (notebook computer) reaches 4 Mb/s. Table 3 below shows the aggregate time delay of the system when the inference results are transmitted between different sizes, which is measured experimentally:

it can be seen that the time required to transmit the intermediate results of several common Trasnformer models is short.

Claims

1. A deep learning transform class model-oriented mobile device collaborative inference system is characterized by comprising the following steps: the method comprises the following steps: step 1: a preparation stage: performing equipment performance evaluation test on the equipment, and allocating model slices to be deployed to different mobile terminal equipment;

step 1.2: determining the segmentation granularity: analyzing a Transformer model structurally, and analyzing to obtain adaptive model segmentation granularity by combining the hardware information of each device to be used, which is obtained in the step 1.1;

step 2.1: SDN networking: determining devices to be inferred, participating in deep learning, from the devices, networking by using an SDN (software defined network), ensuring that the devices are connected to the same network, deploying an SDN control program by using the devices with the strongest network processing capacity as control nodes, enabling all the nodes to participate in collaborative inference, and deploying inference service programs by the devices to be inferred;

step 2.3: the control node communicates with other nodes to ensure that all nodes in the step 2.1 normally carry out collaborative inference service;

2. The deep learning fransformer-like model oriented mobile device collaborative inference system of claim 1, wherein: the hardware information in step 1.1 includes a memory and a storage space, and the cpu calculates the frequency.

3. The deep learning fransformer-like model oriented mobile device collaborative inference system of claim 1, wherein: the step of distributing the algorithm in the step 1.4 is as follows: and testing each model slice on each device, deducing the actually required time overhead, searching the device with the minimum current load, and distributing the model slice to the device, wherein the device with the minimum load refers to the device with the minimum time required for running the currently distributed model slice.

4. The deep learning fransformer-like model oriented mobile device collaborative inference system of claim 3, wherein: the distribution algorithm comprises the following specific steps:

step 1.4.1: initialization: inputting a model to be split, using a Transformer model, recording the estimated calculation time cost of tasks allocated by each device by using an array T,

Assigning a value of infinity;

step 1.4.4: after the allocation is completed, find

Minimum equipment to be used

step 1.4.5: detecting whether a device i cannot undertake the inference task of the assigned model slice, i.e. checking

If the value in (1) is infinite, the model is over-large, the whole system cannot run the model even though the whole system is cooperated, and the program is terminated.

5. The deep learning fransformer-like model oriented mobile device collaborative inference system of claim 1, wherein: and 2, shielding the difference of the bottom-layer heterogeneous system by using an ONNX + runtime environment, and abstracting the bottom-layer difference and the network condition of the heterogeneous equipment by using a software-defined network technology.

6. The deep learning fransformer-like model oriented mobile device collaborative inference system of claim 5, wherein: the software defined network technology uses onos to control an SDN control layer, uses ovs to control an SDN forwarding layer, deploys the SDN control layer and the SDN forwarding layer on the same node, and connects other nodes with the control node through vxlan.