CN116033492A

CN116033492A - Method and device for segmenting transducer model in mobile edge environment

Info

Publication number: CN116033492A
Application number: CN202211624092.1A
Authority: CN
Inventors: 屈志昊; 周文轩; 叶保留; 王博文; 柳泽
Original assignee: Shanghai Industrial Control Safety Innovation Technology Co ltd; Hohai University HHU
Current assignee: Shanghai Industrial Control Safety Innovation Technology Co ltd; Hohai University HHU
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-28

Abstract

The invention discloses a segmentation method and a segmentation device for a transducer model in a mobile edge environment, wherein the method comprises the following steps: selecting the number of mobile edge devices which is not more than the number of layers of a transducer model encoder as a device group, acquiring the computing capacity and the transmitting power of the devices, and calculating the D2D communication time between the devices by combining the distance between the devices, the channel bandwidth and the channel noise; solving the modeled optimization problem by using a branch delimitation method to obtain an optimal transmission path; and based on the obtained optimal transmission path, the transducer model is placed on corresponding mobile edge equipment in a layer division mode. The invention improves the communication efficiency and the overall calculation performance of the system, and reduces the operation complexity of splitting and arranging the network.

Description

Method and device for segmenting transducer model in mobile edge environment

Technical Field

The invention relates to the technical field of distributed computing, in particular to a method and a device for segmenting a transducer model in a mobile edge environment.

Background

The breakthrough progress in deep learning has prompted the development of artificial intelligence driven fields such as computer vision, natural language processing, etc., and the success of deep neural networks has been derived from training large-scale data sets on large-scale models. However, the increasing model and data sizes have prevented stand-alone devices from completing one training of deep neural network models. Thus, training with cloud computing is a natural choice. Under the traditional cloud computing framework, data sources need to offload self-generated data to the cloud to uniformly use a large-scale distributed machine learning scheme (such as data parallelism) for computing. However, because various sensors and internet of things devices generate a large amount of data (for example, more than 1 TB) per hour, transmitting the data to the cloud end causes a large amount of communication overhead; furthermore, offloading of data will also present privacy security issues such as malicious servers, hacking, etc. In order to solve these two problems, researchers have proposed a machine learning paradigm based on split learning (also called split learning), which is to divide a machine learning model into a plurality of parts, and place the parts on different devices (a client and a server) to perform different propagation training, so that users are not required to transmit original data to a cloud, and thus communication overhead and unloading security risks are avoided.

The transducer model is a deep learning model employing a self-attention mechanism that computes weights and weights sums for each portion of input data, and early transducers were mainly used in the field of natural language processing. Similar to recurrent neural networks, the transducers are intended to process sequence data (e.g., text translation). However, unlike recurrent neural networks, which recursively process data, the transducer processes all data simultaneously, so that the transducer not only solves the "forget" disadvantage in the recurrent neural network, but also can parallelize to accelerate the training of the model, and the present transducer model can be applied not only in the field of natural language processing but also in the field of computer vision. As shown in fig. 1, a general transducer model mainly includes two parts, an encoder module and a decoder module, respectively. In natural language processing (e.g., translation tasks), the transducer model requires the use of both encoder and decoder modules, while for classification tasks (e.g., text classification, picture classification), the transducer model uses encoder modules. The invention is applied to classification tasks using only encoder modules. The encoder module of the transducer model consists of several isomorphic encoders, with the output of each encoder being the input of the next encoder, for which an embedded layer mapping to a low dimensional space and location information is added to the input. The encoder mainly comprises two parts, namely a multi-head attention module and a feedforward neural network, and the convergence efficiency of the model is accelerated by using a residual connection and layer normalization method after each module.

The usual segmentation learning breaks a complete model into two parts, a client network and a server network, respectively. And a batch of edge devices and a server cooperatively train the whole model, and each edge device sequentially updates a client network and a server network by utilizing data generated by the edge devices and sends the updated model to the next edge device. The method not only ensures the privacy of the self data of the edge equipment, but also reduces the self calculation and storage cost, and is an efficient distributed machine learning paradigm. However, such split learning may bring about intolerable communication overhead, especially communication overhead between the edge device and the server, which not only requires lengthy communication time and consumes a lot of energy of its own, but also generates high economic cost. Multi-hop segmentation learning can solve the problem of expensive communication overhead, however, due to the high degree of heterogeneity of edge devices, its computational power and memory power are quite different, which results in arbitrary network splitting and orchestration into a set of edge devices, which can severely reduce training efficiency of the model, although the encoder of the transducer model is isomorphic, the multi-hop segmentation learning is intended to be utilized, the most efficient training of a transducer model, the splitting and orchestration part of which is an NP-hard problem, cannot be solved using algorithms within polynomial time.

Disclosure of Invention

The invention aims to: in order to overcome the defects and shortcomings of the prior art, the invention provides a segmentation method and a segmentation device for a Transformer model in a mobile edge environment, which are based on a multi-hop segmentation learning paradigm, improve the communication efficiency and the overall calculation performance of a system and reduce the calculation complexity of splitting and arranging a network on the premise of protecting the privacy of edge equipment data.

The technical scheme is as follows: a segmentation method of a transducer model in a mobile edge environment comprises the following steps:

selecting the number of mobile edge devices which is not more than the number of layers of a transducer model encoder as a device group, acquiring the computing capacity and the transmitting power of the devices, and calculating the D2D communication time between the devices by combining the distance between the devices, the channel bandwidth and the channel noise;

the method comprises the steps of establishing an optimization problem model by taking the most-capable edge equipment with the highest computing capacity as a transducer layer and the rest equipment as a segmentation basis and taking the computation time of a minimized transducer model as a target based on D2D communication time between the equipment, and solving the modeled optimization problem by using a branch delimitation algorithm to obtain an optimal transmission path;

and based on the obtained optimal transmission path, the transducer model is placed on corresponding mobile edge equipment in a layer division mode.

Preferably, the D2D communication time is calculated as follows:

wherein t is _ij Representing the transmission time of device i to device j, a representing the amount of data transmitted, s _ij Representing the data transfer rate of device i to device j, W is the bandwidth of the channel,alpha is the path loss index, N _o Is the noise power, p _i Representing the transmit power of device i, d _ij Representing the physical distance between device i and device j.

Preferably, the modeled optimization problem is as follows:

the constraint conditions are as follows:

wherein N is the number of mobile edge devices in the device group, and x is the number of mobile edge devices in the device group _ij To indicate a variable, a value of 1 indicates that device i transmits data to device j, and a value of 0 indicates that device i does not transmit data to devices j, w _i Representing the time, z, required for the device i to calculate the transducer model for each layer _i Is an integer that limits one connected component.

Preferably, obtaining the optimal transmission path using the branch-and-bound algorithm includes: obtaining a suboptimal solution by using a genetic algorithm as an upper bound of a search algorithm, generating a priority queue and adding a starting node into the queue; the following is then performed when the queue is not empty and the lower bound for the head point is less than the global upper bound: and (3) taking out the queue head element, traversing the rest nodes, adding the nodes into the priority queue if the lower bound obtained by adding the traversed nodes into the path is smaller than or equal to the current upper bound, and updating the upper bound if the leaf nodes are searched and the total consumption of the path is smaller than the current global upper bound.

Preferably, placing the fransformer model on the corresponding mobile edge device in layer segmentation comprises: and sequentially deploying the transducer models into the edge devices according to the transmission paths, wherein the edge device with the strongest computing capacity holds the most transducer layers, namely the number of network layers is reduced by the number of devices plus 1, and all other devices hold a layer of network model.

The invention also provides a segmentation device of the transducer model in the mobile edge environment, which comprises:

the communication time determining module is configured to select the number of the mobile edge devices which is not more than the number of layers of the transducer model encoder as a device group, acquire the computing capacity and the transmitting power of the devices, and calculate the D2D communication time between the devices by combining the distance between the devices, the channel bandwidth and the channel noise;

the path calculation module is configured to set up an optimization problem model by taking the most transducer layer held by the edge equipment with the strongest computing capability and the rest equipment with one transducer layer as a segmentation basis and taking the computing time of the minimum transducer model as a target based on the D2D communication time between the equipment, and solve the modeled optimization problem by using a branch delimitation algorithm to obtain an optimal transmission path;

and the model segmentation module is configured to place the transducer model on the corresponding mobile edge equipment in a layer segmentation mode based on the obtained optimal transmission path.

The present invention also provides a computer device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps of the method for partitioning a transducer model in a mobile edge environment as described above.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a method for partitioning a Transformer model in a mobile edge environment as described above.

The invention also provides a mobile edge computing system, which comprises a plurality of mobile edge devices, wherein the mobile edge devices are distributed with partitioned transducer models based on an optimal transmission path, the number of the mobile edge devices is not more than the number of layers of the transducer models, and the optimal transmission path is obtained based on the following method:

selecting the number of mobile edge devices which is not more than the number of layers of a transducer model encoder as a device group, acquiring the computing capacity and the transmitting power of the devices, and calculating the D2D communication time between the devices by combining the distance between the devices, the channel bandwidth and the channel noise; and establishing an optimization problem model by taking the most-capable edge equipment with the highest computing capacity as a division basis and taking the minimum computing time of the transducer model as a target based on the D2D communication time between the equipment, and solving the modeled optimization problem by using a branch delimitation algorithm to obtain an optimal transmission path.

Compared with the prior art, the invention has the following advantages and beneficial effects: the invention provides a high-efficiency training method based on a multi-hop segmentation learning paradigm based on the characteristics of a Transformer model, wherein a strategy for splitting and arranging a network with approximate ratio of 2 can be obtained in polynomial time under the condition that network topology meets a certain condition, and a searching strategy based on a branch-and-bound method is provided under other conditions, so that the time complexity of optimal splitting and arranging of violent searching is greatly reduced. The invention solves the problem of how to divide the network in multi-hop division learning, and greatly improves the resource utilization rate and the speed of the transform model training.

Drawings

FIG. 1 is a block diagram of a generic transducer model;

fig. 2 (a) and 2 (b) are two basic architecture diagrams of basic segmentation learning;

FIG. 3 is a basic architecture diagram of multi-hop segmentation learning;

FIG. 4 is a flow chart of a method for partitioning a transducer model according to the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The basic split learning has two forms, namely, an original split learning and a U-shaped split learning, and their configurations are shown in fig. 2 (a) and fig. 2 (b). In the original segmentation learning, the training flow is as follows:

1) The client cuts the model into two parts, wherein the former part is placed at the client and the latter part is placed at the server;

2) The client side executes forward propagation until a cutting layer, and transmits intermediate data to the server side;

3) The server receives the intermediate data and continues to perform forward propagation, performs backward propagation until a cutting layer is obtained after a calculation result is obtained, updates network parameters, and transmits the intermediate data to the client;

4) The client receives the intermediate data, continues back propagation and updates the network parameters, and transmits the network parameters to the next client after the calculation is completed;

5) Repeating 2) to 4) until the model converges.

The U-shaped segmentation learning aims to protect the tag privacy while protecting the original data characteristics, the network model is divided into three parts, and the last part is put back to the client side for calculation, so that the purpose of protecting the tag privacy is achieved.

The above segmentation learning is based on a client server architecture, and in the training process, not only communication delay which is difficult to tolerate and wastes self energy, but also expensive economic cost can be generated, in addition, some varieties exist in the segmentation learning, such as multi-hop segmentation learning, and the segmentation learning is a decentralization architecture, and the architecture is shown in fig. 3, under the architecture, a network is segmented into a plurality of pieces of speed, and the pieces of speed are respectively placed in similar edge equipment, and an internal Device to Device (D2D) communication mode is adopted, so that the load on a service base station is reduced, and on the premise of protecting data privacy, not only the communication cost is reduced, but also the economic cost is saved. The training process is as follows:

1) The network is arranged in n different devices according to a specified dividing mode;

2) Device 1 to device n perform forward propagation in sequence;

3) Device n to device 1 perform back propagation and update parameters in sequence;

3) Repeating 2) to 3) until the model converges.

The invention is based on a multi-layer transducer model, and aims to find an optimal segmentation scheme and an optimal placement scheme for the segmented submodels, so that the training speed of the whole network is the fastest. The invention proposes an algorithm for solving the problem, named M-SLT algorithm. The algorithm relies on a decentralised architecture in which the transducer model is divided into a number of sub-models and placed on a set of edge devices respectively, with intermediate results being exchanged over the D2D link. Compared to client-server architecture, M-SLT improves communication efficiency because D2D communication typically has lower latency and higher data rates than communicating with edge servers over cellular links. Furthermore, by fully exploiting the computing power of the edge devices, segmentation learning can be performed in a resource-saving and flexible manner. The algorithm is characterized in that the arrangement mode of a transducer model sub-network on edge equipment is found, the optimal arrangement mode is obtained through the algorithm, so that the optimal model training speed is obtained, after the optimal arrangement is obtained, the models are sequentially placed on the edge equipment, wherein the equipment with the strongest computing power is arranged for the largest network layer number, the rest equipment is arranged for one layer, for example, three equipment is arranged for 2,1 and 3, five layers of networks are obtained, the computing power of the equipment 1 is the strongest, then the equipment 2 is arranged for the 1 st layer of network, the equipment 1 is arranged for the 2 nd to 4 th layer of network, and the equipment 3 is arranged for the 5 th layer of network.

The problem is modeled as follows:

using

Representing mobile edge devices within a group of devices, wherein device 1 has original training data, in a mobile edge environment, when device i passes its own intermediate result to the remaining nodes, its transmit power is set to p _i . For any two devices i and j, use d _ij Representing the physical distance between them. Thus, the data transmission rate from device i to device j can be derived using the shannon formula:

where W is the bandwidth of the channel, α is the path loss index, N _o Is the noise power. Due to the inherent computational properties of the transducer model, the middle between each two layersThe resulting data amount is a constant, denoted by A, and thus the transmission time t from device i to device j _ij ＝A/s _ij . The core of the problem is thus to find an arrangement (where device 1 should be the starting point) that enables the edge device under the arrangement to train efficiently, and when the arrangement is found, arrange the number of network layers in turn onto the edge device.

To build this optimization problem, an indicator variable x is defined _ij When it is 1 it represents device i transmitting data to device j (otherwise x _ij 0). Based on the previous assumptions, the indicated variables must meet the following constraints:

however, these two constraints do not guarantee that there is only one connected component in the network transmission topology. Therefore, the following constraints need to be introduced:

z _i representing a virtual variable for device i, the constraint may ensure that the path formed by modeling contains only one connected component (i.e., no sub-loop), which may be demonstrated using the anti-prover method. (1) If device i is transmitted to device j, then x _ij =1, then z will be derived from this condition _j ≥z _i The value of +1, z, increases consistently along the loop. (2) Assuming that there is a sub-loop, satisfying the above constraint, there will be one that does not include the starting device (number 1), e.g., 2- > 3- > 2, then z ₂ ≥z ₃ +1≥z ₂ +2 creates a contradiction. The modeling therefore does not include sub-loops.

Using w _i Indicating device i meterThe time required for each layer of the transducer model is calculated, and therefore the following optimization problem can be established:

which satisfies the following constraints:

it can be shown that the traveller's problem can be reduced to the optimization problem in polynomial time, and that one city in the traveller's problem can be mapped to one edge device of the optimization problem, both of which can be abstracted to find a shortest hamiltonian path in the figure. The optimization problem is therefore an NP-hard problem that can only be solved using algorithms that approximate solutions or brute force searches.

Referring to fig. 4, the method of the present invention mainly comprises the following steps: firstly, determining the communication time between every two devices in the device group according to indexes such as signal transmitting power of each device in the device group, distance between every two devices, channel bandwidth, channel noise and the like. The transmission path is calculated using a branch-and-bound algorithm. And finally, dividing, distributing and training the network model according to the solved paths.

Specifically:

the device group is abstracted into a complete graph g= (V, E), wherein the device is node V in the graph, and the communication time plus the computation time is edge E in the graph. The complete graph is stored using the adjacency matrix C. In the search algorithm, the following definitions will be used:

determined path U: for the current device searched, the determined path u= (r ₁ ，r ₂ ，...r _k ) U represents a sequence, r _k Representing the kth device in the sequence;

route U with origin removed ₁ : in path U, r is removed ₁ The obtained path U ₁ ；

Path U for removing end point ₂ : in path U, r is removed _k The obtained path U ₂ ；

Upper bound ub: and initializing a transmission sequence arbitrarily, taking the total time consumption of the obtained path as the upper bound of the search space, and updating the upper bound if the total time consumption of the path of the leaf node is smaller than the upper bound each time the leaf node is searched.

Lower bound lb: for the search path U of the current search position, the lower bound is updated according to the following formula, and if the possible lower bound generated by the current search exceeds the lower bound, pruning is directly performed. If the lower bound of all the remaining unsearched positions exceeds the upper bound, the result of the upper bound is directly returned.

c is the adjacency matrix of transmission time, cr ₁ ]The r < th > of the representation matrix c ₁ Row, c [:][r _k ]the r < th > of the representation matrix c _k Columns.

The method is implemented according to the following algorithm flow:

a) Obtaining an initial upper bound ub by using a genetic algorithm, generating a priority queue Q and adding a starting point into the queue;

b) The following loop is performed when the lower bound of the head of queue point of queue Q is less than the upper bound:

(1) Taking out the queue head element p;

(2) Traversing the rest nodes, if the node is added to the lower bound of the path p and is smaller than or equal to the current upper bound, adding the node to the priority queue Q, and if the leaf node is currently searched and the total consumption of the path is smaller than the current upper bound, updating the upper bound.

After determining the transmission path, sequentially deploying the transducer models into edge devices according to the transmission path, wherein the edge device with the strongest computing capability holds the most transducer layers (i.e. the number of network layers minus the number of devices plus 1), and all other devices hold a layer of network model.

It should be understood that the segmentation apparatus for a transducer model in a mobile edge environment in the embodiment of the present invention may implement all the technical solutions in the above method embodiments, and the functions of each functional module may be specifically implemented according to the methods in the above method embodiments, and the specific implementation process may refer to the relevant descriptions in the above embodiments, which are not repeated herein.

The invention also provides a mobile edge computing system, which comprises a plurality of mobile edge devices, wherein the mobile edge devices are distributed with partitioned transducer models based on an optimal transmission path, the number of the mobile edge devices is not more than the number of layers of the transducer models, and the optimal transmission path is obtained based on the following method: selecting the number of mobile edge devices which is not more than the number of layers of a transducer model encoder as a device group, acquiring the computing capacity and the transmitting power of the devices, and calculating the D2D communication time between the devices by combining the distance between the devices, the channel bandwidth and the channel noise; and establishing an optimization problem model by taking the most-capable edge equipment with the highest computing capacity as a division basis and taking the minimum computing time of the transducer model as a target based on the D2D communication time between the equipment, and solving the modeled optimization problem by using a branch delimitation algorithm to obtain an optimal transmission path.

The invention models the optimization problem of training a transducer model by using multi-hop segmentation learning in a mobile edge environment, and provides a segmentation method (M-SLT) based on assisted training of the transducer model in the mobile edge environment. The invention divides a transducer model into a plurality of sub-models based on a decentralization architecture, and respectively places the sub-models on a group of edge devices, and intermediate results are exchanged through a D2D link. Compared to client-server architecture, M-SLT improves communication efficiency because D2D communication typically has lower latency and higher data rates than communicating with edge servers over cellular links.

Claims

1. The method for segmenting the transducer model in the mobile edge environment is characterized by comprising the following steps of:

2. The method of claim 1, wherein the D2D communication time is calculated as follows:

wherein t is _ij Representing the transmission time of device i to device j, a representing the amount of data transmitted, s _ij Representing the data transmission rate from device i to device j, W is the bandwidth of the channel, α is the path loss index, N _o Is the noise power, p _i Representing the transmit power of device i, d _ij Representing the physical distance between device i and device j.

3. The method of claim 2, wherein the modeled optimization problem is as follows:

the constraint conditions are as follows:

wherein N is the number of mobile edge devices in the device group, and x is the number of mobile edge devices in the device group _ij To indicate a variable, a value of 1 indicates that device i transmits data to device j, and a value of 0 indicates that device i does not transmit data to devices j, w _i Representing the time, z, required for the device i to calculate the transducer model for each layer _i Is a limitationAn integer of connected components.

4. The method of claim 1, wherein obtaining an optimal transmission path using a branch-and-bound algorithm comprises: obtaining a suboptimal solution by using a genetic algorithm as an upper bound of a search algorithm, generating a priority queue and adding a starting node into the queue; the following is then performed when the queue is not empty and the lower bound for the head point is less than the global upper bound: and (3) taking out the queue head element, traversing the rest nodes, adding the nodes into the priority queue if the lower bound obtained by adding the traversed nodes into the path is smaller than or equal to the current upper bound, and updating the upper bound if the leaf nodes are searched and the total consumption of the path is smaller than the current global upper bound.

5. The method of claim 1, wherein placing the fransformer model on the corresponding mobile edge device in layer segmentation comprises: and sequentially deploying the transducer models into the edge devices according to the transmission paths, wherein the edge device with the strongest computing capacity holds the most transducer layers, namely the number of network layers is reduced by the number of devices plus 1, and all other devices hold a layer of network model.

6. A segmentation apparatus for a transducer model in a mobile edge environment, comprising:

7. A computer device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps of the method of Transformer model segmentation in a mobile edge environment according to any one of claims 1-5.

8. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the method for segmentation of a transducer model in a mobile edge environment according to any of claims 1-5.

9. A mobile edge computing system comprising a number of mobile edge devices, wherein the mobile edge devices are assigned partitioned fransformer models based on an optimal transmission path, wherein the number of mobile edge devices is no greater than the number of fransformer model layers, the optimal transmission path being based on the following method:

10. The system of claim 9, wherein the modeled optimization problem is as follows:

the constraint conditions are as follows:

where A represents the amount of data transmitted, W is the bandwidth of the channel, α is the path loss index, N _o Is the noise power, p _i Representing the transmit power of device i, d _ij Representing the physical distance between device i and device j, N is the number of edge devices in the group of devices, x _ij To indicate a variable, a value of 1 indicates that device i transmits data to device j, and a value of 0 indicates that device i does not transmit data to devices j, w _i Representing the time, z, required for the device i to calculate the transducer model for each layer _i Is an integer that limits one connected component.