CN116050540B

CN116050540B - Self-adaptive federal edge learning method based on joint bi-dimensional user scheduling

Info

Publication number: CN116050540B
Application number: CN202310050202.6A
Authority: CN
Inventors: 潘春雨; 张九川; 李学华; 姚媛媛
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-09-22
Anticipated expiration: 2043-02-01
Also published as: CN116050540A

Abstract

The application provides a self-adaptive federal edge learning method based on joint bi-dimensional user scheduling, which comprises the following steps: based on the loss function and the training period, acquiring the evaluation efficiency of model training; acquiring batch data based on the evaluation efficiency, and acquiring a trained initial model based on the batch data; and screening the initial model to obtain a final trained model. The application can further improve the accuracy and efficiency of the federal learning method.

Description

Self-adaptive federal edge learning method based on joint bi-dimensional user scheduling

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a self-adaptive federal edge learning method based on joint bi-dimensional user scheduling.

Background

With the development of mobile communication and the internet of things, the data volume generated by devices such as smart phones, sensors of the internet of things and the like has a trend of explosive growth. Machine learning models require a large rich data set to train. On the one hand, the traditional centralized machine learning algorithm needs to upload a large amount of data to the central node, and large-scale data transmission will lead to larger transmission time and congestion. On the other hand, the traditional distributed machine learning algorithm requires the training data set to be uploaded in a concentrated mode, and the training data set is redistributed to a plurality of working nodes after being divided uniformly, so that privacy leakage is easy to cause.

The proposal of federal edge learning (Federated Edge Learning, FEL) provides a solution to the above-mentioned problems. In FEL, model training is performed on edge devices and is equipped with a multiple access edge computing center server. The FEL achieves iterative updating by two steps: 1) Training a local model: the intelligent edge device trains the local model by utilizing the local data set and then uploads model parameters to the central server. 2) Global model aggregation: the central server aggregates the local model parameters uploaded locally to form a global model, updates the global model, and then sends the updated model to the intelligent edge equipment to start a new iteration. Compared with the traditional centralized and distributed machine learning algorithms, the iterative training process of the FEL does not need to upload local data by intelligent edge equipment, so that the FEL has more potential for protecting the privacy of the data.

However, the computing power of intelligent edge devices and local data set heterogeneity and imbalance present significant challenges to the convergence speed of the global model and the global model accuracy. In recent years, related work has conducted optimization studies on iterative processes of FEL. Most of the prior studies used a random gradient descent algorithm (Stochastic Gradient Descent, SGD) for local model training. The literature improves redundancy rate in federal learning through information coding design to reduce the influence that longer equipment caused based on local model training time.

However, none of the above studies takes into account the difference in training completion time due to smart device computing power and data set heterogeneity, and waiting for all edge devices to complete local model training will delay the global model aggregation process. Furthermore, data sets are often bulky and data distribution is not balanced, given that the data collected by a device depends on the local environment and the device's own properties. The local data set presents an uneven distribution state according to devices, and the data attributes of different devices need to be considered for device scheduling. Therefore, a method that combines the computing power of the device and the distribution characteristics of the data set needs to be designed to enhance the model training accuracy and convergence rate of the algorithm.

In the gradient descent algorithm, each iteration needs to calculate the gradient of the sample on the whole data set, and when the number of the samples of the data set is large, each iteration consumes a great deal of time and calculation resources. The formula is as follows:

wherein w is _t Model parameters representing the t-th iterationThe number of the product is the number,representing the loss function at w _t A gradient thereat. η represents the learning rate, which may represent the distance the entire loss function moves in the negative direction of the gradient during the gradient descent.

Random gradient descent only one sample is selected at a time to calculate the random gradient, so the time per gradient update is greatly reduced. The formula is as follows:

however, the random gradient of one sample does not represent the gradient of the entire data set, so that the random gradient descent method does not run in the negative direction of the full gradient for each iteration, and the convergence process is relatively jittery. Since the random gradient of a single sample and the full gradient of all samples differ significantly, the number of iterations required for convergence is greatly increased using the random gradient descent algorithm.

The trade-off between gradient descent and random gradient descent algorithms is a small batch gradient descent algorithm. The algorithm selects the gradient update model parameters of a part of samples each time, and the update formula is as follows:

wherein xi _t A random sample representing a batch selected at the t-th iteration, assuming a batch size of m, is obtained:

however, the batch size employed by each iteration of the conventional small batch gradient descent algorithm needs to be configured before training begins and remains unchanged throughout the training process. Along with the progress of the local model training process, the model precision is gradually improved, and the self-adaptive selection of the batch size according to the model precision is beneficial to improving the convergence rate.

On the other hand, in active learning, when the selected sample has the characteristics of diversity, feature enrichment and the like, the model can be trained by using less data. Therefore, in the FEL, the training can be performed by referring to the data with active learning selection diversity, and when the data with non-independent and same distribution exists in the device, the data with higher diversity can be selected to improve the convergence speed and the convergence accuracy.

At present, the existing federal edge learning local model precision and local model training time have great influence on the global model aggregation and model updating process, so that the batch size of gradient descent extraction needs to be automatically adjusted in the local model training process, and algorithm convergence is accelerated while the model precision is improved;

the existing federal edge learning does not consider the difference of training completion time caused by the computing power of intelligent equipment and the isomerism of data sets, and waiting for all edge equipment to complete local model training delays the global model aggregation process; the data collected by a device is typically bulky depending on the local environment and the device's own properties, and the data is not independently co-distributed. Therefore, the application provides a two-dimensional user scheduling strategy based on task completion time and data self attribute aiming at the non-independent and same distribution characteristic of user data, and the application further improves the precision and convergence speed of the global model while reducing the waiting time.

Disclosure of Invention

In order to solve the technical problems, the application provides a self-adaptive federal edge learning method based on joint bi-dimensional user scheduling, which can further improve the accuracy and efficiency of the federal edge learning method.

In order to achieve the above object, the present application provides an adaptive federal edge learning method based on joint bi-dimensional user scheduling, including:

s1, acquiring evaluation efficiency of model training based on a loss function and a training period;

s2, acquiring batch data based on the evaluation efficiency, and acquiring a trained initial model based on the batch data;

s3, screening the initial model, and putting the screened model back into the S1 for repeated iteration until a plurality of iteration processes are completed, so as to obtain a final trained model.

Optionally, obtaining the evaluation efficiency of model training includes:

acquiring the iteration loss and the loss variation before a plurality of times based on the loss function;

and acquiring the evaluation efficiency based on the loss variation and the training period.

Optionally, the loss variation is:

Δloss＝f(x-n)-f(x)

where Δloss is the loss variation, f (x-n) is the loss value of the previous n iterations, and f (x) is the loss value of the current iteration.

Optionally, the evaluation efficiency is:

where e is the evaluation efficiency, Δloss is the loss variation, and t is the training period.

Optionally, acquiring the batch data includes:

based on the evaluation efficiency, presetting a triggering condition of batch switching;

and randomly distributing the local data into batches with different data sizes, storing the batches in a list, selecting the smallest batch in the list to start iteration for the first time, calculating the evaluation efficiency after each iteration is finished, and switching to the batch with a preset value as the batch data when the acquired evaluation efficiency meets the triggering condition for triggering batch switching.

Optionally, the triggering condition includes: the first trigger condition, the second trigger condition and the third trigger condition;

the first triggering condition is as follows: the evaluation efficiency of the nth time is smaller than the evaluation efficiency of the nth-1 time;

the second triggering condition is as follows: the current evaluation efficiency is lower than the historical evaluation efficiency;

the third triggering condition is as follows: the current evaluation efficiency is negative.

Optionally, switching to the batch of the preset value as the batch data includes:

when the acquired evaluation efficiency meets the first trigger condition, switching to a batch with a first preset value as the batch data;

when the acquired evaluation efficiency meets the second trigger condition, switching to a batch with a second preset value as the batch data;

when the acquired evaluation efficiency meets the third trigger condition, switching to a batch with a third preset value as the batch data;

the first preset value is larger than the second preset value, and the third preset value is larger than the first preset value.

Optionally, screening the initial model includes:

when the initial model is derived from edge equipment with lower than preset computing capacity, rejecting the initial model;

carrying out diversity analysis on the removed residual models, and removing equipment corresponding to the models when the diversity index of the models is lower than a threshold value; otherwise, the reservation is made.

Optionally, when the initial model originates from an edge device with a lower computing power than a preset computing power, rejecting the initial model includes:

obtaining training parameters of the initial model locally from a group of edge equipment subsets with heterogeneous computing capacities, and setting a longest time threshold according to different equipment performances;

comparing the local initial model training time for each device in the subset of edge devices to the maximum time threshold specified by device i; if the local training time is not greater than the maximum time threshold value specified by the equipment i, reserving the equipment i in the equipment subset; if the local training time is longer than that specified by device iThe longest time threshold, eliminating the device i from the subset of devices; the devices meeting the threshold requirement in the edge device subset are updated to a new subset M ₁ 。

Optionally, the performing diversity analysis on the remaining models after the culling includes:

traversing the new subset M ₁ The diversity index G of each device in the database is stored in a diversity index array G; the diversity index in G is then arranged from large to small and based on diversity constraintsScreening from large to small in an array G; if the diversity index G of device i in array G is in the diversity constraint +.>Within, then at the new subset M ₁ Reserving the device i; if the diversity index G of device i in array G is in the diversity constraint +.>Outside, then at the new subset M ₁ Removing the equipment i; finally, the updated device subset M is output ₂ And uses this user scheduling setting in federal learning for the current iteration.

Compared with the prior art, the application has the following advantages and technical effects:

according to the method, based on the evaluation efficiency, the batch data are acquired, and the batch data are determined more accurately, so that the accuracy and the efficiency of the dynamic balance model are improved, and the accuracy and the efficiency of the federal learning algorithm are further improved.

The method and the device can acquire the final trained model, can more accurately determine the model for federal learning aggregation, and can further improve the accuracy and efficiency of federal learning algorithm.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a schematic flow chart of the adaptive federal edge learning method of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

As shown in fig. 1, the present application provides an adaptive federal edge learning method based on joint bi-dimensional user scheduling, which includes:

s3, screening the initial model, putting the screened model back into the S1 for repeated iteration until a plurality of iteration processes are completed, namely when the success rate of the training result reaches a certain height, wherein the screened model is the model after final training.

Further, obtaining the evaluation efficiency of model training includes:

Further, obtaining the batch data includes:

Further, the triggering condition includes: the first trigger condition, the second trigger condition and the third trigger condition;

Further, switching to a batch of a preset value as the batch data includes:

Further, screening the initial model includes:

Further, when the initial model is derived from an edge device with a lower computing power than a preset computing power, rejecting the initial model includes:

comparing the local initial model training time for each device in the subset of edge devices to the maximum time threshold specified by device i; if the local training time is not greater than the maximum time threshold value specified by the equipment i, reserving the equipment i in the equipment subset; if the local training time is greater than the maximum time threshold value specified by the equipment i, eliminating the equipment i from the equipment subset; the devices meeting the threshold requirement in the edge device subset are updated to a new subset M ₁ 。

Further, the diversity analysis of the removed residual model comprises:

Examples

Traditional centralized machine learning algorithms require large amounts of data to be uploaded to a central node, and large-scale data transmission results in large transmission times and congestion. In addition, the traditional distributed machine learning algorithm requires the training data set to be uploaded in a concentrated mode, and the training data set is redistributed to a plurality of working nodes after being divided uniformly, so that privacy leakage is easy to cause. The proposal of the adaptive dynamic batch gradient descent algorithm combined with the two-dimensional user scheduling strategy provides a solution to the above problems. For example, in the large context of the industrial internet, traditional factories have a need to convert to intelligent factories, but because the production and manufacturing data of the factories belong to commercial secrets, there are high demands on the privacy and security of the data. Therefore, the algorithm of the embodiment uses the local data of the factory to train the intelligent production model of the factory, keeps the confidential data of the factory in the local server of the factory, and only needs to transmit the intelligent production model to the cloud server, thereby greatly reducing the risk of data leakage. In addition, because the data of a single intelligent factory has the defects of less data volume, single data structure and the like, the accuracy of the model can be greatly influenced in the training process of the intelligent production model, so that the algorithm of the embodiment can upload the intelligent production models trained by a plurality of intelligent factories of the same type to a cloud server for federal learning aggregation, and the accuracy of the intelligent production model in the single factory can be greatly improved. Therefore, the algorithm of the embodiment can ensure safety and efficiency in practical application.

As shown in fig. 1, the embodiment provides an adaptive federal edge learning method based on joint bi-dimensional user scheduling, which specifically includes the steps of:

1. adaptive dynamic batch gradient descent algorithm

Edge device: determining an evaluation efficiency according to a loss function and a running time, wherein the loss function can reflect the accuracy of a model of historical batch data, the model is obtained by fitting a sample and a label (if a simplified formula is assumed to be y=ax+b, the sample is x, the label is y, and A and B are models); batch data (batch size) is determined based on the historical batch data, the evaluation efficiency, and the historical evaluation efficiency, the batch data being used to determine a dynamically updated model. For example, the local data of the intelligent plant includes sample data and labels, so the plant can train the intelligent production model locally, the intelligent plant being edge-wise.

Step (1): loss prediction: and calculating the variation delta loss of the current iteration loss and the previous n losses according to the secondary linear characteristic of the convergence speed of the gradient descent algorithm through a loss function.

Δloss＝f(x-n)-f(x)。

Where Δloss is the loss variation, f (x-n) is the loss value of the previous n iterations, and f (x) is the loss value of the current iteration. This formula means the amount of change between the loss value of the current iteration and the loss value of the previous n iterations, f () is a loss function, f (x) represents the loss value of the current iteration, and f (x-n) represents the loss value of the previous n iterations.

Step (2): efficiency evaluation: the training period t is a constant, is a time threshold for evaluating efficiency, and can be set as needed. The algorithm efficiency e is used to evaluate the model training effect of the previous n iterations under the same batch, and is determined by the loss variation and the training period.

Step (3): and the dynamic fitting gradient descent algorithm determines triggering conditions of batch switching through efficiency evaluation parameters. The local data are randomly distributed into batches with different data sizes and stored in a list L, and the algorithm selects the smallest batch to start the first iteration. And calculating algorithm efficiency e after each iteration is finished. And triggering batch switching to a larger batch to be used as batch data until the algorithm efficiency e of the nth time is smaller than the algorithm efficiency e of the n-1 th time. To avoid local optimizations, algorithms allow switching back to smaller batches as batch data when the current algorithm efficiency is lower than the historical efficiency. When the current algorithm efficiency is negative, the current batch is proved to be incapable of enabling the algorithm to be normally converged, the algorithm is switched to a larger batch to be used as batch data, and meanwhile, the batch data is prevented from being accessed again in the subsequent training process, so that the algorithm is prevented from shaking.

2. Two-dimensional user scheduling policy

The central server receives models from a plurality of edge devices; the central server rejects models of part of the edge devices, and the models are used for federal learning aggregation. For example, the intelligent factories of the same type upload respective trained intelligent production models to the cloud, and the models are delivered to the intelligent factories after federal learning aggregation for new rounds of federal edge learning.

The method for eliminating the central server comprises the following two steps:

the method comprises the following steps: the difference of edge equipment is reduced, and the purpose is to improve the speed and reduce the calculation time. When the model is derived from the edge equipment with weaker computing power, rejecting the model; otherwise, the reservation is made.

The main process is that the algorithm firstly obtains the training parameters of the local model from a group of edge equipment subsets M with heterogeneous computing power, and sets a longest time threshold value array T specified by a user scheduling strategy according to different equipment performances. The next local model training time for each device in M is compared to the maximum time threshold specified for device i. If the local training time is less than or equal to the threshold value, reserving the device i in the device subset M; if the local training time is greater than the threshold, device i is culled in device subset M. The threshold-meeting devices in subset M are updated to a new subset M ₁ 。

The local model and the data set are stored in the device, for example, the device I is rejected in federal edge learning, and the local model I in the device I does not participate in federal aggregation of federal learning. The model I need not be considered in order to reject the device I.

The second method is as follows: the diversity of the edge equipment data sets is improved, and the purpose is to improve the accuracy. When the diversity index of the model is lower than a threshold value, rejecting the model; otherwise, the reservation is made. The optional diversity index may be a keni-simpson index or a shannon entropy index, and the present scheme is not limited.

The main process is to traverse subset M ₁ The diversity index G of the data set in each device and stored in the diversity index array G. The diversity index in G is then arranged from large to small and based on diversity constraintsScreening from large to small in array G. If the diversity index G of device i in array G is in the diversity constraint +.>Within, then at device subset M ₁ Reserving the device i; if the diversity index G of device i in array G is in the diversity constraint +.>Outside, then at device subset M ₁ The device i is removed. Finally, the updated device subset M is output ₂ And uses this user scheduling setting in federal learning for the current iteration.

Through carrying out diversity analysis on the data set in each device, when the diversity index of the model is lower than a threshold value, rejecting the device corresponding to the model; otherwise, the reservation is made.

The formula of the kini-simpson index is:

wherein C is the total number of categories, p _c Is the probability of category c.

The formula of shannon entropy index is:

In summary, the main technical scheme of this embodiment is as follows:

1. adaptive dynamic batch gradient descent algorithm: edge device: determining evaluation efficiency according to a loss function and running time, wherein the loss function can reflect the accuracy of a model of historical batch data, and the model is obtained by fitting a sample and a label; batch size (batch size) is determined from the historical batch data, the evaluation efficiency, and the historical evaluation efficiency, the batch data being used to determine the model.

2. The difference of edge equipment is reduced, and the purpose is to improve the speed and reduce the calculation time. When the model is derived from the edge equipment with weaker computing power, rejecting the model; otherwise, the reservation is made.

3. The diversity of the edge equipment data sets is improved, and the purpose is to improve the accuracy. When the diversity index of the model is lower than a threshold value, rejecting the model; otherwise, the reservation is made. Alternatively, the diversity index may be a keni-simpson index or a shannon entropy index, which is not limited in this embodiment.

The beneficial effects of this embodiment are:

1. the batch data is determined more accurately, so that the accuracy and the efficiency of the dynamic balance model can be further improved.

2. The model for federal learning aggregation is determined more accurately, and the accuracy and efficiency of federal learning algorithms can be further improved.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The self-adaptive federal edge learning method based on the joint bi-dimensional user scheduling is characterized by comprising the following steps of:

the obtaining of the batch data includes:

randomly distributing local data into batches with different data sizes, storing the batches in a list, selecting the smallest batch in the list to start iteration for the first time, calculating the evaluation efficiency after each iteration is finished, and switching to a batch with a preset value as the batch data when the acquired evaluation efficiency meets the triggering condition for triggering batch switching;

the triggering conditions include: the first trigger condition, the second trigger condition and the third trigger condition;

the first triggering condition is as follows: first, theThe evaluation efficiency of the times is smaller than +.>Secondary said evaluation efficiency;

the third triggering condition is as follows: the current evaluation efficiency is negative;

the batch switched to the preset value is used as the batch data and comprises the following steps:

the first preset value is larger than the second preset value, and the third preset value is larger than the first preset value;

s3, screening the initial model, and putting the screened model back into the S1 for repeated iteration until a plurality of iteration processes are completed, so as to obtain a final trained model;

screening the initial model includes:

carrying out diversity analysis on the removed residual models, and removing the edge equipment corresponding to the models when the diversity index of the models is lower than a threshold value; otherwise, reserving;

when the initial model is derived from the edge equipment with lower than preset computing power, rejecting the initial model comprises:

obtaining training parameters of the initial model locally from a group of edge equipment subsets with heterogeneous computing capacities, and setting a longest time threshold according to different edge equipment performances;

local said initial model training time and edge device for each edge device in a subset of edge devicesComparing the specified maximum time threshold; if the local training time is not greater than +.>The maximum time threshold specified, then edge devices are +_ in the subset of edge devices>Retaining; if the local training time is greater than the edge device +.>The maximum time threshold specified, then edge devices are +_ in the subset of edge devices>Removing; the edge devices meeting the threshold requirement in the subset of the edge devices are updated to be a new subset +.>；

The diversity analysis of the removed residual models comprises the following steps:

traversing the new subsetDiversity index of each edge device in +.>And stored to diversity index array +.>In (a) and (b); then will->The diversity index of (2) is arranged from large to small and is according to diversity constraint +.>In array->Screening from large to small; if the array->Middle edge device->Diversity index>In diversity constraint->Within, then in the new subset +.>Edge device->Retaining;if the array->Middle edge device->Diversity index>In diversity constraint->In addition, then in the new subset +.>Middle culling edge device->The method comprises the steps of carrying out a first treatment on the surface of the Finally, the updated edge device subset is output +.>And uses this user scheduling setting in federal learning for the current iteration.

2. The adaptive federal edge learning method based on joint bi-dimensional user scheduling of claim 1, wherein obtaining the evaluation efficiency of model training comprises:

3. The adaptive federal edge learning method based on joint bi-dimensional user scheduling according to claim 2, wherein the loss variance is:

wherein (1)>To lose the variable->Loss value for the previous n iterations, +.>The loss value of the iteration is obtained.

4. The adaptive federal edge learning method based on joint bi-dimensional user scheduling according to claim 1, wherein the evaluation efficiency is:

wherein (1)>To evaluate the efficiency->In order to lose the amount of change,tis a training period.