CN117808129A

CN117808129A - Heterogeneous distributed learning method, device, equipment, system and medium

Info

Publication number: CN117808129A
Application number: CN202410230139.9A
Authority: CN
Inventors: 赵坤; 范宝余; 张润泽; 曹芳; 郭振华; 李仁刚; 赵雅倩; 鲁璐; 王立; 贺蒙
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-02
Anticipated expiration: 2044-02-29
Also published as: CN117808129B

Abstract

The invention provides a heterogeneous distributed learning method, a device, equipment, a system and a medium, which relate to the technical field of computers, wherein the method is applied to edge equipment, a federal learning system comprises a plurality of edge equipment, the plurality of edge equipment is divided into a plurality of equipment clusters, and the method comprises the following steps: performing iterative training on a local machine learning model by using local training data; the machine learning models in the edge devices correspond to the same model structure and reasoning task; compressing the trained machine learning model, and transmitting the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the machine learning model is located; when the cluster head edge equipment belongs to the cluster head edge equipment, performing cluster aggregation on the received machine learning model parameters to obtain cluster aggregation model parameters, and sending the cluster aggregation model parameters to an edge cloud server for global aggregation; communication traffic in the federal learning process can be reduced through equipment clustering and model compression, so that communication efficiency can be improved.

Description

Heterogeneous distributed learning method, device, equipment, system and medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a heterogeneous distributed learning method, apparatus, device, system, and medium.

Background

Federal learning is a common way of training a distributed model. The system involved in this training approach typically includes multiple training devices deployed with the same machine learning model and a single central server. In the actual training process, each training device firstly uses local training data to train a local machine learning model; when the training device completes the local training, the local model parameters can be sent to the central server, the central server averages the received model parameters, updates the global model by using the average parameters, and then sends the updated global model parameters back to each device. This process is repeated in a number of iterations until a predetermined stop condition is reached. In the edge computing scenario, the training device may also be an edge device, and the central server may also be an edge cloud server.

In the related art, since all machine learning model parameters are aggregated at the edge cloud server, better communication efficiency cannot be achieved, and even when the number of edge devices is large, the edge cloud server cannot aggregate model parameters at all.

Disclosure of Invention

The invention aims to provide a heterogeneous distributed learning method, a heterogeneous distributed learning device, a heterogeneous distributed learning equipment, a heterogeneous distributed learning system and a heterogeneous distributed learning medium, which can reduce the traffic in the federal learning process through equipment clustering and model compression, so that the communication efficiency can be improved.

In order to solve the above technical problems, the present invention provides a heterogeneous distributed learning method applied to an edge device, wherein a federal learning system includes a plurality of edge devices, and the plurality of edge devices are divided into a plurality of device clusters, the method includes:

performing iterative training on a local machine learning model by using local training data; the machine learning model in each edge device corresponds to the same model structure and reasoning task;

compressing the trained machine learning model, and transmitting the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the machine learning model is located;

when the cluster head edge equipment belongs to the cluster head edge equipment, performing cluster aggregation on the received machine learning model parameters to obtain cluster aggregation model parameters, and sending the cluster aggregation model parameters to an edge cloud server for global aggregation.

Optionally, the compressing the trained machine learning model includes:

Each network layer in the trained machine learning model is subjected to layer-by-layer traversal compression and/or all network layers in the trained machine learning model are subjected to overall compression.

Optionally, the performing layer-by-layer traversal compression on each network layer in the trained machine learning model includes:

setting a full-scale model and a compression model by using the trained machine learning model;

setting a first network layer in the compression model as a layer to be compressed, and creating state information for the layer to be compressed; the state information comprises self attribute information and compression state information of the layer to be compressed;

inputting the state information to a hierarchical compression decision device to obtain the compression rate of the layer to be compressed, and compressing the layer to be compressed according to the compression rate;

when the layer to be compressed is compressed, training the compression model by utilizing the local training data to restore the precision of the compression model;

judging whether a compression stop condition is satisfied;

if yes, setting the next network layer in the compression model as the layer to be compressed, and entering a step of creating state information for the layer to be compressed;

If not, updating the compression state information in the state information of the layer to be compressed, and entering a step of inputting the state information into a hierarchical compression decision device based on the updated state information to obtain the compression rate of the layer to be compressed.

Optionally, the self attribute information includes the number of layers of the layer to be compressed, the type of the network layer, the input scale of the feature map and the output scale of the feature map, the compression state information includes the compression rate of the layer to be compressed, the variation degree of the feature map and the precision of the compression model, and the variation degree of the feature map represents the variation degree between the layer to be compressed and the corresponding network layer in the full-scale model;

the updating the compression state information in the state information of the layer to be compressed includes:

updating the compression rate in the compression state information to the compression rate currently output by the hierarchical compression decision maker;

updating the variation degree of the feature map in the compression state information by utilizing the compressed layer to be compressed and the corresponding network layer in the full-scale model;

and determining the latest precision of the compression model, and updating the precision in the compression state information by using the latest precision.

Optionally, the updating the variability of the feature map in the compression state information by using the compressed layer to be compressed and the corresponding network layer in the full-scale model includes:

setting the characteristic diagram of the compressed layer to be compressed as a first characteristic diagram, and setting the characteristic diagram of the network layer corresponding to the layer to be compressed in the full-scale model as a second characteristic diagram;

determining element difference values between each element in the first feature map and the corresponding element in the second feature map, and summing absolute values of all element difference values to obtain a total element difference value;

and determining the average element difference value of each element by using the total element difference value and the number of elements contained in the first feature map, and updating the feature map variability by using the average element difference value.

Optionally, the compressing the layer to be compressed according to the compression rate includes:

and determining the compression grade corresponding to the compression rate, and compressing the layer to be compressed by utilizing a compression mode corresponding to the compression grade.

Optionally, the determining the compression level corresponding to the compression rate, and compressing the layer to be compressed by using a compression mode corresponding to the compression level includes:

When the compression rate is determined to be located in a first preset interval, weight pruning is carried out on the layer to be compressed; the lower limit value of the first preset interval is 0;

when the compression rate is determined to be positioned in a second preset interval, weight pruning and low-rank approximate compression are sequentially carried out on the layer to be compressed; the lower limit value of the second preset interval is the upper limit value of the first preset interval;

when the compression rate is determined to be positioned in a third preset interval, weight pruning, low-rank approximate compression and quantitative compression are sequentially carried out on the layer to be compressed; the lower limit value of the third preset interval is the upper limit value of the second preset interval.

Optionally, after training the compression model using the local training data, further comprising:

when the hierarchical compression decision maker does not complete training, determining a loss value of the hierarchical compression decision maker in the current compression through the following formula:

；

wherein Delta_Acc represents the loss value,representing the current accuracy of the compression model,representing the accuracy of the full model, parameter-Reduction represents the Parameter quantity reduced by compression, total-Parameter represents the Total Parameter quantity of the full model, +. >Representing the regulatory factor;

and updating a model of the hierarchical compression decision maker by utilizing the loss value.

Optionally, the determining whether the compression stop condition is satisfied includes:

when the hierarchical compression decision maker does not complete training, judging whether the current precision of the compression model is smaller than a preset minimum precision value or not;

if the current precision is smaller than the preset minimum precision value, judging that the compression stopping condition is met, and rolling back the layer to be compressed to the previous version;

if the current precision is not less than the preset minimum precision value, judging whether the current compression rate of the layer to be compressed is less than the preset compression rate or not;

if the current compression rate is smaller than the preset compression rate, adding 1 to the duration of the low compression rate corresponding to the layer to be compressed, and judging whether the duration of the low compression rate is larger than the preset time; if the duration of the low compression rate is greater than the preset times, judging that the compression stopping condition is met; if the duration of the low compression rate is not more than the preset times, judging that the compression stopping condition is not met; the initial value of the duration of the low compression ratio is 0;

and if the current compression rate is not less than the preset compression rate, judging that the compression stopping condition is not met.

Optionally, after determining that the duration of the low compression ratio is greater than the preset number of times, the method further includes:

judging whether the duration times of the low compression rate corresponding to each network layer in the compression model are all larger than the preset times;

if yes, the hierarchical compression decision maker is judged to finish training.

when the hierarchical compression decision device finishes training, judging whether the compressed times of the layer to be compressed reach preset times or not;

if yes, judging that the compression stopping condition is met;

if not, it is determined that the compression stop condition is not satisfied.

Optionally, the overall compressing of all network layers in the trained machine learning model includes:

setting each network layer in the compression model as a layer to be compressed, and creating state information for each layer to be compressed to obtain a state information sequence; the state information comprises self attribute information and compression state information of the layer to be compressed, and the compression state information represents the compression degree of the layer to be compressed compared with a corresponding network layer in the full-scale model;

Inputting the state information sequence to a global compression decision device to obtain the compression rate of each layer to be compressed, and compressing each layer to be compressed according to the compression rate; the global compression decision maker comprises a sequence processing unit, wherein the sequence processing unit is used for extracting the dependency relationship in the state information sequence;

when the compression model is compressed, training the compression model by utilizing the local training data to restore the precision of the compression model;

judging whether a compression stop condition is satisfied;

if yes, the compression is exited;

if not, updating the compression state information in the state information of each layer to be compressed, and entering a step of inputting the state information sequence into a global compression decision device based on the updated state information to obtain the compression rate of each layer to be compressed.

when the global compression decision maker does not complete training, determining the rewarding value of the global compression decision maker in the current compression through the following formula:

；

wherein,representing the prize value,/->Representing a strategy output by the global compression decision maker in the current compression, wherein the strategy comprises a compression rate corresponding to each layer to be compressed in the compression model in the current compression, Representing a validation set for determining the accuracy of the full-scale model and the compression model, theFor the first bonus item b represents a preset loss of accuracy threshold,>is determined by the following formula:

；

wherein the method comprises the steps ofRepresenting the difference in accuracy>，/>Representing the accuracy of the full model, +.>Representing the current accuracy of the compression model, +.>Representing a regulatory factor, said->Representing a second prize item->Is determined by the following formula:

；

wherein,representing the ratio between the floating point operand of said compression model and the floating point operand of said full-scale model,/->；

And updating a model of the global compression decision maker by using the reward value.

when the global compression decision device does not complete training, judging whether the current precision of the compression model is smaller than a preset minimum precision value or not;

if the current precision is smaller than the preset minimum precision value, judging that the compression stopping condition is met, and rolling back the compression model to the previous version;

if the current precision is not less than the preset minimum precision value, judging whether the current compression rate of each layer to be compressed is less than the preset compression rate or not;

if the current compression rate is smaller than the preset compression rate, adding 1 to the duration of the low compression rate corresponding to the compression model, and judging whether the duration of the low compression rate is larger than the preset time; if the duration of the low compression rate is greater than the preset times, judging that the compression stopping condition is met, and judging that the global compression decision device finishes training; if the duration of the low compression rate is not more than the preset times, judging that the compression stopping condition is not met; the initial value of the duration of the low compression ratio is 0;

when the global compression decision maker has completed training, judging whether the overall compression frequency of the compression model reaches a preset frequency;

if yes, judging that the compression stopping condition is met;

if not, it is determined that the compression stop condition is not satisfied.

Optionally, the performing layer-by-layer traversal compression on each network layer in the trained machine learning model and performing overall compression on all network layers in the trained machine learning model includes:

performing layer-by-layer traversal compression on each network layer in the trained machine learning model, and performing overall compression on all network layers in the trained machine learning model after the layer-by-layer traversal compression is finished.

Optionally, in performing iterative training on the local machine learning model by using the local training data, the method further includes:

receiving a test data set issued by an edge cloud server; each edge device corresponds to the same test data set;

inputting the test data set into a local machine learning model to obtain an reasoning result corresponding to the test data set;

The reasoning result is sent to the edge cloud server, and equipment cluster information issued by the edge cloud server is received; the edge cloud server divides each of the edge devices into a plurality of device clusters by using training data similarity determined by an inference result of each of the edge devices.

Optionally, the device cluster information includes a weighted undirected graph of the device cluster, each node in the weighted undirected graph corresponds to each edge device in the device cluster, edges are set between the nodes if and only if the training data similarity between the corresponding edge devices is greater than a first threshold, and the weight of the edges is the training data similarity between the corresponding edge devices;

after iteratively training the local machine learning model using the local training data, further comprising:

determining edge equipment corresponding to a node connected with the node of the weighted undirected graph as neighbor edge equipment corresponding to the node;

and broadcasting local machine learning model parameters to the neighbor edge equipment, and utilizing the machine learning model sent by the neighbor edge equipment to aggregate and update the local machine learning model parameters.

Optionally, the broadcasting the local machine learning model parameters to the neighboring edge device, and performing aggregation update on the local machine learning model parameters by using the machine learning model sent by the neighboring edge device, including:

when the training iteration number reaches integer times of a first preset value, broadcasting local machine learning model parameters to own neighbor edge equipment, and carrying out aggregation updating on the local machine learning model parameters by utilizing a machine learning model sent by the neighbor edge equipment;

compressing the trained machine learning model, and sending the compressed machine learning model parameters to cluster head edge equipment of the equipment cluster where the machine learning model is located, wherein the method comprises the following steps:

when the iteration times reach integer times of a second preset value, compressing the trained machine learning model, and sending the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the parameters of the machine learning model are located; the second preset value is greater than the first preset value.

Optionally, the sending the cluster aggregation model parameter to an edge cloud server for global aggregation includes:

and when the aggregation times of the clusters reach the integral multiple of a third preset value, sending the aggregation molding parameters of the clusters to an edge cloud server for global aggregation.

Optionally, the performing intra-cluster polymerization on the received machine learning model parameters to obtain intra-cluster polymerization model parameters includes:

transmitting local machine learning model parameters to the cluster head edge device;

when the cluster head edge equipment belongs to the equipment cluster, the cluster head edge equipment aggregates the received machine learning model parameters through the following formula to obtain the cluster aggregation model parameters:

；

wherein,represents the cluster cohesive model parameters obtained by the polymerization of the c-th cluster of devices in the t+1st cluster,>representing the cluster-cohesive model parameters obtained by the aggregation of the c-th cluster of devices in the t-th cluster, and>machine learning model parameters representing local compression of a jth edge device in a c-th device cluster before aggregation in a t+1-th round cluster, +.>Representing edges in the c-th cluster of devicesNumber of devices,/->Representing the super parameter.

The invention also provides a heterogeneous distributed learning device applied to edge equipment, wherein a federal learning system comprises a plurality of the edge equipment, the plurality of the edge equipment is divided into a plurality of equipment clusters, and the device comprises:

the local training module is used for carrying out iterative training on a local machine learning model by utilizing local training data; the machine learning model in each edge device corresponds to the same model structure and reasoning task;

The model compression module is used for compressing the trained machine learning model and transmitting the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the model compression module is located;

and the intra-cluster aggregation module is used for carrying out cluster aggregation on the received machine learning model parameters to obtain cluster aggregation model parameters when the intra-cluster aggregation module belongs to the cluster head edge equipment, and sending the cluster aggregation model parameters to an edge cloud server for global aggregation.

The invention also provides an edge device, comprising:

a memory for storing a computer program;

and a processor for implementing the heterogeneous distributed learning method as described above when executing the computer program.

The invention also provides a federal learning system comprising:

the edge cloud server is used for carrying out global aggregation on the received intra-cluster set model parameters;

a plurality of edge devices for performing the heterogeneous distributed learning method as described above.

The present invention also provides a computer readable storage medium having stored therein computer executable instructions that, when loaded and executed by a processor, implement the heterogeneous distributed learning method as described above.

The invention provides a heterogeneous distributed learning method, which is applied to edge devices, wherein a federal learning system comprises a plurality of edge devices, the plurality of edge devices are divided into a plurality of device clusters, and the method comprises the following steps: performing iterative training on a local machine learning model by using local training data; the machine learning model in each edge device corresponds to the same model structure and reasoning task; compressing the trained machine learning model, and transmitting the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the machine learning model is located; when the cluster head edge equipment belongs to the cluster head edge equipment, performing cluster aggregation on the received machine learning model parameters to obtain cluster aggregation model parameters, and sending the cluster aggregation model parameters to an edge cloud server for global aggregation.

Therefore, the plurality of edge devices can be divided into a plurality of device clusters at first, so that the edge devices develop hierarchical federal learning, namely, the local machine learning model parameters of the edge devices are clustered in the device clusters, and then the cluster head edge devices of the device clusters send the intra-cluster machine learning model parameters obtained by the intra-cluster clustering to the edge cloud server for global aggregation, so that the traffic at the edge cloud server can be greatly reduced, and the communication efficiency at the edge cloud server can be improved. Furthermore, in order to ensure the communication efficiency at the cluster head edge device, each edge device needs to compress the local machine learning model before sending the local machine learning model to the cluster head edge device, and only sends the compressed machine learning model to the cluster head edge device, so that the communication traffic at the cluster head edge device can be reduced, and better communication efficiency is achieved at the cluster head edge device. The invention also provides a heterogeneous distributed learning device, edge equipment, a federal learning system and a computer readable storage medium, which have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a federal learning system according to an embodiment of the present invention;

FIG. 2 is a flowchart of a heterogeneous distributed learning method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a layer-by-layer traversal compression according to an embodiment of the invention;

FIG. 4 is a schematic diagram of an overall compression provided by an embodiment of the present invention;

FIG. 5 is a diagram of a weighted undirected graph according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a cluster of devices according to an embodiment of the present invention;

fig. 7 is a block diagram of a heterogeneous distributed learning device according to an embodiment of the present invention;

fig. 8 is a block diagram of an edge device according to an embodiment of the present invention;

fig. 9 is a block diagram of a federal learning system according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the related art, because all machine learning model parameters are aggregated at the edge cloud server, better communication efficiency cannot be achieved. Even when the machine learning model parameters are uploaded together by facing a plurality of edge devices, the edge cloud server is limited by communication bandwidth and cannot aggregate the model parameters at all. In view of this, the present invention may provide a heterogeneous distributed learning method, in which a plurality of edge devices may be first divided into a plurality of device clusters, so that each edge device performs hierarchical federal learning, that is, each edge device first performs cluster aggregation on its local machine learning model parameters in the device cluster, and then the cluster head edge device of each device cluster sends the intra-cluster machine learning model parameters obtained by the cluster aggregation to the edge cloud server for global aggregation, so that the traffic at the edge cloud server may be greatly reduced, and thus the communication efficiency at the edge cloud server may be improved. In addition, in order to ensure the communication efficiency at the cluster head edge equipment, each edge equipment in the invention can compress the local machine learning model before sending the local machine learning model to the cluster head edge equipment, and only send the compressed machine learning model to the cluster head edge equipment, so that the communication traffic at the cluster head edge equipment can be lightened, and the better communication efficiency can be achieved in the cluster head edge equipment.

For ease of understanding, embodiments of the present invention will first be described with respect to a federal learning system to which the present invention is applicable. Referring to fig. 1, fig. 1 is a schematic diagram of a federal learning system according to an embodiment of the present invention. The system comprises a plurality of device clusters 200 and a single edge cloud server 10, wherein each device cluster 200 can comprise edge devices 20 and cluster head edge devices 21, and all the edge devices 20 and all the cluster head edge devices 21 are provided with machine learning models 1 with the same network structure and the same reasoning task. It should be noted that the cluster head edge device 21 is a special edge device 20, which, in addition to having the basic purpose of the edge device 20 (such as locally training the machine learning model 1), at least needs to upload cluster-integrated model parameters obtained by cluster integration to the edge cloud server 10. It should be noted that, the number of the device clusters 200 is not limited in the embodiment of the present invention, and at least two may be other numbers. The number of edge devices 20 that can be included in the cluster 200 is not limited in this embodiment, and may be at least two, or may be other numbers. The embodiment of the present invention is not limited to the specific manner of dividing the device cluster 200, and may be divided according to the address locations between the edge devices 20, or may be divided according to the similarity of training data in each edge device 20.

It should be further noted that, in the hierarchical federal training provided by the embodiment of the present invention, the machine learning model will at least need to undergo three steps of local iterative training, intra-cluster aggregation, and global aggregation. Each complete global aggregation may be considered as a complete round of federal learning, and each round of federal learning may include multiple rounds of local iterative training and multiple rounds of intra-cluster aggregation, and the number of iterative rounds of local iterative training may be significantly higher than the number of rounds of intra-cluster aggregation. The embodiment of the invention is not limited to local iterative training and aggregation of clusters, and can be set according to actual application requirements. In addition, the number of rounds of local iterative training and the number of rounds of intra-cluster aggregation may have a quantitative relationship, e.g., one intra-cluster aggregation is performed each time a first number of rounds of local iterative training is completed. Meanwhile, the number of rounds of intra-cluster aggregation and the number of rounds of global aggregation may also have a relationship, for example, global aggregation is performed once every time the second number of rounds of cluster aggregation is completed. The embodiment of the invention is not limited to the specific values of the first number and the second number, and can be set according to actual application requirements.

Further, the embodiment of the invention is not limited to the scene to which the federal learning system is applicable, and for example, the invention can be applied to the fields of network security, smart city, medical health, intelligent manufacturing, and the like. In the field of network security, network devices such as routers and firewalls can be used as edge devices to collect network attack data and train a machine learning model for detecting malware or network attacks based on the federal learning system without sharing sensitive data. In a traffic management system in a smart city, monitoring devices and intelligent vehicles can collect traffic data as edge devices and train a machine learning model for optimizing traffic flow and predicting accidents based on the present federal learning system. In the field of medical health, edge devices can be arranged in each medical institution, medical diagnosis data can be collected by using the edge devices, and a machine learning model for disease diagnosis prediction is trained based on the federal learning system, so that the accuracy of disease diagnosis is improved through federal learning while the privacy of patients is protected. Finally, in terms of intelligent manufacturing and predictive maintenance, the machine and sensors of the plant may be used as edge devices to collect line data and train a machine learning model for machine failure prediction based on the present federal learning system for optimization of failure prediction and maintenance decisions. It is worth pointing out that the method can effectively utilize distributed edge computing resources, and can reduce communication pressure brought by model parameter transmission to each device in the system while enhancing data privacy protection.

Based on the above system structure description, the heterogeneous distributed learning method provided by the embodiment of the invention will be described in detail. Referring to fig. 2, fig. 2 is a flowchart of a heterogeneous distributed learning method according to an embodiment of the present invention, where the method is applied to an edge device, and may include:

s201, performing iterative training on a local machine learning model by using local training data; the machine learning models in the edge devices correspond to the same model structure and reasoning task.

Specifically, the edge device may update the local model using an SGD algorithm (Stochastic Gradient Descent, random gradient descent algorithm). Aggregation in training t-th round clusterlAnd in the next local iteration, the updating formula of the model parameters is as follows:

；

wherein,indicating that the ith edge device is aggregated in the t-th clusterlCorresponding local model parameters after the secondary local iteration update,/->Representing the corresponding local model parameters before this iterative update,/->Indicates the learning rate corresponding to this local iteration, < >>Representing training data used for this local iteration,/->Sample loss function value representing this local iteration, < ->Representing the gradient.

The local empirical loss function at the edge device is:

；

Wherein the method comprises the steps ofRepresenting local model parameters,/->For training data which are involved in iterative training, +.>Representing the local training set at the ith edge device,/->Representing the sample loss function value. It should be noted that, the embodiment of the present invention is not limited to a specific sample loss function, and may be set according to actual application requirements.

S202, compressing the trained machine learning model, and sending the compressed machine learning model parameters to cluster head edge equipment of the equipment cluster where the machine learning model is located.

After performing local iterative training on the machine learning model, each edge device needs to upload the local machine learning model to the cluster head edge device of the device cluster where the edge device is located. However, each device cluster may contain more edge devices, easily increasing communication pressure at the cluster head edge devices; meanwhile, since the cluster head edge device belongs to an edge device, compared with the edge cloud server, the cluster head edge device may not have stronger communication performance, and therefore simply sending the whole quantity of machine learning model parameters to the cluster head edge device can cause that the cluster head edge device cannot obtain higher communication efficiency. Therefore, before the parameter uploading, each edge device can compress the local machine learning model and only send the compressed machine learning model parameters to the cluster head edge device of the device cluster where the edge device is located, so that the communication traffic between each edge device and the cluster head edge device can be obviously reduced, the communication pressure at the cluster head edge device can be further reduced, and the communication efficiency can be improved.

It should be noted that, the embodiment of the present invention is not limited to a specific model compression manner, for example, the trained machine learning model may be compressed one by one, or the trained machine learning model may be integrally compressed. Each network layer in the machine learning model is compressed one by layer compression, and the influence of each layer of compression on the precision of the machine learning model is evaluated after each compression is finished; and the whole compression can be used for carrying out common compression on all network layers in the machine learning model, and the influence of the whole compression on the precision of the machine learning model is evaluated after each compression is finished. The advantage of layer by layer is that: 1. targeted optimization can be performed: allowing personalized processing of the network parameters of each layer. Because different layers may have different importance and redundancy, optimizing for each layer may more effectively reduce redundancy while more retaining critical information; 2. the flexibility is higher: layer-by-layer compression provides the ability to adjust the compression policy individually for each layer, which makes the model more flexible, and the compression policy can be tailored to the characteristics and importance of each layer; 3. the device has higher precision control capability: by means of layer-by-layer compression and fine adjustment, the overall accuracy of the model can be controlled more accurately, and performance degradation caused by compression is reduced. The advantage of overall compression is that: 1. and (3) overall optimization: the overall neural network model compression considers the dependency relationship between the overall structure and layers of the model, and can optimize the model in the overall scope so as to realize more effective compression. 2. Parameter sharing and reduction: global compression may identify and exploit similarities and redundancies between layers of the network, further reducing overall parameters of the model. 3. Long-term dependency capture: elements such as LSTM or GRU that capture long-term dependencies may be used to better understand and optimize the interrelationships between the layers of the network. The two modes are comprehensively used and have the advantages that: 1. comprehensive optimization can be performed: the combination of layer-by-layer compression and overall compression can enjoy the benefits of both targeted and overall optimization. The comprehensive method can balance the performance and the compression ratio of the model on a local and global level; 2. balance of precision and compression: the combined use can more effectively find the optimal balance point between reducing the model size and maintaining the model accuracy. This is because the layer-by-layer compression allows careful control, while the overall compression provides for optimization of the overall structure.

Based on this, compressing the trained machine learning model may include:

step 11: each network layer in the trained machine learning model is subjected to layer-by-layer traversal compression and/or all network layers in the trained machine learning model are subjected to overall compression.

Further, when the layer-by-layer compression mode and the overall compression mode are used in combination, the machine learning model can be compressed layer by layer for achieving better processing efficiency, and then the machine learning model can be compressed overall.

Based on this, performing layer-by-layer traversal compression on each network layer in the trained machine learning model and performing overall compression on all network layers in the trained machine learning model may include:

step 21: each network layer in the trained machine learning model is subjected to layer-by-layer traversal compression, and after the layer-by-layer traversal compression is finished, all network layers in the trained machine learning model are subjected to overall compression.

Further, step S202 may be performed multiple times during each round of federal learning. For example, the edge device may compress the trained machine learning model and send the compressed machine learning model parameters to the cluster head edge device when it is determined that the local iterative training reaches an integer multiple of the preset value at the current corresponding iteration number. It should be noted that, the embodiment of the present invention is not limited to specific values of the preset values, and may be set according to actual application requirements.

And S203, when the cluster head edge equipment belongs to the cluster head edge equipment, performing cluster aggregation on the received machine learning model parameters to obtain cluster aggregation model parameters, and sending the cluster aggregation model parameters to an edge cloud server for global aggregation.

It should be noted that, step S203 is only performed by the cluster head edge device, and the non-cluster head edge device may not perform this step. When the cluster head edge equipment is determined to belong to the cluster head edge equipment, the current edge equipment can receive machine learning model parameters sent by other edge equipment in the same equipment cluster, and cluster aggregation is carried out on the received machine learning model parameters to obtain cluster aggregation model parameters. Considering that each device cluster contains only a small number of edge devices, and each edge device has compressed its respective model parameters, it is possible to reduce traffic and communication pressure brought by communication transmissions to the cluster head edge devices. In addition, since the cluster-integrated model parameters are obtained by polymerizing the compressed model parameters, the volume of the cluster-integrated model parameters can be reduced. After the cluster aggregation is completed, the cluster head edge equipment also needs to send cluster aggregation model parameters to an edge cloud server for global aggregation. Considering that the number of the cluster head edge devices is originally smaller than the total number of the edge devices, and the cluster aggregation model parameters transmitted by the cluster head edge devices also reduce the volume, the communication pressure at the edge cloud server can be obviously reduced, so that the communication pressure at each device in the federal training system can be improved, and the problem of overlarge single-point communication pressure is avoided.

It should be noted that, the embodiment of the present invention is not limited to a specific implementation manner of intra-cluster aggregation, and for example, average aggregation, weighted aggregation, and other manners may be adopted. In order to achieve a better specific effect, the cluster head edge device in the embodiment of the invention can also aggregate the received machine learning model parameters to obtain cluster aggregation model parameters through the following formula:

；

wherein,represents the cluster cohesive model parameters obtained by the polymerization of the c-th cluster of devices in the t+1st cluster,>representing the cluster-cohesive model parameters obtained by the aggregation of the c-th cluster of devices in the t-th cluster, and>machine learning model parameters representing local compression of a jth edge device in a c-th device cluster before aggregation in a t+1-th round cluster, +.>Representing the number of edge devices in the c-th device cluster, and #>Representing the super parameter.

Based on this, performing intra-cluster polymerization on the received machine learning model parameters to obtain intra-cluster polymerization model parameters may include:

step 41: transmitting local machine learning model parameters to cluster head edge equipment;

step 42: when the cluster head edge equipment belongs to the equipment cluster, the received machine learning model parameters are aggregated to obtain cluster aggregation model parameters through the following formula:

；

Further, after cluster head edge equipment can perform cluster aggregation for a plurality of times, cluster aggregation model parameters are sent to an edge cloud server for global aggregation. Specifically, when cluster aggregation is completed, the cluster head edge equipment can judge whether the current cluster aggregation frequency reaches an integral multiple of a second preset value, and if so, the cluster aggregation model parameters are sent to an edge cloud server for global aggregation. It should be noted that, the embodiment of the present invention is not limited to the specific value of the second preset value, and may be set according to the actual application requirement.

Based on this, the sending the cluster aggregation model parameter to the edge cloud server for global aggregation may include:

step 51: and when the aggregation times of the clusters reach the integral multiple of a second preset value, sending the aggregation molding parameters of the clusters to an edge cloud server for global aggregation.

Based on the above embodiment, the plurality of edge devices in the present invention may be divided into a plurality of device clusters at first, so that the edge devices develop hierarchical federal learning, that is, each edge device first performs cluster aggregation on its local machine learning model parameters in the device cluster, and then each cluster head edge device of each device cluster sends the intra-cluster machine learning model parameters obtained by the cluster aggregation to the edge cloud server for global aggregation, so that the traffic at the edge cloud server may be greatly reduced, and thus the communication efficiency at the edge cloud server may be improved. Furthermore, in order to ensure the communication efficiency at the cluster head edge device, each edge device needs to compress the local machine learning model before sending the local machine learning model to the cluster head edge device, and only sends the compressed machine learning model to the cluster head edge device, so that the communication traffic at the cluster head edge device can be reduced, and better communication efficiency is achieved at the cluster head edge device.

Based on the above embodiments, a detailed description of a specific implementation of layer-by-layer compression is provided below. In one possible scenario, performing layer-by-layer traversal compression for each network layer in the trained machine learning model may include:

S301, setting a full-scale model and a compression model by using the trained machine learning model.

In the step, in order to facilitate model compression, a full-scale model and a compression model can be set by using a trained machine learning model, wherein the full-scale model is not compressed and can be used as a reference object of the compression model; and the compression model is the object to be compressed in the embodiment of the invention.

S302, setting a first network layer in a compression model as a layer to be compressed, and creating state information for the layer to be compressed; the state information includes self attribute information of the layer to be compressed and compression state information.

The embodiment of the invention uses a hierarchical compression decision device to determine the compression rate of each layer to be compressed in the compression model one by one, and performs layer by layer compression on each layer to be compressed based on the compression rate. The hierarchical compression decision maker is a neural network model, and the structure of the hierarchical compression decision maker specifically comprises:

input layer: accepting as input a vector of state information;

hidden layer: there may be one or more hidden layers, each hidden layer containing a plurality of neurons;

output layer: multiple neurons are output representing different compression decision options. The compression rate of the time output by the model compression decision device is output respectively.

That is, in brief, the hierarchical compression decision device takes state information of a layer to be compressed as input, and takes a compression rate of the layer to be compressed as output. The state information may include self attribute information of the layer to be compressed and compression state information, where the self attribute information is used to record self information of the layer to be compressed, for example, may include the number of layers of the layer to be compressed, the type of network layer, the input scale of the feature map, and the output scale of the feature map; the compression state information is used for recording the compression state of the layer to be compressed and the compression state of the compression model, and may include, for example, the compression rate of the layer to be compressed, the variation degree of the feature map, and the precision of the compression model, where the variation degree of the feature map represents the variation degree between the layer to be compressed and the corresponding network layer in the full-scale model. In this step, therefore, initial state information needs to be created for the layer to be compressed first. In one possible scenario, the vector form corresponding to the state information may be:

s_fc = [i，type，[c1*h1*w1],[o1*h2*w2], Compression Ratio, Variability of Compressed Feature Maps，Ac]；

where s_fc represents state information, i represents the network layer is the model layer, type represents the type of the network layer, [ c1×h1×w1] represents the input scale of the feature map of the layer, [ o1×h2×w2] represents the output scale of the feature map of the layer, compression Ratio represents the current compression rate of the network layer, variability of Compressed Feature Maps represents the variation of the feature map between compression and non-compression, and Ac represents the accuracy of the current compression model. It is understood that the network layer is not compressed when it is set as the layer to be compressed for the first time, and thus compression state information in the state information thereof may be an initial value.

And S303, inputting the state information into a hierarchical compression decision device to obtain the compression rate of the layer to be compressed, and compressing the layer to be compressed according to the compression rate.

In this step, the compression ratio characterizes the compression force and has a proportional relationship with the compression force, i.e. the higher the compression ratio is, the greater the compression force is. Therefore, the hierarchical compression decision device in the embodiment of the invention is used for comprehensively evaluating the state information of the layer to be compressed and determining the compression strength corresponding to the layer to be compressed according to the state information.

It should be noted that, the embodiment of the present invention is not limited to how to compress the layer to be compressed according to the compression rate, for example, a plurality of compression levels may be preset, and a compression mode corresponding to the compression force is set for each compression level, so that the compression level corresponding to the compression rate may be determined, and the layer to be compressed is compressed by using the compression mode corresponding to the compression level.

Based on this, compressing the layer to be compressed according to the compression rate may include:

step 61: and determining a compression grade corresponding to the compression rate, and compressing the layer to be compressed by utilizing a compression mode corresponding to the compression grade.

It is understood that each compression level may correspond to a compression rate threshold interval, and thus it is only necessary to determine which threshold interval the compression rate is located in, and it may be determined which compression level the compression rate belongs to. It should be noted that, the embodiment of the present invention does not limit the specific number and upper and lower limit values of the threshold interval, and may be set according to the actual application requirements. The embodiment of the invention also does not limit which compression level corresponds to which compression mode, and only ensures that the higher the compression level is, the larger the compression force is.

In one possible case, three preset intervals, namely, a first preset interval, a second preset interval and a third preset interval, may be set, and the three preset intervals respectively correspond to three compression levels, where the level corresponding to the first preset interval is the lowest, and the level corresponding to the third preset interval is the highest. Then, the corresponding compression modes, that is, the first-level corresponding weight pruning, the second-level corresponding weight pruning and the combination of low-rank approximate compression, and the third-level corresponding weight pruning, the combination of low-rank approximate compression and quantization compression, may be set for the three levels. It should be noted that, the embodiment of the present invention is not limited to the upper and lower limits of the first preset zone, the second preset zone and the third preset zone, and may be set according to practical application requirements, for example, the first preset zone may be 0-30%, the second preset zone may be 30-70%, and the third preset zone may be 70-90%.

Based on this, determining a compression level corresponding to the compression rate, and compressing the layer to be compressed using a compression method corresponding to the compression level may include:

step 71: when the compression rate is determined to be located in a first preset interval, weight pruning is carried out on the layer to be compressed; the lower limit value of the first preset interval is 0.

If the compression rate is determined to be in the first preset interval, weight pruning can be only executed on the layer to be compressed. The weight pruning may comprise the steps of:

1. weight importance analysis operation: the absolute value size of each weight in the layer to be compressed may be observed to evaluate the importance of each weight. Of course, the evaluation of the weight importance can also be performed by using gradient information, hessian matrix analysis and other means;

2. pruning strategy selection operation:

global pruning based on weight size: the global threshold may be set according to the compression rate, and pruning may be performed on weights in the current band compression layer that are less than the global threshold.

Step 72: when the compression rate is determined to be positioned in a second preset interval, weight pruning and low-rank approximate compression are sequentially carried out on the layer to be compressed; the lower limit value of the second preset interval is the upper limit value of the first preset interval.

If the compression rate is determined to be in the second preset interval, weight pruning can be performed on the layer to be compressed, and then low-rank approximate compression is performed on the layer. When performing low-rank approximate compression, a proper rank m, for example, 90% of the energy of the original matrix is reserved, so as to perform low-rank decomposition such as SVD (Singular Value Decomposition ) on the weight matrix remained in the layer to be compressed, so as to further reduce the number of parameters, and further increase the compression rate.

Step 73: when the compression rate is determined to be positioned in a third preset interval, weight pruning, low-rank approximate compression and quantization compression are sequentially carried out on the layer to be compressed; the lower limit value of the third preset interval is the upper limit value of the second preset interval.

If the compression rate is determined to be in the third preset interval, weight pruning, low-rank approximate compression and quantization compression can be sequentially performed on the layer to be compressed. During quantization compression, the low-rank approximated weight matrix can be quantized, for example, from 32-bit floating point number quantization to 8-bit integer, and the specific quantization bit number is selected according to actual requirements and hardware compatibility.

And S304, training the compression model by utilizing the local training data when the layer to be compressed is compressed, so as to restore the precision of the compression model.

It will be appreciated that after compression of the layer to be compressed is completed, although the amount of parameters in the layer to be compressed is reduced, the accuracy of the compression model is also reduced. In order to compensate for the loss, the embodiment of the invention can start from the current state of the compression model, adjust super parameters such as learning rate and the like under the same training and verification data set, and continue training the model to restore the precision of the compression model, so that the relation between the reduction of the parameter quantity and the loss of the model precision can be balanced.

S305, judging whether a compression stop condition is met; if yes, go to step S306; if not, the process advances to step S307.

It should be noted that, the compression stop condition in the embodiment of the present invention is used to determine whether the current layer to be compressed is compressed to a proper extent, and the compression may be exited. Embodiments of the present invention are not limited to specific compression stop conditions, which are relevant to the training situation of the hierarchical compression decision maker. For example, while the hierarchical compression decision maker is still in training, it may give a poor compression rate, resulting in a compression model that still recovers poorly after retraining, at which point it may be determined that the compression stop condition is met and the layer to be compressed is rolled back to the previous version (e.g., the original version or the previous compressed version). For another example, when the hierarchical compression decision maker has completed training, it basically gives a suitable compression rate, and each network layer only needs to compress a fixed number of times, so the compression stop condition may be: and judging whether the number of times of compression of the layer to be compressed reaches the preset number of times, if so, judging that the compression stop condition is met, and otherwise, judging that the compression stop condition is not met.

Based on this, determining whether the compression stop condition is satisfied may include:

Step 81: when the hierarchical compression decision maker has completed training, judging whether the compressed times of the layer to be compressed reach preset times or not; if yes, go to step 82; if not, go to step 83;

step 82: judging that the compression stop condition is satisfied;

step 83: it is determined that the compression stop condition is not satisfied.

It should be noted that, the embodiment of the present invention is not limited to specific values of the preset times, and may be set according to actual application requirements.

S306, setting the next network layer in the compression model as a layer to be compressed, and entering a step of creating state information for the layer to be compressed.

As described above, if it is determined that the compression stop condition is satisfied, compression of the next network layer can be entered.

S307, updating the compression state information in the state information of the layer to be compressed, and entering a step of inputting the state information to the hierarchical compression decision device based on the updated state information to obtain the compression rate of the layer to be compressed.

As described above, if it is determined that the compression stop condition is not satisfied, this means that the current layer to be compressed is still to be compressed. At this time, the embodiment of the invention can update the compression state information in the state information of the layer to be compressed, so as to continuously compress the layer to be compressed based on the updated state information. The updating of the compression state information is described below.

In one possible case, updating compression state information in the state information of the layer to be compressed may include:

step 91: updating the compression rate in the compression state information to the compression rate currently output by the hierarchical compression decision maker;

step 92: updating the variation degree of the feature map in the compression state information by utilizing the compressed layer to be compressed and the corresponding network layer in the full-scale model;

step 93: the latest precision of the compression model is determined, and the precision in the compression state information is updated with the latest precision.

It should be noted that, the embodiment of the present invention is not limited to the calculation manner of the variation degree of the feature map, as long as the variation degree of the feature map can be ensured to characterize the variation degree between the layer to be compressed and the network layer corresponding to the layer in the full-scale model. The following describes a calculation method of the degree of variation of the feature map.

In one possible case, updating the feature map variability in the compression state information by using the compressed layer to be compressed and the corresponding network layer in the full-scale model includes:

step 101: and setting the characteristic diagram of the compressed layer to be compressed as a first characteristic diagram, and setting the characteristic diagram of the network layer corresponding to the layer to be compressed in the full-scale model as a second characteristic diagram.

It should be noted that for both the second feature map before compression and the first feature map after compression, the sizes are (C, H, W), where H represents the height of the feature map, W represents the width of the feature map, and C represents the number of channels.

Step 102: and determining element differences between each element in the first feature map and the corresponding element in the second feature map, and summing absolute values of all the element differences to obtain a total element difference.

In this step, each element in the feature map may be traversed, which is a [ i, j ] corresponding to the second feature map and B [ i, j ] corresponding to the first feature map; calculating the absolute value of the element difference, expressed as element_diff=abs (a [ i, j ] -B [ i, j ]), wherein element_diff represents the absolute value of the element difference and abs represents the absolute value taking operation; the absolute values of all element differences are summed to give a total difference sum, namely total_diff=sum (element_ diff for all elements), where total_diff represents the difference sum, sum represents the summing operation and elements represent each element in the feature map.

Step 103: and determining the average element difference value of each element by using the total element difference value and the number of the elements contained in the first feature map, and updating the variation degree of the feature map by using the average element difference value.

In this step, the total number of elements of the feature map, i.e., total_elements=c×h×w, may be calculated, and the variability of the feature map, i.e., variability of Compressed Feature Maps =total_diff/total_elements, may be calculated.

Based on the above embodiments, the above layer-by-layer traversal compression method is described below based on a specific schematic diagram. Referring to fig. 3, fig. 3 is a schematic diagram of a layer-by-layer traversal compression according to an embodiment of the invention. The network layer belongs to a machine learning model, and the number (1-n+2) indicates the position of the network layer in the machine learning model. Embodiments of the present invention may perform layer-by-layer traversal compression on these network layers, such as in fig. 3, after the compression of network layer n-1 is completed, the hierarchical compression decision maker moves to network layer n by layer and compresses network layer n. The compression is divided into four steps: 1. extracting state information of a network layer, wherein the state information can comprise the number of layers of the network layer, the type of the network layer, the input scale of a feature map, the output scale of the feature map, the compression rate of the network layer, the variation degree of the feature map of the network layer and the precision of a compression model; 2. and inputting the characteristic information into a hierarchical compression decision device for processing to obtain the compression rate. The hierarchical compression decision maker belongs to a neural network model and comprises an input layer, a hidden layer and an output layer, wherein the three network layers in the decision maker process input state information to obtain the compression rate of the network layer to be compressed currently by the machine learning model; 3. inputting the compression rate to an execution actuator so that the execution actuator selects a compression strategy matched with the compression rate to execute compression on a network layer to be compressed currently; 4. the model is retrained after compression to restore the model accuracy, and the model is selected to be continuously compressed or moved to the next network layer for compression according to the compression condition of the current network layer.

Based on the above embodiments, the training manner of the hierarchical compression decision maker is described in detail below. In the embodiment of the invention, the hierarchical compression decision maker can be directly carried out in the federal learning process without special offline training, so that the hierarchical compression decision maker can be better adapted to the training environment in each edge device. In one possible scenario, performing layer-by-layer traversal compression for each network layer in the trained machine learning model may include:

s401, setting a full-scale model and a compression model by using the trained machine learning model.

S402, setting a first network layer in the compression model as a layer to be compressed, and creating state information for the layer to be compressed; the state information includes self attribute information of the layer to be compressed and compression state information.

S403, inputting the state information into a hierarchical compression decision device to obtain the compression rate of the layer to be compressed, and compressing the layer to be compressed according to the compression rate.

S404, when the layer to be compressed is compressed, training the compression model by using the local training data to restore the precision of the compression model.

As described above, since the hierarchical compression decision maker can be directly performed in the federal learning process, it is still possible for the hierarchical compression decision maker that has not completed training to directly compress the machine learning model trained by the edge device. Therefore, the descriptions of steps S401 to S404 are identical to those of the above embodiments, and are not repeated here.

S405, when the hierarchical compression decision maker does not complete training, determining a loss value of the hierarchical compression decision maker in the current compression through the following formula:

；

wherein Delta_Acc represents a loss value,representing the current accuracy of the compression model,representing full-scale modesPrecision of model, parameter-Reduction represents the Parameter quantity reduced by compression, total-Parameter represents the Total Parameter quantity of the full-quantity model, +.>Representing the adjustment factor.

In the embodiment of the invention, each time the hierarchical compression decision maker completes one compression, a loss value is calculated for the hierarchical compression decision maker so as to guide the hierarchical compression decision maker to update the model by using the loss value. The loss value calculation formula contains two items of data, namelyPartial sum->Part(s). Wherein,part of the guiding hierarchy decision compressor maintains the accuracy of the compression model, whileAnd in part, to encourage the hierarchical decision compressor to reduce the amount of parameters in the model. By setting the loss function, the embodiment of the invention can ensure that the hierarchical decision compressor effectively balances the relation between the model precision and the model size, thereby achieving better compression effect.

S406, updating the model of the hierarchical compression decision maker by using the loss value.

It should be noted that, the embodiment of the present invention is not limited to how to update the model of the hierarchical compression decision maker by using the loss value, and the decision maker is considered to be a neural network model, so reference may be made to the related technology of the neural network model.

S407, judging whether the current precision of the compression model is smaller than a preset minimum precision value or not when the hierarchical compression decision-making device does not complete training; if the current precision is smaller than the preset minimum precision value, entering S408; if the current precision is not less than the preset minimum precision value, the process goes to S409.

In this step, when the hierarchical compression decision maker does not complete training, it may give a poor compression rate, which in turn results in the compression model not being able to recover to the required accuracy. Therefore, when the current precision of the compression model is determined to be smaller than the preset minimum precision value, the compression of the current layer to be compressed can be directly stopped, and the layer to be compressed is rolled back to the previous version (such as the original version or the previous compression version) so as to avoid serious degradation of the compression model. It should be noted that, the embodiment of the present invention is not limited to a specific value of the preset minimum precision value, and may be set according to the actual application requirement.

S408, judging that the compression stop condition is met, and rolling back the layer to be compressed to the previous version.

S409, judging whether the current compression rate of the layer to be compressed is smaller than a preset compression rate; if the current compression rate is smaller than the preset compression rate, the step S410 is entered; if the current compression rate is not less than the preset compression rate, the process proceeds to step S413.

S410, adding 1 to the duration of the low compression rate corresponding to the layer to be compressed, and judging whether the duration of the low compression rate is greater than a preset number; if the duration of the low compression ratio is greater than the preset number, step S411 is entered; if the duration of the low compression rate is not greater than the preset number, step S412 is entered, wherein the initial value of the duration of the low compression rate is 0;

s411, judging that the compression stop condition is met;

s412, judging that the compression stop condition is not met;

in steps S409 to S412, if it is determined that the current precision of the compression model meets the requirement of the preset minimum precision value, whether to exit the compression of the layer to be compressed may be further determined according to the current compression rate of the layer to be compressed. Specifically, it may be first determined whether the current compression rate of the layer to be compressed is smaller than a preset compression rate. If the current compression rate of the layer to be compressed is determined to be smaller than the preset compression rate, the fact that the current compressible space of the layer to be compressed is smaller is indicated, and the exit condition is basically met. In order to further improve the reliability of the judgment, the embodiment of the invention can also set the corresponding duration of the low compression rate for the layer to be compressed, and add one to the duration of the low compression rate when the current compression rate of the layer to be compressed is determined to be lower than the preset compression rate each time. Furthermore, when the duration of the low compression rate is greater than the preset number, not only means that the layer to be compressed has no compressible space, but also means that the compression of the layer to be compressed by the hierarchical compression decision maker always does not seriously lose the precision of the compression model, so that it can be determined that the layer to be compressed meets the compression stop condition.

Further, if it is determined that compression of each network layer of the compression model by the hierarchical compression decision maker does not severely lose the accuracy of the model, it may be determined that the hierarchical compression decision maker has completed training. Specifically, whether the duration of the low compression rate corresponding to each network layer in the compression model is larger than the preset times can be judged, if yes, the hierarchical compression decision-making device can stably compress each network layer of the compression model, and further it can be judged that the hierarchical compression decision-making device has completed training.

Based on this, after determining that the low compression rate duration is greater than the preset number, it further includes:

step 1001: judging whether the duration times of the low compression rate corresponding to each network layer in the compression model are all larger than preset times or not; if yes, go to step 1002, if no, determine that the hierarchical compression decision maker does not complete training.

Step 1002: the decision-level compression decision maker completes the training.

If S413 is not smaller, it is determined that the compression stop condition is not satisfied.

And S414, if the compression stopping condition is determined to be met, setting the next network layer in the compression model as a layer to be compressed, and entering a step of creating state information for the layer to be compressed.

And S415, if the compression stopping condition is not met, updating the compression state information in the state information of the layer to be compressed, and inputting the state information into the hierarchical compression decision device based on the updated state information to obtain the compression rate of the layer to be compressed.

It should be noted that the descriptions of steps S414 and S415 are identical to those of the above embodiments, and are not repeated here.

Based on the above examples, a detailed description of the overall compression is given below. In one possible scenario, the overall compression of all network layers in the trained machine learning model may include:

s501, setting a full-scale model and a compression model by using the trained machine learning model.

S502, setting each network layer in the compression model as a layer to be compressed, and creating state information for each layer to be compressed to obtain a state information sequence; the state information comprises self attribute information and compression state information of the layer to be compressed, and the compression state information represents the compression degree of the layer to be compressed compared with a corresponding network layer in the full-scale model.

In this step, unlike layer-by-layer compression, in the embodiment of the present invention, each network layer in the compression model needs to be set as a layer to be compressed, and state information needs to be created for each layer to be compressed, so as to obtain a state information sequence, so as to perform overall compression on the compression model. It should be noted that, the relevant description about the status information is consistent with the above embodiment, and will not be repeated here.

S503, inputting the state information sequence into a global compression decision device to obtain the compression rate of each layer to be compressed, and compressing each layer to be compressed according to the compression rate; the global compression decision maker comprises a sequence processing unit which is used for extracting the dependency relationship in the state information sequence.

In this step, the state information sequence is processed by using a global compression decision maker to obtain the compression rate of each layer to be compressed, wherein the global compression decision maker is a neural network model. In order to determine the association relation of each network layer of each compression model, the embodiment of the invention is particularly provided with a sequence processing unit in a global compression decision device, which is used for extracting the dependency relation in the state information sequence. The sequence processing unit may be composed of units such as LSTM (Long Short-Term Memory network) or GRU (Gate Recurrent Unit, gate loop unit) that capture Long-Term dependencies in order to better understand and optimize interactions between layers of the network and thus better compress the network layers together.

It should be noted that, the specific manner of compressing each layer to be compressed according to the compression rate is consistent with the above embodiment, and will not be described herein again.

S504, when the compression model is compressed, training the compression model by using the local training data to restore the precision of the compression model.

It can be understood that after the compression of the layer to be compressed is completed, in order to compensate for the loss of the model precision caused by the compression, the embodiment of the invention can start from the current state of the compression model, under the same training and verification data set, adjust the super parameters such as the learning rate and the like, and continue to train the model so as to recover the precision of the compression model, thereby balancing the relationship between the reduction of the parameter quantity and the model precision loss quantity.

S505, judging whether the compression stop condition is satisfied.

It should be noted that, the compression stop condition in the embodiment of the present invention is used to determine whether the compression model is compressed to a proper extent, and may exit the compression. Embodiments of the present invention are not limited to specific compression stop conditions, which are relevant to the training situation of the global compression decision maker. For example, while the global compression decision maker is still in training, it may give a poor compression rate, resulting in a compression model that still recovers poorly after retraining with the required accuracy, at which point it may be determined that the compression stop condition is met and the compression model is rolled back to the previous version (e.g., the original version or the previous compressed version). For another example, when the global compression decision maker has completed training, it basically gives a suitable compression rate, and each network layer only needs to compress a fixed number of times, so the compression stop condition may be: and judging whether the number of times of compression of the compression model reaches the preset number of times, if so, judging that the compression stop condition is met, and otherwise, judging that the compression stop condition is not met.

Based on this, determining whether the compression stop condition is satisfied includes:

step 1101: when the global compression decision maker has completed training, judging whether the overall compression frequency of the compression model reaches a preset frequency; if yes, go to step 1102; if not, go to step 1103;

step 1102: judging that the compression stop condition is satisfied;

step 1103: it is determined that the compression stop condition is not satisfied.

S506, if yes, the compression is exited.

As described above, if it is determined that the compression stop condition is satisfied, compression of the compression model may be exited.

And S507, if not, updating the compression state information in the state information of each layer to be compressed, and entering a step of inputting a state information sequence into a global compression decision device based on the updated state information to obtain the compression rate of each layer to be compressed.

As described above, if it is determined that the compression stop condition is not satisfied, this means that the current compression model is still to be compressed. At this time, the embodiment of the invention can update the compression state information in the state information of each layer to be compressed, so as to continuously compress the layer to be compressed based on the updated state information.

Based on the above embodiment, the above overall compression method is described below based on a specific schematic diagram. Referring to fig. 4, fig. 4 is a schematic diagram of overall compression according to an embodiment of the present invention. The network layer belongs to a machine learning model, and the number (1-n+1) indicates the position of the network layer in the machine learning model. The embodiment of the invention can carry out integral compression on all network layers in the machine learning model, and comprises four steps: 1. extracting state information of all network layers, wherein the state information can comprise the number of network layers, the type of the network layers, the input scale of a feature map, the output scale of the feature map, the compression rate of the network layers, the variability of the feature map of the network layers and the precision of a compression model; 2. each feature information is sequentially input to each recurrent neural network layer (LSTM unit) in the global compression decision maker. Each cyclic neural network layer in the global compression decision maker corresponds to each network layer in the machine learning model one by one, and is used for receiving and processing state information of the corresponding network layer and outputting the compression rate of the corresponding network layer. It is worth pointing out that each cyclic neural network layer is connected in sequence, and state information of each network layer can be processed in sequence so as to mine dependency relationship among the network layer state information; 3. inputting the compression rate to an execution actuator so that the execution actuator selects a compression strategy matched with the compression rate to execute compression on each network layer; 4. the model is retrained after compression to restore the model accuracy, and the model is selected to be continuously compressed or moved to the next network layer for compression according to the compression condition of the current network layer.

Based on the above embodiments, a specific training manner of the global compression decision maker is described in detail below. Similar to the hierarchical compression decision maker, the global compression decision maker can also be directly performed in the federal learning process without special offline training, so that the training environment in each edge device can be better adapted. In one possible scenario, the overall compression of the trained machine learning model may include:

s601, setting a full-scale model and a compression model by using the trained machine learning model.

S602, setting each network layer in the compression model as a layer to be compressed, and creating state information for each layer to be compressed to obtain a state information sequence; the state information comprises self attribute information and compression state information of the layer to be compressed, and the compression state information represents the compression degree of the layer to be compressed compared with a corresponding network layer in the full-scale model.

S603, inputting the state information sequence into a global compression decision device to obtain the compression rate of each layer to be compressed, and compressing each layer to be compressed according to the compression rate; the global compression decision maker comprises a sequence processing unit which is used for extracting the dependency relationship in the state information sequence.

S604, when the compression model is compressed, training the compression model by using the local training data to restore the precision of the compression model.

As described above, since the global compression decision maker can be directly performed in the federal learning process, for the global compression decision maker that does not complete training, it can still directly compress the machine learning model trained by the edge device. Therefore, the descriptions of steps S601 to S604 are identical to those of the above embodiments, and are not repeated here.

S605, when the global compression decision-making device does not complete training, determining the rewarding value of the global compression decision-making device in the current compression through the following formula:

；

wherein,representing a prize value->The strategy which represents the output of the global compression decision maker in the current compression comprises the compression rate and the +_ corresponding to each layer to be compressed in the compression model in the current compression>Representing a validation set, the validation set being used to determine the accuracy of the full-scale model and the compressed model, +.>For the first bonus item b represents a preset loss of accuracy threshold,>is determined by the following formula:

；

wherein the method comprises the steps ofRepresenting the difference in accuracy>，/>Representing the accuracy of the full-scale model, +.>Representing the current accuracy of the compression model, +.>Representing the regulatory factor- >Representing a second prize item->Is determined by the following formula:

；

wherein,representing the ratio between the floating point operand of the compression model and the floating point operand of the full-scale model,。

embodiments of the present invention may use the prize value to prize the global compressed decision maker, which is targeted to maximize the prize value. As shown in the above formula, the prize value includes a first prize termAnd a second prize item. Wherein, the first bonus item->The accuracy loss of the compressed network on the verification set Xval can be guaranteed not to exceed b, because the first rewarding item is divided into two values, when the fatter acc is smaller than or equal to b, the first rewarding item and the fatter acc are in a linear relation, and the smaller the fatter acc is, the larger the contribution to the final rewarding value is; whileWhen fatter acc is greater than b, fatter acc and the first bonus are in an exponential relationship, and the larger fatter acc is, the smaller the first bonus is, at this time, the global compression decision maker is forced to keep the accuracy of the compression model within a tolerable range. Second prize item->The method is used for encouraging the global compression decision maker to reject more filters as much as possible, and reducing the floating point calculation amount. Therefore, the reward value can simultaneously consider the model precision and the model calculation amount, and the calculation complexity of the model can be reduced as much as possible on the premise of ensuring the precision.

S606, updating the model of the global compression decision maker by using the reward value.

S607, judging whether the current precision of the compression model is smaller than a preset minimum precision value when the global compression decision-making device does not complete training; if the current precision is smaller than the preset minimum precision value, the step S608 is entered; if the current precision is not less than the preset minimum precision value, the step S609 is entered;

s608, judging that the compression stopping condition is met, and rolling back the compression model to the previous version;

s609, judging whether the current compression rate of each layer to be compressed is smaller than a preset compression rate; if the current compression rate is smaller than the preset compression rate, entering S610; if the current compression rate is not less than the preset compression rate, entering S613;

s610, adding 1 to the duration of the low compression rate corresponding to the compression model, and judging whether the duration of the low compression rate is greater than a preset number; if the duration of the low compression ratio is greater than the preset number, the process proceeds to S611; if the duration of the low compression ratio is not greater than the preset number, the process goes to S612; the initial value of the duration of the low compression ratio is 0;

s611, judging that the compression stopping condition is met, and judging that the global compression decision-making device completes training;

s612, judging that the compression stop condition is not satisfied.

In steps S607-S612, if it is determined that the current precision of the compression model meets the requirement of the preset minimum precision value, whether to exit compression of the layer to be compressed may be further determined according to the current compression rate of each layer of the compression model. Specifically, it may be first determined whether the current compression rate of each layer of the compression model is smaller than a preset compression rate. If the compression rate of each layer is determined to be smaller than the preset compression rate, the current compressible space of the compression model is smaller, and the exiting condition is basically met. In order to further improve the reliability of judgment, the embodiment of the invention can also set the corresponding low compression rate duration for the compression model, and add one to the low compression rate duration when each time that the current compression rate of each layer of the compression model is determined to be lower than the preset compression rate. Furthermore, when the duration of the low compression rate is greater than the preset number, not only means that the compression model has no compressible space, but also means that the global compression decision maker does not seriously lose the precision of the compression model all the time when compressing the layer to be compressed, so that the layer to be compressed can be judged to meet the compression stop condition, and the global compression decision maker can be judged to complete training.

S613, it is determined that the compression stop condition is not satisfied.

S614, if the compression stopping condition is judged to be met, the compression is stopped;

and S615, if the compression stopping condition is not met, updating the compression state information in the state information of each layer to be compressed, and entering a step of inputting a state information sequence into a global compression decision device based on the updated state information to obtain the compression rate of each layer to be compressed.

Based on the above embodiments, in the related art, the federal learning mode cannot generally achieve higher training efficiency, because the training data local to each edge device is generally different, and not independently and uniformly distributed. For example, for classification tasks, the data in edge device a may be primarily type 1 data, while the data in edge device B may be primarily type 2 data, i.e., the number, distribution, of different types of data may be different among the edge devices. Furthermore, the model parameters trained by different edge devices are different, for example, the model parameters trained by the edge device A have good classification performance for the type 1, but have poor classification performance for other types; the model parameters trained by the edge device B have better classification performance for the type 2, but have poorer classification performance for other types. Furthermore, if the machine learning model parameters with larger difference are subjected to global aggregation directly, the higher convergence speed of the global aggregation model is difficult to achieve, and meanwhile, the global aggregation model is considered to be a comprehensive model integrating the local model data characteristics of all edge devices, so that the aggregated model can have offset errors in different federal computing devices even be degraded due to the fact that the aggregated model is simply and uniformly aggregated with all the model parameters, and further the efficient development of edge learning is not facilitated. Moreover, under the condition of more edge devices, the model parameter aggregation is also difficult to achieve higher communication efficiency by directly carrying out model parameter aggregation at the edge cloud server. In short, the traditional federal learning mode cannot achieve higher training efficiency and communication efficiency, and partial communication between the edge equipment and the edge cloud server does not significantly contribute to accelerating rapid model convergence, so that more meaningless communication pressure is brought to the edge cloud server. Therefore, in the embodiment of the invention, in the process of clustering the edge devices, the training data similarity among the edge devices can be determined, and the edge devices can be divided into a plurality of device clusters by utilizing the training data similarity, namely, the characteristics and the distribution condition of data isomerism can be considered in the clustering process, and similar data are put in the same cluster, so that the similarity of the data in the clusters is improved. In this way, when the hierarchical federation training is performed, local machine learning model parameters of the edge equipment can be clustered in the equipment cluster, so that similar machine learning model parameters are clustered first, cluster head edge equipment of each equipment cluster sends cluster aggregation model parameters obtained by clustering to an edge cloud server for global aggregation, the convergence rate of a global aggregation model can be improved, the value of model parameters transmitted by the edge equipment to the edge cloud server can be improved, and the communication times between the edge equipment and the edge cloud server, which have little benefit in improving the model convergence rate, are reduced; and through setting up the level federal training, also can reduce the communication pressure of edge cloud server department to can promote communication efficiency. The following describes specific embodiments of device clustering. In one possible scenario, after iteratively training the local machine learning model using the local training data, further comprising:

S301, receiving a test data set issued by an edge cloud server; each edge device corresponds to the same test data set.

S302, inputting the test data set into a local machine learning model to obtain an inference result corresponding to the test data set.

In the embodiment of the invention, the reason capacity of the machine learning model is considered to have a direct relation with the training content and the training data quantity, so that the edge cloud server can issue the same test data set to all edge devices, and determine the reason condition of the machine learning model local to each edge device on the same test data set, so that the distribution condition of various training data in each edge device can be roughly determined, and the training data similarity among the edge devices can be determined based on the reason result.

S303, sending the reasoning result to an edge cloud server, and receiving equipment cluster information issued by the edge cloud server; the edge cloud server divides each edge device into a plurality of device clusters by using training data similarity determined by the reasoning result of each edge device.

In this step, the edge devices may upload the inference results to the edge cloud server, so that the edge domain cloud server divides each edge device into a plurality of device clusters using the training data similarity determined by the inference results of each edge device. And then, when the equipment clustering is completed, the edge cloud server only needs to send the equipment cluster information to each edge equipment so that the edge equipment forms the equipment cluster to perform hierarchical federation learning.

It should be noted that, the embodiment of the present invention is not limited to how to divide each edge device into a plurality of device clusters based on the similarity of training data, for example, clustering may be performed by adopting a clustering means, or clustering may be performed by constructing a weighted undirected graph, and may be set according to actual application requirements.

The following briefly describes a process for device clustering by constructing weighted undirected graphs. The edge cloud server can construct a weighted undirected graph according to the similarity of training data among the edge devices, wherein nodes contained in the weighted undirected graph are in one-to-one correspondence with the edge devices, the edges between the nodes are provided with weights, and the value of the weights is the similarity of the training data among the corresponding edge devices. Further, in order to facilitate equipment clustering and facilitate edge equipment to know which of neighboring edge equipment is more similar to training data of the edge equipment, when the edge is set, the embodiment of the invention can require that the edge is set between nodes if and only if the similarity of the training data between the corresponding edge equipment is greater than a first threshold, and the first threshold is a preset value. For convenience of understanding, please refer to fig. 5, fig. 5 is a schematic diagram of a weighted undirected graph provided by the embodiment of the present invention, in which device nodes 1 to 6 correspond to 6 edge devices, the weight of the connection between the nodes is the training data similarity of the corresponding edge devices, for example, the training data similarity between device 1 and device 2 is 0.94, the training data similarity between device 2 and device 3 is 0.83, and so on. In addition, the weighted undirected graph can be constructed based on a preset first threshold (such as 0.7), namely, the nodes have connecting lines, and the training data similarity between corresponding edge devices is larger than 0.7 only. Assuming that the similarity of training data between the device 2 and the device 6 is 0.68, the device 2 and the device 6 are not connected in the weighted undirected graph.

Further, the edge device may cluster in the weighted undirected graph in a label iteration manner, and the process may be:

step 1201: setting a plurality of initial clusters by using a selected edge with a weight greater than a second threshold in the weighted undirected graph, and setting cluster labels of the initial clusters for nodes in the initial clusters; the nodes in the initial clusters are communicated by the selected edges, the selected edges do not exist among the initial clusters, and the second threshold is larger than the first threshold;

step 1202: initializing the label change quantity to be zero, traversing all nodes which are not set to an initial cluster in the weighted undirected graph, and counting the occurrence times of various cluster labels in neighbor nodes of the target node aiming at each traversed target node;

step 1203: if the target node is not provided with a cluster label or the cluster label of the target node is different from the cluster label with the largest occurrence number, the cluster label of the target node is set to be the cluster label with the largest occurrence number, and the number of label changes is added with 1;

step 1204: judging whether the number of label changes is larger than a preset value or not after node traversal is completed; if yes, go to step 1205, if no, go to step 1206;

step 1205: initializing the label change quantity to zero, and traversing all nodes which are not set to an initial cluster in the weighted undirected graph;

Step 1206: and dividing the edge devices corresponding to the nodes with the same cluster labels into the same device cluster.

Specifically, the embodiment of the invention sets a plurality of initial clusters by using the selected edge with the weight larger than the second threshold in the weighted undirected graph, and sets the cluster labels of the initial clusters for the nodes in the initial clusters, wherein the nodes in the initial clusters are communicated by the selected edge, the selected edge does not exist among the initial clusters, and the second threshold is larger than the first threshold. It should be noted that, the embodiment of the present invention is not limited to the specific value of the second threshold, and may be preset according to the actual situation. For example, in fig. 5, if the second threshold is set to 0.9, the devices 1-2 and 3-4 may form two initial clusters, and thus the present invention may set the tag a for the devices 1 and 2 and the tag B for the devices 3 and 4. The embodiment of the invention can initialize the label change number to zero and traverse all nodes which are not set to the initial cluster in the weighted undirected graph. For each traversed target node, the embodiment of the invention can count the occurrence times of various cluster labels in the neighbor nodes of the target node. If the target node is not provided with the cluster label or the cluster label of the target node is different from the cluster label with the largest occurrence number, the cluster label of the target node is set to be the cluster label with the largest occurrence number, and the number of label changes is added with 1. For example, in fig. 5, since device 6 is not set to the initial cluster, when traversing to the device node for the first time, the neighbor nodes of device 6 may be counted as device 3 and device 4. Since the tags of both the device 3 and the device 4 are the tag B, the tag B can be set for the device 6. Further, upon completion of the node traversal, it may be determined whether the number of tag changes is greater than a preset value. If it is greater, this means that the tag propagation has not stopped at this point, at which point the number of tag changes can be reinitialized to zero and only the next pass is needed. Otherwise, if the number of label changes is not greater than the preset value, it means that label propagation is basically stopped, and at this time, edge devices corresponding to nodes having labels in the same cluster can be divided into the same device cluster. It should be noted that, the embodiment of the present invention is not limited to specific values of the preset values, and may be set according to actual application requirements. For example, in fig. 5, the devices 5 and 6 are each provided with a tag B, so that the finally obtained device clusters are a first cluster and a second cluster, where the first cluster includes the device 1 and the device 2, and the second cluster includes the devices 3 to 6, and for a specific clustering situation, please refer to fig. 6, fig. 6 is a schematic diagram of one device cluster provided in the embodiment of the present invention.

It should be noted that, the device cluster information sent by the edge cloud server to the edge device may include a weighted undirected graph of the device cluster, such as a portion corresponding to the tag a and a portion corresponding to the tag B in fig. 7. Further, each edge device may determine its own neighbor edge device based on the weighted undirected graph. At this time, in addition to performing cluster aggregation, each edge device may perform neighbor aggregation with a neighbor edge device, specifically: the edge device broadcasts local machine learning model parameters to the neighbor edge devices, and the local machine learning model parameters are aggregated and updated by utilizing the machine learning model transmitted by the neighbor edge devices. By setting the neighbor aggregation, the embodiment of the invention can ensure that the approximate machine learning model parameters are aggregated first and more information is obtained between the approximate edge devices through the neighbor model aggregation, so that the contribution degree of the machine learning parameters in each edge device to the rapid convergence of the model can be improved. Therefore, the method and the system can improve the value of the model parameters transmitted by the edge equipment to the edge cloud server, and can effectively reduce the communication times between the edge equipment and the edge cloud server, which have little benefit in improving the convergence speed of the model.

Based on the above, the device cluster information comprises a weighted undirected graph of the device cluster, each node in the weighted undirected graph corresponds to each edge device in the device cluster, the nodes are provided with edges, and the weight of the edges is the training data similarity between the corresponding edge devices if and only if the training data similarity between the corresponding edge devices is greater than a first threshold;

step 1301: determining edge equipment corresponding to a node connected with the node in the weighted undirected graph as neighbor edge equipment corresponding to the node;

step 1302: and broadcasting local machine learning model parameters to the neighbor edge devices, and utilizing the machine learning model sent by the neighbor edge devices to aggregate and update the local machine learning model parameters.

It should be noted that, the embodiment of the present invention is not limited to a specific manner of performing neighbor aggregation on the parameters of the machine learning model, for example, average aggregation may be performed, and other aggregation manners may also be adopted. For convenient implementation, the embodiment of the invention can utilize the machine learning model sent by the neighbor edge equipment to average and aggregate the local machine learning model parameters, and utilize the machine learning model parameters obtained by the average aggregation to update the local machine learning model parameters.

Further, in an embodiment of the present invention, the machine learning model may undergo four steps of local training, neighbor aggregation, intra-cluster aggregation, and global aggregation. Each complete global aggregation may be considered as a complete round of federal learning, and each round of federal learning may include multiple rounds of local training, multiple rounds of neighbor aggregation, and multiple rounds of intra-cluster aggregation, and further, the rounds of local training, the rounds of intra-cluster aggregation of neighbor aggregation may have a quantitative relationship, e.g., once each time a round of local training of a first preset value is completed; the intra-cluster aggregation is performed once every time the local training of the second preset value round is completed, the second preset value may be greater than the first preset value, and the second preset value may be a multiple of the first preset value.

Based on this, broadcasting local machine learning model parameters to the neighbor edge device, and using the machine learning model sent by the neighbor edge device to aggregate and update the local machine learning model parameters may include:

step 1401: when the training iteration number reaches integer times of a first preset value, broadcasting local machine learning model parameters to the neighbor edge equipment of the training iteration number, and carrying out aggregation update on the local machine learning model parameters by utilizing the machine learning model sent by the neighbor edge equipment;

Compressing the trained machine learning model and transmitting the compressed machine learning model parameters to cluster head edge equipment of the equipment cluster where the machine learning model is located, wherein the method can comprise the following steps:

step 1402: when the iteration times reach integer times of a second preset value, compressing the trained machine learning model, and sending the parameters of the compressed machine learning model to cluster head edge equipment of the equipment cluster where the parameters of the machine learning model are located; the second preset value is greater than the first preset value.

In addition, the number of rounds of intra-cluster aggregation may also be related to the number of rounds of global aggregation, e.g., global aggregation may be performed once each time cluster aggregation of a third preset value round is completed.

Based on this, sending the cluster aggregation model parameter to the edge cloud server for global aggregation may include:

step 1503: and when the aggregation times of the clusters reach the integral multiple of a third preset value, sending the aggregation model parameters of the clusters to an edge cloud server for global aggregation.

It should be noted that, the embodiment of the present invention is not limited to specific values of the first preset value, the second preset value and the third preset value, and may be set according to actual application requirements.

The heterogeneous distributed learning device, the edge device, the federal learning system and the computer readable storage medium provided by the embodiments of the present invention are described below, and the heterogeneous distributed learning device, the edge device, the federal learning system and the computer readable storage medium described below and the heterogeneous distributed learning method described above can be correspondingly referred to each other.

Referring to fig. 7, fig. 7 is a block diagram of a heterogeneous distributed learning apparatus according to an embodiment of the present invention, where the apparatus is applied to an edge device, and a federal learning system includes a plurality of edge devices, and the plurality of edge devices are divided into a plurality of device clusters. The apparatus may include:

a local training module 701, configured to iteratively train a local machine learning model using local training data; the machine learning models in the edge devices correspond to the same model structure and reasoning task;

the model compression module 702 is configured to compress the trained machine learning model, and send parameters of the compressed machine learning model to cluster head edge devices of the device cluster where the model is located;

and the cluster-aggregation module 703 is configured to perform cluster-aggregation on the received machine learning model parameters to obtain cluster-aggregation model parameters when the cluster-aggregation module belongs to the cluster head edge device, and send the cluster-aggregation model parameters to the edge cloud server for global aggregation.

Optionally, the model compression module 702 is specifically configured to:

Alternatively, the model compression module 702 may include:

the first setting submodule is used for setting a full-quantity model and a compression model by utilizing the trained machine learning model;

the first initialization submodule is used for setting a first network layer in the compression model as a layer to be compressed and creating state information for the layer to be compressed; the state information comprises self attribute information and compression state information of the layer to be compressed;

the hierarchical compression sub-module is used for inputting the state information into the hierarchical compression decision-making device to obtain the compression rate of the layer to be compressed, and compressing the layer to be compressed according to the compression rate;

the first precision recovery submodule is used for training the compression model by utilizing the local training data when the layer to be compressed is compressed so as to recover the precision of the compression model;

the first exit judging sub-module is used for judging whether the compression stopping condition is met or not;

the first circulation control sub-module is used for setting the next network layer in the compression model as a layer to be compressed if yes, and entering a step of creating state information for the layer to be compressed;

and the second circulation control sub-module is used for updating the compression state information in the state information of the layer to be compressed if not, and entering a step of inputting the state information into the hierarchical compression decision device based on the updated state information to obtain the compression rate of the layer to be compressed.

Optionally, the self attribute information comprises the number of layers of the layer to be compressed, the type of the network layer, the input scale of the feature map and the output scale of the feature map, and the compression state information comprises the compression rate of the layer to be compressed, the variation degree of the feature map and the precision of a compression model, wherein the variation degree of the feature map represents the variation degree between the layer to be compressed and the corresponding network layer in the full-scale model;

the second cycle control sub-module may include:

the first updating unit is used for updating the compression rate in the compression state information into the compression rate currently output by the hierarchical compression decision device;

the second updating unit is used for updating the variation degree of the feature map in the compression state information by utilizing the compressed layer to be compressed and the corresponding network layer in the full-scale model;

and a third updating unit for determining the latest precision of the compression model and updating the precision in the compression state information with the latest precision.

Optionally, the second updating unit may include:

the setting subunit is used for setting the characteristic diagram of the compressed layer to be a first characteristic diagram and setting the characteristic diagram of the network layer corresponding to the layer to be compressed in the full model to be a second characteristic diagram;

the total element difference value calculation subunit is used for determining element difference values between each element in the first feature diagram and the corresponding element in the second feature diagram, and summing absolute values of all the element difference values to obtain a total element difference value;

And the characteristic diagram variation degree updating subunit is used for determining the average element difference value of each element by utilizing the total element difference value and the number of the elements contained in the first characteristic diagram, and updating the characteristic diagram variation degree by utilizing the average element difference value.

Optionally, the hierarchical compression sub-module may include:

and the compression unit is used for determining the compression grade corresponding to the compression rate and compressing the layer to be compressed by utilizing the compression mode corresponding to the compression grade.

Alternatively, the compression unit may include:

the first compression subunit is used for carrying out weight pruning on the layer to be compressed when the compression rate is determined to be located in a first preset interval; the lower limit value of the first preset interval is 0;

the second compression subunit is used for sequentially carrying out weight pruning and low-rank approximate compression on the layer to be compressed when the compression rate is determined to be located in a second preset interval; the lower limit value of the second preset interval is the upper limit value of the first preset interval;

the third compression subunit is used for sequentially carrying out weight pruning, low-rank approximate compression and quantization compression on the layer to be compressed when the compression rate is determined to be located in a third preset interval; the lower limit value of the third preset interval is the upper limit value of the second preset interval.

Optionally, the model compression module 702 may further include:

The loss value calculation submodule is used for determining the loss value of the hierarchical compression decision maker in the compression when the hierarchical compression decision maker does not complete training according to the following formula:

；

wherein Delta_Acc represents a loss value,representing the current accuracy of the compression model,representing the accuracy of the full-scale model, parameter-Reduction represents the Parameter quantity reduced by compression, total-Parameter represents the Total Parameter quantity of the full-scale model,/-the Parameter quantity is reduced by compression>Representing the regulatory factor;

and the hierarchical compression decision maker updating sub-module is used for updating the model of the hierarchical compression decision maker by using the loss value.

Optionally, the first exit determination sub-module may include:

the first judging unit is used for judging whether the current precision of the compression model is smaller than a preset minimum precision value or not when the hierarchical compression decision device does not complete training;

the first rollback unit is used for judging that the compression stopping condition is met if the current precision is smaller than a preset minimum precision value, and rollback the layer to be compressed into the previous version;

the second judging unit is used for judging whether the current compression rate of the layer to be compressed is smaller than the preset compression rate or not if the current precision is not smaller than the preset minimum precision value;

the third judging unit is used for adding 1 to the duration of the low compression rate corresponding to the layer to be compressed if the current compression rate is smaller than the preset compression rate, and judging whether the duration of the low compression rate is larger than the preset time; if the duration of the low compression ratio is greater than the preset times, judging that the compression stop condition is met; if the duration of the low compression rate is not more than the preset times, judging that the compression stop condition is not met; the initial value of the duration of the low compression ratio is 0;

And the first judging unit is used for judging that the compression stopping condition is not met if the current compression rate is not smaller than the preset compression rate.

Optionally, the first exit determination sub-module may further include:

the training completion judging unit is used for judging whether the duration times of the low compression rate corresponding to each network layer in the compression model are all larger than preset times or not;

if yes, the decision-level compression decision-maker completes training.

Optionally, the exit determination submodule may include:

the fourth judging unit is used for judging whether the compressed times of the layer to be compressed reach the preset times or not when the hierarchical compression decision device finishes training; if yes, judging that the compression stopping condition is met; if not, it is determined that the compression stop condition is not satisfied.

Alternatively, the model compression module 702 may include:

the second setting submodule is used for setting a full-quantity model and a compression model by utilizing the trained machine learning model;

the second initialization submodule is used for setting each network layer in the compression model as a layer to be compressed, and creating state information for each layer to be compressed to obtain a state information sequence; the state information comprises self attribute information and compression state information of the layer to be compressed, and the compression state information represents the compression degree of the layer to be compressed compared with the corresponding network layer in the full model;

The global compression sub-module is used for inputting the state information sequence into the global compression decision-making device to obtain the compression rate of each layer to be compressed, and compressing each layer to be compressed according to the compression rate; the global compression decision maker comprises a sequence processing unit, wherein the sequence processing unit is used for extracting the dependency relationship in the state information sequence;

the second precision recovery sub-module is used for training the compression model by utilizing the local training data when the compression model is compressed so as to recover the precision of the compression model;

the second exit judging sub-module is used for judging whether the compression stopping condition is met or not; if yes, the compression is exited;

and the third circulation control sub-module is used for updating the compression state information in the state information of each layer to be compressed if not, and entering a step of inputting a state information sequence into the global compression decision device based on the updated state information to obtain the compression rate of each layer to be compressed.

Optionally, the model compression module 702 may further include:

the rewarding value calculating submodule is used for determining the rewarding value of the global compression decision maker in the compression when the global compression decision maker does not complete training according to the following formula:

；

wherein,representing a prize value- >The strategy which represents the output of the global compression decision maker in the current compression comprises the compression rate and the +_ corresponding to each layer to be compressed in the compression model in the current compression>Representing a validation set, the validation set being used to determine the accuracy of the full-scale model and the compressed model, +.>For the first bonus item b represents a preset loss of accuracy threshold,>is determined by the following formula:

；

wherein the method comprises the steps ofRepresenting the difference in accuracy>，/>Representing the accuracy of the full-scale model, +.>Representing the current accuracy of the compression model, +.>Representing the regulatory factor->Representing a second prize item->Is determined by the following formula:

；

wherein,representing the ratio between the floating point operand of the compression model and the floating point operand of the full-scale model,；

and the global compression decision maker updating sub-module is used for updating the model of the global compression decision maker by using the reward value.

Optionally, the second exit determination sub-module may include:

the fifth judging unit is used for judging whether the current precision of the compression model is smaller than a preset minimum precision value or not when the global compression decision-making device does not complete training;

the second rollback unit is used for judging that the compression stopping condition is met if the current precision is smaller than a preset minimum precision value and rolling back the compression model into the previous version;

A sixth judging unit, configured to judge whether the current compression rate of each layer to be compressed is smaller than the preset compression rate if the current precision is not smaller than the preset minimum precision value;

a seventh judging unit, configured to add 1 to the duration of the low compression rate corresponding to the compression model if the current compression rate is less than the preset compression rate, and judge whether the duration of the low compression rate is greater than the preset number; if the duration of the low compression rate is greater than the preset times, judging that the compression stopping condition is met, and judging that the global compression decision device finishes training; if the duration of the low compression rate is not more than the preset times, judging that the compression stop condition is not met; the initial value of the duration of the low compression ratio is 0;

and the second judging unit is used for judging that the compression stop condition is not met if the current compression rate is not smaller than the preset compression rate.

Optionally, the second exit determination sub-module may include:

the eighth judging unit is used for judging whether the overall compression frequency of the compression model reaches the preset frequency or not when the global compression decision device has completed training; if yes, judging that the compression stopping condition is met; if not, it is determined that the compression stop condition is not satisfied.

Optionally, the model compression module 702 is specifically configured to:

Each network layer in the trained machine learning model is subjected to layer-by-layer traversal compression, and after the layer-by-layer traversal compression is finished, all network layers in the trained machine learning model are subjected to overall compression.

Optionally, the apparatus may further include:

the receiving module is used for receiving the test data set issued by the edge cloud server; each edge device corresponds to the same test data set;

the test module is used for inputting the test data set into a local machine learning model to obtain an inference result corresponding to the test data set;

the clustering module is used for sending the reasoning result to the edge cloud server and receiving equipment cluster information issued by the edge cloud server; the edge cloud server divides each edge device into a plurality of device clusters by using training data similarity determined by the reasoning result of each edge device.

Optionally, the device cluster information includes a weighted undirected graph of the device cluster, each node in the weighted undirected graph corresponds to each edge device in the device cluster, edges are arranged between the nodes, and if and only if the similarity of training data between the corresponding edge devices is greater than a first threshold, the weight of the edges is the similarity of training data between the corresponding edge devices;

The apparatus may further include:

the neighbor equipment determining module is used for determining edge equipment corresponding to a node connected with the node in the weighted undirected graph as neighbor edge equipment corresponding to the neighbor equipment;

and the neighbor aggregation module is used for broadcasting local machine learning model parameters to the neighbor edge equipment and utilizing the machine learning model sent by the neighbor edge equipment to aggregate and update the local machine learning model parameters.

Optionally, the neighbor aggregation module is specifically configured to:

when the training iteration number reaches integer times of a first preset value, broadcasting local machine learning model parameters to the neighbor edge equipment of the training iteration number, and carrying out aggregation update on the local machine learning model parameters by utilizing the machine learning model sent by the neighbor edge equipment;

the model compression module 702 is specifically configured to:

Optionally, the cluster-cohesive module 703 may include:

and the global aggregation sub-module is used for sending the cluster aggregation model parameters to the edge cloud server for global aggregation when the aggregation times of the clusters reach the integral multiple of the third preset value.

Optionally, the cluster-cohesive module 703 may include:

the sending submodule is used for sending the local machine learning model parameters to cluster head edge equipment;

and the cluster aggregation sub-module is used for aggregating the received machine learning model parameters to obtain cluster aggregation type parameters through the following formula when the cluster aggregation sub-module belongs to cluster head edge equipment of the equipment cluster:

；

Referring to fig. 8, fig. 8 is a block diagram of an edge device according to an embodiment of the present invention, and an edge device 20 according to an embodiment of the present invention includes a processor 21 and a memory 22; wherein the memory 22 is used for storing a computer program; the processor 21 is configured to execute the heterogeneous distributed learning method provided in the foregoing embodiment when executing the computer program.

For the specific process of the heterogeneous distributed learning method, reference may be made to the corresponding content provided in the foregoing embodiment, and no further description is given here.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk or an optical disk, and the storage mode may be transient storage or permanent storage.

In addition, the edge device 20 further includes a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26; wherein the power supply 23 is configured to provide an operating voltage for each hardware device on the edge device 20; the communication interface 24 can create a data transmission channel between the edge device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present invention, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

Referring to fig. 9, fig. 9 is a block diagram of a federal learning system according to an embodiment of the present invention, and the embodiment of the present invention further provides a federal learning system, which may include:

The edge cloud server 10 is used for globally aggregating the received intra-cluster set model parameters;

a plurality of edge devices 20 for performing the heterogeneous distributed learning method as described in the above embodiments.

Of course, the federal learning system can have other forms, such as the form shown in FIG. 1. And, the edge cloud server 10 may also be used to cluster multiple edge devices 20.

Since the embodiments of the federal learning system portion correspond to the embodiments of the heterogeneous distributed learning method portion, the embodiments of the federal learning system portion refer to the descriptions of the embodiments of the heterogeneous distributed learning method portion, and are not repeated herein.

The embodiment of the invention also provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the heterogeneous distributed learning method applied to the edge cloud server or the heterogeneous distributed learning method applied to the edge device is realized.

Since the embodiments of the computer readable storage medium portion and the embodiments of the heterogeneous distributed learning method portion correspond to each other, the embodiments of the storage medium portion are referred to the description of the embodiments of the heterogeneous distributed learning method portion, and are not repeated herein.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The heterogeneous distributed learning method, the heterogeneous distributed learning device, the heterogeneous distributed learning equipment, the heterogeneous distributed learning system and the heterogeneous distributed learning medium provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

Claims

1. A heterogeneous distributed learning method applied to an edge device, a federal learning system comprising a plurality of said edge devices, a plurality of said edge devices having been partitioned into a plurality of clusters of devices, said method comprising:

2. The heterogeneous distributed learning method of claim 1, wherein the compressing the trained machine learning model comprises:

3. The heterogeneous distributed learning method of claim 2, wherein the performing layer-by-layer traversal compression of each network layer in the trained machine learning model comprises:

judging whether a compression stop condition is satisfied;

4. The heterogeneous distributed learning method according to claim 3, wherein the self attribute information includes a layer number of the layer to be compressed, a network layer type, a feature map input scale and a feature map output scale, the compression state information includes a compression rate of the layer to be compressed, a feature map variability and an accuracy of the compression model, and the feature map variability characterizes a variation degree between the layer to be compressed and a corresponding network layer in the full-scale model;

5. The heterogeneous distributed learning method according to claim 4, wherein updating the feature map variability in the compression state information by using the compressed layer to be compressed and the corresponding network layer in the full-scale model includes:

6. A heterogeneous distributed learning method according to claim 3, wherein the compressing the layer to be compressed according to the compression rate includes:

7. The heterogeneous distributed learning method according to claim 6, wherein the determining the compression level corresponding to the compression rate and compressing the layer to be compressed by a compression method corresponding to the compression level includes:

8. The heterogeneous distributed learning method of claim 3, further comprising, after training the compression model with the local training data:

；

wherein Delta_Acc represents the loss value,representing the current accuracy of the compression model,representing the accuracy of the full model, parameter-Reduction represents the Parameter quantity reduced by compression, total-Parameter represents the Total Parameter quantity of the full model, +.>Representing the regulatory factor;

9. The heterogeneous distributed learning method according to claim 8, wherein the determining whether the compression stop condition is satisfied comprises:

10. The heterogeneous distributed learning method according to claim 9, further comprising, after determining that the low compression rate duration is greater than the preset number of times:

11. The heterogeneous distributed learning method of claim 3, wherein the determining whether the compression stop condition is satisfied comprises:

if yes, judging that the compression stopping condition is met;

if not, it is determined that the compression stop condition is not satisfied.

12. The heterogeneous distributed learning method of claim 2, wherein the overall compression of all network layers in the trained machine learning model comprises:

judging whether a compression stop condition is satisfied;

if yes, the compression is exited;

13. The heterogeneous distributed learning method of claim 12, further comprising, after training the compression model with the local training data:

；

wherein,representing the prize value,/->A strategy for representing the output of the global compression decision maker in the current compression, wherein the strategy comprises the compression rate corresponding to each layer to be compressed in the compression model in the current compression, and the strategy comprises the following steps of>Representing a validation set for determining the accuracy of said full-scale model and said compressed model, said +. >For the first bonus item b represents a preset loss of accuracy threshold,>is determined by the following formula:

；

wherein the method comprises the steps ofRepresenting the difference in accuracy>，/>Representing the accuracy of the full model, +.>Representing the current accuracy of the compression model, +.>Representing the regulatory factor->Representing a second prize item that is to be awarded,is determined by the following formula:

；

wherein,floating point operations representing the compression model and the full-scale modelCalculating the ratio between the quantities ∈ ->；

14. The heterogeneous distributed learning method according to claim 13, wherein the determining whether the compression stop condition is satisfied comprises:

15. The heterogeneous distributed learning method according to claim 12, wherein the determining whether the compression stop condition is satisfied comprises:

if yes, judging that the compression stopping condition is met;

if not, it is determined that the compression stop condition is not satisfied.

16. The heterogeneous distributed learning method of claim 2, wherein the performing layer-by-layer traversal compression on each network layer in the trained machine learning model and the overall compression on all network layers in the trained machine learning model comprises:

17. The heterogeneous distributed learning method of claim 1 wherein iteratively training a local machine learning model using local training data further comprises:

18. The heterogeneous distributed learning method according to claim 17, wherein the device cluster information includes a weighted undirected graph of the device cluster, each node in the weighted undirected graph corresponds to each of the edge devices in the device cluster, edges are provided between the nodes if and only if a training data similarity between the corresponding edge devices is greater than a first threshold, and weights of the edges are the training data similarity between the corresponding edge devices;

19. The heterogeneous distributed learning method of claim 18, wherein broadcasting the local machine learning model parameters to the neighbor edge device and aggregating and updating the local machine learning model parameters with the machine learning model transmitted by the neighbor edge device comprises:

20. The heterogeneous distributed learning method of claim 19, wherein the sending the clustered aggregate model parameters to an edge cloud server for global aggregation comprises:

21. The heterogeneous distributed learning method of claim 1, wherein the performing intra-cluster aggregation on the received machine learning model parameters to obtain intra-cluster aggregation model parameters comprises:

；

wherein,represents the cluster cohesive model parameters obtained by the polymerization of the c-th cluster of devices in the t+1st cluster,>representing the cluster-cohesive model parameters obtained by the aggregation of the c-th cluster of devices in the t-th cluster, and>machine learning model parameters representing local compression of a jth edge device in a c-th device cluster before aggregation in a t+1-th round cluster, +. >Representing the number of edge devices in the c-th device cluster, and #>Representing the super parameter.

22. A heterogeneous distributed learning apparatus for use with an edge device, a federal learning system comprising a plurality of said edge devices, a plurality of said edge devices being partitioned into a plurality of clusters of devices, said apparatus comprising:

23. An edge device, comprising:

a memory for storing a computer program;

a processor for implementing a heterogeneous distributed learning method according to any of claims 1 to 21 when executing the computer program.

24. A federal learning system, comprising:

a plurality of edge devices for performing the heterogeneous distributed learning method of any of claims 1 to 21.

25. A computer readable storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the heterogeneous distributed learning method of any of claims 1 to 21.