CN114363911A

CN114363911A - Wireless communication system for deploying layered federated learning and resource optimization method

Info

Publication number: CN114363911A
Application number: CN202111675427.8A
Authority: CN
Inventors: 朱旭; 温正峤; 蒋宇飞; 王同
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-15
Anticipated expiration: 2041-12-31
Also published as: CN114363911B

Abstract

The invention discloses a wireless communication system for deploying layered federated learning and a resource optimization method, wherein the system takes a base station deployed with a cloud server as a global parameter server in a layered federated learning architecture and executes the aggregation operation of global parameters; and taking the working nodes in the service range of the base station as the federal learning working nodes, carrying out self-adaptive grouping on the working nodes, and electing a head node of each group as a bridge and a local aggregator between the working nodes and the cloud server. And the nodes of the internet of things in each group are iteratively trained by using the local data set, and uploading of local parameters to the group head node is completed through the D2D mode. And the packet node is accessed to the base station in a Frequency Division Multiple Access (FDMA) mode to complete the distribution and the receiving of the Federal learning task and the updating and the exchange of parameters in the training process. The invention overcomes the defects of the traditional layered federated learning architecture and effectively reduces the system overhead.

Description

Wireless communication system for deploying layered federated learning and resource optimization method

Technical Field

The invention relates to the technical field of computer resource optimization, in particular to a wireless communication system for deploying layered federal learning and a resource optimization method.

Background

According to a traditional 'end-edge-cloud' based layered federated learning architecture, after an edge node executes model training task iteration, local model parameters are uploaded to an edge server in parallel to perform local parameter aggregation, after local parameter aggregation of a certain number of rounds is completed, the edge server uploads edge-level model parameters to a cloud server, and the cloud server completes global model parameter aggregation. And repeating the iterative training and updating until the model converges and the task ends.

The 'end-edge-cloud' layered federal learning architecture requires the participation of edge servers, but in rural areas with incomplete infrastructure, the edge servers are not necessarily deployed, and the mode cannot be realized. Secondly, the edge server location is generally relatively fixed, and when performing the grouping matching of the working node and the edge server, it is not always possible to achieve the optimal grouping to achieve the overall optimization of the model training time and the data distribution. In addition, the number of edge servers is generally fixed, and during the process that massive working nodes participate in model training, the load of a single-point server may be too large. Finally, in a scenario where the working node moves, a communication link between the working node and the server is unstable, and there is a risk of disconnection, which may cause the working node to switch among different edge servers, and this packet instability may affect the overall delay and final accuracy of model training.

However, in the existing layered architecture, there is not much research on the optimization tradeoff between communication and learning resources. The influence of the batch size of the training samples and the parameter aggregation frequency on the model learning performance and the communication overhead is more complex and difficult to quantify.

Disclosure of Invention

The invention aims to provide a wireless communication system and a resource optimization method for deploying layered federal learning, which aim to solve the problems in the prior art.

In order to achieve the purpose, the invention adopts the technical scheme that: the wireless communication system for the hierarchical federated learning is deployed, wherein a base station with a cloud server deployed is used as a global parameter server in a hierarchical federated learning architecture, and the aggregation operation of global parameters is executed;

taking a working node in a service range of a base station as a federal learning working node, carrying out self-adaptive grouping on the working node, and electing a head node of each group as a bridge and a local aggregator between the working node and a cloud server; the nodes of the Internet of things in each group use local data sets for iterative training, and upload of local parameters to the group head nodes is completed through a D2D mode;

the packet nodes are accessed to a base station in a Frequency Division Multiple Access (FDMA) mode, and each packet can be allocated with certain bandwidth resources and used for establishing communication connection between a wireless communication link and the base station so as to complete allocation and reception of a federal learning task and parameter updating and exchanging in a training process.

Preferably, in the technical scheme, each working node completes iterative update of the model parameters in parallel by using local computing resources based on a serial parameter uploading strategy;

the serial parameter uploading strategy is that at the stage that the working nodes upload the model parameters to the grouping head nodes, the working nodes in the grouping upload the local model parameters to the grouping head nodes in sequence and serially by utilizing all pre-allocated bandwidth resources in the grouping; completing intra-group aggregation of the model parameters on a grouping head node, and broadcasting the updated aggregation parameters to each working node in the group through broadcasting;

in the stage of reaching the time node of global aggregation, firstly completing the group of the number of aggregation rounds in the designated group, and uploading the grouping model parameters of the group to a global parameter server by using the available bandwidth resources of the system through a grouping head node; if the other group completes the parameter aggregation of the appointed number of rounds and the last group finishes uploading the group model parameters, the group model parameters of the group are uploaded by directly utilizing the system bandwidth resources, otherwise, the uploading of the group model parameters is started after the last group finishes uploading the group model parameters.

Preferably, in the present technical solution, in the communication model, the communication energy consumption of each working node uploading the local model parameters to the head node of the group to which the working node belongs is expressed by the following formula:

wherein p represents the node transmit power;

and representing the uploading communication time delay of the parameter of the working node j.

Preferably, in the technical solution, the communication energy consumption expression formula of uploading the packet model parameters to the cloud server by the head node of each packet is as follows:

wherein p represents the head node transmit power;

representing the parameter uploading communication time delay of the head node of the packet i; v_iRepresenting the head node of packet i.

Preferably, in the technical scheme, in the learning model, the local working node adopts a mini-batch sgd algorithm to train the model task, the loss function of the model is calculated as l, and the loss function value of the local working node j is F_j(w) is defined as:

whereinS represents a sample in the local dataset of the working node; w represents the model parameters of the working node;

model parameters representing the working nodes.

Preferably, in the technical solution, in the learning model, after the working node completes the local model training of the designated number of rounds by using the local training sample, the model parameter is uploaded to the packet header node, the packet header node completes aggregation, and the packet parameter aggregation operation is represented as:

after all working nodes in the group complete model parameter updating of the designated number of rounds by using local training samples, the group head node uploads the group model parameters to a global parameter server, the global parameter server completes the aggregation operation of the group model parameters, and the global parameter is aggregated

The operation is represented as:

after the global parameter server completes global parameter aggregation, the updated global parameters are broadcasted to all the grouping head nodes, the grouping head nodes continue to broadcast to all the working nodes in the service scope, and each working node continues a new round of model training and parameter updating by using the latest updated global model parameters and the local training sample set, and the parameters are recorded as:

wherein, B_jRepresenting the batch size of the working node training samples;

the batch summation of all working node training samples in the group is carried out; t represents a training time; i and j represent different working nodes respectively; eta represents the learning rate in model training;

is a loss function F of the working node model training_j(w) in

Training a loss value gradient value under a sample set; n represents the number of nodes;

representing a batch sample set of a round of local iterative training of the working nodes; b represents the size of the training sample batch of all the working nodes in the whole situation; v_iRepresenting the head node of packet i.

Another objective of the present invention is to provide a resource optimization method for deploying a layered federal learning wireless communication system, where the optimization method is a calculation method based on iterative joint optimization communication factors and learning factors, and includes the following steps:

s100, performing initialization grouping and parameter initialization on all working nodes, and starting to optimize global parameter aggregation frequency; then, solving the problem of the aggregation frequency of the parameters in the packet by using the global parameter aggregation frequency value obtained in the last step and the initialization values of the other parameters;

s200, allocating bandwidth resources based on the existing parameters;

and S300, replanning the grouping of all the nodes. And (4) performing iterative optimization until the algorithm is converged, and finally obtaining the optimal solution of each variable and the minimum value of the system overhead.

Preferably, in the present technical solution, the optimization objective function expression is:

where p represents a weighting coefficient,

and

the reference value representing the model convergence error overhead and the model training time overhead is used for unifying two values with different scales; t represents the training time.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a hierarchical federated learning architecture with aggregation sinking, which sinks local parameter aggregation to a node side, and replaces an edge server with a grouping head node to finish the local parameter aggregation, thereby improving the defects of the local parameter aggregation. Under the assumption of a non-convex loss function, model convergence analysis is carried out, and a theoretical upper bound of model convergence errors is deduced. Similarly, with the aim of minimizing the system overhead, the balance between two performance indexes is researched, and an iterative optimization algorithm for combining a communication factor and a learning factor under a layered architecture is provided. The algorithm comprises a grouping algorithm based on node exchange, an optimal closed-form solution of bandwidth resource allocation and parameter aggregation frequency is deduced, and the optimal training sample batch size is solved by utilizing a concave-convex process optimization algorithm.

2. The invention provides a low-complexity optimization algorithm for how learning resources and communication resources influence system performance, and reduces system overhead.

3. The invention solves the defects of the traditional layered federated learning architecture, optimizes resources aiming at the novel layered federated learning architecture, and intensively studies how the batch size of the learning resource training samples, the parameter aggregation frequency and the communication resource bandwidth resources affect the performance of the system, thereby further effectively reducing the system overhead.

Drawings

FIG. 1 is an architecture diagram of a hierarchical federated learning system of the present invention;

FIG. 2 is a diagram illustrating the effect of serial parameter upload according to the present invention;

FIG. 3 is a flow chart of a resource optimization method of the present invention;

FIG. 4-1 is a geographic coordinate diagram of a wireless communication system of the three-tier federated learning architecture of the present invention;

FIG. 4-2 is an experimental graph of the effect of the grouping aggregation frequency on the model loss function value according to the present invention;

FIGS. 4-3 are graphs of experimental effects of the grouping aggregation frequency on the model loss function values in accordance with the present invention;

FIGS. 4-4 are graphs of the change of model training time overhead and model convergence error overhead with global parameter aggregation frequency for the present invention;

FIGS. 4-5 are graphs showing the variation of the overhead of the present invention with the magnitude of the model parameters;

FIGS. 4-6 are graphs of the variation of overhead with the total number of training sample batches according to the present invention;

FIGS. 4-7 are histograms of the impact of data distribution of the working nodes of the present invention on system overhead;

FIGS. 4-8 are histograms of system overhead variation with working node calculation duration gap threshold variation in accordance with the present invention

FIGS. 4-9 are histograms of system overhead as a function of packet training gap threshold for the present invention;

FIGS. 4-10 are graphs of the change of system overhead with system bandwidth resources in accordance with the present invention;

FIGS. 4-11 are experimental graphs of convergence performance of the T-JCL algorithm of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; details of the same or similar concepts or processes may not be repeated in some embodiments.

Referring to fig. 1, the wireless communication system for deploying layered federal learning according to the present invention is a layered federal learning system model based on convergence sinking, and uses a base station deployed with a cloud server as a global parameter server in a layered federal learning architecture to perform an aggregation operation of global parameters; and taking the working nodes in the service range of the base station as the federal learning working nodes, carrying out self-adaptive grouping on the working nodes, and electing a head node of each group as a bridge and a local aggregator between the working nodes and the cloud server. And the nodes of the internet of things in each group are iteratively trained by using the local data set, and uploading of local parameters to the group head node is completed through the D2D mode. The packet nodes are accessed to the base station in a Frequency Division Multiple Access (FDMA) mode, namely, each packet is allocated with certain bandwidth resources for establishing communication connection with the base station on a wireless communication link, and the allocation and the reception of the federal learning task and the parameter updating and exchanging in the training process are completed.

The system model at least comprises the following three stages:

the first stage is as follows: a model training stage of the local working node;

and a second stage: the working node uploads the local model parameters to a grouping head node stage;

and a third stage: and uploading the grouping model parameters to a global parameter server stage in a grouping mode.

As shown in fig. 2, the first stage: and (3) a model training phase of the local working node:

specifically, each working node completes iterative updating of model parameters in parallel by using local computing resources. In the stage that the working nodes upload the model parameters to the grouping head nodes, the working nodes in the grouping upload the local model parameters to the grouping head nodes in sequence and in series by utilizing all pre-allocated bandwidth resources in the grouping; completing intra-group aggregation of the model parameters on the grouping head nodes, and broadcasting the updated aggregation parameters to each working node in the group through broadcasting; in the stage of reaching the time node of global aggregation, the group of the number of aggregation rounds in the designated group is completed first, the grouping model parameters of the group are uploaded to a global parameter server through a grouping head node by using available bandwidth resources of the system, and the bandwidth resources can be fully utilized in the training process. If another group completes the parameter aggregation of the appointed number of rounds and the last group completes the uploading of the group model parameters, the group model parameters of the group can be uploaded by directly utilizing the system bandwidth resources, otherwise, the uploading of the group model parameters is started after the last group completes the uploading of the group model parameters.

Further, a communication model based on a three-layer federal learning architecture with convergence sinking is illustrated:

first defining a set of working nodes within a group i as

Packet header node is H_iThe head node is set as

The uplink rate R of the member working node j in the group i_jCan be expressed as:

in the formula (1), α_iBandwidth resources (Hz) allocated by the working nodes are represented; p represents the transmission power (W) of the working node; delta²Represents the noise power (W); h is_i,jRepresenting the channel coefficients between the working node and the head node.

Head node H within packet i_iUplink rate R_iCan be expressed as:

in the formula (2)，α_iRepresents a bandwidth resource (MHz) of communication between the packet header node and the base station;

channel coefficients between the operating node and the base station.

The computation time for each worker node to complete a local round of training can be expressed as:

in the formula (3), C represents the size (bit) of the training sample of the working node; v_jRepresenting the rotational speed (number of wheels/sample) of the working node; f. of_jRepresenting the calculated frequency (Hz) of the working node.

The communication delay of each working node for uploading the local model parameters to the packet header node to which the working node belongs can be represented as:

in the formula (4)

Indicating the bandwidth resources (Hz) allocated by packet i.

The communication latency for the packet header node to upload the packet model parameters to the global parameter server may be expressed as:

specifically, according to the calculation time of the working nodes in the group

The indexes are sorted from small to large to obtain oneOrdered set V_i ^SAnd sequentially selecting the working nodes which have finished local training from the set in sequence. Because the working nodes are sorted according to the calculation time, the first selected working node is the working node with the minimum calculation time and is also the first node of the ordered set, and the uploading of the local model parameters is started by using the bandwidth resources pre-allocated by the groups and is uploaded to the head node in the group.

Further, the communication model is continuously selected to illustrate a more specific process:

for example, define T_i,jFor the working node j in the group i, starting synchronous training calculation from all the nodes in the group to the time when the node finishes uploading the local model training parameters to the group head node, then

Wherein, T_i,j′Representing the set V ordered according to local computation time_i ^SThe node preceding the middle working node j; s is a constant with no physical meaning for identification; i denotes a packet i, V_i ^SRepresented is a set of nodes of packet i;

representing the computation time of node j in packet i, cmp being used to identify no physical significance;

indicating the communication delay of node j in packet i, comm is used to identify that there is no physical meaning.

Since the working node with the shortest computation time in the packet is the first node to start uploading the local model parameters, the set V_i ^SThe I-round local computation time of the first node in (a) can be expressed as:

as can be seen from the above formula, the formula is satisfied

Under the precondition of (2), the time required for completing a round of aggregate updating of the grouping model parameters in the grouping can be expressed as:

that is to say that the first and second electrodes,

similarly, the mode of uploading the model parameters of each group to the global parameter server is the same as the strategy of uploading the model parameters to the group head node by the working nodes in the group.

And a third stage: grouping and uploading grouping model parameters to a global parameter server;

specifically, each group is divided into groups according to the training time T_iThe values are sorted from small to large to obtain an ordered set V^SAnd sequentially selecting groups from the set in sequence, and uploading the grouping model parameter vectors. Each packet communicates using all bandwidth resources within the system and a global parameter server. Then, the time when the ith packet completes uploading the packet model parameters can be expressed as:

wherein, T_i′Represented as an ordered set V^SThe previous group of the ith group, wherein the italic G represents a variable and is the global parameter aggregation frequency; l is a true body, and the description is for identification and has no physical meaning.

Set V^SWhen the first group in (1) completes a round of group parameter uploadingThe interval can be expressed as:

as can be seen from the above formula (11), the composition satisfies

On the premise of (2), the time required for completing a round of aggregate update of global model parameters can be expressed as

That is to say that the first and second electrodes,

as can be seen from equation (13), the time of a round of global parameter aggregation is related to the global parameter aggregation frequency, the packet parameter aggregation frequency, and the allocated bandwidth resource of each packet.

Averaging the sum of the calculation time and the communication time into each round of local iterative training, and defining model training time overhead under a three-layer federal learning architecture as follows:

equation (14) characterizes the sum of the computation and communication time required for the average working node to complete a local round of training.

Similarly, considering the energy consumption limit of the working node is also one of the key limiting factors of the system performance. Since the computational energy consumption of the working node is proportional to the batch size of the training samples and proportional to the square of the node computation frequency, the computational energy consumption of the working node j to complete a round of local iterative training can be expressed as:

wherein κ represents the calculated coefficient of energy consumption; b is_jRepresenting the training sample batch size for working node j.

The communication energy consumption of each working node for uploading the local model parameters to the head node of the group to which the working node belongs can be expressed as:

the communication energy consumption of uploading the packet model parameters to the cloud server by the head node of each packet can be expressed as:

where p represents the transmit power.

Further, in the three-layer federal learning architecture, the local working node adopts a mini-batch SGD algorithm to train a model task, the purpose of model training is to minimize a global loss function value, the model reaches a convergence state, and the specific learning model is as follows:

defining the loss function of the model as l, and the loss function value of the local work node j as F_j(w) may be defined as:

wherein s represents a sample in the local dataset of the working node; w represents the model parameters of the working node;

model parameters representing the working nodes.

Each working node performs each round of local iterative model training, and a small-batch random gradient descent algorithm is adopted to train a sample set from the local

Randomly extracting a training sample subset with a certain batch size

For training updates of local model parameters, the update of model parameters of working node j can be expressed as:

wherein η represents the learning rate in the model training;

representing a batch sample set of a round of local iterative training of the working nodes;

is a loss function F of the working node model training_j(w) in

And training the loss value gradient value under the sample set.

Under the three-layer federal learning architecture, after the working nodes in the group complete a certain number of rounds of local model training, the local model parameters are uploaded to the head nodes of the group to which the working nodes belong. The loss function of a packet is expressed as a weighted sum of the loss function values of all working nodes within its service range (including the head node itself), with the weighting factor being related to the sum of the individual working node training sample batch sizes over the total training sample batch used within the packet.

Then the loss function value within packet i can be defined as:

in the formula p_jRepresenting the penalty value weight value for the worker node.

In general, in defining the packet loss function, the weight of the loss value of each working node is defined as the ratio of the local training batch size used in the model training to the sum of the training sample batch sizes used by all working nodes in the packet to which the local training batch size belongs, and the formula is defined as follows:

wherein, B_jRepresenting the batch size of the working node training samples;

and training sample batch summation of all working nodes in the grouping.

Under the three-layer federal learning architecture, a global model loss function is defined as a weighted sum of loss functions of each group, which can be expressed as:

in the formula p_iRepresenting the loss function weight of the packet.

In general, in the global loss function, the weight of the loss function of each packet is the proportion of the training sample batch of the packet in the model training to the training sample batch used by the global working node, and is defined as follows:

wherein B represents the global all-working-node training sample batch size.

After the working node completes the local model training of the designated number of rounds by using the local training sample, the model parameters are uploaded to the grouping head node, the grouping head node completes the aggregation, and the grouping parameter aggregation operation in the t training time can be expressed as:

wherein,

representing the batch size of the training sample of the working node j;

after all working nodes in a group complete model parameter updating of a designated number of rounds by using local training samples, a group head node uploads group model parameters to a global parameter server, the global parameter server completes the aggregation operation of the group model parameters, and the global parameter aggregation operation can be expressed as:

after the global parameter server completes global parameter aggregation, the updated global parameters are broadcasted to all the grouping head nodes, the grouping head nodes continue to broadcast to all the working nodes in the service scope, and each working node continues to perform a new round of model training and parameter updating by using the latest updated global model parameters and the local training sample set, as follows:

therefore, the unified expression of the model parameters of the working node j at different moments is as follows:

and repeating iterative training and model aggregation updating until the model converges in a certain error range, and finishing the model training.

Enhanced layered federated learning architecture model convergence analysis for the present system

Based on the following four premise hypothesis conditions, the following four premise hypothesis conditions are respectively:

the method comprises the following steps: the model loss function is L-Lipschitz continuous.

Secondly, the step of: the working node randomly extracts the training samples from the local training sample set, and the sampling error caused by the random extraction meets the following formula:

where ξ is the sampling error constant coefficient, D_jThe number of training samples of working node j;

③: and the difference between the gradient value of each working node and the global gradient value is used for representing the deviation degree of each working node in the global model training, so that the data heterogeneous degree among the working nodes is indirectly reflected.

Fourthly, the method comprises the following steps: it is assumed that the gradient value size of the working node exists in the upper bound.

Fifthly: the difference between the grouping weighting gradient value and the global gradient value in each grouping is used for representing the deviation degree of each grouping in the global model training, so that the data heterogeneous degree among the groupings is indirectly reflected, and the mathematical definition is as follows:

on the premise of meeting the hypothesis theories of (i), (ii), (iii), (iv) and (v), if w represents the model parameter in the whole text;

the eta L is less than 1, and for any T which is more than or equal to 1, the upper bound of the model convergence theoretical error meets the formula:

wherein,

it can be seen that the number of working nodes N, the number of groups M, and the number of samples B randomly sampled by the working nodes for model training_jThe data isomerism degree between working nodes and the data isomerism degree between groups, the grouping parameter aggregation frequency I and the global parameter aggregation frequency G are all important factors influencing the upper bound of the model convergence theoretical error. In order to better represent the precision error brought by each round of local iterative training, the model training convergence error overhead is defined as follows:

method for optimizing resources by using system

The sum of model convergence error overhead and model training time overhead weighting under the layered federated learning architecture is system overhead, the minimized system overhead is taken as an optimization target, and an optimization problem is established, wherein an optimization problem target function is expressed as:

where ρ represents a weighting coefficient

，

And

and the reference value representing the model convergence error overhead and the model training time overhead is used for unifying the values of two different scales.

Therefore, joint optimization to minimize system overhead is calculated by constructing as P0, specifically:

wherein, (C1) indicates that the constraint limits the number of samples in the training batch of the model, (C2) and (C3) indicate that the constraint limits the number of samples in the training batch of each group and all groups, (C4) and (C5) indicate that the constraint limits the bandwidth resources of the system, (C6) indicates that the constraint limits the aggregation frequency to be a positive integer, (C7) indicates that the constraint limits the energy consumption of the working nodes, and (C8) indicates that the constraint limits the ordering by the calculated duration in each group, the sum of the time of the previous working node for completing the local training and uploading the model parameters is longer than the local calculation time of the next working node, (C9) constraint that after the constraint condition limits all the packets to be sorted according to the total time length, the time of the previous group for completing the group training and uploading the model parameters is longer than the group training time of the next group.

Since the constraint conditions (C8) and (C9) need to be considered on the basis that the working nodes and the groups are ordered according to indexes, so that the optimization problem is difficult to be converted and solved, in order to make the optimization problem easier to be solved, slack variables are introduced

And

the method is used for representing the maximum difference value between the calculation time of each working node after I-round local iterative training of all the working nodes in the same group and the maximum difference value between the total grouping time after G/I-round group iterative training and parameter aggregation updating of different groups are finished. Moreover, after the two thresholds are introduced, the performance difference between the bandwidth resource allocation strategy proposed by the present invention and the conventional bandwidth pre-allocation scheme can be observed by controlling the size of the thresholds. Because, if the computing power of each working node in the same group is equivalent, the time for each working node to complete the local iterative training of the fixed number of rounds is similar.

Then, the constraints (C8) and (C9) can be converted as shown in the following formula:

wherein C (8.1) and C (9.1) are mathematical scaling of the constraints; p, q, m, n all represent nodes P, q, m, m;

the optimization objective function due to the P0 optimization problem contains cost^delayAnd cost^delayIs the min function. For the convenience of solution, relaxation variables tau and tau are introduced_i：

Based on the introduction of the above time relaxation variables, the constraints (C8) and (C9) can be further transformed into the following forms:

c (8.2) and C (9.2) are constraint scaling after introducing relaxation variables; p.q is simply a subscript indicating node p and node q;

based on the introduction of the above four relaxation variables and the adjustment of the constraints, the P0 optimization problem can be converted into the P1 form.

s.t.(C1),(C2),(C3),(C4),(C5),(C6),(C7),(C8.2),(C9.2)

Wherein,

obviously, P1 is still a non-convex optimization problem, and based on the above non-convex problem and definition, the invention provides an algorithm for jointly optimizing communication factors and learning factors based on iteration under a three-layer federated learning architecture, namely a T-JCL algorithm. As shown in FIG. 3, the general flow of the T-JCL algorithm is summarized as follows:

s200, allocating bandwidth resources based on the existing parameters;

Simulation experiment

The invention provides a hierarchical federated learning architecture with aggregation sinking, which sinks local parameter aggregation to a node side, and replaces an edge server with a grouping head node to finish the local parameter aggregation, thereby improving the defects of the local parameter aggregation. Under the assumption of a non-convex loss function, model convergence analysis is carried out, and a theoretical upper bound of model convergence errors is deduced. Similarly, with the aim of minimizing the system overhead, the balance between two performance indexes is researched, and an iterative optimization algorithm for combining a communication factor and a learning factor under a layered architecture is provided. The algorithm comprises a grouping algorithm based on node exchange, an optimal closed-form solution of bandwidth resource allocation and parameter aggregation frequency is deduced, and the optimal training sample batch size is solved by utilizing a concave-convex process optimization algorithm. The simulation results show that compared with an optimization algorithm only considering a single factor, the proposed joint optimization algorithm can effectively reduce the system overhead and quickly converge, and the following simulation experiments are specifically referred to.

A wireless communication system is constructed with 1 base station with a server deployed. The service coverage area of the wireless communication system is a circular area with the radius of 250m, the base station is positioned in the center of the circle, the working nodes are uniformly distributed in the circular area, and the geographic position coordinate graph of the working nodes is shown as 4-1. The global parameter server in the three-layer federated learning framework is positioned at the circle center in the graph, and the red dots represent grouped head nodes and are used as local parameter aggregators to perform grouped parameter aggregation and simultaneously used as working nodes to perform model training of the federated learning task. The remaining discrete differently shaped patterns represent common working nodes. Simulation sets the path loss model between packet header node and global parameter server to PL₁＝128.1+37.6log₁₀(d₁)，d₁Which represents the distance between the operating node and the base station in km. The path loss model between the working nodes is PL₂＝148+40log₁₀(d₂),d₂Representing the distance between the working nodes in km. The remaining simulation part parameter settings are shown in table 4-1.

For convenience of illustration of the comparison of ++++ with the algorithm, some comparison schemes are defined.

(1) The first scheme is as follows: the scheme only considering the influence of the communication factors on the system overhead is called T-C algorithm.

(2) Scheme II: and only considering the influence of the learning factor on the system overhead is called T-L algorithm.

(3) The third scheme is as follows: the scheme of jointly considering the influence of the communication factors and the parameter aggregation frequency on the system overhead is called as a T-JCI algorithm; the training sample batch size is set to half of the local training data set.

(4) And the scheme is as follows: the scheme of jointly considering the influence of the communication factors and the batch size of the training samples on the system overhead is called as a T-JCB algorithm; the fixed grouping and global parameter aggregation frequencies are 5 and 30.

(5) And a fifth scheme: an exhaustive search algorithm that takes into account the impact of communication factors and learning factors on system overhead.

In order to study the influence of different data heterogeneity degrees on the system overhead, different training sample allocation schemes are defined.

On the basis of the optimization of the target scale unification,

and

the reference values are selected as follows.

(1)

30 working nodes and 5 groups, wherein each working node selects half of the batch size of a local training sample to train each time, and the data distribution among the working nodes is strong heterogeneous.

(2)

30 working nodes, 5 groups, each working node selecting half of the size of the local training sample batch for training each time.

(I) verification of theoretical analysis of model convergence under three-layer Federal learning architecture

The simulation was based on the MNIST dataset and trained based on the CNN regression model. The simulation set 30 nodes, 5 groups, and the data was based on the heterogeneous distribution of the C scheme defined above, with the same number of training samples per working section. Model training uses a small batch of gradient descent training, with the batch size set to 10.

Fig. 4-2 and 4-3 simulate and verify the relationship between the grouping aggregation frequency and the global loss function and the convergence accuracy of the model under the three-layer federal learning architecture and the fixed global parameter aggregation frequency G of 30. And selecting the grouping parameter aggregation frequency I E {1, 5,10 } for simulation comparison. It can be seen from the figure that, under the fixed global parameter aggregation frequency, the larger the packet parameter aggregation frequency value is, that is, the more local iterations required for each round of packet aggregation are, the larger the loss value of the model is under the same iteration round number, and the lower the accuracy is. This is because the smaller the grouping aggregation frequency value is, the more frequent the aggregation is, the more frequent the parameter interaction between the models is, and the more powerful the generalization capability of the model training is. And the larger the grouping aggregation frequency value is, the less frequent the aggregation is, the model is easy to fall into the local optimum of each working node instead of the global optimum, and the model training precision is poor.

And (II) carrying out simulation verification and comparison analysis on the T-JCL algorithm and the comparison scheme thereof based on a Monte Carlo simulation experiment, and researching the system overhead optimization of a three-layer federal learning architecture deployed under the wireless communication system.

Fig. 4-4 show the relationship and influence of global parameter aggregation frequency on model training time overhead and model convergence error overhead under a three-layer federated learning architecture.

As can be seen from the relationship between the y-axis and the x-axis on the right side of the figure, as the global parameter aggregation frequency increases, the model training time overhead decreases, because in the case of a fixed packet parameter aggregation frequency, the increase of the global parameter aggregation frequency represents the increase of the packet aggregation times required for each global parameter aggregation, and then the communication delay for averaging the communication delay of one round to each local iteration calculation decreases, thereby resulting in the decrease of the model training time overhead.

As can be seen from the relationship between the left y-axis and the x-axis in the figure, as the global parameter aggregation frequency increases, the model convergence error overhead increases. This is because, with a fixed grouping and aggregation frequency, the increase of the global parameter aggregation frequency increases the grouping and aggregation times required by the model to complete a round of global aggregation in training, and the grouping and aggregation operation of the grouping parameters is an important way to average the differences of the models among the groups. The reduction of the grouping parameter aggregation times leads the model trained by each grouping to be closer to the optimal solution of the group of data sets, but not to be closer to the global optimal solution. Particularly, when data heterogeneity among the packets is obvious, the reduction of the packet aggregation times can obviously increase the model parameter difference among the packets, so that the parameters of the packets deviate from the globally optimal parameters, and further increase the model convergence error overhead.

Fig. 4-5 show the relationship between the system overhead and the size of the transmitted model parameters on the premise of the number of working nodes in the fixed system. The optimization algorithm T-JCL, which combines communication factors and learning factors, is compared to T-C, T-L, which considers only communication factor optimization or learning factor optimization, respectively.

It can be seen from the figure that as the size of the model parameter increases, the system overhead increases for the four algorithms. This is because more communication time is required for the transmission of the model parameters, so that the model training time overhead increases, and thus the system overhead increases. And the system overhead of the T-JCL algorithm is obviously lower than that of the T-C and T-L algorithms. This is because the T-C and T-L algorithms only consider communication factors (bandwidth resources) or learning factors (aggregation frequency and batch size), respectively, and insufficient optimization factors lead to insufficient performance. The T-C algorithm only considers the allocation of bandwidth resources, omits the optimization of the model convergence error overhead, has the worst overall performance and has the highest system overhead. However, the iterative T-JCL algorithm provided by the invention approaches the algorithm based on exhaustive search in performance under the advantage of low complexity, and proves the effectiveness of the algorithm.

FIGS. 4-6 show the relationship between overhead and training sample batch size, comparing the performance differences of the T-JCB, T-JCI, and T-JCL algorithms. As can be seen from fig. 4-6, as the batch size of training samples increases, the overhead increases for the four algorithms. This is because, in the case of a fixed number of nodes, the number of training samples increases, and the number of training samples allocated to each working node may increase, so that more communication time is required to train the model parameters, and the model training time overhead increases, thereby increasing the system overhead. The system overhead of the T-JCL algorithm provided by the invention is obviously lower than that of the T-JCB and the T-JCI algorithm. Because the optimization algorithm based on the joint factors considers more related factors, the method has stronger performance advantage compared with the algorithm considering a single factor. In the aspect of time complexity and performance balance of the algorithm, the T-JCL algorithm based on iteration provided by the invention approaches the algorithm based on exhaustive search in performance under the advantage of low complexity, and the effectiveness of the algorithm is proved.

Fig. 4-7 illustrate the impact of different data distributions of the working nodes on the system overhead and compare the four schemes.

As can be seen from fig. 4-7, under the four schemes, the overhead is the lowest in the case of data distribution a, the overhead is the next lowest in the case of data distribution B, and the overhead is the highest in the case of data distribution C. This is because, as the degree of data heterogeneity increases, the model convergence error overhead increases, resulting in an increase in system overhead. In comparison to the four schemes, it can be seen that in the case of a data distribution, the two schemes for batch optimization are much smaller in system overhead than the two schemes for fixed batches. While the difference between the system overhead of random grouping and grouping optimization is not obvious. This is because, under the condition of a data distribution, the data among the working nodes belong to homogeneous distribution, and the data in the packets can be guaranteed to be homogeneous by adopting a random grouping mode, so that the scheme of random grouping for optimizing the packets can guarantee the data to be homogeneous on the formed packets, and the difference is not large, so that the difference of the system overhead is not obvious. Under the condition of data distribution of B and C, due to the fact that data among nodes are heterogeneous, and the scheme of random grouping cannot guarantee that formed data among groups are isomorphic, the system overhead of the scheme of random grouping is higher than that of the scheme of grouping optimization. In the scheme of fixing the batch size and optimizing the batch size, obviously, the algorithm of batch optimization is obviously lower than the scheme of fixing the batch size in the system overhead. The impact and benefit of bulk size optimization on system overhead is demonstrated.

Fig. 4-8 show the variation of the system overhead with the time gap threshold of the working node, and compare and analyze the performance difference under the scheme of adopting different parameter uploading strategies and bandwidth resource allocation.

As can be seen from fig. 4 to 8, under the same threshold value and the scheme of using the fixed bandwidth and the parallel upload mode, the system overhead is significantly higher than that of the remaining two schemes for performing bandwidth resource optimization, which indicates that the bandwidth resource optimization has a positive impact on the system overhead. In the schemes that all consider bandwidth resource optimization, the scheme based on the serial upload mode has lower system overhead than the scheme based on the parallel upload mode. This is because, in a scenario with heterogeneous computing power, the parallel upload mode wastes part of bandwidth resources due to pre-allocation of bandwidth resources, which increases the model training time overhead, thereby increasing the system overhead. As can be seen from fig. 4-8, as the working node calculation gap threshold increases, the scheme based on the serial upload mode gradually decreases in system overhead, because in the power heterogeneous scenario, in order to maintain the difference between the calculation durations, as the gap threshold increases, the calculation time between the working nodes progresses from strict control to loose control, and the training sample batch size is also a key factor affecting the calculation duration gap. Then, under the condition that the difference threshold is relaxed, the degree that the change of the batch size of the training samples is limited by the calculation time length is relaxed, so that the optimization space is larger in the optimization of the model convergence error overhead in the batch size of the training samples, and the model convergence error overhead is reduced. The serial model-based overhead decreases as the gap threshold increases.

Fig. 4-9 show the variation of the system overhead with the threshold of the packet training time gap, and compare and analyze the performance difference under the scheme of adopting different parameter uploading strategies and bandwidth resource allocation.

As can be seen in fig. 4-9, as the threshold of the inter-packet training time gap increases, the overhead decreases. This is because the parameters have more control space on the optimization overhead as the training time gap limit is relaxed between control packets. The scheme based on the serial and parallel modes has similar system overhead. This is because the threshold value of the inter-packet training time difference is generally lower than the time of packet training, so that the difference is small.

Fig. 4-10 illustrate the impact of system bandwidth resources on system overhead. It can be observed from fig. 4-10 that the T-JCL algorithm proposed in the present application is significantly lower in system overhead than the algorithm that does not take into account the learning factor optimization. And the performance of the T-JCL algorithm approaches to an algorithm based on exhaustive search, and system performance approximation under low complexity is realized. In addition, as system bandwidth resources increase, system overhead decreases. This is because the available bandwidth resources in the system increase, the allocable bandwidth resources of each working node in the system increase, and the model training time overhead of the system decreases, so that the system overhead decreases.

Fig. 4-11 show the convergence performance of the T-JCL algorithm proposed in this chapter at different packet numbers. It can be seen that in the scenarios where the number of packets is 3,5, and 10, respectively, the T-JCL algorithm proposed in the present application can converge within the number of priority rounds, which proves that the T-JCL algorithm operating in an iterative manner has good convergence performance.

In conclusion, the system of the invention has simple equipment structure, low cost and simple resource method, and can improve the convergence and reduce the overhead.

The structures, connection relationships, operation principles, and the like that are not described in the present embodiment are implemented by using the prior art, and a repeated description thereof is omitted.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A wireless communication system for deploying layered federated learning is characterized in that a base station with a cloud server deployed is used as a global parameter server in a layered federated learning architecture, and global parameter aggregation operation is executed;

2. The wireless communication system for deploying hierarchical federated learning according to claim 1, wherein each worker node performs iterative updates of model parameters in parallel using local computing resources based on a serial parameter upload strategy;

3. The wireless communication system for deploying hierarchical federated learning according to claim 2, wherein in the communication model, the communication energy consumption expression of each working node uploading the local model parameters to the head node of the group to which it belongs is expressed as:

wherein p represents the node transmit power;

4. The wireless communication system for deploying hierarchical federated learning according to claim 2, wherein the communication energy consumption of the head node of each packet uploading packet model parameters to a cloud server is expressed by the formula:

wherein p represents the head node transmit power;

5. The wireless communication system for deployment of hierarchical federated learning of claim 2, wherein in the learning model, the local working nodes adopt a small batch stochastic gradient descent algorithm to train model tasks, the loss function of the model is l, and the loss function value of the local working node j is F_j(w) is defined as:

model parameters representing the working nodes.

6. The wireless communication system for deployment of hierarchical federated learning of claim 5, wherein in the learning model, after the working node completes a specified number of rounds of local model training using the local training samples, model parameters are uploaded to the packet header node, aggregation is completed by the packet header node, and the packet parameter aggregation operation is expressed as:

The operation is represented as:

wherein, B_jRepresenting the batch size of the working node training samples;

is a loss function F of the working node model training_j(w) in

representing a batch sample set of a round of local iterative training of the working nodes; b represents the size of the training sample batch of all the working nodes in the whole situation; v_iRepresenting the head node representing packet i.

7. The resource optimization method for deploying a layered federal learned wireless communication system as claimed in claim 1, wherein the optimization method is an iterative-based joint optimization communication factor and learning factor calculation method, comprising the steps of:

s200, allocating bandwidth resources based on the existing parameters;

8. The resource optimization method according to claim 1, wherein the optimization objective function expression is:

where p represents a weighting coefficient,

and