CN114363911B

CN114363911B - Wireless communication system for deploying hierarchical federal learning and resource optimization method

Info

Publication number: CN114363911B
Application number: CN202111675427.8A
Authority: CN
Inventors: 朱旭; 温正峤; 蒋宇飞; 王同
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-10-17
Anticipated expiration: 2041-12-31
Also published as: CN114363911A

Abstract

The invention discloses a wireless communication system for deploying hierarchical federation learning and a resource optimization method, wherein a base station deployed with a cloud server is used as a global parameter server in a hierarchical federation learning architecture to execute the aggregation operation of global parameters; and taking the working nodes in the service range of the base station as federal learning working nodes, carrying out self-adaptive grouping by the working nodes, and selecting head nodes of each grouping to serve as bridges and local aggregators between the working nodes and the cloud server. The internet of things node in each group is trained iteratively by using the local data set, and local parameters are uploaded to the group head node through the D2D mode. The grouping node accesses the base station in a frequency division multiple access FDMA mode to complete the distribution and the receiving of the federal learning task and the parameter updating and the exchanging in the training process. The invention solves the defect of the traditional hierarchical federal learning architecture and effectively reduces the system overhead.

Description

Wireless communication system for deploying hierarchical federal learning and resource optimization method

Technical Field

The invention relates to the technical field of computer resource optimization, in particular to a wireless communication system for deploying hierarchical federal learning and a resource optimization method.

Background

In the traditional hierarchical federal learning architecture based on end-side-cloud, after performing model training task iteration, an edge node can upload local model parameters to an edge server in parallel to perform local parameter aggregation, and after completing local parameter aggregation of a certain number of rounds, the edge server can upload model parameters of an edge level to a cloud server, and the cloud server completes aggregation of global model parameters. And repeating the iterative training update until the model converges and the task is finished.

The hierarchical federal learning architecture of "end-side-cloud" requires participation of edge servers, but in rural areas with imperfect infrastructure, there is not necessarily deployment of edge servers, and this mode cannot be implemented. Secondly, the edge server locations are generally relatively fixed, and when the grouping matching of the working nodes and the edge servers is performed, the optimal grouping is not necessarily achieved so as to achieve overall optimization of model training time and data distribution. In addition, the number of edge servers is generally fixed, and in the process of participating in model training by massive working nodes, there may be an excessive load of single-point servers. Finally, in the scenario where the working node moves, the communication links between the working node and the server are unstable, and there is a risk of disconnection, which may cause the working node to switch under different edge servers, and this unstable grouping situation may affect the overall time delay and final accuracy of model training.

Under the existing hierarchical architecture, research on optimization trade-off of communication and learning resources is not much. The influence of the batch size of training samples and the parameter aggregation frequency on the model learning performance and the communication overhead is more complex, and the model learning performance and the communication overhead are not easy to quantify.

Disclosure of Invention

The invention aims to provide a wireless communication system for deploying hierarchical federal learning and a resource optimization method, which are used for solving the problems in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme: the method comprises the steps of deploying a wireless communication system of hierarchical federation learning, wherein a base station deployed with a cloud server is used as a global parameter server in a hierarchical federation learning architecture, and executing aggregation operation of global parameters;

taking working nodes in the service range of the base station as federal learning working nodes, carrying out self-adaptive grouping by the working nodes, and selecting head nodes of each grouping as bridges and local aggregators between the working nodes and the cloud server; the Internet of things nodes in each group are trained iteratively by using a local data set, and local parameters are uploaded to the group head nodes through a D2D mode;

the grouping node accesses the base station in a frequency division multiple access FDMA mode, and each grouping node is allocated with certain bandwidth resources for establishing communication connection between a wireless communication link and the base station so as to complete allocation and receiving of federal learning tasks and parameter updating and exchange in the training process.

Preferably, in the present technical solution, each working node uses local computing resources to complete iterative updating of model parameters in parallel based on a serial parameter uploading policy;

the serial parameter uploading strategy is that in the stage that the working node uploads the model parameters to the grouping head node, the working node in the grouping uses all pre-allocated bandwidth resources in the group to sequentially and serially upload the local model parameters to the grouping head node; finishing intra-group aggregation of the model parameters on the grouping head node, and broadcasting the updated aggregation parameters to each working node in the group through broadcasting;

in the time node stage of reaching global aggregation, firstly completing the group of the aggregation round number in the designated group, uploading the group model parameters of the group to a global parameter server by grouping head nodes and utilizing available bandwidth resources of the system; if the other group also completes the parameter aggregation of the designated round number and the last group has completed the uploading of the group model parameters, the group model parameters of the group are directly uploaded by utilizing the system bandwidth resource, otherwise, the uploading of the group model parameters is started after waiting for the last group to complete the uploading of the group model parameters.

Preferably, in the present technical solution, in the communication model, a communication energy consumption expression formula of each of the working nodes uploading a local model parameter to a head node of a packet to which the working node belongs is:

wherein p represents the node transmit power;and the parameter representing the working node j uploads the communication time delay.

Preferably, in the present technical solution, the communication energy consumption expression formula of uploading the grouping model parameters to the cloud server by each of the head nodes of the grouping is:

wherein p represents the head node transmit power;a parameter upload communication delay representing a head node of packet i; v (V) _i Representing the head node of packet i.

Preferably, in the technical scheme, in the learning model, the local working node adopts a mini-batch SGD algorithm to train model tasks, a loss function of the model is calculated as l, and a loss function value of the local working node j is F _j (w) is defined as:

where s represents a sample in the local dataset of the working node; w represents model parameters of the working node;model parameters representing the working node.

Preferably, in the present technical solution, in the learning model, after the working node completes local model training with a specified number of rounds by using a local training sample, model parameters are uploaded to the packet header node, and aggregation is completed by the packet header node, where the packet parameter aggregation operation is expressed as:

GroupingAfter all working nodes in the system complete model parameter updating of a specified round number by using a local training sample, the grouping head node uploads grouping model parameters to a global parameter server, the global parameter server completes grouping model parameter aggregation operation, and global parameter aggregation is performedThe operation is expressed as:

after the global parameter server completes global parameter aggregation, broadcasting the updated global parameters to all the grouping head nodes, wherein the grouping head nodes continue broadcasting to all the working nodes in the service range, and each working node continues a new round of model training and parameter updating by using the latest updated global model parameters and the local training sample set, and the method is recorded as follows:

wherein B is _j Representing a working node training sample batch size;training the batch sum of the samples by all working nodes in the group; t represents training time; i and j respectively represent different working nodes; η represents the learning rate in model training; />Is a loss function F for training a working node model _j (w) is->Training a loss value gradient value under the sample set; n represents the number of nodes;representing a batch sample set of a round of local iterative training of a working node; b represents the overall all working node training sample batch size; v (V) _i Representing the head node of packet i.

Another object of the present invention is to provide a resource optimization method for a wireless communication system deploying hierarchical federal learning, the optimization method being an iterative joint optimization communication factor and learning factor-based calculation method, comprising the steps of:

s100, initializing grouping and parameter initialization are carried out on all working nodes, and global parameter aggregation frequency is optimized; then, solving the problem of the aggregation frequency of the parameters in the group by using the global parameter aggregation frequency value and the rest parameter initialization value obtained in the last step;

s200, distributing bandwidth resources based on the existing parameters;

s300, the grouping of all the nodes is re-planned. And (3) performing iterative optimization until the algorithm converges, and finally, obtaining the optimal solution and the minimum system overhead of each variable by walking.

Preferably, in the present technical solution, the expression of the optimization objective function is:

wherein ρ represents a weighting coefficient,and->The reference value representing model convergence error overhead and model training time overhead is used for unifying values of two different scales; t represents training time.

Compared with the prior art, the invention has the following beneficial effects:

1. The invention provides a hierarchical federal learning architecture with convergence sinking, which is characterized in that local parameter convergence is sinking to a node side, and a grouping head node replaces an edge server to complete local parameter convergence, so that the defect of the hierarchical federal learning architecture is improved. Under the assumption of a non-convex loss function, model convergence analysis is performed, and a theoretical upper bound of model convergence errors is deduced. Likewise, with the aim of minimizing the system overhead, trade-off between two large performance indexes is studied, and an iterative optimization algorithm of joint communication factors and learning factors under a layered architecture is provided. The algorithm comprises a grouping algorithm based on node exchange, an optimal closed solution of bandwidth resource allocation and parameter aggregation frequency is deduced, and the optimal training sample batch size is solved by utilizing a concave-convex process optimization algorithm.

2. The invention provides a low-complexity optimization algorithm for how to influence the system performance by two large resources, namely learning resources and communication resources, and reduces the system overhead.

3. The invention solves the defects of the traditional hierarchical federal learning architecture, optimizes resources aiming at the novel hierarchical federal learning architecture, and mainly researches how the batch size and parameter aggregation frequency of learning resource training samples and communication resource bandwidth resources influence the performance of the system, thereby further effectively reducing the system overhead.

Drawings

FIG. 1 is a block diagram of a hierarchical federal learning system according to the present invention;

FIG. 2 is a diagram showing the effect of serial parameter uploading according to the present invention;

FIG. 3 is a flow chart of a resource optimization method of the present invention;

FIG. 4-1 is a geographic coordinate diagram of a wireless communication system of a three-layer federal learning architecture of the present invention;

FIG. 4-2 is a graph showing the effect of the grouping aggregation frequency on the model loss function value according to the present invention;

FIGS. 4-3 are experimental graphs showing the effect of the grouping aggregation frequency on the model loss function value according to the present invention;

FIGS. 4-4 are graphs of model training time overhead and model convergence error overhead versus global parameter aggregation frequency for the present invention;

FIGS. 4-5 are graphs showing the variation of overhead with model parameter size for the present invention;

FIGS. 4-6 are graphs showing overhead as a function of the total number of training sample batches in accordance with the present invention;

FIGS. 4-7 are bar graphs illustrating the impact of the data distribution of the working node on system overhead;

FIGS. 4-8 are bar graphs of the system overhead versus working node calculation time gap threshold variation of the present invention

FIGS. 4-9 are bar graphs of overhead versus packet training gap threshold for the present invention;

FIGS. 4-10 are graphs of system overhead as a function of system bandwidth resources in accordance with the present invention;

fig. 4-11 are experimental diagrams of convergence performance of the T-JCL algorithm of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; the same or similar concepts or processes may not be described in detail in some embodiments.

Referring to fig. 1, the wireless communication system for deploying hierarchical federal learning according to the present invention is a hierarchical federal learning system model based on aggregation and subsidence, and executes an aggregation operation of global parameters by using a base station deployed with a cloud server as a global parameter server in a hierarchical federal learning architecture; and taking the working nodes in the service range of the base station as federal learning working nodes, carrying out self-adaptive grouping by the working nodes, and selecting head nodes of each grouping to serve as bridges and local aggregators between the working nodes and the cloud server. The internet of things node in each group is trained iteratively by using the local data set, and local parameters are uploaded to the group head node through the D2D mode. The grouping node accesses the base station in a frequency division multiple access FDMA mode, namely, each grouping is allocated with certain bandwidth resources for establishing communication connection between the wireless communication link and the base station, and the allocation and the receiving of the federal learning task and the parameter updating and exchanging in the training process are completed.

The system model comprises at least three stages:

the first stage: model training stage of local working node;

and a second stage: uploading local model parameters to a grouping head node by a working node;

and a third stage: the packet uploads the packet model parameters to the global parameters server stage.

As shown in fig. 2, the first stage: model training phase of local working node:

specifically, each working node uses local computing resources to complete iterative updating of model parameters in parallel. In the stage that the working node uploads the model parameters to the grouping head node, the working node in the grouping uses all pre-allocated bandwidth resources in the group to sequentially and serially upload the local model parameters to the grouping head node; finishing the intra-group aggregation of the model parameters on the grouping head nodes, and broadcasting the updated aggregation parameters to all working nodes in the group through broadcasting; and in the time node stage of reaching global aggregation, firstly completing the group of the aggregation round number in the designated group, uploading the grouping model parameters of the group to a global parameter server by using the available bandwidth resources of the system through the grouping head node, and also ensuring that the bandwidth resources are fully utilized in the training process. If another group also completes parameter aggregation of the designated round number at this time and the last group has completed uploading of the group model parameters, the group model parameters of the group can be directly uploaded by using the system bandwidth resources, otherwise, the uploading of the group model parameters is started after waiting for the last group to complete uploading of the group model parameters.

Further, a communication model based on the aggregate sinking three-layer federal learning architecture is illustrated:

first define the set of working nodes in group i asThe grouping head node is H _i The head node set is +.>Up-link rate R of member working node j in packet i _j Can be expressed as:

in the formula (1), alpha _i Representing bandwidth resources (Hz) allocated by the working node; p represents the transmit power (W) of the operating node; delta ² Represents noise power (W); h is a _i,j Representing the channel coefficients between the working node and the head node.

Head node H within packet i _i Uplink rate R _i Can be expressed as:

in the formula (2), alpha _i Bandwidth resources (MHz) representing communications between the packet header node and the base station;channel coefficients between the operating node and the base station.

The computation time for each working node to complete a round of local training can be expressed as:

in the formula (3), C represents the size (bit) of the working node training sample; v (V) _j Indicating the rotational speed (number of rounds/sample) of the working node; f (f) _j Representing the calculated frequency (Hz) of the working node.

The communication latency of each operating node uploading local model parameters to its associated packet header node can be expressed as:

in the formula (4)Representing the bandwidth resources (Hz) allocated by packet i.

The communication delay of the packet header node to upload the packet model parameters to the global parameter server can be expressed as:

specifically, according to the calculation time of the working nodes in the groupSequencing from small to large as index to obtain an ordered set V _i ^S The working nodes which have been trained locally are selected from the set in sequence. Because the order of computation time is already in order, the first selected working node is the working node with the smallest computation time, and is also the first node of the ordered set, and the local model parameters are uploaded by using bandwidth resources pre-allocated by the packet, and the local model parameters are uploaded to the head nodes in the group.

Further, the more specific process is illustrated by continuing to select the communication model:

for example define T _i,j For working node j in packet i, then starting from the start of the synchronization training calculation for all nodes in the packet to the time that the node completes uploading of the local model training parameters to the packet header node

Wherein T is _i,j′ Representing the set V after time ordering according to local calculation _i ^S A node preceding the working node j; s is constant for identification without physical meaning; i denotes the packet i, V _i ^S Representing a set of nodes of packet i;representing the computation time of the node j in the packet i, the cmp is used for identifying no physical meaning; / >Representing the communication latency of node j in packet i, comm is of no physical significance for identification.

Since the working node with the shortest computation time in the packet is the first node to start local model parameter upload, set V _i ^S The I round local computation time of the first node in (a) can be expressed as:

as can be seen from the above, when satisfying the following conditionsOn the premise that the time required to complete a round of aggregation update of the packet model parameters in the packet can be expressed as:

that is to say,

similarly, the mode of uploading the model parameters of each packet to the global parameter server is the same as the strategy of uploading the model parameters to the packet header node by the working node in the packet.

And a third stage: a step of uploading the grouping model parameters to a global parameter server in a grouping way;

specifically, each group is trained according to the intra-group training time T _i The values are ordered from small to large to obtain an ordered set V ^S And sequentially selecting groups from the set, and uploading the group model parameter vector. Each packet communicates with the global parameter server using all bandwidth resources within the system. Then, the time for the ith packet to complete the uploading of the packet model parameters can be expressed as:

wherein T is _i′ Represented as an ordered set V ^S The former group of the ith group, the G italics indicates the variable, which is the global parameter aggregation frequency; l is a positive body, and the description is for identification and has no physical meaning.

Set V ^S The time for the first packet in a round of packet parameter upload to complete can be expressed as:

as can be seen from the above formula (11), when satisfyingOn the premise that the time required to complete a round of aggregate updates of global model parameters can be expressed as +.>That is to say,

as can be seen from equation (13), the time of a round of global parameter aggregation is closely related to the global parameter aggregation frequency, the packet parameter aggregation frequency, and the allocated bandwidth resources of each packet.

And averaging the sum of calculation and communication time to each round of local iterative training, and defining the model training time cost under the three-layer federal learning architecture as follows:

equation (14) characterizes the sum of the computation and communication time required for the average working node to complete a round of local training.

Also, considering the energy consumption limit of the working node is one of the key constraint factors of the system performance. Since the calculation energy consumption of the working node is proportional to the batch size of training samples and proportional to the square of the node calculation frequency, the calculation energy consumption of the working node j for completing one round of local iterative training can be expressed as:

Wherein, κ represents a calculated energy consumption coefficient; b (B) _j Representing the training sample batch size for working node j.

The communication energy consumption of each working node to upload local model parameters to the head node of its belonging packet can be expressed as:

the communication energy consumption of each packet's head node to upload the packet model parameters to the cloud server can be expressed as:

where p represents the transmit power.

Further, in the three-layer federal learning architecture, a local working node adopts a mini-batch SGD algorithm to train a model task, the purpose of model training is to minimize a global loss function value, and the model reaches a convergence state, and a specific learning model is as follows:

defining the loss function of the model as l, and the loss function value of the local working node j as F _j (w) can be defined as:

Each working node performs model training of each round of local iteration, and adopts a small batch random gradient descent algorithm to train a sample set locally A subset of training samples of a certain batch size is randomly extracted +.>Training updates for local model parameters, updates for model parameters of the working node j can be expressed as:

wherein η represents a learning rate in model training;representing a batch sample set of a round of local iterative training of a working node; />Is a loss function F for training a working node model _j (w) is->The gradient values of the loss values under the sample set are trained.

Under the three-layer federal learning architecture, after a working node in a packet completes a certain number of rounds of local model training, the working node can upload local model parameters to the head node of the packet to which the working node belongs. The loss function of a packet is expressed as a weighted sum of the loss function values of all working nodes within its service range (including the head node itself), the weighting coefficients being related to the sum of the individual working node training sample batch sizes over the total training sample batch used within the packet.

Then the intra-packet loss function value may be defined as:

in p _j And the loss value weight value of the working node is represented.

In general, in defining a packet loss function, the weight of the loss value of each working node is defined as the ratio of the local training batch size used in model training to the sum of the training sample batch sizes used by all working nodes in the packet to which the local training batch size belongs, for example, the following formula is defined:

Wherein B is _j Representing a working node training sample batch size;all working nodes within the group train the sample batch sum.

Under the three-layer federal learning architecture, defining a global model loss function as a weighted sum of loss functions of each group can be expressed as:

in p _i Representing the loss function weight of the packet.

In general, in the global loss function, the weight of the loss function of each group is the group training sample batch in the model training, and the weight of the loss function of each group accounts for the proportion of the training sample batch used by the global working node, and the following formula is defined:

wherein B represents the global all working node training sample batch size.

After the working node finishes local model training with a specified number of rounds by using a local training sample, uploading model parameters to the grouping head node, and finishing aggregation by the grouping head node, wherein the grouping parameter aggregation operation in t training time can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing a training sample batch size for a working node j;

after all working nodes in the group complete model parameter updating of the designated number of rounds by using the local training samples, the group head node uploads the group model parameters to the global parameter server, and the global parameter server completes aggregation operation of the group model parameters, wherein the global parameter aggregation operation can be expressed as:

After the global parameter server completes global parameter aggregation, broadcasting the updated global parameters to all the grouping head nodes, wherein the grouping head nodes continue broadcasting to all the working nodes in the service range, and each working node continues new rounds of model training and parameter updating by using the latest updated global model parameters and the local training sample set, wherein the following formula is as follows:

the unified expression of the model parameters of the working node j at different moments is as follows:

and repeating the iterative training and the model aggregation updating until the model is converged within a certain error range, and ending the model training.

Convergence analysis of enhanced hierarchical federal learning architecture model for the present system

Based on the following four preconditions, the following conditions are respectively:

(1) the method comprises the following steps The model loss function is L-Lipschitz continuous.

(2) The method comprises the following steps The working node randomly extracts training samples from the local training sample set, and the sampling error brought by the working node meets the following formula:

where ζ is a sampling error constant coefficient, D _j Training sample number of working node j;

(3) the method comprises the following steps And representing the deviation degree of each working node in the global model training by using the difference between the gradient value of each working node and the global gradient value, and indirectly representing the data heterogeneous degree among the working nodes.

(4) The method comprises the following steps It is assumed that there is an upper bound on the gradient value size of the working node.

(5) The method comprises the following steps Characterizing the deviation degree of each group in the global model training by using the difference between the group weighted gradient value and the global gradient value in each group, indirectly reflecting the data heterogeneous degree among the groups, and mathematically defining as follows:

on the premise of meeting hypothesis theory (1), (2), (3), (4) and (5), if w is expressed as model parameters in the whole text;

satisfies eta L < 1, and for any T is more than or equal to 1, under the hierarchical federal learning architecture, the upper bound of the theoretical error of model convergence satisfies the formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,

the number N of the working nodes and the number M of the groups can be seen, and the working nodes randomly sample the number B of the samples for model training _j The data heterogeneous degree between the working nodes and the data heterogeneous degree between the groups, the grouping parameter aggregation frequency I and the global parameter aggregation frequency G are all important factors influencing the upper bound of the model convergence theoretical error. In order to better characterize the precision error brought by each round of local iterative training, the model training convergence error cost is defined as follows:

method for optimizing resources by using system

The weighted sum of model convergence error cost and model training time cost under the hierarchical federal learning architecture is taken as system cost, the minimized system cost is taken as an optimization target, an optimization problem is established, and an optimization problem objective function is expressed as:

Wherein ρ represents a weighting coefficient

，And->Reference values representing model convergence error overhead and model training time overhead are used to unify values of two different scales.

Thus, the joint optimization that minimizes the overhead is calculated by constructing as a P0 form, specifically:

/>

wherein, (C1) represents constraint condition limiting model training batch sample size, (C2) and (C3) represents constraint condition limiting training batch sample size of each group and all groups, (C4) and (C5) represent constraint condition limiting system bandwidth resource, (C6) represents constraint condition limiting aggregation frequency as positive integer, (C7) represents constraint condition limiting energy consumption of working node, (C8) represents constraint condition limiting time of previous working node after finishing local training and uploading model parameters after each group is sorted according to calculated time length and local calculating time of next working node, and (C9) represents about time of previous group after finishing training and uploading model parameters is larger than training time of next group after the constraint condition limiting all groups are sorted according to total time length.

Since constraints (C8) and (C9) need to be considered on the basis of the order of the working nodes and the groups according to indexes, so that the optimization problem is difficult to solve in a conversion way, in order to make the optimization problem easier to solve, a relaxation variable is introduced And->The method is used for representing the maximum difference value between the calculation time of each working node after the I round of local iterative training of all working nodes in the same group and the maximum difference value between the total time of the groups after the G/I round of intra-group iterative training and parameter aggregation updating are completed among different groups. Moreover, after the two thresholds are introduced, the performance difference between the bandwidth resource allocation strategy and the traditional bandwidth pre-allocation scheme proposed by the invention can be observed by controlling the size of the thresholds. Because, if the computing power of each working node within the same group is comparable, the time for each working node to complete the fixed round number of local iterative training is similar.

Then, constraints (C8) and (C9) can be converted to the following formula:

wherein C (8.1) and C (9.1) are mathematical scaling of the constraint; p, q, m, n all represent nodes P, q, m, m;

the optimization objective function due to the P0 optimization problem contains cost ^delay And cost ^delay Is a min function. For ease of solution, the relaxation variables τ and τ are introduced _i ：

Based on the introduction of the above time relaxation variables, constraints (C8) and (C9) can be further converted into the following form:

/>

c (8.2) and C (9.2) are constraint scaling after introducing a relaxation variable; p.q is just a subscript indicating node p and node q;

Based on the introduction of the four relaxation variables above and the adjustment of constraints, the P0 optimization problem can be converted into P1 form.

s.t.(C1),(C2),(C3),(C4),(C5),(C6),(C7),(C8.2),(C9.2)

obviously, P1 is still a non-convex optimization problem, and based on the above non-convex problem and definition, the present invention proposes an algorithm for jointly optimizing a communication factor and a learning factor based on iteration under a three-layer federal learning architecture, namely a T-JCL algorithm. As shown in fig. 3, the general flow of the T-JCL algorithm is summarized as:

s200, distributing bandwidth resources based on the existing parameters;

Simulation experiment

The invention provides a hierarchical federal learning architecture with convergence sinking, which is characterized in that local parameter convergence is sinking to a node side, and a grouping head node replaces an edge server to complete local parameter convergence, so that the defect of the hierarchical federal learning architecture is improved. Under the assumption of a non-convex loss function, model convergence analysis is performed, and a theoretical upper bound of model convergence errors is deduced. Likewise, with the aim of minimizing the system overhead, trade-off between two large performance indexes is studied, and an iterative optimization algorithm of joint communication factors and learning factors under a layered architecture is provided. The algorithm comprises a grouping algorithm based on node exchange, an optimal closed solution of bandwidth resource allocation and parameter aggregation frequency is deduced, and the optimal training sample batch size is solved by utilizing a concave-convex process optimization algorithm. Simulation results show that compared with an optimization algorithm only considering a single factor, the combined optimization algorithm provided by the method can effectively reduce the system overhead and can be converged rapidly, and the simulation experiment is specifically referred to as follows.

Constructing a wireless communication system with 1 deployed serviceA base station of the transmitter. The service coverage area of the wireless communication system is a circular area with the radius of 250m, the base station is positioned at the center of the circle, the working nodes are uniformly distributed in the circular area, and the geographic position coordinate diagram is shown as 4-1. The global parameter server is positioned in the center of a circle and is in a three-layer federation learning architecture, red dots represent grouped head nodes, and the red dots serve as local parameter aggregators to execute grouping parameter aggregation and serve as working nodes to execute model training of federation learning tasks. The remaining dispersed patterns of different shapes represent common working nodes. Simulation sets the path loss model between the packet header node and the global parameter server to PL ₁ ＝128.1+37.6log ₁₀ (d ₁ )，d ₁ The distance between the working node and the base station is expressed in km. The path loss model between working nodes is PL ₂ ＝148+40log ₁₀ (d ₂ ),d ₂ Representing the distance between the working nodes in km. The remaining simulation parameters are set forth in Table 4-1.

For convenience of description++ + + and comparison of + and algorithm, some comparison schemes are defined.

(1) Scheme one: a scheme that considers only the influence of the communication factor on the overhead is called a T-C algorithm.

(2) Scheme II: a scheme that considers only the influence of learning factors on the overhead is called a T-L algorithm.

(3) Scheme III: the scheme of jointly considering the influence of the communication factors and the parameter aggregation frequency on the system overhead is called a T-JCI algorithm; the training sample batch size is set to half the local training data set.

(4) Scheme IV: the scheme of jointly considering the influence of the communication factors and the batch size of training samples on the system overhead is called a T-JCB algorithm; fixed packet and global parameter aggregation frequencies are 5 and 30.

(5) Scheme five: an exhaustive search algorithm that considers the impact of communication factors and learning factors on overhead.

In order to study the impact of different data heterogeneous degrees on the system overhead, different allocation schemes of training samples are defined.

On the basis of the unification of the optimization target dimensions,and->The reference values are selected as follows.

(1)30 working nodes and 5 groups, wherein each working node selects half of the batch size of the local training samples for training each time, and the data distribution among each working node is heterogeneous.

(2)30 working nodes, 5 groups, each working node selecting half the local training sample batch size at a time for training.

Verification of model convergence theory analysis under three-layer federal learning architecture

The simulation is based on the MNIST data set and the training is based on the CNN regression model. The simulation sets 30 nodes, 5 packets, and the data is based on the heterogeneous distribution of the C scheme defined above, with the same number of training samples per working segment. Model training uses a small batch of gradient descent training, with batch size set to 10.

Fig. 4-2 and 4-3 simulate and verify the relation between the grouping aggregation frequency and the global loss function and the model convergence accuracy under the fixed global parameter aggregation frequency g=30 in the three-layer federal learning architecture. And selecting three conditions of grouping parameter aggregation frequency I epsilon {1,5, 10} for simulation comparison. It can be seen from the graph that, at a fixed global parameter aggregation frequency, the larger the grouping parameter aggregation frequency value, that is, the more local iteration times needed for each round of grouping aggregation, the larger the loss value of the model is at the same iteration round number, and the lower the precision is. This is because the smaller the grouping aggregation frequency value, the more frequent the aggregation, the more frequent the parameter interaction between models, and the more generalization capability of model training. The larger the grouping aggregation frequency value is, the less frequent the aggregation is, and the model is easy to sink into the local optimum of each working node, but not the global optimum, so that the model training precision is poor.

And secondly, based on simulation verification and comparison analysis of a Monte Carlo simulation experiment on a T-JCL algorithm and a comparison scheme thereof, researching overhead optimization of a three-layer federal learning architecture deployed under a wireless communication system.

Fig. 4-4 illustrate the relationship and impact of global parameter aggregation frequency on model training time overhead and model convergence error overhead under a three-layer federal learning architecture.

As can be seen from the relationship between the y-axis and the x-axis on the right side of the figure, the model training time overhead decreases with the increase of the global parameter aggregation frequency, because in the case of the fixed packet parameter aggregation frequency, the increase of the global parameter aggregation frequency represents the increase of the packet aggregation number required for each global parameter aggregation, and the overhead of one round of communication delay is averaged to the communication delay on each local iteration calculation, thereby resulting in the decrease of the model training time overhead.

As can be seen from the relationship between the y-axis and the x-axis on the left side of the graph, the model convergence error overhead increases with increasing global parameter aggregation frequency. This is because, in the case of a fixed packet aggregation frequency, an increase in the global parameter aggregation frequency increases the number of packet aggregations required for the model to complete one round of global aggregation in training, and the aggregation operation of the packet parameters is an important way of averaging the model differences between packets. The reduction in the number of packet parameter aggregations allows each packet to more closely approximate the optimal solution for the set of data sets in the trained model than the global optimal solution. Particularly, in the case of obvious inter-packet data isomerism, the reduction of the packet aggregation times can obviously increase the model parameter difference among each packet, so that the parameters of each packet deviate from the globally optimal parameters, and further the model convergence error overhead is increased.

Fig. 4-5 show the relationship between system overhead and model parameter size for transmission, with the number of working nodes in the fixed system. The optimization algorithm T-JCL that compares the joint communication factor and the learning factor and T-C, T-L that considers only the communication factor optimization or the learning factor optimization, respectively.

As can be seen from the figure, as the size of the model parameters increases, the overhead under the four algorithms increases. This is because more communication time is required for transmission of the model parameters, so that model training time overhead increases, and thus system overhead increases. While the overhead of the T-JCL algorithm is significantly lower than that of the T-C and T-L algorithms. This is because the T-C and T-L algorithms consider only the communication factor (bandwidth resource) or the learning factor (aggregate frequency and batch size), respectively, and the lack of an optimization factor results in insufficient performance. The T-C algorithm only considers the allocation of bandwidth resources, so that optimization of model convergence error overhead is omitted, the overall performance is worst, and the system overhead is highest. However, the iterative T-JCL algorithm provided by the invention approximates the algorithm based on the exhaustive search in performance under the advantage of low complexity, and proves the effectiveness of the algorithm.

FIGS. 4-6 show the relationship between system overhead and training sample batch size, comparing the performance differences of the T-JCB, T-JCI and T-JCL algorithms. As can be seen from fig. 4-6, as the training sample batch size increases, the overhead under the four algorithms increases. This is because, with a fixed number of nodes, there are increased training samples, and thus there is a possibility that the number of training samples allocated to each working node increases, and more communication time is required for training the model parameters, so that the model training time overhead increases, and the system overhead increases. The system overhead of the T-JCL algorithm provided by the invention is obviously lower than that of the T-JCB algorithm and the T-JCI algorithm. Because the joint factor-based optimization algorithm of the invention considers more relevant factors, the algorithm has stronger performance advantages compared with the algorithm considering a single factor. In terms of time complexity and performance trade-off of the algorithm, the iterative T-JCL algorithm provided by the invention approximates the algorithm based on exhaustive search in performance under the advantage of low complexity, and proves the effectiveness of the algorithm.

Fig. 4-7 illustrate the impact of different data distributions of the working nodes on system overhead and compare the four schemes.

As can be seen from fig. 4 to 7, under the four schemes, the overhead in the case of a data distribution is the lowest, the overhead in the case of B data distribution is the second highest, and the overhead in the case of C data distribution is the highest. This is because, as the degree of data heterogeneity increases, the model convergence error overhead increases, resulting in an increase in the system overhead. In comparison of the four schemes, it can be seen from the figure that in the case of a data distribution, the two schemes for batch optimization are smaller in system overhead than the two schemes for fixed batches. And the system overhead difference between random grouping and grouping optimization is not obvious. The method is characterized in that under the condition of data distribution A, data among working nodes belong to isomorphic distribution, and the isomorphism of the data in the packet can be ensured by adopting a random packet mode, so that the isomorphism of the data can be ensured by a random packet scheme for packet optimization on the formed packet, the difference is not large, and the system overhead difference is not obvious. Under the condition of B and C data distribution, the scheme of adopting random grouping is higher in system overhead than the scheme of adopting grouping optimization because the scheme of adopting random grouping can not ensure isomorphism of the formed data formula among the groups. And on the scheme of fixed batch size and batch size optimization, the scheme of adopting a batch optimization algorithm is obviously lower than that of the scheme of fixed batch in system overhead. The impact and benefits of batch size optimization on overhead are demonstrated.

Fig. 4-8 show the variation of system overhead with the working node time gap threshold, and compare and analyze the performance difference under the scheme of uploading strategies and bandwidth resource allocation by using different parameters.

As can be seen from fig. 4 to 8, under the same threshold, the system overhead is obviously higher than that of the other two schemes for bandwidth resource optimization by adopting the schemes of fixed bandwidth and parallel uploading mode, which illustrates that the bandwidth resource optimization has a positive influence on the system overhead. In schemes that all consider bandwidth resource optimization, the scheme overhead based on the serial upload mode is lower than the scheme based on the parallel upload mode. This is because, in the heterogeneous computing power scenario, the mode of parallel uploading wastes a part of bandwidth resources due to the bandwidth resource pre-allocation, which causes an increase in model training time overhead, thereby causing an increase in system overhead. As can be seen from fig. 4-8, the serial upload mode based scheme gradually reduces in system overhead as the gap threshold is calculated by the working nodes, because in a computationally heterogeneous scenario, in order to maintain the difference between calculation durations, the calculation time between the working nodes is developed from strict control to slack control as the gap threshold is increased, and the training sample batch size is also a key factor affecting the calculation duration gap. Then, in the case that the gap threshold is relaxed, the degree to which the variation of the training sample batch size is limited by the calculation time length is relaxed, so that the optimization space is larger on the optimization of the model convergence error overhead performed on the training sample batch size, and the model convergence error overhead is reduced. So as the gap threshold increases, the serial model-based overhead decreases.

Fig. 4-9 show the variation of system overhead with packet training time gap threshold, and compare and analyze the performance differences under different parameter uploading strategies and bandwidth resource allocation schemes.

As can be seen from fig. 4-9, as the inter-packet training time gap threshold increases, the overhead decreases. This is because as the training time gap constraint between control packets is relaxed, there is more control space for the parameters on the optimization overhead. Based on the scheme of serial and parallel modes, the system overhead is similar. This is because the inter-packet training time gap threshold is typically lower than the packet training time, making the difference small.

Fig. 4-10 illustrate the impact of system bandwidth resources on system overhead. It can be observed from fig. 4-10 that the T-JCL algorithm proposed by the present application is significantly lower in overhead than the algorithm that does not consider learning factor optimization. And the performance of the T-JCL algorithm is similar to an algorithm based on exhaustive search, so that the system performance approximation under low complexity is realized. In addition, as the bandwidth resources of the system increase, the overhead decreases. This is because the bandwidth resources available within the system increase, with each working node within the system having an increased bandwidth resource allocated, and with the model training time overhead of the system decreasing, such that the system overhead decreases.

Fig. 4-11 show the convergence performance of the T-JCL algorithm proposed in this chapter for different packet numbers. It can be seen that in the scenario that the packet numbers are 3,5,10, the T-JCL algorithm proposed by the present application can be converged within the priority round number, which proves that the T-JCL algorithm running in an iterative manner has good convergence performance.

In summary, the system equipment of the application has simple structure, low cost and simple resource method, and can improve convergence and reduce expenditure.

The structures, connection relationships, operation principles, and the like, which are not described in the present embodiment, are implemented by using the prior art, and a detailed description thereof will not be repeated here.

Although embodiments of the present application have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the application, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A wireless communication system for deploying hierarchical federation learning is characterized in that the system uses a base station deployed with a cloud server as a global parameter server in a hierarchical federation learning architecture to execute the aggregation operation of global parameters;

the grouping node is accessed to the base station in a frequency division multiple access FDMA mode, and each grouping node is allocated with certain bandwidth resources for establishing communication connection between a wireless communication link and the base station so as to complete allocation and receiving of federal learning tasks and parameter updating and exchange in a training process;

each working node utilizes local computing resources to finish iterative updating of model parameters in parallel based on a serial parameter uploading strategy;

In the time node stage of reaching global aggregation, firstly completing the group of the aggregation round number in the designated group, uploading the group model parameters of the group to a global parameter server by grouping head nodes and utilizing available bandwidth resources of the system; if the other group also completes the parameter aggregation of the designated round number and the last group has completed the uploading of the group model parameters, directly uploading the group model parameters of the group by using the system bandwidth resource, otherwise, waiting for the last group to complete the uploading of the group model parameters and then starting the uploading of the group model parameters;

in the communication model, each working node uploads the local model parameters to the head node of the packet to which the working node belongs, and the communication energy consumption expression formula is as follows:

wherein p represents the head node transmit power;the parameter uploading communication time delay of the working node j is represented;

each head node of the group uploads the group model parameters to the cloud server to represent the formula as follows:

wherein p represents the head node transmit power;a parameter upload communication delay representing a head node of packet i; v (V) _i A header node representing packet i;

in the learning model, the local working node adopts a small batch random gradient descent algorithm to train model tasks, the loss function of the model is calculated as l, and the loss function value of the local working node j is F _j (w) is defined as:

where s represents a sample in the local dataset of the working node; w represents model parameters of the working node;model parameters representing the working node;

in the learning model, after the working node finishes local model training with a specified number of rounds by using a local training sample, uploading model parameters to the grouping head node, and finishing aggregation by the grouping head node, wherein the grouping parameter aggregation operation is expressed as follows:

all working nodes within the group complete model parameters for a specified number of rounds using the local training samplesAfter updating, the packet header node uploads the packet model parameters to the global parameter server, the global parameter server completes the aggregation operation of the packet model parameters, and the global parameters are aggregatedThe operation is expressed as:

wherein B is _j Representing a working node training sample batch size; b (B) _Vi Training the batch sum of the samples by all working nodes in the group; t represents training time; i and j respectively represent different working nodes; η represents the learning rate in model training;is a loss function F for training a working node model _j (w) is->Training a loss value gradient value under the sample set; n represents the number of nodes;representing a batch sample set of a round of local iterative training of a working node; b represents that all the global working node training samples are large in batchesIs small; v (V) _i Representing the head node of packet i.

2. The method for optimizing the reclamation of a wireless communication system with hierarchical federal learning deployment according to claim 1, wherein the optimizing method is an iterative joint optimization communication factor and learning factor-based computing method, comprising the steps of:

s200, distributing bandwidth resources based on the existing parameters;

and S300, re-planning the grouping of all the nodes, and performing iterative optimization until the algorithm converges, and finally, obtaining the optimal solution and the minimum system overhead of each variable by walking.

3. The method of claim 2, wherein the optimization objective function expression is: