CN115357402B

CN115357402B - Intelligent edge optimization method and device

Info

Publication number: CN115357402B
Application number: CN202211282973.XA
Authority: CN
Inventors: 詹玉峰; 王家盛; 齐天宇; 翟弟华; 张元�; 吴楚格; 夏元清
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-24
Anticipated expiration: 2042-10-20
Also published as: CN115357402A

Abstract

The invention relates to an edge intelligent optimization method and device. According to the edge intelligent optimization method provided by the invention, the current round state of the environment is constructed based on model parameters, the number of rounds of training, communication time, idle CPU occupancy rate and training energy consumption, each edge device participates in federal training according to corresponding round number information in the current round state, acquires information such as local model parameters, communication time, idle CPU utilization rate and training energy consumption, and updates the current round state, so that the environment is transferred to the next state. The edge equipment continuously interacts with the environment, a large amount of track information is generated and used for updating the strategy model until the strategy model converges, different federal training rounds are distributed according to the calculation speed, the training energy consumption and the communication time of each equipment, and therefore the purposes of balancing calculation of isomerism and reduction of energy consumption overhead are achieved.

Description

Intelligent edge optimization method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an edge intelligent optimization method and device based on deep reinforcement learning.

Background

Federal learning is a mechanism for model training to be jointly participated in by multiple parties, which is developed along with the development of artificial intelligence technology in the big data era. Because the user does not need to upload local data to the central server, the user only needs to train the models by using the respective local data under the coordination of the central server and uploads the trained models to the central server for aggregation, the control right of the user on the data is also ensured while a data island is broken, and the privacy protection effect is achieved, so that the traditional centralized training method can be replaced, and the wide application is realized.

Federal training also faces a number of practical problems: the first is the computing heterogeneity of the device, and the second is the limited resource budget of the edge device, such as energy consumption. The devices of which the user side participates in the federal training may be edge devices such as smart phones, computers, raspberry groups and even enterprise monitoring cameras, and the like, and the devices have obvious heterogeneity in the calculation speed, and due to the complexity of the actual use scene of the user, other programs may be operated in the foreground of the devices to occupy the calculation resources, so that the calculation power for background federal training is changed. The computational speed of the edge device is closely related to the performance of federal training, and selecting different edge devices to participate in federal training may result in significant differences in training time. According to the traditional method, the participating equipment is randomly selected from the edge end, so that the problem of falling behind is easily caused, the equipment with the slowest calculation speed restricts the aggregation time of each round of the federal model, and the process of federal training is greatly slowed down. Therefore, how to select participants of each round of federal training according to the calculation speed of the equipment and allocate the proper number of training rounds to the participants is the key for solving the calculation heterogeneous problem. Most devices on the edge side participating in federal training have limited network bandwidth and battery power. How to reduce budget expenses such as energy consumption and the like while ensuring the federal training precision is also an important research direction in federal learning. Conventional solutions assume that these devices are distributed near the communication base station and only participate in federal training when the power supply is switched on, which greatly limits the application scenarios of federal training. Therefore, how to give consideration to the training precision and the energy consumption overhead and save the cost of federal training is also the key for optimizing edge intelligence.

The data-driven modeling method is high in accuracy and calculation efficiency, the data-driven idea is applied to the field of edge intelligence, accumulated training data are analyzed by an effective method, relevant knowledge is extracted and used for guiding federal training, and the method is an important direction for researching edge intelligent optimization problems.

Deep reinforcement learning is an effective method for data-driven modeling, automatic interaction is carried out between a computer and the environment, strategies can be learned from past experiences, and the method is suitable for scenes in which mathematical models are difficult to establish. In recent years, thanks to rapidly increasing computing resources, reinforcement learning is sufficiently developed, and the reinforcement learning is successfully applied to the fields of robot walking control, cloud workflow scheduling, intelligent transportation and the like, and even has excellent performance far beyond human level on computer games.

The optimization problem of the edge intelligence is multi-constraint and multi-target, some works have been to apply deep reinforcement learning to the optimization of the edge intelligence at present, and the deep reinforcement learning has great potential. The work can be roughly divided into two categories, one category is optimized from the angle of computing heterogeneity, and equipment with higher computing speed is selected by utilizing reinforcement learning, so that the time of each round of federal training can be shortened, but the method usually needs great energy consumption expense; and in the other type, from the perspective of saving limited resources such as energy consumption, the energy-saving equipment participation scheme is selected by using reinforcement learning, so that the total budget expenditure can be reduced, but the problem of edge intelligence in computing heterogeneous is ignored, and long training time is often needed. At present, only a few leading-edge works comprehensively consider the problems of computing heterogeneity, energy consumption and the like, but a great improvement space is provided on the utilization rate of computing resources. Therefore, the method is designed to take account of both heterogeneous calculation and energy consumption overhead, and meanwhile, the calculation power of the edge equipment can be fully utilized, so that the Federal training performance is improved, and the method has important significance for optimizing the performance of edge intelligence.

Disclosure of Invention

The invention aims to provide an edge intelligent optimization method and device capable of considering both calculation of heterogeneous and energy consumption overhead, and further, the calculation power of edge equipment can be fully utilized, and the performance of federal training is improved.

In order to achieve the purpose, the invention provides the following scheme:

an edge intelligent optimization method comprises the following steps:

step 100: acquiring a central model and a strategy model, and appointing a global training parameter; the central model and the policy model are hosted in a central server; the global training parameters include: total number of edge devices, threshold time, batch size, and training rounds;

step 101: determining edge equipment participating in the current round of training based on the number of the training rounds to obtain a participating equipment set;

step 102: obtaining a local data sample;

step 103: the edge devices in the participating device set receive the central model and the training round number, and update parameters of a local model by the batch size by using the local data samples under the condition that the threshold time is met; the local model is implanted in the edge device;

step 104: collecting local information, and constructing the current round state of the environment based on the local information; the current round of states of the environment include: parameters of a local model, communication time, CPU utilization rate and training energy consumption;

step 105: updating the current round state of the environment, and aggregating the central model based on the parameters of the local model in the current round state of the updated environment and the local data samples to obtain an aggregated central model;

step 106: determining an accuracy of the aggregated central model;

step 107: determining a return value of the strategy model according to the accuracy of the aggregation central model, the communication time in the current state of the updated environment and the training energy consumption in the current state of the updated environment;

step 108: generating a normal distribution for each edge device participating in the training of the current round by using the strategy model according to the updated current round state of the environment;

step 109: sampling the normal distribution to obtain new training round number distribution information, and returning to the step 103 until the threshold time is exceeded, and obtaining decision trajectory information; the decision trajectory information comprises a plurality of decision trajectories; each of the decision trajectories includes: the current round state of the environment, the return value of the strategy model and the number of training rounds;

step 110: and updating the strategy model by using the decision track information, and returning to the execution step 100 to obtain the optimization model of the federal training until the updated strategy model converges to the optimal solution.

Preferably, the determining, based on the number of training rounds, the edge devices participating in the current round of training to obtain a participating device set specifically includes:

distributing corresponding training rounds to the edge equipment based on the training rounds;

when the number of training rounds allocated to the edge device is 0, the edge device does not participate in the training round; when the number of training rounds distributed to the edge equipment is not 0, the edge equipment participates in the training of the current round according to the distributed number of training rounds;

and acquiring edge devices participating in the current round of training to generate the participating device set.

Preferably, after obtaining the central model and the policy model, the method further comprises: and initializing the central model and the strategy model.

Preferably, the determining the accuracy of the aggregated central model specifically includes:

acquiring a test set;

determining the accuracy of the aggregated central model using a test set.

Preferably, the aggregate central model is:

；

in the formula (I), the compound is shown in the specification,

is as followstThe aggregate central model of round +1,

is as followsiThe data samples of the individual edge devices are,

is as followsiThe number of data samples of each edge device, D is the sum of the number of data samples of all edge devices,

n denotes the total number of edge devices,

is the t-th wheeliThe parameters of the local model of the individual edge devices,Q _t the number of edge devices in the participating device set for the t-th round.

Preferably, the return value of the policy model is:

；

in the formula (I), the compound is shown in the specification,

for the return value of the t-th round policy model,

aggregating the accuracy of the central model for the t-th round,

to aggregate the accuracy of the central model for round t-1,

is the t-th wheeliThe communication time of the individual edge devices,

is the t-th wheeliThe training energy consumption of each edge device,

is a first weight coefficient of the first weight coefficient,

is a second weight coefficient, and is,

is a third weight coefficient, and is,Q _t the number of edge devices in the participating device set for the t-th round.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the edge intelligent optimization method provided by the invention, the local round state of the environment is constructed based on model parameters, the number of rounds of training, communication time, idle CPU occupancy rate and training energy consumption, each edge device participates in federal training according to the corresponding round number information in the local round state, acquires the information of local model parameters, communication time, idle CPU utilization rate, training energy consumption and the like, and updates the local round state, so that the environment is transferred to the next state. The edge equipment continuously interacts with the environment, a large amount of track information is generated and used for updating the strategy model until the strategy model converges, different federal training rounds are distributed according to the calculation speed, the training energy consumption and the communication time of each equipment, and therefore the purposes of balancing calculation of isomerism and reduction of energy consumption overhead are achieved.

The invention also provides an edge intelligent optimization device, which comprises: a central server and an edge device;

the central server and the edge equipment perform information interaction;

a central model and a strategy model are implanted into the central server; the central server is used for appointing global training parameters, determining edge equipment participating in the current round of training based on the number of training rounds, and obtaining a participating equipment set; the global training parameters include: the total number of edge devices, threshold time, batch size, and training round number;

a local model is implanted in the edge device; the edge devices in the participating device set receive the central model and the training round number in the central server, and update parameters of a local model in the batch size by using local data samples under the condition that the threshold time is met;

the central server is used for acquiring local information and constructing the current state of the environment based on the local information; the current round of states of the environment include: parameters of a local model, communication time, CPU utilization rate and training energy consumption;

the central server is used for updating the current state of the environment and aggregating the central model based on the parameters of the local model in the current state of the updated environment and the local data samples to obtain an aggregated central model;

the central server is used for acquiring a test set and determining the precision of the aggregation central model by adopting the test set;

the central server is used for determining a return value of the strategy model according to the precision of the aggregation central model, the communication time in the current state of the environment after updating and the training energy consumption in the current state of the environment after updating;

the central server is used for generating a normal distribution for each edge device participating in the current training by utilizing the strategy model according to the updated current state of the environment;

the central server is used for sampling the normal distribution to obtain new training round number distribution information and sending the obtained new training round number distribution information to the edge equipment in the participating equipment set, and after the edge equipment in the participating equipment set receives the central model and the new training round number, parameters of a local model are updated by the local data samples in batch size under the condition of meeting the threshold time until the threshold time is exceeded, and decision trajectory information is obtained; the decision track information comprises a plurality of decision tracks; each of the decision trajectories includes: the current round state of the environment, the return value of the strategy model and the number of training rounds;

and the central server is used for updating the strategy model by using the decision track information, training the updated strategy model as a new strategy model, and obtaining an optimized model of federal training until the updated strategy model converges to an optimal solution.

Preferably, the edge device is a raspberry pi, a smartphone, a computer, or a surveillance camera.

Since the technical effect achieved by the edge intelligent optimization device provided by the invention is the same as that achieved by the edge intelligent optimization method provided by the invention, the details are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a diagram illustrating the steps of an edge intelligent optimization method provided by the present invention;

fig. 2 is an implementation schematic diagram of the edge intelligent optimization device provided by the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

As shown in fig. 1, the edge intelligent optimization method provided by the present invention includes:

step 100: and acquiring a central model and a strategy model, and appointing a global training parameter. The central model and the policy model are hosted in a central server. The global training parameters include: total number of edge devices, threshold time, batch size, and training rounds.

Step 101: and determining the edge equipment participating in the current round of training based on the number of the training rounds to obtain a participating equipment set. Specifically, the method comprises the following steps:

and distributing corresponding training round numbers for the edge equipment based on the training round numbers.

When the number of training rounds allocated to an edge device is 0, the edge device does not participate in the training round. When the number of training rounds allocated to the edge device is not 0, the edge device participates in the training round according to the number of the allocated training rounds.

And acquiring the edge devices participating in the current round of training to generate the participating device set.

Step 102: local data samples are obtained.

Step 103: and the edge devices in the participating device set receive the central model and the training round number, and update the parameters of the local model by the batch size by using the local data samples under the condition that the threshold time is met. The local model is implanted in the edge device.

Step 104: local information is collected, and the current round state of the environment is constructed based on the local information. The current round of states of the environment include: parameters of the local model, communication time, CPU utilization, and training energy consumption.

Step 105: and updating the current round state of the environment, and aggregating the central model based on the parameters of the local model in the current round state of the updated environment and the local data samples to obtain an aggregated central model. Wherein the aggregate central model is:

；

in the formula (I), the compound is shown in the specification,

the aggregated central model for round t +1,

is as followsiThe data samples of the individual edge devices are,

and N represents the total number of edge devices,

is the t-th wheeliParameters of a local model of the edge device.

Step 106: determining an accuracy of the aggregated central model. Specifically, the method comprises the following steps:

and acquiring a test set.

Determining the accuracy of the aggregated central model using a test set.

Step 107: determining a return value of the policy model according to the accuracy of the aggregated central model, the updated communication time in the current state of the environment, and the updated training energy consumption in the current state of the environment. Wherein, the return value of the strategy model is as follows:

；

in the formula (I), the compound is shown in the specification,

is the t-th wheelThe value of the return of the policy model,

aggregating the accuracy of the central model for the t-th round,

to aggregate the accuracy of the central model for round t-1,

is the t-th wheeliThe communication time of the individual edge devices,

is the t-th wheeliThe training energy consumption of each edge device,

is a first weight coefficient of the first weight coefficient,

is a second weight coefficient, and is,

Step 108: and generating a normal distribution for each edge device participating in the training of the current round by using the strategy model according to the updated current round state of the environment.

Step 109: and sampling the normal distribution to obtain new training round number distribution information, and returning to the step 103 until the threshold time is exceeded, and obtaining decision trajectory information. The decision trajectory information comprises a plurality of decision trajectories. Each of the decision trajectories includes: the current round state of the environment, the return value of the strategy model and the number of training rounds.

Step 110: and updating the strategy model by using the decision track information, and returning to the step 100 until the updated strategy model converges to the optimal solution, thereby obtaining the optimized model of the federal training.

In order to further improve the training accuracy, after the central model and the strategic model are obtained in step 100, the method for edge intelligent optimization provided by the invention further comprises: and initializing the central model and the strategy model.

The present invention also provides an edge intelligent optimization device, as shown in fig. 2, the device includes: a central server and edge devices.

And the central server performs information interaction with the edge equipment.

And a central model and a strategy model are implanted in the central server. And the central server is used for appointing global training parameters, determining the edge equipment participating in the current round of training based on the number of the training rounds, and obtaining a participating equipment set. The global training parameters include: total number of edge devices, threshold time, batch size, and training rounds.

The edge device has a local model embedded therein. And the edge devices in the participating device set receive the central model and the training round number in the central server, and update the parameters of the local model by the batch size by using the local data samples under the condition of meeting the threshold time.

The central server is used for collecting local information and constructing the current round state of the environment based on the local information. The current round of states of the environment include: parameters of the local model, communication time, CPU utilization, and training energy consumption.

The central server is used for updating the current state of the environment, and aggregating the central model based on the parameters of the local model in the current state of the updated environment and the local data samples to obtain an aggregated central model.

The central server is used for obtaining the test set and determining the accuracy of the aggregation central model by adopting the test set.

The central server is used for determining a return value of the strategy model according to the accuracy of the aggregation central model, the updated communication time in the current state of the environment and the updated training energy consumption in the current state of the environment.

And the central server is used for generating a normal distribution for each edge device participating in the current training by using the strategy model according to the updated current state of the environment.

The central server is configured to sample the normal distribution to obtain new training round number distribution information, and send the obtained new training round number distribution information to the edge device in the participating device set, where after the edge device in the participating device set receives the central model and the new training round number, the edge device updates parameters of a local model in the batch size by using the local data sample under the condition that the threshold time is met, and obtains decision trajectory information until the threshold time is exceeded. The decision trajectory information includes a plurality of decision trajectories. Each of the decision trajectories includes: the current round state of the environment, the return value of the strategy model and the number of training rounds.

The adopted edge equipment can be a raspberry pi, a smart phone, a computer or a monitoring camera.

The following describes a specific implementation process of the above-mentioned edge intelligent optimization method and apparatus by taking a raspberry pi as an edge device as an example.

As shown in fig. 2, the edge intelligent optimization apparatus provided in this embodiment is divided into two parts, one part is a central server located on the left side in fig. 2 and is served by a desktop computer, and the other part is an edge device on the right side and is composed of a plurality of raspberry groups, and the representation meaning of each symbol in fig. 2 is as follows:

Ntotal number of edge devices (e.g., raspberry pies) for federal learning.BLot size used for federal training.

Is the threshold time.EThe vectors formed by training rounds are distributed for different raspberries to meet the requirement

Wherein

Denotes the firstiTraining round number information of each raspberry pie, wherein the value of the training round number information is not more than a threshold valueMIs a natural number of (1).WRepresenting a model parameter matrix, satisfy

In which

Denotes the firstiModel parameters for individual raspberry pies.

Represents a communication time vector, satisfies

In which

Is shown asiThe time taken for each raspberry to communicate, including the sum of the up and down times.UA vector formed by the utilization rate of the CPU in idle time is defined as

Wherein

Is shown asiCPU utilization (idle utilization) of individual raspberry groups when not participating in federal training.PTo train energy consumption vectors, satisfy

Wherein

Denotes the firstiThe training total energy consumption of the raspberry pie comprises calculation energy consumption and communication energy consumption.vRepresenting the test accuracy of the central model on the test set. In addition, in order to represent information between different numbers of rounds, subscripts are introducedtTo distinguish, e.g.

Respectively representtModel parameter matrix of the wheel, firsttWheel firstiEnergy consumption of a raspberry pietAccuracy of the wheel center model.

The basic idea of the embodiment is as follows: and (3) constructing a reinforcement learning model at the central server end, constructing a deep reinforcement learning environment at the edge equipment end, and continuously interacting the model and the environment to learn an optimal training round number distribution scheme. Specifically, the parameters of the model collected in the round of dispatching the raspberry

Number of rounds of training

Communication time

And idle CPU occupancy rate

And training energy consumption

Current round state modeled as an environment

I.e. by

. The number of training rounds allocated to the device is defined as the actions of the raspberry pie

. Precision of two adjacent wheel central model

Local communication time

And communication energy consumption

Is used to construct a merit function (i.e., a return value) that is fed back to the raspberry pie

Satisfy the following requirements

. Strategy model of raspberry pie

State information

As input, output is number of training rounds

. Each raspberry group will be based on

The corresponding round number information in (1) participates in the federal training and collects the local model parameters

Communication time

Idle CPU utilization

And training energyConsumption unit

The information is uploaded to a central server, and model parameters are updated

Communication time

Energy consumption of communication

And idle CPU occupancy

Make the environment shift to the next state

. The raspberry pie continuously interacts with the environment to generate a large amount of track information

For policy models

Until the policy model is updated

And (6) converging.

The optimization method provided by the embodiment specifically includes the following steps:

step 1, initializing a central model

And a policy model

Specifying Total number of Raspberry pies for Federal learning of Global training parametersNBatch size for federal trainingBTime of threshold

Vector formed by training rounds with different raspberry groups

。

Step 2, according to the training round number vector

Allocating corresponding training round number for each raspberry group, and if the allocated training round number is

Then it is firstiThe raspberry pie participates in the training round and carries out

Iteration of rounds, if number of training rounds assigned

Then represents the firstiThe raspberry pies do not participate in the federate training round, so that the participating device sets of the round can be determined

。

Step 3, participating in the equipment set in the training process of the t round

Middle raspberry pi receiving central model

And number of rounds information

At the time of satisfying the threshold

Under the condition of (1), using the local dataSample(s)

In batch sizeBUpdating local models

And collecting local information

And uploading to a central server, and updating the local model by using the formula (1).

（1）

Wherein the content of the first and second substances,

the number of samples, which are sampled samples of the local data set,

are the parameters of the local model and are,

for the value of the loss function for that sample,

for learning rate, B =1,2.

Step 4, the central server receives the information uploaded by the raspberry group and updates

And obtaining an aggregate central model by aggregating the central models by using the formula (2)

Evaluating aggregated central models on a test set

Extract of (1)Degree of rotation

And calculating a return value according to the formula (3)

For evaluating a policy model

Good or bad.

（2）

（3）

Wherein, the first and the second end of the pipe are connected with each other,

is shown asiA data sample is sent to each raspberry

The number of the (c) component(s),

representing the total number of samples on all raspberry legs,

are all weight coefficients.

Step 5, reinforcement learning raspberry group according to the state

Using a policy model

Generating a normal distribution for each device, and generating new round number distribution information by sampling each normal distribution

。

Repeating the step 2~5 for a plurality of times until a time threshold is exceeded

Saving decision trajectories of raspberry pies

。

Step 6, the raspberry pi utilizes a plurality of pieces of track information according to the algorithm in the formula (4)

Updating a policy model

。

（4）

representing updated policy models

Is determined by the parameters of (a) and (b),

representing a policy model

The parameter(s) of (a) is,

respectively representing the length and the number of tracks,l=1,2,...,L，m=1,2,...,n，

a discount factor is indicated in the form of a discount factor,xis shown astThe length of the track of the wheel is,

respectively representjOn the strip tracktThe status, actions and rewards of the wheels,

representing corresponding cumulative discount returns, baseline

Is shown astWhile in turnjThe average discount return for the bar track is,

which represents the operation of the assignment of the value,

is a gradient operator.

And repeating all the steps until the strategy model of the raspberry pi converges to the optimal solution, and obtaining the optimized model of the federal training.

Based on the above description, compared with the prior art, the edge intelligent optimization method and apparatus provided by the present invention further have the following advantages:

1. the invention solves the optimization problem of multiple targets and constraints by using a deep reinforcement learning method. The deep reinforcement learning can automatically interact with the edge intelligence, can automatically learn and generate an optimal scheme, does not need a complex mathematical modeling process, and provides a new idea and a new way for optimizing the federal training process.

2. According to the invention, different training rounds are distributed to the equipment with different calculation speeds, the calculation heterogeneous problems among the equipment are skillfully balanced, the calculation power of the equipment can be fully utilized, the training speed of the global model is improved, and a new attempt is made for federal learning and deployment in a practical environment.

3. The method can save the energy consumption expense of the edge equipment without influencing the training speed and precision of the model, can improve the economic benefit of the federal training and ensure the sustainability of the federal training, thereby further meeting the requirements of intelligent multi-target optimization of the edge.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An edge intelligent optimization method is characterized by comprising the following steps:

step 100: acquiring a central model and a strategy model, and specifying a global training parameter; the central model and the policy model are hosted in a central server; the global training parameters include: total number of edge devices, threshold time, batch size, and training rounds;

step 102: obtaining a local data sample;

step 106: determining an accuracy of the aggregated central model;

step 109: sampling the normal distribution to obtain new training round number distribution information, and returning to the step 103 until the threshold time is exceeded, and obtaining decision trajectory information; the decision track information comprises a plurality of decision tracks; each of the decision trajectories includes: the current round state of the environment, the return value of the strategy model and the number of training rounds;

2. The edge intelligent optimization method according to claim 1, wherein the determining the edge devices participating in the current round of training based on the number of training rounds to obtain a participating device set specifically includes:

3. The edge intelligent optimization method according to claim 1, further comprising, after obtaining the central model and the strategic model: and initializing the central model and the strategy model.

4. The edge intelligent optimization method according to claim 1, wherein the determining the accuracy of the aggregated central model specifically comprises:

acquiring a test set;

determining the accuracy of the aggregated central model using a test set.

5. The edge intelligent optimization method according to claim 1, wherein the aggregate central model is:

；

in the formula (I), the compound is shown in the specification,

is a firsttThe aggregate central model of round +1,

is a firstiThe data samples of the individual edge devices are,

is as followsiThe number of data samples of each edge device, D is the data samples of all edge devicesThe sum of the amounts of (a) and (b),

and N represents the total number of edge devices,

6. The edge intelligent optimization method according to claim 1, wherein the return values of the policy model are:

；

in the formula (I), the compound is shown in the specification,

for the return value of the t-th round policy model,

the accuracy of the central model is aggregated for the t-th round,

to aggregate the accuracy of the central model for round t-1,

is the t-th wheeliThe communication time of each of the edge devices,

is the t-th wheeliThe training energy consumption of each edge device,

is a first weight systemThe number of the first and second groups is,

is a second weight coefficient, and is,

7. An edge intelligence optimization device, comprising: a central server and an edge device;

the central server and the edge equipment perform information interaction;

a local model is implanted in the edge device; the edge devices in the participating device set receive the central model and the number of training rounds in the central server, and update parameters of the local model in the batch size by using local data samples under the condition that the threshold time is met;

the central server is used for collecting local information and constructing the current round state of the environment based on the local information; the current round of states of the environment include: parameters of a local model, communication time, CPU utilization rate and training energy consumption;

8. The intelligent edge optimization device of claim 7, wherein the edge device is a raspberry pi, a smartphone, a computer, or a surveillance camera.