CN115358831A

CN115358831A - User bidding method and device based on multi-agent reinforcement learning algorithm under federated learning

Info

Publication number: CN115358831A
Application number: CN202211120985.2A
Authority: CN
Inventors: 曾荣飞; 安树阳; 曾超; 韩波; 苏迈; 王家齐
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-03-28
Filing date: 2022-09-15
Publication date: 2022-11-18
Also published as: CN114971819A

Abstract

The invention discloses a user bidding method and device based on a multi-agent reinforcement learning algorithm under federated learning. The method includes: acquiring learning tasks issued by a federated learning platform, a sample client uploading bidding information to the federated platform using a reinforced learning algorithm, and the platform After the sample client is selected by the algorithm, the global shared model is delivered to the selected sample client. The selected sample client performs local training and uploads the updated parameters. The platform aggregates the uploaded updated model parameters according to the aggregation algorithm and aggregates the global The model parameters in the model are updated. In order to complete the learning tasks released by the federated learning platform, this method alleviates the over-fitting of the model while realizing the dynamic bidding of federated learning users, and solves the existing auction-based incentive mechanism. It will not change in the subsequent training process, resulting in the lack of fairness of federated learning and the problem of model overfitting.

Description

A user bidding method based on multi-agent reinforcement learning algorithm under federated learning and its device

技术领域technical field

本发明涉及人工智能技术领域，具体而言，涉及一种基于多智能体强化学习算法在联邦学习下的用户竞价方法及装置。The present invention relates to the technical field of artificial intelligence, in particular, to a user bidding method and device based on a multi-agent reinforcement learning algorithm under federated learning.

背景技术Background technique

随着用户对隐私的日益重视以及相关政策的出台，传统机器学习收集数据集中训练变得越来越困难。联邦学习因不需要用户上传原始数据而保护用户隐私而成为最具潜力的深度学习范式。但是联邦学习的参与用户在参与联邦学习的训练过程中会消耗大量的计算、通信等资源，这意味着自私的参与用户无法在没有足够回报的前提下全心全意地参与学习任务。同时，由于联邦学习的底层网络结构复杂，节点资源有限且异构，联邦发起者如果没有相应的激励选择措施会造成巨大的通信开销，这些问题不仅浪费了网络资源，而且阻碍了联邦学习的推广。With the increasing attention of users to privacy and the introduction of related policies, it is becoming more and more difficult for traditional machine learning to collect data for centralized training. Federated learning has become the most potential deep learning paradigm because it does not require users to upload raw data and protects user privacy. However, participating users of federated learning consume a large amount of computing, communication and other resources during the training process of federated learning, which means that selfish participating users cannot fully participate in learning tasks without sufficient rewards. At the same time, due to the complex underlying network structure of federated learning, limited and heterogeneous node resources, if the federation initiator does not have corresponding incentives to choose measures, it will cause huge communication overhead. These problems not only waste network resources, but also hinder the promotion of federated learning .

相关技术的激励机制中，可采用博弈技术进行参与用户的选择和利润分配。具体可以可通过在联邦学习中加入拍卖方法来实现。作为一种实现方式，可以通过轻量级和多维度的激励方案来进行高质量参与用户的选择，作为另一种实现方式，可以通过设置激励机制框架，将参与用户的学习质量整合到联邦学习中进行有质量意识的激励机制和模型聚合。然而，现有基于拍卖的激励机制几乎都是静态的，这些方法在拍过程中会默认参与用户在确定自己的策略后不再随平台行为的改变更改自己的策略，该方式仅仅最大化了平台或者社会福利的效用，而未能共同最大化平台和参与用户的效用。具体表现为在联邦学习拍卖过程中，参与用户在确定自己的竞标信息后，该策略在后续训练中不会发生改变，无论被选中与否，参与用户只有等待被选择，在这样的拍卖机制中，即使参与用户没有中标，参与用户也无法改变现有策略，导致出现资源强的参与用户一直被选中，而资源弱但诚实的参与用户不会被选中的现象，这不仅会导致联邦学习公平性的缺失，阻碍参与用户的积极性，并且由于不断选中特定的客户端，导致数据多样性减少，模型可能会出过拟合的问题。其次一些的动态竞标方法都是假设用户信息是透明的，即每个用户知道其他用户的私人信息，这在实际应用中是不可能实现的。In the incentive mechanism of related technologies, game technology can be used to select participating users and distribute profits. Specifically, it can be realized by adding an auction method in federated learning. As an implementation method, a lightweight and multi-dimensional incentive scheme can be used to select high-quality participating users. As another implementation method, the learning quality of participating users can be integrated into federated learning by setting an incentive mechanism framework. Quality-conscious incentives and model aggregation in . However, the existing incentive mechanisms based on auctions are almost static. During the bidding process, these methods will default to participating users not to change their strategies with changes in platform behavior after determining their own strategies. This method only maximizes the value of the platform. Or the utility of social welfare without jointly maximizing the utility of the platform and participating users. The specific performance is that in the federated learning auction process, after the participating users determine their bidding information, the strategy will not change in the subsequent training. No matter whether they are selected or not, the participating users can only wait to be selected. In such an auction mechanism , even if the participating users do not win the bid, the participating users cannot change the existing strategy, resulting in the phenomenon that the participating users with strong resources are always selected, while the participating users with weak resources but honest will not be selected, which will not only lead to the fairness of federated learning The absence of , hinders the enthusiasm of participating users, and due to the continuous selection of specific clients, the data diversity is reduced, and the model may have over-fitting problems. Secondly, some dynamic bidding methods all assume that user information is transparent, that is, each user knows the private information of other users, which is impossible in practical applications.

发明内容Contents of the invention

本发明提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价方法及装置，通过将多智能体强化学习的方式引入联邦学习的激励机制中，从而解决了现有技术基于拍卖的激励机制由于策略后续训练过程中不会发生改变而导致联邦学习公平性缺失的问题。具体的技术方案如下：The present invention provides a user bidding method and device based on a multi-agent reinforcement learning algorithm under federated learning. By introducing the multi-agent reinforcement learning method into the incentive mechanism of federated learning, it solves the problem of auction-based bidding in the prior art. The incentive mechanism will not change in the subsequent training process of the strategy, which leads to the lack of fairness in federated learning. The specific technical scheme is as follows:

第一方面，本发明实施例提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价方法，所述方法包括：In the first aspect, an embodiment of the present invention provides a user bidding method based on a multi-agent reinforcement learning algorithm under federated learning, the method comprising:

获取联邦学习平台发布的学习任务，基于所述学习任务以及参与联邦学习的客户端集合所上传的竞标信息从所述客户端集合中选取样本客户端，并向样本客户端下发全局共享模型；Obtaining the learning task released by the federated learning platform, selecting a sample client from the client set based on the learning task and the bidding information uploaded by the client set participating in the federated learning, and sending the global shared model to the sample client;

接收每个样本客户端上传的更新模型参数，所述更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的；Receive the updated model parameters uploaded by each sample client, the updated model parameters are the sample client using the multi-agent reinforcement learning algorithm to output the bidding information of the sample client in the current round before the training starts, after being selected according to The configuration training global shared model in the bidding information to be submitted is formed;

对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对所述全局共享模型中的模型参数进行更新；Aggregating the updated model parameters uploaded by each sample client, and using the aggregated updated model parameters to update the model parameters in the global shared model;

若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。If the updated global shared model reaches the preset model accuracy in the test task, it is determined that the learning task released by the federated learning platform is completed; otherwise, the steps of updating the model parameters in the global shared model are repeated for multiple rounds, so that The updated globally shared model achieves the preset model accuracy on the test task.

可选的，所述样本客户端使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的过程，包括：Optionally, the process of the sample client using a multi-agent reinforcement learning algorithm to output the bidding information to be submitted by the sample client in the current round includes:

以所述样本客户端作为智能体，所述智能体观察在联邦学习环境中自身的历史状态信息，并利用所述历史状态信息输出所述样本客户端在当前轮次的待提交竞标信息。Taking the sample client as an agent, the agent observes its own historical state information in the federated learning environment, and uses the historical state information to output the bidding information of the sample client in the current round to be submitted.

可选的，所述多智能体强化学习算法包括策略器和经验池，所述以所述样本客户端作为智能体，所述智能体观察在联邦学习环境中自身的历史状态信息，并利用所述历史状态信息输出所述样本客户端在当前轮次的待提交竞标信息，包括：Optionally, the multi-agent reinforcement learning algorithm includes a strategist and an experience pool, the sample client is used as an agent, and the agent observes its own historical state information in a federated learning environment, and uses the The above historical status information outputs the bidding information to be submitted by the sample client in the current round, including:

以所述样本客户端作为智能体，使用所述多智能体强化学习算法中经验池来存储联邦学习环境中各个智能体观察到的历史任务状态信息，所述历史任务状态信息至少包括智能体在历史轮次中是否被选中、历史资源值、历史提供数据量以及历史单位资源量；Taking the sample client as an agent, using the experience pool in the multi-agent reinforcement learning algorithm to store historical task state information observed by each agent in the federated learning environment, the historical task state information includes at least Whether it is selected in the historical round, the historical resource value, the amount of historical provided data, and the historical unit resource amount;

通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息。By inputting the historical task state information observed by the agent in the federated learning environment as the state information of the agent in the current round to the strategist in the multi-agent reinforcement learning algorithm, the output of the agent in the current round is times of bids to be submitted.

可选的，在所述通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息之后，所述方法还包括：Optionally, in the multi-agent reinforcement learning algorithm, the strategist inputs the historical task state information observed by the agent in the federated learning environment as the state information of the agent in the current round. , after outputting the bid information to be submitted by the agent in the current round, the method further includes:

计算联邦学习环境针对智能体在当前轮次反馈的收益资源，并使用所述多智能体强化学习算法中经验池存储智能体在当前轮次观察到环境的历史状态、待提交竞标信息、待提交竞标信息上传后的环境状态以及联邦学习环境针对当前轮次上传的待提交竞标信息反馈给智能体的收益资源。Calculate the revenue resources fed back by the federated learning environment for the agent in the current round, and use the experience pool in the multi-agent reinforcement learning algorithm to store the historical state of the environment observed by the agent in the current round, the bidding information to be submitted, the information to be submitted The state of the environment after the bidding information is uploaded and the revenue resources that the federated learning environment feeds back to the agent for the bidding information uploaded in the current round.

可选的，所述计算联邦学习环境针对智能体在当前轮次反馈的收益资源，包括：Optionally, the computing federated learning environment aims at the revenue resources fed back by the agent in the current round, including:

基于智能体在当前轮次上的待上传竞标信息，分别获取智能体在竞标过程中涉及的资源参数；Based on the bidding information to be uploaded by the agent in the current round, respectively obtain the resource parameters involved in the bidding process of the agent;

将所述智能体在竞标过程中涉及的资源参数输入至预先构建的收益函数，得到联邦学习环境针对智能体在当前轮次反馈的收益资源。The resource parameters involved in the bidding process of the agent are input into the pre-built revenue function to obtain the revenue resource fed back by the federated learning environment for the agent in the current round.

可选的，每个样本客户端配置有一个策略器，所述策略器包括动作网络和价值网络，所述通过将所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息，包括：Optionally, each sample client is configured with a strategist, the strategist includes an action network and a value network, and the state information of historical tasks observed in the federated learning environment is used as the The state information of the agent is input to the strategist in the multi-agent reinforcement learning algorithm, and the bidding information to be submitted by the agent in the current round is output, including:

通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述策略器中动作网络，输出智能体在当前轮次的待提交竞标信息，得到智能体在当前训练轮次的待上传竞标信息；By inputting the historical task state information observed by the agent in the federated learning environment as the state information of the agent in the current round to the action network in the strategist, output the state information of the agent to be submitted in the current round Bidding information, to obtain the bidding information to be uploaded by the agent in the current training round;

通过将所述智能体在当前轮次的状态信息以及智能体在当前轮次的待上传竞标信息输入至所述策略器中价值网络，对所述待上传竞标信息进行评估，得到待上传竞标信息的评估分数；By inputting the state information of the agent in the current round and the bidding information to be uploaded by the agent in the current round into the value network in the strategist, evaluating the bidding information to be uploaded, and obtaining the bidding information to be uploaded assessment score;

其中，所述动作网络利用所述待上传竞标信息的评估分数进行训练，所述动作网络的网络参数通过梯度上升来更新，所述价值网络利用所述待上传竞标信息的评估分数以及智能体实际反馈的收益资源进行训练，所述价值网络的网络参数通过时序差分法来更新。Wherein, the action network uses the evaluation score of the bidding information to be uploaded for training, the network parameters of the action network are updated through gradient ascent, and the value network uses the evaluation score of the bidding information to be uploaded and the actual The feedback revenue resources are used for training, and the network parameters of the value network are updated by the temporal difference method.

可选的，所述对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对所述全局共享模型中的模型参数进行更新，包括：Optionally, the aggregating the updated model parameters uploaded by each sample client, and using the aggregated updated model parameters to update the model parameters in the global shared model include:

分别计算各个样本客户端的数据量与所有样本客户端的数据量的比值，得到每个样本客户端对应的数据量占比；Calculate the ratio of the data volume of each sample client to the data volume of all sample clients to obtain the proportion of data volume corresponding to each sample client;

将每个样本客户端对应的数据量占比乘以相应样本客户端上传的更新模型参数后，聚合所有样本客户端对应的更新模型参数，通过累加聚合后更新模型参数对全局共享模型中的模型参数进行更新。After multiplying the proportion of data volume corresponding to each sample client by the updated model parameters uploaded by the corresponding sample client, aggregate the updated model parameters corresponding to all sample clients, and update the model parameters for the model in the global shared model through accumulation and aggregation. The parameters are updated.

第二方面，本发明实施例提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价装置，所述装置包括：In the second aspect, an embodiment of the present invention provides a user bidding device based on a multi-agent reinforcement learning algorithm under federated learning, and the device includes:

获取单元，用于获取联邦学习平台发布的学习任务，基于所述学习任务以及参与联邦学习的客户端集合所上传的竞标信息从所述客户端集合中选取样本客户端，并向样本客户端下发全局共享模型；The acquiring unit is configured to acquire the learning tasks issued by the federated learning platform, select sample clients from the client set based on the learning tasks and the bidding information uploaded by the client sets participating in the federated learning, and download to the sample clients Send a global shared model;

接收单元，用于接收每个样本客户端上传的更新模型参数，所述更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的；The receiving unit is used to receive the updated model parameters uploaded by each sample client, and the updated model parameters are the sample client's bidding information to be submitted in the current round using the multi-agent reinforcement learning algorithm output by the sample client before the training starts , formed by training the global shared model according to the configuration in the bidding information to be submitted after being selected;

聚合单元，用于对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对所述全局共享模型中的模型参数进行更新；An aggregation unit, configured to aggregate the updated model parameters uploaded by each sample client, and use the aggregated updated model parameters to update the model parameters in the global shared model;

选取单元，用于若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。The selection unit is used to determine that the learning task released by the federated learning platform is completed if the updated global shared model reaches the preset model accuracy in the test task, otherwise, repeat multiple rounds to update the model parameters in the global shared model steps to make the updated global shared model achieve the preset model accuracy in the test task.

可选的，所述装置还包括：Optionally, the device also includes:

输出单元，用于样本客户端使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的过程；The output unit is used for the sample client to use the multi-agent reinforcement learning algorithm to output the bidding information of the sample client in the current round to be submitted;

所述输出单元，具体用于以所述样本客户端作为智能体，所述智能体观察在联邦学习环境中自身的历史状态信息，并利用所述历史状态信息输出所述样本客户端在当前轮次的待提交竞标信息。The output unit is specifically configured to use the sample client as an agent, the agent observes its own historical state information in the federated learning environment, and uses the historical state information to output the sample client in the current round times of bids to be submitted.

可选的，所述多智能体强化学习算法包括策略器和经验池，所述输出单元包括：Optionally, the multi-agent reinforcement learning algorithm includes a strategist and an experience pool, and the output unit includes:

存储模块，用于以所述样本客户端作为智能体，使用所述多智能体强化学习算法中经验池来存储联邦学习环境中各个智能体观察到的历史任务状态信息，所述历史任务状态信息至少包括智能体在历史轮次中是否被选中、历史资源值、历史提供数据量以及历史单位资源量；A storage module, configured to use the sample client as an agent, use the experience pool in the multi-agent reinforcement learning algorithm to store historical task state information observed by each agent in the federated learning environment, the historical task state information Including at least whether the agent is selected in the historical rounds, historical resource value, historical provided data amount and historical unit resource amount;

输出模块，用于通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息。The output module is used to input the historical task state information observed by the agent in the federated learning environment into the strategist in the multi-agent reinforcement learning algorithm as the state information of the agent in the current round, and output The bidding information to be submitted by the agent in the current round.

可选的，所述输出单元还包括：Optionally, the output unit also includes:

计算模块，用于在所述通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息之后，计算联邦学习环境针对智能体在当前轮次反馈的收益资源，并使用所述多智能体强化学习算法中经验池存储智能体在当前轮次观察到环境的历史状态、待提交竞标信息、待提交竞标信息上传后的环境状态以及联邦学习环境针对当前轮次上传的待提交竞标信息反馈给智能体的收益资源。The calculation module is used to input the historical task state information observed by the agent in the federated learning environment into the multi-agent reinforcement learning algorithm as the state information of the agent in the current round. After outputting the bidding information to be submitted by the agent in the current round, it calculates the revenue resources fed back by the federated learning environment for the agent in the current round, and uses the experience pool in the multi-agent reinforcement learning algorithm to store the The round observes the historical state of the environment, the bidding information to be submitted, the state of the environment after the bidding information to be submitted is uploaded, and the revenue resources that the federated learning environment feeds back to the agent for the bidding information uploaded in the current round.

可选的，所述计算模块，具体用于基于智能体在当前轮次上的待上传竞标信息，分别获取智能体在竞标过程中涉及的资源参数；Optionally, the computing module is specifically configured to respectively acquire resource parameters involved in the bidding process of the agent based on the bidding information to be uploaded by the agent in the current round;

所述计算模块，具体还用于将所述智能体在竞标过程中涉及的资源参数输入至预先构建的收益函数，得到联邦学习环境针对智能体在当前轮次反馈的收益资源。The calculation module is also specifically configured to input the resource parameters involved in the bidding process of the agent into a pre-built revenue function, so as to obtain the revenue resources fed back by the federated learning environment for the agent in the current round.

可选的，每个样本客户端配置有一个策略器，所述策略器包括动作网络和价值网络，所述输出模块，具体用于通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述策略器中动作网络，输出智能体在当前轮次的待提交竞标信息，得到智能体在当前训练轮次的待上传竞标信息；Optionally, each sample client is configured with a strategist, the strategist includes an action network and a value network, and the output module is specifically configured to use the agent observed in the federated learning environment The historical task state information is input to the action network in the strategist as the state information of the agent in the current round, and the bid information to be submitted by the agent in the current round is output, and the bid information to be uploaded by the agent in the current training round is obtained ;

所述输出模块，具体还用于通过将所述智能体在当前轮次的状态信息以及智能体在当前轮次的待上传竞标信息输入至所述策略器中价值网络，对所述待上传竞标信息进行评估，得到待上传竞标信息的评估分数；The output module is also specifically configured to input the state information of the agent in the current round and the bidding information of the agent in the current round to the value network in the strategist, and process the bids to be uploaded. Evaluate the information and get the evaluation score of the bidding information to be uploaded;

可选的，所述聚合单元包括：Optionally, the polymerization unit includes:

计算模块，用于分别计算各个样本客户端的数据量与所有样本客户端的数据量的比值，得到每个样本客户端对应的数据量占比；A computing module, configured to calculate the ratio of the data volume of each sample client to the data volume of all sample clients respectively, so as to obtain the proportion of data volume corresponding to each sample client;

聚合模块，用于将每个样本客户端对应的数据量占比乘以相应样本客户端上传的更新模型参数后，聚合所有样本客户端对应的更新模型参数，通过累加聚合后更新模型参数对全局共享模型中的模型参数进行更新。The aggregation module is used to multiply the proportion of data volume corresponding to each sample client by the updated model parameters uploaded by the corresponding sample client, aggregate the updated model parameters corresponding to all sample clients, and update the model parameters to the global Model parameters in the shared model are updated.

第三方面，本发明实施例提供了一种存储介质，其上存储有可执行指令，该指令被处理器执行时使处理器实现第一方面所述的方法。In a third aspect, an embodiment of the present invention provides a storage medium, on which executable instructions are stored, and when the instructions are executed by a processor, the processor implements the method described in the first aspect.

第四方面，本发明实施例提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价的设备，包括：In the fourth aspect, an embodiment of the present invention provides a device for user bidding based on a multi-agent reinforcement learning algorithm under federated learning, including:

一个或多个处理器；one or more processors;

存储装置，用于存储一个或多个程序，storage means for storing one or more programs,

其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现第一方面所述的方法。Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method described in the first aspect.

由上述内容可知，本发明实施例提供的基于多智能体强化学习算法在联邦学习下的用户竞价方法及装置，通过获取联邦学习平台发布的学习任务，基于学习任务以及参与联邦学习的客户端集合所上传的竞标信息从客户端集合中选取样本客户端，并向样本客户端下发全局共享模型，接收每个样本客户端上传的更新模型参数，该更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的，进一步对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对全局共享模型中的模型参数进行更新，若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。由此可知，与现有技术基于拍卖的激励机制相比，本发明实施例可使用多智能体学习系统来调整客户端上传的竞标信息，从而解决了现有技术基于拍卖的激励机制由于策略后续训练过程中不会发生改变而导致联邦学习公平性缺失本的问题。It can be seen from the above that the embodiment of the present invention provides a user bidding method and device based on a multi-agent reinforcement learning algorithm under federated learning. By obtaining the learning tasks released by the federated learning platform, based on the learning tasks and the set of clients participating in federated learning The uploaded bidding information selects the sample client from the client collection, and sends the global shared model to the sample client, and receives the updated model parameters uploaded by each sample client. The updated model parameters are the sample client before the training starts. Use the multi-agent reinforcement learning algorithm to output the bidding information to be submitted by the sample client in the current round. After being selected, it is formed by training the global shared model according to the configuration in the bidding information to be submitted, and further uploaded by each sample client Update the model parameters for aggregation, and use the aggregated updated model parameters to update the model parameters in the global shared model. If the updated global shared model reaches the preset model accuracy in the test task, it is judged that the learning released by the federated learning platform is completed. task, otherwise, repeatedly execute the step of updating the model parameters in the global shared model for multiple rounds, so that the updated global shared model can reach the preset model accuracy in the test task. It can be seen that, compared with the auction-based incentive mechanism in the prior art, the embodiment of the present invention can use the multi-agent learning system to adjust the bidding information uploaded by the client, thereby solving the problem of the auction-based incentive mechanism in the prior art There is no change in the training process, which leads to the lack of fairness in federated learning.

此外，本实施例还可以实现的技术效果包括：In addition, the technical effects that can also be achieved in this embodiment include:

(1)基于多智能体强化学习算法来调整客户端上传的竞标信息，以增加客户端被选取的概率，保证联邦学习中参与用户的公平性，解决固定策略导致的次优问题，实现联邦学习平台和参与用户效用共同最大化的目标。(1) Adjust the bidding information uploaded by the client based on the multi-agent reinforcement learning algorithm to increase the probability of the client being selected, ensure the fairness of participating users in federated learning, solve the suboptimal problem caused by fixed strategies, and realize federated learning The goal of jointly maximizing the utility of the platform and participating users.

(2)该多智能体强化学习算法采用集中式训练，分布式执行，使得参与用户对应客户端可以观察到更多状态，提高智能体训练过程中的稳定性。(2) The multi-agent reinforcement learning algorithm adopts centralized training and distributed execution, so that participating users can observe more states corresponding to the client, and improve the stability of the agent training process.

(3)该多智能体强化学习算法采用异步深度强化的训练方式，解耦学习任务的执行与竞标信息的更新的步骤，使得两者可以并行地进行工作，加速模型的训练。(3) The multi-agent reinforcement learning algorithm adopts an asynchronous deep reinforcement training method to decouple the execution of learning tasks and the steps of updating bidding information, so that the two can work in parallel to speed up the training of the model.

当然，实施本发明的任一产品或方法并不一定需要同时达到以上所述的所有优点。Of course, implementing any product or method of the present invention does not necessarily need to achieve all the above-mentioned advantages at the same time.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单介绍。显而易见地，下面描述中的附图仅仅是本发明的一些实施例。对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1为本发明实施例提供的一种基于多智能体强化学习算法在联邦学习下的用户竞价方法的流程示意图；1 is a schematic flow diagram of a user bidding method under federated learning based on a multi-agent reinforcement learning algorithm provided by an embodiment of the present invention;

图2为本发明实施例提供的多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的流程框图；Fig. 2 is a flow diagram of the multi-agent reinforcement learning algorithm outputting the bidding information to be submitted by the sample client in the current round provided by the embodiment of the present invention;

图3为本发明实施例提供的一种基于多智能体强化学习算法在联邦学习下的用户竞价装置的组成框图。FIG. 3 is a block diagram of a user bidding device under federated learning based on a multi-agent reinforcement learning algorithm provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整的描述。显然，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Apparently, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

需要说明的是，本发明实施例及附图中的术语“包括”和“具有”以及它们的任何变形，意图在于覆盖不排他的包含。例如包含的一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。It should be noted that the terms "include" and "have" and any variations thereof in the embodiments of the present invention and the drawings are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally further includes For other steps or units inherent in these processes, methods, products or devices.

本发明提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价方法及装置，通过使用多智能体学习系统来调整客户端上传的竞标信息，从而解决了现有技术基于拍卖的激励机制由于策略后续训练过程中不会发生改变而导致联邦学习公平性缺失的问题。传统拍卖技术在进行拍卖过程中需要参与用户的私人信息，整个拍卖过程是静态进行的，也就是说参与用户的投标是固定的，即使在竞标失败后也不会调整客户端上传的竞标信息，使得参与用户无法动态改变竞标信息，使得联邦学习存在公平性缺失，资源弱的参与用户很难被联邦学习平台所选取，使得参与用户的资源造成极大的浪费，这些拍卖机制仅仅最大化了联邦学习平台或者社会福利效用，而未能共同最大化联邦学习平台和参与用户的效用。本发明实施例将多智能体强化学习算法引入联邦学习的激励机制中，基于多智能体强化学习算法来调整客户端上传的竞标信息，以增加参与用户对应客户端被选取的概率，降低聚合时间，保证联邦学习中参与用户的公平性，解决固定策略导致的次优问题，实现联邦学习平台和参与用户效用共同最大化的目标。The present invention provides a user bidding method and device based on a multi-agent reinforcement learning algorithm under federated learning, by using a multi-agent learning system to adjust the bidding information uploaded by the client, thereby solving the problem of auction-based incentives in the prior art The mechanism does not change in the subsequent training process of the policy, which leads to the lack of fairness in federated learning. Traditional auction technology needs to participate in the user's private information during the auction process. The entire auction process is carried out statically, that is to say, the bidding of participating users is fixed, and the bidding information uploaded by the client will not be adjusted even after the bidding fails. It makes it impossible for participating users to dynamically change the bidding information, which makes federated learning lack of fairness. It is difficult for participating users with weak resources to be selected by the federated learning platform, which causes a great waste of participating users' resources. These auction mechanisms only maximize the federated learning. learning platform or social welfare utility, but fail to jointly maximize the utility of the federated learning platform and participating users. The embodiment of the present invention introduces the multi-agent reinforcement learning algorithm into the incentive mechanism of federated learning, and adjusts the bidding information uploaded by the client based on the multi-agent reinforcement learning algorithm, so as to increase the probability of the corresponding client being selected by participating users and reduce the aggregation time , to ensure the fairness of participating users in federated learning, solve the suboptimal problem caused by fixed strategies, and achieve the goal of maximizing the utility of the federated learning platform and participating users.

下面对本发明实施例进行详细说明。The embodiments of the present invention will be described in detail below.

图1为本发明实施例提供的一种基于多智能体强化学习算法在联邦学习下的用户竞价方法的流程示意图。所述方法可以包括如下步骤：FIG. 1 is a schematic flowchart of a user bidding method under federated learning based on a multi-agent reinforcement learning algorithm provided by an embodiment of the present invention. The method may include the steps of:

S100：获取联邦学习平台发布的学习任务，基于所述学习任务以及参与联邦学习的客户端集合所上传的竞标信息从所述客户端集合中选取样本客户端，并向样本客户端下发全局共享模型。S100: Obtain the learning task released by the federated learning platform, select a sample client from the client set based on the learning task and the bidding information uploaded by the client set participating in the federated learning, and issue a global share to the sample client Model.

其中，联邦学习平台发布的学习任务由联邦发布者对应的服务端发布，该学习任务适用于各种涉及到数据收集训练的应用场景，例如，目标识别任务、数据分类任务等。由于联邦学习过程需要使用选取带有数据的客户端对全局共享模型进行训练，如果能够选取优质的客户端来更新全局共享模型的模型参数，将提高学习任务的应用效果，这里为了保证服务端选取到合适的客户端来进行模型训练，各个客户端会上传竞标信息，进一步服务端基于学习任务对各个客户端进行选取。Among them, the learning tasks issued by the federated learning platform are issued by the server corresponding to the federated issuer. The learning tasks are applicable to various application scenarios involving data collection and training, such as target recognition tasks, data classification tasks, etc. Since the federated learning process needs to use selected clients with data to train the global shared model, if a high-quality client can be selected to update the model parameters of the global shared model, the application effect of the learning task will be improved. Here, in order to ensure that the server selects To the appropriate client for model training, each client will upload bidding information, and the server will further select each client based on the learning task.

这里竞标信息由计算资源、数据量和竞标资源组成。具体基于学习任务以及参与联邦学习的客户端集合上传的竞标信息从客户端集合中选取样本客户端的过程中，可以在联邦学习平台收到客户端的竞标信息后，通过对学习任务进行建模求解得到样本客户端，以及样本客户端对应的竞标资源，该建模求解过程如下：The bidding information here consists of computing resources, data volume, and bidding resources. Specifically, in the process of selecting sample clients from the client collection based on the learning tasks and the bidding information uploaded by the client collection participating in federated learning, after the federated learning platform receives the client bidding information, it can be obtained by modeling and solving the learning tasks. The sample client, and the bidding resources corresponding to the sample client, the modeling and solving process is as follows:

s_n∈{0,1} (2)s _n ∈ {0,1} (2)

tⁿ _max≤T_max (3)t ⁿ _max ≤ T _max (3)

其中，(1)代表的是联邦学习平台对于被选中的客户端的支付总和不得超过平台的预算，(2)代表的每个客户端被选中或不被选中，被选中为1，不被选中为0，(3)代表的是被选中客户端的训练时间不能超过联邦学习平台规定的最大时间。Among them, (1) represents that the sum of the federated learning platform’s payments to the selected clients must not exceed the platform’s budget, (2) represents whether each client is selected or not selected, is selected as 1, and is not selected as 0, (3) means that the training time of the selected client cannot exceed the maximum time specified by the federated learning platform.

通过求解上述表达式，联邦学习平台会选取出样本客户端集合，以及联邦学习平台在每个样本客户端所购买的资源量。By solving the above expressions, the federated learning platform will select a set of sample clients and the amount of resources purchased by the federated learning platform for each sample client.

S110：接收每个样本客户端上传的更新模型参数，所述更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的。S110: Receive the updated model parameters uploaded by each sample client, the updated model parameters are the sample client's output of the bidding information to be submitted in the current round by using the multi-agent reinforcement learning algorithm before the training starts, and are selected Then it is formed by training the global shared model according to the configuration in the bidding information to be submitted.

在本发明实施例中，样本客户端为希望参与学习任务的客户端，只有被选中的样本客户端才可进行全局模型的下载以及训练，每个样本客户端都有多智能体强化学习算法，具体在被选中的样本客户端在接收到全局共享模型后，按照其竞标信息中的配置训练全局模型，得到更新模型参数。In the embodiment of the present invention, the sample client is a client that wishes to participate in the learning task, and only the selected sample client can download and train the global model, and each sample client has a multi-agent reinforcement learning algorithm, Specifically, after the selected sample client receives the global shared model, it trains the global model according to the configuration in its bidding information, and obtains updated model parameters.

具体样本客户端使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的过程，以样本客户端作为智能体，智能体智能体观察在联邦学习环境中自身的历史状态信息，并利用历史状态信息输出样本客户端在当前轮次的待提交竞标信息。The specific sample client uses the multi-agent reinforcement learning algorithm to output the bidding information of the sample client in the current round. The sample client is used as the agent, and the agent observes its own historical state information in the federated learning environment. , and use the historical state information to output the bidding information to be submitted by the sample client in the current round.

上述多智能体强化学习算法包括策略器和经验池，具体以样本客户端作为智能体，智能体智能体观察在联邦学习环境中自身的历史状态信息，并利用历史状态信息输出样本客户端在当前轮次的待提交竞标信息的过程中，可以以样本客户端作为智能体，使用多智能体强化学习算法中经验池来存储联邦学习环境中各个智能体观察到的历史任务状态信息，该的历史任务状态信息相当于智能体历史提交竞标信息的状态以及联邦学习对于智能体历史提交竞标信息的反馈，至少包括智能体在历史轮次中是否被选中、历史资源值、历史提供数据量以及历史单位资源量，进一步通过将智能体在联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息。The above multi-agent reinforcement learning algorithm includes a strategist and an experience pool. Specifically, the sample client is used as the agent. The agent agent observes its own historical state information in the federated learning environment, and uses the historical state information to output the current status of the sample client. In the process of submitting bidding information for a round, the sample client can be used as an agent, and the experience pool in the multi-agent reinforcement learning algorithm can be used to store the historical task status information observed by each agent in the federated learning environment. The task status information is equivalent to the status of the agent’s historically submitted bidding information and federated learning’s feedback on the agent’s historically submitted bidding information, including at least whether the agent is selected in the historical rounds, historical resource values, historically provided data volume, and historical units The amount of resources, further by inputting the state information of historical tasks observed by the agent in the federated learning environment as the state information of the agent in the current round to the strategist in the multi-agent reinforcement learning algorithm, and outputting the state information of the agent in the current round Submit bid information.

可以理解的是，上述多智能体强化学习算法可以学习如何将联邦学习环境中的任务状态映射到竞标信息中，以使得客户端与平台同时获得最大资源收益。该系统的基本模型是基本模型是马尔可夫博弈过程，在马尔科夫博弈中，所有智能体根据当前联邦学习环境的任务状态(或者是观测值)来同时选择并执行各智能体待提交的竞标信息。它被定义为一个元组(n,S,A₁,...,A_n,T,γ,R₁,...,R_n)，其中，n为智能体的数量，S是多智能体强化学习算法的任务状态，指各个智能体在的历史任务状态信息；A是各智能体待提交的竞标信息集合；T:S×A₁×A₂×...×A_n×S→[0,1]是智能体状态转移函数的集合，即给定当前任务状态和联合动作时，下一任务状态的概率分布。R_i:S×A₁×A₂×...×A_n×S→[0,1]是智能体i奖励函数的集合,R_i(s,a₁,...×a_n,s)是智能体i在任务状态s时采取联合动作(a₁,...a_n)之后在任务状态s_t+1所得到的回报。这里某个智能体i获得的累积奖励的期望可以表示为：It can be understood that the above multi-agent reinforcement learning algorithm can learn how to map the task state in the federated learning environment to bidding information, so that the client and the platform can obtain the maximum resource benefits at the same time. The basic model of the system is that the basic model is a Markov game process. In the Markov game, all agents simultaneously select and execute tasks to be submitted by each agent according to the task state (or observation value) of the current federated learning environment. bidding information. It is defined as a tuple (n,S,A ₁ ,...,A _n ,T,γ,R ₁ ,...,R _n ), where n is the number of agents and S is the number of agents The task status of the agent reinforcement learning algorithm refers to the historical task status information of each agent; A is the bidding information set to be submitted by each agent; T:S×A ₁ ×A ₂ ×...×A _n ×S→ [0,1] is the set of agent state transition functions, that is, the probability distribution of the next task state given the current task state and the joint action. R _i :S×A ₁ ×A ₂ ×...×A _n ×S→[0,1] is the set of reward functions for agent i, R _i (s,a ₁ ,...×a _n ,s ) is the reward obtained by agent i in task state t ₊₁ after taking joint action (a ₁ ,...a _n ) in task state s. Here the expectation of the cumulative reward obtained by an agent i can be expressed as:

上述智能体的奖励函数可表示为:The reward function of the above agent can be expressed as:

格；d_i为智能体i的数据量；m_i为智能体的单位计算能力，c_i为单位成本；

为智能体i在自己的资源需求得到的平均利润,x_i为智能体将资源服务于自己的需求。由于智能体所有者行为的不确定性，例如，智能体可能会长时间将其用于其他事情，导致设备几乎没有剩余的资源可用于任务训练，这里将x_i定义为某个区间内的随机变量

其中，x_i遵循概率分布函数F(x_i)。

grid; d _i is the data volume of agent i; m _i is the unit computing power of the agent, and c _i is the unit cost;

is the average profit that the agent i gets from its own resource demand, and _xi is the resource that the agent uses to serve its own demand. Due to the uncertainty of the behavior of the agent owner, for example, the agent may use it for other things for a long time, causing the device to have almost no remaining resources for task training. Here, _xi is defined as a random value within a certain interval variable

Among them, x _i follows the probability distribution function F(xi ₎ .

进一步地，为了实时了解样本客户端在每个轮次反馈的收益情况，可以在输出智能体在当前轮次的待提交竞标信息之后，计算联邦学习环境针对智能体在当前轮次反馈的收益资源，并使用多智能体强化学习算法中经验池存储智能体在当前轮次观察到环境的历史状态、待提交竞标信息、待提交竞标信息上传后的环境状态以及联邦学习环境针对当前轮次上传的待提交竞标信息反馈给智能体的收益资源。具体可以基于智能体在当前轮次上的待上传竞标信息，分别获取智能体在竞标过程中涉及的资源参数，进一步将智能体在竞标过程中涉及的资源参数输入至预先构建的收益函数，得到联邦学习环境针对智能体在当前轮次反馈的收益资源。Furthermore, in order to understand the revenue feedback of the sample client in each round in real time, after outputting the bid information to be submitted by the agent in the current round, calculate the revenue resources fed back by the federated learning environment for the agent in the current round , and use the experience pool in the multi-agent reinforcement learning algorithm to store the historical state of the environment observed by the agent in the current round, the bidding information to be submitted, the environment state after the bidding information is uploaded, and the federated learning environment uploaded for the current round The revenue resource to be submitted to the agent for bidding information feedback. Specifically, based on the bidding information to be uploaded by the agent in the current round, the resource parameters involved in the bidding process of the agent can be obtained respectively, and the resource parameters involved in the bidding process of the agent can be further input into the pre-built revenue function to obtain The federated learning environment is aimed at the revenue resources fed back by the agent in the current round.

上述每个样本客户端配置有一个策略器，策略器包括动作网络和价值网络，具体在输出智能体在当前轮次的待提交竞标信息的过程中，可以通过将智能体在联邦学习环境中观察到的历史任务状态信息作为其在当前轮次的状态信息输入至策略器中动作网络，输出智能体在当前轮次的待提交竞标信息，得到智能体在当前训练轮次的待上传竞标信息；通过将智能体在当前轮次的状态信息以及智能体在当前轮次的待上传竞标信息输入至策略器中价值网络，对待上传竞标信息进行评估，得到待上传竞标信息的评估分数。其中，动作网络利用待上传竞标信息的评估分数进行训练，动作网络的网络参数通过梯度上升来更新，价值网络利用待上传竞标信息的评估分数以及智能体实际反馈的收益资源进行训练，价值网络的网络参数通过时序差分法来更新。Each of the above sample clients is configured with a strategist, which includes an action network and a value network. Specifically, in the process of outputting the bid information to be submitted by the agent in the current round, the agent can be observed in the federated learning environment The historical task status information obtained is input to the action network in the strategist as its status information in the current round, and the bid information to be submitted by the agent in the current round is output, and the bid information to be uploaded by the agent in the current training round is obtained; By inputting the status information of the agent in the current round and the bidding information to be uploaded in the current round of the agent into the value network in the strategist, the bidding information to be uploaded is evaluated, and the evaluation score of the bidding information to be uploaded is obtained. Among them, the action network uses the evaluation scores of the bidding information to be uploaded for training, and the network parameters of the action network are updated through gradient ascent. The network parameters are updated by the temporal difference method.

具体在实际应用场景中，可假设在某一时间步长t某个区域有m个样本客户端和一个任务发起者，这里某一时间步长t相当于联邦学习过程中客户端集合提交任务投标、联邦学习平台选取客户端以及被选中的样本客户端在本地训练并上传更新模型参数的一个轮次，其中，任务投标包含样本客户端的竞标信息(数据量，计算资源)与希望获得的支付。在联邦学习中，每个样本客户端都作为一个智能体，拥有一个强化学习策略器，该强化学习策略器使用深度学习中的多层感知机构成，包含输入层、隐藏层和输出层。该策略器表示如下：st为t时刻的联邦学习环境的状态，包括每个智能体的状态和联邦学习平台的状态，在每个时间槽t中，智能体i的观测空间为

其中，

为上一轮次智能体提供的价格，

是上一轮次的竞标结果i∈{0,1}，s＝0表示竞标失败，s＝1表示竞标成功；

表示智能体提供的单计算资源，由于智能体在训练时间内不一定会将全部计算资源分配给训练任务，所以每个智能体的单位计算资源与智能体自身的资源需求有关；

表示上一轮次智能体提供的数据量。智能体i在当前训练轮开始前，观察之前关于当前学习任务的状态信息，然后将从联邦学习环境中观察到的状态学习输入至动作网络，动作网络经过计算输出策略

策略为当前轮次智能体待提交的竞标信息。其中，

是一个数组，该数组包含竞标信息中的四个属性，

是该用户在当前轮次采取行动后联邦学习环境反馈的奖励，每一个智能体都有自己的奖励函数，在本发明实施例中，使用智能体的奖励函数即可计算智能体在当前轮次的收益资源。Specifically, in an actual application scenario, it can be assumed that there are m sample clients and a task initiator in a certain area at a certain time step t, where a certain time step t is equivalent to submitting task bids by a collection of clients in the federated learning process 1. The federated learning platform selects the client and the selected sample client trains locally and uploads a round of updating model parameters, wherein the task bid includes the sample client’s bidding information (data volume, computing resources) and the expected payment. In federated learning, each sample client acts as an agent and has a reinforcement learning strategist, which is composed of a multi-layer perceptron in deep learning, including an input layer, a hidden layer, and an output layer. The strategy device is expressed as follows: st is the state of the federated learning environment at time t, including the state of each agent and the state of the federated learning platform. In each time slot t, the observation space of agent i is

in,

For the price offered by the agent in the previous round,

is the bidding result of the previous round i∈{0,1}, s=0 means the bidding failed, and s=1 means the bidding was successful;

Represents the single computing resource provided by the agent. Since the agent may not allocate all the computing resources to the training task during the training time, the unit computing resource of each agent is related to the resource requirements of the agent itself;

Indicates the amount of data provided by the agent in the last round. Before the start of the current training round, the agent i observes the previous state information about the current learning task, and then inputs the state learning observed from the federated learning environment into the action network, and the action network outputs the policy through calculation

Strategy is the bidding information to be submitted by the agent in the current round. in,

is an array that contains the four attributes in the bidding information,

is the reward of federated learning environment feedback after the user takes action in the current round. Each agent has its own reward function. In the embodiment of the present invention, the reward function of the agent can be used to calculate the income resources.

示例性的，图2为多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的流程框图，这里每个策略器包括动作网络和价值网络，动作网络和价值网络分别包括主网络和目标网络，结合图2具体算法流程如下：Exemplarily, Fig. 2 is a flow chart of the multi-agent reinforcement learning algorithm outputting the bidding information to be submitted by the sample client in the current round, where each strategist includes an action network and a value network, and the action network and the value network respectively include the main Network and target network, combined with Figure 2, the specific algorithm flow is as follows:

for回合数episode＝1 to M，进行迭代for number of rounds episode=1 to M, iterate

初始化动作空间；Initialize the action space;

for t＝1 to T，进行迭代for t=1 to T, iterate

a)对于每一个智能体i选择动作

a) For each agent i choose an action

b)执行actiona＝(a₁,...,a_N)，观察奖励r和新的状态s_t+1 b) Execute actiona=(a ₁ ,...,a _N ), observe reward r and new state s _t+1

c)将(s_t,a_t,r_t,s_t+1,t)放入到经验池D中；c) Put (s _t , a _t , r _t , s _t+1 , t) into the experience pool D;

d)s_t←s_t+1 d)s _t ←s _t+1

for智能体ito N进行迭代：for the agent ito N to iterate:

从经验池里随机抽取小批量存储的样本(X^j,a^j,r^j,X′^j)Randomly select samples stored in small batches from the experience pool (X ^j ,a ^j ,r ^j ,X′ ^j )

主网络更新

mainnet update

通过最小化loss函数

来更新主网络和目标网络；By minimizing the loss function

to update the main network and the target network;

采用梯度上升更新动作网络：Update the action network using gradient ascent:

全部更新完成后，对于每一个智能体i更新目标网络：θ'_i←τθ_i+(1-τ)θ'_i After all updates are completed, update the target network for each agent i: θ' _i ←τθ _i +(1-τ)θ' _i

S120：对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对所述全局共享模型中的模型参数进行更新。S120: Aggregate the updated model parameters uploaded by each sample client, and use the aggregated updated model parameters to update the model parameters in the global shared model.

具体地，可以分别计算各个样本客户端的数据量与所有样本客户端的数据量的比值，得到每个样本客户端对应的数据量占比，并将每个样本客户端对应的数据量占比乘以相应样本客户端上传的更新模型参数后，聚合所有样本客户端对应的更新模型参数，通过累加聚合后更新模型参数对全局共享模型中的模型参数进行更新。Specifically, the ratio of the data volume of each sample client to the data volume of all sample clients can be calculated separately to obtain the proportion of data volume corresponding to each sample client, and the proportion of data volume corresponding to each sample client can be multiplied by After the updated model parameters uploaded by the corresponding sample clients, the updated model parameters corresponding to all sample clients are aggregated, and the model parameters in the global shared model are updated by accumulating the updated model parameters after aggregation.

可以理解的是，下发到各个样本客户端中全局共享模型的模型参数是相同的，而各个样本客户端根据输出的待提交竞标信息配置所训练的全局共享模型对应的模型参数是不同的，这里在每个样本客户端训练全局共享模型后会得到本地模型参数，使用本地模型参数与下发的全局共享模型的模型参数相减，可得到更新模型参数，进一步联邦学习平台对全局共享模型中的模型参数进行更新。It can be understood that the model parameters of the global shared model delivered to each sample client are the same, while the model parameters corresponding to the global shared model trained by each sample client according to the output bidding information configuration to be submitted are different. Here, after each sample client trains the global shared model, the local model parameters will be obtained, and the local model parameters will be subtracted from the model parameters of the delivered global shared model to obtain updated model parameters. The model parameters are updated.

S130：若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。S130: If the updated global shared model reaches the preset model accuracy in the test task, it is judged that the learning task released by the federated learning platform is completed; otherwise, the steps of updating the model parameters in the global shared model are repeated for multiple rounds, In order to make the updated global shared model reach the preset model accuracy in the test task.

在本发明实施例中，考虑智能体的探索空间巨大，这里采用集中式训练，分布式执行的多智能强化学习算法作为框架。每一个样本客户端作为一个智能体，每个智能体都有一个策略器，策略器由动作网络和价值网络组成，每个动作网络与价值网络都分别由两个网络(主网络和目标网络)组成，以用来训练更新。智能体会观察当前轮次的任务状态，例如，历史轮次选中未选中，历史价格，历史提供数据量，以及历史单位资源量等，作为策略器中动作网络的输入，动作网络给出当前轮次的动作，即当前轮次待提交的竞标信息，每个智能体的价值网络具有每个智能体观察的局部状态和做出的动作，并作为输入以此来对此智能体输出的动作进行打分。具体而言，在每一个联邦学习训练轮次开始前，每个智能体将观察自己的历史信息s(历史竞标结果，历史资源计算资源，历史数据量，历史竞价)作为状态输入到策略器中，策略器将输出智能体的动作a即用户当前训练轮的竞标信息，用户将竞标信息提交至联邦学习平台即环境，联邦学习平台会选择合适的样本客户端以最大化自己的利润，联邦学习环境会反馈给每个智能体奖励值r，并转变到下一个状态s’，经验池会对元组(s,a,s’r)进行存储。当经验池里无法在收集新的数据时，策略器会开始进行训练，在本发明实施例中具体采用集中式训练，分布式执行的思想来训练策略器，集中式训练可表现为：首先，每个智能体策略器中的动作网络根据当前的状态选择一个动作a，然后，价值网络根据状态-动作对计算一个Q值，作为对动作网络做出动作a的反馈。这里价值网络根据估计的Q值和实际的Q值来进行训练，动作网络根据价值网络的反馈来更新策略。为了获得更准确的Q值，训练过程中策略器中的价值网络具有所有智能体的动作及状态，且价值网络通过时序差分法来更新价值网络参数，而后通过梯度上升更新动作网络的参数。分布式执行可表现为：集中训练完成后，由每个智能体根据自己当前观察的状态分布执行。策略器经过足够时间训练，开始进入收敛状态，最终达到一个最佳的实时竞标的效果。In the embodiment of the present invention, considering the huge exploration space of the agent, a multi-intelligence reinforcement learning algorithm with centralized training and distributed execution is used as the framework. Each sample client acts as an agent, and each agent has a strategist. The strategist consists of an action network and a value network. Each action network and value network consists of two networks (the main network and the target network). composed to be used for training updates. The agent will observe the task status of the current round, for example, selected or not selected in the historical round, historical price, historical data volume, and historical unit resource volume, etc., as the input of the action network in the strategist, and the action network gives the current round action, that is, the bidding information to be submitted in the current round, the value network of each agent has the local state observed and the action made by each agent, and is used as input to score the action output by the agent . Specifically, before the start of each federated learning training round, each agent will observe its own historical information s (historical bidding results, historical resource computing resources, historical data volume, historical bidding) as a state input into the strategist , the strategist will output the action a of the agent, that is, the bidding information of the user's current training wheel. The user submits the bidding information to the federated learning platform, that is, the environment. The federated learning platform will select the appropriate sample client to maximize its own profit. Federated learning The environment will feed back the reward value r to each agent, and transition to the next state s', and the experience pool will store the tuple (s, a, s'r). When new data cannot be collected in the experience pool, the policer will start training. In the embodiment of the present invention, centralized training is specifically adopted, and the idea of distributed execution is used to train the policer. Centralized training can be expressed as: first, The action network in each agent strategist selects an action a according to the current state, and then the value network calculates a Q value according to the state-action pair as a feedback to the action network to make action a. Here the value network is trained according to the estimated Q value and the actual Q value, and the action network updates the policy according to the feedback of the value network. In order to obtain a more accurate Q value, the value network in the strategist has the actions and states of all agents during the training process, and the value network updates the parameters of the value network through the temporal difference method, and then updates the parameters of the action network through gradient ascent. Distributed execution can be expressed as: after the centralized training is completed, each agent executes according to the state distribution it currently observes. After enough time training, the strategist starts to enter the convergence state, and finally achieves an optimal real-time bidding effect.

本发明实施例提供的基于多智能体强化学习算法在联邦学习下的用户竞价方法，通过获取联邦学习平台发布的学习任务，基于学习任务以及参与联邦学习的客户端集合所上传的竞标信息从客户端集合中选取样本客户端，并向样本客户端下发全局共享模型，接收每个样本客户端上传的更新模型参数，该更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的，进一步对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对全局共享模型中的模型参数进行更新，若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。由此可知，与现有技术基于拍卖的激励机制相比，本发明实施例可使用多智能体学习系统来调整客户端上传的竞标信息，以增加客户端被选取的概率，从而解决了现有技术基于拍卖的激励机制由于策略后续训练过程中不会发生改变而导致联邦学习公平性缺失本的问题。The user bidding method under federated learning based on the multi-agent reinforcement learning algorithm provided by the embodiment of the present invention obtains the learning tasks released by the federated learning platform, based on the learning tasks and the bidding information uploaded by the client set participating in federated learning from the customer Select the sample client from the client set, and send the global shared model to the sample client, and receive the updated model parameters uploaded by each sample client. The updated model parameters are the multi-agent reinforcement learning algorithm used by the sample client before the training starts. Output the bidding information to be submitted by the sample client in the current round, which is formed by training the global shared model according to the configuration in the bidding information to be submitted after being selected, and further aggregate the updated model parameters uploaded by each sample client, using The aggregated updated model parameters update the model parameters in the global shared model. If the updated global shared model reaches the preset model accuracy in the test task, it is judged that the learning task released by the federated learning platform is completed. Otherwise, multiple executions are repeated. The step of updating the model parameters in the global shared model in rounds, so that the updated global shared model can reach the preset model accuracy in the test task. It can be seen that, compared with the auction-based incentive mechanism in the prior art, the embodiment of the present invention can use the multi-agent learning system to adjust the bidding information uploaded by the client to increase the probability of the client being selected, thus solving the problem of existing The auction-based incentive mechanism of technology does not change in the subsequent training process of the strategy, which leads to the lack of fairness in federated learning.

基于上述实施例，本发明的另一实施例提供了一种基于多智能体强化学习算法在联邦学习下的用户竞价装置，如图3所示，所述装置包括：Based on the above-mentioned embodiments, another embodiment of the present invention provides a user bidding device based on a multi-agent reinforcement learning algorithm under federated learning, as shown in FIG. 3 , the device includes:

获取单元20，可以用于获取联邦学习平台发布的学习任务，基于所述学习任务以及参与联邦学习的客户端集合所上传的竞标信息从所述客户端集合中选取样本客户端，并向样本客户端下发全局共享模型；The acquisition unit 20 can be used to acquire the learning tasks issued by the federated learning platform, select sample clients from the client set based on the learning tasks and the bidding information uploaded by the client set participating in the federated learning, and send the sample client to the sample client The end sends the global shared model;

接收单元22，可以用于接收每个样本客户端上传的更新模型参数，所述更新模型参数为样本客户端在训练开始之前使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息，被选中后按照所述待提交竞标信息中的配置训练全局共享模型所形成的；The receiving unit 22 can be used to receive the updated model parameters uploaded by each sample client, and the updated model parameters are the output of the sample client in the current round using the multi-agent reinforcement learning algorithm before the training starts. The bidding information is formed by training the global shared model according to the configuration in the bidding information to be submitted after being selected;

聚合单元24，可以用于对各个样本客户端上传的更新模型参数进行聚合，使用聚合后的更新模型参数对所述全局共享模型中的模型参数进行更新；The aggregation unit 24 may be configured to aggregate the updated model parameters uploaded by each sample client, and use the aggregated updated model parameters to update the model parameters in the global shared model;

选取单元26，可以用于若更新后的全局共享模型在测试任务中达到预设模型精度，则判定完成联邦学习平台发布的学习任务，否则，重复执行多个轮次对全局共享模型中模型参数进行更新的步骤，以使得更新后的全局共享模型在测试任务中达到预设模型精度。The selection unit 26 can be used to determine that the learning task released by the federated learning platform is completed if the updated global shared model reaches the preset model accuracy in the test task; The step of updating is performed, so that the updated global shared model reaches the preset model accuracy in the test task.

可选的，所述装置还包括：Optionally, the device also includes:

输出单元，可以用于样本客户端使用多智能体强化学习算法输出样本客户端在当前轮次的待提交竞标信息的过程；The output unit can be used for the sample client to use the multi-agent reinforcement learning algorithm to output the bidding information to be submitted by the sample client in the current round;

所述输出单元，具体用于以所述样本客户端作为智能体，所述智能体观察在联邦学习环境中自身的历史状态信息，并利用所述历史状态信息输出所述样本客户端在当前轮次的待提交竞标信息。The output unit is specifically configured to use the sample client as an agent, the agent observes its own historical state information in the federated learning environment, and uses the historical state information to output the sample client in the current round Times pending bidding information.

存储模块，可以用于以所述样本客户端作为智能体，使用所述多智能体强化学习算法中经验池来存储联邦学习环境中各个智能体观察到的历史任务状态信息，所述历史任务状态信息至少包括智能体在历史轮次中是否被选中、历史资源值、历史提供数据量以及历史单位资源量；The storage module can be used to use the sample client as an agent, use the experience pool in the multi-agent reinforcement learning algorithm to store the historical task status information observed by each agent in the federated learning environment, and the historical task status information The information includes at least whether the agent is selected in the historical rounds, the historical resource value, the amount of historical data provided, and the amount of historical unit resources;

输出模块，可以用于通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息。The output module can be used to input the historical task state information observed by the agent in the federated learning environment as the state information of the agent in the current round to the strategist in the multi-agent reinforcement learning algorithm, Output the bidding information to be submitted by the agent in the current round.

计算模块，可以用于在所述通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述多智能体强化学习算法中策略器，输出智能体在当前轮次的待提交竞标信息之后，计算联邦学习环境针对智能体在当前轮次反馈的收益资源，并使用所述多智能体强化学习算法中经验池存储智能体在当前轮次观察到环境的历史状态、待提交竞标信息、待提交竞标信息上传后的环境状态以及联邦学习环境针对当前轮次上传的待提交竞标信息反馈给智能体的收益资源。The calculation module can be used to input the historical task state information observed by the agent in the federated learning environment into the multi-agent reinforcement learning algorithm as the state information of the agent in the current round Strategist, after outputting the bid information to be submitted by the agent in the current round, calculate the revenue resources fed back by the federated learning environment for the agent in the current round, and use the experience pool in the multi-agent reinforcement learning algorithm to store the agent in The current round observes the historical state of the environment, the bidding information to be submitted, the state of the environment after the bidding information to be submitted is uploaded, and the revenue resources that the federated learning environment feeds back to the agent for the bidding information uploaded in the current round.

可选的，所述计算模块，具体可以用于基于智能体在当前轮次上的待上传竞标信息，分别获取智能体在竞标过程中涉及的资源参数；Optionally, the calculation module can be specifically used to obtain the resource parameters involved in the bidding process of the agent based on the bidding information to be uploaded by the agent in the current round;

所述计算模块，具体还可以用于将所述智能体在竞标过程中涉及的资源参数输入至预先构建的收益函数，得到联邦学习环境针对智能体在当前轮次反馈的收益资源。The calculation module can also be specifically configured to input the resource parameters involved in the bidding process of the agent into a pre-built revenue function, so as to obtain the revenue resources fed back by the federated learning environment for the agent in the current round.

可选的，每个样本客户端配置有一个策略器，所述策略器包括动作网络和价值网络，所述输出模块，具体可以用于通过将所述智能体在所述联邦学习环境中观察到的历史任务状态信息作为智能体在当前轮次的状态信息输入至所述策略器中动作网络，输出智能体在当前轮次的待提交竞标信息，得到智能体在当前训练轮次的待上传竞标信息；Optionally, each sample client is configured with a strategist, the strategist includes an action network and a value network, and the output module can specifically be used to observe by the agent in the federated learning environment The historical task state information of the agent is input to the action network in the strategist as the state information of the agent in the current round, and the bid information to be submitted by the agent in the current round is output, and the bid to be uploaded by the agent in the current training round is obtained. information;

所述输出模块，具体还可以用于通过将所述智能体在当前轮次的状态信息以及智能体在当前轮次的待上传竞标信息输入至所述策略器中价值网络，对所述待上传竞标信息进行评估，得到待上传竞标信息的评估分数；The output module can also specifically input the state information of the agent in the current round and the bidding information of the agent to be uploaded in the current round into the value network in the strategist, and process the to-be-uploaded Evaluate the bidding information and get the evaluation score of the bidding information to be uploaded;

可选的，所述聚合单元24包括：Optionally, the aggregation unit 24 includes:

计算模块，可以用于分别计算各个样本客户端的数据量与所有样本客户端的数据量的比值，得到每个样本客户端对应的数据量占比；The calculation module can be used to separately calculate the ratio of the data volume of each sample client to the data volume of all sample clients, and obtain the proportion of data volume corresponding to each sample client;

聚合模块，可以用于将每个样本客户端对应的数据量占比乘以相应样本客户端上传的更新模型参数后，聚合所有样本客户端对应的更新模型参数，通过累加聚合后更新模型参数对全局共享模型中的模型参数进行更新。The aggregation module can be used to multiply the proportion of data volume corresponding to each sample client by the updated model parameters uploaded by the corresponding sample client, aggregate the updated model parameters corresponding to all sample clients, and update the model parameter pairs after aggregation and aggregation Model parameters in the globally shared model are updated.

基于上述方法实施例，本发明的另一实施例提供了一种存储介质，其上存储有可执行指令，该指令被处理器执行时使处理器实现上述方法。Based on the foregoing method embodiments, another embodiment of the present invention provides a storage medium on which executable instructions are stored, and when the instructions are executed by a processor, the processor implements the foregoing method.

基于上述实施例，本发明的另一实施例提供了一种车辆，包括：Based on the above embodiments, another embodiment of the present invention provides a vehicle, including:

一个或多个处理器；one or more processors;

其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现上述的方法。所述车辆可以为非自动驾驶车辆，也可以为自动驾驶车辆。Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are made to implement the above method. The vehicle may be a non-autonomous driving vehicle or an automatic driving vehicle.

上述系统、装置实施例与方法实施例相对应，与该方法实施例具有同样的技术效果，具体说明参见方法实施例。装置实施例是基于方法实施例得到的，具体的说明可以参见方法实施例部分，此处不再赘述。本领域普通技术人员可以理解：附图只是一个实施例的示意图，附图中的模块或流程并不一定是实施本发明所必须的。The above-mentioned system and device embodiments correspond to the method embodiments, and have the same technical effect as the method embodiments. For details, refer to the method embodiments. The device embodiment is obtained based on the method embodiment. For specific description, please refer to the method embodiment part, which will not be repeated here. Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present invention.

本领域普通技术人员可以理解：实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中，也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块。Those of ordinary skill in the art can understand that: the modules in the device in the embodiment may be distributed in the device in the embodiment according to the description in the embodiment, or may be changed and located in one or more devices different from the embodiment. The modules in the above embodiments can be combined into one module, and can also be further split into multiple sub-modules.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A user bidding method under federal learning based on multi-agent reinforcement learning algorithm is characterized by comprising the following steps:

the method comprises the steps of obtaining a learning task issued by a federal learning platform, selecting a sample client from a client set based on the learning task and bidding information uploaded by the client set participating in federal learning, and issuing a global sharing model to the sample client;

receiving an update model parameter uploaded by each sample client, wherein the update model parameter is formed by using a multi-agent reinforcement learning algorithm to output bidding information to be submitted of the sample client in the current round before training of the sample client is started, and training a global sharing model according to configuration in the bidding information to be submitted after the sample client is selected;

aggregating the update model parameters uploaded by each sample client, and updating the model parameters in the global sharing model by using the aggregated update model parameters;

and if the updated global shared model reaches the preset model precision in the test task, judging that the learning task released by the Federal learning platform is completed, otherwise, repeatedly executing the step of updating the model parameters in the global shared model in multiple turns so as to ensure that the updated global shared model reaches the preset model precision in the test task.

2. The method of claim 1, wherein the process of the sample client outputting the bidding information to be submitted of the sample client in the current round using a multi-agent reinforcement learning algorithm comprises:

and taking the sample client as an intelligent agent, observing the self historical state information in the federal learning environment by the intelligent agent, and outputting the bidding information to be submitted of the sample client in the current turn by using the historical state information.

3. The method of claim 2, wherein the multi-agent reinforcement learning algorithm comprises a policer and an experience pool, the taking the sample client as an agent, the agent observing historical state information of the agent in a federated learning environment and using the historical state information to output bidding information to be submitted of the sample client in a current round, comprises:

the sample client side is used as an intelligent agent, historical task state information observed by each intelligent agent in a federated learning environment is stored by using an experience pool in the multi-intelligent-agent reinforcement learning algorithm, and the historical task state information at least comprises whether the intelligent agent is selected in a historical turn, a historical resource value, a historical provided data amount and a historical unit resource amount;

historical task state information observed by the intelligent agent in the federal learning environment is used as state information of the intelligent agent in the current round and input to a strategy device in the multi-intelligent-agent reinforcement learning algorithm, and bidding information to be submitted of the intelligent agent in the current round is output.

4. The method of claim 3, wherein after outputting the agent's to-be-submitted bid information for a current round by inputting historical task state information observed by the agent in the federated learning environment as the agent's state information for the current round to a policer in the multi-agent reinforcement learning algorithm, the method further comprises:

and calculating the income resources fed back by the federal learning environment aiming at the intelligent agent in the current turn, and storing the historical state of the environment observed by the intelligent agent in the current turn, the bidding information to be submitted, the environmental state after the bidding information to be submitted is uploaded, and the income resources fed back to the intelligent agent by the federal learning environment aiming at the bidding information to be submitted uploaded in the current turn by using an experience pool in the multi-intelligent-agent reinforcement learning algorithm.

5. The method of claim 4, wherein calculating the revenue resources for the federated learning environment for the agent to feed back on the current turn comprises:

respectively acquiring resource parameters related to the intelligent agent in the bidding process based on the bidding information to be uploaded of the intelligent agent in the current round;

and inputting resource parameters related to the intelligent agent in the bidding process into a pre-constructed revenue function to obtain revenue resources fed back by the intelligent agent in the current turn in the federal learning environment.

6. The method of claim 3, wherein each sample client is configured with a policer comprising an action network and a value network, the outputting of the bidding information to be submitted by the agent in the current round by inputting historical task state information observed in the federated learning environment as state information of the agent in the current round to a policer in the multi-agent reinforcement learning algorithm comprises:

historical task state information observed by the intelligent agent in the federal learning environment is input into the action network of the policy engine as state information of the intelligent agent in the current round, and bidding information to be submitted of the intelligent agent in the current round is output, so that the bidding information to be uploaded of the intelligent agent in the current training round is obtained;

the state information of the intelligent agent in the current round and the competitive bidding information to be uploaded of the intelligent agent in the current round are input into a value network in the strategy device, and the competitive bidding information to be uploaded is evaluated to obtain an evaluation score of the competitive bidding information to be uploaded;

the action network is trained by using the evaluation score of the competitive bidding information to be uploaded, the network parameters of the action network are updated through gradient rise, the value network is trained by using the evaluation score of the competitive bidding information to be uploaded and income resources actually fed back by the intelligent agent, and the network parameters of the value network are updated through a time sequence difference method.

7. The method according to any one of claims 1 to 6, wherein the aggregating the update model parameters uploaded by each sample client, and using the aggregated update model parameters to update the model parameters in the global sharing model, comprises:

respectively calculating the ratio of the data volume of each sample client to the data volume of all the sample clients to obtain the data volume proportion corresponding to each sample client;

and after multiplying the data volume ratio corresponding to each sample client by the update model parameters uploaded by the corresponding sample client, aggregating the update model parameters corresponding to all the sample clients, and updating the model parameters in the global sharing model by accumulating the aggregated update model parameters.

8. A user bidding apparatus under federal learning based on multi-agent reinforcement learning algorithm, the apparatus comprising:

the system comprises an acquisition unit, a resource allocation unit and a resource allocation unit, wherein the acquisition unit is used for acquiring a learning task issued by a federal learning platform, selecting a sample client from a client set based on the learning task and bidding information uploaded by the client set participating in federal learning, and issuing a global sharing model to the sample client;

the receiving unit is used for receiving the updated model parameters uploaded by each sample client, and the updated model parameters are formed by training a global sharing model by using a multi-agent reinforcement learning algorithm to output sample clients in the current round of bidding information to be submitted;

the aggregation unit is used for aggregating the update model parameters uploaded by each sample client and updating the model parameters in the global sharing model by using the aggregated update model parameters;

and the selecting unit is used for judging that the learning task issued by the Federal learning platform is completed if the updated global sharing model reaches the preset model precision in the test task, otherwise, repeatedly executing the step of updating the model parameters in the global sharing model in multiple turns so as to ensure that the updated global sharing model reaches the preset model precision in the test task.

9. A storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to implement the method for user bidding under federal learning based on multi-agent reinforcement learning algorithms.

10. An apparatus for bidding users under federal learning based on multi-agent reinforcement learning algorithm, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for user bidding under federated learning based on a multi-agent reinforcement learning algorithm.