CN117852745A

CN117852745A - Enterprise group distributed decision method, device, equipment and storage medium

Info

Publication number: CN117852745A
Application number: CN202311727686.XA
Authority: CN
Inventors: 黄国铨; 张涌; 林冠宇; 董德; 许宜诚; 汪磊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-09

Abstract

The application relates to an enterprise group distributed decision making method, an enterprise group distributed decision making device, enterprise group distributed decision making equipment and a storage medium. The method comprises the following steps: decomposing the whole production task of the multi-agent system into a plurality of subtasks according to the task type, and taking each single agent in the multi-agent system as an independent production decision unit; carrying out distributed decision and centralized training on task allocation strategies of a multi-agent system by utilizing a multi-agent reinforcement learning algorithm, and training a decision network capable of carrying out distributed decision for each production decision unit; and performing federal training and isomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and aggregating the decision networks of all production decision units by utilizing an aggregation algorithm to generate a global decision model of the multi-agent system. The embodiment of the application solves the problems of high decision complexity, low decision timeliness, low decision quality and the like of the centralized decision in the prior art, and ensures the data security of enterprises.

Description

Enterprise group distributed decision method, device, equipment and storage medium

Technical Field

The application belongs to the technical field of group intelligence, and particularly relates to an enterprise group distributed decision method, device, equipment and storage medium.

Background

In the context of global production, distributed manufacturing has become a trend in modern manufacturing. In this mode, product manufacturing is no longer accomplished independently by a single enterprise, but rather by multiple enterprises co-operating together. Each enterprise relies on its own technology, resources and policies to promote core competitiveness, and through cooperation in distributed manufacturing, it is able to coexist for a long time. In the enterprise group cooperation mode, each enterprise can be regarded as an independent production unit, and high-efficiency decisions are made according to real-time production tasks and self resource requirements so as to improve overall production efficiency and adjust strategies thereof, thereby realizing better overall benefits. Traditional production task allocation and resource scheduling face the problems of low efficiency, high complexity, low accuracy and the like. Multi-agent reinforcement learning provides an effective solution that allows for the creation of a markov sequential decision model to coordinate production task allocation and resource scheduling in enterprise group collaboration. In multi-agent reinforcement learning systems, each enterprise is considered an agent, and strategies are quickly adjusted according to real-time tasks and resource changes, thereby improving overall revenue.

The existing multi-agent reinforcement learning system generally adopts a centralized training (a plurality of agents share an experience pool) and a distributed decision strategy, however, production data in the cooperation process of an enterprise group is core data of production, and the adoption of the centralized training lacks a protection data safety scheme, so that the corporate data, decision and the like are kept, and the cooperation willingness of the enterprise group is hindered. Meanwhile, the centralized deep learning or reinforcement learning model training helps the decision-making of a single enterprise, so that the production income and efficiency of the single enterprise are improved, but in practice, the production process of the manufacturing industry chain is multi-enterprise group distributed cooperation, the single decision-making is difficult to obtain the optimization of the whole decision-making, the dynamic and efficient decision-making is influenced, and the problems of low instantaneity, high complexity, difficult meeting of the service requirements of decision-making quality and the like still exist in the centralized regulation and control of the production task allocation and the resource scheduling.

Disclosure of Invention

The application provides an enterprise group distributed decision-making method, an enterprise group distributed decision-making device, enterprise group distributed decision-making equipment and a storage medium, which aim to solve one of the technical problems in the prior art at least to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

An enterprise group distributed decision making method, comprising:

decomposing the whole production task of the multi-agent system into a plurality of subtasks according to the task type, and taking each single agent in the multi-agent system as an independent production decision unit; the multi-agent system is an enterprise group in an industrial chain system, and the single agent is each enterprise in the enterprise group;

carrying out distributed decision and centralized training on task allocation strategies of the multi-agent system by utilizing a multi-agent reinforcement learning algorithm, training a decision network capable of carrying out distributed decision for each production decision unit, and placing interaction data of each single agent in a shared experience playback buffer area in the centralized training process;

and performing federal training and isomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and aggregating decision networks of all production decision units by utilizing an aggregation algorithm to generate a global decision model of the multi-agent system.

The technical scheme adopted by the embodiment of the application further comprises: the multi-agent reinforcement learning algorithm includes distributed task execution objective settings and state space and rewards function design,

The task execution targets are set as follows: matching each subtask with a production decision unit according to the task type and resource requirement of the subtask, and training a decision network which can perform independent distributed decision and can autonomously cooperate with the multi-agent system for each production decision unit so that each production decision unit can perform independent decision according to the production environment and the self state;

the state space and the reward function are designed as: the task allocation and resource scheduling of each production decision unit are regarded as a decision process of a single agent, the decision process of each single agent is modeled into a partially observable Markov decision process, the state space of the single agent comprises the type and quantity of production tasks, the profit value, the cost value, the production resources and the resource requirement, and the action space of the single agent is used for selecting whether the production tasks are needed at the moment and the allocation quantity of the production resources, and the value of the optimized objective function is used as the rewarding value of the reinforcement learning agent.

The technical scheme adopted by the embodiment of the application further comprises: the value of the optimized objective function is taken as the rewarding value of the reinforcement learning agent, and the method specifically comprises the following steps:

Each agent shares a reward function with the same benefit value, and the rule of the reward function is set as follows:

if a subtask is not assigned to a production decision unit, the lower the benefit value obtained by the production decision unit that subsequently receives the subtask;

determining whether each subtask is received or completed according to the production capacity and the resource demand matching degree of each production decision unit, wherein the higher the resource demand matching degree is, the more sufficient the production capacity is, and the higher the rewarding value of the corresponding production decision unit is;

represented by a set of metadata<(x _i ,y _i ),c _i1 ,c _i2 ,…,c _in >Representing the self information of each production decision unit, wherein (x _i ,y _i ) Representing the production position of the production decision unit i in the industrial chain system c _in Representing the number of production decision units i on resource n, the spatial position of the subtasks allocated in real time and the number of resources required are represented as<(x _j ,y _j ),c _j1 ,c _j2 ,…,c _jn >The position distance between the subtask j and the production decision unit i is as followsThe matching value of the production decision unit i and the subtask j is +.>The matching value represents a matching coefficient, and the higher the matching coefficient is, the higher the matching degree of the production decision unit i and the subtask j is; the task allocation scheme of each round of decision is a= (a) ₁ ,a ₂ ,…,a _N )，a _i Is the decision result of whether the production decision unit i receives a subtask, and the reward function of each agent is expressed as: / >rew＝rew _d +rew _e 。

The technical scheme adopted by the embodiment of the application further comprises: the method for performing distributed decision and centralized training on the task allocation strategy of the multi-agent system by utilizing the multi-agent reinforcement learning algorithm comprises the following steps:

respectively initializing a decision network and an evaluation network for each single agent, wherein the decision network comprises a local decision network and a target decision network, the evaluation network comprises a local evaluation network and a target evaluation network, the decision network is used for generating actions of maximum expected benefits according to the observation information of each single agent, and the evaluation network is used for evaluating the actions generated by the decision network;

inputting the observation information o of each single agent in the production environment into a decision network, and outputting a decision action a, wherein the yield record of each index in the production environment feedback is r= (r) ₁ ,r ₂ ,…,r _m ) Then the observed information, benefits, rewards and decision actions of the current stage of each single agent are used as an interaction data set (s _i ,a _i ,r _i ) Storing in an experience playback buffer; the observation information o comprises position information, state information, position distance of other intelligent agents relative to the observation information o, position information of sub-task allocation and resource requirements of each single intelligent agent in an industrial chain system;

And randomly extracting small batch data from the experience playback buffer zone, calculating the evaluation network gradient of each single agent, calculating the decision network gradient of each single agent by using the extracted data and the value network value, and updating network parameters by using the gradient.

The technical scheme adopted by the embodiment of the application further comprises: the method for performing distributed decision and centralized training on the task allocation strategy of the multi-agent system by utilizing the multi-agent reinforcement learning algorithm further comprises the following steps:

the distributed decision framework comprises an Actor network and a Critic network, wherein in the Actor network, the input of the decision network of each single agent is observation information o of the agent, the observation information o is processed by a multi-layer neural network and then outputs a vector with the dimension of the total number of sub-tasks, then the vector is processed by a Softmax algorithm to obtain the selection probability of each sub-task, and a production decision unit with the maximum selection probability is used as the current action selection result of the single agent; in the Critic network, the input of the evaluation network of each single agent is observation information o= (o) of all single agents ₁ ,o ₂ ,…,o _N ) And action selection information a= (a) ₁ ,a ₂ ,…,a _N ) And the input information of the decision network is processed by the multi-layer neural network and then outputs a vector with the dimension of 1, and the value of the vector is the estimated value of rewards when the observed information of each single agent is o and the action selection result is a.

The technical scheme adopted by the embodiment of the application further comprises: the interactive data stored in the experience playback buffer area is subjected to federal training and isomorphic encryption optimization through a federal reinforcement learning algorithm, and the method specifically comprises the following steps:

the federal reinforcement learning algorithm adopts a multiparty reinforcement learning training strategy, and encrypts the interactive data in the experience playback buffer zone by using a federal learning privacy protection gradient aggregation mechanism and an isomorphic encryption technology, so that each production decision unit delivers the interactive data to a federal central coordination node of federal training in each round of training, and the federal central coordination node executes reinforcement average on the interactive data and distributes the interactive data to each production decision unit.

The technical scheme adopted by the embodiment of the application further comprises: the training process of the federal reinforcement learning algorithm comprises the following steps:

initializing a global decision model through the federal central coordination node and broadcasting the global decision model to each enterprise computing node of an industrial chain system, wherein the enterprise computing node is used for each production decision unit, executing model parallel operation through the federal central coordination node, setting an initialized buffer pool with set capacity to provide a synchronous parallel model sharing mechanism, and setting the number and the maximum waiting time of expected receiving decision networks;

Independently observing the environmental states through the enterprise computing nodes, making decisions according to the environmental states, providing rewards and new states after the environments execute actions, optimizing self policies, and updating decision network parameters;

uploading network parameters to the federal central coordination node after the enterprise computing nodes perform decision network updating for a plurality of times, and fully homomorphic encrypting the interactive data by utilizing a fully homomorphic encryption algorithm when each enterprise computing node interacts the interactive data in the experience playback buffer;

the federal central coordination node uses a specific aggregation algorithm to aggregate network parameters uploaded by each enterprise computing node, a global decision model is generated, and the global decision model is distributed to each enterprise computing node;

the enterprise computing node updates the decision network to a global decision model.

The embodiment of the application adopts another technical scheme that: an enterprise group distributed decision making apparatus, comprising:

the task decomposition module: the system comprises a plurality of sub-tasks, a production decision unit and a processing unit, wherein the sub-tasks are used for dividing the overall production task of the multi-agent system into a plurality of sub-tasks according to task types, and each single agent in the multi-agent system is used as a single production decision unit; the multi-agent system is an enterprise group in an industrial chain system, and the single agent is each enterprise in the enterprise group;

Multi-agent reinforcement learning module: the system comprises a multi-agent reinforcement learning algorithm, a distributed decision-making unit, a centralized training unit, a shared experience playback buffer zone, a distributed decision-making unit, a distributed decision-making network, a distributed training unit and a distributed training unit, wherein the multi-agent reinforcement learning algorithm is used for carrying out distributed decision-making and centralized training on task allocation strategies of the multi-agent system, and interactive data of each single agent in the centralized training process is put into the shared experience playback buffer zone;

federal reinforcement learning module: the method is used for carrying out federal training and fully homomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and utilizing an aggregation algorithm to aggregate decision networks of all production decision units so as to generate a global decision model of the multi-agent system.

The embodiment of the application adopts the following technical scheme: an apparatus comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the enterprise group distributed decision method;

the processor is configured to execute the program instructions stored by the memory to control an enterprise group distributed decision making method.

The embodiment of the application adopts the following technical scheme: a storage medium storing program instructions executable by a processor for performing the enterprise group distributed decision making method.

Compared with the prior art, the beneficial effect that this application embodiment produced lies in: the enterprise group distributed decision-making method, the enterprise group distributed decision-making device, the enterprise group distributed decision-making equipment and the storage medium adopt a mode of combining multi-agent reinforcement learning and federal reinforcement learning, a multi-agent reinforcement learning algorithm is utilized to perform distributed execution and centralized training on a task allocation strategy of an industrial chain system, proper task requirements are matched according to enterprise resource distribution conditions, the distributed cooperation problem of the enterprise group is solved, and real-time distributed decision-making is realized; the federal reinforcement learning algorithm is utilized to encrypt the sensitive data in the centralized training process by using federal training and homomorphic encryption technology, so that the interactive data safety of multiple intelligent agents is ensured, and the problem of sensitive data leakage of a shared experience pool of multiple enterprises in the centralized training is solved. According to the method and the device for the distributed decision making, complete distributed decision making can be achieved, each enterprise in the enterprise group is an individual decision making, however, the overall optimization targets are the same, the problems of high decision complexity, low decision timeliness, low decision quality and the like of the centralized decision making in the prior art are solved, and meanwhile data security of the enterprises is guaranteed.

Drawings

FIG. 1 is a flow chart of an enterprise group distributed decision making method of an embodiment of the present application;

FIG. 2 is a schematic diagram of an enterprise group distributed collaboration process in an industrial chain system according to an embodiment of the present application;

fig. 3 is a schematic diagram of a distributed decision framework based on a multi-agent reinforcement learning algorithm madppg according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an Actor and Critic network architecture according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a federal reinforcement learning algorithm according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an enterprise group distributed decision making device according to an embodiment of the present application;

FIG. 7 is a schematic view of a device structure according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "first," "second," "third," and the like in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise. All directional indications (such as up, down, left, right, front, back … …) in the embodiments of the present application are merely used to explain the relative positional relationship, movement, etc. between the components in a particular gesture (as shown in the drawings), and if the particular gesture changes, the directional indication changes accordingly. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Specifically, please refer to fig. 1, which is a flowchart of an enterprise group distributed decision method according to an embodiment of the present application. The enterprise group distributed decision-making method of the embodiment of the application comprises the following steps:

s100: decomposing the whole production task of the multi-agent system into a plurality of subtasks according to the task type, and taking each single agent in the multi-agent system as an independent production decision unit; the multi-agent system is an enterprise group which has independent decision making capability and can cooperatively complete the whole production task in a group in the industrial chain system, and the single agent is each enterprise in the enterprise group;

in the step, in the industrial chain system, the whole production task needs to rely on the resources of a plurality of enterprises to perform distributed collaborative completion, the production task is decomposed into a plurality of subtasks according to the task type, each subtask is subjected to deadly matching for a proper enterprise to perform production, and the enterprise side needs to make a real-time decision on whether to complete the subtask. Specifically, fig. 2 is a schematic diagram of a distributed collaboration process of enterprise groups in an industrial chain system. Each enterprise in the industrial chain system is located in different production chains and production clusters, production positions and production resources of different enterprises are different, and in the embodiment of the application, each enterprise is used as a production decision unit with certain resource scheduling capability, and due to factors such as data safety and privacy protection, each production decision unit cannot share own resource information to a unified coordinator for carrying out centralized decision, so that each production decision unit needs to carry out independent decision according to the resource information and the like which can be acquired by itself when making decisions.

S110: carrying out distributed decision and centralized training on task allocation strategies of a multi-agent system by utilizing a multi-agent reinforcement learning algorithm MADDPG, training a decision network which can carry out independent distributed decision and can autonomously cooperate with the multi-agent system for each production decision unit, and placing interactive data in a shared experience playback buffer zone in the centralized training process;

in this step, according to the interactive data strategy and the profit assessment strategy with the production decision unit and the production environment, the embodiment of the application adopts an offline semi-distributed multi-agent reinforcement learning algorithm madppg, in which each independent autonomous decision production decision unit is regarded as an agent, and all production decision units are used as a multi-agent system to perform centralized training and distributed decision. The multi-agent reinforcement learning algorithm MADDPG comprises distributed task execution target setting and state space and rewarding function design, and is specifically as follows:

task execution target setting: the problem of accurate and efficient matching of each subtask in the industrial chain system with the production decision unit is the key of enterprise group cooperation and division. The completion of each subtask in the whole production task requires a series of support of production resources, and the demands of different types of subtasks on the production resources are different, for example, the production of a lithium battery cell by a part manufacturer requires the support of resources such as anode and cathode materials, a diaphragm, electrolyte and the like, and the support of more resources such as a motor, an electric control and the like by the manufacturer. In the embodiment of the application, the multi-agent task allocation strategy is subjected to distributed decision and centralized training by utilizing the multi-agent reinforcement learning algorithm MADDPG, each subtask is matched with a production decision unit according to the task type and resource requirement of the subtask, and a decision network which can carry out independent distributed decision and can autonomously cooperate with a multi-agent system is trained for each production decision unit, so that each production decision unit can carry out independent decision according to information such as production environment, self state and the like.

State space and bonus function design: the task allocation and resource scheduling of each production decision unit are regarded as a decision process of a single agent, the decision process of each single agent is modeled into a partially observable Markov decision process, the state space of the single agent comprises information such as the type and quantity of production tasks, corresponding profit values, cost values, current production resources, resource requirements and the like, and the action space of the single agent is used for selecting whether the production tasks are needed at the moment, the allocation quantity of the production resources and the like, so that the action space of the single agent is discrete. When the reinforcement learning is utilized to solve the decision problem, the value of the optimization objective function can be used as the reward value of the reinforcement learning agent, and the process of optimizing the problem objective function is needed to determine the process of reinforcing the learning reward function. In this embodiment of the present application, each single agent shares a reward function with the same benefit value, and may specifically be designed according to the completion degree and the task completion effect of each production decision unit, where a specific rule is set as follows:

1) If a subtask is not assigned to a production decision unit, the subtask may create a risk of delay or incomplete in subsequent production, the lower the value of revenue that the production decision unit subsequently receiving the subtask receives.

2) Whether each subtask is received or completed is determined according to the production capacity and the resource demand matching degree of each production decision unit, the higher the resource demand matching degree is, the shorter the time for completing the subtask is, the higher the income is, the better the effect of completing the subtask is, and the higher the rewarding value of the corresponding production decision unit is.

3) The self information of each production decision unit can be represented by a group of metadata<(x _i ,y _i ),c _i1 ,c _i2 ,…,c _in >Wherein (x) _i ,y _i ) Representing the production position of the production decision unit i in the industrial chain system (industrial chain system is a process of multi-chain parallel production), c _in Representing the number of production decision units i on the resource n. Meanwhile, the subtasks allocated in real time also include the space position and the required resource quantity, which are expressed as<(x _j ,y _j ),c _j1 ,c _j2 ,…,c _jn >The position distance between the subtask j and the production decision unit i is as followsThe matching value of the production decision unit i and the subtask j isThe matching value represents a matching coefficient, and the higher the matching coefficient is, the higher the matching degree between the production decision unit i and the subtask j is. The task allocation scheme of each round of decision is a= (a) ₁ ,a ₂ ,…,a _N )，a _i Is the decision result of whether the production decision unit i receives a subtask. The reward functions of each agent executing the tasks are: /> rew＝rew _d +rew _e . In the production task allocation scenario, all production decision units share the same prize value as a multi-agent system, and each single agent maximizes the prize value as a decision target.

In the embodiment of the present application, a specific training process for performing distributed decision and centralized training on a task allocation policy of a multi-agent system by using a multi-agent reinforcement learning algorithm madddpg includes:

(1) Initializing a network; a decision network and an evaluation network are initialized for each individual agent, respectively, the decision network comprising a local decision network and a target decision network, the evaluation network comprising a local evaluation network and a target evaluation network. The decision network is used for rapidly generating the action of the maximum expected benefit according to the observation information of each single agent, and the evaluation network is only used in model training and is used for evaluating the action generated by the decision network, so that the parameters of the decision network are assisted to be trained.

(2) Distributed decision: each single agent needs to input the observation information o in the production environment into a decision network and output a decision action a, and the production environment feeds back the benefit record of each index as r= (r) ₁ ,r ₂ ,…,r _m ) Then each agent sets the interaction data of the observation information o, the benefits, the rewards, the decision actions and the like of the current stage (s _i ,a _i ,r _i ) Storing the model data into an experience playback buffer area to be used as the next model training; each individual agent learns the policies separately while taking into account the policies of other agents, especially when calculating the predicted values for the evaluation network. The observation information o comprises information such as position information, state information, position distance of other intelligent agents relative to the observation information o, position information of allocation subtasks, resource requirements and the like of each single intelligent agent in an industrial chain system. Specifically, as shown in fig. 3, the distributed learning algorithm madppg based on the multi-agent reinforcement learning algorithm according to the embodiment of the present application A schematic diagram of a decision framework. The distributed decision framework includes an Actor network and a Critic network, the Actor and Critic network architecture being shown in fig. 4. In the Actor network, the input of the decision network of each single agent is the observation information o of the agent, the observation information o is processed by a multi-layer neural network to output a vector with the dimension of the total number of subtasks, then the vector is processed by a Softmax algorithm to obtain the selection probability of each subtask, and finally the production decision unit with the maximum selection probability is used as the current action selection result of the single agent. In the Critic network, the input of the evaluation network of each single agent is the observation information o= (o) of all single agents ₁ ,o ₂ ,…,o _N ) And action selection information a= (a) ₁ ,a ₂ ,…,a _N ) The input information of the decision network is processed by the multi-layer neural network and then outputs a vector with dimension of 1, and the value of the vector is the estimated value of rewards when the observation information of each single agent is o and the action selection result is a.

(3) Centralized training; the multi-agent system shares an experience playback buffer zone, and when each single agent takes action in the production environment, the single agent respectively collects interaction data such as respective observation information, benefits, rewards, decision actions and the like, stores all the interaction data in the experience playback buffer zone, shares respective experiences, enables other agents to learn from the experience playback buffer zone, and performs privacy protection on the interaction data in the experience playback buffer zone by combining with a subsequent federal reinforcement learning algorithm.

(4) Experience playback buffers; in model training, small batches of data are randomly extracted from an experience playback buffer zone, the evaluation network gradient of each single agent is calculated, then the decision network gradient of each single agent is calculated by using the extracted data and the value of the value network, and the network parameters are updated by using the decision network gradient.

(5) For the evaluation network, the mean square error between the predicted Q value and the target Q value is calculated, and for the decision network, gradient elevation is used to maximize the predicted Q value.

(6) Updating weights of the local evaluation network and the local decision network by using a back propagation algorithm, and copying the weights of the local evaluation network and the local decision network to the target evaluation network and the target decision network in a small step so as to maintain the stability of the target evaluation network and the target decision network.

(7) Adding noise performs distributed exploration to try new policies and state spaces.

(8) And each single agent coordinates the respective actions through proper training and shared experience learning, so as to realize the common production target.

S120: performing federal training and isomorphic encryption optimization on the interactive data stored in the experience playback buffer region through a federal reinforcement learning algorithm, and aggregating decision networks of all production decision units by utilizing an aggregation algorithm to generate a global decision model of the multi-agent system;

In this step, although the distributed collaborative decision problem of the enterprise group in the industry chain system can be solved based on the multi-agent reinforcement learning algorithm madddpg, because the interactive data such as the observation information o, the income, the rewards, the decision actions and the like of each single agent are stored in the shared experience playback buffer area in the centralized training, the privacy disclosure of the production data of the enterprise group is caused. Therefore, in the embodiment of the application, the federal reinforcement learning and homomorphic encryption mode is used for encrypting the data in the centralized training process of the multi-agent reinforcement learning algorithm MADDPG, so that the interactive data such as observation information o, benefits, rewards, decision actions and the like generated by each production decision unit in each round of centralized training are ensured to be protected, and the data safety of enterprise groups can be ensured while the problems of high decision complexity, low decision timeliness, low decision quality and the like of the current centralized decision are solved.

Specifically, the federal reinforcement learning algorithm in the embodiment of the present application adopts a multiparty reinforcement learning training strategy, and encrypts the interactive data such as observation information o, benefits, rewards, decision actions and the like of the enterprise group in the experience playback buffer area by using a federal learning privacy protection gradient aggregation mechanism and an isomorphic encryption technology, so that each production decision unit delivers the interactive data to a federal central coordination node of federal training in each round of training, performs reinforcement averaging on the interactive data through the federal central coordination node, distributes the interactive data to each production decision unit, and sets a timeout waiting and buffering time limit mechanism, thereby protecting the data privacy of each production decision unit and promoting the enterprise group to participate in the training scene of the distributed decision model.

Specifically, in the federal reinforcement learning algorithm, including a federal central coordination node and enterprise computing nodes (i.e., each production decision unit), as shown in fig. 5, a schematic diagram of the federal reinforcement learning algorithm according to an embodiment of the present application is shown, and a specific training process includes the following steps:

s121: initializing a global decision model through a federal central coordination node and broadcasting to all enterprise computing nodes of an industrial chain system; meanwhile, the federal central coordination node executes model parallel operation, an initialized buffer pool with set capacity is set to provide a synchronous parallel model sharing mechanism, namely the number and the maximum waiting time of expected receiving decision networks are set;

s122: independently observing the environmental states through enterprise computing nodes, making decisions according to the environmental states, providing rewards and new states after the environments execute actions, optimizing self policies, and updating decision network parameters;

s123: after the enterprise computing nodes carry out decision network updating for many times, uploading network parameters to a federal central coordination node, and fully homomorphic encryption is carried out on the interactive data by utilizing a fully homomorphic encryption algorithm when the experience of each enterprise interactive MADDPG plays back the interactive data in a buffer zone in order to ensure the safety of the model interactive data;

S124: the method comprises the steps that network parameters uploaded by all enterprise computing nodes are aggregated through a federal central coordination node by using a specific aggregation algorithm, and a global decision model is generated; the federal central coordination node does not need to wait for all enterprise computing nodes to upload network parameters, dynamically adjusts the expected parameter quantity received by the model buffer zone according to the received parameter quantity, selects partial network parameters to aggregate, completes the federal reinforcement learning of the round, and then stores aggregated experiences in an experience pool and the buffer zone;

s125: distributing the global decision model to each enterprise computing node through the federal central coordination node so as to ensure that the subsequent enterprise computing nodes update the network on the basis of the global decision model, and training from an experience pool and a buffer area sampling training set in the training process;

s126: the enterprise computing node updates the decision network into a global decision model, or establishes a personalized local decision network according to the global decision model to obtain new network parameters;

s127: and repeating S122 to S126 until the model converges, the maximum iteration number or the maximum training time is reached, and ending the federal reinforcement learning training.

Based on the above, the enterprise group distributed decision method in the embodiment of the application adopts a mode of combining multi-agent reinforcement learning and federal reinforcement learning, performs distributed execution and centralized training on the task allocation strategy of the industrial chain system by utilizing a multi-agent reinforcement learning algorithm, and matches proper task requirements according to enterprise resource distribution conditions, so that the problem of distributed cooperation of the enterprise group is solved, and a real-time distributed decision effect can be achieved; the federal reinforcement learning algorithm is utilized to encrypt the sensitive data in the centralized training process by using federal training and homomorphic encryption technology, so that the interactive data safety of multiple intelligent agents is ensured, and the problem of sensitive data leakage of a shared experience pool of multiple enterprises in the centralized training is solved. According to the method and the device for achieving the distributed decision-making, the complete distributed decision-making can be achieved, each enterprise in the enterprise group is an individual decision-making, however, the overall optimization targets are the same, the problems of high decision-making complexity, low decision timeliness, low decision quality and the like of the centralized decision-making in the prior art are solved, and meanwhile data security of the enterprises is guaranteed.

Fig. 6 is a schematic structural diagram of an enterprise group distributed decision making device according to an embodiment of the present application. The enterprise group distributed decision making apparatus 40 of the present embodiment includes:

task decomposition module 41: the system comprises a plurality of sub-tasks, a production decision unit and a processing unit, wherein the sub-tasks are used for dividing the overall production task of the multi-agent system into a plurality of sub-tasks according to task types, and each single agent in the multi-agent system is used as a single production decision unit; the multi-agent system is an enterprise group in an industrial chain system, and the single agent is each enterprise in the enterprise group;

multi-agent reinforcement learning module 42: the system comprises a multi-agent reinforcement learning algorithm, a distributed decision-making unit, a centralized training unit, a shared experience playback buffer zone, a distributed decision-making unit, a distributed decision-making network, a distributed training unit and a distributed training unit, wherein the multi-agent reinforcement learning algorithm is used for carrying out distributed decision-making and centralized training on task allocation strategies of the multi-agent system, and interactive data of each single agent in the centralized training process is put into the shared experience playback buffer zone;

federal reinforcement learning module 43: the method is used for carrying out federal training and fully homomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and utilizing an aggregation algorithm to aggregate decision networks of all production decision units so as to generate a global decision model of the multi-agent system.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

The device provided in the embodiment of the present application may be applied to the foregoing method embodiment, and details refer to descriptions of the foregoing method embodiment, which are not repeated herein.

Please refer to fig. 7, which is a schematic diagram of an apparatus structure according to an embodiment of the present application. The apparatus 50 comprises:

a memory 51 storing executable program instructions;

a processor 52 connected to the memory 51;

the processor 52 is configured to call the executable program instructions stored in the memory 51 and perform the steps of: decomposing the whole production task of the multi-agent system into a plurality of subtasks according to the task type, and taking each single agent in the multi-agent system as an independent production decision unit; the multi-agent system is an enterprise group in an industrial chain system, and the single agent is each enterprise in the enterprise group; carrying out distributed decision and centralized training on task allocation strategies of the multi-agent system by utilizing a multi-agent reinforcement learning algorithm, training a decision network capable of carrying out distributed decision for each production decision unit, and placing interaction data of each single agent in a shared experience playback buffer area in the centralized training process; and performing federal training and isomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and aggregating decision networks of all production decision units by utilizing an aggregation algorithm to generate a global decision model of the multi-agent system.

The processor 52 may also be referred to as a CPU (Central Processing Unit ). The processor 52 may be an integrated circuit chip having signal processing capabilities. Processor 52 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Fig. 8 is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of the embodiment of the present application stores program instructions 61 capable of implementing the steps of: decomposing the whole production task of the multi-agent system into a plurality of subtasks according to the task type, and taking each single agent in the multi-agent system as an independent production decision unit; the multi-agent system is an enterprise group in an industrial chain system, and the single agent is each enterprise in the enterprise group; carrying out distributed decision and centralized training on task allocation strategies of the multi-agent system by utilizing a multi-agent reinforcement learning algorithm, training a decision network capable of carrying out distributed decision for each production decision unit, and placing interaction data of each single agent in a shared experience playback buffer area in the centralized training process; and performing federal training and isomorphic encryption optimization on the interactive data in the experience playback buffer region through a federal reinforcement learning algorithm, and aggregating decision networks of all production decision units by utilizing an aggregation algorithm to generate a global decision model of the multi-agent system. The program instructions 61 may be stored in the storage medium as a software product, and include several instructions to cause a device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program instructions, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the partitioning of elements is merely a logical functional partitioning, and there may be additional partitioning in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not implemented. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The foregoing is only the embodiments of the present application, and not the patent scope of the present application is limited by the foregoing description, but all equivalent structures or equivalent processes using the contents of the present application and the accompanying drawings, or directly or indirectly applied to other related technical fields, which are included in the patent protection scope of the present application.

Claims

1. A method for distributed decision making of an enterprise group, comprising:

2. The method of claim 1, wherein the multi-agent reinforcement learning algorithm comprises distributed task execution objective settings and state space and rewards function design,

3. The method for distributed decision-making of an enterprise group according to claim 2, wherein the value of the optimization objective function is taken as a reward value of the reinforcement learning agent, specifically:

4. The method for distributed decision making of an enterprise group according to claim 3, wherein the performing distributed decision making and centralized training on the task allocation strategy of the multi-agent system by using a multi-agent reinforcement learning algorithm comprises:

5. The method for distributed decision making for an enterprise group according to claim 4, wherein said utilizing a multi-agent reinforcement learning algorithm to perform distributed decision making and centralized training on task allocation policies of said multi-agent system further comprises:

6. The method according to any one of claims 1 to 5, wherein the federal training and fully homomorphic encryption optimization is performed on the interactive data stored in the experience playback buffer by a federal reinforcement learning algorithm, specifically:

7. The enterprise group distributed decision making method of claim 6, wherein the training process of the federal reinforcement learning algorithm comprises:

8. An enterprise group distributed decision making apparatus, comprising:

9. An apparatus comprising a processor, a memory coupled to the processor, wherein,

the memory storing program instructions for implementing the enterprise group distributed decision method of any of claims 1-7;

10. A storage medium having stored thereon program instructions executable by a processor for performing the enterprise group distributed decision method of any one of claims 1 to 7.