WO2023082552A1 - 分布式模型训练方法、系统及相关装置 - Google Patents

分布式模型训练方法、系统及相关装置 Download PDF

Info

Publication number
WO2023082552A1
WO2023082552A1 PCT/CN2022/088702 CN2022088702W WO2023082552A1 WO 2023082552 A1 WO2023082552 A1 WO 2023082552A1 CN 2022088702 W CN2022088702 W CN 2022088702W WO 2023082552 A1 WO2023082552 A1 WO 2023082552A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
terminal
action
online network
task
Prior art date
Application number
PCT/CN2022/088702
Other languages
English (en)
French (fr)
Inventor
任涛
何航
谷宁波
牛建伟
戴彬
邱源
胡哲源
胡舒程
姚依明
李青锋
Original Assignee
北京航空航天大学杭州创新研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京航空航天大学杭州创新研究院 filed Critical 北京航空航天大学杭州创新研究院
Publication of WO2023082552A1 publication Critical patent/WO2023082552A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of control, in particular, to a distributed model training method, system and related devices.
  • the task offloading strategy is used to redistribute terminal tasks in the edge device. For example, terminal tasks are executed locally on the edge device or offloaded to a server in the cloud for execution.
  • the embodiments of the present application provide a distributed model training method, system and related devices, which may include:
  • Some embodiments of the present application provide a distributed model training method, which is applied to a decision-making system deployed with a DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient) model, and the decision-making system includes a management device and a plurality of terminal devices,
  • the DDPG model includes a Critic network and an Actor network
  • the Actor network includes a first online network and a second online network
  • each of the terminal devices is deployed with the first online network
  • the management device is deployed with the Critic network and the second online network
  • the method may include:
  • the model training process may include:
  • the terminal device For each terminal device, the terminal device generates an action corresponding to the first device state through the first online network according to its own first device state;
  • the terminal device stores the policy experience corresponding to the first device state in an experience pool, wherein the policy experience includes the first device state, the action, the second device state after the action is executed, and the immediate reward for the action described;
  • the management device samples the experience pool to obtain policy samples
  • the management device adjusts the model parameters of the second online network through the critic network according to the policy sample
  • the management device synchronizes the adjusted second online network to each of the first online networks, so that each synchronized first online network is consistent with the adjusted second online network.
  • Online networks have the same parameters.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • the decision-making system is deployed with a DDPG model.
  • the decision-making system may include a management device and multiple terminal devices.
  • the DDPG model includes a Critic network and an Actor network, and the Actor
  • the network includes a first online network and a second online network, each terminal device is deployed with the first online network, and the management device is deployed with the critic network and the second online network;
  • the model training process may include:
  • the terminal device For each terminal device, the terminal device is configured to generate an action corresponding to the first device state through the first online network according to its own first device state;
  • the terminal device is configured to store policy experience corresponding to the first device state in an experience pool, wherein the policy experience includes the first device state, the action, and the second device state after executing the action and an immediate reward for said action;
  • the management device is used to sample the experience pool to obtain policy samples
  • the management device is further configured to adjust the model parameters of the second online network through the critic network according to the policy sample;
  • the management device synchronizes the adjusted second online network to each of the first online networks, so that each synchronized first online network is consistent with the adjusted second online network.
  • Online networks have the same parameters.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • Still other embodiments of the present application provide a distributed model training method, which is applied to a management device in a decision-making system, where the management device communicates with multiple terminal devices in the decision-making system, and the decision-making system is deployed with DDPG model, the DDPG model includes a Critic network and an Actor network, and the Actor network includes a first online network and a second online network, each of the terminal devices is deployed with the first online network, and the management device is deployed with The Critic network and the second online network, the method may include:
  • the model training process includes:
  • Sampling an experience pool to obtain a policy sample wherein the experience pool is used to store policy experience generated by each terminal device through the first online network according to its own first device state, wherein the policy
  • the experience includes the first device state, the action, a second device state after performing the action, and an immediate reward for the action;
  • the adjusted second online network is synchronized to each of the first online networks, so that each synchronized first online network has the same model parameters.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • the DDPG model includes a Critic network and an Actor network, the Actor network includes a first online network and a second online network, the terminal device is deployed with the first online network, and the management device is deployed with the Critic network and The second online network, the method may include:
  • the policy experience corresponding to the first device state in an experience pool, where the policy experience includes the first device state, the action, the second device state after the action is executed, and the immediate award;
  • the management device adjusts the model parameters of the second online network through the critic network according to the policy samples to obtain the adjusted second online network, and the policy samples Sampled from the experience pool.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • Yet another embodiment of the present application provides a distributed model training device, the management device communicates with multiple terminal devices in the decision-making system, the decision-making system is deployed with a DDPG model, and the DDPG model includes A critic network and an actor network, the actor network includes a first online network and a second online network, each terminal device is deployed with the first online network, and the management device is deployed with the critic network and the second online network
  • the distributed model training device may include:
  • the model iteration module may be configured to perform at least one model training process until the DDPG model meets a preset convergence condition
  • the distributed model training device also includes:
  • the experience sampling module may be configured to sample an experience pool to obtain policy samples, wherein the experience pool is used to store each terminal device according to its own first device state, generated through the first online network strategy experience, wherein the strategy experience includes the first device state, the action, the second device state after performing the action, and an immediate reward for the action;
  • the model adjustment module may be configured to adjust the model parameters of the second online network through the critic network according to the policy samples;
  • the first synchronization module may be configured to synchronize the adjusted second online network to each of the first online networks when the preset synchronization condition is satisfied, so that each synchronized first online network is compatible with the first online network.
  • the adjusted second online network has the same parameters.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • Still other embodiments of the present application provide a distributed model training device, which is applied to a terminal device in a decision-making system, where the terminal device communicates with a management device in the decision-making system, and the decision-making system is deployed with a DDPG model , the DDPG model includes a Critic network and an Actor network, the Actor network includes a first online network and a second online network, the terminal device is deployed with the first online network, and the management device is deployed with the Critic network As well as the second online network, the distributed model training device may include:
  • a status acquisition module which can be configured to acquire its own first device status
  • a policy generation module configured to generate an action corresponding to the first device state through the first online network according to the first device state
  • the experience generation module may be configured to store the policy experience corresponding to the first device state in an experience pool, wherein the policy experience includes the first device state, the action, and the first 2. Device status and immediate rewards for said actions;
  • the second synchronization module may be configured to receive the adjusted second online network, wherein the management device adjusts the model parameters of the second online network through the critic network according to the policy sample, and obtains the adjusted The second online network of , the policy samples are sampled from the experience pool.
  • the second synchronization module may also be configured to synchronize the adjusted second online network to the first online network, so that the synchronized first online network and the adjusted second online network
  • the networks have the same model parameters.
  • the action may include executing the terminal task locally or offloading it to a server for execution, and the terminal device converts the first device state, the action, and the second device state after executing the action And before the immediate reward of the action is stored in the experience pool, the management device determines the immediate reward of the action through the expression R t , and the expression R t is:
  • the expression R t can also be configured with at least one constraint condition, and when any constraint condition is met, a penalty factor is generated, and the penalty factor is used to reduce the instant rewards;
  • the constraints may include:
  • the computing resources required to execute the terminal task cannot exceed the respective upper resource limits of the terminal device and the server;
  • the terminal task is only allowed to be executed on the terminal device or the server;
  • the delay in executing the terminal task cannot exceed a duration threshold
  • the energy consumption for executing the terminal task cannot exceed the respective energy storage upper limits of the terminal device and the server.
  • Still other embodiments of the present application provide an electronic device, the electronic device includes a processor and a memory, the memory stores a computer program, and when the computer program is executed by the processor, the management device or the terminal device can run Distributed model training method.
  • Some other embodiments of the present application provide a computer storage medium, where the computer storage medium stores a computer program, and when the computer program is executed by a processor, a distributed model training method for running a management device or a terminal device is implemented.
  • the policy system deployed with the DDPG model provided in this embodiment includes a management device and multiple terminal devices.
  • the DDPG model includes a critic network and an actor network.
  • the actor network includes a first online network and a second online network.
  • Each terminal device is deployed with the first online network
  • the management device is deployed with the critic network and the second online network.
  • the policy samples used to train the second online network are collected from the experience pool and generated by each terminal device through the first online network deployed by itself. Therefore, the state space of the policy samples only involves a single terminal device. Therefore, this method can not only avoid It is time-consuming to collect the global state, and it can also reduce the dimension of the state space.
  • FIG. 1 is a schematic structural diagram of a policy system provided by an embodiment of the present application.
  • Fig. 2 is the distributed model training method applied to the policy system provided by the embodiment of the present application.
  • Fig. 3 is the schematic diagram of the DDPG model that the embodiment of the present application provides
  • FIG. 4 is a distributed model training method applied to a management device provided by an embodiment of the present application.
  • FIG. 5 is a distributed model training device applied to management equipment provided by the embodiment of the present application.
  • FIG. 6 is a distributed model training method applied to a terminal device provided in an embodiment of the present application.
  • FIG. 7 is a distributed model training device applied to a terminal device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Icon 101-model iteration module; 102-experience sampling module; 103-model adjustment module; 104-first synchronization module; 201-state acquisition module; 202-strategy generation module; 203-experience generation module; 204-second synchronization module; 320-memory; 330-processor; 340-communication unit.
  • smart devices as edge devices in the network, can choose to execute terminal tasks locally or offload terminal tasks to the server according to task requirements, local computing resources, base station computing resources, etc. It can be executed remotely, thereby reducing the experience delay of the application and improving the service quality of the network.
  • MEC Mobile Edge Computing, Mobile Edge Computing
  • an effective task offloading strategy is particularly important to achieve satisfactory service quality in MEC network systems.
  • a technology based on reinforcement learning to obtain an approximately optimal computing offload scheduling strategy for MEC network systems has been proposed.
  • the task offloading strategy determination algorithm based on DDQN (Double Deep Q Network), which maximizes the cumulative MEC utility according to the task queue, energy queue and wireless channel conditions;
  • the task offloading strategy determination algorithm based on DRL (Deep Reinforcement Learning), the The algorithm can obtain near-optimal task offloading and resource allocation strategies without using traditional numerical algorithms to solve difficult optimization problems.
  • search-based algorithms such as heuristics, coordinate descent, genetic algorithms, etc. to solve.
  • the heuristic algorithm is used to iteratively adjust the binary offloading decision of smart devices in the MEC network system, so that the delay and energy consumption of the entire mobile edge computing system are minimized.
  • this embodiment provides a distributed model training method, which is applied to a policy system deployed with a DDPG model.
  • the policy system is not limited to the above-mentioned MEC network system, but may also be a monitoring system in the field of security and a communication system in a scene of a surge in people flow (for example, a football field).
  • the DDPG model belongs to the reinforcement learning method implemented in the form of a neural network. Therefore, the DDPG model also involves the state, strategy, action, instant reward, and Q value (also known as the action value, expressed in the state s t , based on After strategy ⁇ takes action at t , and if strategy ⁇ is continuously executed, the expected value of the reward obtained). Since the DDPG model is developed based on the DQN model, similar to the DQN model, the DDPG model also uses a neural network to fit the Q value of the agent. That is, the current state of the agent and the actions taken in the current state are evaluated through the neural network, and the evaluation result is the Q value.
  • the neural network is called the Critic network in the DDPG model.
  • the difference from the DQN model is that the greedy strategy is used in the DQN model to select the action with the largest Q value, while the DDPG model generates the action that should be taken in the current state through the Actor network.
  • actor network is used to fit the policy function, that is, it can generate the actions that the agent should take in the current state according to the current state of the agent.
  • Critic network is used to fit the action value function, that is, it can determine the Q value of the action generated by the Actor network based on the current state.
  • the Actor network is designed to include the Actor online network and the Actor target network;
  • the Critic network is designed to include the Critic target network and the Actor target network.
  • Critic online network In related technologies, the above-mentioned Actor online network, Actor target network, Critic target network, and Critic online network in the DDPG model are usually deployed to the same device for training.
  • the Actor online network Based on the current state s of the agent, the Actor online network generates the action a that the agent should take in the current state, and calculates the instant reward r after executing action a in the current state s according to the instant reward function, as well as the action a The new state s' of the agent.
  • N represents the preset quantity of strategy experience
  • Q(s i , a i ) represents the input of (s i , a i ) in the i-th group of strategy experience (s i , a i , ri, s i ' ) Critic online network, Q value fitted by Critic online network;
  • y i represents the Bellman equation expression corresponding to the Q value of (s i , a i ), where a' represents all possible actions in state s i ', Represents the maximum Q value in all actions, and this maximum Q value is obtained by inputting (s i ', a i ') into the Critic target network, a i ' is generated by the Actor target network according to the state s i '+ ⁇ , where, ⁇ denotes the introduced noise.
  • the model parameters of the Actor online network are synchronized to the Actor target network; the model parameters of the Critic online network are synchronized to the Critic target network.
  • the decision-making system includes a management device and multiple terminal devices.
  • the DDPG model includes a Critic network and an Actor network (ActorNetwork).
  • the Actor network includes a first online network and a second online network.
  • Each terminal device is deployed with a first The online network is used as an agent in the machine reinforcement learning model; and the management device is deployed with a Critic network and a second online network. Since the first online network deployed by each terminal device belongs to the Actor network, the terminal device can generate corresponding actions based on its own current state, so as to overcome the defects of the centralized scheduling method.
  • the management device may be a server communicatively connected to a plurality of terminal devices.
  • Web website
  • FTP File Transfer Protocol
  • the server can be a single server or a group of servers. Server groups can be centralized or distributed (for example, the servers can be a distributed system).
  • the server 100 may be local or remote relative to the user terminal.
  • the server 100 can be implemented on a cloud platform; only as an example, the cloud platform can include private cloud, public cloud, hybrid cloud, community cloud (Community Cloud), distributed cloud, inter-cloud (Inter-Cloud) , Multi-Cloud, etc., or any combination of them.
  • server 100 may be implemented on an electronic device having one or more components.
  • the management device may serve as a server for offloading terminal tasks from the terminal device.
  • the object for the terminal device to offload the terminal task may be another server different from the management device.
  • the terminal device may be, but not limited to, a mobile terminal, a tablet computer, a laptop computer, or a built-in device in a motor vehicle, etc., or any combination thereof.
  • the mobile terminal may include smart home devices, wearable devices, smart mobile devices, virtual reality devices, or augmented reality devices, etc., or any combination thereof.
  • smart home devices may include smart lighting devices, control devices for smart electrical devices, smart monitoring devices, smart TVs, smart cameras, or walkie-talkies, etc., or any combination thereof.
  • wearable devices may include smart bracelets, smart shoelaces, smart glasses, smart helmets, smart watches, smart clothing, smart backpacks, smart accessories, etc., or any combination thereof.
  • the smart mobile device may include a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game device, a navigation device, or a point of sale (Point of Sale, POS) device, etc., or any combination thereof.
  • PDA Personal Digital Assistant
  • the distributed model training method applied to the policy system will be described in detail below in conjunction with the flow chart shown in FIG. 2 .
  • the method may include:
  • the terminal device For each terminal device, the terminal device generates an action corresponding to the first device state through the first online network according to its own first device state.
  • the Actor network and the Critic network are deployed on the same device, and then a fully centralized scheduling method is used to generate a scheduling policy for all terminal tasks; it is necessary to obtain the respective first device states of all terminal devices.
  • the Actor network includes a first online network and a second online network, and the first online network and the second online network have the same network structure and satisfy a synchronization relationship; and, deploy the first online network to each The terminal device is such that when each terminal device generates an action corresponding to its own first state, it only needs to pay attention to its own state information.
  • the terminal device stores the policy experience corresponding to the first device state in an experience pool.
  • the strategy experience includes the first device state, the action, the second device state after the action is executed, and the immediate reward of the action. It should be noted that, in order to facilitate the distinction between the current state of the terminal device and the state after the execution of the action; in this example, the current state of the terminal device is called the first device state, and the state after the execution of the action is called the second device state .
  • the experience pool (Experience Buffer) may be a preset storage space provided by the management device for storing a preset amount of strategy experience. Therefore, the terminal device can send the constructed policy experience to the management device through the network, and the management device stores it in the preset storage space.
  • this embodiment also designs a corresponding instant reward function for the use scenario of the policy system; therefore, the terminal device can send the first device state, the actions taken in the first device state, and the second device state to the management device, so that The management device determines the instant reward of the action through an instant reward function according to the first device state, the first device state and the second device state.
  • the actions that the terminal device needs to perform in the first device state include executing the terminal task locally or offloading it to a server for execution.
  • the management device determines the immediate reward of the action through the expression R t . That is, R t is the instant reward function, and the corresponding expression is:
  • this embodiment also designs a constraint condition for the reward function R t at this time, and if the terminal device satisfies any constraint condition after performing an action, a penalty factor is generated. This penalty factor is used to reduce the immediate reward.
  • Constraints include:
  • the computing resources required to perform terminal tasks cannot exceed the respective resource limits of the terminal device and the server;
  • Terminal tasks are only allowed to be executed on the user terminal or server
  • the delay in executing terminal tasks cannot exceed the duration threshold
  • the energy consumption for executing terminal tasks cannot exceed the respective energy storage limits of the terminal device and the server.
  • the management device samples the experience pool to obtain a policy sample.
  • the management device adjusts the model parameters of the second online network through the critic network according to the policy sample.
  • the management device synchronizes the adjusted second online network to each first online network, so that each synchronized first online network has the same parameters.
  • the preset synchronization condition is that the number of executions of the model training process reaches a preset iteration period. For example, after 5 iterations of the model parameters of the Actor target network, the Actor target network after the fifth iteration is synchronized to each terminal device. Since the actor target network has the same network structure as the actor online network in each terminal device, the management device only needs to deliver the model parameters of the actor target network after 5 iterations to each terminal device.
  • the management device judges whether the DDPG model satisfies the preset convergence condition; if so, execute step S107A to obtain a pre-trained DDPG model; if not, return to S101A to iterate again.
  • the policy system deployed with the DDPG model includes a management device and multiple terminal devices.
  • the DDPG model includes a critic network and an actor network.
  • the actor network includes a first online network and a second online network.
  • Each terminal device is deployed with the first online network
  • the management device is deployed with the critic network and the second online network.
  • the policy samples used to train the second online network are collected from the experience pool and generated by each terminal device through the first online network deployed by itself. Therefore, the state space of the policy samples only involves a single terminal device. Therefore, this method can not only avoid It is time-consuming to collect the global state, and it can also reduce the dimension of the state space.
  • the management device trains the second online network based on the policy samples collected by the policy pool. Since the policy experience generated by all terminal devices is stored in the experience pool, the policy samples sampled by the management device from the experience pool can take into account the global information of the policy system, so that the second online network can tend to converge; while the first online network Both the network and the second online network belong to the Actor network and maintain a synchronous relationship. Therefore, the synchronized first online network can generate an action that minimizes the cost of the entire decision-making system based on the state information of the terminal device itself.
  • the task offloading system is an MEC network system
  • the base station in the MEC network system is a task offloading system management device
  • multiple user equipments in the MEC network system correspond to multiple terminal devices in the task offloading system.
  • the following takes the MEC network system as an example to illustrate the above-mentioned distributed model training method in detail.
  • a base station BS Base Station
  • multiple user equipment UDs User Devices
  • multiple user equipments are represented as a set U:
  • the two-dimensional plane of the environment is represented by coordinates xy, and the position of user equipment u in time slice t and the location L BS of the base station BS are:
  • the energy stored by the user equipment u is expressed as E u ; for each time slice t, the terminal task of the user equipment u is expressed as
  • Each user equipment can communicate with the base station BS through a wireless channel, which enables user equipment u to dedicate its computationally intensive task to It is transmitted to the base station BS for calculation.
  • a binary unloading strategy is adopted, that is, the task It is only allowed to execute on the user equipment u, or on the base station BS.
  • the unloading result of is expressed as when Indicates that the terminal tasks are offloaded to the base station BS for execution; when It means that the terminal task is executed locally.
  • the wireless data transmission rate between user equipment u and base station BS is:
  • the user equipment u is only allowed to perform wireless data transmission, and is less than the maximum power p max , namely:
  • the large-scale path loss attenuation function of a wireless channel can be expressed as:
  • the terminal tasks generated by the user equipment u are only allowed to be executed on the user equipment u, or offloaded to the base station BS for execution.
  • ⁇ u and ⁇ u are calculation coefficients related to the chip architecture of user equipment u.
  • terminal tasks For selecting the terminal task Offload to the base station BS for calculation, terminal tasks
  • the time required can be expressed as:
  • F ES represents the maximum available computing resources of the base station BS. Considering that the base station BS has sufficient energy supply, the energy consumed by computing terminal tasks is ignored. In addition, this embodiment also stipulates that each user equipment does not consume energy while waiting for the base station BS to process the terminal task.
  • the agreed system cost is the terminal task
  • the task delay of and the energy consumption of user equipment u are defined as:
  • the two are weighted and summed with the preset weight ⁇ :
  • the average system cost C sys of all user equipments in the period T can be expressed as:
  • the purpose of this embodiment is to train the DDPG model so that the model formulates a task offloading strategy for each user equipment
  • the objective function P of the MEC network system can be expressed as:
  • C1 and C2 represent the computing resource constraints of the user equipment and the base station BS respectively;
  • C3 represents the binary offload constraint, that is, the terminal task It is only allowed to choose to calculate locally in the user terminal, or offload the terminal task to the base station BS to run;
  • C5 indicates the terminal task The task delay of should not be greater than its maximum allowable delay;
  • C6 means that the total energy consumption of user equipment u from the first time slot to the current time slot t' should not exceed the maximum available energy stored by user equipment u.
  • this embodiment designs a distributed model training framework based on the DDPG model for the optimization of the network average system cost of the MEC network system.
  • DDPG model similar to conventional reinforcement learning, it is also necessary to determine the state, action and immediate reward under the DDPG model.
  • the user equipment u can only observe limited state information from the environment.
  • the status information may include the current location of the user equipment u, the latest task arrived and the remaining energy of the user equipment u. Therefore, at time slice t, the state information s t of the user equipment u can be expressed as:
  • Action Also because of the distributed training of the DDPG model in this embodiment, the user equipment u can only decide its own scheduling action. Therefore, the action a t of user equipment u in time slice t can be expressed as:
  • Reward The immediate reward that user equipment u gets from the environment at each time slice is related to the optimization goal of the objective function P to minimize the average system cost C sys . Therefore, the instant reward function is denoted by R t , and the calculated instant reward is denoted by rt , where the expression of R t is:
  • the training loss function and update strategy of the DDPG model are designed:
  • the agent can learn the interaction experience with the environment to achieve the goal of maximizing long-term rewards (accumulating the rewards of all time slices), so as to obtain the optimal action strategy under each time slice.
  • the long-term reward of performing an action a t in a state s t can be expressed by the Bellman equation of the Q function, namely:
  • s t+1 is the state compared to s t at time slice t+1
  • a' is all possible actions in state s t +1
  • is the discount factor of future rewards, satisfying 0 ⁇ 1.
  • a neural network (Neural Network) is used to approximate the Q function.
  • the neural network is called Q network, assuming that the parameter of Q network is ⁇ , then the Q network can be expressed as Q ⁇ (s t ,at ) .
  • the sample data used to train the Q-network is generated from the interaction between the agent and the environment. Assume that the sample data is called a strategy sample, expressed as (s t , a t , r t , s t +1).
  • the loss function of this Q network can be expressed as:
  • Q(s i , a i ) represents the output of the Q network
  • y i represents the reference Q value
  • the Actor-Critic framework was invented and adopted in the DDPG model, where Actor is the policy network and Critic is the Q network.
  • the expression of the policy gradient is:
  • the Actor network includes the Actor online network and the Actor target network; the Critic includes the Critic online network and the Critic target network.
  • the Actor online network includes a first online network and a second online network. Wherein, the first online network is deployed in each user equipment, and the second online network is deployed in the base station BS together with the Actor target network. The second online network cooperates with the Actor target network, adjusts the model parameters of the second online network by means of policy gradient, and keeps in sync with the first online network.
  • a double critic network is used to reduce the overestimation of the Q value, and a clipped double Q learning and delayed policy update are adopted to avoid high variance.
  • the critic online network includes Critic1 and Critic2; the target network of critics includes Target Critic1 and Target Critic2. Therefore, for the same strategy sample, Target Critic1 and Target Critic2 will respectively give the Q value of the strategy sample, Q1 and Q2 respectively; Critic1 and Critic2 will respectively give the Q value of the strategy sample, Q1 and Q2 respectively.
  • the training process may include:
  • each user device operates in a distributed manner, and receives the task offloading strategy generated by the Actor online network deployed by itself.
  • the user equipment collects its own status observed from the MEC network system Actor online network according to strategy generate the current action that should be taken
  • the user equipment u performs an action After that, the user equipment u will receive instant rewards from the MEC network system and the next state User equipment u packs the above information into policy experience Send it to the experience pool.
  • All user devices download the model parameter ⁇ of the Actor online network, namely It is worth noting that the period for each user device to download the model parameters of the Actor online network is T explore time slices, that is, the user device u needs to use the Actor online network deployed by itself to run for T explore time slices before performing a synchronization operate.
  • T explore is specified by the annealing algorithm value, which can make the training more stable at the beginning and more efficient afterwards.
  • the embodiment of the present application also provides a system, a unilateral method and a device corresponding to the method, which may include:
  • This embodiment also provides a decision-making system.
  • the decision-making system is deployed with a DDPG model.
  • the decision-making system includes a management device and a plurality of terminal devices.
  • the DDPG model includes a Critic network and an Actor network, and the Actor network includes a first online network and a second online network. , each terminal device is deployed with the first online network, and the management device is deployed with the Critic network and the second online network;
  • Model training process including:
  • the terminal device For each terminal device, the terminal device is configured to generate an action corresponding to the first device state through the first online network according to its own first device state;
  • the terminal device is used to store the strategy experience corresponding to the first device state in the experience pool, wherein the strategy experience includes the first device state, action, the second device state after the action is executed, and the immediate reward of the action;
  • the management device is used to sample the experience pool and obtain policy samples
  • the management device is also used to adjust the model parameters of the second online network through the critic network according to the policy sample;
  • the management device When the preset synchronization condition is met, the management device will synchronize the adjusted second online network to each first online network, so that each synchronized first online network has the same parameters as the adjusted second online network .
  • This embodiment also provides a distributed model training method, which is applied to a management device in a decision-making system.
  • the management device communicates with multiple terminal devices in the decision-making system.
  • the decision-making system is deployed with a DDPG model.
  • the DDPG model includes the Critic network and the Actor network.
  • the Actor network includes the first online network and the second online network.
  • Each terminal device is deployed with the first An online network
  • the management device is deployed with a critic network and a second online network.
  • the method includes:
  • the experience pool is used to store the policy experience generated by each terminal device through the first online network according to its own first device state.
  • the policy experience includes the first device state, action, the second device state after the action is executed, and the action. instant rewards;
  • step S104B judging whether the DDPG model satisfies the preset convergence condition; if so, execute step S105B to obtain a pre-trained DDPG model; if not, return to S101B to iterate again.
  • this embodiment also provides a distributed model training device applied to the management device.
  • the decision-making system is deployed with a DDPG model.
  • the DDPG model includes the Critic network and the Actor network.
  • the Actor network includes the first online network and the second online network.
  • Each terminal device is deployed with the first online network
  • the management device is deployed with the Critic network and the second online network. network.
  • the distributed model training device includes at least one functional module that can be stored in the memory in the form of software. As shown in Figure 5, functionally, the distributed model training device may include:
  • the model iteration module 101 may be configured to perform at least one model training process until the DDPG model meets the preset convergence condition;
  • the distributed model training device also includes:
  • the experience sampling module 102 may be configured to sample an experience pool to obtain policy samples, wherein the experience pool is used to store policy experience generated by each terminal device through the first online network according to its own first device state, wherein , the policy experience includes the first device state, the action, the second device state after performing the action, and the immediate reward of the action;
  • the model adjustment module 103 may be configured to adjust the model parameters of the second online network through the critic network according to the policy sample;
  • the first synchronization module 104 may be configured to synchronize the adjusted second online network to each first online network when the preset synchronization condition is met, so that each synchronized first online network is the same as the adjusted first online network.
  • the second online network has the same parameters.
  • the distributed model training device may also include other software function modules for implementing other steps or sub-steps of the distributed model training method applied to management equipment; of course, the model iteration module 101, the experience sampling module 102 , the model adjustment module 103 and the first synchronization module 104 can also be used to implement other steps or sub-steps of the distributed model training method applied to the management device.
  • This example does not specifically limit this, and those skilled in the art can make appropriate adjustments according to different software module division standards.
  • This embodiment also provides a distributed model training method, which is applied to terminal devices in a decision-making system.
  • the terminal device communicates with the management device in the decision-making system.
  • the decision-making system is deployed with a DDPG model.
  • the DDPG model includes the Critic network and the Actor network.
  • the Actor network includes the first online network and the second online network.
  • the terminal device is deployed with the first online network.
  • the management device is deployed with a critic network and a second online network.
  • methods may include:
  • Step S101C acquiring the first device status of itself.
  • S102C Generate an action corresponding to the first device state through the first online network according to the first device state.
  • the strategy experience includes the first device state, the action, the second device state after the action is executed, and the immediate reward of the action.
  • the management device adjusts the model parameters of the second online network through the critic network according to the policy samples to obtain the adjusted second online network, and the policy samples are sampled from the experience pool.
  • this embodiment also provides a distributed model training device applied to the terminal device.
  • the terminal device communicates with the management device in the decision-making system.
  • the decision-making system is deployed with a DDPG model.
  • the DDPG model includes the Critic network and the Actor network.
  • the Actor network includes the first online network and the second online network.
  • the terminal device is deployed with the first online network.
  • the management device is deployed with a critic network and a second online network.
  • the distributed model training device includes at least one functional module that can be stored in a memory in the form of software. As shown in Figure 7, functionally, the distributed model training device may include:
  • the status acquiring module 201 may be configured to acquire its own first device status
  • the policy generating module 202 may be configured to generate an action corresponding to the first device state through the first online network according to the first device state;
  • the experience generation module 203 may be configured to store the policy experience corresponding to the first device state in the experience pool, wherein the policy experience includes the first device state, actions, second device states after performing actions, and immediate rewards for actions;
  • the second synchronization module 204 may be configured to receive the adjusted second online network, wherein the management device adjusts the model parameters of the second online network through the Critic network according to the policy sample, and obtains the adjusted second online network, the policy Samples are sampled from the experience pool.
  • the second synchronization module 204 may also be configured to synchronize the adjusted second online network to the first online network, so that the synchronized first online network and the adjusted second online network have the same model parameters.
  • the distributed model training device also includes other software function modules, which are used to implement other steps or sub-steps of the distributed model training method applied to terminal equipment; , the experience generation module 203 and the second synchronization module 204 can also be used to implement other steps or sub-steps of the distributed model training method for terminal devices.
  • This embodiment does not specifically limit this, and those skilled in the art can make appropriate adjustments according to different software module division standards.
  • This embodiment also provides an electronic device, the electronic device includes a processor and a memory, and the memory stores a computer program.
  • the electronic device is a management device
  • the computer program when executed by the processor, the above-mentioned distributed model training method for the operation of the management device is realized.
  • the electronic device is a terminal device
  • the computer program when executed by the processor, the above-mentioned distributed model training method operated by the terminal device is realized.
  • This embodiment also provides a schematic structural diagram of the electronic device. As shown in FIG. 8 , the memory 320 , the processor 330 , and the communication unit 340 .
  • the components of the memory 320 , the processor 330 and the communication unit 340 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, these components can be electrically connected to each other through one or more communication buses or signal lines.
  • the memory 320 can be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), etc.
  • RAM Random Access Memory
  • ROM read-only memory
  • PROM Programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electric Erasable Programmable Read-Only Memory
  • the communication unit 340 is used to send and receive data through the network.
  • the network can include wired network, wireless network, optical fiber network, telecommunication network, intranet, Internet, local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN), wireless local area network (Wireless Local Area Networks, WLAN), Metropolitan Area Network (MAN), Wide Area Network (Wide Area Network, WAN), Public Switched Telephone Network (PSTN), Bluetooth network, ZigBee network, or Near Field Communication (NFC) network, etc., or any combination thereof.
  • a network may include one or more network access points.
  • a network may include wired or wireless network access points, such as base stations and/or network switching nodes, through which one or more components of the service request processing system may connect to the network to exchange data and/or information.
  • the processor 330 may be an integrated circuit chip with signal processing capabilities, and the processor may include one or more processing cores (for example, a single-core processor or a multi-core processor).
  • the above-mentioned processor may include a central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an application specific instruction set processor (Application Specific Instruction-set Processor, ASIP), graphics processing Unit (Graphics Processing Unit, GPU), Physical Processing Unit (Physics Processing Unit, PPU), Digital Signal Processor (Digital Signal Processor, DSP), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Programmable Logic Device (Programmable Logic Device, PLD), controller, microcontroller unit, reduced instruction set computer (Reduced Instruction Set Computing, RISC), or microprocessor, etc., or any combination thereof.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • ASIP application specific instruction set processor
  • graphics processing Unit Graphics Processing Unit, GPU
  • Physical Processing Unit
  • This embodiment also provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the above-mentioned distributed model training method operated by the management device or the above-mentioned distributed model training method operated by the terminal device is implemented.
  • the policy system deployed with the DDPG model includes a management device and multiple terminal devices.
  • the DDPG model includes a critic network and an actor network.
  • the actor network includes a first online network and a second online network.
  • Each terminal device is deployed with the first online network
  • the management device is deployed with the critic network and the second online network.
  • the policy samples used to train the second online network are collected from the experience pool and generated by each terminal device through the first online network deployed by itself. Therefore, the state space of the policy samples only involves a single terminal device. Therefore, this method can not only avoid It is time-consuming to collect the global state, and it can also reduce the dimension of the state space.
  • each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
  • each functional module in each embodiment of the present application may be integrated to form an independent part, each module may exist independently, or two or more modules may be integrated to form an independent part.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
  • the present application provides a distributed model training method, system and related devices.
  • the deployment of the system includes a management device and multiple terminal devices and a DDPG model is deployed;
  • the DDPG model includes a Critic network and an Actor network, and the Actor network includes the first online network and the second online network, each terminal device is deployed with the first online network, and the management device is deployed with the critic network and the second online network;
  • the policy samples used to train the second online network are collected from the experience pool, and each terminal device Generated by the self-deployed first online network, therefore, the state space of the policy sample only involves a single terminal device. Therefore, this method can not only avoid the time consumption required for collecting the global state, but also reduce the dimension of the state space.
  • the distributed model training method, system and related devices of the present application are reproducible and can be used in various industrial applications.
  • the distributed model training method, system and related devices of the present application can be used in the field of control.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本申请提供分布式模型训练方法、系统及相关装置中,该系统部署包括管理设备以及多个终端设备且部署有DDPG模型;DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络;而用于训练第二在线网络的策略样本采集自经验池,由各终端设备通过自身部署的第一在线网络生成,因此,策略样本的状态空间仅涉及单个终端设备,因此,该方法不仅能够避免采集全局状态所需要的耗时,而且还能降低状态空间的维度。

Description

分布式模型训练方法、系统及相关装置
相关申请的交叉引用
本申请要求于2021年11月10日提交中国国家知识产权局的申请号为202111323472.7、名称为“分布式模型训练方法、系统及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及控制领域,具体而言,涉及一种分布式模型训练方法、系统及相关装置。
背景技术
在基于边缘计算的任务卸载场景中,为了提高服务质量,需要根据各边缘设备的状态,制定任务卸载策略。该任务卸载策略用于将边缘设备中的终端任务进行重新分配。例如,将终端任务在边缘设备本地执行或者卸载至云端的服务器执行。
现有的研究与发明大多是完全集中式的调度方法。即需要获取全部边缘设备的全局状态信息之后,基于该全局状态信息对所有边缘设备的终端任务进行统一调度。然而,发明人研究发现,随着边缘设备的增加,集中式调度方法难以在同一个时间内收集所需的全局信息,并在有限的时间内制定任务卸载策略,极大的影响了算法的收敛效率。
发明内容
为了克服相关技术中的至少一个不足,本申请实施例提供一种分布式模型训练方法、系统及相关装置,可以包括:
本申请的一些实施例提供一种分布式模型训练方法,应用于部署有DDPG(Deep Deterministic Policy Gradient,深度确定性策略梯度)模型的决策系统,所述决策系统包括管理设备以及多个终端设备,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法可以包括:
执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
所述模型训练流程,可以包括:
针对每个所述终端设备,所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
所述终端设备将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
所述管理设备对所述经验池进行采样,获得策略样本;
所述管理设备根据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
当满足预设同步条件,则所述管理设备将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000001
Figure PCTCN2022088702-appb-000002
式中,
Figure PCTCN2022088702-appb-000003
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000004
示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的另一些实施例提供一种决策系统,所述决策系统部署有DDPG模型,所述决策系统可以包括管理设备以及多个终端设备,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络;
执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
所述模型训练流程,可以包括:
针对每个所述终端设备,所述终端设备用于根据自身的第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
所述终端设备用于将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
所述管理设备用于对所述经验池进行采样,获得策略样本;
所述管理设备还用于根据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
当满足预设同步条件,则所述管理设备将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000005
Figure PCTCN2022088702-appb-000006
式中,
Figure PCTCN2022088702-appb-000007
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000008
表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的又一些实施例提供一种分布式模型训练方法,应用于决策系统中的管理设备,所述管理设备与所述决策系统中的多个终端设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法可以包括:
执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
所述模型训练流程,包括:
对经验池进行采样,获得策略样本,其中,所述经验池用于存储每个所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成的策略经验,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
当满足预设同步条件,则将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的模型参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000009
Figure PCTCN2022088702-appb-000010
式中,
Figure PCTCN2022088702-appb-000011
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000012
表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的再一些实施例提供一种分布式模型训练方法,应用于决策系统中的终端设备,所述终端设备与所述决策系统中的管理设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法可以包括:
获取自身的第一设备状态;
根据所述第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
接收调整后的第二在线网络,其中,所述管理设备根据策略样本,通过所述Critic网络调整所述第二在线网络的模型参数,获得所述调整后的第二在线网络,所述策略样本采样自所述经验池。
将所述调整后的第二在线网络同步至所述第一在线网络,以使同步后的第一在线网络与所述调整后的第二在线网络具有相同的模型参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000013
Figure PCTCN2022088702-appb-000014
式中,
Figure PCTCN2022088702-appb-000015
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000016
表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的再又一些本实施例提供一种分布式模型训练装置,所述管理设备与所述决策系统中的多个终端设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述分布式模型训练装置可以包括:
所述模型迭代模块,可以配置成用于执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
在所述模型训练流程,所述分布式模型训练装置还包括:
经验采样模块,可以配置成用于对经验池进行采样,获得策略样本,其中,所述经验池用于存储每个所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成的策略经验,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
模型调整模块,可以配置成用于据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
第一同步模块,可以配置成用于当满足预设同步条件,则将调整后的第二在线网络同步至每个所述第一在线网络, 以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000017
Figure PCTCN2022088702-appb-000018
式中,
Figure PCTCN2022088702-appb-000019
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000020
表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的又再一些实施例提供一种分布式模型训练装置,应用于决策系统中的终端设备,所述终端设备与所述决策系统中的管理设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述分布式模型训练装置可以包括:
状态获取模块,可以配置成用于获取自身的第一设备状态;
策略生成模块,可以配置成用于根据所述第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
经验生成模块,可以配置成用于将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
第二同步模块,可以配置成用于接收调整后的第二在线网络,其中,所述管理设备根据策略样本,通过所述Critic网络调整所述第二在线网络的模型参数,获得所述调整后的第二在线网络,所述策略样本采样自所述经验池。
所述第二同步模块,还可以配置成用于将所述调整后的第二在线网络同步至所述第一在线网络,以使同步后的第一在线网络与所述调整后的第二在线网络具有相同的模型参数。
在一些可选的实施方式中,所述动作可以包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
Figure PCTCN2022088702-appb-000021
Figure PCTCN2022088702-appb-000022
式中,
Figure PCTCN2022088702-appb-000023
表示终端设备u在时间片t执行所述终端任务的任务延时,
Figure PCTCN2022088702-appb-000024
表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
在一些可选的实施方式中,所述表达式R t还可以配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
所述约束条件可以包括:
执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
所述终端任务只允许在所述终端设备执行或者所述服务器执行;
执行所述终端任务的延时不能超过时长阈值;
执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
本申请的又一些实施例提供一种电子设备,所述电子设备包括处理器以及存储器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,实现管理设备或者终端设备运行的分布式模型训练方法。
本申请的其他一些实施例提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现管理设备或者终端设备运行的分布式模型训练方法。
相对于相关技术而言,本申请至少具有以下有益效果:
本实施例提供的部署有DDPG模型的策略系统中,包括管理设备以及多个终端设备。DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。而用于训练第二在线网络的策略样本采集自经验池,由各终端设备通过自身部署的第一在线网络生成,因此,策略样本的状态空间仅涉及单个终端设备,因此,该方法不仅能够避免采集全局状态所需要的耗时,而且还能降低状态空间的维度。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1为本申请实施例提供的策略系统结构示意图;
图2为本申请实施例提供的应用于策略系统的分布式模型训练方法;
图3为本申请实施例提供的DDPG模型示意图;
图4为本申请实施例提供的应用于管理设备的分布式模型训练方法;
图5为本申请实施例提供的应用于管理设备的分布式模型训练装置;
图6为本申请实施例提供的应用于终端设备的分布式模型训练方法;
图7为本申请实施例提供的应用于终端设备的分布式模型训练装置;
图8为本申请实施例提供的电子设备结构示意图。
图标:101-模型迭代模块;102-经验采样模块;103-模型调整模块;104-第一同步模块;201-状态获取模块;202-策略生成模块;203-经验生成模块;204-第二同步模块;320-存储器;330-处理器;340-通信单元。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。
因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
在本申请的描述中,需要说明的是,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。此外,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
应该理解,流程图的操作可以不按顺序实现,没有逻辑的上下文关系的步骤可以反转顺序或者同时实施。此外,本领域技术人员在本申请内容的指引下,可以向流程图添加一个或多个其他操作,也可以从流程图中移除一个或多个操作。
在基于边缘计算的任务卸载场景中,为了提高服务质量,需要根据各边缘设备的状态,制定任务卸载策略。
例如,随着智能设备和移动应用的快速增长,越来越多的计算密集型应用要求更低的延迟。在MEC(Mobile Edge Computing,移动边缘计算)网络系统中,智能设备作为该网络中的边缘设备,可以根据任务要求、本地计算资源、基站计算资源等选择本地执行终端任务或者将终端任务卸载到服务器上远程执行,从而减少应用的体验延迟,提高网络的服务质量。
因此,有效的任务卸载策略对于MEC网络系统中实现令人满意的服务质量尤为重要。近年的相关技术中,提出了基于强化学习为MEC网络系统获得近似最优的计算卸载调度策略的技术。例如,基于DDQN(Double Deep Q Network)的任务卸载策略确定算法,该算法根据任务队列、能量队列和无线信道条件最大化累积MEC效用;基于DRL(Deep Reinforcement Learning)的任务卸载策略确定算法,该算法可以获得近似最优的任务卸载和资源分配策略,无需使用传统数值算法解决难以求解的优化问题。
或者,使用基于搜索的算法(如启发式算法、坐标下降法、遗传算法等)进行求解。例如,使用启发式算法在MEC网络系统中不断迭代调整智能设备的二元卸载决策,使得整个移动边缘计算系统的时延和能量消耗最小。
然而,上述方法大多集中在完全集中式的调度方法上,即在获取MEC网络系统中所有智能设备全局状态信息之后,对所有智能设备的任务卸载进行统一调度。由于全局状态信息的维度过大,可能会面临搜索空间规模过大而导致的维数灾难问题;其次,用于制定任务卸载策略的管理设备难以在一个时间片内收集所需的全局信息,并在有限的时间内学习到合适的任务卸载策略。
鉴于此,为了至少部分解决上述问题,本实施例提供一种分布式模型训练方法,应用于部署有DDPG模型的策略系统。该策略系统不仅限于上述MEC网络系统,还可以是安防领域的监控系统、人流激增场景下的通信系统(例如, 足球场)。
由于本实施例涉及到DDPG模型,为便于本领域技术人员实施本方案,下面先对DDPG模型进行介绍。
首先,DDPG模型属于以神经网络方式实现的强化学习方法,因此,DDPG模型同样涉及强化学习领域的状态、策略、动作、即时奖励、Q值(又名动作价值,表示在状态s t下,基于策略μ采取动作a t后,且如果持续执行策略μ的情况下,所获得奖励的期望值)。由于,DDPG模型基于DQN模型发展而来,因此,与DQN模型类似,DDPG模型同样使用一个神经网络拟合智能体的Q值。即针对智能体当前状态以及当前状态采取的动作,通过该神经网络对其进行评价,评价结果即为Q值。
而该神经网络在DDPG模型中被称为Critic网络。与DQN模型不同的是,DQN模型中采用贪婪策略选择Q值最大的动作,而DDPG模型则通过Actor网络生成当前状态应该采取的动作。
Actor网络与Critic网络的关系具体表现为:Actor网络用于拟合策略函数,即能够根据智能体的当前状态,生成当前状态下智能体应该采取的动作。Critic网络用于拟合动作价值函数,即能够确定Actor网络基于当前状态所生成动作的Q值。
为使Actor网络经训练后能够拟合策略函数,Critic网络经训练后能够拟合动Q值函数,Actor网络被设计为包括Actor在线网络以及Actor目标网络;Critic网络被设计为包括Critic目标网络与Critic在线网络。相关技术中,通常将DDPG模型中的上述Actor在线网络、Actor目标网络、Critic目标网络、Critic在线网络部署到同一设备进行训练。
下面提供一个示例,详细阐述在相关技术中,Actor在线网络、Actor目标网络、Critic目标网络、Critic在线网络之间的关系:
1、Actor在线网络基于智能体当前状态s,生成当前状态下智能体应该采取的动作a,并依据即时奖励函数计算在当前状态s下执行动作a后,产生的即时奖励r,以及执行动作a后智能体新的状态s'。
2、将(s,a,r,s')作为一组策略经验,存放至经验池,直到经验池中策略经验的数量达到设定的数量阈值。
3、从经验池中采集预设数量的策略经验,按照Q值函数的贝尔曼方程确定Actor在线网络的训练损失L Q,并依据训练损失L Q通过反向梯度传播算法(Loss Gradient)调整Actor在线网络的模型参数;其中,训练损失L Q的表达式可以表示为:
Figure PCTCN2022088702-appb-000025
Figure PCTCN2022088702-appb-000026
式中,N表示策略经验的预设数量,Q(s i,a i)表示将第i组策略经验(s i,a i,ri,s i')中的(s i,a i)输入Critic在线网络,由Critic在线网络拟合出的Q值;
y i表示(s i,a i)对应Q值的贝尔曼方程表达式,其中,a'表示在状态s i'下所有可能采取的动作,
Figure PCTCN2022088702-appb-000027
表示所有动作中的最大Q值,而这个最大Q值通过将(s i',a i')输入到Critic目标网络获得,a i'由Actor目标网络依据状态s i'+∈生成,其中,∈表示引入的噪声。
4、依据Actor网络确定出的Q值,以策略梯度的方式调整在Actor在线网络的模型参数。
5、当满足预设同步条件时,将Actor在线网络的模型参数同步至Actor目标网络;将Critic在线网络的模型参数同步至Critic目标网络。
6、迭代上述步骤1-5,直到DDPG模型满足收敛条件。
不同于上述相关技术中的训练方式,本实施例将DDPG模型部署到策略系统,对其进行分布式训练。如图1所示,决策系统包括管理设备以及多个终端设备,DDPG模型包括Critic网络以及Actor网络(ActorNetwork),Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,将其作为机强化学习模型中的智能体;而管理设备部署有Critic网络以及第二在线网络。由于每个终端设备所部署的第一在线网络属于Actor网络,因此,终端设备可以基于自身的当前状态,生成相应的动作,以克服集中式的调度方法所存在缺陷。
其中,在一些实施方式中,该管理设备可以是与多个终端设备通信连接的服务器。例如,Web(网站)服务器、FTP(File Transfer Protocol,文件传输协议)服务器、数据处理服务器等。此外,该服务器可以是单个服务器,也可以是服务器组。服务器组可以是集中式的,也可以是分布式的(例如,服务器可以是分布式系统)。在一些实施例中,服务器100相对于用户终端,可以是本地的、也可以是远程的。在一些实施例中,服务器100可以在云平台上实现;仅作为示例,云平台可以包括私有云、公有云、混合云、社区云(Community Cloud)、分布式云、跨云(Inter-Cloud)、多云(Multi-Cloud)等,或者它们的任意组合。在一些实施例中,服务器100可以在具有一个或多个组件的电子设备上实现。
需要说明的是,在一些实施方式中,该管理设备可以作为终端设备卸载终端任务的服务器。当然,在一些实施方式中,用于终端设备卸载终端任务的对象可以是与管理设备不同的其他服务器。
该终端设备可以是,但不限于,移动终端、平板计算机、膝上型计算机、或机动车辆中的内置设备等,或其任意组合。在一些实施例中,移动终端可以包括智能家居设备、可穿戴设备、智能移动设备、虚拟现实设备、或增强现实设备等,或其任意组合。在一些实施例中,智能家居设备可以包括智能照明设备、智能电器设备的控制设备、智能监控设备、智能电视、智能摄像机、或对讲机等,或其任意组合。在一些实施例中,可穿戴设备可包括智能手环、智能鞋带、智能玻璃、智能头盔、智能手表、智能服装、智能背包、智能配件等、或其任何组合。在一些实施例中,智能移动设备可以包括智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏设备、导航设备、或销售点(Point of Sale,POS)设备等,或其任意组合。
基于上述设计,下面结合图2所示的流程图对应用于策略系统的分布式模型训练方法进行详细阐述。如图2所示,该方法可以包括:
S101A,针对每个终端设备,终端设备根据自身的第一设备状态,通过第一在线网络生成与第一设备状态相对应的动作。
相较于一些相关技术中,将Actor网络与Critic网络部署到同一设备,然后,采用全集中式的调度方法为所有的终端任务生成调度策略;需要获取全部终端设备各自的第一设备状态。本实施例中,Actor网络包括第一在线网络以及第二在线网络,而第一在线网络与第二在线网络具有相同的网络结构,且满足同步关系;并且,将第一在线网络部署到每个终端设备,使得而每个终端设备生成与自身第一状态相对应的动作时,只需要关注自身的状态信息。
S102A,终端设备将第一设备状态对应的策略经验存放至经验池。
其中,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励。需要说明的是,为了便于区分终端设备的当前状态,以及执行动作后的状态;本实例中,将终端设备的当前状态称作第一设备状态,将执行动作后的状态称作第二设备状态。
在一些实施方式中,该经验池(Experience Buffer)可以是管理设备提供的预设存储空间,用于存储预设数量的策 略经验。因此,终端设备可以将构建好的策略经验通过网络发送给管理设备,由管理设备将其存储至该预设存储空间。
此外,本实施例还针对策略系统的使用场景设计有相应的即时奖励函数;因此,终端设备可以将第一设备状态、第一设备状态下采取的动作以及第二设备状态发送给管理设备,使得管理设备根据第一设备状态、第一设备状态以及第二设备状态,通过即时奖励函数确定该动作的即时奖励。
在一些实施方式中,当上述决策系统为任务卸载系统,终端设备在第一设备状态需要执行的动作,包括终端任务本地执行或者卸载至服务器执行。
针对该任务卸载系统,终端设备将第一设备状态对应的策略经验存放至经验池之前,管理设备通过表达式R t确定动作的即时奖励。即R t为即时奖励函数,相应的表达式为:
Figure PCTCN2022088702-appb-000028
Figure PCTCN2022088702-appb-000029
式中,
Figure PCTCN2022088702-appb-000030
表示执行终端任务的任务延时,
Figure PCTCN2022088702-appb-000031
表示执行终端任务的任务能耗,λ表示预设权重。
此外,本实施例还针对该时奖励函数R t设计有约束条件,当终端设备在执行动作后,若满足任意一条的约束条件时,则生成惩罚因子。该惩罚因子用于减小即时奖励。
约束条件包括:
执行终端任务需要的计算资源不能超过终端设备与服务器各自的资源上限;
终端任务只允许在用户终端执行或者服务器执行;
执行终端任务的延时不能超过时长阈值;
执行终端任务的能耗不能超过终端设备与服务器各自的储能上限。
S103A,管理设备对经验池进行采样,获得策略样本。
S104A,管理设备根据策略样本,通过Critic网络调整第二在线网络的模型参数。
S105A,当满足预设同步条件,则管理设备将调整后的第二在线网络同步至每个第一在线网络,以使每个同步后的第一在线网络与调整后的第二在线网络具有相同的参数。
在一些实施方式中,预设同步条件为模型训练流程的执行次数达到预设迭代周期。例如,当Actor目标网络的模型参数经过5轮的迭代后,将第5次迭代后的Actor目标网络同步至每个终端设备。由于Actor目标网络与每个终端设备中的Actor在线网络具有相同的网络结构,因此,管理设备只需将5次迭代后的Actor目标网络的模型参数下发至每个终端设备。
S106A,管理设备判断该DDPG模型是否满足预设收敛条件;若满足,则执行步骤S107A,获得预先训练的DDPG模型,若不满足,则返回S101A再次进行迭代。
如此,本实施例提供的部署有DDPG模型的策略系统中,包括管理设备以及多个终端设备。DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。而用于训练第二在线网络的策略样本采集自经验池,由各终端设备通过自身部署的第一在线网络生成,因此,策略样本的状态空间仅涉及单个终端设备,因此,该方法不仅能够避免采集全局状态所需要的耗时,而且还能降低状态空间的维度。
此外,管理设备基于策略池采集的策略样本,训练第二在线网络。由于该经验池中存储有全部终端设备产生的策 略经验,因此,管理设备从经验池中采样的策略样本能够兼顾策略系统的全局信息,使得第二在线网络能够趋近于收敛;而第一在线网络与第二在线网络均属于Actor网络,且保持同步关系。因此,同步后的第一在线网络能够基于终端设备自身的状态信息,生成使得整个决策系统成本最小的动作。
假定该任务卸载系统为MEC网络系统,其中,MEC网络系统中的基站为任务卸载系统管理设备,MEC网络系统中的多个用户设备对应任务卸载系统中的多个终端设备。下面以MEC网络系统为例,对上述分布式模型训练方法进行详细示例性说明。
而应该理解的是,在对分布式模型训练方法进行介绍之前,需要先建立MEC网络系统的数学模型,具体包括系统模型、通信模型、计算模型,下面对这些数学模型进行详细说明:
1.系统模型
在MEC网络中,包含一个基站BS(Base Station)和多个用户设备UDs(User Devices),将多个用户设备表示为集合U:
U=(1,2,3...,u);
将系统时间均分为多个时间片,表示为事件片集合T:
T=(1,2,3...,t);
建立基站BS以及多个用户设备所处环境的空间模型,该环境的二维平面以坐标xy表示,用户设备u在时间片t的位置
Figure PCTCN2022088702-appb-000032
和基站BS的位置L BS分别为:
Figure PCTCN2022088702-appb-000033
LBS={xBS,yBS};
并且,假定基站BS的高度为H,用户设备u存储的能量表示为E u;每个时间片t,用户设备u的终端任务表示为
Figure PCTCN2022088702-appb-000034
Figure PCTCN2022088702-appb-000035
其中,
Figure PCTCN2022088702-appb-000036
表示终端任务的大小(单位:bits),
Figure PCTCN2022088702-appb-000037
表示完成该终端任务所需的CPU周期数,
Figure PCTCN2022088702-appb-000038
表示终端任务允许的最大延迟。
每个用户设备可以通过无线信道与基站BS进行通信,这使得用户设备u可以将其计算密集型任务
Figure PCTCN2022088702-appb-000039
传输到基站BS上进行计算。本实施例采用二元卸载策略,即任务
Figure PCTCN2022088702-appb-000040
只允许在在用户设备u上执行,或者在基站BS上执行。
Figure PCTCN2022088702-appb-000041
的卸载结果表示为
Figure PCTCN2022088702-appb-000042
Figure PCTCN2022088702-appb-000043
表示将终端任务卸载到基站BS上执行;当
Figure PCTCN2022088702-appb-000044
则表示终端任务在本地执行。
2.通信模型
在时间片t,用户设备u到基站BS之间无线数据传输速率为:
Figure PCTCN2022088702-appb-000045
其中,
Figure PCTCN2022088702-appb-000046
表示用户设备u在时间片t的无线传输功率,
Figure PCTCN2022088702-appb-000047
表示在时间片t,用户设备u和BS之间的无线信道功率增益,σ 2表示背景噪声功率;并且,假定任意一个时间片,所有用户设备的噪声相同。
本实施例中,还约定在每个时间片t中,用户设备u只允许执行无线数据传输,且
Figure PCTCN2022088702-appb-000048
小于最大功率p max,即:
Figure PCTCN2022088702-appb-000049
无线信道的大规模路径损失衰减函数可以表示为:
Figure PCTCN2022088702-appb-000050
Figure PCTCN2022088702-appb-000051
其中,
Figure PCTCN2022088702-appb-000052
表示在时间片t,用户设备u和BS之间的欧式距离,
Figure PCTCN2022088702-appb-000053
表示路径损失指数。
因此,用户设备u将终端任务
Figure PCTCN2022088702-appb-000054
卸载到基站BS计算,所需要的时间成本
Figure PCTCN2022088702-appb-000055
和能量成本分
Figure PCTCN2022088702-appb-000056
别可以表示为:
Figure PCTCN2022088702-appb-000057
Figure PCTCN2022088702-appb-000058
此外,考虑终端任务的计算结果远小于终端任务的大小,因此,为了简化运算模型,不考虑从基站BS下载终端任务的计算结果所需的时间和能量。
3.计算模型
正如前文所约定的,用户设备u产生的终端任务只允许在用户设备u上执行,或者卸载到基站BS上执行。
对于选择在用户设备u上执行任务
Figure PCTCN2022088702-appb-000059
所需要的时间可以表示为:
Figure PCTCN2022088702-appb-000060
其中,
Figure PCTCN2022088702-appb-000061
表示用户设备u在时间片t所分配的计算资源,其中,
Figure PCTCN2022088702-appb-000062
满足以下限制条件:
Figure PCTCN2022088702-appb-000063
其中,
Figure PCTCN2022088702-appb-000064
表示时间片t所能够分配的最大计算资源。
对于选择在用户设备u上执行任务
Figure PCTCN2022088702-appb-000065
所需要的能量可以表示为:
Figure PCTCN2022088702-appb-000066
其中,κ u和ν u是与用户设备u的芯片架构相关的计算系数。
对于选择将终端任务
Figure PCTCN2022088702-appb-000067
卸载到基站BS进行计算,终端任务
Figure PCTCN2022088702-appb-000068
所需要的时间可以表示为:
Figure PCTCN2022088702-appb-000069
其中,
Figure PCTCN2022088702-appb-000070
表示基站BS在时间片t分配给用户设备u的计算资源,应满足以下限制条件:
Figure PCTCN2022088702-appb-000071
其中,F ES表示基站BS最大可用计算资源。考虑到基站BS有充足的能量供应,因此忽略计算终端任务所消耗的能量。此外,本实施例还约定每个用户设备在等待基站BS处理终端任务期间不消耗能量。
在每个时间片t,约定系统成本为终端任务
Figure PCTCN2022088702-appb-000072
的任务延迟和用户设备u的能量消耗,分别定义为:
Figure PCTCN2022088702-appb-000073
Figure PCTCN2022088702-appb-000074
因此,为了统筹任务延迟和能量消耗两种系统成本,以预设权重λ对两者进行加权求和:
Figure PCTCN2022088702-appb-000075
则基于上述数学模型,所有用户设备在时期T内的平均系统成本C sys可以表示为:
Figure PCTCN2022088702-appb-000076
本实施例的目的则在于,训练DDPG模型,使得该模型为每个用户设备制定任务卸载策略
Figure PCTCN2022088702-appb-000077
使得所有用户设备在时期T内的平均系统成本最小,因此,可以将MEC网络系统的目标函数P表示为:
Figure PCTCN2022088702-appb-000078
需要满足以下约束条件C1-C6:
Figure PCTCN2022088702-appb-000079
Figure PCTCN2022088702-appb-000080
Figure PCTCN2022088702-appb-000081
Figure PCTCN2022088702-appb-000082
Figure PCTCN2022088702-appb-000083
Figure PCTCN2022088702-appb-000084
其中,C1和C2分别表示用户设备和基站BS的计算资源约束;C3表示二元卸载约束,即终端任务
Figure PCTCN2022088702-appb-000085
只允许选择在用户终端本地计算,或者将终端任务卸载到基站BS上运行;C5表示终端任务
Figure PCTCN2022088702-appb-000086
的任务延迟不应大于其最大允许延迟;C6表示用户设备u从第一个时间片到当前时间片t'总的能量消耗不应该超过用户设备u所存储的最大可用能量。
基于上述的系统模型、通信模型、计算模型,本实施例针对MEC网络系统的网络平均系统成本的优化问题设计基于DDPG模型的分布式模型训练框架。如图3所示,与常规的强化学习类似,同样需要先确定DDPG模型下的状态、动作和即时奖励,相关定义如下:
状态:由于本实施例中的对DDPG模型进行分布式训练,用户设备u只能观察来自环境的有限的状态信息。该状态信息可以包括用户设备u的当前位置,最新到达的任务以及用户设备u的剩余能量。因此,在时间片t,该用户设备u的状态信息s t可以表示为:
Figure PCTCN2022088702-appb-000087
其中,
Figure PCTCN2022088702-appb-000088
的表达式为:
Figure PCTCN2022088702-appb-000089
动作:同样由于本实施例中的对DDPG模型进行分布式训练,因此,用户设备u只能决定自己的调度动作。因此,用户设备u在时间片t的动作a t可以表示为:
Figure PCTCN2022088702-appb-000090
奖励:用户设备u在每个时间片从环境得到的即时奖励与目标函数P的优化目标相关,而目标函数P是为了使平均系统成本C sys最小。因此,将即时奖励函数以R t表示,将计算得到的即时奖励以r t表示,其中,R t的表达式为:
Figure PCTCN2022088702-appb-000091
然后,针对MEC网络系统平均系统成本的优化问题,设计DDPG模型的训练损失函数和更新策略:
在强化学习中,智能体可以通过学习与环境的交互经验,以达到最大化长期奖励(累计所有时间片的奖励)的目标,从而获得在每个时间片下的最优动作策略。根据强化学习的理论,状态s t下执行动作a t的长期奖励可以用Q函数的贝尔曼方程来表示,即:
Figure PCTCN2022088702-appb-000092
其中s t+1是相较于s t在t+1时间片的状态,a'是在状态s t+1下所有可能的动作。γ是未来奖励的折扣因子,满足0≤γ≤1。
在DDPG模型中,使用神经网络(Neural Network)来近似Q函数。将该神经网络称为Q网络,假定Q网络的参数是θ,则可以将Q网络表示第为Q θ(s t,a t)。为了优化Q网络,用于训练该Q网络的样本数据,产生自智能体和环境的交互。假定将样本数据称为策略样本,表示为(s t,a t,r t,s t+1)。该Q网络的损失函数可以表示为:
Figure PCTCN2022088702-appb-000093
式中,Q(s i,a i)表示Q网络的输出,y i表示参考Q值;其中N是策略样本的数量。
由于动作值的连续性,在DDPG模型中发明采用了Actor-Critic框架,其中,Actor为策略网络,Critic为Q网络。策略网络来近似策略函数a t=π ω(s t),采用基于策略样本的策略梯度(Policy Gradient)进行更新,策略梯度的表达式为:
Figure PCTCN2022088702-appb-000094
正如前文关于DDPG模型的介绍,为使DDPG模型训练过程中保持稳定,以避免出现震荡,Actor网络包括Actor在线网络以及Actor目标网络;Critic中包括Critic在线网络以及Critic目标网络。而本实施例中,为了实现对DDPG模型进行分布式训练,Actor在线网络包括第一在线网络以及第二在线网络。其中,第一在线网络部署在每个用户设备,第二在线网络则与Actor目标网络一起部署在基站BS。第二在线网络与Actor目标网络相互配合,采用策略梯度的方式调整二在线网络的模型参数,并与第一在线网络保持同步。
此外,与常规DDPG模型不同的是,使用了两重Critic网络减小Q值的过度估计,并采取剪辑的双Q学习和延迟策略更新来避免高方差。
相应的实施方式请继续参见图3,Critic在线网络包括Critic1、Critic2;Critic目标网络包括Target Critic1、Target Critic2。因此,针对同一策略样本,Target Critic1与Target Critic2分别会给出该策略样本的Q值,分别为Q1和Q2;Critic1、Critic2分别会给出该策略样本的Q值,分别为Q1和Q2。然后,通过min{Q 1,Q 2}选取其中最小的一个Q值,分别与Q1和Q2进行比较,获得Critic1的训练损失
Figure PCTCN2022088702-appb-000095
Critic2的训练损失
Figure PCTCN2022088702-appb-000096
然后,依据
Figure PCTCN2022088702-appb-000097
更新Critic1的模型参数;依据
Figure PCTCN2022088702-appb-000098
更新Critic2的模型参数;当满足更新周期时,依据Critic1的模型参数对Target Critic1的模型参数进行更新;依据Critic2的模型参数对Target Critic2进行更新。
基于上述设计,对应用于该MEC网络系统的DDPG模型进行训练,直至满足预设收敛条件。继续参见图3,其训练过程,可以包括:
1、在每个时间片中,每个用户设备分布式运行,且接收自身所部署Actor在线网络生成的任务卸载策略。用户设备采集从MEC网络系统观察到自身的状态
Figure PCTCN2022088702-appb-000099
Actor在线网络根据策略
Figure PCTCN2022088702-appb-000100
生成当前应该采取的动作
Figure PCTCN2022088702-appb-000101
用户设备u执行动作
Figure PCTCN2022088702-appb-000102
后,用户设备u会从MEC网络系统各中收到即时奖励
Figure PCTCN2022088702-appb-000103
以及下一个状态
Figure PCTCN2022088702-appb-000104
用户设备u将上述信息打包成策略经验
Figure PCTCN2022088702-appb-000105
送入到经验池中。
2、从经验池中采样一小批次的样本策略
Figure PCTCN2022088702-appb-000106
用于训练Critic网络(Critic1、Critic2)和Actor在线网络。而其他三个目标模型(Target Critic1、Target Critic2、Target Actor)无需通过训练来更新,仅通过周期性地同步Critic1、Critic2以及Target Actor的模型参数。
3、所有用户设备下载Actor在线网络的模型参数ω,即
Figure PCTCN2022088702-appb-000107
而值得说明的是,每个用户设备下载Actor在线网络的模型参数的周期为T explore个时间片,即用户设备u需要使用自身所部署Actor在线网络运行T explore个时间片后, 才进行一次同步操作。T explore通过退火算法值来指定,可以使训练在开始时更加稳定,之后更加高效。
基于与分布式模型训练方法相同的发明构思,本申请实施例还提供有与该方法相应的系统、单侧方法以及装置,可以包括:
本实施例提供还提供一种决策系统,决策系统部署有DDPG模型,决策系统包括管理设备以及多个终端设备,DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络;
执行至少一次模型训练流程,直到DDPG模型满足预设收敛条件;
模型训练流程,包括:
针对每个终端设备,终端设备用于根据自身的第一设备状态,通过第一在线网络生成与第一设备状态相对应的动作;
终端设备用于将第一设备状态对应的策略经验存放至经验池,其中,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励;
管理设备用于对经验池进行采样,获得策略样本;
管理设备还用于根据策略样本,通过Critic网络调整第二在线网络的模型参数;
当满足预设同步条件,则管理设备将调整后的第二在线网络同步至每个第一在线网络,以使每个同步后的第一在线网络与调整后的第二在线网络具有相同的参数。
本实施例还提供一种分布式模型训练方法,应用于决策系统中的管理设备。
管理设备与决策系统中的多个终端设备通信连接,决策系统部署有DDPG模型,DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。如图4所示,该方法包括:
S101B,对经验池进行采样,获得策略样本。
其中,经验池用于存储每个终端设备根据自身的第一设备状态,通过第一在线网络生成的策略经验,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励;
S102B,据策略样本,通过Critic网络调整第二在线网络的模型参数。
S103B,当满足预设同步条件,则将调整后的第二在线网络同步至每个第一在线网络,以使每个同步后的第一在线网络与调整后的第二在线网络具有相同的模型参数。
S104B,判断该DDPG模型是否满足预设收敛条件;若满足,则执行步骤S105B,获得预先训练的DDPG模型,若不满足,则返回S101B再次进行迭代。
基于与应用于管理设备的分布式模型训练方法相同的发明构思,本实施例还提供一种应用于管理设备的分布式模型训练装置。
决策系统部署有DDPG模型,DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。
分布式模型训练装置包括至少一个可以软件形式存储于存储器中的功能模块。如图5所示,从功能上划分,该分布式模型训练装置可以包括:
模型迭代模块101,可以配置成用于执行至少一次模型训练流程,直到DDPG模型满足预设收敛条件;
在模型训练流程,分布式模型训练装置还包括:
经验采样模块102,可以配置成用于对经验池进行采样,获得策略样本,其中,经验池用于存储每个终端设备根据自身的第一设备状态,通过第一在线网络生成的策略经验,其中,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励;
模型调整模块103,可以配置成用于据策略样本,通过Critic网络调整第二在线网络的模型参数;
第一同步模块104,可以配置成用于当满足预设同步条件,则将调整后的第二在线网络同步至每个第一在线网络,以使每个同步后的第一在线网络与调整后的第二在线网络具有相同的参数。
需要说明的是,该分布式模型训练装置还可以包括其他软件功能模块,用于实现应用于管理设备的分布式模型训练方法的其他步骤或者子步骤;当然,模型迭代模块101、经验采样模块102、模型调整模块103以及第一同步模块104同样可以用于实现应用于管理设备的分布式模型训练方法的其他步骤或者子步骤。本示例不对此做具体的限定,本领域技术人员可以根据不同的软件模块划分标准进行适当调整。
本实施例还提供一种分布式模型训练方法,应用于决策系统中的终端设备。
终端设备与决策系统中的管理设备通信连接,决策系统部署有DDPG模型,DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。如图6所示,方法可以包括:
步骤S101C,获取自身的第一设备状态。
S102C,根据第一设备状态,通过第一在线网络生成与第一设备状态相对应的动作。
S103C,将第一设备状态对应的策略经验存放至经验池。
其中,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励。
S104C,接收调整后的第二在线网络。
其中,管理设备根据策略样本,通过Critic网络调整第二在线网络的模型参数,获得调整后的第二在线网络,策略样本采样自经验池。
S104C,将调整后的第二在线网络同步至第一在线网络,以使同步后的第一在线网络与调整后的第二在线网络具有相同的模型参数。
基于与应用于终端设备的分布式模型训练方法相同的发明构思,本实施例还提供一种应用于终端设备的分布式模型训练装置。
终端设备与决策系统中的管理设备通信连接,决策系统部署有DDPG模型,DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。
该分布式模型训练装置包括至少一个可以软件形式存储于存储器中的功能模块。如图7所示,从功能上划分,该分布式模型训练装置,可以包括:
状态获取模块201,可以配置成用于获取自身的第一设备状态;
策略生成模块202,可以配置成用于根据第一设备状态,通过第一在线网络生成与第一设备状态相对应的动作;
经验生成模块203,可以配置成用于将第一设备状态对应的策略经验存放至经验池,其中,策略经验包括第一设备状态、动作、执行动作后的第二设备状态以及动作的即时奖励;
第二同步模块204,可以配置成用于接收调整后的第二在线网络,其中,管理设备根据策略样本,通过Critic网络调整第二在线网络的模型参数,获得调整后的第二在线网络,策略样本采样自经验池。
第二同步模块204,还可以配置成用于将调整后的第二在线网络同步至第一在线网络,以使同步后的第一在线网络与调整后的第二在线网络具有相同的模型参数。
需要说明的是,该分布式模型训练装置还包括其他软件功能模块,用于实现应用于终端设备的分布式模型训练方法的其他步骤或者子步骤;同理,状态获取模块201、策略生成模块202、经验生成模块203以及第二同步模块204同样可以用于实现用于终端设备的分布式模型训练方法的其他步骤或者子步骤。本实施例不对此做具体的限定,本领域技术人员可以依据不同的软件模块划分标准进行适当调整。
本实施例还提供一种电子设备,电子设备包括处理器以及存储器,存储器存储有计算机程序。
当该电子设备为管理设备时,计算机程序被处理器执行时,实现上述管理设备运行的分布式模型训练方法。
当该电子设备为终端设备时,计算机程序被处理器执行时,实现上述终端设备运行的分布式模型训练方法。
本实施例还提该电子设备的一种结构示意图。如图8所示,该存储器320、处理器330、通信单元340。该存储器320、处理器330以及通信单元340各元件相互之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。
其中,该存储器320可以是,但不限于,随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。其中,存储器320用于存储程序,该处理器330在接收到执行指令后,执行该程序。
该通信单元340用于通过网络收发数据。网络可以包括有线网络、无线网络、光纤网络、远程通信网络、内联网、因特网、局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、无线局域网(Wireless Local Area Networks,WLAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、公共电话交换网(Public Switched Telephone Network,PSTN)、蓝牙网络、ZigBee网络、或近场通信(Near Field Communication,NFC)网络等,或其任意组合。在一些实施例中,网络可以包括一个或多个网络接入点。例如,网络可以包括有线或无线网络接入点,例如基站和/或网络交换节点,服务请求处理系统的一个或多个组件可以通过该接入点连接到网络以交换数据和/或信息。
该处理器330可能是一种集成电路芯片,具有信号的处理能力,并且,该处理器可以包括一个或多个处理核(例如,单核处理器或多核处理器)。仅作为举例,上述处理器可以包括中央处理单元(Central Processing Unit,CPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用指令集处理器(Application Specific Instruction-set Processor,ASIP)、图形处理单元(Graphics Processing Unit,GPU)、物理处理单元(Physics Processing Unit,PPU)、数字信号处理器(Digital Signal Processor,DSP)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、可编程逻辑器件(Programmable Logic Device,PLD)、控制器、微控制器单元、简化指令集计算机(Reduced Instruction Set Computing,RISC)、或微处理器等,或其任意组合。
本实施例还提供一种计算机存储介质,计算机存储介质存储有计算机程序,计算机程序被处理器执行时,实现上述管理设备运行的分布式模型训练方法或者上述终端设备运行的分布式模型训练方法。
综上所述,本实施例提供的部署有DDPG模型的策略系统中,包括管理设备以及多个终端设备。DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络。而用于训练第二在线网络的策略样本采集自经验池,由各终端设备通过自身部署的第一在线网络生成,因此,策略样本的状态空间仅涉及单个终端设备,因此,该方法不仅能够避免采集全局状态所需要的耗时,而且还能降低状态空间的维度。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,附图中的流程图和框图显示了根据本申请的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
另外,在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的各种实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。
工业实用性
本申请提供了一种分布式模型训练方法、系统及相关装置中,该系统部署包括管理设备以及多个终端设备且部署有DDPG模型;DDPG模型包括Critic网络以及Actor网络,Actor网络包括第一在线网络以及第二在线网络,每个终端设备部署有第一在线网络,管理设备部署有Critic网络以及第二在线网络;而用于训练第二在线网络的策略样本采集自经验池,由各终端设备通过自身部署的第一在线网络生成,因此,策略样本的状态空间仅涉及单个终端设备,因此,该方法不仅能够避免采集全局状态所需要的耗时,而且还能降低状态空间的维度。
此外,可以理解的是,本申请的分布式模型训练方法、系统及相关装置是可以重现的,并且可以用在多种工业应用中。例如,本申请的分布式模型训练方法、系统及相关装置可以用于控制领域。

Claims (20)

  1. 一种分布式模型训练方法,其特征在于,应用于部署有DDPG模型的决策系统,所述决策系统包括管理设备以及多个终端设备,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法包括:
    执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
    所述模型训练流程,包括:
    针对每个所述终端设备,所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
    所述终端设备将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    所述管理设备对所述经验池进行采样,获得策略样本;
    所述管理设备根据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
    当满足预设同步条件,则所述管理设备将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
  2. 根据权利要求1所述的分布式模型训练方法,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100001
    Figure PCTCN2022088702-appb-100002
    式中,
    Figure PCTCN2022088702-appb-100003
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100004
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  3. 根据权利要求2所述的分布式模型训练方法,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  4. 一种决策系统,其特征在于,所述决策系统部署有DDPG模型,所述决策系统包括管理设备以及多个终端设备,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络;
    执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
    所述模型训练流程,包括:
    针对每个所述终端设备,所述终端设备用于根据自身的第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
    所述终端设备用于将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    所述管理设备用于对所述经验池进行采样,获得策略样本;
    所述管理设备还用于根据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
    当满足预设同步条件,则所述管理设备将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
  5. 根据权利要求4所述的决策系统,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100005
    Figure PCTCN2022088702-appb-100006
    式中,
    Figure PCTCN2022088702-appb-100007
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100008
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  6. 根据权利要求5所述的决策系统,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  7. 一种分布式模型训练方法,其特征在于,应用于决策系统中的管理设备,所述管理设备与所述决策系统中的多个终端设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法包括:
    执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
    所述模型训练流程,包括:
    对经验池进行采样,获得策略样本,其中,所述经验池用于存储每个所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成的策略经验,其中,所述策略经验包括所述第一设备状态、动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
    当满足预设同步条件,则将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网 络与所述调整后的第二在线网络具有相同的模型参数。
  8. 根据权利要求7所述的分布式模型训练方法,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100009
    Figure PCTCN2022088702-appb-100010
    式中,
    Figure PCTCN2022088702-appb-100011
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100012
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  9. 根据权利要求8所述的分布式模型训练方法,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  10. 一种分布式模型训练方法,其特征在于,应用于决策系统中的终端设备,所述终端设备与所述决策系统中的管理设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述方法包括:
    获取自身的第一设备状态;
    根据所述第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
    将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    接收调整后的第二在线网络,其中,所述管理设备根据策略样本,通过所述Critic网络调整所述第二在线网络的模型参数,获得所述调整后的第二在线网络,所述策略样本采样自所述经验池;
    将所述调整后的第二在线网络同步至所述第一在线网络,以使同步后的第一在线网络与所述调整后的第二在线网络具有相同的模型参数。
  11. 根据权利要求10所述的分布式模型训练方法,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100013
    Figure PCTCN2022088702-appb-100014
    式中,
    Figure PCTCN2022088702-appb-100015
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100016
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  12. 根据权利要求11所述的分布式模型训练方法,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  13. 一种分布式模型训练装置,其特征在于,应用于决策系统中的管理设备,所述管理设备与所述决策系统中的多个终端设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,每个所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述分布式模型训练装置包括:
    模型迭代模块,配置成用于执行至少一次模型训练流程,直到所述DDPG模型满足预设收敛条件;
    在所述模型训练流程,所述分布式模型训练装置还包括:
    经验采样模块,配置成用于对经验池进行采样,获得策略样本,其中,所述经验池用于存储每个所述终端设备根据自身的第一设备状态,通过所述第一在线网络生成的策略经验,其中,所述策略经验包括所述第一设备状态、动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    模型调整模块,配置成用于据所述策略样本,通过所述Critic网络调整所述第二在线网络的模型参数;
    第一同步模块,配置成用于当满足预设同步条件,则将调整后的第二在线网络同步至每个所述第一在线网络,以使每个同步后的第一在线网络与所述调整后的第二在线网络具有相同的参数。
  14. 根据权利要求13所述的分布式模型训练装置,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100017
    Figure PCTCN2022088702-appb-100018
    式中,
    Figure PCTCN2022088702-appb-100019
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100020
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  15. 根据权利要求14所述的分布式模型训练装置,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  16. 一种分布式模型训练装置,其特征在于,应用于决策系统中的终端设备,所述终端设备与所述决策系统中的管 理设备通信连接,所述决策系统部署有DDPG模型,所述DDPG模型包括Critic网络以及Actor网络,所述Actor网络包括第一在线网络以及第二在线网络,所述终端设备部署有所述第一在线网络,所述管理设备部署有所述Critic网络以及所述第二在线网络,所述分布式模型训练装置包括:
    状态获取模块,配置成用于获取自身的第一设备状态;
    策略生成模块,配置成用于根据所述第一设备状态,通过所述第一在线网络生成与所述第一设备状态相对应的动作;
    经验生成模块,配置成用于将所述第一设备状态对应的策略经验存放至经验池,其中,所述策略经验包括所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励;
    第二同步模块,配置成用于接收调整后的第二在线网络,其中,所述管理设备根据策略样本,通过所述Critic网络调整所述第二在线网络的模型参数,获得所述调整后的第二在线网络,所述策略样本采样自所述经验池;
    所述第二同步模块,还配置成用于将所述调整后的第二在线网络同步至所述第一在线网络,以使同步后的第一在线网络与所述调整后的第二在线网络具有相同的模型参数。
  17. 根据权利要求16所述的分布式模型训练装置,其特征在于,所述动作包括终端任务本地执行或者卸载至服务器执行,所述终端设备将所述第一设备状态、所述动作、执行所述动作后的第二设备状态以及所述动作的即时奖励存放至经验池之前,所述管理设备通过表达式R t确定所述动作的即时奖励,所述表达式R t为:
    Figure PCTCN2022088702-appb-100021
    Figure PCTCN2022088702-appb-100022
    式中,
    Figure PCTCN2022088702-appb-100023
    表示终端设备u在时间片t执行所述终端任务的任务延时,
    Figure PCTCN2022088702-appb-100024
    表示终端设备u在时间片t执行所述终端任务的任务能耗,λ表示预设权重。
  18. 根据权利要求17所述的分布式模型训练装置,其特征在于,所述表达式R t还配置有至少一条约束条件,当满足任意一条所述的约束条件时,则生成惩罚因子,所述惩罚因子用于减小所述即时奖励;
    所述约束条件包括:
    执行所述终端任务需要的计算资源不能超过所述终端设备与所述服务器各自的资源上限;
    所述终端任务只允许在所述终端设备执行或者所述服务器执行;
    执行所述终端任务的延时不能超过时长阈值;
    执行所述终端任务的能耗不能超过终端设备与所述服务器各自的储能上限。
  19. 一种电子设备,其特征在于,所述电子设备包括处理器以及存储器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,实现权利要求7至9或者权利要求10至12所述的分布式模型训练方法。
  20. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求7至9或者权利要求10至12所述的分布式模型训练方法。
PCT/CN2022/088702 2021-11-10 2022-04-24 分布式模型训练方法、系统及相关装置 WO2023082552A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111323472.7 2021-11-10
CN202111323472.7A CN113762512B (zh) 2021-11-10 2021-11-10 分布式模型训练方法、系统及相关装置

Publications (1)

Publication Number Publication Date
WO2023082552A1 true WO2023082552A1 (zh) 2023-05-19

Family

ID=78784910

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/088702 WO2023082552A1 (zh) 2021-11-10 2022-04-24 分布式模型训练方法、系统及相关装置

Country Status (2)

Country Link
CN (1) CN113762512B (zh)
WO (1) WO2023082552A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117477607A (zh) * 2023-12-28 2024-01-30 国网江西综合能源服务有限公司 一种含智能软开关的配电网三相不平衡治理方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762512B (zh) * 2021-11-10 2022-03-18 北京航空航天大学杭州创新研究院 分布式模型训练方法、系统及相关装置
CN114862656B (zh) * 2022-05-18 2023-05-05 北京百度网讯科技有限公司 基于多gpu的分布式深度学习模型训练代价的获取方法
CN117806835B (zh) * 2024-02-29 2024-06-04 浪潮电子信息产业股份有限公司 一种任务分配方法、装置及电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111030861A (zh) * 2019-12-11 2020-04-17 中移物联网有限公司 一种边缘计算分布式模型训练方法、终端和网络侧设备
CN111858009A (zh) * 2020-07-30 2020-10-30 航天欧华信息技术有限公司 基于迁移和强化学习的移动边缘计算系统任务调度方法
KR20200126822A (ko) * 2019-04-30 2020-11-09 중앙대학교 산학협력단 심층 강화학습 기반 mmWave 차량 네트워크의 비디오 품질을 고려한 선제적 캐싱정책 학습 기법 및 그의 시스템
CN113392971A (zh) * 2021-06-11 2021-09-14 武汉大学 策略网络训练方法、装置、设备及可读存储介质
CN113762512A (zh) * 2021-11-10 2021-12-07 北京航空航天大学杭州创新研究院 分布式模型训练方法、系统及相关装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109407644A (zh) * 2019-01-07 2019-03-01 齐鲁工业大学 一种用于制造企业多Agent协同控制方法及系统
CN110488759B (zh) * 2019-08-09 2020-08-04 西安交通大学 一种基于Actor-Critic算法的数控机床进给控制补偿方法
CN111786713B (zh) * 2020-06-04 2021-06-08 大连理工大学 一种基于多智能体深度强化学习的无人机网络悬停位置优化方法
CN112261674A (zh) * 2020-09-30 2021-01-22 北京邮电大学 一种基于移动边缘计算及区块链协同赋能的物联网场景的性能优化方法
CN112995950B (zh) * 2021-02-07 2022-03-29 华南理工大学 一种车联网中基于深度强化学习的资源联合分配方法
CN113364854B (zh) * 2021-06-02 2022-07-15 东南大学 移动边缘计算网络中基于分布式强化学习的隐私保护动态边缘缓存设计方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200126822A (ko) * 2019-04-30 2020-11-09 중앙대학교 산학협력단 심층 강화학습 기반 mmWave 차량 네트워크의 비디오 품질을 고려한 선제적 캐싱정책 학습 기법 및 그의 시스템
CN111030861A (zh) * 2019-12-11 2020-04-17 中移物联网有限公司 一种边缘计算分布式模型训练方法、终端和网络侧设备
CN111858009A (zh) * 2020-07-30 2020-10-30 航天欧华信息技术有限公司 基于迁移和强化学习的移动边缘计算系统任务调度方法
CN113392971A (zh) * 2021-06-11 2021-09-14 武汉大学 策略网络训练方法、装置、设备及可读存储介质
CN113762512A (zh) * 2021-11-10 2021-12-07 北京航空航天大学杭州创新研究院 分布式模型训练方法、系统及相关装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117477607A (zh) * 2023-12-28 2024-01-30 国网江西综合能源服务有限公司 一种含智能软开关的配电网三相不平衡治理方法及系统
CN117477607B (zh) * 2023-12-28 2024-04-12 国网江西综合能源服务有限公司 一种含智能软开关的配电网三相不平衡治理方法及系统

Also Published As

Publication number Publication date
CN113762512A (zh) 2021-12-07
CN113762512B (zh) 2022-03-18

Similar Documents

Publication Publication Date Title
WO2023082552A1 (zh) 分布式模型训练方法、系统及相关装置
JP6942397B2 (ja) モバイルエッジコンピューティングのシナリオでシングルタスクオフロード戦略を策定する方法
Liu et al. Cooperative offloading and resource management for UAV-enabled mobile edge computing in power IoT system
Zou et al. A3C-DO: A regional resource scheduling framework based on deep reinforcement learning in edge scenario
Chen et al. Deep reinforcement learning for computation offloading in mobile edge computing environment
CN109829332B (zh) 一种基于能量收集技术的联合计算卸载方法及装置
Liu et al. A reinforcement learning-based resource allocation scheme for cloud robotics
Zhu et al. BLOT: Bandit learning-based offloading of tasks in fog-enabled networks
CN111367657B (zh) 一种基于深度强化学习的计算资源协同合作方法
CN111274036A (zh) 一种基于速度预测的深度学习任务的调度方法
CN111090631B (zh) 分布式环境下的信息共享方法、装置和电子设备
CN115277689B (zh) 一种基于分布式联邦学习的云边网络通信优化方法及系统
CN115175217A (zh) 一种基于多智能体的资源分配和任务卸载优化方法
CN112214301B (zh) 面向智慧城市基于用户偏好的动态计算迁移方法及装置
CN114281718A (zh) 一种工业互联网边缘服务缓存决策方法及系统
Liu et al. Fine-grained offloading for multi-access edge computing with actor-critic federated learning
Kim One‐on‐one contract game–based dynamic virtual machine migration scheme for Mobile Edge Computing
Gao et al. Com-DDPG: A multiagent reinforcement learning-based offloading strategy for mobile edge computing
CN115408072A (zh) 基于深度强化学习的快速适应模型构建方法及相关装置
Henna et al. Distributed and collaborative high-speed inference deep learning for mobile edge with topological dependencies
CN115480882A (zh) 一种分布式边缘云资源调度方法及系统
CN110366210A (zh) 一种针对有状态数据流应用的计算卸载方法
CN113543225A (zh) 一种电力无线专网安全动态资源分配的方法和系统
Ji et al. Downlink scheduler for delay guaranteed services using deep reinforcement learning
CN104601424B (zh) 设备控制网中利用概率模型的主被动数据收集装置及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891364

Country of ref document: EP

Kind code of ref document: A1