WO2024007499A1 - Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus - Google Patents

Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus Download PDF

Info

Publication number
WO2024007499A1
WO2024007499A1 PCT/CN2022/130998 CN2022130998W WO2024007499A1 WO 2024007499 A1 WO2024007499 A1 WO 2024007499A1 CN 2022130998 W CN2022130998 W CN 2022130998W WO 2024007499 A1 WO2024007499 A1 WO 2024007499A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
action
reinforcement learning
execution
modal
Prior art date
Application number
PCT/CN2022/130998
Other languages
French (fr)
Chinese (zh)
Inventor
沈丛麒
张慧峰
姚少峰
徐琪
张汝云
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US18/359,862 priority Critical patent/US20240015079A1/en
Publication of WO2024007499A1 publication Critical patent/WO2024007499A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/50Overload detection or protection within a single switching element

Definitions

  • the invention belongs to the field of network management and control technology, and in particular relates to a reinforcement learning agent training method, a modal bandwidth resource scheduling method and a device.
  • each technology system is a network mode.
  • Each network mode shares network resources. If not controlled, it will cause each network mode to directly compete for network resources, such as bandwidth, etc., which will directly affect the communication transmission quality of some key modes. Therefore, reasonable management and control of each mode in the network is one of the necessary prerequisites to ensure the stable operation of multi-modal networks.
  • the current mainstream technology is to control the proportion of bandwidth used by switch ports and limit the size of egress traffic to avoid network overload.
  • the purpose of the embodiments of this application is to provide reinforcement learning agent training methods, modal bandwidth resource scheduling methods and devices, so as to solve the technical problem in related technologies that modal resources in multi-modal networks cannot be intelligently controlled.
  • a modal bandwidth resource scheduling method in a multi-modal network including:
  • S11 Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
  • S13 In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the result of the SDN switch executing the action.
  • the status and reward value of the network, the action, reward value and the respective status in the two time periods before and after the execution of the action are stored in the experience pool;
  • S15 Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action is executed;
  • the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow.
  • the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.
  • update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action including:
  • the corresponding reward value and the preset attenuation discount calculate the discount reward of the state before each action
  • update the network parameters for executing the new network based on all actions in the experience pool and the status before executing the action including:
  • a modal bandwidth resource scheduling device in a multi-modal network including:
  • the building module is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
  • An execution module configured to obtain the global network characteristic status in each step, input the global network characteristic status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch.
  • the state and reward value of the network after the action are described, and the action, reward value and respective states in the two time periods before and after the action are performed are stored in the experience pool;
  • the first update module is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
  • the second update module is used to assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action. ;
  • the repeat module is used to repeat the process of executing the module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network outlet.
  • a modal bandwidth resource scheduling method in a multi-modal network including:
  • the resources occupied by each mode are scheduled.
  • a modal bandwidth resource scheduling device in a multi-modal network including:
  • An application module configured to apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in the first aspect to the multi-modal network;
  • a scheduling module is used to schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
  • an electronic device including:
  • processors one or more processors
  • Memory used to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement, for example, a reinforcement learning agent training method in a multimodal network or a model in a multimodal network. Dynamic bandwidth resource scheduling method.
  • a computer-readable storage medium When the instructions are executed by a processor, the reinforcement learning agent training method in the multi-modal network or the modal bandwidth in the multi-modal network are implemented. Steps of the resource scheduling method.
  • this application uses the idea of reinforcement learning algorithms to construct global network characteristic states, execution actions, and reward functions that are suitable for multi-modal networks, allowing reinforcement learning agents to continuously interact with the network, and according to the network state and reward value
  • the change outputs the optimal execution action, so that the allocation of multi-modal network resources meets expectations and ensures network operating performance. It has strong practical significance for promoting smart management and control of multi-modal networks.
  • Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment.
  • FIG. 2 is a flowchart of step S14 according to an exemplary embodiment.
  • Figure 3 is a flow chart of "updating the network parameters of the new network based on all actions in the experience pool and the state before executing the action" according to an exemplary embodiment.
  • Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment.
  • Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment.
  • Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment.
  • FIG. 7 is a schematic diagram of an electronic device according to an exemplary embodiment.
  • first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment. As shown in Figure 1, this method is applied to reinforcement learning agents and may include the following steps:
  • Step S11 Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
  • Step S12 Set the maximum number of steps for a round of training
  • Step S13 In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, obtain the SDN switch to execute the action After obtaining the status and reward value of the network, the action, reward value and respective status in the two time periods before and after the execution of the action are stored in the experience pool;
  • Step S14 Update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;
  • Step S15 Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the status before the action is executed;
  • Step S16 Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and does not overload the network egress.
  • this application uses the idea of reinforcement learning algorithms to construct global network characteristic states, execution actions, and reward functions that are suitable for multi-modal networks, allowing reinforcement learning agents to continuously interact with the network, and according to the network state and reward value
  • the change outputs the optimal execution action, so that the allocation of multi-modal network resources meets expectations and ensures network operating performance. It has strong practical significance for promoting smart management and control of multi-modal networks.
  • step S11 a global network feature state, actions and a deep neural network model required for training the reinforcement learning agent are constructed, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network.
  • the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow.
  • the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.
  • Let a t represent the action of the tth ⁇ t second.
  • the above actions are used to adjust the bandwidth of the stream, and then schedule the resources occupied by each mode to ensure that the network communication quality meets the expected goals.
  • the physical meaning of the action is the proportion of each flow in each mode reaching the exit area.
  • P represent the number of modes running in the network. Since one mode corresponds to a network technology system, it is assumed that the number of modes running in the network is fixed.
  • Let F m represent the maximum number of flows in each mode, then the output action space dimension is P ⁇ F m .
  • F(p, t) represent the number of flows based on the p-th mode within the t-th ⁇ t second, and satisfy F(p, t) ⁇ F m . Therefore, within the tth ⁇ t seconds, only P ⁇ F(p,t) elements have corresponding flows, so their values are 0.1-1, while other elements have values 0 because they have no actual flows.
  • the same architecture can be used for the new execution network, the old execution network and the action evaluation network.
  • deep neural networks, convolutional neural networks, recurrent neural networks and other architectures can be used.
  • the parameters are randomly initialized after the network construction is completed.
  • step S12 set the maximum number of steps for a round of training
  • the maximum number of steps T for each round of training is set.
  • the value of T is related to factors such as the number of modes in the network. It is necessary to try multiple times during the training process to select a more preferred value. For example, assuming that the number of modes in the network is 8, it is more optimal to obtain T of 120 after many attempts.
  • step S13 in each step, the global network characteristic state is obtained, the global network characteristic state is input into the execution of the new network, the SDN switch is controlled to execute the action of the execution of the new network output, and the SDN is obtained
  • the status and reward value of the network after the switch performs the action, and the action, reward value, and respective states in the two time periods before and after the action are executed are stored in the experience pool;
  • the reinforcement learning agent uses the controller to obtain the global network characteristics within the ⁇ t second time period at a sampling interval of ⁇ t seconds. Input the current network state s t to execute the new network, and output the mean ⁇ (s t
  • the output execution action is expressed as
  • ⁇ ⁇ ) represents the mean value of the action vector selected by the reinforcement learning agent in a certain state s t
  • ⁇ ⁇ represents the parameters for executing the new network
  • N represents the noise, which is A normal function that decays over time.
  • the SDN controller sets the bandwidth for each flow according to the proportion set in the execution action, converts it into instructions that can be recognized by the SDN switch, and issues the configuration.
  • the SDN switch receives the configuration and forwards each mode according to the configured bandwidth. If a flow requires more bandwidth than the configured bandwidth, part of it will be randomly discarded to meet the allocated bandwidth.
  • the reinforcement learning agent obtains the new state s t+1 and reward value rt of the network after executing the action, and stores ( s t , a t , r t , s t+1 ) into the experience pool.
  • the reinforcement learning agent will perform the process of step S13 T times. During this process, the network parameters are not updated, and the reward value r t is the value of the reward function calculated by the reinforcement learning agent.
  • the reward function is defined as follows
  • eta p is the weight coefficient of the p-th mode, and the value is determined manually according to the network operation quality target.
  • v p (i, t) is the flow velocity of the i-th flow in the p-th mode during the t-th ⁇ t second, which can be obtained from the global network characteristic state.
  • ⁇ p (i, t) is the proportion of the i-th flow in the p-th modality arriving at the server in the t-th ⁇ t second, which can be obtained from the execution action.
  • is the upper limit of traffic that the egress area can carry during normal operation.
  • the setting of the above reward function can allocate appropriate bandwidth according to the communication transmission conditions of different modes in the network while preventing each mode from seizing network resources and causing network overload.
  • bandwidth resource allocation we use the ratio of the number of flows arriving at the server in each mode to characterize the transmission situation of this mode. If there is congestion in the transmission of this mode, even if its weight coefficient is not high or the overall network is not congested for the time being, the reward function will promote allocating greater bandwidth to this mode when subsequent actions are executed. If congestion occurs in multiple modes in the network, the mode with a higher weight coefficient will obtain greater bandwidth, which is also in line with actual needs, that is, priority is given to ensuring more important communication services.
  • the setting of the above reward function can ensure the normal operation of the network, and at the same time dynamically adjust the bandwidth resource allocation according to the transmission status of each mode in the network.
  • step S14 update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
  • this step may include the following sub-steps:
  • Step S21 Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value
  • the expected value represents the evaluation of the network state at time t, that is, the instantaneous value of the current state to achieve the goal set by the reward function.
  • Step S22 Calculate the discount reward of the state before each action based on the expected value, the corresponding reward value and the preset attenuation discount;
  • Step S23 Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network;
  • This difference represents the gap between immediate value and long-term value. This gap is used to adjust the parameters of the subsequent action evaluation network and optimize the output execution action. The smaller the gap, the closer the action network is to the optimal.
  • step S15 the network parameters of the new network are assigned to the old network, and the network of the new network is updated based on all actions in the experience pool and the status before the action is executed. parameter;
  • Step S31 Input all the states in the experience pool before executing the action into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;
  • s t in the samples stored in the experience pool are input into the old execution network and the new execution network, and the old distribution of action normal distribution and the new distribution of execution action are obtained respectively.
  • the old and new networks are also built based on the same neural network architecture. The two architectures are the same, with only different parameters. Because we set the input of these two neural networks to be network state samples s t , the output is the mean ⁇ (s t
  • the distribution is a normal distribution, so the old probability distribution and the new probability distribution of the action can be determined based on the outputs of the two execution networks.
  • Step S32 Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;
  • Step S33 Calculate the ratio of the second probability to the first probability
  • This ratio characterizes the parameter differences between the old and new execution networks. If the parameters between the old and new networks are consistent, it means that the execution network has been updated to the optimum. Because we hope that the parameters of the execution network can be continuously updated and optimized, the calculated ratio will be used to update the network parameters.
  • Step S34 Multiply all the ratios by the corresponding differences and average the value as the second loss value to update the network parameters of the new network;
  • ratio t is multiplied by R(t)-V(s t ) and averaged as the second loss value to update and execute new network parameters.
  • Ratio t represents the update direction of the action network
  • R(t)-V(s t ) represents the parameter update direction of the evaluation network. Because the optimization of output execution actions needs to be combined with changes in network status, the product of the two is selected to update the parameters of the new network execution so that it can learn the latest network status and output actions suitable for the network status in the next step.
  • steps S13-S15 are repeated until the bandwidth occupied by each mode in the multi-modal network ensures the quality of communication transmission and does not overload the network egress;
  • the process of S13-S15 is a round of training process, and the next round of training continues until each mode reasonably occupies the bandwidth, ensuring the quality of communication transmission while not overloading the network outlet.
  • the reinforcement learning agent After sufficient training, the reinforcement learning agent has completely learned the optimal strategy under different network environments, that is, the execution action that can achieve the set expected goal.
  • this application also provides embodiments of the reinforcement learning agent training device in the multi-modal network.
  • Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment.
  • the device is applied to reinforcement learning agents and may include:
  • Building module 21 is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
  • Setting module 22 is used to set the maximum number of steps for a round of training
  • Execution module 23 is used to obtain the global network feature status in each step, input the global network feature status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch.
  • the status and reward value of the network after the action are stored in the experience pool;
  • the first update module 24 is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;
  • the second update module 25 is used to assign the network parameters of the new network to the old network, and update the network of the new network based on all actions in the experience pool and the status before the action is executed. parameter;
  • the repeat module 26 is used to repeat the process from the execution module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network egress.
  • Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment. As shown in Figure 5, the method may include the following steps:
  • Step S41 Apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in Embodiment 1 to the multi-modal network;
  • Step S42 Schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
  • this application applies the trained reinforcement learning agent in the modal bandwidth resource scheduling method, which can adapt to networks with different characteristics, can be used for intelligent management and control of multi-modal networks, and has good adaptability and scheduling performance.
  • this application also provides an embodiment of a modal bandwidth resource scheduling device in a multi-modal network.
  • Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment.
  • the device may include:
  • the application module 31 is used to apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multi-modal network according to Embodiment 1 to the multi-modal network;
  • the scheduling module 32 is used to schedule the resources occupied by each mode according to the scheduling policy output by the reinforcement learning agent.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • this application also provides an electronic device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors , so that the one or more processors implement the above-mentioned reinforcement learning agent training method in the multi-modal network or the modal bandwidth resource scheduling method in the multi-modal network.
  • a reinforcement learning agent training method in a multi-modal network or a modal bandwidth resource scheduling method in a multi-modal network provided by an embodiment of the present invention is located on any device with data processing capabilities.
  • any device with data processing capabilities where the device in the embodiment is located usually can also be based on the actual functions of any device with data processing capabilities. Including other hardware, I won’t go into details about this.
  • this application also provides a computer-readable storage medium on which computer instructions are stored.
  • the above-mentioned reinforcement learning agent training method or multi-modal network in a multi-modal network is implemented.
  • Modal bandwidth resource scheduling method in .
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of the wind turbine, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash card (Flash Card) equipped on the device. wait.
  • the computer-readable storage medium may also include an internal storage unit of any device with data processing capabilities and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Abstract

Disclosed in the present invention are a reinforcement learning agent training method and apparatus, and a modality bandwidth resource scheduling method and apparatus. By means of the reinforcement learning agent training method, in a polymorphic network, the latest global network feature is acquired by using continuous interaction between a reinforcement learning agent and a network environment, and an updated action is output. A bandwidth occupied by a modality is adjusted, and a reward value is set to determine an optimization target for an intelligent agent, such that modality scheduling is realized, and the rational use of polymorphic network resources is ensured. The trained reinforcement learning agent is applied to a modality bandwidth resource scheduling method, can be adaptive to networks having different features, can be used for intelligent management and control of a polymorphic network, and has good adaptability and scheduling performance.

Description

强化学习智能体训练方法、模态带宽资源调度方法及装置Reinforcement learning agent training method, modal bandwidth resource scheduling method and device 技术领域Technical field
本发明属于网络管控技术领域,尤其涉及强化学习智能体训练方法、模态带宽资源调度方法及装置。The invention belongs to the field of network management and control technology, and in particular relates to a reinforcement learning agent training method, a modal bandwidth resource scheduling method and a device.
背景技术Background technique
在多模态网络中,同时运行着多种网络技术体制,每一种技术体制即为一种网络模态。各网络模态共享网络资源,如不加以管控,则会导致各网络模态直接竞争网络资源,如带宽等,这会直接影响部分关键模态的通信传输质量。因此,对网络中的各个模态进行合理管控是保障多模态网络稳定运行的必要前提之一。In a multi-modal network, multiple network technology systems are running at the same time, and each technology system is a network mode. Each network mode shares network resources. If not controlled, it will cause each network mode to directly compete for network resources, such as bandwidth, etc., which will directly affect the communication transmission quality of some key modes. Therefore, reasonable management and control of each mode in the network is one of the necessary prerequisites to ensure the stable operation of multi-modal networks.
对于上述需要,目前主流技术是控制交换机端口的带宽被使用的比例,限制出口流量大小以避免网络过载。To meet the above needs, the current mainstream technology is to control the proportion of bandwidth used by switch ports and limit the size of egress traffic to avoid network overload.
在实现本发明过程中,本发明人发现现有技术至少存在如下问题:In the process of realizing the present invention, the inventor found that the existing technology has at least the following problems:
使用这类静态的策略(如限制带宽使用比例不超过某个最大值)将无法适应网络模态动态变化的情况。而实际网络中,很有可能因业务变化而导致个别模态流量变大,此时原来的静态策略则不再适用。The use of such static policies (such as limiting the bandwidth usage ratio not to exceed a certain maximum value) will not be able to adapt to dynamic changes in network modes. In actual networks, it is very likely that individual modal traffic will increase due to business changes. At this time, the original static policy is no longer applicable.
发明内容Contents of the invention
本申请实施例的目的是提供强化学习智能体训练方法、模态带宽资源调度方法及装置,以解决相关技术中存在的多模态网络中的模态资源无法智慧管控的技术问题。The purpose of the embodiments of this application is to provide reinforcement learning agent training methods, modal bandwidth resource scheduling methods and devices, so as to solve the technical problem in related technologies that modal resources in multi-modal networks cannot be intelligently controlled.
根据本申请实施例的第一方面,提供一种多模态网络中的模态带宽资源调度方法,包括:According to the first aspect of the embodiments of the present application, a modal bandwidth resource scheduling method in a multi-modal network is provided, including:
S11:构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
S12:设置一轮训练的最大步数;S12: Set the maximum number of steps for a round of training;
S13:在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the result of the SDN switch executing the action. The status and reward value of the network, the action, reward value and the respective status in the two time periods before and after the execution of the action are stored in the experience pool;
S14:根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
S15:将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的 动作和执行动作前的状态,更新所述执行新网络的网络参数;S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action is executed;
S16:重复步骤S13-S15,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and prevents the network egress from being overloaded.
进一步地,所述全局网络特征状态包括各个模态的报文数量、各个模态的平均报文大小、每条流的平均时延、每条流中的数据包数量、每条流的大小、每条流中的平均数据包大小。Further, the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow.
进一步地,所述动作为在对应的全局网络特征状态下选择的动作向量的均值与噪声的和。Further, the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.
进一步地,根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数,包括:Further, update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action, including:
将所述经验池中所有的执行动作前的状态输入所述动作评价网络中,得到对应的期望价值;Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;
根据所述期望价值和对应的奖励值以及预先设定的衰减折扣,计算每个行动作前的状态的折扣奖励;According to the expected value, the corresponding reward value and the preset attenuation discount, calculate the discount reward of the state before each action;
计算所述折扣奖励与所述期望价值的差值,并根据所有差值计算均方差,将得到的均方差作为第一损失值,以更新所述动作评价网络的网络参数。Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network.
进一步地,根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数,包括:Further, update the network parameters for executing the new network based on all actions in the experience pool and the status before executing the action, including:
将所述经验池中所有的执行动作前的状态分别输入所述执行旧网络和执行新网络,得到执行动作旧分布和执行动作新分布;Input all the states before executing the action in the experience pool into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;
计算所述经验池中每个动作在对应的所述执行动作旧分布和执行动作新分布中分别出现的第一概率和第二概率;Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;
计算所述第二概率与所述第一概率的比值;Calculate the ratio of the second probability to the first probability;
将所有的所述比值乘以对应的所述差值并求平均之后的值作为第二损失值,以更新所述执行新网络的网络参数。All the ratios are multiplied by the corresponding difference values and the average value is used as the second loss value to update the network parameters of the new network.
根据本申请实施例的第二方面,提供一种多模态网络中的模态带宽资源调度装置,包括:According to the second aspect of the embodiment of the present application, a modal bandwidth resource scheduling device in a multi-modal network is provided, including:
构建模块,用于构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:The building module is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
设置模块,用于设置一轮训练的最大步数;Setting module, used to set the maximum number of steps for a round of training;
执行模块,用于在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;An execution module, configured to obtain the global network characteristic status in each step, input the global network characteristic status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The state and reward value of the network after the action are described, and the action, reward value and respective states in the two time periods before and after the action are performed are stored in the experience pool;
第一更新模块,用于根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;The first update module is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
第二更新模块,用于将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;The second update module is used to assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action. ;
重复模块,用于重复执行模块到第二更新模块中的过程,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。The repeat module is used to repeat the process of executing the module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network outlet.
根据本申请实施例的第三方面,提供多模态网络中的模态带宽资源调度方法,包括:According to the third aspect of the embodiments of the present application, a modal bandwidth resource scheduling method in a multi-modal network is provided, including:
将根据第一方面所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;applying the reinforcement learning agent trained according to the reinforcement learning agent training method in the multimodal network described in the first aspect to the multimodal network;
根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。According to the scheduling strategy output by the reinforcement learning agent, the resources occupied by each mode are scheduled.
根据本申请实施例的第三方面,提供一种多模态网络中的模态带宽资源调度装置,包括:According to the third aspect of the embodiment of the present application, a modal bandwidth resource scheduling device in a multi-modal network is provided, including:
应用模块,用于将根据第一方面所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;An application module, configured to apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in the first aspect to the multi-modal network;
调度模块,用于根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。A scheduling module is used to schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
根据本申请实施例的第五方面,提供一种电子设备,包括:According to a fifth aspect of the embodiment of the present application, an electronic device is provided, including:
一个或多个处理器;one or more processors;
存储器,用于存储一个或多个程序;Memory, used to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如多模态网络中的强化学习智能体训练方法或多模态网络中的模态带宽资源调度方法。When the one or more programs are executed by the one or more processors, the one or more processors implement, for example, a reinforcement learning agent training method in a multimodal network or a model in a multimodal network. Dynamic bandwidth resource scheduling method.
根据本申请实施例的第六方面,提供一种计算机可读存储介质,该指令被处理器执行时实现如多模态网络中的强化学习智能体训练方法或多模态网络中的模态带宽资源调度方法的步骤。According to a sixth aspect of the embodiments of the present application, a computer-readable storage medium is provided. When the instructions are executed by a processor, the reinforcement learning agent training method in the multi-modal network or the modal bandwidth in the multi-modal network are implemented. Steps of the resource scheduling method.
本申请的实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of this application may include the following beneficial effects:
由上述实施例可知,本申请利用强化学习算法思想,构建适应于多模态网络的全局网络特征状态、执行动作、奖励函数,让强化学习智能体不断与网络进行交互,根据网络状态及奖励值的变化输出最优执行动作,从而让多模态网络资源的分配符合预期,保障网络运行性能,对于推动多模态网络的智慧管控具有较强的现实意义。As can be seen from the above embodiments, this application uses the idea of reinforcement learning algorithms to construct global network characteristic states, execution actions, and reward functions that are suitable for multi-modal networks, allowing reinforcement learning agents to continuously interact with the network, and according to the network state and reward value The change outputs the optimal execution action, so that the allocation of multi-modal network resources meets expectations and ensures network operating performance. It has strong practical significance for promoting smart management and control of multi-modal networks.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present application.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并 与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
图1是根据一示例性实施例示出的一种多模态网络中的强化学习智能体训练方法的流程图。Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment.
图2是根据一示例性实施例示出的步骤S14的流程图。FIG. 2 is a flowchart of step S14 according to an exemplary embodiment.
图3是根据一示例性实施例示出的“根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数”的流程图。Figure 3 is a flow chart of "updating the network parameters of the new network based on all actions in the experience pool and the state before executing the action" according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种多模态网络中的强化学习智能体训练装置的框图。Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment.
图5是根据一示例性实施例示出的一种多模态网络中的模态带宽资源调度方法的流程图。Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种多模态网络中的模态带宽资源调度装置的框图。Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种电子设备的示意图。FIG. 7 is a schematic diagram of an electronic device according to an exemplary embodiment.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present application, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."
实施例1:Example 1:
图1是根据一示例性实施例示出的一种多模态网络中的强化学习智能体训练方法的流程图,如图1所示,该方法应用于强化学习智能体,可以包括以下步骤:Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment. As shown in Figure 1, this method is applied to reinforcement learning agents and may include the following steps:
步骤S11:构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:Step S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
步骤S12:设置一轮训练的最大步数;Step S12: Set the maximum number of steps for a round of training;
步骤S13:在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所 述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;Step S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, obtain the SDN switch to execute the action After obtaining the status and reward value of the network, the action, reward value and respective status in the two time periods before and after the execution of the action are stored in the experience pool;
步骤S14:根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;Step S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;
步骤S15:将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;Step S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the status before the action is executed;
步骤S16:重复步骤S13-S15,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。Step S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and does not overload the network egress.
由上述实施例可知,本申请利用强化学习算法思想,构建适应于多模态网络的全局网络特征状态、执行动作、奖励函数,让强化学习智能体不断与网络进行交互,根据网络状态及奖励值的变化输出最优执行动作,从而让多模态网络资源的分配符合预期,保障网络运行性能,对于推动多模态网络的智慧管控具有较强的现实意义。As can be seen from the above embodiments, this application uses the idea of reinforcement learning algorithms to construct global network characteristic states, execution actions, and reward functions that are suitable for multi-modal networks, allowing reinforcement learning agents to continuously interact with the network, and according to the network state and reward value The change outputs the optimal execution action, so that the allocation of multi-modal network resources meets expectations and ensures network operating performance. It has strong practical significance for promoting smart management and control of multi-modal networks.
在步骤S11的具体实施中,构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:In the specific implementation of step S11, a global network feature state, actions and a deep neural network model required for training the reinforcement learning agent are constructed, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network. :
具体地,所述全局网络特征状态包括各个模态的报文数量、各个模态的平均报文大小、每条流的平均时延、每条流中的数据包数量、每条流的大小、每条流中的平均数据包大小。这些特征构成当前时间间隔Δt秒的全局网络状态。用s t表示第t个Δt秒内全局网络特征。 Specifically, the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow. These features constitute the global network status of the current time interval Δt seconds. Let s t represent the global network characteristics in the tth Δt second.
具体地,所述动作为在对应的全局网络特征状态下选择的动作向量的均值与噪声的和。用a t表示第t个Δt秒的动作。所述动作用于调整流的带宽,进而调度各模态所占用的资源,保障网络通信质量符合预期目标。所述动作的物理含义为每个模态每条流到达出口区域的比例。用P表示网络中运行的模态数量,由于一种模态对应一种网络技术体制,因此假设网络中运行的模态数量固定不变。用F m表示每个模态中流数量的最大值,则输出的动作空间维度为P×F m。用F(p,t)表示在第t个Δt秒内基于第p个模态的流数量,满足F(p,t)<F m。因此,在第t个Δt秒内,仅有P×F(p,t)个元素有对应流,因此取值为0.1-1,而其他元素由于没有实际流,取值为0。 Specifically, the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise. Let a t represent the action of the tth Δt second. The above actions are used to adjust the bandwidth of the stream, and then schedule the resources occupied by each mode to ensure that the network communication quality meets the expected goals. The physical meaning of the action is the proportion of each flow in each mode reaching the exit area. Let P represent the number of modes running in the network. Since one mode corresponds to a network technology system, it is assumed that the number of modes running in the network is fixed. Let F m represent the maximum number of flows in each mode, then the output action space dimension is P×F m . Let F(p, t) represent the number of flows based on the p-th mode within the t-th Δt second, and satisfy F(p, t) < F m . Therefore, within the tth Δt seconds, only P×F(p,t) elements have corresponding flows, so their values are 0.1-1, while other elements have values 0 because they have no actual flows.
在具体实施中,为方便实现,可对执行新网络、执行旧网络及动作评价网络采用相同架构,例如可以采用深度神经网络、卷积神经网络、循环神经网络等架构。网络构建完成后随机初始化参数。In specific implementation, to facilitate implementation, the same architecture can be used for the new execution network, the old execution network and the action evaluation network. For example, deep neural networks, convolutional neural networks, recurrent neural networks and other architectures can be used. The parameters are randomly initialized after the network construction is completed.
在步骤S12的具体实施中,设置一轮训练的最大步数;In the specific implementation of step S12, set the maximum number of steps for a round of training;
具体地,设置每一轮训练的最大步数T,实际实施中T的取值与网络中的模态数量等因素相关,需要在训练过程中多次尝试选择较为优选的值。例如,假设网络中模态数量为8,经过多次尝试得到T为120时较为优选。Specifically, the maximum number of steps T for each round of training is set. In actual implementation, the value of T is related to factors such as the number of modes in the network. It is necessary to try multiple times during the training process to select a more preferred value. For example, assuming that the number of modes in the network is 8, it is more optimal to obtain T of 120 after many attempts.
在步骤S13的具体实施中,在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;In the specific implementation of step S13, in each step, the global network characteristic state is obtained, the global network characteristic state is input into the execution of the new network, the SDN switch is controlled to execute the action of the execution of the new network output, and the SDN is obtained The status and reward value of the network after the switch performs the action, and the action, reward value, and respective states in the two time periods before and after the action are executed are stored in the experience pool;
具体地,在每一步中,所述强化学习智能体通过控制器按采样间隔Δt秒,获取Δt秒时间段内全局网络特征。将当前网络状态s t输入执行新网络,输出基于当前参数θ μ的所述执行动作的均值μ(s tμ)和方差N,输出的所述执行动作表示为 Specifically, in each step, the reinforcement learning agent uses the controller to obtain the global network characteristics within the Δt second time period at a sampling interval of Δt seconds. Input the current network state s t to execute the new network, and output the mean μ (s tμ ) and variance N of the execution action based on the current parameter θ μ . The output execution action is expressed as
a t=μ(s tμ)+N a t =μ(s tμ )+N
其中,μ(s tμ)表示的是强化学习智能体在某一个状态s t下,选择的动作向量的均值,θ μ表示的是执行新网络的参数,N表示的是噪声,是随着时间衰退的正态函数。 Among them, μ(s tμ ) represents the mean value of the action vector selected by the reinforcement learning agent in a certain state s t , θ μ represents the parameters for executing the new network, and N represents the noise, which is A normal function that decays over time.
SDN控制器根据所述执行动作中所设定的比例,为每条流设定带宽,转化为SDN交换机可识别的指令,下发配置,SDN交换机接收配置并按所配置的带宽转发各个模态的流,如果某条流需要占用的带宽超过了所配置的带宽,则会被随机丢弃部分以满足所分配的带宽。The SDN controller sets the bandwidth for each flow according to the proportion set in the execution action, converts it into instructions that can be recognized by the SDN switch, and issues the configuration. The SDN switch receives the configuration and forwards each mode according to the configured bandwidth. If a flow requires more bandwidth than the configured bandwidth, part of it will be randomly discarded to meet the allocated bandwidth.
强化学习智能体获取因执行动作后网络的新状态s t+1和奖励值r t,将(s t,a t,r t,s t+1)存入经验池当中。 The reinforcement learning agent obtains the new state s t+1 and reward value rt of the network after executing the action, and stores ( s t , a t , r t , s t+1 ) into the experience pool.
对于一轮训练,强化学习智能体会进行T次步骤S13的过程,在这个过程中网络参数不更新,其中奖励值r t为强化学习智能体计算奖励函数的值。所述奖励函数定义如下 For a round of training, the reinforcement learning agent will perform the process of step S13 T times. During this process, the network parameters are not updated, and the reward value r t is the value of the reward function calculated by the reinforcement learning agent. The reward function is defined as follows
Figure PCTCN2022130998-appb-000001
Figure PCTCN2022130998-appb-000001
其中η p是第p个模态的权重系数,数值由人为根据网络运行质量目标确定
Figure PCTCN2022130998-appb-000002
v p(i,t)是在第t个Δt秒内第p个模态中第i个流的流速,可从全局网络 特征状态中获得。β p(i,t)是在第t个Δt秒内第p个模态中第i个流的到达该服务器的比例,可从所述执行动作中获得。ξ是出口区域正常运行时能够承载的流量上限。
where eta p is the weight coefficient of the p-th mode, and the value is determined manually according to the network operation quality target.
Figure PCTCN2022130998-appb-000002
v p (i, t) is the flow velocity of the i-th flow in the p-th mode during the t-th Δt second, which can be obtained from the global network characteristic state. β p (i, t) is the proportion of the i-th flow in the p-th modality arriving at the server in the t-th Δt second, which can be obtained from the execution action. ξ is the upper limit of traffic that the egress area can carry during normal operation.
上述奖励函数的设置可以根据网络中的不同模态的通信传输情况分配合适的带宽同时避免各模态抢占网络资源而导致网络过载。在带宽资源分配方面,我们用各模态到达服务器的流数目的比例表征该模态的传输情况。如果该模态传输发生拥塞,即便其权重系数不高或整体网络暂无拥塞,该奖励函数也将推动后续动作执行时为这个模态分配更大的带宽。如果网络中的多个模态都发生了拥塞,则权重系数更高的模态会获得更大的带宽,这也符合实际需要,即优先保障更加重要的通信业务。在避免网络过载方面,我们用惩罚值-1向上一步的动作做出负反馈,减小分配的带宽以避免网络过载。因此,上述奖励函数的设置能够保障网络正常运行,同时依据网络中各模态的传输情况动态调整带宽资源分配。The setting of the above reward function can allocate appropriate bandwidth according to the communication transmission conditions of different modes in the network while preventing each mode from seizing network resources and causing network overload. In terms of bandwidth resource allocation, we use the ratio of the number of flows arriving at the server in each mode to characterize the transmission situation of this mode. If there is congestion in the transmission of this mode, even if its weight coefficient is not high or the overall network is not congested for the time being, the reward function will promote allocating greater bandwidth to this mode when subsequent actions are executed. If congestion occurs in multiple modes in the network, the mode with a higher weight coefficient will obtain greater bandwidth, which is also in line with actual needs, that is, priority is given to ensuring more important communication services. In terms of avoiding network overload, we use a penalty value of -1 to make negative feedback for the upward step and reduce the allocated bandwidth to avoid network overload. Therefore, the setting of the above reward function can ensure the normal operation of the network, and at the same time dynamically adjust the bandwidth resource allocation according to the transmission status of each mode in the network.
在步骤S14的具体实施中,根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;In the specific implementation of step S14, update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
具体地,如图2所示,此步骤中可以包括以下子步骤:Specifically, as shown in Figure 2, this step may include the following sub-steps:
步骤S21:将所述经验池中所有的执行动作前的状态输入所述动作评价网络中,得到对应的期望价值;Step S21: Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;
具体地,在经验池中的样本,将样本中的s t输入动作评价网络得到对应的期望价值V(s t),t=1,2,...,T。该期望价值表示了对t时刻的网络状态的评价,即当前状态对达到奖励函数所设目标的瞬时价值。 Specifically, for the sample in the experience pool, input s t in the sample into the action evaluation network to obtain the corresponding expected value V(s t ), t=1, 2,...,T. The expected value represents the evaluation of the network state at time t, that is, the instantaneous value of the current state to achieve the goal set by the reward function.
步骤S22:根据所述期望价值和对应的奖励值以及预先设定的衰减折扣,计算每个行动作前的状态的折扣奖励;Step S22: Calculate the discount reward of the state before each action based on the expected value, the corresponding reward value and the preset attenuation discount;
具体地,计算每个s t的折扣奖励 Specifically, calculate the discount reward for each s t
R(t)=-V(s t)+r t+γr t+12r t+2+...+γ T-1-tr T-1T-tV(s T),t=1,2,...,T,其中γ为衰减折扣,由人为取值。由于每一轮的训练需要经历T步,我们需要知道当前网络状态对于后续网络状态变化从而达到奖励函数所设目标的长期价值。 R(t)=-V(s t )+r t +γr t+12 r t+2 +...+γ T-1-t r T-1Tt V(s T ), t=1, 2,...,T, where γ is the attenuation discount, which is set artificially. Since each round of training needs to go through T steps, we need to know the long-term value of the current network state for subsequent network state changes to achieve the goal set by the reward function.
步骤S23:计算所述折扣奖励与所述期望价值的差值,并根据所有差值计算均方差,将得到的均方差作为第一损失值,以更新所述动作评价网络的网络参数;Step S23: Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network;
具体地,根据样本分布计算R(t)-V(s t),t=1,2,...,T,计算标准差作为第一损失值用于更新动作评价网络参数。该差值表征了瞬时价值与长期价值之间的差距。该差距用于调整后续动作评价网络的参数,优化输出的执行动作。该差距越小,则表示动作网络越靠近最优。 Specifically, R(t)-V(s t ) is calculated according to the sample distribution, t=1, 2, ..., T, and the standard deviation is calculated as the first loss value for updating the action evaluation network parameters. This difference represents the gap between immediate value and long-term value. This gap is used to adjust the parameters of the subsequent action evaluation network and optimize the output execution action. The smaller the gap, the closer the action network is to the optimal.
在步骤S15的具体实施中,将所述执行新网络的网络参数赋值给所述执行旧网络,并根 据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;In the specific implementation of step S15, the network parameters of the new network are assigned to the old network, and the network of the new network is updated based on all actions in the experience pool and the status before the action is executed. parameter;
具体地,我们需要不断比较新旧执行网络的参数不同,并更新执行网络的参数以不断优化输出的动作,最终让执行新网络的参数达到最优,以输出最优的动作。Specifically, we need to constantly compare the parameters of the old and new execution networks, and update the parameters of the execution network to continuously optimize the output actions, and finally optimize the parameters of the new execution network to output the optimal actions.
具体地,如图3所示,“根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数”可以包括以下子步骤:Specifically, as shown in Figure 3, "updating the network parameters for executing the new network based on all actions in the experience pool and the state before executing the action" may include the following sub-steps:
步骤S31:将所述经验池中所有的执行动作前的状态分别输入所述执行旧网络和执行新网络,得到执行动作旧分布和执行动作新分布;Step S31: Input all the states in the experience pool before executing the action into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;
具体地,将所述经验池中存储的样本中的s t输入执行旧网络和执行新网络,分别得到动作正态分布执行动作旧分布和执行动作新分布。执行新旧网络也都是基于相同神经网络架构构建的网络,两者架构相同,仅有参数不同。因为我们设定这个两个神经网络的输入为网络状态样本s t,输出为目前最优执行动作的均值μ(s tμ)和方差N;同时我们不失一般性地假设动作地概率分布为正太分布,因此,可以基于两个执行网络的输出确定得到动作的旧概率分布和新概率分布。 Specifically, s t in the samples stored in the experience pool are input into the old execution network and the new execution network, and the old distribution of action normal distribution and the new distribution of execution action are obtained respectively. The old and new networks are also built based on the same neural network architecture. The two architectures are the same, with only different parameters. Because we set the input of these two neural networks to be network state samples s t , the output is the mean μ (s tμ ) and variance N of the current optimal execution action; at the same time, we assume the probability of the action without losing generality The distribution is a normal distribution, so the old probability distribution and the new probability distribution of the action can be determined based on the outputs of the two execution networks.
步骤S32:计算所述经验池中每个动作在对应的所述执行动作旧分布和执行动作新分布中分别出现的第一概率和第二概率;Step S32: Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;
具体地,计算存储的每个所述动作a t,t=1,2,...,T在对应分布中的第一概率p old(a t)和第二概率p new(a t)。这两个概率分别表征了样本池所存储的动作的新旧执行网络中被选中执行的概率。 Specifically, the first probability p old ( at ) and the second probability p new ( at ) in the corresponding distribution of each stored action at, t=1, 2,..., T are calculated. These two probabilities respectively represent the probability of being selected for execution in the old and new execution networks of the actions stored in the sample pool.
步骤S33:计算所述第二概率与所述第一概率的比值;Step S33: Calculate the ratio of the second probability to the first probability;
具体地,计算
Figure PCTCN2022130998-appb-000003
该比值表征了新旧执行网络之间的参数差异。如果新旧网络之间参数一致说明执行网络已更新至最优。因为我们希望执行网络的参数能不断更新优化,因此,计算其比值将用于更新网络参数。
Specifically, calculate
Figure PCTCN2022130998-appb-000003
This ratio characterizes the parameter differences between the old and new execution networks. If the parameters between the old and new networks are consistent, it means that the execution network has been updated to the optimum. Because we hope that the parameters of the execution network can be continuously updated and optimized, the calculated ratio will be used to update the network parameters.
步骤S34:将所有的所述比值乘以对应的所述差值并求平均之后的值作为第二损失值,以更新所述执行新网络的网络参数;Step S34: Multiply all the ratios by the corresponding differences and average the value as the second loss value to update the network parameters of the new network;
具体地,对于t=1,2,...,T,ratio t乘以R(t)-V(s t)并求均值作为第二损失值用来更新执行新网络参数。ratio t表征了动作网络的更新方向,R(t)-V(s t)表征了评价网络的参数更新方向。因为输出执行动作的优选,需要结合网络状态的变化,因此选择两者乘积,以更新执行新网络的参数,使其学习到最新的网络状态,以在下一步输出适合网络状态的动作。 Specifically, for t=1, 2,...,T, ratio t is multiplied by R(t)-V(s t ) and averaged as the second loss value to update and execute new network parameters. Ratio t represents the update direction of the action network, and R(t)-V(s t ) represents the parameter update direction of the evaluation network. Because the optimization of output execution actions needs to be combined with changes in network status, the product of the two is selected to update the parameters of the new network execution so that it can learn the latest network status and output actions suitable for the network status in the next step.
在步骤S16的具体实施中,重复步骤S13-S15,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载;In the specific implementation of step S16, steps S13-S15 are repeated until the bandwidth occupied by each mode in the multi-modal network ensures the quality of communication transmission and does not overload the network egress;
具体地,S13-S15的过程为一轮训练的过程,继续开启下一轮训练,直到各个模态合理占用带宽,在保证通信传输质量的同时不让网络出口端过载。经过充分的训练后,所述强化学习智能体已经完全学习到了不同网络环境下的最优策略,即能达到所设定的预期目标的所述执行动作。Specifically, the process of S13-S15 is a round of training process, and the next round of training continues until each mode reasonably occupies the bandwidth, ensuring the quality of communication transmission while not overloading the network outlet. After sufficient training, the reinforcement learning agent has completely learned the optimal strategy under different network environments, that is, the execution action that can achieve the set expected goal.
与前述的多模态网络中的强化学习智能体训练方法的实施例相对应,本申请还提供了多模态网络中的强化学习智能体训练装置的实施例。Corresponding to the foregoing embodiments of the reinforcement learning agent training method in the multi-modal network, this application also provides embodiments of the reinforcement learning agent training device in the multi-modal network.
图4是根据一示例性实施例示出的一种多模态网络中的强化学习智能体训练装置的框图。参照图4,该装置应用于强化学习智能体,可以包括:Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment. Referring to Figure 4, the device is applied to reinforcement learning agents and may include:
构建模块21,用于构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:Building module 21 is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
设置模块22,用于设置一轮训练的最大步数;Setting module 22 is used to set the maximum number of steps for a round of training;
执行模块23,用于在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池; Execution module 23 is used to obtain the global network feature status in each step, input the global network feature status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The status and reward value of the network after the action are stored in the experience pool;
第一更新模块24,用于根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;The first update module 24 is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;
第二更新模块25,用于将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;The second update module 25 is used to assign the network parameters of the new network to the old network, and update the network of the new network based on all actions in the experience pool and the status before the action is executed. parameter;
重复模块26,用于重复执行模块到第二更新模块中的过程,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。The repeat module 26 is used to repeat the process from the execution module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network egress.
实施例2:Example 2:
图5是根据一示例性实施例示出的一种多模态网络中的模态带宽资源调度方法的流程图,如图5所示,该方法可以包括以下步骤:Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment. As shown in Figure 5, the method may include the following steps:
步骤S41:将根据实施例1所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;Step S41: Apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in Embodiment 1 to the multi-modal network;
步骤S42:根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。Step S42: Schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
由上述实施例可知,本申请将训练后的强化学习智能体应用于模态带宽资源调度方法中,能自适应于不同特征的网络中,可用于多模态网络的智慧管控,具有良好的适应性及调度性 能。It can be seen from the above embodiments that this application applies the trained reinforcement learning agent in the modal bandwidth resource scheduling method, which can adapt to networks with different characteristics, can be used for intelligent management and control of multi-modal networks, and has good adaptability and scheduling performance.
具体地,上述多模态网络中的强化学习智能体训练方法在实施例1中已有详细描述,而将强化学习智能体应用于多模态网络和根据强化学习智能体输出的调度策略进行调度均为本领域的常规技术手段,此处不作赘述。Specifically, the above reinforcement learning agent training method in the multimodal network has been described in detail in Embodiment 1, and the reinforcement learning agent is applied to the multimodal network and scheduled according to the scheduling strategy output by the reinforcement learning agent. All are conventional technical means in this field and will not be described in detail here.
与前述的多模态网络中的模态带宽资源调度方法的实施例相对应,本申请还提供了多模态网络中的模态带宽资源调度装置的实施例。Corresponding to the foregoing embodiment of the modal bandwidth resource scheduling method in a multi-modal network, this application also provides an embodiment of a modal bandwidth resource scheduling device in a multi-modal network.
图6是根据一示例性实施例示出的一种多模态网络中的模态带宽资源调度装置的框图。参照图6,该装置可以包括:Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment. Referring to Figure 6, the device may include:
应用模块31,用于将根据实施例1所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;The application module 31 is used to apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multi-modal network according to Embodiment 1 to the multi-modal network;
调度模块32,用于根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。The scheduling module 32 is used to schedule the resources occupied by each mode according to the scheduling policy output by the reinforcement learning agent.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
实施例3:Example 3:
相应地,本申请还提供一种电子设备,包括:一个或多个处理器;存储器,用于存储一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述的多模态网络中的强化学习智能体训练方法或多模态网络中的模态带宽资源调度方法。如图7所示,为本发明实施例提供的一种多模态网络中的强化学习智能体训练方法或多模态网络中的模态带宽资源调度方法所在任意具备数据处理能力的设备的一种硬件结构图,除了图7所示的处理器、内存以及网络接口之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。Correspondingly, this application also provides an electronic device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors , so that the one or more processors implement the above-mentioned reinforcement learning agent training method in the multi-modal network or the modal bandwidth resource scheduling method in the multi-modal network. As shown in Figure 7, a reinforcement learning agent training method in a multi-modal network or a modal bandwidth resource scheduling method in a multi-modal network provided by an embodiment of the present invention is located on any device with data processing capabilities. A hardware structure diagram, in addition to the processor, memory and network interface shown in Figure 7, any device with data processing capabilities where the device in the embodiment is located usually can also be based on the actual functions of any device with data processing capabilities. Including other hardware, I won’t go into details about this.
实施例4:Example 4:
相应地,本申请还提供一种计算机可读存储介质,其上存储有计算机指令,该指令被处理器执行时实现如上述的多模态网络中的强化学习智能体训练方法或多模态网络中的模态带 宽资源调度方法。所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是风力发电机的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。Correspondingly, this application also provides a computer-readable storage medium on which computer instructions are stored. When the instructions are executed by a processor, the above-mentioned reinforcement learning agent training method or multi-modal network in a multi-modal network is implemented. Modal bandwidth resource scheduling method in . The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of the wind turbine, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash card (Flash Card) equipped on the device. wait. Furthermore, the computer-readable storage medium may also include an internal storage unit of any device with data processing capabilities and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.
本领域技术人员在考虑说明书及实践这里公开的内容后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. .
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims (10)

  1. 一种多模态网络中的强化学习智能体训练方法,其特征在于,应用于强化学习智能体,包括:A reinforcement learning agent training method in a multimodal network, which is characterized in that it is applied to reinforcement learning agents and includes:
    S11:构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
    S12:设置一轮训练的最大步数;S12: Set the maximum number of steps for a round of training;
    S13:在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the result of the SDN switch executing the action. The status and reward value of the network, the action, reward value and the respective status in the two time periods before and after the execution of the action are stored in the experience pool;
    S14:根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
    S15:将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the status before the action is executed;
    S16:重复步骤S13-S15,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and prevents the network egress from being overloaded.
  2. 根据权利要求1所述的方法,其特征在于,所述全局网络特征状态包括各个模态的报文数量、各个模态的平均报文大小、每条流的平均时延、每条流中的数据包数量、每条流的大小、每条流中的平均数据包大小。The method according to claim 1, characterized in that the global network characteristic state includes the number of messages in each mode, the average message size in each mode, the average delay of each flow, the Number of packets, size of each flow, average packet size in each flow.
  3. 根据权利要求1所述的方法,其特征在于,所述动作为在对应的全局网络特征状态下选择的动作向量的均值与噪声的和。The method according to claim 1, characterized in that the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.
  4. 根据权利要求1所述的方法,其特征在于,根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数,包括:The method according to claim 1, characterized in that, based on all reward values in the experience pool and the state before executing the action, updating the network parameters of the action evaluation network includes:
    将所述经验池中所有的执行动作前的状态输入所述动作评价网络中,得到对应的期望价值;Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;
    根据所述期望价值和对应的奖励值以及预先设定的衰减折扣,计算每个行动作前的状态的折扣奖励;According to the expected value, the corresponding reward value and the preset attenuation discount, calculate the discount reward of the state before each action;
    计算所述折扣奖励与所述期望价值的差值,并根据所有差值计算均方差,将得到的均方差作为第一损失值,以更新所述动作评价网络的网络参数。Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network.
  5. 根据权利要求4所述的方法,其特征在于,根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数,包括:The method according to claim 4, characterized in that, based on all actions in the experience pool and the status before executing the action, updating the network parameters for executing the new network includes:
    将所述经验池中所有的执行动作前的状态分别输入所述执行旧网络和执行新网络,得到执行动作旧分布和执行动作新分布;Input all the states before executing the action in the experience pool into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;
    计算所述经验池中每个动作在对应的所述执行动作旧分布和执行动作新分布中分别出现的第一概率和第二概率;Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;
    计算所述第二概率与所述第一概率的比值;Calculate the ratio of the second probability to the first probability;
    将所有的所述比值乘以对应的所述差值并求平均之后的值作为第二损失值,以更新所述执行新网络的网络参数。All the ratios are multiplied by the corresponding difference values and the average value is used as the second loss value to update the network parameters of the new network.
  6. 一种多模态网络中的强化学习智能体训练装置,其特征在于,应用于强化学习智能体,包括:A reinforcement learning agent training device in a multi-modal network, which is characterized in that it is applied to reinforcement learning agents and includes:
    构建模块,用于构建全局网络特征状态、动作及训练所述强化学习智能体所需的深度神经网络模型,其中所述深度神经网络模型包括执行新网络、执行旧网络及动作评价网络:The building module is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:
    设置模块,用于设置一轮训练的最大步数;Setting module, used to set the maximum number of steps for a round of training;
    执行模块,用于在每一步中,获取全局网络特征状态,将所述全局网络特征状态输入所述执行新网络,控制SDN交换机执行所述执行新网络输出的动作,获取所述SDN交换机执行所述动作后网络的状态和奖励值,将所述动作、奖励值和执行所述动作前后的两个时间段内分别的状态存入经验池;An execution module, configured to obtain the global network characteristic status in each step, input the global network characteristic status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The state and reward value of the network after the action are described, and the action, reward value and respective states in the two time periods before and after the action are performed are stored in the experience pool;
    第一更新模块,用于根据所述经验池中所有的奖励值和执行动作前的状态,更新所述动作评价网络的网络参数;The first update module is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;
    第二更新模块,用于将所述执行新网络的网络参数赋值给所述执行旧网络,并根据所述经验池中所有的动作和执行动作前的状态,更新所述执行新网络的网络参数;The second update module is used to assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action. ;
    重复模块,用于重复执行模块到第二更新模块中的过程,直至多模态网络中各个模态占用的带宽均在保证通信传输质量的同时不让网络出口端过载。The repeat module is used to repeat the process of executing the module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network outlet.
  7. 一种多模态网络中的模态带宽资源调度方法,其特征在于,包括:A modal bandwidth resource scheduling method in a multi-modal network, which is characterized by including:
    将根据权利要求1-5中任一项所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;Apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multimodal network according to any one of claims 1 to 5 to the multimodal network;
    根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。According to the scheduling strategy output by the reinforcement learning agent, the resources occupied by each mode are scheduled.
  8. 一种多模态网络中的模态带宽资源调度装置,其特征在于,包括:A modal bandwidth resource scheduling device in a multi-modal network, which is characterized by including:
    应用模块,用于将根据权利要求1-5中任一项所述的多模态网络中的强化学习智能体训练方法训练后的强化学习智能体应用于多模态网络中;An application module, configured to apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multimodal network according to any one of claims 1 to 5 to the multimodal network;
    调度模块,用于根据所述强化学习智能体输出的调度策略,调度各个模态占用的资源。A scheduling module is used to schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
  9. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    一个或多个处理器;one or more processors;
    存储器,用于存储一个或多个程序;Memory, used to store one or more programs;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-5任一项所述的多模态网络中的强化学习智能体训练方法或权利要求7所述的多模态网络中的模态带宽资源调度方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the reinforcement learning intelligence in the multi-modal network according to any one of claims 1-5 The volume training method or the modal bandwidth resource scheduling method in the multi-modal network according to claim 7.
  10. 一种计算机可读存储介质,其上存储有计算机指令,其特征在于,该指令被处理器执行时实现如权利要求1-5任一项所述的多模态网络中的强化学习智能体训练方法或权利要求7所述的多模态网络中的模态带宽资源调度方法的步骤。A computer-readable storage medium with computer instructions stored thereon, characterized in that when the instructions are executed by a processor, the reinforcement learning agent training in the multi-modal network according to any one of claims 1-5 is implemented. The method or the steps of the modal bandwidth resource scheduling method in a multi-modal network according to claim 7.
PCT/CN2022/130998 2022-07-05 2022-11-10 Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus WO2024007499A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/359,862 US20240015079A1 (en) 2022-07-05 2023-07-26 Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210782477.4A CN114866494B (en) 2022-07-05 2022-07-05 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN202210782477.4 2022-07-05

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/359,862 Continuation US20240015079A1 (en) 2022-07-05 2023-07-26 Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus

Publications (1)

Publication Number Publication Date
WO2024007499A1 true WO2024007499A1 (en) 2024-01-11

Family

ID=82626124

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130998 WO2024007499A1 (en) 2022-07-05 2022-11-10 Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus

Country Status (2)

Country Link
CN (1) CN114866494B (en)
WO (1) WO2024007499A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866494B (en) * 2022-07-05 2022-09-20 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN116994693B (en) * 2023-09-27 2024-03-01 之江实验室 Modeling method and system for medical insurance overall agent based on stability control

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465151A (en) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) Multi-agent federal cooperation method based on deep reinforcement learning
CN113254197A (en) * 2021-04-30 2021-08-13 西安电子科技大学 Network resource scheduling method and system based on deep reinforcement learning
CN113595923A (en) * 2021-08-11 2021-11-02 国网信息通信产业集团有限公司 Network congestion control method and device
US20220210200A1 (en) * 2015-10-28 2022-06-30 Qomplx, Inc. Ai-driven defensive cybersecurity strategy analysis and recommendation system
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108683614B (en) * 2018-05-15 2021-11-09 国网江苏省电力有限公司苏州供电分公司 Virtual reality equipment cluster bandwidth allocation device based on threshold residual error network
US20200162535A1 (en) * 2018-11-19 2020-05-21 Zhan Ma Methods and Apparatus for Learning Based Adaptive Real-time Streaming
CN111988225B (en) * 2020-08-19 2022-03-04 西安电子科技大学 Multi-path routing method based on reinforcement learning and transfer learning
CN112295237A (en) * 2020-10-19 2021-02-02 深圳大学 Deep reinforcement learning-based decision-making method
CN113328938B (en) * 2021-05-25 2022-02-08 电子科技大学 Network autonomous intelligent management and control method based on deep reinforcement learning
CN113963200A (en) * 2021-10-18 2022-01-21 郑州大学 Modal data fusion processing method, device, equipment and storage medium
CN114626499A (en) * 2022-05-11 2022-06-14 之江实验室 Embedded multi-agent reinforcement learning method using sparse attention to assist decision making

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220210200A1 (en) * 2015-10-28 2022-06-30 Qomplx, Inc. Ai-driven defensive cybersecurity strategy analysis and recommendation system
CN112465151A (en) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) Multi-agent federal cooperation method based on deep reinforcement learning
CN113254197A (en) * 2021-04-30 2021-08-13 西安电子科技大学 Network resource scheduling method and system based on deep reinforcement learning
CN113595923A (en) * 2021-08-11 2021-11-02 国网信息通信产业集团有限公司 Network congestion control method and device
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device

Also Published As

Publication number Publication date
CN114866494A (en) 2022-08-05
CN114866494B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
WO2024007499A1 (en) Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus
WO2020216135A1 (en) Multi-user multi-mec task unloading resource scheduling method based on edge-end collaboration
US11051210B2 (en) Method and system for network slice allocation
CN108616458B (en) System and method for scheduling packet transmissions on a client device
CN111953758B (en) Edge network computing unloading and task migration method and device
US7826351B2 (en) MMPP analysis of network traffic using a transition window
CN110234127B (en) SDN-based fog network task unloading method
US20040054766A1 (en) Wireless resource control system
US11128682B2 (en) Video streaming at mobile edge
WO2021129575A1 (en) Bandwidth scheduling method, traffic transmission method, and related product
US11240690B2 (en) Streaming media quality of experience prediction for network slice selection in 5G networks
WO2018233425A1 (en) Network congestion processing method, device, and system
Rezazadeh et al. Continuous multi-objective zero-touch network slicing via twin delayed DDPG and OpenAI gym
CN113254192B (en) Resource allocation method, resource allocation device, electronic device and storage medium
Hu et al. Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach
US20240098155A1 (en) Systems and methods for push-based data communications
US20230060623A1 (en) Network improvement with reinforcement learning
Gholami et al. Roma: Resource orchestration for microservices-based 5g applications
Iqbal et al. Instant queue occupancy used for automatic traffic scheduling in data center networks
CN113543160B (en) 5G slice resource allocation method, device, computing equipment and computer storage medium
CN116302578B (en) QoS (quality of service) constraint stream application delay ensuring method and system
Zhang et al. Vehicular multi-slice optimization in 5G: Dynamic preference policy using reinforcement learning
Wang et al. On Jointly Optimizing Partial Offloading and SFC Mapping: A Cooperative Dual-Agent Deep Reinforcement Learning Approach
CN114172558B (en) Task unloading method based on edge calculation and unmanned aerial vehicle cluster cooperation in vehicle network
CN115484205A (en) Deterministic network routing and queue scheduling method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22950057

Country of ref document: EP

Kind code of ref document: A1