WO2024007499A1

WO2024007499A1 - Reinforcement learning agent training method and apparatus, and modal bandwidth resource scheduling method and apparatus

Info

Publication number: WO2024007499A1
Application number: PCT/CN2022/130998
Authority: WO
Inventors: 沈丛麒; 张慧峰; 姚少峰; 徐琪; 张汝云
Original assignee: 之江实验室
Priority date: 2022-07-05
Filing date: 2022-11-10
Publication date: 2024-01-11
Also published as: CN114866494A; CN114866494B

Abstract

Disclosed in the present invention are a reinforcement learning agent training method and apparatus, and a modality bandwidth resource scheduling method and apparatus. By means of the reinforcement learning agent training method, in a polymorphic network, the latest global network feature is acquired by using continuous interaction between a reinforcement learning agent and a network environment, and an updated action is output. A bandwidth occupied by a modality is adjusted, and a reward value is set to determine an optimization target for an intelligent agent, such that modality scheduling is realized, and the rational use of polymorphic network resources is ensured. The trained reinforcement learning agent is applied to a modality bandwidth resource scheduling method, can be adaptive to networks having different features, can be used for intelligent management and control of a polymorphic network, and has good adaptability and scheduling performance.

Description

Reinforcement learning agent training method, modal bandwidth resource scheduling method and device

Technical field

The invention belongs to the field of network management and control technology, and in particular relates to a reinforcement learning agent training method, a modal bandwidth resource scheduling method and a device.

Background technique

In a multi-modal network, multiple network technology systems are running at the same time, and each technology system is a network mode. Each network mode shares network resources. If not controlled, it will cause each network mode to directly compete for network resources, such as bandwidth, etc., which will directly affect the communication transmission quality of some key modes. Therefore, reasonable management and control of each mode in the network is one of the necessary prerequisites to ensure the stable operation of multi-modal networks.

To meet the above needs, the current mainstream technology is to control the proportion of bandwidth used by switch ports and limit the size of egress traffic to avoid network overload.

In the process of realizing the present invention, the inventor found that the existing technology has at least the following problems:

The use of such static policies (such as limiting the bandwidth usage ratio not to exceed a certain maximum value) will not be able to adapt to dynamic changes in network modes. In actual networks, it is very likely that individual modal traffic will increase due to business changes. At this time, the original static policy is no longer applicable.

Contents of the invention

The purpose of the embodiments of this application is to provide reinforcement learning agent training methods, modal bandwidth resource scheduling methods and devices, so as to solve the technical problem in related technologies that modal resources in multi-modal networks cannot be intelligently controlled.

According to the first aspect of the embodiments of the present application, a modal bandwidth resource scheduling method in a multi-modal network is provided, including:

S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

S12: Set the maximum number of steps for a round of training;

S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the result of the SDN switch executing the action. The status and reward value of the network, the action, reward value and the respective status in the two time periods before and after the execution of the action are stored in the experience pool;

S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;

S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action is executed;

S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and prevents the network egress from being overloaded.

Further, the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow.

Further, the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.

Further, update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action, including:

Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;

According to the expected value, the corresponding reward value and the preset attenuation discount, calculate the discount reward of the state before each action;

Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network.

Further, update the network parameters for executing the new network based on all actions in the experience pool and the status before executing the action, including:

Input all the states before executing the action in the experience pool into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;

Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;

Calculate the ratio of the second probability to the first probability;

All the ratios are multiplied by the corresponding difference values and the average value is used as the second loss value to update the network parameters of the new network.

According to the second aspect of the embodiment of the present application, a modal bandwidth resource scheduling device in a multi-modal network is provided, including:

The building module is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

Setting module, used to set the maximum number of steps for a round of training;

An execution module, configured to obtain the global network characteristic status in each step, input the global network characteristic status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The state and reward value of the network after the action are described, and the action, reward value and respective states in the two time periods before and after the action are performed are stored in the experience pool;

The first update module is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;

The second update module is used to assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action. ;

The repeat module is used to repeat the process of executing the module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network outlet.

According to the third aspect of the embodiments of the present application, a modal bandwidth resource scheduling method in a multi-modal network is provided, including:

applying the reinforcement learning agent trained according to the reinforcement learning agent training method in the multimodal network described in the first aspect to the multimodal network;

According to the scheduling strategy output by the reinforcement learning agent, the resources occupied by each mode are scheduled.

According to the third aspect of the embodiment of the present application, a modal bandwidth resource scheduling device in a multi-modal network is provided, including:

An application module, configured to apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in the first aspect to the multi-modal network;

A scheduling module is used to schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.

According to a fifth aspect of the embodiment of the present application, an electronic device is provided, including:

one or more processors;

Memory, used to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement, for example, a reinforcement learning agent training method in a multimodal network or a model in a multimodal network. Dynamic bandwidth resource scheduling method.

According to a sixth aspect of the embodiments of the present application, a computer-readable storage medium is provided. When the instructions are executed by a processor, the reinforcement learning agent training method in the multi-modal network or the modal bandwidth in the multi-modal network are implemented. Steps of the resource scheduling method.

The technical solutions provided by the embodiments of this application may include the following beneficial effects:

As can be seen from the above embodiments, this application uses the idea of reinforcement learning algorithms to construct global network characteristic states, execution actions, and reward functions that are suitable for multi-modal networks, allowing reinforcement learning agents to continuously interact with the network, and according to the network state and reward value The change outputs the optimal execution action, so that the allocation of multi-modal network resources meets expectations and ensures network operating performance. It has strong practical significance for promoting smart management and control of multi-modal networks.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present application.

Description of the drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment.

FIG. 2 is a flowchart of step S14 according to an exemplary embodiment.

Figure 3 is a flow chart of "updating the network parameters of the new network based on all actions in the experience pool and the state before executing the action" according to an exemplary embodiment.

Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment.

Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment.

Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment.

FIG. 7 is a schematic diagram of an electronic device according to an exemplary embodiment.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application.

The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present application, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

Example 1:

Figure 1 is a flow chart of a reinforcement learning agent training method in a multi-modal network according to an exemplary embodiment. As shown in Figure 1, this method is applied to reinforcement learning agents and may include the following steps:

Step S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

Step S12: Set the maximum number of steps for a round of training;

Step S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, obtain the SDN switch to execute the action After obtaining the status and reward value of the network, the action, reward value and respective status in the two time periods before and after the execution of the action are stored in the experience pool;

Step S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;

Step S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the status before the action is executed;

Step S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and does not overload the network egress.

In the specific implementation of step S11, a global network feature state, actions and a deep neural network model required for training the reinforcement learning agent are constructed, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network. :

Specifically, the global network characteristic state includes the number of packets in each mode, the average packet size of each mode, the average delay of each flow, the number of data packets in each flow, the size of each flow, Average packet size per flow. These features constitute the global network status of the current time interval Δt seconds. Let s _t represent the global network characteristics in the tth Δt second.

Specifically, the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise. Let a _t represent the action of the tth Δt second. The above actions are used to adjust the bandwidth of the stream, and then schedule the resources occupied by each mode to ensure that the network communication quality meets the expected goals. The physical meaning of the action is the proportion of each flow in each mode reaching the exit area. Let P represent the number of modes running in the network. Since one mode corresponds to a network technology system, it is assumed that the number of modes running in the network is fixed. Let F ^m represent the maximum number of flows in each mode, then the output action space dimension is P×F ^m . Let F(p, t) represent the number of flows based on the p-th mode within the t-th Δt second, and satisfy F(p, t) < F ^m . Therefore, within the tth Δt seconds, only P×F(p,t) elements have corresponding flows, so their values are 0.1-1, while other elements have values 0 because they have no actual flows.

In specific implementation, to facilitate implementation, the same architecture can be used for the new execution network, the old execution network and the action evaluation network. For example, deep neural networks, convolutional neural networks, recurrent neural networks and other architectures can be used. The parameters are randomly initialized after the network construction is completed.

In the specific implementation of step S12, set the maximum number of steps for a round of training;

Specifically, the maximum number of steps T for each round of training is set. In actual implementation, the value of T is related to factors such as the number of modes in the network. It is necessary to try multiple times during the training process to select a more preferred value. For example, assuming that the number of modes in the network is 8, it is more optimal to obtain T of 120 after many attempts.

In the specific implementation of step S13, in each step, the global network characteristic state is obtained, the global network characteristic state is input into the execution of the new network, the SDN switch is controlled to execute the action of the execution of the new network output, and the SDN is obtained The status and reward value of the network after the switch performs the action, and the action, reward value, and respective states in the two time periods before and after the action are executed are stored in the experience pool;

Specifically, in each step, the reinforcement learning agent uses the controller to obtain the global network characteristics within the Δt second time period at a sampling interval of Δt seconds. Input the current network state s _t to execute the new network, and output the mean μ (s _t |θ ^μ ) and variance N of the execution action based on the current parameter θ ^μ . The output execution action is expressed as

a _t =μ(s _t |θ ^μ )+N

Among them, μ(s _t |θ ^μ ) represents the mean value of the action vector selected by the reinforcement learning agent in a certain state s _t , θ ^μ represents the parameters for executing the new network, and N represents the noise, which is A normal function that decays over time.

The SDN controller sets the bandwidth for each flow according to the proportion set in the execution action, converts it into instructions that can be recognized by the SDN switch, and issues the configuration. The SDN switch receives the configuration and forwards each mode according to the configured bandwidth. If a flow requires more bandwidth than the configured bandwidth, part of it will be randomly discarded to meet the allocated bandwidth.

The reinforcement learning agent obtains the new state s _t+1 and reward value rt of the network after executing the action, and stores ( _{s t} _, a _t , r _t , s _t+1 ) into the experience pool.

For a round of training, the reinforcement learning agent will perform the process of step S13 T times. During this process, the network parameters are not updated, and the reward value r _t is the value of the reward function calculated by the reinforcement learning agent. The reward function is defined as follows

where eta _p is the weight coefficient of the p-th mode, and the value is determined manually according to the network operation quality target.

v _p (i, t) is the flow velocity of the i-th flow in the p-th mode during the t-th Δt second, which can be obtained from the global network characteristic state. β _p (i, t) is the proportion of the i-th flow in the p-th modality arriving at the server in the t-th Δt second, which can be obtained from the execution action. ξ is the upper limit of traffic that the egress area can carry during normal operation.

The setting of the above reward function can allocate appropriate bandwidth according to the communication transmission conditions of different modes in the network while preventing each mode from seizing network resources and causing network overload. In terms of bandwidth resource allocation, we use the ratio of the number of flows arriving at the server in each mode to characterize the transmission situation of this mode. If there is congestion in the transmission of this mode, even if its weight coefficient is not high or the overall network is not congested for the time being, the reward function will promote allocating greater bandwidth to this mode when subsequent actions are executed. If congestion occurs in multiple modes in the network, the mode with a higher weight coefficient will obtain greater bandwidth, which is also in line with actual needs, that is, priority is given to ensuring more important communication services. In terms of avoiding network overload, we use a penalty value of -1 to make negative feedback for the upward step and reduce the allocated bandwidth to avoid network overload. Therefore, the setting of the above reward function can ensure the normal operation of the network, and at the same time dynamically adjust the bandwidth resource allocation according to the transmission status of each mode in the network.

In the specific implementation of step S14, update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;

Specifically, as shown in Figure 2, this step may include the following sub-steps:

Step S21: Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;

Specifically, for the sample in the experience pool, input s _t in the sample into the action evaluation network to obtain the corresponding expected value V(s _t ), t=1, 2,...,T. The expected value represents the evaluation of the network state at time t, that is, the instantaneous value of the current state to achieve the goal set by the reward function.

Step S22: Calculate the discount reward of the state before each action based on the expected value, the corresponding reward value and the preset attenuation discount;

Specifically, calculate the discount reward for each s _t

R(t)＝-V(s _t )+r _t +γr _t+1 +γ ² r _t+2 +...+γ ^T-1-t r _T-1 +γ ^Tt V(s _T ), t=1, 2,...,T, where γ is the attenuation discount, which is set artificially. Since each round of training needs to go through T steps, we need to know the long-term value of the current network state for subsequent network state changes to achieve the goal set by the reward function.

Step S23: Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network;

Specifically, R(t)-V(s _t ) is calculated according to the sample distribution, t=1, 2, ..., T, and the standard deviation is calculated as the first loss value for updating the action evaluation network parameters. This difference represents the gap between immediate value and long-term value. This gap is used to adjust the parameters of the subsequent action evaluation network and optimize the output execution action. The smaller the gap, the closer the action network is to the optimal.

In the specific implementation of step S15, the network parameters of the new network are assigned to the old network, and the network of the new network is updated based on all actions in the experience pool and the status before the action is executed. parameter;

Specifically, we need to constantly compare the parameters of the old and new execution networks, and update the parameters of the execution network to continuously optimize the output actions, and finally optimize the parameters of the new execution network to output the optimal actions.

Specifically, as shown in Figure 3, "updating the network parameters for executing the new network based on all actions in the experience pool and the state before executing the action" may include the following sub-steps:

Step S31: Input all the states in the experience pool before executing the action into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;

Specifically, s _t in the samples stored in the experience pool are input into the old execution network and the new execution network, and the old distribution of action normal distribution and the new distribution of execution action are obtained respectively. The old and new networks are also built based on the same neural network architecture. The two architectures are the same, with only different parameters. Because we set the input of these two neural networks to be network state samples s _t , the output is the mean μ (s _t |θ ^μ ) and variance N of the current optimal execution action; at the same time, we assume the probability of the action without losing generality The distribution is a normal distribution, so the old probability distribution and the new probability distribution of the action can be determined based on the outputs of the two execution networks.

Step S32: Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;

Specifically, the first probability p _old ( _at ₎ and the second probability p _new ( _at ) in the corresponding distribution of each stored action at, t=1, 2,..., T are calculated. These two probabilities respectively represent the probability of being selected for execution in the old and new execution networks of the actions stored in the sample pool.

Step S33: Calculate the ratio of the second probability to the first probability;

Specifically, calculate

This ratio characterizes the parameter differences between the old and new execution networks. If the parameters between the old and new networks are consistent, it means that the execution network has been updated to the optimum. Because we hope that the parameters of the execution network can be continuously updated and optimized, the calculated ratio will be used to update the network parameters.

Step S34: Multiply all the ratios by the corresponding differences and average the value as the second loss value to update the network parameters of the new network;

Specifically, for t=1, 2,...,T, ratio _t is multiplied by R(t)-V(s _t ) and averaged as the second loss value to update and execute new network parameters. Ratio _t represents the update direction of the action network, and R(t)-V(s _t ) represents the parameter update direction of the evaluation network. Because the optimization of output execution actions needs to be combined with changes in network status, the product of the two is selected to update the parameters of the new network execution so that it can learn the latest network status and output actions suitable for the network status in the next step.

In the specific implementation of step S16, steps S13-S15 are repeated until the bandwidth occupied by each mode in the multi-modal network ensures the quality of communication transmission and does not overload the network egress;

Specifically, the process of S13-S15 is a round of training process, and the next round of training continues until each mode reasonably occupies the bandwidth, ensuring the quality of communication transmission while not overloading the network outlet. After sufficient training, the reinforcement learning agent has completely learned the optimal strategy under different network environments, that is, the execution action that can achieve the set expected goal.

Corresponding to the foregoing embodiments of the reinforcement learning agent training method in the multi-modal network, this application also provides embodiments of the reinforcement learning agent training device in the multi-modal network.

Figure 4 is a block diagram of a reinforcement learning agent training device in a multi-modal network according to an exemplary embodiment. Referring to Figure 4, the device is applied to reinforcement learning agents and may include:

Building module 21 is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

Setting module 22 is used to set the maximum number of steps for a round of training;

Execution module 23 is used to obtain the global network feature status in each step, input the global network feature status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The status and reward value of the network after the action are stored in the experience pool;

The first update module 24 is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the status before executing the action;

The second update module 25 is used to assign the network parameters of the new network to the old network, and update the network of the new network based on all actions in the experience pool and the status before the action is executed. parameter;

The repeat module 26 is used to repeat the process from the execution module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network egress.

Example 2:

Figure 5 is a flow chart of a modal bandwidth resource scheduling method in a multi-modal network according to an exemplary embodiment. As shown in Figure 5, the method may include the following steps:

Step S41: Apply the reinforcement learning agent trained according to the reinforcement learning agent training method in the multi-modal network described in Embodiment 1 to the multi-modal network;

Step S42: Schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.

It can be seen from the above embodiments that this application applies the trained reinforcement learning agent in the modal bandwidth resource scheduling method, which can adapt to networks with different characteristics, can be used for intelligent management and control of multi-modal networks, and has good adaptability and scheduling performance.

Specifically, the above reinforcement learning agent training method in the multimodal network has been described in detail in Embodiment 1, and the reinforcement learning agent is applied to the multimodal network and scheduled according to the scheduling strategy output by the reinforcement learning agent. All are conventional technical means in this field and will not be described in detail here.

Corresponding to the foregoing embodiment of the modal bandwidth resource scheduling method in a multi-modal network, this application also provides an embodiment of a modal bandwidth resource scheduling device in a multi-modal network.

Figure 6 is a block diagram of a modal bandwidth resource scheduling device in a multi-modal network according to an exemplary embodiment. Referring to Figure 6, the device may include:

The application module 31 is used to apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multi-modal network according to Embodiment 1 to the multi-modal network;

The scheduling module 32 is used to schedule the resources occupied by each mode according to the scheduling policy output by the reinforcement learning agent.

Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Example 3:

Correspondingly, this application also provides an electronic device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors , so that the one or more processors implement the above-mentioned reinforcement learning agent training method in the multi-modal network or the modal bandwidth resource scheduling method in the multi-modal network. As shown in Figure 7, a reinforcement learning agent training method in a multi-modal network or a modal bandwidth resource scheduling method in a multi-modal network provided by an embodiment of the present invention is located on any device with data processing capabilities. A hardware structure diagram, in addition to the processor, memory and network interface shown in Figure 7, any device with data processing capabilities where the device in the embodiment is located usually can also be based on the actual functions of any device with data processing capabilities. Including other hardware, I won’t go into details about this.

Example 4:

Correspondingly, this application also provides a computer-readable storage medium on which computer instructions are stored. When the instructions are executed by a processor, the above-mentioned reinforcement learning agent training method or multi-modal network in a multi-modal network is implemented. Modal bandwidth resource scheduling method in . The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of the wind turbine, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash card (Flash Card) equipped on the device. wait. Furthermore, the computer-readable storage medium may also include an internal storage unit of any device with data processing capabilities and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. .

It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims

A reinforcement learning agent training method in a multimodal network, which is characterized in that it is applied to reinforcement learning agents and includes:

S11: Construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

S12: Set the maximum number of steps for a round of training;

S13: In each step, obtain the global network characteristic state, input the global network characteristic state into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the result of the SDN switch executing the action. The status and reward value of the network, the action, reward value and the respective status in the two time periods before and after the execution of the action are stored in the experience pool;

S14: Update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;

S15: Assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the status before the action is executed;

S16: Repeat steps S13-S15 until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality and prevents the network egress from being overloaded.
The method according to claim 1, characterized in that the global network characteristic state includes the number of messages in each mode, the average message size in each mode, the average delay of each flow, the Number of packets, size of each flow, average packet size in each flow.
The method according to claim 1, characterized in that the action is the sum of the mean value of the action vector selected in the corresponding global network feature state and noise.
The method according to claim 1, characterized in that, based on all reward values in the experience pool and the state before executing the action, updating the network parameters of the action evaluation network includes:

Input all the states in the experience pool before executing the action into the action evaluation network to obtain the corresponding expected value;

According to the expected value, the corresponding reward value and the preset attenuation discount, calculate the discount reward of the state before each action;

Calculate the difference between the discount reward and the expected value, calculate the mean square error based on all differences, and use the obtained mean square error as the first loss value to update the network parameters of the action evaluation network.
The method according to claim 4, characterized in that, based on all actions in the experience pool and the status before executing the action, updating the network parameters for executing the new network includes:

Input all the states before executing the action in the experience pool into the old execution network and the new execution network respectively to obtain the old distribution of execution actions and the new distribution of execution actions;

Calculate the first probability and the second probability that each action in the experience pool appears in the corresponding old distribution of execution actions and the new distribution of execution actions respectively;

Calculate the ratio of the second probability to the first probability;

All the ratios are multiplied by the corresponding difference values and the average value is used as the second loss value to update the network parameters of the new network.
A reinforcement learning agent training device in a multi-modal network, which is characterized in that it is applied to reinforcement learning agents and includes:

The building module is used to construct the global network feature state, actions and the deep neural network model required to train the reinforcement learning agent, where the deep neural network model includes an execution new network, an execution old network and an action evaluation network:

Setting module, used to set the maximum number of steps for a round of training;

An execution module, configured to obtain the global network characteristic status in each step, input the global network characteristic status into the execution of the new network, control the SDN switch to execute the action of executing the new network output, and obtain the execution of the SDN switch. The state and reward value of the network after the action are described, and the action, reward value and respective states in the two time periods before and after the action are performed are stored in the experience pool;

The first update module is used to update the network parameters of the action evaluation network based on all reward values in the experience pool and the state before executing the action;

The second update module is used to assign the network parameters of the new network to the old network, and update the network parameters of the new network based on all actions in the experience pool and the state before the action. ;

The repeat module is used to repeat the process of executing the module to the second update module until the bandwidth occupied by each mode in the multi-modal network ensures communication transmission quality while not overloading the network outlet.
A modal bandwidth resource scheduling method in a multi-modal network, which is characterized by including:

Apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multimodal network according to any one of claims 1 to 5 to the multimodal network;

According to the scheduling strategy output by the reinforcement learning agent, the resources occupied by each mode are scheduled.
A modal bandwidth resource scheduling device in a multi-modal network, which is characterized by including:

An application module, configured to apply the reinforcement learning agent trained by the reinforcement learning agent training method in the multimodal network according to any one of claims 1 to 5 to the multimodal network;

A scheduling module is used to schedule the resources occupied by each mode according to the scheduling strategy output by the reinforcement learning agent.
An electronic device, characterized by including:

one or more processors;

Memory, used to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the reinforcement learning intelligence in the multi-modal network according to any one of claims 1-5 The volume training method or the modal bandwidth resource scheduling method in the multi-modal network according to claim 7.
A computer-readable storage medium with computer instructions stored thereon, characterized in that when the instructions are executed by a processor, the reinforcement learning agent training in the multi-modal network according to any one of claims 1-5 is implemented. The method or the steps of the modal bandwidth resource scheduling method in a multi-modal network according to claim 7.