CN116743669A

CN116743669A - Deep reinforcement learning packet scheduling method, system, terminal and medium

Info

Publication number: CN116743669A
Application number: CN202310642053.2A
Authority: CN
Inventors: 祝恩国; 张海龙; 郑国权; 刘岩; 阿辽沙•叶; 李然; 卢继哲; 任毅; 侯帅; 翟梦迪; 成倩; 郜波; 陈昊; 郑安刚
Original assignee: China Electric Power Research Institute Co Ltd CEPRI
Current assignee: China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-12

Abstract

The application discloses a deep reinforcement learning packet scheduling method, and discloses a system, a terminal and a medium with the deep reinforcement learning packet scheduling method, wherein the deep reinforcement learning packet scheduling method utilizes the general sense calculation integrated thought, adopts a multi-agent deep reinforcement learning method, and realizes the packet scheduling problem of multiple QoS services. The method is characterized in that characteristics of different service types of the electric power metering equipment are actually considered, a service classification model is established, services are divided into a plurality of priority classes, a service priority dynamic adjustment mechanism is arranged, a service data packet scheduling model is established through a deep reinforcement learning algorithm, a data packet scheduling method facing time delay is designed based on a neural network, and time delay of the data packets with different priorities is optimized through continuous training of the neural network, so that multiple users can share link bandwidths fairly, network utilization rate is improved, time delay of various electric power metering services is guaranteed, and high-quality transmission of multiple services is realized.

Description

Deep reinforcement learning packet scheduling method, system, terminal and medium

Technical Field

The application relates to the field of mobile communication, in particular to a deep reinforcement learning packet scheduling method for power mass metering equipment service.

Background

The electric energy metering device is accurate and reliable and is a basis for guaranteeing fairness and fairness of settlement of electric power spot transactions. With the access of high-proportion new energy and high-proportion power electronic equipment, the running mode of a power grid is more complex, and higher requirements on metering accuracy are provided. And a wide area online monitoring system is required to be built, online monitoring of equipment such as a gateway electric energy meter, a transformer and the like is carried out, a wide area node model is built, and cloud collaborative analysis is realized.

With the rapid development of smart grids, the existing wireless network resources cannot meet the requirements of diversified communication service types. The cognitive wireless sensor network concept is introduced to effectively solve the problems of coexistence of heterogeneous wireless networks, shortage of frequency spectrum resources, low frequency spectrum resource utilization rate and the like faced by the intelligent power grid wireless sensor network. Because of strong isomerism of power communication service and large QoS requirement difference, how to use efficient scheduling algorithm to fully utilize time-varying characteristics of communication resources, and meet requirements of resource utilization rate and service transmission quality, becomes one of the problems to be solved in power communication network.

The traditional scheduling algorithm does not consider the dynamic adjustment of spectrum resources and can not provide reliable service quality guarantee for users under the condition that available transmission resources change in real time; the improved QoS routing algorithm and neely propose an opportunistic scheduling algorithm with packet delay guarantees that meets the delay and reliability requirements of the network to some extent. The existing cognitive wireless scheduling mechanism only considers the absolute priority of the main user, but does not consider the relative priority of each secondary user, and can not meet the requirement of providing differentiated QoS service for heterogeneous services in an intelligent power communication network.

In recent years, the field of machine learning has rapidly progressed. And the data transmission scheduling of the electric power metering equipment business is performed by using a machine learning method, so that good benefits can be obtained. The data short-time high concurrency characteristic of the million-level data acquisition system based on the 5G large access technology is extracted by adopting a machine learning method; secondly, analyzing service composition of short-time high concurrency service, and constructing a dynamic grouping strategy suitable for short-time high concurrency service access; finally, analyzing the priority of the data demand, and establishing a scheduling management mechanism based on the data packet.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a deep reinforcement learning packet scheduling method, which can divide priorities according to the QoS requirement and importance degree of network services, and set reasonable and efficient data packet scheduling strategies for each priority service to improve the transmission performance of the data packets of secondary users with higher priorities.

The application also provides a device, a terminal and a medium with the deep reinforcement learning packet scheduling method.

An embodiment of the deep reinforcement learning packet scheduling method according to the first aspect of the present application is characterized by comprising the steps of:

dividing the collected service data into a plurality of priority classes and establishing a system model, wherein each priority class contains QoS (quality of service) demand service;

making a corresponding strategy according to a system state, namely scheduling the transmission of the service data to a target position through a certain channel, and transmitting the service data needing to be calculated to an edge computing node, wherein the system state comprises the priority of the current service data and the availability of each channel;

respectively inputting the system states of the k and k+1 scheduling periods into a value network to respectively obtain approximate system costs of two stages, thereby obtaining a target loss function;

the loss function is reversely input into the value network for updating the weight parameters of the value network.

The deep reinforcement learning packet scheduling method provided by the embodiment of the application has at least the following beneficial effects:

the method utilizes the general sense calculation integration thought and adopts a multi-agent deep reinforcement learning method to realize the packet scheduling problem of various QoS services. The method is characterized in that characteristics of different service types of the electric power metering equipment are actually considered, a service classification model is established, services are divided into a plurality of priority classes, a service priority dynamic adjustment mechanism is arranged, a service data packet scheduling model is established through a deep reinforcement learning algorithm, a data packet scheduling method facing time delay is designed based on a neural network, and time delay of the data packets with different priorities is optimized through continuous training of the neural network, so that multiple users can share link bandwidths fairly, network utilization rate is improved, time delay of various electric power metering services is guaranteed, and high-quality transmission of multiple services is realized.

According to some embodiments of the present application, the classifying into a plurality of priority classes based on the collected service data and establishing a system model, wherein each priority class includes a step of QoS requiring service, including:

classifying, shaping and aggregating service flows according to the isomerism characteristics of the power communication service, and converting a single flow into an aggregate flow;

dividing users in a communication network into a control user PU and other users SU, wherein the control user PU transmits important information for control, protection and management in a smart grid, and corresponds to the highest priority level 0;

information sent by other users SU is given different priorities;

and establishing a system model based on the conditions.

According to some embodiments of the application, the different priorities of the other users include four types:

traffic with high real-time requirements, e.g. advanced measurement systems, using SU ₁ A representation;

services with computational requirements, e.g. average power consumption, using SU ₂ A representation;

real-time and data rate requirements are generally high reliability services, such as data acquisition and supervisory control, using SU ₃ A representation;

service with low real-time and speed requirements, such as smart meter reading, and SU availability ₄ And (3) representing.

According to some embodiments of the application, the step of building a system model based on the above conditions includes:

assuming that 1 cognitive frequency band is composed of P orthogonal and isomorphic sub-channels and is shared by P PUs and N SUs; the system can be regarded as a single-hop cognitive communication network, and all users send information to a cognitive communication base station; the priority classes SU in the system are divided, so that the access capacity of each priority class SU to the channel is different; higher priority SU has higher access capability to the available channels than lower priority user;

the SU utilizes idle frequency spectrum resources to transmit data packets, if PU reappears in the process of transmitting SU information, the SU shall be discarded from the channel or switched to other idle channels to continue transmission, when the SU is accessed and switched, the SU with higher priority can occupy the SU channel with lower priority and occupy the SU user channel with lowest priority as much as possible, so as to avoid multiple switching;

for each type of priority data packet, a data packet buffer queue exists, when all available channels are occupied by PU or SU with higher priority, the priority data packet is blocked, and the blocked data packet reenters the buffer queue to wait for next transmission scheduling;

when some emergency occurs, the priority of the data packet of the service is increased so as to ensure the reliability of the intelligent power grid.

According to some embodiments of the present application, the step of making a corresponding policy according to a system state, that is, scheduling service data transmission to a target location through a certain channel, transmitting service data to be calculated to an edge computing node, where the system state includes a priority of current service data and availability of each channel includes:

establishing a deep reinforcement learning neural network model;

initializing a value network, a strategy network, experience buffer pool parameters and other parameters;

update scheduling period k=1;

acquiring a state function x (k), wherein the current state comprises the availability of a frequency band and the priority of a data packet;

obtaining a strategy u (k) through a strategy network;

the agent executes a strategy u (k) to obtain a cost function R [ x (k), u (k) ] and a system state function x (k+1) at the next moment;

e at time k of scheduling period _k ＝[x(k),u(k),R,x(k+1)]Storing the experience playback pool;

calculating to obtain the system cost J [ x (k) ] through a cost function R [ x (k), u (k) ];

update scheduling period k=k+1;

the policy u (k+1) at the next moment is obtained through the policy network based on the empirical sample and the system state function x (k+1) at the next moment.

And updating the strategy network parameters by using a strategy gradient equation, and updating the value network parameters according to the loss function.

According to some embodiments of the application, the step of inputting the system states of the kth and k+1 scheduling periods into the value network respectively, and obtaining the approximate system costs of the two stages respectively, thereby obtaining the target loss function comprises the following steps:

respectively inputting the system states x (k) and x (k+1) of the k and k+1 scheduling periods into a value network;

the value network output is an approximate system cost of the two phases respectivelyAnd->The goal of value network training is to minimize the loss function.

According to some embodiments of the application, the step of inputting the loss function back into the value network for updating the weight parameters of the value network comprises:

error E _c (k) The value network is reversely input, and the value network weight parameters are updated, and the value network gradient weight updating function is as follows:

W _c (k+1)＝W _c (k)+ΔW _c (k)

the forward learning speed of the value network is represented, the forward learning speed comprises a sum, a weight matrix between an input layer and an implicit layer is represented, the weight matrix between the implicit layer and an output layer is represented, and the weight matrix is represented as follows:

ΔW _c2 ＝-e _c c _h 2 ^T

wherein c _h1 And c _h2 Input and output matrices of the temporal hidden layer. By updating the weight parameters through the above formula, the output of the neural network will approach the system cost value.

An embodiment of a deep reinforcement learning packet scheduling system according to a second aspect of the present application is characterized by comprising:

the model building module can divide the collected service data into a plurality of priority classes and build a system model;

the strategy making module can make corresponding strategies according to the system state, namely, the service data to be calculated is transmitted to the edge computing node by scheduling the service data transmission to the target position through a certain channel, and the system state comprises the priority of the current service data and the availability of each channel;

An embodiment of the present application is a terminal, including: the device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor is used for realizing the method for detecting the blocking of the application program when executing the computer program.

A computer readable storage medium according to an embodiment of the fourth aspect of the present application is characterized in that the medium stores computer executable instructions for performing the deep reinforcement learning packet scheduling method described above.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram illustrating steps of a deep reinforcement learning packet scheduling method according to an embodiment of the present application;

fig. 2 is a block diagram of a deep reinforcement learning packet scheduling system according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

In recent years, the field of machine learning has rapidly progressed. And the data transmission scheduling of the electric power metering equipment business is performed by using a machine learning method, so that good benefits can be obtained.

In order to solve the problems in the prior art, the application considers the characteristics of different service types of the electric power metering equipment, establishes a service classification model, classifies the service into a plurality of priority classes, sets a service priority dynamic adjustment mechanism, establishes a service data packet scheduling model through a deep reinforcement learning algorithm, designs a data packet scheduling method facing to time delay based on a neural network, and optimizes the time delay of the data packets with different priorities through continuous training of the neural network. The method optimizes the network slice and the sensing and computing resources for the first time, so that multiple users can share the link bandwidth fairly, the network utilization rate is improved, the time delay of various electric power metering services is ensured, and the high-quality transmission of multiple services is realized.

Referring to fig. 1, an embodiment of the present application provides a deep reinforcement learning packet scheduling method, which includes the steps of:

step S100, dividing the collected service data into a plurality of priority classes and establishing a system model, wherein each priority class contains QoS requirement service.

Step 200, a corresponding policy is made according to a system state, that is, service data transmission is scheduled to a target position through a certain channel, the service data needing to be calculated is transmitted to an edge computing node, and the system state includes the priority of the current service data and the availability of each channel.

And step S300, respectively inputting the system states of the k and k+1 scheduling periods into a value network to respectively obtain the approximate system costs of the two stages, thereby obtaining the target loss function.

And step 400, reversely inputting the loss function into the value network for updating the weight parameters of the value network.

In order to describe the present application in more detail, the step S100 specifically includes:

step S101, classifying, shaping and aggregating the service flows according to the isomerism characteristics of the power communication service, and converting the single flow into an aggregated flow.

QoS metrics of the power communication service include data rate, time delay, packet loss rate, delay jitter, and the like. According to the isomerism characteristics of the power communication service, the service flows are classified, shaped and aggregated, and a single flow is converted into an aggregated flow.

Step S102, dividing users in a communication network into a control user PU and other users SU, wherein the control user PU sends important information for control, protection and management in a smart grid, and corresponds to the highest priority level 0;

the high priority information is used for crisis notification, and has strict real-time requirements and reliability requirements.

Step S103, information sent by other users SU is given different priorities;

specifically, the priorities can be classified into four types, corresponding to priorities 1, 2, 3, and 4.

Priority 1, real-time demanding services, e.g. advanced measurement systems, using SUs ₁ A representation;

priority 2, computationally demanding traffic, e.g. averagingElectric quantity, SU ₂ A representation;

priority 3, real-time and data rate requirements are generally high reliability service, such as data acquisition and monitoring control, using SU ₃ A representation;

priority 4, service with low real-time and speed requirements, such as smart meter reading, available SU ₄ And (3) representing.

Step S104, a system model is built based on the conditions.

. It is assumed that 1 cognitive frequency band is made up of orthogonal and homogenous sub-channels and is shared by both PU and SU. The system can be regarded as a single-hop cognitive communication network, and all users send information to the cognitive communication base station. The SU in the system is prioritized, so that the access capability of each priority SU to the channel is also different. Higher priority SUs have higher access capability to the available channels than lower priority users.

The SU uses the free spectrum resources for data packet transmission and if the PU reappears during SU information transmission, the SU should be dropped from that channel or switched to other free channels for further transmission. When the SU is accessed and switched, the SU with higher priority can occupy the SU channel with lower priority and occupy the SU user channel with lowest priority as much as possible, so as to avoid the occurrence of multiple switching.

There is a buffer queue of packets for each class of priority packets. When all available channels are occupied by PUs or higher priority SUs, the priority packets will be blocked, and the blocked packets re-enter the buffer queue to wait for the next transmission schedule.

When some emergency situations occur, such as equipment damage or equipment periodic hardware inspection, such service data packets should have a high priority to ensure the reliability of the smart grid, such as the smart meter at the lowest priority, and when an equipment abnormality is detected, the priority should be raised to report the abnormality.

Step S200 may be divided into the following steps:

step S201, establishing a deep reinforcement learning neural network model;

step S202, initializing a value network, a strategy network, experience buffer pool parameters and other parameters;

step S203, updating the scheduling period k=1;

step S204, a state function x (k) is obtained, wherein the current state comprises the availability of frequency bands and the priority of data packets;

step S205, obtaining a strategy u (k) through a strategy network;

step S206, the agent executes the strategy u (k), and obtains the cost function R [ x (k), u (k) ] and the system state function x (k+1) at the next moment;

step S207, e at time k of scheduling period _k ＝[x(k),u(k),R,x(k+1)]Storing the experience playback pool;

step S208, calculating the system cost J [ x (k) ] through a cost function R [ x (k), u (k) ];

step S209, updating the scheduling period k=k+1;

step S2010, obtaining a policy u (k+1) at a next moment through the policy network based on the experience sample and the system state function x (k+1) at the next moment.

And step 2011, updating the strategy network parameters by using a strategy gradient equation, and updating the value network parameters according to the loss function.

Further, the step of step S201 specifically includes:

step S201-1, scheduling period k: the duration of each scheduling period is defined as Δτ. The scheduler makes a decision of the frequency band allocation at the beginning of each scheduling period, the PU and SU come at the beginning of each scheduling period, and leave at the end of each scheduling period after ending the service.

Step S201-2, state function x (k): the current state of the system includes the availability of frequency bands and the priority of data packets. Abstracting available frequency band resources into N channels by using V _n (k) Indicating the availability of channel n in scheduling period k, V _n (k) =0 means that channel n is occupied by PU and cannot be accessed by SU during scheduling period k; v (V) _n (k) Indicating that channel n is available to the SU for scheduling period k. Assuming M SUs randomly access N channels, P is used _m (k) Representation ofPriority of SU in scheduling period k, P _m (k) Smaller indicates higher priority. The system state at the beginning of the scheduling period k is referred to as the state of the scheduling period k, denoted by x (k).

x(k)＝(p _m (k),v _n (k))

The value of x (k) remains unchanged for the duration of each scheduling period. The set of all possible states is called a state space, denoted by X.

Step S201-3 policy function u (k): at each scheduling period k, the scheduler makes a policy u (k) according to the current system state x (k). u (k) is defined as follows:

u(k)＝(u _m (k)∣m＝1,2,...M)

u _m (k) The situation where the scheduler allocates a channel to the SU during scheduling period k is shown. u (u) _m (k) =n means that the scheduler has allocated channel n to SU in scheduling period k; u (u) _m (k) =0 means that no idle channels can be allocated in the scheduling period k. The policy space is represented by U, a subset U [ x (k) ] of which]All possible policies in system state x (k) are included.

Step S201-4 policy space: policy space is a set of decision functions, i.e., pi= [ u (1), u (2), u (k), j. If any scheduling period k has u (k) =u, this means that the policy function does not change with scheduling period changes, and the scheduling policy is fixed. Only fixed policies are considered herein, each policy function u (k): X→U is a mapping from state space X to policy space U.

Step S201-5 cost function: the cost function of the scheduling period k is determined by the system state x (k) and the policy u (k), denoted by R [ x (k), u (k) ]. The delay of the data packet is an important index for evaluating the QoS performance, and the cost function is represented by a weighted sum of the delays of the data packets:

wherein W is _m (k) For SU _m The higher the priority of the weight coefficient of the data packet, the larger the weight coefficient. τ _m (k) Representing data packetsThe transmission delay in the scheduling period k is calculated as follows:

when SU _m Is blocked or interrupted during a scheduling period, which needs to wait in the queue, and one scheduling period has a duration of Deltaτ, thus delaying tau _m (k) =Δτ. Otherwise, SU _m May be at the time of accessing a given channel, τ _m (k) =0. The cost function describes the delay of each scheduling period, and the sum of the delays of the whole scheduling process is called the system (delay) cost and the minimum system cost J ^* Expressed, the system cost formula is as follows:

the algorithm objective is to solve an optimization problem, i.e. find the optimal strategy u ^* Thereby minimizing system costs.

Step S300 includes:

step S301, inputting the system states x (k) and x (k+1) of the k and k+1 scheduling periods into a value network respectively;

step S302, value network output is approximate system cost of two stages respectivelyAnd->The goal of value network training is to minimize the loss function:

wherein, the liquid crystal display device comprises a liquid crystal display device,W _c is a weight parameter of the value network. If E is at a certain scheduling period k _c (k) =0, obtainable:

the above results are the same as the system cost formula, indicating that when the error function value is sufficiently small, the approximate system cost obtained by inputting the system state into the value network will be closer to the true value.

Step S400 includes:

W _c (k+1)＝W _c (k)+ΔW _c (k)

ΔW _c2 ＝-e _c c _h 2 ^T

Yet another embodiment of the present application provides a deep reinforcement learning packet dispatch system, as shown in fig. 2, the system 20 comprising: a model building module 201, a policy making module 202, a loss calculating module 203 and a weight updating module 204.

The model building module 201 is capable of dividing the collected service data into a plurality of priority classes and building a system model;

the policy making module 202 can make a corresponding policy according to a system state, that is, schedule the service data transmission to a target location through a certain channel, and transmit the service data to be calculated to an edge computing node, where the system state includes the priority of the current service data and the availability of each channel;

the loss calculation module 203 inputs the system states of the k and k+1 scheduling periods into the value network respectively to obtain approximate system costs of the two stages respectively, thereby obtaining a target loss function;

the weight update module 204 is capable of inverting the loss function into the value network for updating the weight parameters of the value network.

The embodiment of the application considers the characteristics of different service types of the electric power metering equipment, establishes a service classification model, divides the service into a plurality of priority classes, sets a service priority dynamic adjustment mechanism, establishes a service data packet scheduling model through a deep reinforcement learning algorithm, designs a data packet scheduling method facing time delay based on a neural network, and optimizes the time delay of the data packets with different priorities through continuous training of the neural network. The method optimizes the network slice and the sensing and computing resources for the first time, so that multiple users can share the link bandwidth fairly, the network utilization rate is improved, the time delay of various electric power metering services is ensured, and the high-quality transmission of multiple services is realized.

Still another embodiment of the present application provides a terminal, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing the deep reinforcement learning packet scheduling method described above.

In particular, the processor may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination that performs computing functions, e.g., including one or more microprocessors, a combination of a DSP and a microprocessor, and the like.

In particular, the processor is coupled to the memory via a bus, which may include a path for communicating information. The bus may be a PCI bus or an EISA bus, etc. The buses may be divided into address buses, data buses, control buses, etc.

The memory may be, but is not limited to, ROM or other type of static storage device, RAM or other type of dynamic storage device, which can store static information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disc, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In the alternative, the memory is used for storing the code of the computer program for executing the scheme of the application, and the execution is controlled by the processor. The processor is configured to execute the application code stored in the memory to implement the actions of the deep reinforcement learning packet dispatch system provided by the embodiment shown in fig. 2.

Yet another embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for performing the deep reinforcement learning packet scheduling method shown in fig. 1 described above.

According to the embodiment of the application, the service is divided into a plurality of priority classes through the service classification model, the service priority dynamic adjustment mechanism is set, the service data packet scheduling model is established through the deep reinforcement learning algorithm, the time delay-oriented data packet scheduling method is designed based on the neural network, and the time delay of the data packets with different priorities is optimized through the continuous training of the neural network. The method optimizes the network slice and the sensing and computing resources for the first time, so that multiple users can share the link bandwidth fairly, the network utilization rate is improved, the time delay of various electric power metering services is ensured, and the high-quality transmission of multiple services is realized.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiment, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A deep reinforcement learning packet scheduling method, comprising the steps of:

2. The method of claim 1, wherein the step of classifying the collected traffic data into priority classes and modeling a system, wherein each priority class contains QoS-requiring traffic, comprises:

information sent by other users SU is given different priorities;

and establishing a system model based on the conditions.

3. The method of claim 2, wherein the different priorities of the other users include four:

4. The method of claim 2, wherein the step of building a system model based on the above conditions comprises:

5. The method according to claim 1, wherein the step of making a corresponding policy according to a system state, i.e. scheduling the traffic data transmission to the target location through a certain channel, the traffic data to be calculated being transmitted to the edge computing node, the system state including the priority of the current traffic data and the availability of the respective channel, comprises:

establishing a deep reinforcement learning neural network model;

update scheduling period k=1;

obtaining a strategy u (k) through a strategy network;

update scheduling period k=k+1;

6. The method of claim 5, wherein the step of inputting the system states of the k-th and k+1-th scheduling periods into the value network, respectively, to obtain two-stage approximate system costs, respectively, to obtain the objective loss function, comprises:

the value network outputs are two orders respectivelyApproximate system cost of segmentsAnd->The goal of value network training is to minimize the loss function.

7. The method of claim 1, wherein the step of inverting the loss function into the value network for updating the weighting parameters of the value network comprises:

W _c (k+1)＝W _c (k)+ΔW _c (k)

ΔW _c2 ＝-e _c c _h 2 ^T

8. A deep reinforcement learning packet dispatch system, comprising:

the loss calculation module is used for respectively inputting the system states of the k and k+1 scheduling periods into the value network to respectively obtain the approximate system costs of the two stages, so as to obtain a target loss function;

and the weight updating module can reversely input the loss function into the value network and is used for updating the weight parameters of the value network.

9. A terminal, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method of any one of claims 1 to 7.

10. A computer readable storage medium storing computer executable instructions for performing the method of any one of claims 1 to 7.