CN116896529A

CN116896529A - Differentiated service time delay guarantee transmission method and device

Info

Publication number: CN116896529A
Application number: CN202310710976.7A
Authority: CN
Inventors: 权伟; 刘明远; 张宏科; 刘康; 邓君; 罗延; 王新宇; 郭子琛; 王金法
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-10-17

Abstract

The invention provides a differentiated service time delay guarantee transmission method and a device, which belong to the technical field of network communication, and when differentiated transmission service is issued, initial path planning is carried out according to transmission time delay and network link time delay; sensing the transmission states of each link in the path and the accumulated time delay of the data stream according to the planned initial path to obtain network sensing information; calling a trained strategy adjustment model to process the network perception information and adjust the transmission strategy; when a node forwards a specific data packet, the congestion and the timeliness of the data packet are judged, active packet loss processing is carried out, and data transmission adjustment is carried out according to an adjusted transmission strategy. The invention has stable transmission control capability under the guarantee of RTT with lower RTT and smaller fluctuation, and has lower drop rate caused by the loss of active data packets, higher throughput and smaller fluctuation; the system has stable transmission control capability and delay guarantee capability for the flow with limited cut-off time.

Description

Differentiated service time delay guarantee transmission method and device

Technical Field

The invention relates to the technical field of network communication, in particular to a differentiated service time delay guarantee transmission method and device applied to multi-data-stream parallel transmission scenes with different deadlines.

Background

In recent years, related research on sensitive applications in industrial networks, telemedicine, etc. has become a trend for future network development, and these applications often have reliable transmission requirements with deadline constraints. The requirements of low latency and high throughput are interacted with due to the dynamics of time-varying networks and links. In a resource-limited network, high throughput can easily lead to problems of retransmissions, congestion, and long queuing, thus affecting low latency requirements.

To enable routing adjustments based on time-varying network states, a learning-based routing method may be used. The routing method based on learning has good adaptability and can effectively reduce transmission delay and queuing delay. For example, yu et al propose a dual DQN-based multipath routing method to avoid congestion and reduce end-to-end delay; xu et al designed a flow transmission dynamic control framework based on depth deterministic strategy gradient algorithm (Deep Deterministic Policy Gradient, DDPG) to maximize the widely used utility function by taking into account the periodic changes in network state; the teams of Verburg and Tang use dynamic time warping and network telemetry, respectively, for periodic network detection; oh et al combine priority with SDN to reduce outage probability, thereby reducing processing delay and queuing delay; zhang et al propose an efficient slot allocation algorithm to reduce the probability of stream collisions, which can alleviate congestion and reduce queuing delay.

Due to the dynamics of the network, the available resources and transmission capacity change over time, and the packet granularity transmission control method can achieve a more accurate tradeoff of traffic and network resources to speed up transmission. For example, gomez et al devised an MDP-based AQM mechanism that performs packet dropping operations based on explicit congestion notification, which can reduce the probability of congestion and queuing delay. Li et al propose BPP for packet segmentation based on important data, which can make full use of the remaining resources for transmission. Software Defined Networking (SDN) and packet processors independent of programming protocols also support flexible packet management in switches, routers. The packet processing procedure in the programming switch includes a Parser phase, a Pipe phase, and a Depasser phase. The main forwarding and packet granularity queue management occurs in the Pipe process. Researchers can adjust the flow table and packet drop constraints for route control and AQM, respectively.

Learning-based solutions can achieve globally optimal policies, and also cause additional processing delays in policy generation, so we need a training framework to accelerate algorithm training. For example, an actor-criticizer framework (AC) is proposed to accelerate policy generation. In particular, an asynchronous intensive actor-criticizer framework is designed for distributed and asynchronous algorithm training. Labao et al designed adaptive moment gradient sharing for gradient sharing with A3C, which can effectively improve training efficiency. In addition, some optimization methods combine actor-criticizer frameworks and near-end policy optimization with a policy feedback (PPO-PF) algorithm to accelerate the process of policy updating and policy optimization. The above optimization method effectively reduces processing delay.

The Nanjing medium network satellite communication stock company discloses a time delay deterministic transmission method based on route scheduling and joint optimization, introduces a time sensitive network technology into a mobile edge computing network, ensures time delay deterministic transmission, models based on graph theory, and selects the most suitable route for time triggering service flow according to the residual bandwidth of a link and the length of the route. The constraint formula is deduced by analyzing the characteristics of the time-sensitive network switch and the time-triggered service flow in terms of time slot independent constraint, path dependent constraint, queue independent constraint condition and time-triggered constraint, so that the non-schedulable rate of the time-triggered service flow in the mobile edge computing network is minimized as an optimization target, and the non-schedulable rate of the time-triggered service flow is optimized by using a particle swarm algorithm and a genetic algorithm respectively. The technology has the following defects in use: the problem model is simpler, the control granularity is larger, and the actual application effect may be not ideal; the network congestion random packet loss control cannot be realized without active queue management, the invalid data cannot be actively discarded, and the performance is seriously reduced when the network is congested.

The university of Zhejiang university discloses a TSN scheduling method for real-time application requirements. The method introduces a time-sensitive scheduling problem based on a software defined network SDN and proposes an integer linear programming (IntegerLinearProgramming, ILP) formula to allocate time slots to time-triggered streams and route them to avoid network queues while maximizing the energy utilization of the allocated time slots in the network and calculating their routing and transmission schedules. The technology has the following defects in use: first, as the number of traffic increases, the time for the algorithm to solve increases exponentially. Considering the high dynamic property of the network state, when the algorithm is applied to a large-scale time-varying network, the calculation of the routing strategy is difficult to complete in time, even if the strategy adjustment is carried out. Secondly, the number of super parameters in the model is large, the actual operation effect of the algorithm depends on parameter setting, and the flexibility is poor.

Disclosure of Invention

The invention aims to provide a differentiated service time delay guarantee transmission method and device, which aim to realize the opposite end transmission of service flow data within the deadline in order to cope with the service flow transmission requirement with the deadline constraint, and carry out path transmission control of different service flow priorities in a network sensing and dynamic routing control mode, so that time-critical service flows can be distributed to links with low time delay, high bandwidth and less queuing, and the opposite end transmission of the service flow data is striven for through relay transmission of a plurality of links from end to end, thereby solving at least one technical problem in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a differentiated service time delay guarantee transmission method, which comprises the following steps:

when the differentiated transmission service is issued, carrying out initial path planning according to the transmission delay and the network link delay;

sensing the transmission states of each link in the path and the accumulated time delay of the data stream according to the planned initial path to obtain network sensing information; wherein, based on the improved INT data packet structure, the perception of the data layer is realized; the improved INT data packet structure comprises an INT packet head and INT data; wherein the INT packet header attribute field includes: device ID, identification, stream ID, action, deadline; when the execution mark is 1, the flow ID and the action field are matched for use, and the routing strategy is issued; when the execution flag is '0', the INT data format is set to store the corresponding perception information; the perceptual attributes include: device ID, port, link latency, link bandwidth, queuing length, and cumulative latency;

calling a trained strategy adjustment model to process the network perception information and adjust the transmission strategy; when the policy adjustment model is trained, network perception information is taken as input, the generated dynamic routing policy is taken as output, and dynamic routing adjustment based on network states is carried out on data streams with different deadlines;

When a node forwards a specific data packet, the congestion and the timeliness of the data packet are judged, active packet loss processing is carried out, and data transmission adjustment is carried out according to an adjusted transmission strategy; each node can judge the timeliness of the data, and when the accumulated transmission time is longer than the cut-off time, the active data packet is discarded; and the node judges congestion according to the local queue condition, and when the queuing length is greater than a threshold value, the node discards the active data packet.

The invention provides a reliable transmission guarantee mechanism for differentiated delay service, which is characterized in that a machine learning algorithm is introduced to generate a data stream route control strategy through sensing network state and real-time accumulated transmission delay of data streams, and the data stream transmission path based on the residual deadline is dynamically adjusted, so that the data is maximally ensured to finish opposite-end transmission within the corresponding deadline.

Optionally, according to the designed local node active queue management algorithm (as shown in fig. 5), each node may perform data timeliness judgment, and perform active packet discarding when the accumulated transmission time is greater than the deadline. In addition, the node judges congestion according to the local queue condition, and when the queuing length is greater than a threshold value, the node discards the active data packet. And finally, forwarding the data packet according to the routing strategy aiming at the common data packet.

Optionally, according to the designed multi-node cooperative routing control algorithm (as shown in fig. 6), a neural network model is trained and optimized, network perception information is taken as input, a dynamic routing strategy is generated, and dynamic routing adjustment based on network states is performed on data streams with different deadlines.

The invention designs a multi-node cooperative routing transmission framework (shown in figure 1): the whole process of network sensing, data processing, strategy adjustment and strategy issuing is managed, and the cyclic optimization of the dynamic routing strategy is realized.

Alternatively, loop optimization of policies is performed primarily in the intelligent policy controller component of the framework described above. The cyclic optimization process of specific strategy as shown in fig. 4, we designed a strategy update model based on an actor-critic framework, which divides the optimization process into two parts: a data forwarding plane and an intelligent control plane. The intelligent control plane utilizes an actor-critic framework to carry out execution feedback adjustment of strategy parameters on the new execution experience data, and the specific process is shown in fig. 4 and 6. The data forwarding plane performs data forwarding according to the issuing strategy of the intelligent control plane.

The invention designs an active queue management logic based on P4, modifies the internal processing logic of the switch, calculates the accumulated execution time of the data packet, and judges the timeliness of the data packet. And when the accumulated forwarding delay is greater than the cut-off time, judging that the invalid data packet is subjected to active packet loss processing. The process incorporates a designed local node active queue management algorithm (as shown in fig. 3, 5).

The invention designs an improved INT data packet structure, and utilizes an INT frame to realize the integration of sensing and control of a data layer, and the detail of the INT data packet structure is shown in a part of the INT data packet structure in fig. 2. The data packet structure design mainly comprises two parts: INT packet header and INT data. Wherein the INT packet header attribute field includes: device ID, identification, stream ID, action, deadline. The execution flag may have a value of "0", "1". When the execution identifier is 1, the flow ID and the action field are matched for use, so that the routing strategy can be issued, namely, the following links of the nodes are selected, and the routing transmission control is realized. When the execution flag is "0", the INT data format is set to store the corresponding perception information. The perceptual attributes include: device ID, port, link latency, link bandwidth, queuing length, and cumulative latency. The specific definition is shown in Table 2.

The invention designs a differentiated time delay service reliable transmission system based on P4, which realizes each component of the framework of FIG. 1, and each component is equivalent to a module to execute corresponding functions. The design and function of each component is shown in detail in fig. 1.

Term interpretation:

(1) PPO near-end policy optimization algorithm: proximal Policy Optimization, a new strategy gradient method for reinforcement learning, alternates between sampling data by interacting with the environment and optimizing "surrogate" objective functions using random gradient ascents. The method has the stability and reliability of the trust zone method, is simple to realize, can be suitable for a more universal rule only by slightly changing the ordinary strategy gradient, and realizes the maximization of long-time average performance.

(2) In-band network telemetry: in-band Network Telemetry is a network information collection technology, which aims to collect information In a network. As the size of networks increases, troubleshooting difficulties increase, and thus techniques are needed to analyze and monitor the network for traffic in real time or to automatically troubleshoot "opens" in the network. Network remote sensing is a technology for monitoring a network in real time, can realize remote and refined management, and can accurately locate network problems in time.

(3) P4 programmable data plane: programming Protocol-independent Packet Processors (P4) is a domain specific language for network devices that specifies how data plane devices (switches, NICs, routers, filters, etc.) handle data packets.

(4) AQM active queue management: active Queue Management is a queue management policy for routers that has a plan to drop a portion of packets before the router cache is exhausted, which can reduce network congestion or improve end-to-end delay.

(5) Data failure: and if the end-to-end transmission delay of the data is larger than the deadline transmission limit of the data, judging that the data is invalid.

The invention has the beneficial effects that: the RTT is lower, the fluctuation is smaller, the stable transmission control capability is realized under the guarantee of the RTT, the drop rate caused by the loss of the active data packet is smaller, and the throughput and the fluctuation are higher; the system has stable transmission control capability and delay guarantee capability for the flow with limited cut-off time.

The advantages of additional aspects of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a multi-node cooperative routing transmission frame according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an INT packet structure according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of an active queue management logic based on P4 according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a policy update model based on an Actor-critic framework according to an embodiment of the present invention.

FIG. 5 is a flowchart of an active queue management algorithm for a local node according to an embodiment of the present invention.

Fig. 6 is a flowchart of a multi-node cooperative routing control algorithm according to an embodiment of the present invention.

Fig. 7 is a schematic flow chart of a method for reliably transmitting differentiated delay services according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of specific steps of a method for reliably transmitting differentiated delay services according to an embodiment of the present invention.

Fig. 9 is a functional framework diagram of a differentiated time delay service reliable transmission system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements throughout or elements having like or similar functionality. The embodiments described below by way of the drawings are exemplary only and should not be construed as limiting the invention.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of examples and that the elements of the drawings are not necessarily required to practice the invention.

In this embodiment, first, a mechanism for guaranteeing reliable transmission of differentiated delay service according to the present invention is provided, including a plurality of parts: firstly, a multi-node cooperative routing transmission framework is designed to control the processes of network sensing, strategy generation and updating and strategy issuing, and dynamic routing adjustment based on the residual deadline is realized. Secondly, a specific INT data packet structure is designed to perform the joint processing of corresponding sensing and control actions. Aiming at the strategy generation and optimization function, a strategy training model combining an Actor-Critic framework and a deterministic strategy reinforcement learning algorithm (Twin Delayed Deep Deterministic policy gradient algorithm, TD 3) is designed, iterative training of the strategy is carried out by combining a new experience value of a perception module, and optimization strategy parameters are adjusted according to the latest change of a network state, so that a dynamic routing scheduling strategy is generated (see the algorithm process of fig. 4 and 6 in detail). When a data packet enters the route forwarding device, we also design a scheduler to determine network congestion and timeliness of the data packet (when the accumulated transmission time is greater than the specified transmission delay, determine that the data packet is invalid), and perform active packet loss control. Packet forwarding is then performed in accordance with the routing policy (see fig. 3, 5 for details).

A multi-proxy reliable transport framework (MRTF) designed in this embodiment performs network detection, dynamic routing control, and active queue management, as shown in fig. 1. In the data forwarding plane, the network state is detected using a specific INT packet structure designed (shown in fig. 2). In the intelligent control plane, there are four components. The cross-layer parser is used to parse the INT aware packets and assign policy packets (specific aware packets/policy packet structures, as shown in fig. 2) to specific nodes.

When forwarding the data packet, the node will analyze the data packet field in the forwarding device kernel, and the "action" field in table 1 is used to determine whether the data packet is a perceived data packet, a policy data packet or a normal data packet, and then perform the corresponding code executing process. The perception observer (the development of the kernel code of the forwarding device is realized based on the programmable technology) extracts important data from the INT data and stores the important data in a classified mode. The data analyzer formats the state of the watcher and serializes the intelligent routing controller. The intelligent routing controller outputs policy operations (specific algorithms are shown in fig. 6) according to the internal algorithms and sends these operations to the cross-layer API for routing operation allocation.

Fig. 2 shows the designed INT packet structure. An INT header is added after the IP header for a specific identification. In the INT header, the device ID is used to mark a particular node. The identification field is used for functional identification including INT data collection (network detection) and INT policy assignment. The action field is the current node backward transmission link selection control. The deadline represents the maximum transmission delay allowed for the current data streaming service transmission. Defining INT data to generate an INT data store requires collection of six parameters. The device ID is used to mark a particular node. The ports are primarily used to mark the egress ports through which the packets pass. Link latency is the total delay including the process delay in the node and the transmission delay of the forwarding link. The link bandwidth is based on further computation of ingress/egress time stamps. The queue length is used to record the queue depth of the data packet as it leaves the node. The cumulative delay is the cumulative transmission delay from the sender. Tables 1 and 2 describe the length and function of each field. When the execution flag is "0", the INT data format is set to store the corresponding perception information. The descriptions of the INT data section fields are shown in Table 2.

TABLE 1

TABLE 2

Fig. 3 shows active queue management logic based on P4. A scheduler is defined after the entry pipe in the P4 framework. When the packet completes ingress management, it will enter the scheduler. The specific implementation process is as follows:

S1: the scheduler will first perform a total delay calculation;

s2: when the total delay exceeds the deadline, the packet is disabled and the scheduler will actively discard the packet.

S3: the scheduler compares the current queue depth to the congestion threshold.

S4: if the current queue depth exceeds the congestion threshold, the scheduler will perform a random discard operation (note that flows occupying more buffers will be discarded with a greater probability).

S5: finally, the data packet is forwarded according to the routing policy.

The algorithm design in this embodiment aims to establish a flow scheduling problem with queue stability and end-to-end transmission delay guarantee constraints, thereby finding a transmission strategy pi ^* It can accurately control route and lose for transmission delay reduction and queuing delay reductionControl is discarded.

First, consider a flow scheduling model that aims at balancing transmission requirements with deadline limited available resources. The flow scheduling policy may control the rate of service of the node to different flows based on the current network state and flow transmission requirements (e.g., low latency, high throughput, low drop rate, etc.).

Assuming that there are N nodes in the network, a fixed undirected network graph g= < N, E >. Is the set of all links. l (L) _ij E {0,1} where l _ij =1, indicating that node i and node j are connected. Otherwise, l _ij =0. Of course, the link set is further defined as l= {1,2, …, L, …, L } to simplify the model formula, where l= |e|. To achieve clear transmission control, the flow scheduling process is divided into discrete slot times t e {0,1,2, }. Is provided with->Is the arrival rate of the link l at time slot t for all streams. R is R _l Is the set of all streams through the link l. />Is the arrival rate of the stream r in the link l at the time slot t. Let t _l And (t) represents the transmission rate of link l at time slot t. Is provided with->Is the drop rate of the link l for all streams. />Is the drop rate of stream r in link l at time slot t. Furthermore, due to the limited resources, the network service capacity Λ is limited. Now, based on the network state s (t), a flow scheduling policy pi meeting the flow transmission requirements will be found ^* 。π ^* Output transmission vector t (t) = (t) ₁ (t),...,t _l (t),...,t _L (t)) and a discard vector d (t) = (d) ₁ (t),...,d _l (t),...,d _L (t)) to divide intoThe procedure of transmission and discard control actions at each time slot t is separately directed.

The specific flow is as follows:

d _l (t)≤max{0,α _l (t)-t _l (t)} (3)

t _l(t) and d_l (t) is a non-negative variable for each time slot t. Equation (2) means the transmission rate t _l (t) should be less than the arrival rate alpha of link l at time slot t _l (t) and maximum transmission capacity And a minimum value therebetween. Formula (3) represents d _l (t) should be less than the traffic in the link/queue depth at time slot t. In general, the transmission capacity of the link l>Is also constrained by the connection of node i and node j.

For the purpose of explicit explanation, use is made ofAlternative +.>

f ^ap Is a function of estimating transmission capacity based on link resources (transmission bandwidth) and connected node resources (e.g., CPU). C (C) _i(t) and C_j (t) represents the available resources of node i and node j, respectively, at time slot t.Representing link l _i,j Is used for the real-time available bandwidth of the system. Due to the dynamics and randomness of the network and traffic, the arrival rate α for any link/ _l (t),t _l (t) and d _l (t) is interdependent, e.g

Equation (5) shows that all flows arriving at link l will eventually be forwarded or dropped.

At this time, a flow scheduling optimization problem (P1) of maximizing throughput is determined.

0≤d _l (t)≤max{0,α _l (t)-t _l (t)} (10)

π ^* andΛare strongly hold (11)

wherein ,ω_l Is the weighting factor for link l. Equation (6) and equation (7) are targets of the model P1, and represent the long-term service throughput of the link l in the range of t, t→infinity. The equation represents the constraint control of the flow scheduling policy. Equation (9) shows the transmission rate t of link l at t time slots _l (t) cannot exceed the arrival rate alpha _l (t) and maximum bandwidth transmission capability of link l. Equation (10) shows the packet loss rate d of the link l in t time slots _l (t) the amount of data remaining for transmission cannot be exceeded. Equation (11) shows that the current model is in the ideal control strategy pi ^* Network service energyThe combined constraint of the force Λ holds.

Furthermore, congestion and queuing are difficult to eliminate in practical networks. Thus, a queue model is defined to measure the probability of congestion and queuing. And then, analyzing the queuing, and making specific strategy adjustment to reduce queuing delay and enhance the stability of flow scheduling.

According to pi ^* ，t _l (t)≤α _l The scheduling process of (t) the non-forwarded or dropped flows will be queued for a short period of time. Assume that there are N queues to relax the network dynamic scheduling process. Will q _l (t) is defined as the queue depth of link l at time slot t. q _l (t) update to

t _l (t)、d _l(t) and α_l (t) represents the transmission rate, discard rate, and arrival rate of link l at time slot t, respectively. Of course, in policy pi ^* And the long-term queue depth Q of any link l under constraint of maximum network capacity Λ _l (t) are all constrained by:

representing policy pi ^* The method has stable transmission control capability and can reduce the probability of congestion and long-time queuing.

During actual transmission, the node has a queue buffer to accommodate fluctuations in network capacity. Therefore, the queue model is used to transform the flow scheduling optimization problem (P1) to obtain a maximized problem:

0≤d _l (t)≤max{0,α _l (t)+q _l (t)-t _l (t)} (17)

πandΛare strongly hold. (19)

This conversion can be understood as: when pi is ^* And Λ strong hold, { t (t), d (t) } are obtained at each slot t. All packets will be transmitted, dropped or queued in the node buffer of each time slot t. At this time, the model objective takes into account queuing problems, and the throughput maximization objective of the formula (6) and the formula (7) is converted into the formula (14), and the service rate throughput is converted into the throughput optimization objective. Formulas (8) - (11) take into account queue variation q when controlling transmission rate and packet loss rate _l (t), and further into equation (15), equation (16), equation (17) and equation (19). Equation (18) considers the stability of the system transmit queues in the model.

In addition, it is also contemplated that during actual queuing, there are many different flows of packets in the queue. All streams have different deadlines for packets. Thus, a delay model is designed to analyze the queuing delay of each flow in detail, which may enhance the flow scheduling policy to reduce the queuing delay.

The longest queuing delay of the first packet (head of line packet) of link l is calculated to measure congestion, and active queue management will be performed for each flow for scheduling fairness. Thus, one stream granularity delay queue is defined for all packets.

Is an indicator variable. />The head-of-line packet representing flow r is waiting in link l from t to t + 1. />Indicating that this packet is forwarded or dropped from link l.

wherein dl_max2 Is the queuing delay of the new head-of-line packet of flow r in link l.For recording the old head-of-line packet +.>At this time, a new queuing delay for the new head-of-line packet at time slot t. In detail, when forwarding or discarding the old line header packet +.>Link l also has packets for flow r. The first packet will be selected as the new line header packet. Correspondingly, the long queuing delay is updated to +.>Otherwise, go (L)>All packets representing flow r are forwarded or dropped. Note that other packets not selected as head of line packets will also have their queuing delay updated from t to dl _max2 +1。

The above delay model can be understood as: on the one hand, if the transmission system starts from time slot t=0, or all packets of stream rRepeating in time slot t to obtain q _l (t)＝0，Delay queue->Only need to calculateIf->Linking l new packets of the received stream r will get +.> and />Otherwise, the queue of link l is empty. Obtain-> and />On the other hand, if q _l (t) > 0, we obtainThen (I)>Only the queue delay needs to be updated to the +.f for the head-of-line packet of stream r at time slot t+1 >

If the same head of line packet of flow r is still in the queue for link l from t to t+1, then at time slot t+1, the queue delay isIf the head-of-line packet of stream r at time slot t +1 is different from the packet at time slot t, the first packet of stream r at time slot t is selected as the new head-of-line packet. Then, will arrangeQueue delay update for new head-of-line packetsNote that in the system, only the longest queuing delay of the flow r is considered. Thus, only the queuing delay of the head-of-line packet of flow r is updated. However, the queuing delay of all packets of flow r in link l will be updated for new head-of-line packet selection. In other words, when changing the head of line packet at time slot t, the queue delay will be determined by recording the newly arrived packet for link l at time slot tIs updated with the internal tag of the (c).

In practical applications, any delay-based flow r has a flow granularity maximum delay constraint from start to endThus, there are

wherein R_r Is the set of all links that flow r passes through in the network. Of course, if pi ^* Strongly correlated with Λ, then pi ^* Each flow r at each link l will be scheduled accurately and in segments. In each link l there will be a packet granularity maximum transmission delay constraint for flow r(including queuing delay and processing delay):

Finally, an attempt is made to find a transmission strategy pi ^* It can perform accurate route control for transmission delay reduction and queuing delay reductionAnd discard control. Due to the diversity and dynamics of network states, drift plus penalty theory is introduced for joint optimization. Defining Θ (t) as a set of a queue model and a delay model, yields:

where Q (t) is a queuing vector, H (t) = { H _l (t); l e L is the vector of line head values in the equation. Assuming that all queues are initially empty, so Θ (0) =0. L (t) is defined to measure the stability of Θ (t) according to Lyapunov drift optimization.

At this time, it is desirable to measure the link between the queue stability and the delay model. The following algorithm will consider stream granularity queuing. Thus, further define H _l The delay model of (t) is as follows

H _l (t+1)＝β _l (t)(H _l (t)+1)+(1-β _l (t))(1 _l (t)+1) (28)

wherein ,

dl _max is the queuing delay of the new head-of-line packet in link l. 1 _l (t)＝dl _max Indicating that link l still has packets of stream r at slot t + 1. Otherwise, 1 _l (t)＝0。

Let t _r As a start of the stable queue backlog Θ (t). Assuming that all queues are empty at the beginning (t _r =0), resulting in Θ (0) =0. For duration T _r Delta (t) is defined as follows:

wherein T_r Is strategically scheduledCycle time. When delta (t) is minimized, a short queuing delay is achieved while there is a stable multi-stream scheduling. Further optimizing P2 to obtain the final optimization problem

0≤d _l (t)≤max{0,α _l (t)+q _l (t)-t _l (t)} (34)

pi ^* andΛare strongly hold (35)

Where V is a non-negative penalty parameter that will affect the trade-off of flow scheduling (throughput) and queuing delay.

To simplify the process of finding the optimal control strategy, P3 is decomposed into a minimization problem. Lyapunov optimization theory is used to simplify Δ (t) and resolve the constraint of equation-equation. Thus, there are the following quotients:

in the optimal control strategy pi ^* Under the working time tau epsilon { t } _r ,...,t _r +T _r Obtaining transmission control actions during-1 }Let T be _r The geometric distribution with probability phi is followed. For Θ (t) _r ) Delta (t) satisfies:

Δ(t)≤B+E{G(t _r )|Θ(t _r )} (36)

wherein G(t_r ) Is defined as:

and wherein B is a defined finite constant:

and (3) proving: let τ be { t ∈ } _r ,…,t _r +T _r -1 is a time slot, according to the inequality in

(max[Q-b,0]+A) ² ≤Q ² +A ² +b ² +2Q (A-b) to give

In the optimal control strategy pi ^* All packets will then be forwarded or dropped to avoid congestion and long queuing. Furthermore, packets are not allowed to be queued for long periods of time, depending on deadline constraints. Thus, for any time slot t, there is a maximum queue depth le that causes (t _l (τ)+d _l (tau)) is less than or equal to le. Furthermore, all queues will be at start time slot t _r Reshaping the site such that q _l (t _r ) And is less than or equal to le. In addition, obtain

Will be equal to τ e { t } _r ,...,t _r +T _r -1} is added and divided by 2 to obtain

Similarly, squaring equation (39) yields

wherein β_l (τ) ² ≤1，(1-β _l (τ)) ² ≤1，Is the maximum of all propagation delay constraints for the flows in link l. Equation (40) and τ ε { t } _r ,...,t _r +T _r -1} is added and divided by 2 to obtain

Sum equation (41) and equation (40) over L ε L and use Θ (t) _r ) Calculation condition expectation, get

Furthermore, due to the dynamics of the network and traffic, policy pi ^* The duration T needs to be adjusted _r . Thus, with alpha _l Distribution of (τ) compared with assumption of T _r The geometric distribution with probability phi is followed. T (T) _r Is respectively at the first and second moments of and />The probability is phi. Then, according to the formula (42), the

Finally, the traffic follows a poisson distribution, thus α _l(t) and is bounded. An optimization conclusion, i.e. lemma 1, is obtained. According to lemma 1, P3 can be reduced to minimize the problem +.>

Equation (44) above has a joint constraint that still follows the equation-equation.

The following describes the optimal control strategy pi to find the minimization equation ^* Is a method of (2). However, due to the dynamics of the network and other noise data, the available capacity of the network varies over time. T (T) _r Is also fluctuating. Therefore, it is difficult to find the optimal control strategy pi ^* And long-term scheduling utility maximization is achieved. In addition, T _r Frequent fluctuations of (2) will affect pi ^* Is stable. Thus, the goal is to obtain a sub-optimal control strategy in one scheduling Slot (ST)

Wherein T is {0,1, …, ST }, ST is less than or equal to T _r . Based on continuity over STsApproximately get the long-term optimal control strategy pi ^* 。

In this embodiment, a detailed algorithm will be designed to find in a scheduling slotBefore algorithm design, a PPO-PF optimization algorithm is introduced as a basic model. Then, the control strategy is further defined as +.>θ is a parameter of the neural network. Rd (s (t)) is defined as a bonus function. Finally, the basic tuple is obtained>In addition, an actor-criticizer framework is introduced to accelerate the process of policy training. In the actor-criticizer framework, Q ^c (θ ^c ) Defined as criticizer network, will Q ^a (θ ^a ) Defined as an actor network. The loss functions of the actor network and the criticizing network are respectively designed as follows:

wherein ：

π _θ and />Is a control strategy for different time periods. Gamma is the discount coefficient. η is the clipping factor. Super parameter epsilon is used for updating c _t (θ ^a ) And clip (c) _t (θ ^a ) 1- ε,1+ε), which eliminates the values in the interval [1- ε,1+ε ]]Outward movement ofDynamic c _t (θ ^a ) Is a stimulus of (a). The operation of the AC framework and the PPO-PF is shown in FIG. 4. First, the criticizing network will calculate the discount rewards +.>And a cost function->Updating the criticizing network neural parameter θ in turn according to the equation ^c . Then, the criticizing network calculates the dominance function +. >Actor network via old actor network->And a new actor network (pi) _θ ) To calculate c _t (θ ^a ). Finally, the actor neural parameters for policy update are updated according to the equation.

From the above analysis, the mechanism provided by this embodiment (DMCT) comprises two parts: dynamic routing and active queue management. First, a local node active queue management algorithm (LAMA) is proposed to show the internal transmission procedure of each agent, as shown in fig. 5. LAMA mainly performs local queue management, network state collection, and route control. LAMA first initializes all parameters. Each agent then performs active packet dropping, state collection, and route forwarding in succession. The agent performs routing policy updates based on the policy package. The proxy calculates the queue depth and queue delay for all packets to analyze congestion and active packet loss. The agent discards packets for which the cumulative delay exceeds the deadline delay limit. Then, when the node is congested, LAMA performs packet dropping randomly. The agent detects the local state and inserts the state into an INT detection packet for state collection. The agent performs route forwarding control based on the route control a (t).

Then, in order to find the routing control policy pi ^* A multi-node cooperative routing control algorithm (MCRA) is proposed, as shown in fig. 6. In MCRA, a routing policy is defined as Q ^s (θ ^s ) Having a network Q with an actor ^a (θ ^a ) The same neural network structure. Each node is considered as a LAMA compliant intelligent agent, performing proactive packet dropping and priority-based flow scheduling at a (t). Specifically, MCRA comprises two parts: the home agent transmits control threads and offline dynamic routing policy training procedures. The local agent transmission control thread invokes LAMA for active queue management, network state collection and route forwarding. The offline policy training process is intended for dynamic policy adjustment. All agents observe s (t) and send these experiences to P for formatting storage. Then, the algorithm randomly selects m samples and performs small-batch strategy gradient update. The MCRA first calculates and stores these variables according to equations and then performs gradient optimization and updates the policy parameters by equations (46) and (47), respectively. After the ST step policy adjustment, the scheduling policy adjustment is performed. Finally, the algorithm implementsIs provided. Furthermore, due to the AC framework, the overhead of MCRA consists of two parts. In the online process, the fitting model is just a polynomial. The overhead of the CPU is negligible. In the offline dynamic routing strategy training process, the strategy requires ST iterations. In each iteration, MCRA selects P experiences for parameter updating of the policy. Therefore, the overhead is O (st×p).

Combining the above frame, model and algorithm design, the flow diagram of the method for reliably transmitting differentiated delay service in this embodiment is shown in fig. 7; a specific implementation flowchart is shown in fig. 8. The functional block diagram of the system module is shown in fig. 9.

Fig. 8 shows specific steps of a method for reliably transmitting differentiated delay services. The specific steps are as follows:

s1: the sensing observer performs sensing path planning according to the data stream transmission path;

s2: INT manager transmits sensing data packet according to sensing path to detect network

S3: the cross-layer analyzer analyzes and extracts the perception information of all the perception packages and sends the perception information to the data analyzer;

s4: the data analyzer performs formatting and combination on the sensing information and sends the sensing information to the intelligent routing controller;

s5: and the intelligent routing controller carries out data flow dynamic adjustment according to the network state and the residual transmission delay of the data flow, and generates a new routing strategy. In addition, the intelligent routing controller invokes a policy update model (fig. 4) to perform policy update;

s6: the cross-layer analyzer generates a corresponding strategy packet according to the strategy action and uniformly sends the strategy packet to each node;

the node kernel scheduler (fig. 3) of each node updates the transmission route and performs data forwarding.

Fig. 9 is a functional block diagram of a system module. The specific implementation process is as follows:

s1: when the differentiated transmission service is issued, the path planning module performs initial path planning according to the transmission delay and the network link delay. Then the planned path is sent to a perception module for perception path planning; in addition, the planned path is sent to the corresponding node for initial transmission path setting.

S2: and the sensing module senses the transmission states of the links in the path according to the planned path and senses the accumulated time delay of the data stream. And sending the perceived data to an offline policy optimization module and an online policy execution module.

S3: and the offline policy optimization module performs route transmission policy optimization by calling the model of fig. 4 according to the perceived data, and transmits updated policy parameters to the online policy execution module through iteration of a specified policy to perform parameter update.

S4: the online policy execution module adjusts the transmission policy according to the perceived data and issues the corresponding policy to the node transmission link control module.

S5: when the node forwards a specific data packet, the node queue management module firstly judges congestion and data failure and performs active packet loss processing (as shown in fig. 3 and 5), and then performs transmission adjustment.

In this embodiment, in order to clearly understand the specific implementation process of the differentiated delay service reliable transmission guarantee mechanism of this patent, all the functional implementation processes are described according to fig. 1. Firstly, the embodiment of the invention needs to install a BMv virtual switch in a Linux operation system of gateway equipment, and realizes the mutual conversion of multiple protocols, the processing of tunnels and the forwarding of data packets by writing a p4 program. P4 (Programming Protocol-Independent Packet Processors) is a high-level programming language for data planes, and processing of packets encapsulated by any protocol can be implemented in the P4 language, BMv2 being a software switch supporting P4 programming. In the present invention, an action matching table is designed in advance, and the action matching table involved in the packet processing process is designed. The forwarding device kernel logic modification as in fig. 3 is then implemented using the P4 language. The perception observer, the intelligent routing controller, the data analyzer and the cross-layer analyzer can be different single controls, and can also be placed in the same control to execute corresponding functions.

Subsequently, the forwarding execution process is started: s1: the perception observer performs path planning according to the differentiated service, and calls the INT transmitting end to periodically transmit the perception data packet according to the set path (the packet structure is shown as 2). And each forwarding device dispatches and uses the internal API to acquire the perception attribute value, and embeds the data into the field position corresponding to the data packet according to the structure of figure 2. When the data packet is larger than the maximum MTU, the forwarding device modifies the data packet IP and sends the perceived data packet to the cross-layer parser. Or the arrival opposite end of the data packet is forwarded to the cross-layer parser. In addition, in the executing process of the above S1 sensing step, the forwarding device performs cumulative transmission delay calculation (as in the process of fig. 3) according to the data carried by the data packet after performing the data packet analysis, and puts the cumulative transmission delay as sensing data into the sensing packet and sends the sensing packet to the cross-layer analyzer, at this time, step S2 is started to be executed. The cross-layer analyzer analyzes and splits the data packet, extracts important perception data and sends the important perception data to the data analyzer, and the data analyzer combines the data to form a machine learning model experience pair and sends the machine learning model experience pair to the intelligent routing controller. The intelligent route controller is provided with an offline policy optimization module and an online policy execution module. At this time, step S3 is executed, and any time-aware data is sent to the offline policy optimization module to perform policy update (e.g. the execution process of fig. 4 and fig. 6). However, the offline policy optimization module only sends the trained parameters to the online policy execution module according to the set iteration times or iteration times. And at this time, executing S4, the online policy executing module only dynamically adjusts the transmission route of the corresponding data stream according to the latest received sensing data according to the set another policy updating time, and generates a policy package through a cross-layer analyzer and sends the policy package to the corresponding forwarding equipment. At this time, S5 is executed, and when a specific packet is forwarded, different awareness, policy routing adjustment, congestion analysis and packet loss control are performed according to the packet type (e.g. the procedure of fig. 5).

The invention relates to a differential time delay service reliable transmission guarantee mechanism which is mainly applied to multi-data stream parallel transmission scenes with different deadlines constrained. In this scenario, due to the high dynamic nature of the network state, it is difficult for existing methods to be able to effectively perceive the delay information of the acquired link. In addition, due to the black box arrangement of the forwarding node, accumulated transmission delay information of the data stream cannot be known in the transmission process, so that the data is still transmitted under the condition of failure, transmission resources are wasted, and the effective receiving rate of the data is affected. Aiming at the problem, the invention designs a differentiated delay service reliable transmission guarantee mechanism, and adopts the In-band telemetry technology (In-network Telemetry, INT) based on the network to sense the network state; then, dynamic end-to-end transmission path planning is carried out based on network state by adopting the ideas of near-end Policy optimization and Policy Feedback (PPO-PF); finally, the thought of active queue management (Active Queue Management, AQM) and P4 (Protocol-independent Packet Processors, P4) is adopted to improve the forwarding logic in the node, so that the random packet loss control of network congestion and the active discard control of invalid data can be performed. In the above scenario, the mechanism can effectively improve transmission efficiency, reduce resource waste caused by invalid data and additional transmission delay waiting, and realize high receiving rate of data stream with deadline constraint.

The invention designs a multi-agent cooperative transmission mechanism (DMCT) with deadline constraint aiming at the transmission of deadline constraint streams in a time-varying network. DMCT introduces an INT framework to achieve accurate network perception; based on the advantage of stability policy optimization, DMCT adopts the combination of PPO-PF and AC to carry out algorithm optimization on dynamic route control, realizes continuously updating route policy based on time-varying network state, realizes long-term stable and accurate route control, and reduces transmission delay. In addition, DMCT employs the concepts of AQM and P4 for packet granularity active drop control to reduce queuing delay. Finally, DMCT takes advantage of the above approach, reducing the extra end-to-end delay of the deadline-limited stream. The invention provides a reliable transmission guarantee mechanism for differentiated time delay service, which adopts a network perception transmission control algorithm based on network telemetry to realize flexible network perception, adopts a multi-node cooperation route control algorithm to realize long-term stable and accurate route control, and adopts a local node active queue management algorithm to carry out random packet loss control of network congestion and active discard control of invalid data. In the multi-data stream parallel transmission scene with the deadline constraint, the mechanism can reduce transmission delay and queuing delay, reduce resource waste caused by invalid data and additional transmission delay waiting, effectively improve transmission efficiency, realize high receiving rate of the data stream with the deadline constraint and show stable stream scheduling performance under the congestion condition.

To sum up:

(1) The invention provides a reliable transmission guarantee mechanism for differentiated delay service, which is characterized in that a machine learning algorithm is introduced to generate a data stream route control strategy through sensing network state and real-time accumulated transmission delay of data streams, and the data stream transmission path based on the residual deadline is dynamically adjusted, so that the data is maximally ensured to finish opposite-end transmission within the corresponding deadline. According to the designed local node active queue management algorithm (as shown in fig. 5), each node can perform data timeliness judgment, and active data packet discarding is performed when the accumulated transmission time is greater than the deadline. In addition, the node judges congestion according to the local queue condition, and when the queuing length is greater than a threshold value, the node discards the active data packet. And finally, forwarding the data packet according to the routing strategy aiming at the common data packet. According to the designed multi-node cooperative routing control algorithm (shown in fig. 6), a neural network model is trained and optimized, network perception information is taken as input, a dynamic routing strategy is generated, and dynamic routing adjustment based on network states is carried out on data streams with different deadlines.

(2) The invention designs a multi-node cooperative routing transmission framework (shown in figure 1), manages the whole process of network sensing, data processing, strategy adjustment and strategy issuing, and realizes the cyclic optimization of dynamic routing strategies.

(3) The invention designs an active queue management logic based on P4, modifies the internal processing logic of the switch, calculates the accumulated execution time of the data packet, and judges the timeliness of the data packet. And when the accumulated forwarding delay is greater than the cut-off time, judging that the invalid data packet is subjected to active packet loss processing. The process incorporates a designed local node active queue management algorithm (as shown in fig. 3, 5).

(4) The invention designs an improved INT data packet structure, and utilizes an INT frame to realize the integration of sensing and control of a data layer, and the detail of the INT data packet structure is shown in a part of the INT data packet structure in fig. 2. The data packet structure design mainly comprises two parts: INT packet header and INT data. Wherein the INT packet header attribute field includes: device ID, identification, stream ID, action, deadline. The execution flag may have a value of "0", "1". When the execution identifier is 1, the flow ID and the action field are matched for use, so that the routing strategy can be issued, namely, the following links of the nodes are selected, and the routing transmission control is realized. When the execution flag is "0", the INT data format is set to store the corresponding perception information. The perceptual attributes include: device ID, port, link latency, link bandwidth, queuing length, and cumulative latency. The specific definition is shown in Table 2.

(5) The invention designs a differentiated time delay service reliable transmission system based on P4, which realizes each component of the framework of FIG. 1, and each component is equivalent to a module to execute corresponding functions. The design and function of each component is shown in detail in fig. 1.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it should be understood that various changes and modifications could be made by one skilled in the art without the need for inventive faculty, which would fall within the scope of the invention.

Claims

1. The differential service time delay guarantee transmission method is characterized by comprising the following steps of:

2. The differentiated services time delay guaranteed transmission method of claim 1, wherein the total delay calculation is performed first, and when the total delay exceeds a deadline, the data packet is disabled and discarded; and comparing the current queue depth with the congestion threshold value, and if the current queue depth exceeds the congestion threshold value, executing random discarding operation.

3. The differentiated service delay guarantee transmission method according to claim 1, wherein N nodes in the network are assumed to be present, and an undirected network graph g= < N, E >;is a collection of all links; l (L) _ij E {0,1} where l _ij =1 means node i and node j are connected, otherwise, l _ij =0; dividing the stream scheduling process into discrete slot times t e {0,1, 2. }; is provided with->Is the arrival rate of link l at slot t for all streams; r is R _l Is the set of all streams through link l; />Is the arrival rate of stream r in link l at time slot t; t is t _l (t) represents the transmission rate of link l at time slot t; is provided with->Is the discard rate of the link l for all streams, < >>Is the discard rate of stream r in link l at slot t; based on the network state s (t), a stream scheduling strategy pi meeting the stream transmission requirement is found ^* ；π ^* Output transmission vector t (t) = (t) ₁ (t),...,t _l (t),...,t _L (t)) and a discard vector d (t) = (d) ₁ (t),...,d _l (t),...,d _L (t)) to direct the course of transmission and discard control actions at each time slot t, respectively.

4. The differentiated services time delay guarantee transmission method according to claim 3, wherein,

d _l (t)≤max{0,α _l (t)-t _l (t)}

wherein ,t_l(t) and d_l (t) is a non-negative variable for each time slot t;

usingSubstitution of +.>

f ^ap Estimating transmissions based on link resources and connected node resources A function of output capacity; c (C) _i(t) and C_j (t) represents the available resources of node i and node j, respectively, at time slot t;representing link l _i,j Real-time available bandwidth of (a); for any link/arrival rate α _l (t),t _l (t) and d _l (t) is interdependent:

the equation indicates that all flows arriving at link l will eventually be forwarded or dropped;

the flow scheduling optimization problem of determining throughput maximization is:

P1:

s.t.:

0≤d _l (t)≤max{0,α _l (t)-t _l (t)}

π ^* andΛare strongly hold

wherein ,ω_l Is the weighting factor of the link l, Λ represents the maximum network service capacity.

5. The differentiated services time delay guarantee transmission method according to claim 4, wherein:

according to pi ^* ，t _l (t)≤α _l The scheduling process of (t) wherein non-forwarded or dropped flows will be queued for a short period of time; n queues are assumed to relax the dynamic scheduling process of the network; will q _l (t) defining a queue depth for link l at time slot t; q _l (t) updating to:

t _l (t)、d _l(t) and α_l (t) represents the transmission rate, discard rate and arrival rate of link l at time slot t, respectively; in policy pi ^* And the long-term queue depth Q of any link l under constraint of maximum network capacity Λ _l (t) are all constrained by:

representing policy pi ^* Has stable transmission control capability, and reduces the probability of congestion and long-time queuing.

6. The differentiated services time delay guarantee transmission method according to claim 5, wherein: calculating the longest queuing delay of the first packet of link l to measure congestion, active queue management for each flow, defining a flow granularity delay queue for all packets:

Is an indicator variable, +.>The head-of-line packet representing flow r is waiting in link l from t to t +1,indicating that this packet is forwarded or dropped from link l;

wherein dl_max2 Is the queuing delay of the new head-of-line packet of flow r in link l,for recording the new queuing delay of the new head-of-line packet at time slot t when forwarding or discarding the old head-of-line packet at time slot t.

7. The differentiated services time delay guarantee transmission method of claim 6, wherein the method comprises the following steps:

due to the diversity and the dynamics of the network state, drift plus penalty theory is introduced to perform joint optimization; defining Θ (t) as a set of a queue model and a delay model:

where Q (t) is a queuing vector, H (t) = { H _l (t); l e L is a vector of line head values;

assuming that all queues are initially empty, so Θ (0) =0;

stability of Θ (t) is measured according to Lyapunov drift optimization definition L (t):

definition H _l The delay model of (t) is as follows:

H _l (t+1)＝β _l (t)(H _l (t)+1)+(1-β _l (t))(1 _l (t)+1)

wherein ,

dl _max is the queuing delay of the new line head data packet in link l; 1 _l (t)＝dl _max Data packet indicating link l still has stream r at time slot t+1, otherwise 1 _l (t)＝0；

Let t _r As a start of stable queue backlog Θ (t); assuming that all queues are empty at the beginning, we get Θ (0) =0 for duration T _r Delta (t) is defined as follows:

wherein ,T_r Is the cycle time of the policy schedule; when delta (t) is minimized, a short queuing delay is obtained, while there is a stable multi-stream scheduling.

8. The differentiated services time delay guarantee transmission method of claim 7, wherein the method comprises the following steps:

in the optimal control strategy pi ^* Under the working time tau epsilon { t } _r ,…,t _r +T _r Obtaining transmission control actions during-1 }Let T be _r Following a geometrical distribution with probability phi, for Θ (t _r ) Delta (t) satisfies:

wherein G(t_r ) Is defined as:

and wherein B is a defined finite constant:

9. the differentiated services time delay guarantee transmission method of claim 8, wherein the method comprises the following steps:

due to the dynamics of the network and other noise data, the available capacity of the network varies with time, T _r Is also fluctuating, thus, to obtain sub-optimal control strategies in one scheduling slotThe method aims at:

wherein T is {0,1, …, ST }, ST is less than or equal to T _r Based on continuity over STsApproximately get the long-term optimal control strategy pi ^* 。

10. The differentiated services time delay guarantee transmission method of claim 9, wherein the method comprises the following steps:

defining a control strategy asθ is a parameter of the neural network; defining rd (s (t)) as a bonus function, resulting in the basic tuple +. >Will Q ^c (θ ^c ) Defined as criticizer network, will Q ^a (θ ^a ) The loss functions defined as actor network, actor network and criticizing network are respectively designed as follows:

wherein ：

π _θ and />Is a control strategy for different time periods. Gamma is the discount coefficient. η is the clipping factor; super parameter epsilon is used for updating c _t (θ ^a ) And clip (c) _t (θ ^a ) 1- ε,1+ε), which eliminates the values in the interval [1- ε,1+ε ]]Outside movement c _t (θ ^a ) Is a stimulus of (a).