CN112867083A

CN112867083A - Delay tolerant network routing algorithm based on multi-agent reinforcement learning

Info

Publication number: CN112867083A
Application number: CN202011588326.2A
Authority: CN
Inventors: 姚海鹏; 韩晨晨; 忻向军; 张尼; 童炉; 李韵聪
Original assignee: Tibet Gaochi Science And Technology Information Industry Group Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Tibet Gaochi Science And Technology Information Industry Group Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-28

Abstract

The invention discloses a delay tolerant network routing algorithm based on multi-agent reinforcement learning, which is characterized by comprising the following steps of: firstly, a Louvian clustering algorithm is carried out on a delay tolerant network node, and a centralized and distributed hierarchical architecture is provided; secondly, modeling a next hop problem selected by the DTN node into a distributed partially observable Markov decision process (Dec-POMDP) model by combining positive social characteristics; compared with the prior art, the technical scheme of the patent provides a layered architecture compared with the existing time delay tolerant network routing scheme based on social attributes, and can conveniently capture social information of edge equipment; on one hand, the routing decision issued by the computing center is executed in a distributed mode, and on the other hand, the routing algorithm is trained in a centralized mode in the computing center according to the state transmitted by the service unit. The method can more effectively utilize social characteristics to carry out route forwarding in the delay tolerant network, so that the delivery rate is improved and the average delay is reduced.

Description

Delay tolerant network routing algorithm based on multi-agent reinforcement learning

Technical Field

The invention relates to the technical field of network routing algorithms, in particular to a delay tolerant network routing algorithm based on multi-agent reinforcement learning.

Background

A Delay Tolerant Network (DTN) is a wireless ad hoc network that employs store-carry-forward routing decisions in a network environment where the end-to-end path does not exist a priori. Compared with the traditional wireless network, the DTN has higher flexibility and can be better applied to the network environment with high time delay and frequent link disconnection.

Currently, many routing protocols are used to handle delay tolerant networks, most of which rely on comparisons between each node's metrics to make forwarding policies. However, the efficiency of message delivery is poor due to the unreliability of the link. The social-based approach is more promising than the opportunity-based routing protocol because social attributes are more stable than mobility in predicting and processing routes in DTNs. However, these algorithms transfer a large amount of messages to nodes with high social indicators without limitation, which results in an excessively large queue length of a node buffer and affects the overall routing performance.

Disclosure of Invention

Aiming at the problem that the delivery rate effect of the conventional routing algorithm in the delay tolerant network is poor, the invention provides a delay tolerant network routing algorithm based on multi-agent reinforcement learning; subsequently, the DTN node selection next hop problem is modeled as a distributed partially observable markov decision process (Dec-POMDP) model in conjunction with positive social characteristics. And finally, solving through the collaborative multi-agent reinforcement learning QMIX.

A delay tolerant network routing algorithm based on multi-agent reinforcement learning is characterized by comprising the following steps:

firstly, a Louvian clustering algorithm is carried out on a delay tolerant network node, and a centralized and distributed hierarchical architecture is provided;

secondly, modeling a next hop problem selected by the DTN node into a distributed partially observable Markov decision process (Dec-POMDP) model by combining positive social characteristics;

and thirdly, solving through the collaborative multi-agent reinforcement learning QMIX.

Preferably, the algorithm introduces two different centrality indices, at the community level, while taking into account the discount factor over time,

represents Community C_iLocal centrality index of (a):

wherein M represents the total number of connections of the node s to the node d, phi (0< phi <1) refers to the discount coefficient, t refers to the time index:

wherein, T_nowIs the simulation time, T_intervalIs a variable time slice parameter. Similarly, we can get community C_iGlobal centrality index of (a):

preferably, the maximum buffer occupancy calculation formula in the community Ci is as follows:

thus, for each agent (community), the optimization goal is:

maxα·Δw_C-β·L_buf

we consider the DTN routing strategy based on social attribute clustering as a fully collaborative multi-agent task scenario, which can be described by Dec-POMDP. The Dec-POMDP can be represented by a primitive, N represents the number of agents, A is the action space, is the observation space, and is the discount factor.

Preferably, the QMIX algorithm combines nodes with forward social features (such as community and centrality) by using a hybrid network to extract a distributed policy consistent with a centralized policy.

Compared with the prior art, the technical scheme of the patent provides a layered architecture compared with the existing time delay tolerant network routing scheme based on social attributes, and can conveniently capture social information of edge equipment; on one hand, the routing decision issued by the computing center is executed in a distributed mode, and on the other hand, the routing algorithm is trained in a centralized mode in the computing center according to the state transmitted by the service unit. Meanwhile, the occupancy rate of bottleneck cache regions of all communities is considered in a reward function in a fine-grained manner, so that the routing forwarding in the delay tolerant network can be performed by more effectively utilizing social characteristics, and the delivery rate is improved and the average delay is reduced.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a diagram of a centralized and distributed hierarchical architecture of a delay tolerant network routing algorithm DTN based on multi-agent reinforcement learning according to the present invention;

FIG. 2 is a diagram of a QMIX algorithm of cooperative multi-agent reinforcement learning according to the present invention;

FIG. 3 is a flow chart of the cooperative MARL routing algorithm of the present invention;

FIG. 4 is a graph of the average prize for MARL and DQN of the present invention;

FIG. 5 is a graph of delivery rates of different routing algorithms under the INFOCOM05 in the present invention;

fig. 6 is a graph of average packet delay for different routing algorithms under INFOCOM05 in the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Firstly, a delay tolerant network node is subjected to a Louvian clustering algorithm, and a centralized distributed hierarchical architecture is provided; subsequently, the DTN node selection next hop problem is modeled as a distributed partially observable markov decision process (Dec-POMDP) model in conjunction with positive social characteristics. And finally, solving through the collaborative multi-agent reinforcement learning QMIX.

Referring to fig. 1, a delay tolerant network is first modeled as a directed graph, with edges weighted by total connection time between two nodes. And then, carrying out community clustering by using a round-by-round heuristic iterative optimization modularity algorithm, wherein the service units are user side units, and the service units integrate and analyze own community social information, upload the attributes to a computing center, and carry out centralized training and release strategies by the computing center.

Referring to fig. 2, a diagram of a cooperative multi-agent reinforcement learning algorithm QMIX is shown, which uses a centralized training and distributed execution framework for learning decision-making. The social attribute definition, problem description, QMIX algorithm, and overall flow diagram introduced by the present invention are presented next.

First, social attribute definition

The invention introduces two different centrality indexes at the community level, and simultaneously considers the discount factor along with the time to represent the local centrality index of the community:

second, description of the problem

If a node transmits a message to a more popular community, the data is more likely to be transmitted to the destination. However, unlimited transmission may cause node buffer overflow, which is a major cause of reduced transmission rate. We should make a trade-off between popularity and link congestion. Specifically, with the proposed DTN hierarchy, we can perform fine-grained detection and control on the buffer occupancy at each time step, rather than passively accepting congestion information. Considering the bottleneck effect, the maximum buffer occupancy calculation formula in the community Ci is as follows:

thus, for each agent (community), the optimization goal is:

maxα·Δw_C-β·L_buf

we consider the DTN routing strategy based on social attribute clustering as a fully collaborative multi-agent task scenario, which can be described by Dec-POMDP. The Dec-POMDP can be represented by a primitive, N represents the number of agents, A is the action space, is the observation space, and is the discount factor. Specifically, the observation space of each agent refers to which agents have the agent as a relay in a time slice, and the occupancy rate of the bottleneck buffer area: (ii) a

The action space represents which agent the agent is transmitted to before the next action arrives; the reward is designed as follows:

if the package is transmitted within a community, and it is transmitted to other communities or communities less central than the original community, the reward is obviously negative. Instead, this action facilitates packet transmission. We also consider buffer occupancy as an important factor affecting message transmission.

Three, QMIX algorithm

The QMIX algorithm is a cooperative multi-agent reinforcement learning algorithm, and a hybrid network is adopted to combine local value functions of single agents to improve the performance of the algorithm. Nodes with positive social characteristics (such as community and centrality) can extract decentralized policies that are consistent with the centralized policies. We only need to ensure that the global argmax performed on Qtot yields the same result as the result of a separate set of argmax operations performed on each Qa:

the above two equations indicate that monotonicity can be achieved by constraining the relationship between Qtot and each Qa. For each Qa, agent a may perform a distributed greedy operation. It is very easy to calculate argmaxaQtot. Also, the policy of each agent can be extracted explicitly. To implement the above constraints, the hybrid network takes agent network outputs as inputs and mixes monotonically, producing Qtot values.

The final cost function of QMIX is as follows:

please refer to FIG. 3 for the overall flowchart.

Referring to fig. 4, we performed experiments on a real data set INFOCOM05, where the number of DTN nodes in INFOCOM05 is 41, and the simulation time is 275000 s. The network card transmission rate is 250Kbps, the node buffer area size is 5M, the data packet arrival time is randomly distributed in (300s,400s), and the single data packet size is randomly distributed in (0.8KB,1 KB).

We first compared the QMIX algorithm to Deep Q-network (DQN), set the discount coefficient to 0.95, and the neural network to have 3 layers, each with 64 neurons. In INFCOM 05:

our cooperative MARL routing algorithm has a higher return and more stable performance than the DQN algorithm. The lack of cooperation between agents for distributed training and distributed execution of DQN results in higher rewards for individual agents and less rewards for other agents. In a specific routing scenario where we choose the next relay node, one agent (community) sends its messages to other agents with high centrality without restriction. To increase the second portion of the reward of equation 7, i.e., reduce buffer occupancy, a node will be reluctant to receive messages from other agents. Therefore, the training curve is unstable, and the training process is not as good as our cooperative multi-agent strong chemistry habit. By adopting the cooperative multi-agent reinforcement learning QMIX, good performance can be obtained due to the consideration of the cooperation among the multi-agents.

In order to evaluate the effect of a routing algorithm, the routing algorithm provided by the invention is compared with the conventional classical algorithm BubbleRap and direct transmission (DirectDelivery), wherein BubbleRap forwards messages by means of greedy thought, and the messages search for a node with the highest global centrality as a relay node until the node reaches the same as a target community of the node; at this point, message forwarding is determined by the local centrality ranking. Since the forwarding policy is like a bubble policy, it is called a BubbleRap routing algorithm. In the DirectDelivery routing, each node carries its own message and continuously moves until reaching the destination node, which means that other nodes are never used as relay nodes in the whole communication process. The packet lifetime is taken as an independent variable, and the delivery rates and the average time delay of different routes are analyzed in a simulation mode:

referring to fig. 5, the delivery rate of the cooperative MARL routing algorithm is higher than that of the BubbleRap algorithm, which selects relay nodes according to the rank of "popular" nodes, resulting in buffer overload at high-center nodes. As the time-to-live (TTL) of a packet increases, it may reduce the instances where the packet is deleted in the source node buffer before forwarding. There will be more opportunities for messages to be transmitted between nodes. However, it can cause congestion of the node links, making the increased delivery rate less noticeable. In our proposed algorithm, agent rewards are designed taking into account not only positive social attributes but also the occupancy of the buffer. By proper weight parameters alpha and beta, the advantage of the forwarding capability of the 'popular' community can be fully utilized, and the problem of buffer overflow caused by unlimited relays is avoided.

Referring to fig. 6, the average delay of the three routing algorithms is depicted in fig. 6, with the TTL of the message as the argument. Because messages can only passively wait for their target node to communicate with them, the highest average latency is the direct delivery routing algorithm. However, it is time-consuming to select the best relay node among dozens of nodes by using the BubbleRap, whether the transmission is performed in the community or outside. In contrast, our proposed routing algorithm uses an improved community detection algorithm, making routes of tens of nodes countable community forwarding. Thus, the average delay is reduced compared to BubbleRap.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A delay tolerant network routing algorithm based on multi-agent reinforcement learning is characterized by comprising the following steps:

2. The multi-agent reinforcement learning-based delay tolerant network routing algorithm as claimed in claim 1, wherein the algorithm introduces two different centrality indexes at community level while considering a discount factor, w, over time^localRepresents Community C_iLocal centrality index of (a):

3. the multi-agent reinforcement learning-based delay tolerant network routing algorithm according to claim 2, wherein the maximum buffer occupancy calculation formula in the community Ci is as follows:

thus, for each agent (community), the optimization goal is:

maxα·Δw_C-β·L_buf

4. The multi-agent reinforcement learning-based delay tolerant network routing algorithm of claim 1, wherein the QMIX algorithm employs a hybrid network to combine nodes with forward social features (such as community and centrality) of a single agent local value function to extract decentralized strategy consistent with centralized strategy.