CN113660710A

CN113660710A - Routing method of mobile ad hoc network based on reinforcement learning

Info

Publication number: CN113660710A
Application number: CN202110756598.7A
Authority: CN
Inventors: 王英赫
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-11-16
Anticipated expiration: 2041-07-05
Also published as: CN113660710B

Abstract

The invention discloses a mobile self-organizing network routing method based on reinforcement learning, which solves the defects that the existing routing protocol is not suitable for a non-uniform distribution network and the relation between nodes and the network cannot be well measured; taking a complex network correlation method as a generation basis of a Q value table under a reinforcement learning frame, and providing a standard for preliminary evaluation of node quality; the routing method of the mobile self-organizing network based on reinforcement learning can effectively establish a network topology structure, reduce the maintenance cost of the network structure, and utilize the characteristics of a non-uniform distribution network to realize high-efficiency data transmission.

Description

Routing method of mobile ad hoc network based on reinforcement learning

Technical Field

The invention relates to a wireless communication technology, in particular to a mobile ad hoc network routing method based on reinforcement learning.

Background

The mobile ad hoc network is a multi-hop wireless communication network formed by self-organizing mobile nodes participating in data transmission without management of central nodes such as base stations. The network form has the decentralized characteristics of flexible networking, simple configuration and strong survivability. In the development of mobile ad hoc networks, technology combining network topology control and transmission routing strategies is an area of great interest. According to the range related to the routing information, the routing protocol is summarized into local information routing, global information routing and mixed information routing research. The local information routing comprises a random walk routing strategy, a maximum routing strategy, a local intermediate routing strategy, a preferential routing strategy and the like. Of particular interest is a prioritized routing strategy with tunable parameters. The strategy introduces sequence parameters to describe the position of a network phase change point so as to measure the critical point of network congestion. The global information routing comprises a shortest path routing strategy, an effective path routing strategy and an optimized random walk routing strategy. Global information routing focuses more on the overall transmission capabilities of the network. In addition to local and global routing protocols, there are also mixed information routes, and such routing strategies blend various factors present in the network as the target basis for transferring data.

In the above studies, various routing protocols have two disadvantages. Firstly, networks to which each routing protocol is applicable are basically established on the basis of a network topology in which nodes are uniformly distributed, and the characteristics of the network in which the nodes are non-uniformly distributed are not considered, so that the networks are not applicable to the non-uniformly distributed networks. Secondly, most routing protocols pay attention to single-target implementation, that is, a reward strategy is established through a single target, so that the relationship between nodes and a network cannot be well measured, and a space for improvement is left.

Disclosure of Invention

The invention aims to provide a mobile self-organizing network routing method based on reinforcement learning, which can effectively establish a network topological structure, reduce the maintenance cost of the network structure, and realize high-efficiency data transmission by utilizing the characteristics of a non-uniform distribution network.

The technical purpose of the invention is realized by the following technical scheme:

a routing method of a mobile ad hoc network based on reinforcement learning comprises the following steps:

s1, calculating the residual energy percentage of the opposite end node, and determining the forwarding intention of the opposite end node; calculating the Hello packet delivery rate of the node and the opposite node, and determining the link quality between the nodes;

s2, determining neighbor nodes through probabilistic connection according to the residual energy factors and the Hello packet delivery rate factors, and completing construction of a network topology structure;

s3, calculating the instant reward value R according to the residual energy factor and the Hello packet delivery rate factor_s(i) Evaluating the quality of the neighbor nodes; carrying out iteration updating regularly to obtain Q values of all nodes in a coverage range;

s4, when a node needs to send data, calculating a forwarding reward value R according to the average value of betweenness of each node on the shortest path from the node to the destination node_s(d，i)；

S5, evaluating value Q of neighbor node according to current node_s(i) And forwarding the prize value R_s(d, i) calculating a selection factor Q_s(d, i) selecting factor Q of neighbor node_s(d, i) sorting by selecting the selection having the largest selection factor Q_sAnd the node of (d, i) is used as a next hop node to transmit data.

In conclusion, the invention has the following beneficial effects:

the routing strategy is divided into two stages, wherein the first stage is a network structure establishment stage based on a complex network, and the second stage is a routing stage based on reinforcement learning. In the stage of establishing the network structure, the invention takes a complex network correlation method as a generation basis of a Q value table under a reinforcement learning frame, and provides a standard for preliminary evaluation of the node quality. In the second stage of routing selection, the routing strategy adopts the node betweenness on the whole path as the calculation basis of routing reward, and fully expresses the requirement of the shortest path in the non-uniform network. And integrating the two stages to form a routing strategy based on network topology control, wherein the strategy can effectively reduce the time delay and the congestion probability of the network, improve the survival time of the nodes and further improve the routing capability.

Drawings

FIG. 1 is a schematic flow diagram of the process.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

According to one or more embodiments, a mobile ad hoc network routing method based on reinforcement learning is disclosed, which comprises the following steps:

the nodes maintain and update the Q-value table periodically. The nodes regularly broadcast the Hello messages and receive response messages fed back by other nodes in the coverage range.

s4, when a node needs to send data, calculating a forwarding reward value R according to the average value of betweenness of each node on the shortest path from the node to the destination node_s(d,i)；

The mobile ad hoc network with nodes distributed non-uniformly means that the nodes in the network are not randomly distributed in a network scene, and the node densities of different areas are different. The topological phenomenon of non-uniform distribution of the nodes can affect the applicability of the routing strategy of the mobile ad hoc network.

The network node refers to a mobile terminal participating in data transmission of the mobile ad hoc network. Connected edges (simply "edges") refer to relationships between network nodes. The edges determine the topology of the network.

A neighbor refers to a set of all nodes that have a connecting edge with a node. In the mobile ad-hoc network referred to in the present invention, the other nodes within the coverage of the node are not necessarily all neighbors of the node.

The betweenness refers to the number of exactly nodes x present in all shortest paths in the network. The nodes with large betweenness are not necessarily large, and do not necessarily occupy a central position in the network topology. Network betweenness can generally characterize the degree of centralization of a network.

Routing strategy based on non-uniform distribution network: the routing strategy comprises two aspects, namely (1) network topology establishment and node evaluation, which are responsible for generating a neighbor relation according to node intentions and link quality and finishing quality evaluation of neighbor nodes; (2) a data forwarding selection process for selecting the next hop node during data forwarding according to the betweenness characteristics of the network

The routing strategy is divided into two stages, wherein the first stage is a network structure establishment stage based on a complex network, and the second stage is a routing stage based on reinforcement learning.

Establishing a network structure:

in a large-scale self-organizing network, because of numerous nodes, if a node i establishes links by taking all nodes in the coverage area as neighbors, the node i will be burdened, and a lot of unnecessary signaling data can be transmitted in the network, so that the load is added to the operation of the network. Therefore, in the process of constructing the network topology, the establishment of the node link is restricted, and the node which can express the intention of the network is selected to construct the neighbor relation.

In the routing strategy, a network topology is determined according to node residual energy and a Hello packet receiving ratio.

1) Node residual energy calculation

The node residual energy directly indicates the survival time of the node in the network. The node residual energy is generally considered to influence the node forwarding willingness, that is, when the residual energy is more, the node is willing to participate in data forwarding, and when the residual energy is less, the node refuses unnecessary data forwarding in order to prolong the self survival time. Therefore, the amount of the residual energy of the node can reflect the forwarding intention of the node, and becomes a factor for establishing the neighbor relation.

g (E) is arbitrary of the remaining energy of the nodeMonotonically increasing function, usually given as g (E) E^τAnd E ≠ 0, which represents the role of the node residual energy E in selecting the next-hop node, and the role of the residual energy also has a certain difference with the difference of the g (E) function form. In the model, τ is 1.

2) Hello packet delivery rate (acceptance ratio) between nodes

Besides the node residual energy is taken into consideration as the node forwarding intention, the characteristics of the link between the nodes are also considered, and the strategy adopts the Hello packet delivery rate (receiving ratio) as a reference factor of the link quality between the nodes. The Hello packet delivery rate (reception ratio) is defined as a ratio of a Hello packet received by a node i within the coverage to a Hello packet transmitted by the node. The value can well measure the transmission quality of the link between the nodes and ensure the stability of data forwarding. The delivery rate of the Hello packets was calculated using the following formula:

wherein H (i) the delivery rate of the node and the node i in the coverage area h_t(i) Indicates the number of Hello packets sent by the node, h_r(i) The number of Hello packets received for node i. Lambda epsilon 0 and 1 is a regulating parameter which indicates the importance degree of the delivery rate. Since there are insufficient Hello packets to determine link quality when there are fewer, the policy agrees on h_t(i)<At 20, the delivery rate was 0.

3) Calculation of Q value

The nodes regularly broadcast Hello data packets in the network, and the purpose is to find nodes which are suitable to become neighbor relations in the coverage range of the nodes. The data packet requires that nodes within the node coverage return an acknowledgement message (ACK) and include its own remaining energy ratio value. The selection principle of the neighbor nodes is that the nodes which meet certain energy requirements and have good communication quality of links among the nodes are used as the neighbor nodes. The selection algorithm of the neighbor node is defined by the following formula:

suppose that the probability that node i is connected to the node is pi_iThis probability is subject to node residual energy and Hello packetsConstraints on delivery rate.

Wherein f (g), (E), H ═ g (E)^αH^(1-α)And g (E) is a monotonic function of the residual energy of the node. H is the delivery success rate of the Hello packet. Alpha is an adjustable parameter that can adjust the relationship between energy and packet reception rate. N is a radical of_sIs the set of neighbors of this node s. j is a neighbor of the node s.

When the determination of the neighbor relation is completed, defining an instantaneous reward value R according to the node residual energy factor and the Hello packet receiving rate factor_s(i) To evaluate routing trends.

R_s(i)＝E_s,i·H_s,i＝g(E_i)^αH_i ^(1-α)

Completing the definition of the instantaneous reward value of the next hop node, and the current node s needs to update the corresponding Q value table

Comprises the following steps:

wherein eta is the learning rate, the larger eta is, the less the original Q value is reserved, gamma is the discount factor,

and representing the node j with the maximum Q value in the Q value table of the neighbor nodes. If the neighbor node i is a newly added node in the coverage range of the current node s, Q in the Q value table of the node s_s(i)＝0。

And performing probabilistic connection on other nodes in the node coverage range by using the residual energy factor and the Hello packet delivery rate factor to form a network topology structure. And calculating an initial Q value by using the two factors according to a reinforcement learning method, forming a Q value table and maintaining the Q value table. And periodically performing probabilistic connection calculation on the nodes in the coverage range, determining whether the neighbors are continuously connected or not according to the calculation result, and if not, deleting the corresponding neighbor items in the Q value table.

Through the formula, the establishment strategy of the link structure is given from two aspects, the network topology is interpreted from two aspects of the network global capability and the link level, and a foundation is laid for the establishment of the next route.

2. Data forwarding method

The current node needs to regularly maintain and update the Q value items of the neighbor nodes in the Q value table, and the quality of the neighbor nodes is evaluated. When data needs to be transmitted, the mean value of the betweenness of each node on the shortest path from the neighbor node i to the target node d needs to be considered, and the forwarding reward value R based on the betweenness of the nodes is defined according to the mean value_s(d, i). The larger the value, the larger the forwarded prize value, expressed as:

relaying the reward value R_s(d, i) is the average of the sum of the betweenness of all nodes on the shortest path from the neighbor node i of the current node s to the destination node d, R_s(d,i)∈(0,1]. And L is the number of nodes on the path. It can be found that the closer the current node is to the destination node, R_sThe larger (d, i), the larger the forward prize.

3. Routing policy flow

By rewarding R for forwarding_sAnd (d, i) calculating, and combining the Q values of the neighbor nodes i in the Q value table of the current node to determine the next hop forwarding node. Definition of Q_s(d, i) selecting the neighbor node i as the Q value of the next hop node in the process of forwarding data to the destination node d by the current node s, Q_s(d, i) is expressed as:

Q_s(d,i)＝Q_s(i)+R_s(d,i)

assuming that the current node s has N neighbor nodes, sequentially calculating Q of the N neighbor nodes according to Q value table entries of the node s and the forwarding reward value based on the path betweenness_s(d,i)，i＝1，2，3,.. ang, N. Selecting Q in neighbor node_s(d, i) the largest node as the data forwarding node.

From the above description, the role of the two main phases involved in the present routing strategy is summarized as follows: 1) the first stage, network topology establishment and node evaluation. The nodes do not have data packets to be transmitted, and need to regularly broadcast Hello packets to the nodes within the coverage range of the nodes, maintain the network structure through the received response, and update the Q value table of the nodes; 2) and the second stage, data forwarding selection process. If data need to be sent, calculating the forwarding reward value R on the shortest path from all the neighbor nodes to the target node_s(d, i) and combining the Q value table item of the current node to select the final Q value Q_sAnd (d, i) the largest neighbor node is used as the next hop sending node, and the data is sent out.

Reinforcement learning is an important development direction in the field of artificial intelligence, and has attracted more attention in recent years, and a great deal of research is being conducted. Reinforcement learning includes four elements of an agent, an environment, an action, and a reward. The intelligent agent can select a proper action according to a certain strategy; the environment makes feedback, namely rewards, according to the action selected by the intelligent agent in a certain state; and the intelligent agent adjusts the strategy according to the reward, and then updates the behavior of the intelligent agent. And the process of optimizing decision is achieved through reciprocating adjustment. The earliest algorithm to apply reinforcement learning to mobile ad hoc networks was the Q-routing algorithm. The weight value of the algorithm for measuring the path quality is placed in a Q table maintained by each node, and a next hop node is selected according to the Q table. In addition, the routing algorithm based on reinforcement learning is as follows. The algorithm for adjusting the learning rate of reinforcement learning according to the degree of nodes in the network topology uses less time to detect the real state of the network. The Q values of the neighbor nodes are obtained from the broadcast messages of the nodes, so that the time required for exploring the network state is reduced, and the performance loss of the algorithm in the learning process is reduced. The stability of the route under the high-load condition is improved by randomly polling the self-adaptive Q-routing of the neighbor nodes. The distributed reinforcement learning routing protocol suitable for the high-speed moving scene of the vehicle nodes estimates the state information of the network topology, and uses a unicast control information packet to check the availability of the paths among the vehicles. A mobile self-adaptive routing protocol based on reinforcement learning aims at the problems of unorganized and unstable network topological structure and improves dynamic adaptability to network node changes through a distributed Q learning algorithm. In conclusion, the reinforcement learning framework can be applied to a routing algorithm of the mobile ad hoc network, and routing paths are planned through continuous iteration reward values, so that a certain routing purpose is achieved, and a task of transmitting data is well completed.

In the stage of establishing the network structure, the invention takes a complex network correlation method as a generation basis of a Q value table under a reinforcement learning frame, and provides a standard for preliminary evaluation of the node quality. In the second stage of routing selection, the routing strategy adopts the node betweenness on the whole path as the calculation basis of routing reward, and fully expresses the requirement of the shortest path in the non-uniform network. And integrating the two stages to form a routing strategy based on network topology control, wherein the strategy can effectively reduce the time delay and the congestion probability of the network, improve the survival time of the nodes and further improve the routing capability.

Compared with the prior art, the invention adopts a mobile self-organizing network topological structure construction technology with double objective decisions, can comprehensively consider the characteristics of the mobile self-organizing network, and reasonably establishes the network structure. Different from a network with infrastructure, the multi-hop property of the mobile self-organizing network determines the node capacity and the link capacity participating in data transmission and determines the transmission efficiency, so that the network characteristics cannot be comprehensively measured by adopting a single target as the basis for constructing the network topology.

Secondly, the invention not only adopts multi-objective decision to construct a network topology structure, but also introduces node index indexes as important reference bases for data forwarding. The node betweenness is used as an important index for measuring the centrality of the network, and is extremely suitable for reflecting the structural characteristics of the non-uniform distribution network. Most mobile self-organizing networks show the characteristic of node non-uniform distribution, so the routing method provided by the invention can more quickly and efficiently plan the routing path from the source node to the destination node, and improve the efficiency of data transmission.

Thirdly, the invention adopts a routing strategy combining a complex network and a reinforcement learning method, and continuously optimizes the node set participating in transmission according to the transmission reward value in the routing selection process, thereby further ensuring the high-efficiency transmission of data.

The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.

Claims

1. A routing method of a mobile ad hoc network based on reinforcement learning is characterized by comprising the following steps:

S5, evaluating value Q of neighbor node according to current node_s(i) And forwarding the prize value R_s(d, i) calculating a selection factor Q_s(d, i) selecting factor Q of neighbor node_s(d, i) sorting, selecting the possessions bestLarge selection factor Q_sAnd the node of (d, i) is used as a next hop node to transmit data.

2. The reinforcement learning-based mobile ad hoc network routing method according to claim 1, wherein the determination of the neighboring node in step S2 is specifically:

suppose that the probability that the node i is connected to the node is pi_iAnd the probability is constrained by the node residual energy and the Hello packet delivery rate:

wherein, f (g), (E), H ═ g (E)^αH^(1-α)G (E) is a monotonic function of the node residual energy, H is the delivery success rate of the Hello packet, and alpha is an adjustable parameter and can adjust the relation between the energy and the packet receiving rate; n is a radical of_sIs a set of neighbors of the node s; j is a neighbor of the node s.

3. The reinforcement learning-based mobile ad hoc network routing method according to claim 2, wherein the calculation of the instantaneous rewarding value and the updated Q-value table is specifically as follows:

defining an instantaneous prize value R_s(i) And the trend of the route is evaluated,

R_s(i)＝E_s，i·H_s，i＝g(E_i)^αH_i ^(1-α)

Wherein eta is learning rate, and the larger eta is, the original Q value is retainedThe less, gamma is the discount factor,

representing the node j with the maximum Q value in the Q value table of the neighbor node;

if the neighbor node i is a newly added node in the coverage range of the current node s, Q in the Q value table of the node s_s(i)＝0。

4. The reinforcement learning-based mobile ad hoc network routing method according to claim 3, wherein the data forwarding routing policy is specifically:

when data needs to be transmitted, the mean value of the betweenness of each node on the shortest path from the neighbor node i to the destination node d is considered, and a forwarding reward value R based on the betweenness of the nodes is defined_s(d，i)，

Relaying the reward value R_s(d, i) is the average of the sum of the betweenness of all nodes on the shortest path from the neighbor node i of the current node s to the destination node d, R_s(d，i)∈(0，1](ii) a L is the number of nodes on the path;

determining a next hop forwarding node by combining the Q values of the neighbor nodes i in the Q value table of the current node; definition of Q_s(d, i) selecting the neighbor node i as the Q value of the next hop node in the process of forwarding data to the destination node d by the current node s, Q_s(d, i) is expressed as

Q_s(d，i)＝Q_s(i)+R_s(d，i)

Assuming that the current node s has N neighbor nodes, sequentially calculating Q of the N neighbor nodes according to Q value table entries of the node s and the forwarding reward value based on the path betweenness_s(d，i)，i＝1，2，3，...，N；

Selecting Q in neighbor node_sAnd (d, i) the largest node is used as a data forwarding node for data transmission.