CN111629415B

CN111629415B - Opportunistic routing protocol design method based on Markov decision process model

Info

Publication number: CN111629415B
Application number: CN202010331293.7A
Authority: CN
Inventors: 黄成�; 尹政; 刘子淇; 刘振光; 姚文杰; 徐志良; 王力立
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2023-04-28
Anticipated expiration: 2040-04-24
Also published as: CN111629415A

Abstract

The invention discloses an opportunistic routing protocol based on a Markov decision process model, which comprises the following steps of firstly evaluating the quality of an environmental link and evaluating the receiving rate of a packet: acquiring packet receiving rate data under the same RSSI value and LQI average value and packet receiving rate data under different communication distances, establishing a sample space, and performing curve family regression fitting on the LQI average value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate; broadcasting wireless sensor nodes to construct a wireless sensor network; periodically broadcasting and receiving a detection packet by a sensor node, and establishing a neighbor information table; the sensor node establishes a candidate node set; the node where the effective data packet is located broadcasts the data packet, the candidate node which receives the data packet recalculates the state value corresponding to the node according to the value iteration formula, and the data packet is sent Fang Xuanqu to return the node with the maximum corresponding state value as the next hop forwarding node. The invention optimizes and balances the energy use of the wireless sensor network.

Description

Opportunistic routing protocol design method based on Markov decision process model

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a method for designing an opportunistic routing protocol based on a Markov decision process model.

Background

The wireless sensor network is a network formed by a plurality of sensor nodes in a multi-hop self-organizing mode, and has a very wide application prospect. In a great deal of research work on wireless sensor networks, the research on routing protocols is always key content, and reasonable routing design can effectively improve network performance. Because the sensor nodes are scattered in the unmanned area at random, the battery is difficult to replace, and the problems of saving the consumption of the node energy and balancing the use of the network energy become unavoidable.

Conventional wireless sensor network routing protocols select one or more optimized fixed paths before data transmission begins, and data packets are transmitted along the predetermined fixed paths. Unlike conventional routing protocols, each node in the opportunistic routing that receives a data packet may act as a relay node, and the routing path from the source node to the destination node is not fixed. And each node in the network acquires neighbor node information and network parameters through periodically sending and receiving the detection packet, and selects a proper neighbor node as a candidate node to form a candidate node set (CRS). In the forwarding process, the node selects an optimal next-hop forwarding node from candidate nodes which successfully receive the data packet. This process is repeated until the packet is forwarded to the destination node.

The traditional routing protocol has the defects that the routing path is fixed, the network environment change cannot be well adapted, and the network performance is greatly influenced by the changes of factors such as network link quality, node residual energy, node position and the like. If the data transmission of the next hop node in the sending direction fails in a certain data transmission process, the sending party retransmits the data packet until the next hop node receives the data packet successfully, and the data packet is not forwarded by other nodes receiving the data packet, so that the waste of network resources is caused. The disadvantage of opportunistic routing is that the data packet is hop-by-hop selected to be forwarded to the destination node, and the forwarding node has no global information of the network, so that the routing decision can be carried out only by depending on the position information of the neighbor node and the destination node, and the data packet is ensured to be forwarded to the destination node continuously. In this case, if there is a deviation in the positioning of the sensor nodes in the network or an unknown variation in the node positions, the routing protocol performance will be severely affected. Besides the influence of the position information error, the opportunistic routing protocol only considers the parameters of the neighbor nodes, and the single-step optimal solution is only sought, so that the global optimal cannot be realized. The above problems limit further improvement of the opportunistic routing protocol performance, and how to overcome the dependence on the location information, and it is a great difficulty to select the best node from the global perspective to forward the data packet.

Disclosure of Invention

The invention aims to provide an opportunistic routing protocol based on a Markov decision process model, so as to solve the problems that the routing protocol performance is poor because node position information is relied on and a global optimal solution cannot be realized in the prior art; the invention realizes the design of the opportunistic routing protocol through the Markov decision process of reinforcement learning, so that the energy use of the wireless sensor network is optimized and balanced, and the aim of prolonging the life cycle of the network is fulfilled.

The technical solution for realizing the purpose of the invention is as follows:

an opportunistic routing protocol based on a markov decision process model comprising the steps of:

step 1, evaluating the environmental link quality and the packet receiving rate:

acquiring packet receiving rate data under the same RSSI value and LQI average value and packet receiving rate data under different communication distances, establishing a sample space, and performing curve family regression fitting on the LQI average value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate;

step 2, sowing wireless sensor nodes, and constructing a wireless sensor network: the wireless sensor network comprises a sink node which is responsible for collecting data collected by common nodes in the area and uploading the data to the network;

step 3, periodically broadcasting and receiving the detection packet by the sensor node, and establishing a neighbor information table;

step 4, the sensor node establishes a candidate node set;

step 5, solving a forwarding node selection problem in the opportunistic routing by using a Markov decision process: the node where the effective data packet is located broadcasts the data packet, the candidate node which receives the data packet recalculates the state value corresponding to the node according to the value iteration formula, and the data packet is sent Fang Xuanqu to return the node with the maximum corresponding state value as the next hop forwarding node.

Step 6, repeating the data packet forwarding process of the step 5 until the data packet is forwarded to the sink node; and finally, obtaining an optimal routing path by continuously carrying out data packet forwarding and state value iteration.

Compared with the prior art, the invention has the remarkable advantages that:

(1) The invention provides a method for combining a Markov decision process of a classical reinforcement learning method, which provides a new method for the field of current wireless sensor network routing protocol design, and when the Markov Decision Process (MDP) modeling is carried out on the opportunistic routing optimization problem, a probability transition matrix P is derived by utilizing the packet receiving rate among sensor nodes, and then the problem of searching the optimal solution of the Markov decision process model is solved by utilizing dynamic programming, so that the algorithm is more efficient and has better convergence.

(2) The invention calculates the packet receiving rate between the sensor nodes in real time by utilizing the Received Signal Strength Indication (RSSI) and the Link Quality Indication (LQI) provided by the network physical layer, so that the algorithm can adapt to the change of the network state.

(3) The invention designs the opportunistic routing protocol by using the reinforced-learning Markov decision process model, does not depend on the position information of the sensor node, and can search a proper forwarding path from the global optimal angle after continuous learning.

Drawings

FIG. 1 is a schematic view of the present invention

FIG. 2 is a schematic diagram of a sensor network layout

FIG. 3 is a schematic diagram of state transition matrix computation

FIG. 4 is a plot of the energy ratio k of the action prize R to the current node surplus _F Relation of (2)

FIG. 5 is a schematic diagram of a learned opportunistic routing path

FIG. 6 is a schematic diagram of an end-of-life wireless sensor network

Detailed Description

The invention is further described with reference to the drawings and specific embodiments.

The invention provides an opportunity routing protocol based on a Markov decision process model, which utilizes reinforcement learning to find an energy optimal forwarding path, and comprises the following specific steps:

step 1, evaluating the quality of an environmental link, and providing a packet receiving rate evaluation method:

and selecting a certain area as a data acquisition area, carrying out multiple communication experiments in the area by utilizing two sensor nodes under different communication distances, acquiring packet receiving rate data under the same RSSI value, and establishing a sample space by LQI average values and packet receiving rate data under different communication distances. When the link communication quality is good, the correlation between the RSSI value and the packet receiving rate is the best, so the RSSI value is used for estimating the packet receiving rate. When-70 dBm is less than or equal to RSSI, the packet receiving rate is 100%; when the RSSI is less than or equal to-75 dBm and less than or equal to-70 dBm, the wrapping yield is 99 percent; when the RSSI is less than or equal to-80 dBm and less than or equal to-75 dBm, the wrapping yield is 98 percent; when the RSSI is less than or equal to-85 dBm and less than or equal to-80 dBm, the packet receiving rate is (RSSI+177%; when the RSSI is less than-85 dBm, the packet receiving rate is estimated by using the LQI mean value, and curve family regression fitting is carried out on the LQI mean value and the packet receiving rate data, so as to obtain an estimation formula of the packet receiving rate. The method utilizes the RSSI and LQI value information carried in the data packet transmission to estimate the packet receiving rate in real time, and can adapt the routing protocol to the change of the network.

Step 2, sowing wireless sensor nodes, and constructing a wireless sensor network:

as shown in fig. 2, a plurality of wireless sensor nodes are randomly scattered in a selected data acquisition area to form a wireless sensor network, wherein the wireless sensor network comprises a sink node which is responsible for collecting data acquired by common nodes in the area and uploading the data to the network. The common node has limited energy, and sink node has sufficient energy.

Step 3, periodically broadcasting and receiving the detection packet by the sensor node, and establishing a neighbor information table:

each sensor node in the sensor network periodically broadcasts a detection packet, wherein the detection packet comprises a node ID, a node corresponding state value, a node sleep/interception duty cycle and a candidate node set of the node. Wherein the node corresponding state value is used to evaluate the value of a node as a data packet forwarding node, and the candidate node set of nodes refers to a set of nodes that can receive and forward data packets from the node. Each sensor node receives detection packets from the neighbor nodes, builds a neighbor information table, acquires RSSI and LQI values when receiving data packets, and estimates the packet receiving rate between the sensor node and the neighbor nodes by using the fitting formula obtained in the step 1. Considering the sleep/listening period of a node, in fact, the probability L that node j successfully receives a node i broadcast packet _ij Can be calculated as follows:

L _ij ＝p _ij ·k _jw

wherein ,p_ij For the packet reception rate, k, between node i and node j _jw Is the listening time duty cycle of node j. This step is performed periodically during the life cycle of the sensor network.

Step 4, the sensor node establishes a candidate node set:

the sensor node sorts the neighbor nodes according to the corresponding state values, and sequentially selects the neighbor nodes with the state values larger than or equal to the neighbor nodes as candidate nodes until the probability that the data packet broadcast by the sensor node is successfully received by at least one candidate node is larger than 90% or no selectable neighbor node. The candidate nodes constitute a candidate node set. If a data packet broadcast by a sensor node at a time is not received by any candidate node, the node rebroadcasts the data packet. After updating the state value of the neighbor node, the node periodically repeats the step and reestablishes the candidate node set according to the new state value.

5.1 modeling opportunistic routing problems with a Markov decision process model:

in the opportunistic routing problem, the agent is a valid packet that needs to be forwarded. The modeling process is in the prior art, wherein S represents a state set, states are represented by S, different states of an effective data packet are represented by S, the effective data packet is located in different sensor nodes, each state S of the effective data packet corresponds to a node, and a value of a state corresponding to each node is a state value corresponding to the node and is used for representing the value of the node as a forwarding node; a represents an action set, the action is represented by a, the action set A is a broadcast data packet, a next hop forwarding node is selected according to a certain rule, and the difference between different actions is that the rule for selecting the next hop node is different, so that the generated state transition probability matrixes P are also different; p is a state transition matrix, which represents the probability that an effective data packet is at a certain node after taking a certain action, and different actions can generate different state transition probabilities related to the action taken; r is an action reward, and taking an action in a certain state generates a corresponding reward.

5.2 calculating a state transition probability matrix P:

according to the invention, a state transition probability matrix P is obtained according to the packet receiving rate between nodes, and fig. 3 is a schematic diagram of state transition probability matrix calculation. In the graph, CRS i is a candidate node set of node i, and the node set has m candidate nodes and corresponds to a state value v (j) ₁ )＞v(j ₂ )＞v(j ₃ )＞v(j ₄ )＞…＞v( _m), wherein j₁ 、j ₂ 、j ₃ 、j ₄ 、j _m Are candidate nodes for node i. The action taken by the node where the effective data packet is located is broadcasting the data packet, and the node with the largest state value is selected from candidate nodes receiving the data packet as a forwarding node of the next hop according to a greedy strategy, so that the effective data packet is transferred from the node i to the node j _y And is composed of j _y Probability of forwarding

Can be calculated by the following formula:

wherein ,

representing node j _y Probability of successful reception of a node i broadcast packet, < >>

Representing node j _t The probability of successfully receiving a broadcast packet by node i, t is the amount of change from 1 to y-1, and y is the amount of change from 1 to m. Node j other than the candidate node _x (x=m+1, m+2.,. N) will not act as a forwarding node for the node i broadcast packet, and therefore +.>

Is the amount varying from m+1 to N. Probability calculated by node i>

The ith row and the jth row of the state transition probability matrix P respectively ₁ 、j ₂ 、…、j _m 、…、j _N And calculating the value of the column and the value of the corresponding row of the state transition probability matrix P by each node to obtain the complete state transition probability matrix P.

5.3 developing a reward function R:

in reinforcement learning, each walker generates a walking reward R,

representing rewards earned by taking action a in state s. In each state, the action set A can be broadcast data packet and select the candidate node of next hop according to a certain rule. The broadcast data packet generates corresponding action rewards for actual actions, so that rewards corresponding to all actions in the same action set are the same, and the action rewards only relate to the current state.

And (3) reasonably formulating an action reward, and realizing a corresponding optimization task by a reinforcement learning algorithm. The final purpose of the energy-aware opportunistic routing is to optimize the network energy use and improve the network life cycle. To achieve this, on the one hand, network energy is saved, and the data packet is forwarded to the destination node by the shortest path possible; on the other hand, network energy is balanced, and premature energy exhaustion of part of nodes due to frequent use is avoided. In order to balance the above two problems, the present invention formulates an action rewarding function R _s ＝-1+f(k _E), wherein R_s Representing an action prize, k, for broadcasting a data packet in state s _E Ratio of remaining energy to initial energy for current node, f (k _E ) For remaining energy proportion k with respect to front node _E Is a function of (2). Broadcast packets consume energy, so the action rewards are all negative. The data packet has a basic reward of-1 after each transmission, and can ensure that after a certain learning process, the state value function corresponding to the node far away from the destination nodeSmaller. f (k) _E ) The lower the current node's remaining energy, the greater the cost of forwarding the packet, and the less the action reward. Based on this principle, f (k) _E ) The expression of (2) is as follows:

at this time, the ratio k of the energy remaining in the current node to the action prize R _E The relationship of (2) is shown in FIG. 4.

5.4, formulating an action strategy:

the action strategy in the Markov decision process model adopts a greedy strategy, namely, an optimal action is adopted under the state s, so that the state value of the state s after iteration is maximized. In order to better explore the state space, most algorithms give a certain randomness to the action strategy, so that the intelligent agent has probability to perform random actions to find possible better solutions. However, in the opportunistic routing problem, since the reward function is negative, all arrived states are assigned negative state values. Thus, the algorithm automatically explores the unknown state space, and the unused nodes are preferentially utilized to forward data packets. Through the continuous learning process, the data packet reaches the destination node along the path with the minimum forwarding cost. The present algorithm uses a greedy strategy as the action strategy.

5.5 the candidate nodes iterate corresponding state values and return the state values:

broadcasting data packets by nodes where effective data packets are located, and dynamically planning a value iteration formula of candidate nodes receiving the data packets according to the dynamic programming value

The own state value is recalculated, but instead of immediately replacing the original state value, the value is returned to the source node. Wherein k represents the kth iteration, k+1 represents the kth+1 iteration, v represents a state value, s represents a current state, namely, a state corresponding to a candidate node receiving the data packet, and s' representsShowing the next time state, v _k+1 (s) is the value of state s at the (k+1) th iteration, v _k (s ') is the value of state s' at the kth iteration, a represents an action that can be taken in the current state s, A is the set of actions, gamma is the discount factor, R is the action reward,

representing rewards for taking action a in state s, P is a state transition probability matrix,

representing the probability that the state becomes s' at the next time after taking action a in state s, a corresponding value can be found in the state transition probability matrix P, max indicating that the action policy is a greedy policy.

And 5.6 selecting a next hop forwarding node.

The data packet sending node receives the status value returned by the candidate node, and selects the node with the highest status value corresponding to success as the next hop forwarding node, and broadcasts the message. The candidate node selected as the forwarding node uses the state value calculated by the median iteration formula in the step 5.5 as a new state value, other candidate nodes do not update the state value, and the received data packet is abandoned.

And 6, repeating the data packet forwarding process in the step 5 until the data packet is forwarded to the sink node. And finally, obtaining an optimal routing path by continuously carrying out data packet forwarding and state value iteration. Fig. 5 is a schematic diagram of a learned routing path. If a certain data packet has no path to reach the sink node, the data packet transmission fails. And when 30% of data packet transmission fails in a certain longer period, the life cycle of the wireless sensor network is ended. Fig. 6 is a schematic diagram of the end of life of the network, where the black filled circles indicate sensor nodes that die from energy depletion.

Claims

1. The opportunistic routing protocol design method based on the Markov decision process model is characterized by comprising the following steps of:

carrying out multiple communication experiments in a data acquisition area by utilizing two sensor nodes under different communication distances, acquiring packet receiving rate data under the same RSSI value and establishing a sample space by LQI average value and packet receiving rate data under different communication distances; when-70 dBm is less than or equal to RSSI, the packet receiving rate is 100%; when the RSSI is less than or equal to-75 dBm and less than or equal to-70 dBm, the wrapping yield is 99 percent; when the RSSI is less than or equal to-80 dBm and less than or equal to-75 dBm, the wrapping yield is 98 percent; when the RSSI is less than or equal to-85 dBm and less than or equal to-80 dBm, the packet receiving rate is (RSSI+177%; when the RSSI is less than-85 dBm, estimating the packet receiving rate by using an LQI mean value, and performing curve family regression fitting on the LQI mean value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate;

step 3, periodically broadcasting and receiving the detection packet by the wireless sensor node, and establishing a neighbor information table; establishing a neighbor information table, wherein the neighbor information table comprises neighbor node IDs, neighbor node corresponding state values, neighbor node sleep/interception duty ratios, candidate node sets of the neighbor nodes and the packet receiving rate of the neighbor nodes to the neighbor nodes;

step 4, the wireless sensor node establishes a candidate node set; establishing a candidate node set, namely sequencing neighbor nodes of the candidate node set according to the corresponding state values through wireless sensor nodes, sequentially selecting the neighbor nodes with the state values larger than or equal to the state values as the candidate nodes until the probability that the data packet broadcast by the candidate nodes is successfully received by at least one candidate node is larger than a set value or no selectable neighbor nodes, and stopping the candidate nodes to form the candidate node set;

step 5, solving a forwarding node selection problem in the opportunistic routing by using a Markov decision process: the node where the effective data packet is located broadcasts the data packet, the candidate node which receives the data packet recalculates the corresponding state value of the node according to a value iteration formula, and the data packet is transmitted Fang Xuanqu to the node with the largest corresponding state value to serve as a next hop forwarding node; the method for solving the forwarding node selection problem in the opportunistic routing by using the Markov decision process specifically comprises the following steps:

5.1 modeling the opportunistic routing problem by using a Markov decision process model;

5.2 calculating a state transition probability matrix P: obtaining a state transition probability matrix P according to the packet receiving rate between nodes, and transferring the effective data packet from the node i to the candidate node j _y And is composed of j _y Probability of forwarding

(y=1, 2, …, m) can be calculated by the following formula:

wherein ,

Representing node j _t The probability of successfully receiving the broadcast data packet of the node i is that m represents m candidate nodes in the candidate node set of the node i; the rest nodes except the candidate node are j _x (x＝m+1,m+2,…,N)，/>

N represents N wireless sensor nodes in total of the network; probability calculated by node i>

The ith row and the jth row of the state transition probability matrix P respectively ₁ 、j ₂ 、…、j _m 、…、j _N The values of the columns and the values of the corresponding rows of the state transition probability matrix P are calculated by each node, so that a complete state transition probability matrix P can be obtained;

5.3, making action rewarding function:

formulating an action rewarding function R _s ＝-1+f(k _E), wherein R_s Representing an action prize, k, for broadcasting a data packet in state s _E Ratio of remaining energy to initial energy for current node, f (k _E ) For the ratio k of the remaining energy to the initial energy with respect to the current node _E Is a function of (2);

design f (k) _E ) The expression of (2) is as follows:

the less the current node residual energy, the greater the cost of forwarding the data packet, and the smaller the action rewards;

5.4, formulating an action strategy: adopting a greedy strategy, namely taking optimal action under the state s, so that the state value of the state s after iteration is maximized;

Re-calculating the state value of the self, but not immediately replacing the original state value with the value, and returning the value to the source node; wherein k represents the kth iteration, k+1 represents the kth+1 iteration, v represents the state value, s represents the current state, i.e. the state corresponding to the candidate node receiving the data packet, s' represents the state at the next moment, v _k+1 (s) is the value of state s at the (k+1) th iteration, v _k (s ') is the value of state s' at the kth iteration, a represents an action that can be taken in the current state s, A is the set of actions, lambda is the discount factor, R is the action reward,

is represented in state sRewarding with action a, P is a state transition probability matrix,>

representing the probability that the state becomes s' at the next moment after taking action a in state s, a corresponding value can be found in the state transition probability matrix P, and max refers to an action policy as a greedy policy;

5.6 selecting a next hop forwarding node:

the data packet sending node receives the status value returned by the candidate node, selects the node with the highest status value which is successfully corresponding to the status value as the next hop forwarding node, and broadcasts the message;