CN111629415A

CN111629415A - Opportunistic routing protocol based on Markov decision process model

Info

Publication number: CN111629415A
Application number: CN202010331293.7A
Authority: CN
Inventors: 黄成�; 尹政; 刘子淇; 刘振光; 姚文杰; 徐志良; 王力立
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-09-04
Anticipated expiration: 2040-04-24
Also published as: CN111629415B

Abstract

The invention discloses an opportunistic routing protocol based on a Markov decision process model, which firstly evaluates the quality of an environmental link and evaluates the packet receiving rate: collecting packet receiving rate data under the same RSSI value and LQI mean value and packet receiving rate data under different communication distances to establish a sample space, and performing curvilinear family regression fitting on the LQI mean value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate; scattering wireless sensor nodes to establish a wireless sensor network; the sensor node periodically broadcasts and receives a detection packet, and a neighbor information table is established; the sensor node establishes a candidate node set; and broadcasting the data packet by the node where the effective data packet is located, recalculating the corresponding state value of the node by the candidate node receiving the data packet according to a value iteration formula, and selecting the node with the maximum corresponding state value as a next skip sending node by the data packet sender. The invention optimizes and balances the energy use of the wireless sensor network.

Description

Opportunistic routing protocol based on Markov decision process model

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to an opportunistic routing protocol based on a Markov decision process model.

Background

The wireless sensor network is a network formed by a plurality of sensor nodes in a multi-hop self-organizing mode, and has a very wide application prospect. In a large amount of research works related to wireless sensor networks, the research on routing protocols is always the key content, and reasonable routing design can effectively improve the network performance. Because more sensor nodes are randomly scattered in an unmanned area, batteries are difficult to replace, and how to save the consumption of node energy and balance the use of network energy becomes an inevitable problem.

In the conventional wireless sensor network, one or more optimized fixed paths are selected by a protocol before data starts to be transmitted, and data packets are transmitted along the preset fixed paths. Unlike the conventional routing protocol, each node receiving a data packet in the opportunistic routing is likely to serve as a relay node, and the routing path from the source node to the destination node is not fixed. Each node in the network acquires neighbor node information and network parameters through periodically sending and receiving detection packets, and selects proper neighbor nodes as candidate nodes to form a candidate node set (CRS). In the forwarding process, the node selects the optimal next hop forwarding node from the candidate nodes which successfully receive the data packet. This process is repeated until the packet is forwarded to the destination node.

The traditional routing protocol has the defects that a routing path is fixed and cannot be well adapted to the change of a network environment, and the change of factors such as network link quality, node residual energy, node position and the like can cause great influence on network performance. If the data transmission from the sender to the next hop node fails in a certain data transmission process, the sender retransmits the data packet until the next hop node successfully receives the data packet, instead of using other nodes receiving the data packet to forward, which causes the waste of network resources. The opportunistic routing has the defects that the forwarding nodes are selected hop by hop for the data packet, and the network global information is not available, so that routing decision can be carried out only by depending on the position information of the neighbor nodes and the destination node, and the data packet is ensured to be continuously forwarded to the destination node. In this case, if there is a deviation in the positioning of the sensor nodes in the network or the positions of the nodes are unknown, the performance of the routing protocol will be seriously affected. Except for the influence of position information errors, the opportunistic routing protocol only considers parameters of neighbor nodes, only finds single-step optimal solution, and cannot realize global optimal. The above problems limit the further improvement of the performance of the opportunistic routing protocol, how to overcome the dependence on the position information, and how to select the optimal node from the global perspective to forward the data packet is a big problem.

Disclosure of Invention

The invention aims to provide an opportunistic routing protocol based on a Markov decision process model, which aims to solve the problem that the routing protocol performance is poor due to the fact that the node position information is relied on and the global optimal solution can not be realized in the prior art; the invention realizes the design of the opportunistic routing protocol through the Markov decision process of reinforcement learning, optimizes and balances the energy use of the wireless sensor network and achieves the aim of prolonging the life cycle of the network.

The technical solution for realizing the purpose of the invention is as follows:

an opportunistic routing protocol based on a Markov decision process model comprising the steps of:

step 1, evaluating environment link quality, evaluating packet receiving rate:

collecting packet receiving rate data under the same RSSI value and LQI mean value and packet receiving rate data under different communication distances to establish a sample space, and performing curvilinear family regression fitting on the LQI mean value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate;

step 2, scattering wireless sensor nodes, and establishing a wireless sensor network: the wireless sensor network comprises a sink node and is responsible for collecting data collected by common nodes in a region and uploading the data to the network;

step 3, periodically broadcasting and receiving detection packets by the sensor nodes, and establishing a neighbor information table;

step 4, the sensor nodes establish a candidate node set;

step 5, solving the forwarding node selection problem in the opportunistic routing by using a Markov decision process: and broadcasting the data packet by the node where the effective data packet is located, recalculating the corresponding state value of the node by the candidate node receiving the data packet according to a value iteration formula, and selecting the node with the maximum corresponding state value as a next skip sending node by the data packet sender.

Step 6, repeating the data packet forwarding process of the step 5 until the data packet is forwarded to the sink node; and finally, obtaining the optimal routing path by continuously forwarding the data packet and iterating the state value.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the invention provides a method for Markov decision process combined with classical reinforcement learning method, which provides a new method for the design field of the current wireless sensor network routing protocol.

(2) The invention utilizes the Received Signal Strength Indicator (RSSI) and the Link Quality Indicator (LQI) provided by the network physical layer to calculate the packet receiving rate among the sensor nodes in real time, so that the algorithm can adapt to the change of the network state.

(3) The invention designs the opportunistic routing protocol by utilizing a Markov decision process model for reinforcement learning, does not depend on the position information of the sensor nodes, and can search a proper forwarding path from the perspective of global optimum after continuous learning.

Drawings

FIG. 1 is a general schematic view of the present invention

FIG. 2 is a schematic diagram of a sensor network layout

FIG. 3 is a schematic diagram of state transition matrix calculation

FIG. 4 is a diagram illustrating the ratio k of the action reward R to the current node residual energy_EIn relation to (2)

FIG. 5 is a diagram of learned opportunistic routing paths

FIG. 6 is a schematic diagram of the end of life of a wireless sensor network

Detailed Description

The invention is further described with reference to the following figures and embodiments.

The invention provides an opportunistic routing protocol based on a Markov decision process model, which utilizes reinforcement learning to search an energy optimal forwarding path and comprises the following specific steps:

step 1, evaluating the quality of an environment link, and providing a packet receiving rate evaluation method:

and selecting a certain area as a data acquisition area, performing a plurality of communication experiments under different communication distances by using two sensor nodes in the area, and acquiring packet receiving rate data under the same RSSI value and LQI mean value and packet receiving rate data under different communication distances to establish a sample space. When the communication quality of the link is better, the correlation between the RSSI value and the packet receiving rate is best, so the RSSI value is used for estimating the packet receiving rate. When the RSSI is less than or equal to-70 dBm, the packet receiving rate is 100 percent; when RSSI is less than-70 dBm and is less than-75 dBm, the packet receiving rate is 99 percent; when RSSI is less than-75 dBm and is less than-80 dBm, the packet receiving rate is 98 percent; when RSSI is less than-80 dBm and less than-85 dBm, the packet receiving rate is (RSSI + 177)%; and when the RSSI is less than-85 dBm, estimating the packet receiving rate by using an LQI mean value, and performing curve family regression fitting on the LQI mean value and the packet receiving rate data to obtain an estimation formula of the packet receiving rate. The RSSI and LQI value information carried in the data packet transmission are used for estimating the packet receiving rate in real time, so that the routing protocol can adapt to the change of the network.

Step 2, scattering wireless sensor nodes, and establishing a wireless sensor network:

as shown in fig. 2, a plurality of wireless sensor nodes are randomly scattered in a selected data acquisition area to form a wireless sensor network, wherein the wireless sensor network comprises a sink node which is responsible for collecting data acquired by common nodes in the area and uploading the data to the network. The energy of the common node is limited, and the energy of the sink node is sufficient.

Step 3, the sensor node periodically broadcasts and receives the detection packet, and establishes a neighbor information table:

each sensor node in the sensor network periodically broadcasts a detection packet, wherein the detection packet comprises a node ID, a node corresponding state value, a node sleep/interception duty ratio and a candidate node set of the node. The node corresponding state value is used to evaluate the value of a certain node as a data packet forwarding node, and the candidate node set of the node refers to a set of nodes that can receive and forward data packets from the node. Each sensor node receives a detection packet from a neighbor node, establishes a neighbor information table, simultaneously obtains RSSI and LQI values when receiving a data packet, and estimates the packet receiving rate between the sensor node and the neighbor node by using the fitting formula obtained in the step 1. Considering the sleep/listening period of a node, the probability L that node j successfully receives the node i broadcast packet in practice_ijCan be calculated as follows:

L_ij＝p_ij·k_jw

wherein ,p_ijIs the packet reception rate, k, between node i and node j_jwIs the listening time of node j. This step is performed periodically during the sensor network lifecycle.

Step 4, the sensor nodes establish a candidate node set:

the sensor node sorts the neighbor nodes according to the sizes of the corresponding state values, selects the neighbor nodes with the state values larger than or equal to the state values as candidate nodes in sequence, and stops the process until the probability that the data packet broadcasted by the sensor node is successfully received by at least one candidate node is larger than 90% or no optional neighbor node exists. The candidate nodes constitute a set of candidate nodes. If the data packet broadcasted by the sensor node at a certain time is not received by any candidate node, the node rebroadcasts the data packet. After the state value of the neighbor node is updated, the node periodically repeats the step, and the candidate node set is reestablished according to the new state value.

5.1 modeling the opportunistic routing problem by using a Markov decision process model:

in the opportunistic routing problem, the agent is the valid packet that needs to be forwarded. The modeling process is the prior art, wherein S represents a state set, the states are represented by S, different states S of an effective data packet represent that the effective data packet is located at different sensor nodes, each state S of the effective data packet corresponds to a node, and the value of the state corresponding to each node is the value of the state corresponding to the node and is used for representing the value of the node as a forwarding node; a represents an action set, the action set is represented by a, the action set A is a broadcast data packet and selects a next hop forwarding node according to a certain rule, and different actions are different in the rule of selecting the next hop node, so that the generated state transition probability matrixes P are different; p is a state transition matrix, which indicates the probability that an effective data packet is at a node after a certain action is taken, and different actions can generate different state transition probabilities in relation to the action taken; r is an action award, and taking a certain action in a certain state will generate a corresponding award.

5.2 calculating the state transition probability matrix P:

the invention obtains a state transition probability matrix P according to the packet receiving rate between nodes, and FIG. 3 is a schematic diagram for calculating the state transition probability matrix. In the figure, CRSi is a candidate node set of a node i, and the node set has m candidate nodes in total and corresponds to a state value v (j)₁)＞v(j₂)＞v(j₃)＞v(j₄)＞…＞v(j_m), wherein j₁、j₂、j₃、j₄、j_mAre all candidate nodes for node i. The action taken by the node where the effective data packet is located is a broadcast data packet, and the node with the maximum state value is selected from candidate nodes receiving the data packet according to a greedy strategy to be used as a forwarding node of the next hop, so that the effective data packet is transferred from the node i to the node j_yAnd is formed by j_yProbability of forwarding

Can be calculated from the following formula:

wherein ,

represents node j_yThe probability of successfully receiving the node i broadcast packet,

represents node j_tThe probability of successfully receiving a node i broadcast packet, t being the amount that varies from 1 to y-1, and y being the amount that varies from 1 to m. Nodes j other than candidate node_x(x ═ m +1, m + 2.., N) does not act as a forwarding node for node i to broadcast packets, so node i does not broadcast packets

x is an amount varying from m +1 to N. Probability calculated by node i

Respectively, the ith row and the jth row of the state transition probability matrix P₁、j₂、…、j_m、…、j_NAnd calculating the value of the corresponding row of the state transition probability matrix P by each node according to the column values, so as to obtain the complete state transition probability matrix P.

5.3 formulating a reward function R:

in reinforcement learning, each walking movement generates an action reward R,

indicating a prize earned by taking action a in state s. In each state, the action set A which can be taken is a broadcast data packet and a next hop candidate node is selected according to a certain rule. The broadcast data packet is the actual action and generates the corresponding action reward, so the rewards corresponding to all actions in the same action set are the same, and the action rewardReward is only relevant to the current state.

And the corresponding optimization task can be realized only by reasonably formulating action rewards and strengthening a learning algorithm. The energy perception type opportunistic routing aims at optimizing network energy use and prolonging the life cycle of the network. In order to achieve the purpose, on one hand, network energy is saved, and a data packet is forwarded to a destination node through the shortest path as much as possible; on the other hand, the network energy needs to be balanced, and premature energy exhaustion of partial nodes due to frequent use is avoided. To balance the problems of the two aspects, the invention formulates a behavior reward function R_s＝-1+f(k_E), wherein R_sAction reward, k, representing a broadcast packet in state s_EIs the ratio of the remaining energy to the initial energy of the current node, f (k)_E) To a ratio k of remaining energy with respect to the front node_EAs a function of (c). The broadcast packet consumes energy, so the action reward is negative. The data packet has-1 basic reward every time the data packet is transmitted, and the state value function corresponding to the node far away from the target node is smaller after a certain learning process. f (k)_E) The value of (1) is always negative, the less the residual energy of the current node is, the higher the cost of forwarding the data packet is, and the smaller the action reward is. Based on this principle, f (k) is designed_E) Is represented by the following formula:

at this time, the ratio k of the action reward R to the current node residual energy_EThe relationship of (2) is shown in FIG. 4.

5.4, an action strategy is established:

the action strategy in the Markov decision process model adopts a greedy strategy, namely, the optimal action is taken under the state s, so that the state value of the state s after iteration is maximum. In order to better explore the state space, most algorithms give certain randomness to the action strategy, so that the intelligent agent has probability to carry out random actions to search possible better solutions. But in opportunistic routing problems, all reached states are assigned negative state values because the reward function is negative. Therefore, the algorithm automatically explores the unknown state space and preferentially forwards the data packet by using the unused nodes. Through the continuous learning process, the data packet reaches the destination node along the path with the minimum forwarding cost. The algorithm employs a greedy strategy as an action strategy.

5.5 the candidate node iterates and returns the corresponding state value:

broadcasting the data packet by the node where the effective data packet is located, and iterating the formula according to the dynamically planned value of the candidate node receiving the data packet

The self state value is recalculated, but the value is not immediately replaced with the original state value, but is transmitted back to the source node. In the formula, k represents the kth iteration, k +1 represents the kth iteration and +1, v represents the state value, s represents the current state, i.e. the state corresponding to the candidate node receiving the data packet, s' represents the state at the next moment, v_k+1(s) is the value of state s at the k +1 th iteration, v_k(s ') is the value of state s' at the kth iteration, a represents the action that can be taken in the current state s, A is the set of actions, γ is the discount factor, R is the action reward,

a reward representing taking action a in state s, P is a state transition probability matrix,

the probability that the state changes to s' at the next moment after the action a is taken in the state s is represented, a corresponding value can be found in the state transition probability matrix P, and max indicates that the action strategy is a greedy strategy.

5.6 selecting the next hop forwarding node.

And the data packet sending node receives the state values returned by the candidate nodes, selects the node with the highest corresponding state value as the next hop sending node, and broadcasts the message. The candidate node selected as the forwarding node uses the state value calculated by the value iterative formula in the step 5.5 as a new state value of the candidate node, and the other candidate nodes do not update the state value and abandon the received data packet.

And 6, repeating the data packet forwarding process in the step 5 until the data packet is forwarded to the sink node. And finally, obtaining the optimal routing path by continuously forwarding the data packet and iterating the state value. Fig. 5 is a schematic diagram of a learned routing path. If no path of a certain data packet can reach the sink node, the data packet transmission fails. And when 30% of data packets fail to be transmitted in a certain longer period, the life cycle of the wireless sensor network is ended. Fig. 6 is a schematic diagram of the end of life of the network, wherein black filled circles indicate dead sensor nodes due to energy depletion.

Claims

1. An opportunistic routing protocol based on a Markov decision process model, comprising the steps of:

step 1, evaluating environment link quality, evaluating packet receiving rate:

step 4, the sensor nodes establish a candidate node set;

2. The opportunistic routing protocol according to claim 1 wherein step 3 establishes a neighbor information table comprising neighbor node IDs, neighbor node corresponding state values, neighbor node sleep/listening duty cycles, neighbor node candidate set and neighbor node packet reception rates for itself.

3. The opportunistic routing protocol according to claim 1, wherein the candidate node set is established in step 4, specifically, neighbor nodes of the opportunistic routing protocol are sorted according to the sizes of corresponding state values by the sensor nodes, the neighbor nodes with the state values larger than or equal to the state values of the neighbor nodes are sequentially selected as the candidate nodes, and the opportunistic routing protocol is stopped until the probability that at least one candidate node is successfully received by data packets broadcasted by the sensor nodes is larger than a set value or no optional neighbor node exists, and the candidate nodes form the candidate node set.

4. The opportunistic routing protocol of claim 1 wherein step 5, solving the forwarding node selection problem in opportunistic routing with a markov decision process, specifically comprises the steps of:

5.1 modeling opportunistic routing problems with Markov decision process models

5.2 calculating the state transition probability matrix P: obtaining a state transition probability matrix P according to the packet receiving rate between the nodes, and transferring the effective data packet from the node i to the candidate node j_yAnd is formed by j_yProbability of forwarding

Can be calculated from the following formula:

wherein ,

represents node j_tAnd the probability of successfully receiving the broadcast data packet of the node i, wherein m represents m candidate nodes in the candidate node set of the node i. The nodes except the candidate node are j_x(x＝m+1，m+2，...，N)，

And N represents that the network has N sensor nodes. Probability calculated by node i

Respectively, the ith row and the jth row of the state transition probability matrix P₁、j₂、...、j_m、…、j_NAnd calculating the value of the corresponding row of the state transition probability matrix P by each node according to the column values, so as to obtain the complete state transition probability matrix P.

5.3 formulating a reward function:

formulating a behavior reward function R_s＝-1+f(k_E), wherein R_sAction reward, k, representing a broadcast packet in state s_EIs the ratio of the remaining energy to the initial energy of the current node, f (k)_E) To a ratio k of remaining energy with respect to the front node_EA function of (a);

design f (k)_E) Is represented by the following formula:

the less the residual energy of the current node is, the higher the cost of forwarding the data packet is, and the smaller the action reward is;

5.4, an action strategy is established: adopting a greedy strategy, namely, taking the optimal action under the state s to maximize the state value of the state s after iteration;

5.5 the candidate node iterates and returns the corresponding state value:

Recalculating the self state value, but not immediately replacing the original state value with the self state value, and transmitting the self state value back to the source node; in the formula, k represents the kth iteration, k +1 represents the kth iteration and +1, v represents the state value, s represents the current state, i.e. the state corresponding to the candidate node receiving the data packet, s' represents the state at the next moment, v_k+1(s) is the value of state s at the k +1 th iteration, v_k(s ') is the value of state s' at the kth iteration, a represents the action that can be taken in the current state s, A is the set of actions, γ is the discount factor, R is the action reward,

representing the probability that the state is changed into s' at the next moment after the action a is taken under the state s, and finding a corresponding value in a state transition probability matrix P, wherein max indicates that the action strategy is a greedy strategy;

5.6 selecting the next hop forwarding node:

and the data packet sending node receives the state values returned by the candidate nodes, selects the node with the highest corresponding state value as the next hop sending node, and broadcasts the message.