CN116170854A - DQN-OLSR routing method based on deep reinforcement learning DQN - Google Patents

DQN-OLSR routing method based on deep reinforcement learning DQN Download PDF

Info

Publication number
CN116170854A
CN116170854A CN202310137402.5A CN202310137402A CN116170854A CN 116170854 A CN116170854 A CN 116170854A CN 202310137402 A CN202310137402 A CN 202310137402A CN 116170854 A CN116170854 A CN 116170854A
Authority
CN
China
Prior art keywords
node
mpr
dqn
reinforcement learning
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310137402.5A
Other languages
Chinese (zh)
Inventor
郭剑辉
杨利行
濮存来
陶叔银
董宏林
肖志杰
李伦波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202310137402.5A priority Critical patent/CN116170854A/en
Publication of CN116170854A publication Critical patent/CN116170854A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/04Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/04Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources
    • H04W40/10Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources based on available power or energy
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W40/00Communication routing or communication path finding
    • H04W40/02Communication route or path selection, e.g. power-based or shortest path routing
    • H04W40/12Communication route or path selection, e.g. power-based or shortest path routing based on transmission quality or channel quality
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application discloses a DQN-OLSR routing method based on deep reinforcement learning DQN, which comprises the steps that nodes add node load capacity information in HELLO messages, the nodes add coordinate information and speed information in the HELLO messages, weighted link survival time is obtained through calculation, and the nodes add node residual energy in the HELLO messages; in the route calculation process, the route learns a training Q function through deep reinforcement learning DQN according to four parameters including the number of node neighbors, node load capacity, weighted link survival time and residual energy; when MPR set calculation is carried out, weights of four parameters are obtained according to the Q function, the comprehensive quality of the nodes is calculated, and then the node with the maximum comprehensive quality is selected as an MPR node; when the node sends the data packet, the routing table is calculated based on the shortest hop count principle, and the message is forwarded through the optimal MPR node. The method improves the stability of the whole network, improves the throughput of the network and reduces the packet loss rate.

Description

DQN-OLSR routing method based on deep reinforcement learning DQN
Technical Field
The invention belongs to the technical field of mobile ad hoc network OLSR routing, and particularly relates to a DQN-OLSR routing method based on deep reinforcement learning DQN (Deep Q Network).
Background
Optimized link state routing (Optimized Link State Routing, OLSR) is an optimization algorithm of classical link state routing algorithms, commonly used in mobile ad hoc networks. The biggest feature of this protocol over traditional link state protocols is multicast (multipoint relays, MPRs). Only MPR nodes can forward control information, this feature significantly reduces the number of control messages in the network, reduces the overhead of the routing protocol, and improves the network performance. The merits of MPR sets directly affect the quality of the whole network.
Each node periodically transmits HELLO messages to achieve neighbor discovery and information sharing. Topology discovery and routing computation need to be done by means of TC messages. Only MPR nodes can forward TC messages, so the MPR set is to be able to cover all two-hop neighbor nodes. MPR selection has proven to be an NP-complete problem and conventional greedy algorithms have difficulty in achieving optimal solutions.
Along with the development of technology and the reduction of production cost, unmanned aerial vehicles have wider application in the fields of rescue detection and the like. The unmanned aerial vehicle network has the characteristics of high-speed movement and energy limitation, and the OLSR protocol of the original edition is difficult to meet the requirement of unmanned aerial vehicle ad hoc network. First, the original OLSR protocol simply uses the number of node neighbors as a factor of MPR selection, and it is difficult to cope with an unmanned network in which network topology changes frequently due to high-speed movement. Communication stability of high speed mobile networks. Secondly, if the unmanned plane is used as a network node with energy limitation and is always used as an MPR node to forward TC information, the phenomenon of insufficient energy and communication interruption can occur.
Disclosure of Invention
The present invention is directed to solving the problems set forth in the background art and providing a DQN-OLSR routing method based on deep reinforcement learning DQN. The method provides three parameters of node load capacity, weighted link survival time and residual energy, and weights of the three parameters are calculated through a deep reinforcement learning DNQ algorithm, so that an optimal MPR set is calculated, and the problem that the traditional method is difficult to adapt to rapid change of network topology is solved. Compared with the traditional OLSR routing method, the method can obviously improve the stability of the routing, increase the throughput of the network and reduce the packet loss rate.
In order to achieve the purpose of the invention, the invention discloses a DQN-OLSR routing method based on deep reinforcement learning DQN, which comprises the following steps:
step M1, adding three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation in an OLSR protocol by combining node quality measurement indexes of the number of node neighbors, wherein the added three node quality measurement indexes comprise node load capacity, weighted link survival time and node residual energy; adding the node quality measurement index into HELLO information of the node;
m2, directly carrying out neighbor discovery and information sharing by the node through HELLO information, and calculating to obtain a neighbor list;
m3, calculating weights of four node quality measurement indexes by combining the node with the deep reinforcement learning DQN through a neighbor table, calculating the comprehensive quality of the node according to the weights, and then calculating an optimal MPR set from large to small according to the comprehensive quality of the node;
m4, forwarding TC information by nodes in the MPR set to realize network topology discovery and obtain a topology table;
and M5, calculating an optimal routing table based on the topology table and following the principle of the shortest hop count, and finishing the routing calculation.
Further, the node load capacity L is calculated by the current message queue length and the maximum message queue length of the node:
Figure BDA0004086395480000021
wherein S is load Indicating the current message queue length of the node, and l indicating the maximum message queue length.
Further, the weighted link lifetime is obtained by weighting and calculating the link lifetime and the comprehensive link lifetime between the nodes; the nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V O And V Q Velocity V of Q relative to O QO The method comprises the following steps:
V QO =V Q -V O
link lifetime t between nodes O, Q OQ The method comprises the following steps:
Figure BDA0004086395480000022
where |OQ| is the distance between node O and point Q, and a circle, vector, is drawn with the communication radius R of node O
Figure BDA0004086395480000031
Is the motion vector of point Q, point D is the intersection of the motion vector of point Q and the circle, OD is the distance between point O and point D, and angle beta is the vector +.>
Figure BDA0004086395480000032
Sum vector->
Figure BDA0004086395480000033
Degrees of acute included angle;
by link lifetime t of all one-hop neighbors of node Q Qi The integrated link survival time can be calculated by adding the average value
Figure BDA0004086395480000034
Figure BDA0004086395480000035
Wherein N1 (Q) is a symmetric one-hop neighbor set of the node Q, num N1(Q) Is the number and size of N1 (Q);
final weighted link lifetime T OQ
Figure BDA0004086395480000036
Where α=0.7 and β=0.3.
Further, node remaining energy E represents energy remaining for the current node to communicate; if the energy left by the node is less, the node is not suitable to be used as an MPR node, so that the occurrence of routing errors caused by disconnection of the node from the network due to insufficient energy is avoided;
defining initial energy of communication node as E 0 The communication node includes four states when performing wireless communication, respectively: sleep state Sleep, idle state Ldle, send message state Tx, receive message state Rx; in general, the communication node has higher power consumption in the transmission message state Tx and the reception message state Rx, and lower power consumption in the Sleep state Sleep; multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication cost
E cost =P tx ×T tx +P rx ×T rx +P idle ×T iale +P sleep ×T sleep
Wherein P is tx 、P rx 、P idle And P sleep Respectively represent the working powers of four states, T tx 、T rx 、T idle And T sleep Respectively representing the working time lengths of the four states;
with the initial energy and the consumed energy, the energy E remaining at the node is calculated:
E=E 0 -E cost
E=E 0 -(P tx ×T tx +P rx ×T rx +P idle ×T idle +P sleep ×T sleep )。
further, calculating the weights using DQN in step M3 specifically includes:
the adjustment problem of the four metric weights of node neighbor number, weighted link lifetime, node load capacity and node residual energy is described as a markov decision (Markov Decision Process, MDP) process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a reinforcement learning method; in the reinforcement learning model, an agent is each node, the environment is the whole communication network, the state is the value of four measurement indexes, and the action is the weight of four measurements;
in the reinforcement learning model, a utility function is defined as:
U=α*lb(L)+β*lb(T)-γ*lb(E)
wherein L is the load capacity of the node, T is the average value of the weighted link survival time of the computing node and all MPR nodes thereof, and E is the average value of the residual energy of all MPR nodes of the computing node; alpha, beta and gamma are the prize size coefficients of the three parameters respectively;
the analysis utility function U can be used for selecting neighbor nodes with long communication time and large residual energy as MPR nodes as much as possible while ensuring that the nodes have strong load capacity; therefore, the energy consumption can be optimized, and the network performance can be enhanced;
the utility function U at the front and back time is differenced to define a return function R t
Figure BDA0004086395480000041
Wherein δ is used to adjust the threshold of the size of the reward; u at current and later moments t The difference is largeAt delta, environmental administration U t Reporting the difference value; otherwise, reporting to be zero; u at the latter moment t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment;
the deep reinforcement learning DQN is utilized to set the node quality measure weight magnitude after describing the problem as an MDP procedure.
Further, the method for calculating the optimal MPR set in the step M3 includes:
setting the current MPR node as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows:
m3-1, adding a node with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
step M3-2, if the node in the N1-A is the only one-hop neighbor node of a certain node in the N2-A, adding the node in the N1-A set into the MPR_A set; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
step M3-3, if N2_A is not null, i.e. there are still nodes in N2_A not covered by MPR_A, for each node Y in the set N1_A, calculate and count the number of one-hop neighbors C of Y Y Load condition L Y Link L AY Weighted link lifetime T of (2) AY And the remaining energy E of node Y Y
Step M3-4, inputting the current state information, namely the magnitude of four measurement index values, into a depth reinforcement learning DQN algorithm, and outputting the obtained actions, namely the weight magnitude of the measurement index, as different weights of the four measurement indexes in step M3;
step M3-5, calculating the node comprehensive quality Comp of the node Y according to the four measurement indexes of the step M3 and the different weights of the step M4 Y
Comp Y =α*C Y +β*L Y +γ*T AY +δ*E Y #
Step M3-6, selecting node composite quality Comp from N1_A Y Adding MPR_A to the largest node, deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A set in the set N2_A; returning to the step M3-3 for judgment until the N2_A is empty, and ending the MPR_A calculation.
Compared with the prior art, the invention has the remarkable progress that: 1) According to the method, on the premise of meeting the normal routing function, the optimal MPR set suitable for frequent change of network topology can be calculated, routing errors can be remarkably reduced, data packet loss is reduced, and the communication performance of the wireless ad hoc network is improved; 2) The DQN-OLSR routing method based on the deep reinforcement learning DQN increases three measurement indexes of node load capacity, weighted link survival time and current node residual energy to help calculation of the MPR set, and overcomes the defects of poor routing stability and low network performance of the traditional MPR set under a network with high-speed movement and energy limitation.
In order to more clearly describe the functional characteristics and structural parameters of the present invention, the following description is made with reference to the accompanying drawings and detailed description.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is an overall frame diagram of the method of the present invention;
FIG. 2 is a schematic diagram of a DQN neural network;
fig. 3 is a link lifetime calculation geometry diagram.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention will be further described with reference to fig. 1, 2 and 3. The whole flow of the invention is specifically as follows:
step 1, three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation are provided, and the indexes are added into HELLO information of the nodes;
the three indexes are respectively: node load capacity, weighted link lifetime, and node residual energy. The node load capacity is calculated from the current message queue length and the maximum message queue length of the node. The weighted link survival time can be obtained through calculation of mathematical modeling through node position information and speed information. The node remaining energy may be calculated from the initial energy and the energy consumed.
Step 2, the node performs neighbor discovery and information sharing through HELLO information, and calculates to obtain a neighbor list;
and step 3, acquiring the numerical values of the four node quality measurement indexes through HELLO information, calculating the weight of the indexes by means of deep reinforcement learning DQN, and calculating the comprehensive quality of the nodes according to the weight.
In the reinforcement learning model, an agent is used for each communication node, the agent is continuously in network communication with other nodes, and the agent performs optimal actions under different states to calculate and obtain an optimal MPR set. In the reinforcement learning model, the state space is the numerical value of four node quality metrics, and the action space is the weight of the four node quality metrics.
And 4, starting to calculate the MPR. Setting the current MPR node being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows: and adding the nodes with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A. The n1_a node added to mpr_a is deleted, and the node covered by mpr_a set in the set n2_a is deleted.
And 5, if the node in the N1-A is the only one-hop neighbor node of a certain node in the N2-A, adding the node in the set N1-A into the MPR_A set. The n1_a node added to mpr_a is deleted, and the node covered by mpr_a set in the set n2_a is deleted.
Step 6, if N2_A is not null, i.e. there is still a node in N2_A not covered by MPR_A, selecting node comprehensive quality Comp from N1_A Y The largest node joins mpr_a, deletes the n1_a node joining mpr_a, deletes the node in the set n2_a that is covered by mpr_a set.
Step 7, returning to step 6 for judgment until N2_A is empty, and ending the MPR_A calculation, wherein the MPR set is the optimal MPR set adapting to the high-speed moving and energy limiting network.
And 8, forwarding TC information by the MPR node to realize the discovery of a network topology structure and obtain a topology table.
Step 9, calculating an optimal routing table by means of an optimal MPR set according to the shortest hop count principle based on the topology table; so far, the DQN-OLSR routing method based on the deep reinforcement learning DQN is calculated.
The method is realized by the following technical scheme:
in a first aspect, the present invention proposes three parameters of node load capacity, weighted link survival time and node residual energy as metric factors for MPR set selection, and the three newly proposed metric factors can better solve a fast moving communication scenario, specifically as follows:
node load capability L is determined by the current message queue length S of the node load And the maximum length l of the message queue is calculated to be:
Figure BDA0004086395480000071
the weighted link lifetime is calculated by weighting the link lifetime and the integrated link lifetime. The nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V O And V Q Velocity V of Q relative to O QO The method comprises the following steps:
V QO =V Q -V O
links between nodes O, QTime to live t OQ The method comprises the following steps:
Figure BDA0004086395480000072
where |oq| is the distance between point O and point Q, point D is a point on the edge of the communication range with point O, |od| is the distance between point O and point D, and angle β is the vector
Figure BDA0004086395480000073
Sum vector->
Figure BDA0004086395480000074
The degree of the included acute angle.
As shown in fig. 3, the figure is a geometric schematic of calculating the link lifetime for points O and Q. R is the communication radius of node O, and a circle is drawn with the communication radius R of the node. Vector quantity
Figure BDA0004086395480000081
The motion vector of the point Q, and the point D is the intersection point formed by the motion vector of the point Q and the circle. Vector +.>
Figure BDA0004086395480000082
Sum vector->
Figure BDA0004086395480000083
The acute angle that is taken as angle beta. The distance |qd| of the required movement of the point Q from the communication range of the point O can be calculated by triangle knowledge. And finally, dividing the distance |QD| by the relative speed of the point Q relative to the point O to obtain the link life time.
After calculating the link lifetime, the link lifetime of a node (e.g. node Q) and all its neighboring nodes are accumulated and averaged to obtain the integrated link lifetime of node Q
Figure BDA0004086395480000087
Figure BDA0004086395480000084
/>
Wherein N1 (Q) is a symmetric one-hop neighbor set of the node Q, num N1(Q) Is the number of N1 (Q).
Then T is taken up OD And
Figure BDA0004086395480000085
the weighted link survival time T between the node O and the node Q can be obtained by adding a certain weight OQ
Figure BDA0004086395480000086
Where α=0.7 and β=0.3.
The node remaining energy E represents the energy remaining by the current node for communication; if the node has less energy left, the node is not suitable to be used as an MPR node, and the occurrence of routing errors caused by disconnection of the node from the network due to insufficient energy is avoided. Defining initial energy of communication node as E 0 The communication node mainly has four states when in wireless communication, namely: sleep state (Sleep), idle state (Ldle), transmit message state (Tx), receive message state (Rx). Multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication cost
E cost =P tx ×T tx +P rx ×T rx +P idle ×T idle +P sleep ×T sleep
With the initial energy and the consumed energy, the energy E remaining at the node can be calculated:
E=E 0 -E cost
E=E 0 -(P tx ×T tx +P rx ×T rx +P idle ×T idle +P sleep ×T sleep )。
in a second aspect, the present invention utilizes the three node metrics mentioned above, in combination with deep reinforcement learning DQN, to implement computation of an optimal MPR set, specifically as follows:
setting the current MPR node being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows:
(1) And adding the nodes with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A. Deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
(2) If there is a node in n1_a that is the only one-hop neighbor node of a node in n2_a, then the node in set n1_a is added to the mpr_a set. Deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
(3) If N2_A is not empty, i.e. there are still nodes in N2_A that are not covered by MPR_A, for each node Y in the set N1_A, the number of one-hop neighbors C of Y is calculated and counted Y Load condition L Y Link L AY Weighted link lifetime T of (2) AY And the remaining energy E of node Y Y
(4) Inputting current state information, namely the magnitudes of four measurement index values, into a deep reinforcement learning DQN algorithm, and outputting the obtained actions, namely the magnitudes of the measurement index weights, into different weights of the four measurement indexes in the step (3);
(5) According to the four measurement indexes obtained in the step (3) and the different weights obtained in the step (4), calculating the node comprehensive quality Comp of the node Y Y
Comp Y =α*C Y +β*L Y +γ*T AY +δ*E Y #
(6) Selecting node composite quality Comp from N1_A Y The largest node joins mpr_a, deletes the n1_a node joining mpr_a, deletes the node in the set n2_a that is covered by mpr_a set. Returning to (3) to make a judgment until N2_A is empty, and ending the MPR_A calculation.
So far, the calculation of the optimal MPR set is completed. According to the calculation method, the comprehensive link survival time is considered, the node with the long link survival time is preferentially selected as the MPR, and the stability of communication is ensured; the load capacity and the residual energy of the nodes are also considered, and the network congestion or network interruption phenomenon caused by the too high load or the too low energy of the nodes is avoided.
In a third aspect, the use of deep reinforcement learning DQN to calculate the weights of metric factors is also an important part of the invention.
Reinforcement learning is an important component of machine learning to describe and solve the problem of achieving maximum return benefit or achieving a specific goal through learning strategies during the interaction of agent agents in an environment. The agent modifies the action by interacting with the environment to obtain a report, the behavior of which may be expressed as a markov decision (Markov Decision Process, MDP) process. Most important in a Markov decision process is the five-tuple<S,A,P,R,γ>Wherein S is a state set, A is an action set, and P is a state transition probability matrix, satisfying
Figure BDA0004086395480000102
R is a reward function, and gamma is [0,1 ]]A discount coefficient therebetween.
Specifically we describe the adjustment problem of four metric weights as an MDP process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a reinforcement learning method. In the reinforcement learning model, an agent is each node, the environment is the entire communication network, the state is the value of four metrics, and the action is the weight of four metrics.
In the reinforcement learning model, the utility function is defined as:
U=α*lb(L)+β*lb(T)-γ*lb(E)
where L is the load capacity of the node, T is the average of the weighted link survival times of the computing node and all of its MPR nodes, and E is the average of the remaining energy of all of the MPR nodes of the computing node. Alpha, beta and gamma are prize size coefficients for the three parameters, respectively.
The analysis utility function U can be used for selecting neighbor nodes with long communication time and large residual energy as MPR nodes as much as possible while ensuring that the nodes have strong load capacity. Therefore, the energy consumption can be optimized, and the network performance can be enhanced.
The utility function U at the front and back time is differenced to define a return function R t
Figure BDA0004086395480000101
Wherein δ is used to adjust the threshold value of the size of the reward function. U at current and later moments t When the difference is greater than delta, the environment gives U t Reporting the difference value; otherwise the return is zero. U at the latter moment t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment.
The DQN is a further version of Q-learning reinforcement learning, and a neural network is added to the DQN to fit the Q function, so that the problem that Q-learning cannot solve is solved. The core of Q-learning is Q table, in the invention, our state is four metric index value size, action is four metric index weight size, and for Q table, the dimension is larger, Q-learning can not solve this problem, so we choose to use deep reinforcement learning DQN to solve. Only the structure and parameters of the neural network are needed to be stored, and a huge Q table is not needed to be recorded; in addition, the similar state input can obtain similar action output, and has stronger generalization capability. A schematic diagram of a specific DQN neural network is shown in figure 2.
However, if Q-learning is simply combined with neural networks, two problems are caused: firstly, samples of the neural network are mutually independent, and the samples are different from the state association change of reinforcement learning; and secondly, introducing a nonlinear function, approximating a Q table by using a neural network, and possibly causing non-convergence of training results. To address both of these issues, a pool of experience and a fixed Q-target strategy were introduced. The experience pool utilizes an off-policy strategy, experiences obtained by exploration or actions are stored in the experience pool, then target-policy randomly extracts experiences from the experience pool to update a network, the relevance of the experiences is cut off, and the utilization rate of the experiences is improved. The fixed Q-target is used to accelerate the convergence of training, there are two networks with the same structure but different parameters in the DQN, one is used to predict the Q estimate (MainNet), one is used to predict the Q reality (target), the parameter between the long-term use of the predicted reality, and the latest parameter is used for the estimated value:
targetQ=r+γ*Qmax(s′,a′,θ)#
the loss is obtained according to the targetQ and Q estimation, and the loss function adopts the mean square error loss:
LOSS(θ)=E[((TargetQ-Q(s,a,θ)) 2 ]#
specifically: initializing MainNet and a target, and updating parameters of the MainNet according to a loss function, wherein the target is fixed; after a plurality of iterations, the parameters of the MainNet are all assigned to the target network, so that a fixed Q-target mechanism is realized. The above process is iterated continuously until the training converges. In this process, targetQ is fixed for a period of time, making the updating of the algorithm more stable.
In a fourth aspect, we calculate the routing table from topology table information after obtaining the optimal MPR set by deep reinforcement learning DQN calculation. The topology table is obtained by forwarding TC information through MPR nodes, and the optimal topology table can be obtained by the optimal MPR nodes; the routing table is based on the topology table, and calculates the optimal routing path according to the shortest hop count principle.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. The DQN-OLSR routing method based on deep reinforcement learning DQN is characterized by comprising the following steps of:
step M1, adding three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation in an OLSR protocol by combining node quality measurement indexes of the number of node neighbors, wherein the added three node quality measurement indexes comprise node load capacity, weighted link survival time and node residual energy; adding the added three node quality metrics into HELLO information of the node;
m2, directly carrying out neighbor discovery and information sharing by the node through HELLO information, and calculating to obtain a neighbor list;
m3, calculating weights of four node quality measurement indexes by combining the node with the deep reinforcement learning DQN through a neighbor table, calculating the comprehensive quality of the node according to the weights, and then calculating an optimal MPR set from large to small according to the comprehensive quality of the node;
m4, forwarding TC information by nodes in the MPR set to realize network topology discovery and obtain a topology table;
and M5, calculating an optimal routing table based on the topology table and following the principle of the shortest hop count, and finishing the routing calculation.
2. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein the node load capacity L is calculated from the current message queue length and the maximum message queue length of the node:
Figure FDA0004086395470000011
wherein S is load Indicating the current message queue length of the node, and l indicating the maximum message queue length.
3. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein the weighted link lifetime is calculated by weighting link lifetime and integrated link lifetime between nodes; the nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V O And V Q Velocity V of Q relative to O QO The method comprises the following steps:
V QO =V Q -V O
link lifetime t between nodes O, Q OQ The method comprises the following steps:
Figure FDA0004086395470000012
where |OQ| is the distance between node O and point Q, and a circle, vector, is drawn with the communication radius R of node O
Figure FDA0004086395470000021
Is the motion vector of point Q, point D is the intersection of the motion vector of point Q and the circle, OD is the distance between point O and point D, and angle beta is the vector +.>
Figure FDA0004086395470000022
Sum vector->
Figure FDA0004086395470000023
Degrees of acute included angle;
by link lifetime t of all one-hop neighbors of node Q Qi The integrated link survival time can be calculated by adding the average value
Figure FDA0004086395470000024
Figure FDA0004086395470000025
Wherein N1 (Q) is a symmetric one-hop neighbor set of the node Q, num N1(Q) Is the number and size of N1 (Q);
final weighted link lifetime T OQ
Figure FDA0004086395470000026
Where α=0.7 and β=0.3.
4. A DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, characterized in that node remaining energy E represents the energy remaining for the current node to communicate;
defining initial energy of communication node as E 0 The communication node includes four states when performing wireless communication, respectively: sleep state Sleep, idle state Ldle, send message state Tx, receive message state Rx; multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication cost
E cost =P tx ×T tx +P rx ×T rx +P idle ×T idle +P sleep ×T sleep
Wherein P is tx 、P rx 、P idle And P sleep Respectively represent the working powers of four states, T tx 、T rx 、T idle And T sleep Respectively representing the working time lengths of the four states;
calculating the energy E remained by the node according to the initial energy and the consumed energy:
E=E 0 -E cost
E=E 0 -(P tx ×T tx +P rx ×T rx +P idle ×T idle +P sleep ×T sleep )。
5. the DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein calculating weights using DQN in step M3 specifically comprises:
the adjustment problem of the four measurement weights of the node neighbor number, the weighted link lifetime, the node load capacity and the node residual energy is described as a Markov decision MDP process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a deep reinforcement learning method; in the reinforcement learning model, an agent is each node, the environment is the whole communication network, the state is the value of four measurement indexes, and the action is the weight of four measurements;
in the reinforcement learning model, a utility function is defined as:
U=α*lb(L)+β*lb(T)-γ*lb(E)
wherein L is the load capacity of the node, T is the average value of the weighted link survival time of the computing node and all MPR nodes thereof, and E is the average value of the residual energy of all MPR nodes of the computing node; alpha, beta and gamma are the prize size coefficients of the three parameters respectively;
the utility function U at the front and rear time is differenced to define a return function R t
Figure FDA0004086395470000031
Wherein δ is used to adjust the threshold of the size of the reward; u at current and later moments t When the difference is greater than delta, the environment gives U t Reporting the difference value; otherwise, reporting to be zero; u at the latter moment t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment;
the deep reinforcement learning DQN is utilized to set the node quality measure weight magnitude after describing the problem as an MDP procedure.
6. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 5, wherein the optimal MPR set calculation method in step M3 includes:
setting an MPR node currently being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, N2-A as a two-hop neighbor subset of the node A, and MPR_A as an MPR set of the node A; the specific calculation steps are as follows:
m3-1, adding a node with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
step M3-2, if the node in the N1-A is the only one-hop neighbor node of a certain node in the N2-A, adding the node in the N1-A set into the MPR_A set; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;
step M3-3, if N2_A is not null, i.e. there are still nodes in N2_A not covered by MPR_A, for each node Y in the set N1_A, calculate and count the number of one-hop neighbors C of Y Y Load condition L Y Link L AY Weighted link lifetime T of (2) AY And the remaining energy E of node Y Y
Step M3-4, inputting the current state information, namely the magnitude of four measurement index values, into a depth reinforcement learning DQN algorithm, and outputting the obtained actions, namely the weight magnitude of the measurement index, as different weights of the four measurement indexes in step M3;
step M3-5, calculating the node comprehensive quality Comp of the node Y according to the four measurement indexes of the step M3 and the different weights of the step M4 Y
Comp Y =α*C Y +β*L Y +γ*T AY +δ*E Y #
Step M3-6, selecting node composite quality Comp from N1_A Y The largest node is added with MPR_A, the node N1_A added with MPR_A is deleted, and the set to be MPR_A in the set N2_A is deletedCovered nodes; returning to the step M3-3 for judgment until the N2_A is empty, and ending the MPR_A calculation.
CN202310137402.5A 2023-02-20 2023-02-20 DQN-OLSR routing method based on deep reinforcement learning DQN Pending CN116170854A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310137402.5A CN116170854A (en) 2023-02-20 2023-02-20 DQN-OLSR routing method based on deep reinforcement learning DQN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310137402.5A CN116170854A (en) 2023-02-20 2023-02-20 DQN-OLSR routing method based on deep reinforcement learning DQN

Publications (1)

Publication Number Publication Date
CN116170854A true CN116170854A (en) 2023-05-26

Family

ID=86414406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310137402.5A Pending CN116170854A (en) 2023-02-20 2023-02-20 DQN-OLSR routing method based on deep reinforcement learning DQN

Country Status (1)

Country Link
CN (1) CN116170854A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033005A (en) * 2023-10-07 2023-11-10 之江实验室 Deadlock-free routing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033005A (en) * 2023-10-07 2023-11-10 之江实验室 Deadlock-free routing method and device, storage medium and electronic equipment
CN117033005B (en) * 2023-10-07 2024-01-26 之江实验室 Deadlock-free routing method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
Zhang et al. A multi-path routing protocol based on link lifetime and energy consumption prediction for mobile edge computing
Elhabyan et al. Two-tier particle swarm optimization protocol for clustering and routing in wireless sensor network
CN112954769B (en) Underwater wireless sensor network routing method based on reinforcement learning
CN114499648B (en) Unmanned aerial vehicle cluster network intelligent multi-hop routing method based on multi-agent cooperation
CN110740487B (en) Underwater routing method with effective energy and obstacle avoidance
US20220369200A1 (en) Clustering and routing method and system for wireless sensor networks
CN102510572A (en) Clustering routing control method oriented to heterogeneous wireless sensor network
CN116170854A (en) DQN-OLSR routing method based on deep reinforcement learning DQN
CN111510956A (en) Hybrid routing method based on clustering and reinforcement learning and ocean communication system
Yang et al. V2V routing in VANET based on heuristic Q-learning
Micheletti et al. CER-CH: combining election and routing amongst cluster heads in heterogeneous WSNs
Gu et al. A social-aware routing protocol based on fuzzy logic in vehicular ad hoc networks
Boyineni et al. Mobile sink-based data collection in event-driven wireless sensor networks using a modified ant colony optimization
CN115866735A (en) Cross-layer topology control method based on super-mode game underwater sensor network
CN113660710B (en) Mobile self-organizing network routing method based on reinforcement learning
Peng et al. Real-time transmission optimization for edge computing in industrial cyber-physical systems
Zhang et al. V2V routing in VANET based on fuzzy logic and reinforcement learning
Babu et al. Cuckoo search and M-tree based multicast Ad hoc on-demand distance vector protocol for MANET
Jain et al. An efficient energy aware link stable routing protocol in MANETS
Xu et al. A dyna-Q based multi-path load-balancing routing algorithm in wireless sensor networks
CN115242290B (en) Method and device for optimizing OLSR protocol of emergency unmanned aerial vehicle network
CN114125986B (en) Wireless sensor network clustering routing method based on optimal relay angle
Qiu et al. Coding-Aware Routing for Maximum Throughput and Coding Opportunities by Deep Reinforcement Learning in FANET
Meera et al. ECMST: minimal energy usage competent multicast Steiner tree based route discovery for mobile ad hoc networks
Wang et al. Research on WSN Topology Algorithm Based on Greedy Shortest Paths

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination