CN116170854A

CN116170854A - DQN-OLSR routing method based on deep reinforcement learning DQN

Info

Publication number: CN116170854A
Application number: CN202310137402.5A
Authority: CN
Inventors: 郭剑辉; 杨利行; 濮存来; 陶叔银; 董宏林; 肖志杰; 李伦波
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-05-26

Abstract

The application discloses a DQN-OLSR routing method based on deep reinforcement learning DQN, which comprises the steps that nodes add node load capacity information in HELLO messages, the nodes add coordinate information and speed information in the HELLO messages, weighted link survival time is obtained through calculation, and the nodes add node residual energy in the HELLO messages; in the route calculation process, the route learns a training Q function through deep reinforcement learning DQN according to four parameters including the number of node neighbors, node load capacity, weighted link survival time and residual energy; when MPR set calculation is carried out, weights of four parameters are obtained according to the Q function, the comprehensive quality of the nodes is calculated, and then the node with the maximum comprehensive quality is selected as an MPR node; when the node sends the data packet, the routing table is calculated based on the shortest hop count principle, and the message is forwarded through the optimal MPR node. The method improves the stability of the whole network, improves the throughput of the network and reduces the packet loss rate.

Description

DQN-OLSR routing method based on deep reinforcement learning DQN

Technical Field

The invention belongs to the technical field of mobile ad hoc network OLSR routing, and particularly relates to a DQN-OLSR routing method based on deep reinforcement learning DQN (Deep Q Network).

Background

Optimized link state routing (Optimized Link State Routing, OLSR) is an optimization algorithm of classical link state routing algorithms, commonly used in mobile ad hoc networks. The biggest feature of this protocol over traditional link state protocols is multicast (multipoint relays, MPRs). Only MPR nodes can forward control information, this feature significantly reduces the number of control messages in the network, reduces the overhead of the routing protocol, and improves the network performance. The merits of MPR sets directly affect the quality of the whole network.

Each node periodically transmits HELLO messages to achieve neighbor discovery and information sharing. Topology discovery and routing computation need to be done by means of TC messages. Only MPR nodes can forward TC messages, so the MPR set is to be able to cover all two-hop neighbor nodes. MPR selection has proven to be an NP-complete problem and conventional greedy algorithms have difficulty in achieving optimal solutions.

Along with the development of technology and the reduction of production cost, unmanned aerial vehicles have wider application in the fields of rescue detection and the like. The unmanned aerial vehicle network has the characteristics of high-speed movement and energy limitation, and the OLSR protocol of the original edition is difficult to meet the requirement of unmanned aerial vehicle ad hoc network. First, the original OLSR protocol simply uses the number of node neighbors as a factor of MPR selection, and it is difficult to cope with an unmanned network in which network topology changes frequently due to high-speed movement. Communication stability of high speed mobile networks. Secondly, if the unmanned plane is used as a network node with energy limitation and is always used as an MPR node to forward TC information, the phenomenon of insufficient energy and communication interruption can occur.

Disclosure of Invention

The present invention is directed to solving the problems set forth in the background art and providing a DQN-OLSR routing method based on deep reinforcement learning DQN. The method provides three parameters of node load capacity, weighted link survival time and residual energy, and weights of the three parameters are calculated through a deep reinforcement learning DNQ algorithm, so that an optimal MPR set is calculated, and the problem that the traditional method is difficult to adapt to rapid change of network topology is solved. Compared with the traditional OLSR routing method, the method can obviously improve the stability of the routing, increase the throughput of the network and reduce the packet loss rate.

In order to achieve the purpose of the invention, the invention discloses a DQN-OLSR routing method based on deep reinforcement learning DQN, which comprises the following steps:

step M1, adding three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation in an OLSR protocol by combining node quality measurement indexes of the number of node neighbors, wherein the added three node quality measurement indexes comprise node load capacity, weighted link survival time and node residual energy; adding the node quality measurement index into HELLO information of the node;

m2, directly carrying out neighbor discovery and information sharing by the node through HELLO information, and calculating to obtain a neighbor list;

m3, calculating weights of four node quality measurement indexes by combining the node with the deep reinforcement learning DQN through a neighbor table, calculating the comprehensive quality of the node according to the weights, and then calculating an optimal MPR set from large to small according to the comprehensive quality of the node;

m4, forwarding TC information by nodes in the MPR set to realize network topology discovery and obtain a topology table;

and M5, calculating an optimal routing table based on the topology table and following the principle of the shortest hop count, and finishing the routing calculation.

Further, the node load capacity L is calculated by the current message queue length and the maximum message queue length of the node:

wherein S is _load Indicating the current message queue length of the node, and l indicating the maximum message queue length.

Further, the weighted link lifetime is obtained by weighting and calculating the link lifetime and the comprehensive link lifetime between the nodes; the nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V _O And V _Q Velocity V of Q relative to O _QO The method comprises the following steps:

V _QO ＝V _Q -V _O

link lifetime t between nodes O, Q _OQ The method comprises the following steps:

where |OQ| is the distance between node O and point Q, and a circle, vector, is drawn with the communication radius R of node O

Is the motion vector of point Q, point D is the intersection of the motion vector of point Q and the circle, OD is the distance between point O and point D, and angle beta is the vector +.>

Sum vector->

Degrees of acute included angle;

by link lifetime t of all one-hop neighbors of node Q _Qi The integrated link survival time can be calculated by adding the average value

Wherein N1 (Q) is a symmetric one-hop neighbor set of the node Q, num _N1(Q) Is the number and size of N1 (Q);

final weighted link lifetime T _OQ ：

Where α=0.7 and β=0.3.

Further, node remaining energy E represents energy remaining for the current node to communicate; if the energy left by the node is less, the node is not suitable to be used as an MPR node, so that the occurrence of routing errors caused by disconnection of the node from the network due to insufficient energy is avoided;

defining initial energy of communication node as E ₀ The communication node includes four states when performing wireless communication, respectively: sleep state Sleep, idle state Ldle, send message state Tx, receive message state Rx; in general, the communication node has higher power consumption in the transmission message state Tx and the reception message state Rx, and lower power consumption in the Sleep state Sleep; multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication _cost ：

E _cost =P _tx ×T _tx +P _rx ×T _rx +P _idle ×T _iale +P _sleep ×T _sleep

Wherein P is _tx 、P _rx 、P _idle And P _sleep Respectively represent the working powers of four states, T _tx 、T _rx 、T _idle And T _sleep Respectively representing the working time lengths of the four states;

with the initial energy and the consumed energy, the energy E remaining at the node is calculated:

E＝E ₀ -E _cost

E＝E ₀ -(P _tx ×T _tx +P _rx ×T _rx +P _idle ×T _idle +P _sleep ×T _sleep )。

further, calculating the weights using DQN in step M3 specifically includes:

the adjustment problem of the four metric weights of node neighbor number, weighted link lifetime, node load capacity and node residual energy is described as a markov decision (Markov Decision Process, MDP) process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a reinforcement learning method; in the reinforcement learning model, an agent is each node, the environment is the whole communication network, the state is the value of four measurement indexes, and the action is the weight of four measurements;

in the reinforcement learning model, a utility function is defined as:

U＝α*lb(L)+β*lb(T)-γ*lb(E)

wherein L is the load capacity of the node, T is the average value of the weighted link survival time of the computing node and all MPR nodes thereof, and E is the average value of the residual energy of all MPR nodes of the computing node; alpha, beta and gamma are the prize size coefficients of the three parameters respectively;

the analysis utility function U can be used for selecting neighbor nodes with long communication time and large residual energy as MPR nodes as much as possible while ensuring that the nodes have strong load capacity; therefore, the energy consumption can be optimized, and the network performance can be enhanced;

the utility function U at the front and back time is differenced to define a return function R _t ：

Wherein δ is used to adjust the threshold of the size of the reward; u at current and later moments _t The difference is largeAt delta, environmental administration U _t Reporting the difference value; otherwise, reporting to be zero; u at the latter moment _t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment;

the deep reinforcement learning DQN is utilized to set the node quality measure weight magnitude after describing the problem as an MDP procedure.

Further, the method for calculating the optimal MPR set in the step M3 includes:

setting the current MPR node as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows:

m3-1, adding a node with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;

step M3-2, if the node in the N1-A is the only one-hop neighbor node of a certain node in the N2-A, adding the node in the N1-A set into the MPR_A set; deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;

step M3-3, if N2_A is not null, i.e. there are still nodes in N2_A not covered by MPR_A, for each node Y in the set N1_A, calculate and count the number of one-hop neighbors C of Y _Y Load condition L _Y Link L _AY Weighted link lifetime T of (2) _AY And the remaining energy E of node Y _Y ；

Step M3-4, inputting the current state information, namely the magnitude of four measurement index values, into a depth reinforcement learning DQN algorithm, and outputting the obtained actions, namely the weight magnitude of the measurement index, as different weights of the four measurement indexes in step M3;

step M3-5, calculating the node comprehensive quality Comp of the node Y according to the four measurement indexes of the step M3 and the different weights of the step M4 _Y ：

Comp _Y ＝α*C _Y +β*L _Y +γ*T _AY +δ*E _Y #

Step M3-6, selecting node composite quality Comp from N1_A _Y Adding MPR_A to the largest node, deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A set in the set N2_A; returning to the step M3-3 for judgment until the N2_A is empty, and ending the MPR_A calculation.

Compared with the prior art, the invention has the remarkable progress that: 1) According to the method, on the premise of meeting the normal routing function, the optimal MPR set suitable for frequent change of network topology can be calculated, routing errors can be remarkably reduced, data packet loss is reduced, and the communication performance of the wireless ad hoc network is improved; 2) The DQN-OLSR routing method based on the deep reinforcement learning DQN increases three measurement indexes of node load capacity, weighted link survival time and current node residual energy to help calculation of the MPR set, and overcomes the defects of poor routing stability and low network performance of the traditional MPR set under a network with high-speed movement and energy limitation.

In order to more clearly describe the functional characteristics and structural parameters of the present invention, the following description is made with reference to the accompanying drawings and detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is an overall frame diagram of the method of the present invention;

FIG. 2 is a schematic diagram of a DQN neural network;

fig. 3 is a link lifetime calculation geometry diagram.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention will be further described with reference to fig. 1, 2 and 3. The whole flow of the invention is specifically as follows:

step 1, three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation are provided, and the indexes are added into HELLO information of the nodes;

the three indexes are respectively: node load capacity, weighted link lifetime, and node residual energy. The node load capacity is calculated from the current message queue length and the maximum message queue length of the node. The weighted link survival time can be obtained through calculation of mathematical modeling through node position information and speed information. The node remaining energy may be calculated from the initial energy and the energy consumed.

Step 2, the node performs neighbor discovery and information sharing through HELLO information, and calculates to obtain a neighbor list;

and step 3, acquiring the numerical values of the four node quality measurement indexes through HELLO information, calculating the weight of the indexes by means of deep reinforcement learning DQN, and calculating the comprehensive quality of the nodes according to the weight.

In the reinforcement learning model, an agent is used for each communication node, the agent is continuously in network communication with other nodes, and the agent performs optimal actions under different states to calculate and obtain an optimal MPR set. In the reinforcement learning model, the state space is the numerical value of four node quality metrics, and the action space is the weight of the four node quality metrics.

And 4, starting to calculate the MPR. Setting the current MPR node being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows: and adding the nodes with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A. The n1_a node added to mpr_a is deleted, and the node covered by mpr_a set in the set n2_a is deleted.

And 5, if the node in the N1-A is the only one-hop neighbor node of a certain node in the N2-A, adding the node in the set N1-A into the MPR_A set. The n1_a node added to mpr_a is deleted, and the node covered by mpr_a set in the set n2_a is deleted.

Step 6, if N2_A is not null, i.e. there is still a node in N2_A not covered by MPR_A, selecting node comprehensive quality Comp from N1_A _Y The largest node joins mpr_a, deletes the n1_a node joining mpr_a, deletes the node in the set n2_a that is covered by mpr_a set.

Step 7, returning to step 6 for judgment until N2_A is empty, and ending the MPR_A calculation, wherein the MPR set is the optimal MPR set adapting to the high-speed moving and energy limiting network.

And 8, forwarding TC information by the MPR node to realize the discovery of a network topology structure and obtain a topology table.

Step 9, calculating an optimal routing table by means of an optimal MPR set according to the shortest hop count principle based on the topology table; so far, the DQN-OLSR routing method based on the deep reinforcement learning DQN is calculated.

The method is realized by the following technical scheme:

in a first aspect, the present invention proposes three parameters of node load capacity, weighted link survival time and node residual energy as metric factors for MPR set selection, and the three newly proposed metric factors can better solve a fast moving communication scenario, specifically as follows:

node load capability L is determined by the current message queue length S of the node _load And the maximum length l of the message queue is calculated to be:

the weighted link lifetime is calculated by weighting the link lifetime and the integrated link lifetime. The nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V _O And V _Q Velocity V of Q relative to O _QO The method comprises the following steps:

V _QO ＝V _Q -V _O

links between nodes O, QTime to live t _OQ The method comprises the following steps:

where |oq| is the distance between point O and point Q, point D is a point on the edge of the communication range with point O, |od| is the distance between point O and point D, and angle β is the vector

Sum vector->

The degree of the included acute angle.

As shown in fig. 3, the figure is a geometric schematic of calculating the link lifetime for points O and Q. R is the communication radius of node O, and a circle is drawn with the communication radius R of the node. Vector quantity

The motion vector of the point Q, and the point D is the intersection point formed by the motion vector of the point Q and the circle. Vector +.>

Sum vector->

The acute angle that is taken as angle beta. The distance |qd| of the required movement of the point Q from the communication range of the point O can be calculated by triangle knowledge. And finally, dividing the distance |QD| by the relative speed of the point Q relative to the point O to obtain the link life time.

After calculating the link lifetime, the link lifetime of a node (e.g. node Q) and all its neighboring nodes are accumulated and averaged to obtain the integrated link lifetime of node Q

/>

Wherein N1 (Q) is a symmetric one-hop neighbor set of the node Q, num _N1(Q) Is the number of N1 (Q).

Then T is taken up _OD And

the weighted link survival time T between the node O and the node Q can be obtained by adding a certain weight _OQ ：

Where α=0.7 and β=0.3.

The node remaining energy E represents the energy remaining by the current node for communication; if the node has less energy left, the node is not suitable to be used as an MPR node, and the occurrence of routing errors caused by disconnection of the node from the network due to insufficient energy is avoided. Defining initial energy of communication node as E ₀ The communication node mainly has four states when in wireless communication, namely: sleep state (Sleep), idle state (Ldle), transmit message state (Tx), receive message state (Rx). Multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication _cost 。

E _cost ＝P _tx ×T _tx +P _rx ×T _rx +P _idle ×T _idle +P _sleep ×T _sleep

With the initial energy and the consumed energy, the energy E remaining at the node can be calculated:

E＝E ₀ -E _cost

in a second aspect, the present invention utilizes the three node metrics mentioned above, in combination with deep reinforcement learning DQN, to implement computation of an optimal MPR set, specifically as follows:

setting the current MPR node being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, defining N2_A as a two-hop neighbor subset of the node A, and defining MPR_A as an MPR set of the node A, wherein the specific calculation steps are as follows:

(1) And adding the nodes with the message forwarding willingness degree of always willingness to forward in the set N1_A into the MPR_A. Deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;

(2) If there is a node in n1_a that is the only one-hop neighbor node of a node in n2_a, then the node in set n1_a is added to the mpr_a set. Deleting the node N1_A added with the MPR_A, and deleting the node covered by the MPR_A in the set N2_A;

(3) If N2_A is not empty, i.e. there are still nodes in N2_A that are not covered by MPR_A, for each node Y in the set N1_A, the number of one-hop neighbors C of Y is calculated and counted _Y Load condition L _Y Link L _AY Weighted link lifetime T of (2) _AY And the remaining energy E of node Y _Y ；

(4) Inputting current state information, namely the magnitudes of four measurement index values, into a deep reinforcement learning DQN algorithm, and outputting the obtained actions, namely the magnitudes of the measurement index weights, into different weights of the four measurement indexes in the step (3);

(5) According to the four measurement indexes obtained in the step (3) and the different weights obtained in the step (4), calculating the node comprehensive quality Comp of the node Y _Y ：

Comp _Y ＝α*C _Y +β*L _Y +γ*T _AY +δ*E _Y #

(6) Selecting node composite quality Comp from N1_A _Y The largest node joins mpr_a, deletes the n1_a node joining mpr_a, deletes the node in the set n2_a that is covered by mpr_a set. Returning to (3) to make a judgment until N2_A is empty, and ending the MPR_A calculation.

So far, the calculation of the optimal MPR set is completed. According to the calculation method, the comprehensive link survival time is considered, the node with the long link survival time is preferentially selected as the MPR, and the stability of communication is ensured; the load capacity and the residual energy of the nodes are also considered, and the network congestion or network interruption phenomenon caused by the too high load or the too low energy of the nodes is avoided.

In a third aspect, the use of deep reinforcement learning DQN to calculate the weights of metric factors is also an important part of the invention.

Reinforcement learning is an important component of machine learning to describe and solve the problem of achieving maximum return benefit or achieving a specific goal through learning strategies during the interaction of agent agents in an environment. The agent modifies the action by interacting with the environment to obtain a report, the behavior of which may be expressed as a markov decision (Markov Decision Process, MDP) process. Most important in a Markov decision process is the five-tuple<S,A,P,R,γ>Wherein S is a state set, A is an action set, and P is a state transition probability matrix, satisfying

R is a reward function, and gamma is [0,1 ]]A discount coefficient therebetween.

Specifically we describe the adjustment problem of four metric weights as an MDP process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a reinforcement learning method. In the reinforcement learning model, an agent is each node, the environment is the entire communication network, the state is the value of four metrics, and the action is the weight of four metrics.

In the reinforcement learning model, the utility function is defined as:

U＝α*lb(L)+β*lb(T)-γ*lb(E)

where L is the load capacity of the node, T is the average of the weighted link survival times of the computing node and all of its MPR nodes, and E is the average of the remaining energy of all of the MPR nodes of the computing node. Alpha, beta and gamma are prize size coefficients for the three parameters, respectively.

The analysis utility function U can be used for selecting neighbor nodes with long communication time and large residual energy as MPR nodes as much as possible while ensuring that the nodes have strong load capacity. Therefore, the energy consumption can be optimized, and the network performance can be enhanced.

Wherein δ is used to adjust the threshold value of the size of the reward function. U at current and later moments _t When the difference is greater than delta, the environment gives U _t Reporting the difference value; otherwise the return is zero. U at the latter moment _t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment.

The DQN is a further version of Q-learning reinforcement learning, and a neural network is added to the DQN to fit the Q function, so that the problem that Q-learning cannot solve is solved. The core of Q-learning is Q table, in the invention, our state is four metric index value size, action is four metric index weight size, and for Q table, the dimension is larger, Q-learning can not solve this problem, so we choose to use deep reinforcement learning DQN to solve. Only the structure and parameters of the neural network are needed to be stored, and a huge Q table is not needed to be recorded; in addition, the similar state input can obtain similar action output, and has stronger generalization capability. A schematic diagram of a specific DQN neural network is shown in figure 2.

However, if Q-learning is simply combined with neural networks, two problems are caused: firstly, samples of the neural network are mutually independent, and the samples are different from the state association change of reinforcement learning; and secondly, introducing a nonlinear function, approximating a Q table by using a neural network, and possibly causing non-convergence of training results. To address both of these issues, a pool of experience and a fixed Q-target strategy were introduced. The experience pool utilizes an off-policy strategy, experiences obtained by exploration or actions are stored in the experience pool, then target-policy randomly extracts experiences from the experience pool to update a network, the relevance of the experiences is cut off, and the utilization rate of the experiences is improved. The fixed Q-target is used to accelerate the convergence of training, there are two networks with the same structure but different parameters in the DQN, one is used to predict the Q estimate (MainNet), one is used to predict the Q reality (target), the parameter between the long-term use of the predicted reality, and the latest parameter is used for the estimated value:

targetQ＝r+γ*Qmax(s′,a′,θ)#

the loss is obtained according to the targetQ and Q estimation, and the loss function adopts the mean square error loss:

LOSS(θ)＝E[((TargetQ-Q(s,a,θ)) ² ]#

specifically: initializing MainNet and a target, and updating parameters of the MainNet according to a loss function, wherein the target is fixed; after a plurality of iterations, the parameters of the MainNet are all assigned to the target network, so that a fixed Q-target mechanism is realized. The above process is iterated continuously until the training converges. In this process, targetQ is fixed for a period of time, making the updating of the algorithm more stable.

In a fourth aspect, we calculate the routing table from topology table information after obtaining the optimal MPR set by deep reinforcement learning DQN calculation. The topology table is obtained by forwarding TC information through MPR nodes, and the optimal topology table can be obtained by the optimal MPR nodes; the routing table is based on the topology table, and calculates the optimal routing path according to the shortest hop count principle.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The DQN-OLSR routing method based on deep reinforcement learning DQN is characterized by comprising the following steps of:

step M1, adding three node quality measurement indexes suitable for a network which moves at a high speed and has energy limitation in an OLSR protocol by combining node quality measurement indexes of the number of node neighbors, wherein the added three node quality measurement indexes comprise node load capacity, weighted link survival time and node residual energy; adding the added three node quality metrics into HELLO information of the node;

2. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein the node load capacity L is calculated from the current message queue length and the maximum message queue length of the node:

3. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein the weighted link lifetime is calculated by weighting link lifetime and integrated link lifetime between nodes; the nodes O, Q are set to be one-hop neighbor nodes, and the speeds of the nodes O, Q are respectively V _O And V _Q Velocity V of Q relative to O _QO The method comprises the following steps:

V _QO ＝V _Q -V _O

Sum vector->

Degrees of acute included angle;

final weighted link lifetime T _OQ ：

Where α=0.7 and β=0.3.

4. A DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, characterized in that node remaining energy E represents the energy remaining for the current node to communicate;

defining initial energy of communication node as E ₀ The communication node includes four states when performing wireless communication, respectively: sleep state Sleep, idle state Ldle, send message state Tx, receive message state Rx; multiplying the power consumption values of the four states by the working time of the corresponding states, and then accumulating to obtain the energy E consumed by node communication _cost ：

E _cost ＝P _tx ×T _tx +P _rx ×T _rx +P _idle ×T _idle +P _sleep ×T _sleep

calculating the energy E remained by the node according to the initial energy and the consumed energy:

E＝E ₀ -E _cost

5. the DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 1, wherein calculating weights using DQN in step M3 specifically comprises:

the adjustment problem of the four measurement weights of the node neighbor number, the weighted link lifetime, the node load capacity and the node residual energy is described as a Markov decision MDP process: each node dynamically interacts with the environment in the process of selecting the MPR node, and the optimal action in the current state is obtained through a deep reinforcement learning method; in the reinforcement learning model, an agent is each node, the environment is the whole communication network, the state is the value of four measurement indexes, and the action is the weight of four measurements;

in the reinforcement learning model, a utility function is defined as:

U＝α*lb(L)+β*lb(T)-γ*lb(E)

the utility function U at the front and rear time is differenced to define a return function R _t ：

Wherein δ is used to adjust the threshold of the size of the reward; u at current and later moments _t When the difference is greater than delta, the environment gives U _t Reporting the difference value; otherwise, reporting to be zero; u at the latter moment _t The return is positive, namely the reward, when the time is greater than the previous time; otherwise, the return is negative, namely punishment;

6. The DQN-OLSR routing method based on deep reinforcement learning DQN according to claim 5, wherein the optimal MPR set calculation method in step M3 includes:

setting an MPR node currently being calculated as A, defining N1_A as a one-hop neighbor subset of the node A, N2-A as a two-hop neighbor subset of the node A, and MPR_A as an MPR set of the node A; the specific calculation steps are as follows:

Comp _Y ＝α*C _Y +β*L _Y +γ*T _AY +δ*E _Y #

Step M3-6, selecting node composite quality Comp from N1_A _Y The largest node is added with MPR_A, the node N1_A added with MPR_A is deleted, and the set to be MPR_A in the set N2_A is deletedCovered nodes; returning to the step M3-3 for judgment until the N2_A is empty, and ending the MPR_A calculation.