CN114025405A

CN114025405A - Underwater unmanned vehicle safety opportunity routing method and device based on reinforcement learning

Info

Publication number: CN114025405A
Application number: CN202111176454.0A
Authority: CN
Inventors: 王桐; 崔立佳; 高山; 陈立伟
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-02-08
Anticipated expiration: 2041-10-09
Also published as: CN114025405B

Abstract

A safety opportunity routing method and device for an underwater unmanned vehicle based on reinforcement learning belong to the technical field of sensors. The current underwater exploration aims at the sensor nodes which cannot move autonomously underwater; the selection of the encountering node cannot be made during movement; void nodes are easily created. The invention provides a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method, which comprises the following steps: the method comprises the steps that an underwater unmanned vehicle screens nodes for the first time in a communication range, and a trust evaluation model is established; establishing a trust evaluation model according to the preliminarily screened nodes for evaluation; inputting the evaluation elements into a fuzzy logic system to obtain an evaluation node comprehensive trust value, and updating the evaluation node comprehensive trust value into a encountering node trust value dynamic table; and (4) according to the comprehensive trust value of the evaluation node output by the fuzzy logic system, performing routing selection, setting a state-action value updating function and setting a reward function by using reinforcement learning. The method is applied to the field of safety opportunity routing of the underwater unmanned vehicle.

Description

Underwater unmanned vehicle safety opportunity routing method and device based on reinforcement learning

Technical Field

The invention relates to the field of safety opportunity routing of an underwater unmanned vehicle, in particular to the field of safety opportunity routing of the underwater unmanned vehicle based on reinforcement learning.

Background

The existing invention CN112188583A 'an ocean underwater wireless sensing network opportunistic routing method based on reinforcement learning', proposes the idea of combining reinforcement learning and opportunistic routing, but aims at the sensor nodes which can not move autonomously underwater, the topological change of the sensor nodes is small, and the sensor nodes only record the interaction information with the neighbor nodes and can not move autonomously.

If the underwater unmanned vehicle senses are combined with the opportunistic routing of the wireless sensing network, the nodes cannot be covered comprehensively, the nodes cannot be updated automatically due to the combination of the underwater unmanned vehicle senses and the opportunistic routing of the wireless sensing network, the encountered nodes cannot be selected in the moving process, and the final safe and efficient transmission of the information is realized; void nodes are easily created.

Disclosure of Invention

The invention solves the problem that the sensor node which can not move autonomously underwater has small topological change, and the sensor node only records the interaction information with the neighbor node and can not move autonomously; the encountered nodes can not be selected in the moving process, so that the final safe and efficient transmission of the messages is realized, and the problem of void nodes is easily caused.

An reinforcement learning-based underwater unmanned vehicle safety opportunity routing method, the method comprising:

primarily screening the underwater unmanned vehicle in a node-to-communication range, and establishing a trust evaluation model according to the primarily screened nodes;

establishing a trust evaluation model for evaluation according to the preliminarily screened nodes, wherein evaluation elements of the evaluation model consist of a direct trust value DTvalue and an indirect trust value ITvalue;

inputting the evaluation elements into a fuzzy logic system to obtain an evaluation node comprehensive trust value, and updating the evaluation node comprehensive trust value into a encountering node trust value dynamic table;

and (4) according to the comprehensive trust value of the evaluation node output by the fuzzy logic system, performing routing selection, setting a state-action value updating function and setting a reward function by using reinforcement learning.

Further, the process of primarily screening the underwater unmanned vehicle in the communication range from the node to the communication range and establishing the trust evaluation model according to the primarily screened node comprises the following steps:

the underwater unmanned vehicle node carrying the message sends a broadcast to other nodes in a communication range, requests other nodes to feed back node information, acquires a data packet, performs primary screening according to an indirect trust value ITValue in the data packet information of the other side, and selects the node with the indirect trust value exceeding a threshold value as a candidate relay node for further evaluation.

Further, the direct trust value DTValue evaluation element is selected as: 1. calculating relative distance between nodes through the sending and receiving time difference of the node data packets, and estimating the communication quality between the nodes according to the relative distance between the nodes; 2. node familiarity; 3. node relay ratio.

Further, the path loss estimated by the relative distance between the nodes measures the communication quality between the nodes, and the path loss a (d, f) of any pair of nodes occurring in the underwater acoustic channel is:

A(d,f)＝A₀d^kα₁(f)^d，

wherein, the frequency is f, the unit is KHz signal, the distance is d, the unit is m, A₀Is a unity normalization constant, k is a propagation factor, representing the geometry of the propagation, α₁Is an absorption factor;

further, the node familiarity includes:

each node records interaction records of the node, a previous hop node and a next hop node, including the number of the node of the other party, a target node, the time of starting and ending information transmission and the interaction times;

after receiving the information, the destination node broadcasts and sends a confirmation data packet only containing a packet header in the network, wherein the packet header contains successful transmission path information with destination node information;

the node receives the packet header message and confirms the interaction record, if the packet header message is contained in the successful transmission path, the previous hop node and the next hop node enter the own successful cooperative transmission node table, and whether the node which receives the packet header message and confirms the interaction record exists in the successful cooperative transmission node table is judged:

the nodes which receive the packet header message and confirm the interactive record only update the recorded data, the updated data comprises the time of the beginning and the end of the transmission and the accumulated transmission times, and the nodes in the table can be regarded as friend nodes of the current node;

if no node for receiving the header message and confirming the interactive record exists, namely the interactive record is not found on the successful transmission path, automatically clearing the interactive record after a certain time;

the network successfully operates, each node has a friend node belonging to the network, the contact interval time between the friend nodes is subjected to negative index distribution under the influence of the node moving speed and the transmission radius, the contact interval between the nodes is set to be subjected to negative index distribution, and the contact probability between the friend nodes is judged:

b is a friend node of A, P_A,B(T) represents the probability of contact between node A and node B within time T, n is the total number of acquired historical transmission intervals, x_iIs the ith transmission interval, the recorded successful interaction times with friend nodes are limited in the mobile opportunity network, and the values of n are different in the calculation processes of different nodes, wherein x is_A,BIs a statistical average of historical transmission intervals.

The further node relay ratio is:

P_ret＝P_A,B(T)/N_r，

wherein, P_retAs node relay ratio, N_rThe number of messages received for the acquiring node.

Further, the indirect trust value uses reinforcement learning to perform routing selection, set a state-action value updating function and a reward function according to an evaluation node comprehensive trust value output by the fuzzy logic system, and the method comprises the following steps:

determining a comprehensive trust value of the encountering node according to a fuzzy logic method, using a Q learning strategy in reinforcement learning to find a proper forwarding path for the message, and defining an updating formula of a state-action value Q value as follows:

wherein Q is_d(s, x) selecting a node x as a state-action value of a next skip sending node in a node s for a data packet with a destination node d, namely forwarding a forwarding utility Q value corresponding to the data packet with the destination node d to the node x by the node s, taking out the corresponding Q value stored in a state-action value table and substituting the Q value into a formula when updating, and storing the updated value into the state-action value table again; alpha is a learning coefficient, and alpha is more than or equal to 0 and less than or equal to 1; gamma ray_d(s, x) is a dynamic discount factor corresponding to the data packet of which the destination node is d is forwarded to the node x in the node s; n is a radical of_xA set of contact nodes representing node x, the set containing all nodes, Q, encountered during movement of node x_d' (x, y) state-action values for node synthetic trust values introduced to guarantee mobile opportunistic network security dynamics;

the dynamic discount factor gamma_d(s, x) is

γ_d(s,x)＝γ*e^{CTValue(s,x)-1}，

Wherein gamma is a fixed constant and belongs to gamma (0, 1).

Further, the reward function is an immediate return value, is a function related to the node comprehensive trust value, and is a forward feedback to the node on the path of successful transmission:

wherein, CTValue (s, x) represents the integrated reputation value of the node x in the encountering node of the node s;

the forward feedback is the feedback after the message is successfully delivered to the destination node.

The invention provides an underwater unmanned vehicle safety opportunity routing device based on reinforcement learning, which comprises:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method as described above.

The present invention provides a computer device characterized in that: comprising a memory having a computer program stored therein and a processor that, when executed by the processor, performs a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method as described hereinabove.

The invention has the advantages that:

the invention solves the problems that the existing sensor node which can not move autonomously underwater has small topological change, and the sensor node only records the interaction information with the neighbor node and can not move autonomously; the encountered nodes can not be selected in the moving process, so that the final safe and efficient transmission of the messages is realized, and the problem of void nodes is easily caused.

The method is designed aiming at the routing protocol of the underwater unmanned vehicle, realizes the safety and the high efficiency of the information transmission of the underwater unmanned vehicle, can avoid underwater cavities, can improve the networking performance of the underwater unmanned vehicle, reduces the time delay of the underwater information transmission and increases the delivery rate of messages.

The underwater unmanned vehicle is used as a sensor node, has autonomous mobility, can select meeting nodes in movement, uses opportunistic routing, evaluates the trust values of the meeting nodes during meeting of the nodes, performs routing selection by combining the comprehensive trust values of the reinforcement learning reference nodes, dynamically updates and optimizes the overall performance of the underwater network, and avoids effective transmission of information while cavity nodes.

The method is applied to the field of safety opportunity routing of the underwater unmanned vehicle.

Drawings

FIG. 1 is an overall implementation process of an underwater unmanned vehicle safety opportunity routing;

FIG. 2 is a comprehensive trust value output model.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

First embodiment this embodiment is described with reference to fig. 1. The embodiment provides an underwater unmanned vehicle safety opportunity routing method based on reinforcement learning, which comprises the following steps:

According to the safety opportunistic routing method of the underwater unmanned vehicle based on reinforcement learning, the underwater unmanned vehicle has autonomous mobility, meeting nodes can be selected during movement, opportunistic routing is used, during node meeting, the meeting nodes are subjected to trust value evaluation, routing selection is performed by combining a comprehensive trust value of a reinforcement learning reference node, the overall performance of an underwater network is dynamically updated and optimized, and effective transmission of information is avoided while cavity nodes are simultaneously performed.

Second embodiment this embodiment will be described with reference to fig. 1. The embodiment is further limited to the reinforcement learning-based underwater unmanned vehicle security opportunity routing method described in the first embodiment, in the present embodiment, the process of primarily screening the underwater unmanned vehicle in the communication range from the node, and establishing the trust evaluation model according to the primarily screened node is as follows:

In the embodiment, the node information is acquired by performing primary screening in the communication range of the node.

Third embodiment this embodiment is described with reference to fig. 2. The embodiment is a further limitation on the reinforcement learning-based underwater unmanned vehicle security opportunity routing method in the first embodiment, and in the first embodiment, the direct trust value DTValue evaluation element is selected as follows: 1. calculating relative distance between nodes through the sending and receiving time difference of the node data packets, and estimating the communication quality between the nodes according to the relative distance between the nodes; 2. node familiarity; 3. node relay ratio.

The DTvalue of the indirect trust value guarantees the objectivity of evaluation on the current node, each node maintains a dynamic trust value table, the comprehensive trust value data of other nodes on the node is recorded, and the average value of the data in the dynamic trust value table is output as the indirect trust value.

In the embodiment, the evaluation element consists of an indirect trust value ITvalue and a direct trust value DTvalue, and the comprehensive trust value CTvalue of the candidate relay node is obtained through comprehensive calculation, so that the safety and the effectiveness of information transmission are realized.

Fourth embodiment this embodiment is described with reference to fig. 2. In this embodiment, the path loss estimated by the relative distance between the nodes measures the communication quality between the nodes, and the path loss a (d, f) of any pair of nodes occurring in the underwater acoustic channel is:

A(d,f)＝A₀d^kα₁(f)^d，

wherein, the frequency is f, the unit is KHz signal, the distance is d, the unit is m, A₀Is a unity normalization constant, k is a propagation factor, representing the geometry of the propagation, α₁Is the absorption factor. The geometric propagation loss depends only on the propagation distance and is independent of frequency.

k·10logd+d·10α₁log (f) represents the attenuation caused by the absorption factor α and the distance d, where:

the signal-to-noise ratio is inversely proportional to the distance d and the error rate, and the distance is directly proportional to the error rate, so that the smaller the distance is, the more reliable data transmission can be ensured.

In the embodiment, the inter-node communication quality is an important influence factor for realizing effective transmission of messages with the relay node in the communication process of the underwater unmanned vehicle.

Example five this example is illustrated with reference to figure 2. In this embodiment, the third embodiment is further limited to the reinforcement learning-based underwater unmanned vehicle security opportunity routing method, where the node familiarity degree includes:

and each node on the transmission path records the interaction records of the node, the previous hop node and the next hop node, including the node number of the other party, the destination node, the time for starting and ending information transmission and the interaction times, and after the destination node successfully receives the information, the destination node broadcasts and sends a confirmation data packet only containing a packet header to the network, wherein the packet header contains the successful transmission path information with the destination node information. The nodes receive the packet header message and confirm the interactive record, if the packet header message is contained in the successful transmission path, the upper and lower jump nodes are added into the own successful cooperative transmission node table, if the nodes exist in the successful cooperative transmission node table, only data are updated into the record, the updated data comprise the time of transmission start and end and the accumulated transmission times, the nodes in the table can be regarded as friend nodes of the current node, if the interactive record is not found in the successful transmission path, and after a certain time, the interactive record is automatically cleared.

After the network successfully operates for a period of time, each node has a friend node, and the contact interval time between friend nodes obeys negative exponential distribution under the influence of the node moving speed and the transmission radius, so that the contact interval between the friend nodes is assumed to obey the negative exponential distribution, and the contact probability between the friend nodes is estimated.

When B is friend node of A, P is present_A,B(T) represents the probability of contact between node A and node B within time T, θ_A,BMean value representing negative exponential distribution of node a and B contact intervals:

wherein, the information can be transmitted by the successful cooperation of the nodes to obtain the historical transmission interval record theta_A,BThe value of (A) can be estimated by a maximum likelihood method and derived

Where n is the total number of historical transmission intervals that can be acquired, x_iThe ith transmission interval is the transmission interval, the recorded successful interaction times with friend nodes in the mobile opportunity network are limited, and the values of n are different in the calculation processes of different nodes;

therefore, the distribution mean of the contact interval index distribution can be estimated by using the statistical average of the contact intervals of the friend nodes, and the contact probability of the nodes A and B in the time T is shown as the following formula

Statistical averaging for historical transmission intervals:

in the embodiment, the routing is carried out by calculating and measuring the contact probability between the friend nodes, so that the safety of information transmission can be effectively improved, and the effective transmission of data packets can be guaranteed.

Sixth embodiment this embodiment is described with reference to fig. 2. The present embodiment is a further limitation to the reinforcement learning-based underwater unmanned vehicle security opportunity routing method described in the third embodiment, and in the present embodiment, the node relay ratio is:

P_ret＝P_A,B(T)/N_r

wherein, P_retAs node relay ratio, N_rReceiving message number for acquiring node

The node relay ratio P described in this embodiment_retThe capability of the nodes for forwarding the message can be well reflected, and the introduction of the element can identify the failed underwater nodes and avoid underwater holes, so that the safe and reliable transmission of the message is realized.

Seventh embodiment this embodiment is described with reference to fig. 1. In this embodiment, the indirect trust value is a comprehensive trust value of an evaluation node output by a fuzzy logic system, and the reinforcement learning is used for routing, setting a state-action value update function and a reward function, and includes the steps of:

wherein Q is_d(s, x) selecting a node x as a state-action value of a next skip sending node in a node s for a data packet with a destination node d, namely forwarding a forwarding utility Q value corresponding to the data packet with the destination node d to the node x by the node s, taking out the corresponding Q value stored in a state-action value table and substituting the Q value into a formula when updating, and storing the updated value into the state-action value table again; alpha is a learning coefficient, and alpha is more than or equal to 0 and less than or equal to 1; gamma ray_d(s, x) is a dynamic discount factor corresponding to the data packet of which the destination node is d is forwarded to the node x in the node s; n is a radical of_xA set of contact nodes representing node x, the set containing all nodes, Q, encountered during movement of node x_d' (x, y) status-action values for the synthetic trust values of the nodes introduced for ensuring mobile opportunistic network security dynamics.

Assuming that the learning rate α is 1, the state-action value Q value update formula is given:

Q_d′(x,y)＝CTValue(x,y)*Q_d(x,y)

the dynamic discount factor gamma_d(s, x) is

γ_d(s,x)＝γ*e^{CTValue(s,x)-1}，

Wherein gamma is a fixed constant and belongs to gamma (0, 1).

When the data packet described in this embodiment is transmitted in the network, the reward value and the dynamic discount factor gradually play a role through iterative update, and a relay node with a higher comprehensive trust value can be selected for the data packet, and the comprehensive trust value is a numerical value obtained through overall understanding of network security and efficiency, so that network transmission performance is improved.

Example eight this example is described with reference to fig. 1. In this embodiment, the reward function is an immediate return value, is a function related to a node comprehensive trust value, and is a forward feedback to a node on a successfully transmitted path:

CTvalue (s, x) represents the comprehensive credit value of the node x in the encountering node of the node s, and the comprehensive credit value is obtained through a fuzzy logic system; if the node s forwards the data packet to the destination node d of the data packet, the obtained instant return value is e^{CTValue(s,x)-1}And otherwise, the value is 0, and the larger the comprehensive trust value of the node is, the larger the obtained immediate return value is.

The embodiment uses a greedy strategy to update the state-action values, and selects the maximum state-action value from the action set of the node x

For iterative update of state-action values. In the networking of the underwater unmanned vehicle, the encountering node trust value plays an important role in the design of a safe route, so that the safety and the reliability of a transmission path are guaranteed by combining the encountering node comprehensive trust value in the updating process of the state-action value.

The data packet containing only the header in the embodiment of the present invention is shown in table 1.

When the node x receives the message forwarded by the node s, if the selected relay node x is not the final destination node of the data Packet, the node x will add its own related information to the path information in the Packet header, the Packet sequence number Packet_idWhen the node s receives the forwarded data packet broadcasted by the node x, the data packet at the moment is equivalent to an acknowledgement data packet, the node s extracts information related to the node x from the packet header to replace the original information related to the node x, then calculates the comprehensive trust value of the data packet of the node x, the node s obtains an immediate return of 0 at the moment, and updates the state-action value Q corresponding to the node x_d(s,x)。

If the node x is the final destination node of the data Packet, the node x does not need to continuously forward the data Packet, and only needs to broadcast a message data Packet without load with self information to other nodes, wherein the sequence number of the data Packet is Packet_idSet to-1, indicating that the packet is a packet used to update information for other nodes.

TABLE 2 State-action value update and packet forwarding procedures

Ninth embodiment, the underwater unmanned vehicle security opportunity routing apparatus based on reinforcement learning according to the present embodiment includes:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method as in any of the above embodiments.

Tenth embodiment, a computer device according to this embodiment, characterized in that: comprising a memory having a computer program stored therein and a processor that, when executed by the processor, performs a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method as described in any of the above embodiments.

Claims

1. An underwater unmanned vehicle safety opportunity routing method based on reinforcement learning is characterized by comprising the following steps:

primarily screening nodes in a communication range by using an underwater unmanned vehicle, and establishing a trust evaluation model according to the primarily screened nodes;

evaluating the preliminarily screened nodes by using a trust evaluation model, wherein evaluation elements of the evaluation model consist of a direct trust value DTvalue and an indirect trust value ITvalue;

and (4) according to the comprehensive trust value of the evaluation node output by the fuzzy logic system, performing routing selection by using reinforcement learning, setting a state-action value updating function and setting a reward function.

2. The reinforcement learning-based underwater unmanned vehicle safety opportunity routing method as claimed in claim 1, wherein the underwater unmanned vehicle is primarily screened from the communication range of the nodes, and the process of establishing the trust evaluation model according to the primarily screened nodes comprises the following steps:

3. The reinforcement learning-based underwater unmanned vehicle safety opportunity routing method according to claim 1, wherein the direct trust value DTvalue evaluation element is selected as: 1. calculating relative distance between nodes through the sending and receiving time difference of the node data packets, and estimating the communication quality between the nodes according to the relative distance between the nodes; 2. node familiarity; 3. a node relay ratio;

4. The reinforcement learning-based underwater unmanned vehicle safety opportunity routing method according to claim 3, wherein the path loss estimated from the relative distance between the nodes measures the communication quality between the nodes, and the path loss A (d, f) of any pair of nodes occurring in an underwater acoustic channel is as follows:

A(d,f)＝A₀d^kα₁(f)^d，

where f is frequency, d is distance, A₀Is a unity normalization constant, k is a propagation factor, representing the geometry of the propagation, α₁Is the absorption factor.

5. The reinforcement learning-based underwater unmanned vehicle safety opportunity routing method of claim 3, wherein the node familiarity comprises:

b is a friend node of A, P_A,B(T) represents the probability of contact between node A and node B within time T, n is the total number of acquired historical transmission intervals, x_iIs the ith transmission interval, the recorded successful interaction times with friend nodes are limited in the mobile opportunity network, and the values of n are different in the calculation process of different nodes, wherein

Is a statistical average of historical transmission intervals.

6. The reinforcement learning-based underwater unmanned vehicle safety opportunity routing method of claim 3, wherein the node-to-relay ratio is:

P_ret＝P_A,B(T)/N_r，

7. The reinforcement learning-based underwater unmanned vehicle security opportunity routing method of claim 1, wherein the indirect trust value is a comprehensive trust value of an evaluation node output by a fuzzy logic system, and the reinforcement learning is used for routing selection, setting a state-action value updating function and a reward function, and the method comprises the following steps:

the dynamic discount factor gamma_d(s, x) is

γ_d(s,x)＝γ*e^{CTValue(s,x)-1}，

Wherein gamma is a fixed constant and belongs to gamma (0, 1).

8. The reinforcement learning-based underwater unmanned vehicle security opportunity routing method of claim 1, wherein the reward function is an immediate return value, is a function related to a node comprehensive trust value, and is a forward feedback to nodes on a path of successful transmission:

9. An underwater unmanned vehicle safety opportunity routing device based on reinforcement learning, comprising:

one or more processors;

a memory; and

one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method of any of claims 1-8.

10. A computer device, characterized by: comprising a memory and a processor, the memory having a computer program stored therein, the processor when executing the computer program stored in the memory performing a reinforcement learning-based underwater unmanned vehicle safety opportunity routing method according to any one of claims 1-8.