CN112351400B

CN112351400B - Underwater multi-modal network routing strategy generation method based on improved reinforcement learning

Info

Publication number: CN112351400B
Application number: CN202011103398.3A
Authority: CN
Inventors: 刘春凤; 赵昭; 曲雯毓; 广晓芸; 余涛
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-03-11
Anticipated expiration: 2040-10-15
Also published as: CN112351400A

Abstract

The invention discloses an underwater multi-modal network routing strategy generation method based on improved reinforcement learning, which comprises the following steps: in the off-line stage of the routing strategy implementation initial stage: preliminarily learning the transmission relationship between the network nodes in an iterative mode from the water surface sink node, so that each node obtains the maximum transmission benefit from each information value quantity grade data to the sink node; in the online phase of network operation: the expected income of the water surface sink node is obtained by combining the relay node and the transmission frequency band for each node through a reinforcement learning model, so that a transmission path suitable for data with different information value levels is constructed; the method reduces the transmission delay of high information value data; reduce and balance network energy consumption, prolong network operation time.

Description

Underwater multi-modal network routing strategy generation method based on improved reinforcement learning

Technical Field

The invention mainly relates to the technical field of underwater wireless sensor networks, in particular to an underwater multi-modal network routing strategy generation method based on improved reinforcement learning.

Background

The underwater wireless sensor network can help people know and understand the sea more conveniently, obtain valuable marine data information, and improve the monitoring and predicting capability of the marine environment and the capability of processing marine emergencies. The system can be widely applied to marine information acquisition, environment monitoring, deep sea detection, disaster prediction, auxiliary navigation, distributed tactical monitoring and the like. Various oceans are increasingly applied, and the requirements on ocean data transmission performance are different due to different application types and different time sensitivity. The underwater wireless sensor network provider needs to consider how to further optimize the network performance on the premise of meeting the data transmission requirements of marine applications, so that the network benefit is improved.

Specifically, subsea data typically includes an event type and an event timeliness, which may be referred to as a data value measure. The more important the event type of a piece of data is, the stronger the event timeliness is, and the higher the data value quantity of the data is; the data needs to be transmitted quickly and conversely can be transmitted slowly in order to improve network performance. Currently, in order to improve the transmission efficiency of ocean data, a multi-mode underwater wireless sensor network is proposed. In the network, the sensor nodes are provided with a plurality of non-interfering underwater communication module combinations which can simultaneously communicate, for example: the underwater acoustic communication and underwater optical communication are combined, or the underwater acoustic communication combination containing a plurality of frequency bands which are mutually orthogonal is combined. In addition, the nodes of the underwater wireless sensor network are usually powered by batteries, the energy of the batteries of the nodes is limited, and underwater energy charging is difficult; therefore, for a multi-modal underwater wireless sensor network, one of the most fundamental problems is: under the premise of meeting the data transmission delay requirements of different marine applications, a network provider needs to design a routing strategy suitable for an underwater dynamic communication environment, so that the network energy consumption is further reduced and balanced, and the service life of the network is prolonged.

However, as far as we know, the existing reinforcement learning-based multi-modal underwater wireless sensor network does not comprehensively consider the data value quantity and the network life of marine applications. For example, a journal article "MarLIN-Q," Multi-modal communications for reliable and low-latency underserver data delivery, "proposes a Multi-modal underwater wireless sensor network routing strategy based on reinforcement learning, which aims at minimizing transmission delay and improving data transmission reliability, and dynamically selects a relay node and a communication frequency band according to information fed back by a current neighbor node. Although the method can effectively reduce transmission delay and improve the success rate of data transmission, the method does not analyze data transmission characteristics, data value quantity and balance network energy consumption; therefore, the problems of high energy consumption of algorithm operation, high transmission delay of part of important data, short service life of a network and the like are caused. The invention provides an underwater multi-mode network transmission strategy generation method based on improved reinforcement learning, aiming at the transmission problem of an underwater wireless sensor network containing multi-type data. The network energy consumption is balanced to prolong the service life of the network while the transmission delay of high-value data is effectively reduced.

Disclosure of Invention

In order to solve the technical problems, the invention provides an underwater multi-mode network transmission strategy generation method based on improved reinforcement learning, which can reduce the transmission delay of high information value data; reduce and balance network energy consumption, prolong network operation time.

Aiming at the problems in the prior art, the invention adopts the following technical scheme:

an underwater multi-mode network routing strategy generation method based on improved reinforcement learning,

in the off-line stage of the routing strategy implementation initial stage: preliminarily learning the transmission relationship between the network nodes in an iterative mode from the water surface sink node, so that each node obtains the maximum transmission benefit from each information value quantity grade data to the sink node;

in the online phase of network operation: the expected income of the water surface sink node is obtained by combining the relay node and the transmission frequency band for each node through a reinforcement learning model, so that a transmission path suitable for data with different information value levels is constructed.

Further, each node obtains the maximum transmission benefit in the off-line stage of the initial stage of the implementation of the routing strategy:

s1, the sink node on the water surface generates an advertisement packet for each transmission frequency band combination; then the packet is sent out in a broadcast mode through the corresponding transmission frequency band combination;

s2, the underwater node calculates the final reward function of the convergent node from the underwater node to the water surface through the reward function, namely the final reward function

Wherein: the reward function is embodied as

Wherein Nr (i) is a node n_iNode set for receiving ADV data packet, g represents node n_iSlave node n_jTransmission of received ADV packetID, G of transmission band combination_ijRepresenting a node n_iSlave node n_jA set of transmission band combinations of one ADV packet received,

representing a node n_iTransmitting data with information value quantity level l to node n by using transmission mode g_jThe cost of the transfer of time;

s3, broadcasting the advertisement packet containing the ID information of the transmission frequency band combination in a broadcasting mode according to the final reward value of each underwater node;

and S4, judging whether all the underwater nodes obtain the final reward value of the sink nodes from the underwater nodes to the water surface.

Further, a transmission path process suitable for data with different information value levels is constructed in the online stage of network operation:

s1, when the underwater nodes transmit data packets, each node calculates the current state S by using a revenue function according to the information value quantity level l of the data_hTimely profit per action a of

S2, calculating the current state S by the underwater node by using the Q value function_hThe final profit Q of each action a^π(s_h,a)；

S3, the underwater node is according to the current state S_hThe final profit Q of each action a^π(s_hA) calculating a profit value for the optimal strategy and the optimal strategy, wherein the optimal strategy is calculated as expressed by

In the formula

Representing a node n_iIn a state s_hOptimal for data with lower transmission information value quantity level lAnd (4) strategy.

Advantageous effects

1. The invention initially learns the transmission relationship between the network nodes in an iterative mode from the water surface sink node, so that each node obtains the maximum transmission benefit from each information value level data to the sink node. Then, in the online stage of network operation, a multi-level link cost function comprehensively considering link communication delay, node residual energy and transmission load is designed through a reinforcement learning model, so that expected income of each node reaching a water surface sink node by adopting different transmission strategies (combination of a relay node and a transmission frequency band) is calculated, and a transmission path suitable for data of different information value quantities is constructed; and the nodes distribute corresponding paths for data transmission according to the information value quantity grades of the collected data. Generally, a path with high transmission efficiency transmits data with high information value quantity, so that the time delay of the data with high information value quantity is reduced; meanwhile, the common goals of balancing network energy consumption and reducing data time delay are achieved, and data with low information value quantity are transmitted through a path with high energy efficiency. Therefore, the network can reduce the data transmission delay and balance the network energy at the same time, and the service life of the network is prolonged.

2. The invention designs a multi-mode underwater wireless sensor network routing strategy suitable for various data transmission by utilizing a reinforcement learning model, can adaptively and dynamically select a transmission path for a data packet, meets the requirement of marine application on data delay and prolongs the service life of the network.

3. The invention utilizes an iterative method to quickly obtain network connection and transmission delay information when the network does not start to operate, thereby accelerating the convergence speed of the reinforcement learning model adopted in the online selection stage and reducing energy consumption.

Drawings

FIG. 1 is a flow chart of an underwater multi-modal network routing strategy generation method based on improved reinforcement learning according to the present invention

The specific implementation mode is as follows:

for a more clear description of the embodiment, it is assumed that there are K levels of information value in the network that need to be transmitted; each node has G combinations of transmission bands. The following detailed description of the specific modes, structures, features and functions of the underwater data routing strategy designed according to the present invention will be provided with reference to fig. 1.

1. Off-line training phase

Step 1: a sink node (sink node) positioned on the water surface generates an advertisement packet (ADV packet) for each transmission frequency band combination; then the packet is sent out in a broadcast form through the corresponding transmission frequency band combination, and the back-off time T is set_bThe timer is started at 0. The ADV contains sink node coordinate information and back-off time T_bEach information value quantity level being the final reward Re of the data_s(l) And currently broadcasting transmission band combination ID information of the ADV packet.

Step 2: suppose a certain node n_iReceives n from a certain node (including a sink node)_jA certain transmission band combination g, node n_iStores the information in the ADV packet and waits for a time T at this moment_wAnd starting timing. When T is_wWhen the predetermined value is reached, the node n_iCalculating the final reward of sending data with the information value quantity level of l to the sink node through a reward function

When it gives way for time T_bArrival deadline, node n_iThe ADV packet is transmitted in a broadcast form through the corresponding transmission frequency band combination, and the node n_iIncluding its ID, coordinates, each information value level

The transmission band combination ID information of the ADV packet is currently broadcast.

Wherein the waiting time T_wIs a fixed value in order to enable the node to collect ADV packets from other nodes more comprehensively.

Wherein the reward function is expressed as formula (1)

Wherein Nr (i) is a node n_iA set of nodes for ADV packets is received. g represents a node n_iSlave node n_jID of the transmission band combination of the received one ADV packet. G_ijRepresenting a node n_iSlave node n_jA set of transmission band combinations of one ADV packet received.

Representing a node n_iTransmitting data with information value quantity level l to node n by using transmission mode g_jThe cost of the transfer of time.

Wherein the transmission costs

Is expressed by formula (2)

Wherein beta (l) is an adjustment coefficient corresponding to the information value quantity level of data as l, and is used for adjusting the transmission efficiency cost

And energy efficiency cost

Weight therebetween, β (l) is ∈ [0,1 ]]。

Wherein the transmission efficiency costs

Is expressed as formula (3)

In the formula

Representing nodesn_iTransmitting data packets with information value quantity level l to node n by using transmission mode g_jThe time of transmission. PT_ijRepresenting data packets under water from node n_iTo node n_jThe propagation time of (c). TR (transmitter-receiver)_maxRepresenting a node n_iCombining the transmission frequency bands with the lowest transmission rate to the node n_jThe transmission time of the transmission packet. PT_maxRepresenting the propagation time of the data packet when it propagates underwater to the maximum communication distance of all the transmission band combinations.

Data packet with information value quantity grade x at node n_jThe queuing time in the transmission queue.

Wherein the cost of energy efficiency

Is expressed as formula (4)

In the formula E₀Representing the node initial energy value. Er_jRepresenting a node n_jThe remaining energy of (c).

Representing a node n_iTransmitting data with information value quantity level l to node n by using transmission mode g_jThe transmission power consumption of (2). e.g. of the type_maxData with information value quantity grade of l is transmitted to node n by using transmission frequency band combination with maximum energy consumption_jThe transmission power consumption of (2).

Wherein the back-off time T_bIs expressed as formula (5)

Representing a node n_iTransmitting data packets with information value quantity level l to node n by using transmission mode g_jThe time of transmission. PT_ijRepresenting data packets under water from node n_iTo node n_jThe propagation time of (c). TW is the waiting time.

And 4, step 4: repeating the step 3; until all nodes obtain their Re for each information value level_ni(l) In that respect The above steps are only run a limited number of times according to the underwater communication environment in the off-line phase performed by the transmission strategy generation method of the present invention.

2. On-line selection phase

At this stage, the invention adopts a data transmission strategy selection method based on reinforcement learning, so that the nodes dynamically select the next hop relay node and the corresponding transmission frequency band combination according to the information value quantity grade of the data packet. The reinforced learning model mainly comprises six components: the system comprises an agent, a state set S, an action set A, a strategy set pi, benefits R and a state transition probability matrix P. In the method, an intelligent agent is an underwater sensor node, and a state set S is composed of retransmission times h, successful transmission suc and data discarding drop. The action set is formed by combining the relay nodes and the corresponding transmission frequency bands. A policy set is composed of a combined mapping of states and actions. The benefit represents the corresponding reward when the node adopts a certain strategy. The state transition probability matrix represents a probability matrix of the current state of the node being transferred to some other state; in our method, the state transition includes 1) a transition from a retransmission number H state to a retransmission number H +1, 2) a transition from a retransmission number H state to a transmission success suc, and 3) a transition from a retransmission number H to a data discard drop state when a maximum retransmission number H is reached.

Step 1: when any node n_iWhen data is required to be transmitted, it is based on the information price of the data to be transmittedValue class l, calculating current state s by using gain function_hEach action a ═ n_jTimely gain of k

Wherein the revenue function is expressed as formula (6)

In the formula

Representing a node n_iUsing the action a ═ n_j,k>Temporal state slave s_hTransition to the probability of transmitting a successful suc state.

Representing a node n_jIn a state s₀Maximum benefit for transmitting data with information value volume level l. It is important to note that at the start of the network operation, each node n_iObtain its own initial

By passing

s₀Indicating that the number of retransmissions is 0. H denotes the maximum number of retransmissions of the data. h represents the current number of retransmissions of the data. ε represents an adjustment coefficient by which the data transmission success rate can be improved, and is usually set to [0,10 ]]。

Representing a node n_iTransmitting data with information value quantity level l to node n by using transmission mode g_jThe transmission cost of time is expressed by the formula (2).

Wherein

Is expressed as formula (7)

Wherein f represents the frequency band of one underwater acoustic communication module in the transmission frequency band combination k. P_ij(f) Representing a node n_iUsing frequency band f to node n_jA data packet transmission success rate when transmitting data; it may be composed of node n in general_jListening to node n_iDividing the number of transmitted data packets by the number of nodes n_iThe number of data packets actually transmitted in total is determined.

Step 2: then, the node n_iCalculating the current state s using a Q-value function_hThe final profit Q of each action a^π(s_h,a)。

Wherein the Q value function is expressed by the formula (8)

In the formula

Representing a node n_iState slave s when action a is taken_hTransition to State s'_hThe probability of (c). Gamma is the discount coefficient, and takes on the value range [0,1 ].

Representing a node n_jAt state s'_hMaximum benefit for transmitting data of rank l, its expression is formula (9)

In the formula

To representNode n_iIn a state s_hThe final benefit of action a is then taken and is calculated from equation (8).

And step 3: then, the node n_iAccording to the current state s_hThe final profit Q of each action a^π(s_hA) calculating a profit value for the optimal strategy, and the optimal strategy

(i.e. the combination of the relay node selected by the node and the corresponding communication frequency band under the current retransmission times).

Wherein the optimal strategy is represented by formula (10)

In the formula

Representing a node n_iIn a state s_hAnd (5) the optimal strategy adopted for transmitting data with the information value quantity level l is adopted.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An underwater multi-mode network routing strategy generation method based on improved reinforcement learning is characterized by comprising the following steps:

in the online phase of network operation: the expected income of the water surface sink node is obtained by combining the relay node and the transmission frequency band for each node through a reinforcement learning model, so that a transmission path suitable for data with different information value levels is constructed; wherein:

the process of constructing transmission paths suitable for data with different information value levels at the online stage of network operation comprises the following steps:

S3, the underwater node is according to the current state S_hThe final profit Q of each action a^π(s_hAnd a) calculating the profit value of the optimal strategy and the optimal strategy, wherein the optimal strategy is calculated and expressed by the following formula:

in the formula

2. The underwater multi-modal network routing strategy generation method based on the improved reinforcement learning of claim 1 is characterized in that: the method comprises the following steps of obtaining the maximum transmission profit of each node in an off-line stage at the initial stage of routing strategy implementation:

Wherein: the reward function is embodied as

Wherein Nr (i) is a node n_iNode set for receiving ADV data packet, g represents node n_iSlave node n_jID, G of a received transmission band combination of an ADV packet_ijRepresenting a node n_iSlave node n_jA set of transmission band combinations of one ADV packet received,