CN114423061B

CN114423061B - Wireless route optimization method based on attention mechanism and deep reinforcement learning

Info

Publication number: CN114423061B
Application number: CN202210068572.8A
Authority: CN
Inventors: 尚凤军; 王颖; 代云龙
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2024-05-07
Anticipated expiration: 2042-01-20
Also published as: CN114423061A

Abstract

The invention relates to a network routing method, in particular to a wireless routing optimization method based on an attention mechanism and deep reinforcement learning, which comprises the steps of acquiring current latest decision model parameters from a server when each node is accessed to the network, monitoring neighbor node information, constructing a candidate father node set by using the current decision model parameters, selecting information modeling of m father nodes with maximum energy from the candidate father node set as a graph vector as input, extracting the graph vector characteristic by adopting the attention mechanism based on CNN, selecting the optimal father node as a relay node for data transmission by using the deep reinforcement learning, and counting relevant performance indexes of the data transmission node by the node after each data period is finished; mapping the performance index into a corresponding rewarding value of the node under the corresponding state and action by adopting a similarity quantization function, and transmitting experience information acquired in the data period to a server by the node; the method has higher expandability and can be suitable for the dynamic change scene of the nodes in the network.

Description

Wireless route optimization method based on attention mechanism and deep reinforcement learning

Technical Field

The invention relates to a network routing method, in particular to a wireless routing optimization method based on deep reinforcement learning.

Background

In recent years, the technology of the internet of things continuously obtains new achievements, and the wireless sensor network serving as one of the important supporting technologies of the bottom layer of the internet of things is already used in the fields of national defense and military, environment detection, traffic management, medical and health, manufacturing industry, disaster prevention and rescue and the like, and becomes a research hotspot in academia and industry. The routing protocol is the most important part in the wireless sensor network, is one of the hot spots studied at home and abroad at present, is intended to adapt to different working environments and finish corresponding tasks, and the most important part is to design the corresponding routing protocol, so that the network can work in various environments, keeps better delay performance, has certain robustness and cannot lose too much performance due to severe environments. In addition, the nodes in the wireless sensor network are usually powered by batteries, the calculation and storage capacities of the nodes are weak, and energy is consumed for sending data packets, so that the wireless sensor network has the problems of network delay, short network survival time and uneven network energy consumption.

The routing protocol is one of the core technologies of the wireless sensor network. Routing protocols can be divided into planar routing protocols and hierarchical routing protocols according to whether the status and function of each node in the network are the same. Corresponds to a planar routing protocol. The hierarchical routing protocol can better save the energy of the network and prolong the life cycle of the network. The nodes are divided into high and low levels, and the high level nodes are responsible for collecting information of the low level nodes and then transmitting the information to the base station. The current routing protocol is often focused on the performance of one aspect, and the routing algorithm based on the minimum hop count realizes efficient data transmission, but can cause too fast energy consumption of key nodes, cause partial network paralysis and increase maintenance cost. In addition, wireless Sensor Networks (WSNs) employ wireless communication technology to transmit data, and signals attenuate over wireless channels due to distance variations, multipath, and shadowing effects. To enable WSNs to collect data efficiently, sensor nodes may need to be moved according to some movement model, however, it is more difficult to achieve efficient routing in a mobile environment. The opportunistic routing algorithm, in order to achieve the optimum energy utilization and to cause a longer delay, exploits the broadcast transmission characteristics of the wireless network, forwarding data packets to one set of nodes each time, these nodes determining their priority according to their Metric (Metric) to the destination node, selecting the node with the highest priority to forward data packets again to another set of nodes, repeating this process until the destination node. These algorithms have a good energy consumption performance and a certain robustness, but the delay performance is difficult to meet the requirements. And performance is poor when the network environment changes. On-demand routing protocols, such as AODV, DSR, etc., initiate creation of routes that are less robust and unsuitable for complex network environments only when the source node needs route information to send data to the destination node.

The routing algorithm in the traditional wireless sensor network has a plurality of shortages, such as unreasonable cluster head selection, so that the cluster heads of the long-distance sink nodes can consume own energy too early due to long-distance data transmission, namely, the energy is not saved, and the network is divided. Moreover, most algorithms do not take into account the current energy state of the cluster head node, and if a node with very low energy is selected as the cluster head, the death of the node will be accelerated, thereby affecting the life cycle of the whole network. Some traditional routing decision algorithms adopt fixed routing rules, lack of perception of network state, are easy to cause higher loads of some equivalent paths, and cannot realize self-adaptive flow unloading, so that load imbalance is easy to cause.

Disclosure of Invention

In order to make up the limitation of the traditional routing algorithm to the complex network scene, and solve the problem that the traditional reinforcement learning algorithm cannot be deployed on the terminal equipment with limited resources, the invention provides a wireless routing optimization method based on an attention mechanism and deep reinforcement learning, which comprises the following steps:

when each node accesses the network, acquiring the current latest decision model parameters from a server, and monitoring neighbor node information;

The node builds a candidate father node set according to the information of the interception neighbor nodes, and models the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model;

Based on a local decision model, the node selects an optimal father node as a relay node for data transmission, and after each data period is finished, the node counts relevant performance indexes of the data transmission node;

Mapping the performance index of the same-degree quantization function into a corresponding rewarding value of the node under the corresponding state and action, and transmitting the data acquired in the data period to a server by the node;

the server trains a decision model on the server according to the information collected by the nodes.

Further, the global model on the server includes a CNN-based attention mechanism module for extracting features from the graph vectors constructed by the candidate parent node set and inputting the extracted features into DDPG network for performing routing decision and model optimization processes, and DDPG network.

Further, after the node transmits the data collected in one data period to the server, the server stores the data in an experience playback pool of the server, the server samples k samples from the experience pool to train a decision model on the server, and the training process comprises the following steps:

101. Sampling k samples from the experience pool, e _j＝<s_j,a_j,r_j,s′_j >, j=1, 2..once, k, the j-th sample being represented by the current state s _j of the sample, action a _j, the prize value r _j obtained in the state-action pair (s _j,a_j), and the state s ' _j after the state-action pair (s _j,a_j) is performed, obtaining the states in the sample, i.e., map vectors s _j and s ' _j, extracting features F _j and F ' _j of the image volume using the CNN-based attention mechanism module;

102. The characteristics F _j and F' _j extracted by the attention mechanism module based on CNN are input into a DDPG network, and the Target Q value is calculated by an Actor network of a Main Net of a DDPG network, and expressed as:

103. critic network loss of Main Net is calculated according to the Target Q value and expressed as:

updating Critic network parameters omega of the Main Net based on gradient back propagation of Critic network loss;

104. Calculating an Actor network loss of the Main Net, wherein the network loss is expressed as:

updating an attention mechanism module based on CNN and an Actor network parameter theta of Main Net based on the obtained gradient back propagation of the Actor network loss;

105. After updating the network parameters for each time in steps 101 to 104, the parameters of the CNN-based attention mechanism module, the Actor network, and the Critic network in the update TARGET NET are expressed as follows:

θ′←αθ+(1-α)θ′；

ω′←αω+(1-α)ω′；

106. periodically acquiring the latest strategy network parameter namely theta' from TARGET NET of a server by a node in the network;

Wherein Y _i is the target Q value of the corresponding state-action pair (s _j,a_j); omega is Critic network parameters of Main Net; omega' is the Critic network parameter of TARGET NET; θ 'is the CNN-based attention mechanism module and TARGET NET's Actor network parameters; q (F' _j,π_θ′(F′_j); ω ') is the Q value calculated by the TARGET NET's Actor network from the corresponding state-action pair (F ' _j,π_θ′(F′_j); q (F _j,a_j; ω) is the Q true value calculated by the Main Net's Actor network from the corresponding state-action pair (F _j,a_j); gamma is the rewarding discount factor; j (θ) is the loss function of the Critic network of Main Net; r _j is the corresponding reward value of the j-th node under the corresponding state and action; A represents a set of all actions; alpha epsilon [0,1] is the learning rate.

Further, the prize value r _j corresponding to the jth node in the corresponding state and action is expressed as:

r_j＝w1*f(Th)+w2*f(Ce)+W3*f(De)；

Where f (Th) represents the throughput index of the node; f (Ce) represents an energy index of consumption of the node; f (De) represents a delay index of the node; w1, w2 and w3 are weights of f (Th), f (Ce) and f (De), respectively, and w1+w2+w3=1.

Further, the calculation of the throughput index of the node includes:

f(x)＝αe^{(x-E[x])/(max[x]-E[x])*β},x＝Th；

wherein α and β are the corresponding coefficients of the nonlinear integral function; ex represents the desire to find x; max x represents the maximum value of x; th represents the throughput of the node.

Further, the calculation of the consumed energy index of the node and the time delay index of the node includes:

f(y)＝αe^{(E[y]-y)/(E[y]-min[y])*β},y∈[Ce,De]；

Wherein α and β are corresponding coefficients, and when the index value x reaches an average level, that is, x=ex, f (x) =40 minutes is defined, where α=40; when the index value x=max [ x ], β=ln2.5 can be calculated; ey represents the desire to find y; min [ y ] represents the minimum x value of y; th represents the throughput of the node; ce represents the consumed energy of the node; de represents the delay of the node.

Further, a local decision model is provided on each node, the model comprises a CNN-based attention mechanism module and an Actor network in TARGET NET of DDPG networks, wherein the decision model parameters obtain network parameters from TARGET NET of DDPG networks on a server as parameters of the Actor networks.

Further, the working process of the attention mechanism module based on CNN comprises the following steps:

A graph vector s _j in the sample is obtained, and 32 convolution kernels of 1×1 are used to extract corresponding features, expressed as:

F＝Conv^1×1(s_j)；

The global average pooling and global maximum pooling are adopted for the F on the channel domain to obtain two new features, namely F _avg∈R¹ ^×m×r and F _max∈R^1×m×r, and the two new features are fused and expressed as:

F_am＝[F_ave;F_max]；

Global average pooling of F _am over the channel extracts more detail features, and the global average pooled F _am is expressed as:

F_c∈R^1×m×r；

The double-attention mechanism is implemented on F _c in two dimensions by adopting two convolution layers with different convolution kernel sizes, wherein the number of convolution kernels of each convolution layer is 1, and N _w and M _w obtained after two convolution operations are specifically expressed as:

N_w＝Conv^1×m(F_am)

M_w＝Conv^r×1(F_am)

NM _w was calculated using matrix multiplication, expressed as:

NM_w＝δ(N_w×M_w)

Where NM _w∈R^1×m×r, delta (. Cndot.) is the activation function; the residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:

F_j＝Conv^m×1(NM_w+F_c)。

Further, modeling the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model, if the number of the candidate father nodes of the nodes is more than or equal to m, selecting m nodes with the largest residual energy by the nodes, and abstracting corresponding routing measurement information as the graph vector of m x r; when the number of candidate father nodes of the node is smaller than m, the insufficient route metric information is filled with 0, and the corresponding route metric information is abstracted into m x r graph vectors; where r is the dimension of the routing metric information.

The invention realizes the intellectualization of the resource-limited node through asynchronous experience collection and centralized model training, the node can select the optimal candidate father node in a distributed way based on the local observation information thereof, maximize the network survival time while reducing the end-to-end time delay, and design a CNN-based bidirectional attention mechanism to extract the characteristics of the candidate father node with finer granularity from the node and the route measurement dimension. In addition, the route optimization model has higher expandability and can be suitable for the dynamic change scene of the nodes in the network.

Drawings

FIG. 1 is a flow chart of the operation of a node in an embodiment of the invention;

FIG. 2 is a flowchart illustrating the operation of a server according to an embodiment of the present invention;

FIG. 3 is a diagram of a distributed interaction and centralized training system model architecture in accordance with an embodiment of the present invention;

FIG. 4 is a state vector of an embodiment of the present invention;

fig. 5 is a CNM and DDPG-based route optimization model architecture in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a wireless route optimization method based on deep reinforcement learning, which comprises the following steps:

Based on a local decision model (comprising a CNN-based attention mechanism and an Actor-Target network module), a node selects an optimal father node as a relay node for data transmission, and after each data period is finished, the node counts relevant performance indexes of the data transmission node;

In this embodiment, the distributed routing algorithm in the wireless multi-hop network environment includes the following steps:

The node newly accessing the network acquires the local decision model parameters; the node after network access acquires the current latest decision model parameters from the server and monitors neighbor node information;

In a route updating period, the node selects an optimal father node based on a local observation state and a local decision model, the process of selecting the optimal father node models information of m candidate father nodes with the largest energy as graph vectors serving as input of the local decision model to realize deep fusion of multipath routing parameters and a deep learning model, and the node selects the optimal father node as a relay node for data transmission based on the local decision model;

in the data period, the node sends the data of the buffer area to the father node;

the node counts the relative performance (throughput, time delay, energy consumption, etc.) of the node data transmission in the period, meanwhile, the relative performance information of the data transmission is also the routing metric information of the node), and the experience information of the period and the environment interaction is uploaded to a server (stored in an experience pool);

the server acquires an experience data training model from the experience pool and periodically transmits updated decision part model parameters to each node;

in the training process of the model, and in the period that the model is not converged, the node needs to periodically acquire the latest model parameters from the server in the period of the controller, and upload experience information interacted with the environment to an experience pool of the server, the server extracts part of experience from the experience pool and trains a deep reinforcement learning model (CNM+ DDPG) for routing decision, and the training of the model comprises the following steps:

1) Randomly initializing a route optimization model based on CNM and DDPG deployed on a server;

2) In the wireless multi-hop network, a (CNM+actor-Target) routing decision model is deployed on each wireless node, the model architecture is the same as the CNM+actor-Target structure on the server, and the model is a part of the model on the server;

3) After the node is accessed to the network, partial parameters of the routing decision model are acquired from a server to update the local model of the node;

4) The node collects candidate father node information, wherein the information comprises node residual energy, hop count, expected transmission times, buffer queue count and neighbor node count, and note that the neighbor node count refers to the neighbor node count of the candidate father node (including the father node);

5) During control, the node selects m candidate father nodes with the largest residual energy from the candidate father nodes, constructs the information of the candidate father nodes into m x r (m is the number of the candidate father nodes, and r is the selected information dimension) picture vectors, takes the picture vectors as the state information s input of a local decision model, and the model outputs corresponding action a to indicate the optimal father node information which the node should select; the state information s is the information modeling of m candidate father nodes with the largest energy and is a graph vector, as shown in fig. 4, and mainly comprises two cases, namely when the number of the candidate father nodes of the Node is more than or equal to m (such as Node A), the Node selects m nodes with the largest residual energy and abstracts corresponding routing metric information into a graph vector of mx 5; conversely, when the number of candidate parent nodes of the Node is smaller than m (such as Node B), insufficient route metric information is filled with 0; as shown in fig. 4, the selected routing metric information includes remaining energy information (RE), hop count (Hop), neighbor Node (NO), buffer queue number (BQ), expected number of transmissions (ETX).

6) In the data transmission period, the node transmits the data in the buffer area to the selected optimal father node based on the corresponding channel access mechanism of the MAC layer, and counts the corresponding network performance (average data transmission delay, throughput, energy and the like);

7) Mapping the performance indexes into corresponding rewards r of the nodes under the state and action by adopting a similarity quantization function; the node transmits experience information acquired in the period, namely < s, a, r, s' > to a server, the server stores the experience information into an experience pool D, and the server extracts mini-batch data from the D each time to train a model;

8) The node periodically acquires relevant parameters from the server to update a local decision model and interacts with the environment; .

9) The above process is repeated until all nodes in the network are exhausted.

The CNM module is used for extracting the characteristics of the attribute values of candidate father nodes constructed by the nodes, reserving the characteristics to different degrees by adopting maximum pooling and average pooling on a channel domain, and extracting the detail characteristics of the input state by using average fusion of the two; in addition, two different-dimension convolution layers are adopted in the channel to realize two different-dimension attention mechanisms, namely, the comprehensive attribute of the node is focused and the transverse and longitudinal comparison of each attribute is focused, so that the dual attention based on the node and the measurement is realized; finally, in order to ensure the integrity of the features, a residual network idea is adopted, and two features obtained in a channel domain and a channel are fused and used as the input of a deep reinforcement learning model; based on experience information collected by the nodes, the centralized learner adopts a CNM and DDPG-based network architecture to realize optimization of model parameters. Each wireless node only needs to deploy a CNM+actor-Target model shown in fig. 1-2 for local decision of the node, wherein parameters of the CNM+actor-Target model are trained and optimized by a server; therefore, the local node only needs to download the part of the parameters from the server. The distributed interaction and centralized training model is shown in fig. 3, and compared with the deployment and training of the whole network model, the distributed interaction and centralized training is adopted to effectively reduce the storage and calculation pressure of the terminal nodes.

As an alternative embodiment, the parent node (m candidate parent nodes with the largest energy) is selected using information such as remaining energy information (RE), hop count (Hop), expected transmission number (ETX), buffer queue number (BQ), and neighbor Node (NO) as the hybrid route metric information. The adoption of the residual energy information (RE) can effectively avoid the selection of the node with lower residual energy as the optimal parent node for transmitting data, thereby being beneficial to prolonging the survival time of the network; the adoption of Hop count (Hop) can effectively avoid selecting the node with excessive Hop count as the preference father node, thereby improving the performance of data transmission success rate, time delay and the like; the expected number of transmissions (ETX) indicates the link quality, aimed at improving the reliability of the data transmission; the buffer queue number (BQ) considers the load degree of the candidate father node, so that serious load imbalance is avoided; the purpose of using the number of neighbor Nodes (NO) as one of the routing metrics is to consider the dynamics in the network into the model, so as to predict the potential impact that node distributed decisions may have on the current parent node.

FIGS. 1-2 provide a workflow diagram of a node and a server, and the node and the server are implemented as follows:

and after the new node is accessed to the network, acquiring local decision model parameters from a server. As shown in fig. 3, the local decision model is part of a deep reinforcement learning model deployed on a server, which is used to obtain the corresponding optimal actions based on the environmental state;

the node distributively maintains a candidate father node table, and the table stores and updates the information of the hop count, the residual energy, the buffer queue number, the expected transmission times, the number of neighbor nodes and the like of the corresponding candidate father node in real time;

in a route updating period, the node selects a current preference father node based on a current local decision model and an observation state, and the specific process is as follows:

In each route update period (according to the network self-adaptive setting), the node selects m candidate father nodes with the largest residual energy (m can be self-adaptively adjusted according to the network density) (if the number of the candidate father nodes is less than m, all the candidate father nodes are selected, and the insufficient part is filled with 0). The node abstracts the selected candidate parent node information into a graph vector, s. As shown in fig. 4, each row in the graph vector stores the attribute (hop count, remaining energy, buffer queue number, expected transmission times, number of neighboring nodes, etc.) related to a specific parent node;

The node takes this graph vector s as input to the local decision model. Accordingly, the model will output an action a e a, a being the action space (a= {1,2,3,...m }), a indicating the preferred parent node information for that node (e.g., a = 1, indicating that the node should select the parent node to which the information stored on line 1 of the graph vector corresponds as the preferred parent node for its next data cycle).

Aiming at the characteristics of the graph vectors, the embodiment designs a CNM-based feature extraction module to extract the attribute features of each candidate degree node, and the specific process is as follows:

(1) The state vector s is convolved with a convolution kernel of 32 one-dimensional convolutions (1 x 1), expressed as:

F＝Conv^1x1(s)

(2) Two new features are obtained by global average pooling and global maximum pooling on 32 channel domains, namely F _avg∈R^1×m×r and F _max∈R^1×m×r, and the two new features are fused into F _am＝[F_ave;F_max ]; global averaging pooling of F _am∈R^2×m×r over the channel is used to extract more detailed features, expressed as:

F_c∈R^1×m×r

(3) The convolution operation performed in two different dimensions with two convolution layer pairs F _c having different convolution kernel sizes, namely the double-attention mechanism (N _w and M _w), is expressed as:

N_w＝Conv^1×m(F_am)

M_w＝Conv^r×1(F_am)

Wherein the number of convolution kernels of each convolution layer is 1;

(4) NM _w was calculated using matrix multiplication, expressed as:

NM_w＝N_w×M_w

wherein, NM _w∈R^1×m×r;

(5) The residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:

F_j＝Conv^m×1(NM_w+F_c)

F _j is input into an Actor-Target module to obtain corresponding actions.

During a data transmission period, the node interacts with the parent node on a selected working channel that favors the parent node based on a corresponding medium access mechanism (e.g., CSMA/CA, TDMA, etc.); the nodes record corresponding network performance indicators (throughput (Th), consumed energy (Ce) and average packet transmission delay (Ce)) in this period.

And the nodes adopt a nonlinear integration method to realize the uniformity quantification of the performance indexes, wherein the throughput is a forward index, and the consumed energy and the average end-to-end time delay are reverse indexes. Thus, the following two formulas are used to calculate them, respectively.

f(x)＝αe^{(x-E[x])/(max[x]-E[x])*β},x＝Th

f(y)＝αe^{(E[y]-y)/(E[y]-min[y])*β},y∈[Ce,De]

Where α and β are the corresponding coefficients (which are used to score the system, for example, if the system is divided into 100 minutes and 40 is divided into bins, α=40, β=ln2.5 can be calculated, and in the embodiment of the present invention, α=40, β=ln2.5). x and y represent throughput and consumed energy or average packet transmission delay for the current cycle of the node. E x represents the x value of each period after the node is accessed to the network, and max x represents the x value of the single period after the node is accessed to the network; similarly, min [ y ] represents the minimum y value of a single cycle since the node was network-connected.

The node performs weighted accumulation on the index subjected to the degree of similarity quantification by using the following formula to obtain a corresponding reward value r _j:

r_j＝w1*f(Th)+w2*f(Ce)+W3*f(De)

where w1, w2 and w3 are correlation coefficients, and w1+w2+w3=1 is used to indicate how important the current network is to different metrics.

The node uploads empirical data generated by interaction with the environment in the period, namely e= < s, a, r, s' >, to a server;

The server stores the experience information from each node in the experience pool D and randomly acquires the mini-batch data from the experience pool each time to update the model. The model deployed on the server is shown in fig. 5. The training process of the model is as follows:

1) The server stores experiences acquired by wireless nodes in the network into an experience playback pool of the centralized learner;

2) The server samples mini-batch samples from the experience playback pool, e _j＝<s_j,a_j,r_j,s′_j >, j=1, 2, k

3) Vector F _j and F ' of corresponding features of s _j and s ' _j are calculated based on CNM ' _j

4) Calculating a Target Q value:

5) Calculating the mean square error: Updating Critic-main network parameters omega based on gradient back propagation of a depth network;

6) Calculation of Updating the parameter theta of the CNM+actor main strategy network through gradient back propagation of the neural network;

7) Updating the CNM+Actor Target strategy network and CRITIC TARGET Q network parameters every time C rounds are run:

θ′←αθ+(1-α)θ′

ω′←αω+(1-α)ω′

repeating the above process until the model converges.

The node periodically updates the local decision model parameters from the server, and the local model does not need to be independently trained, so that the computational complexity of the terminal node is greatly reduced; in addition, nodes in the network asynchronously collect experience from their locally observed environment, which provides the server with more empirical information to speed up the convergence speed and generalization capability of the model. Furthermore, the centralized training (server) and distributed interactions (nodes) can be optimized for routes in dynamic or mobile wireless multi-hop scenarios under training of a sufficiently rich experience.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The wireless route optimization method based on the attention mechanism and the deep reinforcement learning is characterized by comprising the following steps of:

the server trains a decision model on the server according to the information collected by the nodes, and a global model on the server comprises a CNN-based attention mechanism module and DDPG networks, wherein the CNN-based attention mechanism module is used for extracting features from graph vectors constructed by candidate father node sets and inputting the extracted features into the DDPG networks to execute a routing decision and model optimization process;

the process of constructing the graph vector includes:

Selecting m candidate father nodes with the largest residual energy from the candidate father nodes, constructing the information of the candidate father nodes into a map vector of m x r as a state vector s, wherein r refers to the selected information dimension;

When the number of candidate father nodes of the node is greater than or equal to m, the node selects m nodes with the largest residual energy and abstracts corresponding route measurement information into m multiplied by 5 graph vectors, wherein the selected route measurement information comprises the residual energy information, the hop count, the adjacent nodes, the buffer queue number and the expected transmission times; when the number of candidate father nodes of the node is smaller than m, insufficient route metric information is filled with 0;

the process of extracting features from graph vectors constructed by candidate parent node sets by the CNN-based attention mechanism module includes:

The state vector s is convolved with a convolution kernel of 32 one-dimensional convolutions (1 x 1), expressed as:

F＝Conv^1x1

Two new features are obtained by global average pooling and global maximum pooling on 32 channel domains, namely F _avg∈R^1×m×r and F _max∈R^1×m×r, and the two new features are fused into F _am＝[F_ave;F_max ]; global averaging pooling of F _am∈R^2×m×r over the channel is used to extract more detailed features, expressed as:

F_c∈R^1×m×r

The convolution operation is performed in two different dimensions using two convolution layer pairs F _c having different convolution kernel sizes, namely:

N_w＝Conv^1×m(F_am)

M_w＝Conv^r×1(F_am)

Wherein the number of convolution kernels of each convolution layer is 1;

NM _w was calculated using matrix multiplication, expressed as:

NM_w＝N_w×M_w

wherein, NM _w∈R^1×m×r;

The residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:

F_j＝Conv^m×1(NM_w+F_c)

Taking F _j as an input of the DDPG network, in the DDPG network, a corresponding prize value r _j of the jth node under the corresponding state and action is expressed as:

r_j＝w1*f(Th)+w2*f(Ce)+W3*f(De)；

Where f (Th) represents the throughput index of the node; f (Ce) represents an energy index of consumption of the node; f (De) represents a delay index of the node; w1, w2 and w3 are weights of f (Th), f (Ce) and f (De), respectively, and w1+w2+w3=1;

the calculation of the throughput index f (x) of the node includes:

f(x)＝αe^{(x-E[x])/(max[x]-E[x])*β},x＝Th

Wherein α and β are first coefficients; ex represents the desire to find x; max x represents the maximum value of x; th represents the throughput of the node;

The calculation of the consumed energy index of the node and the time delay index f (y) of the node includes:

f(y)＝α₁e^{(E[y]-y)/(E[y]-min[y])*β1},y∈[Ce,De]

Wherein α ₁ and β ₁ are second coefficients, and it is specified that when the index value reaches the average level, α ₁ =40; when the index value reaches the maximum value, β ₁ =ln2.5; ey represents the desire to find y; min [ y ] represents the minimum x value of y; ce represents the consumed energy of the node; de represents the delay of the node.

2. The wireless route optimization method based on the attention mechanism and the deep reinforcement learning according to claim 1, wherein after the node transmits the data collected in one data period to the server, the server stores the data in an experience playback pool of the server, the server samples k samples from the experience pool to train a decision model on the server, and the training process comprises:

θ′←αθ+(1-α)θ′；

ω′←αω+(1-α)ω′；

Wherein Y _i is the target Q value of the corresponding state-action pair (s _j,a_j); omega is Critic network parameters of Main Net; omega' is the Critic network parameter of TARGET NET; θ 'is the CNN-based attention mechanism module and TARGET NET's Actor network parameters; q (F' _j,π_θ′(F′_j); ω ') is the Q value calculated by the TARGET NET's Actor network from the corresponding state-action pair (F ' _j,π_θ′(F′_j); q (F _j,a_j; ω) is the Q true value calculated by the Main Net's Actor network from the corresponding state-action pair (F _j,a_j); gamma is the rewarding discount factor; j (θ) is the loss function of the Critic network of Main Net; r _j is the corresponding reward value of the j-th node under the corresponding state and action; a is an action space, namely a set of all actions; alpha epsilon [0,1] is the learning rate.