CN114423061B - Wireless route optimization method based on attention mechanism and deep reinforcement learning - Google Patents
Wireless route optimization method based on attention mechanism and deep reinforcement learning Download PDFInfo
- Publication number
- CN114423061B CN114423061B CN202210068572.8A CN202210068572A CN114423061B CN 114423061 B CN114423061 B CN 114423061B CN 202210068572 A CN202210068572 A CN 202210068572A CN 114423061 B CN114423061 B CN 114423061B
- Authority
- CN
- China
- Prior art keywords
- node
- network
- server
- nodes
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000005457 optimization Methods 0.000 title claims abstract description 14
- 230000002787 reinforcement Effects 0.000 title claims abstract description 13
- 239000013598 vector Substances 0.000 claims abstract description 32
- 230000005540 biological transmission Effects 0.000 claims abstract description 28
- 230000009471 action Effects 0.000 claims abstract description 17
- 230000006870 function Effects 0.000 claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 5
- 238000013139 quantization Methods 0.000 claims abstract description 5
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000005259 measurement Methods 0.000 claims description 5
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000008859 change Effects 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 31
- 238000013527 convolutional neural network Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 6
- 241000854291 Dianthus carthusianorum Species 0.000 description 4
- 238000005265 energy consumption Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000004083 survival effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 101100172132 Mus musculus Eif3a gene Proteins 0.000 description 1
- 206010033799 Paralysis Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
- H04W40/04—Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources
- H04W40/10—Communication route or path selection, e.g. power-based or shortest path routing based on wireless node resources based on available power or energy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W40/00—Communication routing or communication path finding
- H04W40/02—Communication route or path selection, e.g. power-based or shortest path routing
- H04W40/12—Communication route or path selection, e.g. power-based or shortest path routing based on transmission quality or channel quality
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to a network routing method, in particular to a wireless routing optimization method based on an attention mechanism and deep reinforcement learning, which comprises the steps of acquiring current latest decision model parameters from a server when each node is accessed to the network, monitoring neighbor node information, constructing a candidate father node set by using the current decision model parameters, selecting information modeling of m father nodes with maximum energy from the candidate father node set as a graph vector as input, extracting the graph vector characteristic by adopting the attention mechanism based on CNN, selecting the optimal father node as a relay node for data transmission by using the deep reinforcement learning, and counting relevant performance indexes of the data transmission node by the node after each data period is finished; mapping the performance index into a corresponding rewarding value of the node under the corresponding state and action by adopting a similarity quantization function, and transmitting experience information acquired in the data period to a server by the node; the method has higher expandability and can be suitable for the dynamic change scene of the nodes in the network.
Description
Technical Field
The invention relates to a network routing method, in particular to a wireless routing optimization method based on deep reinforcement learning.
Background
In recent years, the technology of the internet of things continuously obtains new achievements, and the wireless sensor network serving as one of the important supporting technologies of the bottom layer of the internet of things is already used in the fields of national defense and military, environment detection, traffic management, medical and health, manufacturing industry, disaster prevention and rescue and the like, and becomes a research hotspot in academia and industry. The routing protocol is the most important part in the wireless sensor network, is one of the hot spots studied at home and abroad at present, is intended to adapt to different working environments and finish corresponding tasks, and the most important part is to design the corresponding routing protocol, so that the network can work in various environments, keeps better delay performance, has certain robustness and cannot lose too much performance due to severe environments. In addition, the nodes in the wireless sensor network are usually powered by batteries, the calculation and storage capacities of the nodes are weak, and energy is consumed for sending data packets, so that the wireless sensor network has the problems of network delay, short network survival time and uneven network energy consumption.
The routing protocol is one of the core technologies of the wireless sensor network. Routing protocols can be divided into planar routing protocols and hierarchical routing protocols according to whether the status and function of each node in the network are the same. Corresponds to a planar routing protocol. The hierarchical routing protocol can better save the energy of the network and prolong the life cycle of the network. The nodes are divided into high and low levels, and the high level nodes are responsible for collecting information of the low level nodes and then transmitting the information to the base station. The current routing protocol is often focused on the performance of one aspect, and the routing algorithm based on the minimum hop count realizes efficient data transmission, but can cause too fast energy consumption of key nodes, cause partial network paralysis and increase maintenance cost. In addition, wireless Sensor Networks (WSNs) employ wireless communication technology to transmit data, and signals attenuate over wireless channels due to distance variations, multipath, and shadowing effects. To enable WSNs to collect data efficiently, sensor nodes may need to be moved according to some movement model, however, it is more difficult to achieve efficient routing in a mobile environment. The opportunistic routing algorithm, in order to achieve the optimum energy utilization and to cause a longer delay, exploits the broadcast transmission characteristics of the wireless network, forwarding data packets to one set of nodes each time, these nodes determining their priority according to their Metric (Metric) to the destination node, selecting the node with the highest priority to forward data packets again to another set of nodes, repeating this process until the destination node. These algorithms have a good energy consumption performance and a certain robustness, but the delay performance is difficult to meet the requirements. And performance is poor when the network environment changes. On-demand routing protocols, such as AODV, DSR, etc., initiate creation of routes that are less robust and unsuitable for complex network environments only when the source node needs route information to send data to the destination node.
The routing algorithm in the traditional wireless sensor network has a plurality of shortages, such as unreasonable cluster head selection, so that the cluster heads of the long-distance sink nodes can consume own energy too early due to long-distance data transmission, namely, the energy is not saved, and the network is divided. Moreover, most algorithms do not take into account the current energy state of the cluster head node, and if a node with very low energy is selected as the cluster head, the death of the node will be accelerated, thereby affecting the life cycle of the whole network. Some traditional routing decision algorithms adopt fixed routing rules, lack of perception of network state, are easy to cause higher loads of some equivalent paths, and cannot realize self-adaptive flow unloading, so that load imbalance is easy to cause.
Disclosure of Invention
In order to make up the limitation of the traditional routing algorithm to the complex network scene, and solve the problem that the traditional reinforcement learning algorithm cannot be deployed on the terminal equipment with limited resources, the invention provides a wireless routing optimization method based on an attention mechanism and deep reinforcement learning, which comprises the following steps:
when each node accesses the network, acquiring the current latest decision model parameters from a server, and monitoring neighbor node information;
The node builds a candidate father node set according to the information of the interception neighbor nodes, and models the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model;
Based on a local decision model, the node selects an optimal father node as a relay node for data transmission, and after each data period is finished, the node counts relevant performance indexes of the data transmission node;
Mapping the performance index of the same-degree quantization function into a corresponding rewarding value of the node under the corresponding state and action, and transmitting the data acquired in the data period to a server by the node;
the server trains a decision model on the server according to the information collected by the nodes.
Further, the global model on the server includes a CNN-based attention mechanism module for extracting features from the graph vectors constructed by the candidate parent node set and inputting the extracted features into DDPG network for performing routing decision and model optimization processes, and DDPG network.
Further, after the node transmits the data collected in one data period to the server, the server stores the data in an experience playback pool of the server, the server samples k samples from the experience pool to train a decision model on the server, and the training process comprises the following steps:
101. Sampling k samples from the experience pool, e j=<sj,aj,rj,s′j >, j=1, 2..once, k, the j-th sample being represented by the current state s j of the sample, action a j, the prize value r j obtained in the state-action pair (s j,aj), and the state s ' j after the state-action pair (s j,aj) is performed, obtaining the states in the sample, i.e., map vectors s j and s ' j, extracting features F j and F ' j of the image volume using the CNN-based attention mechanism module;
102. The characteristics F j and F' j extracted by the attention mechanism module based on CNN are input into a DDPG network, and the Target Q value is calculated by an Actor network of a Main Net of a DDPG network, and expressed as:
103. critic network loss of Main Net is calculated according to the Target Q value and expressed as:
updating Critic network parameters omega of the Main Net based on gradient back propagation of Critic network loss;
104. Calculating an Actor network loss of the Main Net, wherein the network loss is expressed as:
updating an attention mechanism module based on CNN and an Actor network parameter theta of Main Net based on the obtained gradient back propagation of the Actor network loss;
105. After updating the network parameters for each time in steps 101 to 104, the parameters of the CNN-based attention mechanism module, the Actor network, and the Critic network in the update TARGET NET are expressed as follows:
θ′←αθ+(1-α)θ′;
ω′←αω+(1-α)ω′;
106. periodically acquiring the latest strategy network parameter namely theta' from TARGET NET of a server by a node in the network;
Wherein Y i is the target Q value of the corresponding state-action pair (s j,aj); omega is Critic network parameters of Main Net; omega' is the Critic network parameter of TARGET NET; θ 'is the CNN-based attention mechanism module and TARGET NET's Actor network parameters; q (F' j,πθ′(F′j); ω ') is the Q value calculated by the TARGET NET's Actor network from the corresponding state-action pair (F ' j,πθ′(F′j); q (F j,aj; ω) is the Q true value calculated by the Main Net's Actor network from the corresponding state-action pair (F j,aj); gamma is the rewarding discount factor; j (θ) is the loss function of the Critic network of Main Net; r j is the corresponding reward value of the j-th node under the corresponding state and action; A represents a set of all actions; alpha epsilon [0,1] is the learning rate.
Further, the prize value r j corresponding to the jth node in the corresponding state and action is expressed as:
rj=w1*f(Th)+w2*f(Ce)+W3*f(De);
Where f (Th) represents the throughput index of the node; f (Ce) represents an energy index of consumption of the node; f (De) represents a delay index of the node; w1, w2 and w3 are weights of f (Th), f (Ce) and f (De), respectively, and w1+w2+w3=1.
Further, the calculation of the throughput index of the node includes:
f(x)=αe(x-E[x])/(max[x]-E[x])*β,x=Th;
wherein α and β are the corresponding coefficients of the nonlinear integral function; ex represents the desire to find x; max x represents the maximum value of x; th represents the throughput of the node.
Further, the calculation of the consumed energy index of the node and the time delay index of the node includes:
f(y)=αe(E[y]-y)/(E[y]-min[y])*β,y∈[Ce,De];
Wherein α and β are corresponding coefficients, and when the index value x reaches an average level, that is, x=ex, f (x) =40 minutes is defined, where α=40; when the index value x=max [ x ], β=ln2.5 can be calculated; ey represents the desire to find y; min [ y ] represents the minimum x value of y; th represents the throughput of the node; ce represents the consumed energy of the node; de represents the delay of the node.
Further, a local decision model is provided on each node, the model comprises a CNN-based attention mechanism module and an Actor network in TARGET NET of DDPG networks, wherein the decision model parameters obtain network parameters from TARGET NET of DDPG networks on a server as parameters of the Actor networks.
Further, the working process of the attention mechanism module based on CNN comprises the following steps:
A graph vector s j in the sample is obtained, and 32 convolution kernels of 1×1 are used to extract corresponding features, expressed as:
F=Conv1×1(sj);
The global average pooling and global maximum pooling are adopted for the F on the channel domain to obtain two new features, namely F avg∈R1 ×m×r and F max∈R1×m×r, and the two new features are fused and expressed as:
Fam=[Fave;Fmax];
Global average pooling of F am over the channel extracts more detail features, and the global average pooled F am is expressed as:
Fc∈R1×m×r;
The double-attention mechanism is implemented on F c in two dimensions by adopting two convolution layers with different convolution kernel sizes, wherein the number of convolution kernels of each convolution layer is 1, and N w and M w obtained after two convolution operations are specifically expressed as:
Nw=Conv1×m(Fam)
Mw=Convr×1(Fam)
NM w was calculated using matrix multiplication, expressed as:
NMw=δ(Nw×Mw)
Where NM w∈R1×m×r, delta (. Cndot.) is the activation function; the residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:
Fj=Convm×1(NMw+Fc)。
Further, modeling the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model, if the number of the candidate father nodes of the nodes is more than or equal to m, selecting m nodes with the largest residual energy by the nodes, and abstracting corresponding routing measurement information as the graph vector of m x r; when the number of candidate father nodes of the node is smaller than m, the insufficient route metric information is filled with 0, and the corresponding route metric information is abstracted into m x r graph vectors; where r is the dimension of the routing metric information.
The invention realizes the intellectualization of the resource-limited node through asynchronous experience collection and centralized model training, the node can select the optimal candidate father node in a distributed way based on the local observation information thereof, maximize the network survival time while reducing the end-to-end time delay, and design a CNN-based bidirectional attention mechanism to extract the characteristics of the candidate father node with finer granularity from the node and the route measurement dimension. In addition, the route optimization model has higher expandability and can be suitable for the dynamic change scene of the nodes in the network.
Drawings
FIG. 1 is a flow chart of the operation of a node in an embodiment of the invention;
FIG. 2 is a flowchart illustrating the operation of a server according to an embodiment of the present invention;
FIG. 3 is a diagram of a distributed interaction and centralized training system model architecture in accordance with an embodiment of the present invention;
FIG. 4 is a state vector of an embodiment of the present invention;
fig. 5 is a CNM and DDPG-based route optimization model architecture in accordance with an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a wireless route optimization method based on deep reinforcement learning, which comprises the following steps:
when each node accesses the network, acquiring the current latest decision model parameters from a server, and monitoring neighbor node information;
The node builds a candidate father node set according to the information of the interception neighbor nodes, and models the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model;
Based on a local decision model (comprising a CNN-based attention mechanism and an Actor-Target network module), a node selects an optimal father node as a relay node for data transmission, and after each data period is finished, the node counts relevant performance indexes of the data transmission node;
Mapping the performance index of the same-degree quantization function into a corresponding rewarding value of the node under the corresponding state and action, and transmitting the data acquired in the data period to a server by the node;
the server trains a decision model on the server according to the information collected by the nodes.
In this embodiment, the distributed routing algorithm in the wireless multi-hop network environment includes the following steps:
The node newly accessing the network acquires the local decision model parameters; the node after network access acquires the current latest decision model parameters from the server and monitors neighbor node information;
In a route updating period, the node selects an optimal father node based on a local observation state and a local decision model, the process of selecting the optimal father node models information of m candidate father nodes with the largest energy as graph vectors serving as input of the local decision model to realize deep fusion of multipath routing parameters and a deep learning model, and the node selects the optimal father node as a relay node for data transmission based on the local decision model;
in the data period, the node sends the data of the buffer area to the father node;
the node counts the relative performance (throughput, time delay, energy consumption, etc.) of the node data transmission in the period, meanwhile, the relative performance information of the data transmission is also the routing metric information of the node), and the experience information of the period and the environment interaction is uploaded to a server (stored in an experience pool);
the server acquires an experience data training model from the experience pool and periodically transmits updated decision part model parameters to each node;
in the training process of the model, and in the period that the model is not converged, the node needs to periodically acquire the latest model parameters from the server in the period of the controller, and upload experience information interacted with the environment to an experience pool of the server, the server extracts part of experience from the experience pool and trains a deep reinforcement learning model (CNM+ DDPG) for routing decision, and the training of the model comprises the following steps:
1) Randomly initializing a route optimization model based on CNM and DDPG deployed on a server;
2) In the wireless multi-hop network, a (CNM+actor-Target) routing decision model is deployed on each wireless node, the model architecture is the same as the CNM+actor-Target structure on the server, and the model is a part of the model on the server;
3) After the node is accessed to the network, partial parameters of the routing decision model are acquired from a server to update the local model of the node;
4) The node collects candidate father node information, wherein the information comprises node residual energy, hop count, expected transmission times, buffer queue count and neighbor node count, and note that the neighbor node count refers to the neighbor node count of the candidate father node (including the father node);
5) During control, the node selects m candidate father nodes with the largest residual energy from the candidate father nodes, constructs the information of the candidate father nodes into m x r (m is the number of the candidate father nodes, and r is the selected information dimension) picture vectors, takes the picture vectors as the state information s input of a local decision model, and the model outputs corresponding action a to indicate the optimal father node information which the node should select; the state information s is the information modeling of m candidate father nodes with the largest energy and is a graph vector, as shown in fig. 4, and mainly comprises two cases, namely when the number of the candidate father nodes of the Node is more than or equal to m (such as Node A), the Node selects m nodes with the largest residual energy and abstracts corresponding routing metric information into a graph vector of mx 5; conversely, when the number of candidate parent nodes of the Node is smaller than m (such as Node B), insufficient route metric information is filled with 0; as shown in fig. 4, the selected routing metric information includes remaining energy information (RE), hop count (Hop), neighbor Node (NO), buffer queue number (BQ), expected number of transmissions (ETX).
6) In the data transmission period, the node transmits the data in the buffer area to the selected optimal father node based on the corresponding channel access mechanism of the MAC layer, and counts the corresponding network performance (average data transmission delay, throughput, energy and the like);
7) Mapping the performance indexes into corresponding rewards r of the nodes under the state and action by adopting a similarity quantization function; the node transmits experience information acquired in the period, namely < s, a, r, s' > to a server, the server stores the experience information into an experience pool D, and the server extracts mini-batch data from the D each time to train a model;
8) The node periodically acquires relevant parameters from the server to update a local decision model and interacts with the environment; .
9) The above process is repeated until all nodes in the network are exhausted.
The CNM module is used for extracting the characteristics of the attribute values of candidate father nodes constructed by the nodes, reserving the characteristics to different degrees by adopting maximum pooling and average pooling on a channel domain, and extracting the detail characteristics of the input state by using average fusion of the two; in addition, two different-dimension convolution layers are adopted in the channel to realize two different-dimension attention mechanisms, namely, the comprehensive attribute of the node is focused and the transverse and longitudinal comparison of each attribute is focused, so that the dual attention based on the node and the measurement is realized; finally, in order to ensure the integrity of the features, a residual network idea is adopted, and two features obtained in a channel domain and a channel are fused and used as the input of a deep reinforcement learning model; based on experience information collected by the nodes, the centralized learner adopts a CNM and DDPG-based network architecture to realize optimization of model parameters. Each wireless node only needs to deploy a CNM+actor-Target model shown in fig. 1-2 for local decision of the node, wherein parameters of the CNM+actor-Target model are trained and optimized by a server; therefore, the local node only needs to download the part of the parameters from the server. The distributed interaction and centralized training model is shown in fig. 3, and compared with the deployment and training of the whole network model, the distributed interaction and centralized training is adopted to effectively reduce the storage and calculation pressure of the terminal nodes.
As an alternative embodiment, the parent node (m candidate parent nodes with the largest energy) is selected using information such as remaining energy information (RE), hop count (Hop), expected transmission number (ETX), buffer queue number (BQ), and neighbor Node (NO) as the hybrid route metric information. The adoption of the residual energy information (RE) can effectively avoid the selection of the node with lower residual energy as the optimal parent node for transmitting data, thereby being beneficial to prolonging the survival time of the network; the adoption of Hop count (Hop) can effectively avoid selecting the node with excessive Hop count as the preference father node, thereby improving the performance of data transmission success rate, time delay and the like; the expected number of transmissions (ETX) indicates the link quality, aimed at improving the reliability of the data transmission; the buffer queue number (BQ) considers the load degree of the candidate father node, so that serious load imbalance is avoided; the purpose of using the number of neighbor Nodes (NO) as one of the routing metrics is to consider the dynamics in the network into the model, so as to predict the potential impact that node distributed decisions may have on the current parent node.
FIGS. 1-2 provide a workflow diagram of a node and a server, and the node and the server are implemented as follows:
and after the new node is accessed to the network, acquiring local decision model parameters from a server. As shown in fig. 3, the local decision model is part of a deep reinforcement learning model deployed on a server, which is used to obtain the corresponding optimal actions based on the environmental state;
the node distributively maintains a candidate father node table, and the table stores and updates the information of the hop count, the residual energy, the buffer queue number, the expected transmission times, the number of neighbor nodes and the like of the corresponding candidate father node in real time;
in a route updating period, the node selects a current preference father node based on a current local decision model and an observation state, and the specific process is as follows:
In each route update period (according to the network self-adaptive setting), the node selects m candidate father nodes with the largest residual energy (m can be self-adaptively adjusted according to the network density) (if the number of the candidate father nodes is less than m, all the candidate father nodes are selected, and the insufficient part is filled with 0). The node abstracts the selected candidate parent node information into a graph vector, s. As shown in fig. 4, each row in the graph vector stores the attribute (hop count, remaining energy, buffer queue number, expected transmission times, number of neighboring nodes, etc.) related to a specific parent node;
The node takes this graph vector s as input to the local decision model. Accordingly, the model will output an action a e a, a being the action space (a= {1,2,3,...m }), a indicating the preferred parent node information for that node (e.g., a = 1, indicating that the node should select the parent node to which the information stored on line 1 of the graph vector corresponds as the preferred parent node for its next data cycle).
Aiming at the characteristics of the graph vectors, the embodiment designs a CNM-based feature extraction module to extract the attribute features of each candidate degree node, and the specific process is as follows:
(1) The state vector s is convolved with a convolution kernel of 32 one-dimensional convolutions (1 x 1), expressed as:
F=Conv1x1(s)
(2) Two new features are obtained by global average pooling and global maximum pooling on 32 channel domains, namely F avg∈R1×m×r and F max∈R1×m×r, and the two new features are fused into F am=[Fave;Fmax ]; global averaging pooling of F am∈R2×m×r over the channel is used to extract more detailed features, expressed as:
Fc∈R1×m×r
(3) The convolution operation performed in two different dimensions with two convolution layer pairs F c having different convolution kernel sizes, namely the double-attention mechanism (N w and M w), is expressed as:
Nw=Conv1×m(Fam)
Mw=Convr×1(Fam)
Wherein the number of convolution kernels of each convolution layer is 1;
(4) NM w was calculated using matrix multiplication, expressed as:
NMw=Nw×Mw
wherein, NM w∈R1×m×r;
(5) The residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:
Fj=Convm×1(NMw+Fc)
F j is input into an Actor-Target module to obtain corresponding actions.
During a data transmission period, the node interacts with the parent node on a selected working channel that favors the parent node based on a corresponding medium access mechanism (e.g., CSMA/CA, TDMA, etc.); the nodes record corresponding network performance indicators (throughput (Th), consumed energy (Ce) and average packet transmission delay (Ce)) in this period.
And the nodes adopt a nonlinear integration method to realize the uniformity quantification of the performance indexes, wherein the throughput is a forward index, and the consumed energy and the average end-to-end time delay are reverse indexes. Thus, the following two formulas are used to calculate them, respectively.
f(x)=αe(x-E[x])/(max[x]-E[x])*β,x=Th
f(y)=αe(E[y]-y)/(E[y]-min[y])*β,y∈[Ce,De]
Where α and β are the corresponding coefficients (which are used to score the system, for example, if the system is divided into 100 minutes and 40 is divided into bins, α=40, β=ln2.5 can be calculated, and in the embodiment of the present invention, α=40, β=ln2.5). x and y represent throughput and consumed energy or average packet transmission delay for the current cycle of the node. E x represents the x value of each period after the node is accessed to the network, and max x represents the x value of the single period after the node is accessed to the network; similarly, min [ y ] represents the minimum y value of a single cycle since the node was network-connected.
The node performs weighted accumulation on the index subjected to the degree of similarity quantification by using the following formula to obtain a corresponding reward value r j:
rj=w1*f(Th)+w2*f(Ce)+W3*f(De)
where w1, w2 and w3 are correlation coefficients, and w1+w2+w3=1 is used to indicate how important the current network is to different metrics.
The node uploads empirical data generated by interaction with the environment in the period, namely e= < s, a, r, s' >, to a server;
The server stores the experience information from each node in the experience pool D and randomly acquires the mini-batch data from the experience pool each time to update the model. The model deployed on the server is shown in fig. 5. The training process of the model is as follows:
1) The server stores experiences acquired by wireless nodes in the network into an experience playback pool of the centralized learner;
2) The server samples mini-batch samples from the experience playback pool, e j=<sj,aj,rj,s′j >, j=1, 2, k
3) Vector F j and F ' of corresponding features of s j and s ' j are calculated based on CNM ' j
4) Calculating a Target Q value:
5) Calculating the mean square error: Updating Critic-main network parameters omega based on gradient back propagation of a depth network;
6) Calculation of Updating the parameter theta of the CNM+actor main strategy network through gradient back propagation of the neural network;
7) Updating the CNM+Actor Target strategy network and CRITIC TARGET Q network parameters every time C rounds are run:
θ′←αθ+(1-α)θ′
ω′←αω+(1-α)ω′
repeating the above process until the model converges.
The node periodically updates the local decision model parameters from the server, and the local model does not need to be independently trained, so that the computational complexity of the terminal node is greatly reduced; in addition, nodes in the network asynchronously collect experience from their locally observed environment, which provides the server with more empirical information to speed up the convergence speed and generalization capability of the model. Furthermore, the centralized training (server) and distributed interactions (nodes) can be optimized for routes in dynamic or mobile wireless multi-hop scenarios under training of a sufficiently rich experience.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (2)
1. The wireless route optimization method based on the attention mechanism and the deep reinforcement learning is characterized by comprising the following steps of:
when each node accesses the network, acquiring the current latest decision model parameters from a server, and monitoring neighbor node information;
The node builds a candidate father node set according to the information of the interception neighbor nodes, and models the information of m candidate father nodes with the largest energy as a graph vector to be used as the input of a local decision model;
Based on a local decision model, the node selects an optimal father node as a relay node for data transmission, and after each data period is finished, the node counts relevant performance indexes of the data transmission node;
Mapping the performance index of the same-degree quantization function into a corresponding rewarding value of the node under the corresponding state and action, and transmitting the data acquired in the data period to a server by the node;
the server trains a decision model on the server according to the information collected by the nodes, and a global model on the server comprises a CNN-based attention mechanism module and DDPG networks, wherein the CNN-based attention mechanism module is used for extracting features from graph vectors constructed by candidate father node sets and inputting the extracted features into the DDPG networks to execute a routing decision and model optimization process;
the process of constructing the graph vector includes:
Selecting m candidate father nodes with the largest residual energy from the candidate father nodes, constructing the information of the candidate father nodes into a map vector of m x r as a state vector s, wherein r refers to the selected information dimension;
When the number of candidate father nodes of the node is greater than or equal to m, the node selects m nodes with the largest residual energy and abstracts corresponding route measurement information into m multiplied by 5 graph vectors, wherein the selected route measurement information comprises the residual energy information, the hop count, the adjacent nodes, the buffer queue number and the expected transmission times; when the number of candidate father nodes of the node is smaller than m, insufficient route metric information is filled with 0;
the process of extracting features from graph vectors constructed by candidate parent node sets by the CNN-based attention mechanism module includes:
The state vector s is convolved with a convolution kernel of 32 one-dimensional convolutions (1 x 1), expressed as:
F=Conv1x1
Two new features are obtained by global average pooling and global maximum pooling on 32 channel domains, namely F avg∈R1×m×r and F max∈R1×m×r, and the two new features are fused into F am=[Fave;Fmax ]; global averaging pooling of F am∈R2×m×r over the channel is used to extract more detailed features, expressed as:
Fc∈R1×m×r
The convolution operation is performed in two different dimensions using two convolution layer pairs F c having different convolution kernel sizes, namely:
Nw=Conv1×m(Fam)
Mw=Convr×1(Fam)
Wherein the number of convolution kernels of each convolution layer is 1;
NM w was calculated using matrix multiplication, expressed as:
NMw=Nw×Mw
wherein, NM w∈R1×m×r;
The residual block is used to guarantee the integrity of the information and a one-dimensional convolution operation is performed, the result being expressed as:
Fj=Convm×1(NMw+Fc)
Taking F j as an input of the DDPG network, in the DDPG network, a corresponding prize value r j of the jth node under the corresponding state and action is expressed as:
rj=w1*f(Th)+w2*f(Ce)+W3*f(De);
Where f (Th) represents the throughput index of the node; f (Ce) represents an energy index of consumption of the node; f (De) represents a delay index of the node; w1, w2 and w3 are weights of f (Th), f (Ce) and f (De), respectively, and w1+w2+w3=1;
the calculation of the throughput index f (x) of the node includes:
f(x)=αe(x-E[x])/(max[x]-E[x])*β,x=Th
Wherein α and β are first coefficients; ex represents the desire to find x; max x represents the maximum value of x; th represents the throughput of the node;
The calculation of the consumed energy index of the node and the time delay index f (y) of the node includes:
f(y)=α1e(E[y]-y)/(E[y]-min[y])*β1,y∈[Ce,De]
Wherein α 1 and β 1 are second coefficients, and it is specified that when the index value reaches the average level, α 1 =40; when the index value reaches the maximum value, β 1 =ln2.5; ey represents the desire to find y; min [ y ] represents the minimum x value of y; ce represents the consumed energy of the node; de represents the delay of the node.
2. The wireless route optimization method based on the attention mechanism and the deep reinforcement learning according to claim 1, wherein after the node transmits the data collected in one data period to the server, the server stores the data in an experience playback pool of the server, the server samples k samples from the experience pool to train a decision model on the server, and the training process comprises:
101. Sampling k samples from the experience pool, e j=<sj,aj,rj,s′j >, j=1, 2..once, k, the j-th sample being represented by the current state s j of the sample, action a j, the prize value r j obtained in the state-action pair (s j,aj), and the state s ' j after the state-action pair (s j,aj) is performed, obtaining the states in the sample, i.e., map vectors s j and s ' j, extracting features F j and F ' j of the image volume using the CNN-based attention mechanism module;
102. The characteristics F j and F' j extracted by the attention mechanism module based on CNN are input into a DDPG network, and the Target Q value is calculated by an Actor network of a Main Net of a DDPG network, and expressed as:
103. critic network loss of Main Net is calculated according to the Target Q value and expressed as:
updating Critic network parameters omega of the Main Net based on gradient back propagation of Critic network loss;
104. Calculating an Actor network loss of the Main Net, wherein the network loss is expressed as:
updating an attention mechanism module based on CNN and an Actor network parameter theta of Main Net based on the obtained gradient back propagation of the Actor network loss;
105. After updating the network parameters for each time in steps 101 to 104, the parameters of the CNN-based attention mechanism module, the Actor network, and the Critic network in the update TARGET NET are expressed as follows:
θ′←αθ+(1-α)θ′;
ω′←αω+(1-α)ω′;
106. periodically acquiring the latest strategy network parameter namely theta' from TARGET NET of a server by a node in the network;
Wherein Y i is the target Q value of the corresponding state-action pair (s j,aj); omega is Critic network parameters of Main Net; omega' is the Critic network parameter of TARGET NET; θ 'is the CNN-based attention mechanism module and TARGET NET's Actor network parameters; q (F' j,πθ′(F′j); ω ') is the Q value calculated by the TARGET NET's Actor network from the corresponding state-action pair (F ' j,πθ′(F′j); q (F j,aj; ω) is the Q true value calculated by the Main Net's Actor network from the corresponding state-action pair (F j,aj); gamma is the rewarding discount factor; j (θ) is the loss function of the Critic network of Main Net; r j is the corresponding reward value of the j-th node under the corresponding state and action; a is an action space, namely a set of all actions; alpha epsilon [0,1] is the learning rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210068572.8A CN114423061B (en) | 2022-01-20 | 2022-01-20 | Wireless route optimization method based on attention mechanism and deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210068572.8A CN114423061B (en) | 2022-01-20 | 2022-01-20 | Wireless route optimization method based on attention mechanism and deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114423061A CN114423061A (en) | 2022-04-29 |
CN114423061B true CN114423061B (en) | 2024-05-07 |
Family
ID=81276089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210068572.8A Active CN114423061B (en) | 2022-01-20 | 2022-01-20 | Wireless route optimization method based on attention mechanism and deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114423061B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114884895B (en) * | 2022-05-05 | 2023-08-22 | 郑州轻工业大学 | Intelligent flow scheduling method based on deep reinforcement learning |
CN115190135B (en) * | 2022-06-30 | 2024-05-14 | 华中科技大学 | Distributed storage system and copy selection method thereof |
CN115842770B (en) * | 2022-11-07 | 2024-05-14 | 鹏城实验室 | Routing method based on depth map neural network and related equipment |
CN116170370B (en) * | 2023-02-20 | 2024-03-12 | 重庆邮电大学 | SDN multipath routing method based on attention mechanism and deep reinforcement learning |
CN117863948B (en) * | 2024-01-17 | 2024-06-11 | 广东工业大学 | Distributed electric vehicle charging control method and device for auxiliary frequency modulation |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105704754A (en) * | 2014-12-12 | 2016-06-22 | 华北电力大学 | Wireless sensor network routing method |
CN107018548A (en) * | 2017-05-27 | 2017-08-04 | 河南科技大学 | The implementation method of cognition wireless network opportunistic routing protocol based on frequency spectrum perception |
CN107920368A (en) * | 2016-10-09 | 2018-04-17 | 郑州大学 | RPL routing optimization methods based on life cycle in a kind of wireless sense network |
CN110852273A (en) * | 2019-11-12 | 2020-02-28 | 重庆大学 | Behavior identification method based on reinforcement learning attention mechanism |
CN111315005A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Self-adaptive dormancy method of wireless sensor network |
CN112256056A (en) * | 2020-10-19 | 2021-01-22 | 中山大学 | Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning |
CN112965499A (en) * | 2021-03-08 | 2021-06-15 | 哈尔滨工业大学(深圳) | Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning |
CN113139446A (en) * | 2021-04-12 | 2021-07-20 | 长安大学 | End-to-end automatic driving behavior decision method, system and terminal equipment |
WO2021249515A1 (en) * | 2020-06-12 | 2021-12-16 | 华为技术有限公司 | Channel information feedback method, communication apparatus and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11093561B2 (en) * | 2017-12-11 | 2021-08-17 | Facebook, Inc. | Fast indexing with graphs and compact regression codes on online social networks |
-
2022
- 2022-01-20 CN CN202210068572.8A patent/CN114423061B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105704754A (en) * | 2014-12-12 | 2016-06-22 | 华北电力大学 | Wireless sensor network routing method |
CN107920368A (en) * | 2016-10-09 | 2018-04-17 | 郑州大学 | RPL routing optimization methods based on life cycle in a kind of wireless sense network |
CN107018548A (en) * | 2017-05-27 | 2017-08-04 | 河南科技大学 | The implementation method of cognition wireless network opportunistic routing protocol based on frequency spectrum perception |
CN110852273A (en) * | 2019-11-12 | 2020-02-28 | 重庆大学 | Behavior identification method based on reinforcement learning attention mechanism |
CN111315005A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Self-adaptive dormancy method of wireless sensor network |
WO2021249515A1 (en) * | 2020-06-12 | 2021-12-16 | 华为技术有限公司 | Channel information feedback method, communication apparatus and storage medium |
CN112256056A (en) * | 2020-10-19 | 2021-01-22 | 中山大学 | Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning |
CN112965499A (en) * | 2021-03-08 | 2021-06-15 | 哈尔滨工业大学(深圳) | Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning |
CN113139446A (en) * | 2021-04-12 | 2021-07-20 | 长安大学 | End-to-end automatic driving behavior decision method, system and terminal equipment |
Non-Patent Citations (2)
Title |
---|
Composition of Visual Feature Vector Pattern for Deep Learning in Image Forensics;K. H. Rhee;IEEE Access;20201006;全文 * |
特殊交通环境下移动车辆路径规划强化学习算法研究;陈良;工程科技辑;20200115;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114423061A (en) | 2022-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114423061B (en) | Wireless route optimization method based on attention mechanism and deep reinforcement learning | |
Chen et al. | An adaptive on-demand multipath routing protocol with QoS support for high-speed MANET | |
Luo et al. | Intersection-based V2X routing via reinforcement learning in vehicular ad hoc networks | |
Ghaffari | Real-time routing algorithm for mobile ad hoc networks using reinforcement learning and heuristic algorithms | |
Mohseni et al. | CEDAR: A cluster-based energy-aware data aggregation routing protocol in the internet of things using capuchin search algorithm and fuzzy logic | |
CN104168620B (en) | Method for routing foundation in wireless multi-hop return network | |
Yao et al. | Caching in dynamic IoT networks by deep reinforcement learning | |
Yau et al. | Application of reinforcement learning to wireless sensor networks: models and algorithms | |
Zhang et al. | Joint optimization of cooperative edge caching and radio resource allocation in 5G-enabled massive IoT networks | |
CN110461018B (en) | Opportunistic network routing forwarding method based on computable AP | |
CN107920368A (en) | RPL routing optimization methods based on life cycle in a kind of wireless sense network | |
CN111510956B (en) | Hybrid routing method based on clustering and reinforcement learning and ocean communication system | |
Fareen Farzana et al. | Ant-based routing and QoS-effective data collection for mobile wireless sensor network | |
Agarkhed et al. | Fuzzy based multi-level multi-constraint multi-path reliable routing in wireless sensor network | |
KR102308799B1 (en) | Method for selecting forwarding path based on learning medium access control layer collisions in internet of things networks, recording medium and device for performing the method | |
CN107277888B (en) | Cross-layer routing implementation method and device | |
CN116471645A (en) | Adaptive selection method of wireless sensor network routing algorithm based on deep reinforcement learning | |
Xu et al. | Algebraic connectivity aided energy-efficient topology control in selfish ad hoc networks | |
Bai et al. | An adaptive intelligent routing algorithm based on deep reinforcement learning | |
TWI792784B (en) | Method and system for federated reinforcement learning based offloading optimization in edge computing | |
Dutta et al. | Impact of two-level fuzzy cluster head selection model for wireless sensor network: An Energy efficient approach in remote monitoring scenarios | |
Izadi et al. | Fuzzy logic optimized wireless sensor network routing protocol | |
Yang et al. | Dynamic routing path selection algorithm using reinforcement learning in wireless Ad-Hoc networks | |
CN117749692A (en) | Wireless route optimization method and network system based on deep contrast reinforcement learning | |
Kumar et al. | Q-Learning Enabled Green Communication in Internet of Things |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |