CN117749692A

CN117749692A - Wireless route optimization method and network system based on deep contrast reinforcement learning

Info

Publication number: CN117749692A
Application number: CN202311811899.0A
Authority: CN
Inventors: 罗世龙; 林贤文; 严明俊; 李昌波; 陈姜林
Original assignee: Chongqing Kelanda Technology Co ltd
Current assignee: Chongqing Kelanda Technology Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-22

Abstract

The invention discloses a wireless route optimization method and a network system based on deep contrast reinforcement learning, wherein the method is applied to a wireless multi-hop network and a server of the Internet of things, a deep contrast reinforcement learning model is deployed on the server, the network comprises a sink node and a plurality of wireless terminal nodes, and an Actor network is deployed on each terminal node as a distributed route decision model; comprising the following steps: based on the superframe period, the current latest routing decision model is obtained from a server when a node accesses the network; in a control period, the node generates an optimal forwarding node based on the latest routing decision model and the local state vector; in the data transmission period, the node transmits data to the optimal forwarding node; the node uploads experience information collected in each superframe period to a server; the server stores the experience information into an experience pool, extracts part of the experience information from the experience information and trains a deep contrast reinforcement learning model. The invention reduces the calculated amount and improves the routing effect.

Description

Wireless route optimization method and network system based on deep contrast reinforcement learning

Technical Field

The invention relates to the technical field of wireless internet of things routing, in particular to a wireless routing optimization method and a network system based on deep contrast reinforcement learning.

Background

The internet of things (Internet of Things, ioT) technology plays a vital role in the development of social informatization and modernization as an important component of new computer technology. As one of the important support technologies of the bottom layer of the internet of things, wireless sensor networks (Wireless Sensor Networks, WSNs) have become a focus of attention in academia and industry, and are widely applied to the emerging fields of the internet of vehicles, smart cities, industry 4.0 and the like at present, and are aimed at providing efficient data acquisition and transmission services for the wireless sensor networks. The WSN consists of a large number of resource-constrained sensor nodes deployed in a target monitoring area, transmitting data in multi-hop form to a user server through wireless communication technology. In WSN, the routing protocol is a crucial component and is one of the hot spots of current domestic and foreign research. Furthermore, since nodes in WSNs are typically battery powered, the computing and storage power of the nodes is relatively weak, and the transmission of data packets also requires energy. Thus, the network suffers from problems such as network delay, short network lifetime, non-uniform network energy consumption, etc. Research and innovation of a routing algorithm can provide important support for constructing an energy-efficient, safe and reliable Internet of things communication system.

However, there are a number of problems with routing algorithms in conventional wireless multi-hop networks: 1) The cluster head is often unreasonable to select, which may cause that the cluster head at a long distance consumes energy too early due to long-distance data transmission, which not only wastes energy, but also causes network segmentation; 2) Most algorithms do not take into account the current energy state of the cluster head node. If the node with lower energy is selected as the cluster head, the energy consumption of the node is accelerated, so that the service life of the whole network is influenced; 3) Some conventional routing decision algorithms employ fixed routing rules, lacking awareness of network state. This may result in some equivalent paths being loaded higher, making adaptive traffic offloading difficult, and potentially resulting in load imbalance.

Conventional routing algorithms include distance vector algorithms, link state algorithms, etc., which are widely used in conventional networks, but may face efficiency and scalability problems in large-scale, dynamic internet of things environments. Aiming at the characteristics of the Internet of things, researchers propose special routing algorithms, such as position-based routing, energy-aware routing and the like, so as to meet the low-power consumption and high-efficiency communication requirements of the Internet of things equipment. In order to collect data efficiently in a WSN, sensor nodes may need to be deployed dynamically according to a certain scenario, however, it is more difficult to implement efficient routing algorithms in dynamic and complex network environments.

Some studies are exploring the application of reinforcement learning algorithms to routing decisions to optimize routing by learning network states and traffic patterns to improve network performance. Reinforcement learning is widely used to study radio resource route optimization algorithms as a branch of machine learning. The routing decision algorithm based on reinforcement learning has high perceptibility to the network traffic state, and can dynamically adjust the data transmission amounts of different transmission paths according to traffic changes, thereby realizing traffic self-adaptive unloading and resource scheduling. Patent CN109361601a describes an SDN route planning method based on reinforcement learning, wherein a Q-learning algorithm is used as a route decision model. The inputs to the model include network topology information, traffic matrix and QoS class, and the output is the shortest path that meets the requirements. Although this approach aims at improving the bandwidth utilization of the network links and reducing network congestion, it has the disadvantage that each traffic is forwarded only along a selected fixed shortest path, possibly resulting in an unbalanced path load.

Patent CN110611619a proposes an intelligent route decision method based on DDPG reinforcement learning, which utilizes DDPG algorithm to construct a route decision model based on reinforcement learning. The model takes network flow matrix information as input, and takes the absolute value of the difference value between the maximum bandwidth utilization rate and the minimum bandwidth utilization rate in the minimum network equivalent path as a target through a reinforcement learning algorithm, so that the data volume of different transmission paths is dynamically adjusted, and load balancing is realized. However, the algorithm only uses the values of the maximum bandwidth utilization and the minimum bandwidth utilization in a group of equivalent paths to judge the network load balancing state, so that the bandwidths of other paths are difficult to effectively adjust, and the load imbalance of other paths may be caused. In addition, the data set in patent CN110611619a is small, and it is difficult to meet the training requirement of the reinforcement learning algorithm, so that the model performs poorly in many cases, and the network still has an unbalanced load distribution.

However, deep reinforcement learning (Deep Reinforcement Learning, DRL) based route optimization algorithms still face the following challenges: 1) The consumption of computing and storage resources is large: the DRL model is typically larger, requiring more memory and computing resources. This may result in infeasibility or performance degradation on resource-limited internet of things devices; 2) The data demand is large: DRL algorithms typically require a large amount of training data to learn the appropriate strategy. It is a challenge to obtain sufficient tag data, especially in dynamic and complex internet of things environments.

In summary, the above existing route optimization method has the following problems: (1) Conventional routing algorithm routing metrics are not fully considered, and only routing metrics in a single aspect are considered. The routing metric characteristics of the nodes and links lack mixed routing metric information such as residual energy, hop count, expected transmission times, buffer queue count, potential sub-node count and the like of the candidate forwarding nodes, which cannot be comprehensively considered, so that the problems such as energy efficiency, data transmission reliability and network stability in a network are not effectively improved; (2) For hybrid route metric functions employing only simple linear additions, only expert-based subjective experience is employed to determine each route metric weight coefficient. However, these coefficients are generally not adaptively adjusted during the operation of the network according to the objective actual requirements of the network, which will affect the performance of the network to some extent; (3) The routing algorithm based on the deep reinforcement learning can adaptively optimize the related weight coefficient according to the network operation environment, but deploying the corresponding deep reinforcement learning model on the terminal consumes a great deal of calculation, storage and energy resource of the node in the training process of the model, which brings new challenges to the intellectualization of the resource-limited node.

Disclosure of Invention

The invention aims to provide a wireless route optimization method and a network system based on deep contrast reinforcement learning, and discloses a distributed route optimization method for realizing high-efficiency data transmission oriented to an Internet of things by a contrast learning-assisted deep reinforcement learning model (called a deep contrast reinforcement learning model). The invention decouples training and reasoning tasks of the depth contrast reinforcement learning model, aims at relieving calculation and storage pressure of terminal equipment, and provides more abundant and diversified model training data for a server, thereby accelerating convergence of the depth contrast reinforcement learning model and improving network performance. In addition, the invention comprehensively considers a plurality of routing metrics (hop count, residual energy, buffer queue number, time delay and the like) containing history information, introduces the characteristic representation capability of a contrast learning lifting decision model, adopts a multi-scale convolutional neural network to extract the routing metric characteristics of each candidate forwarding node from different dimensions, and can better understand the difference in the environment by comparing the relative advantages and disadvantages of different paths, thereby improving the routing effect.

The invention is realized by the following technical scheme:

in a first aspect, the invention provides a wireless route optimization method based on deep contrast reinforcement learning, which is applied to a wireless multi-hop network of the Internet of things and a server, wherein a deep contrast reinforcement learning model is deployed on the server, the network comprises a sink node and a plurality of wireless terminal nodes, and an Actor network is deployed on each terminal node as a distributed route decision model; the method comprises the following steps:

Dividing the whole time into a plurality of continuous super-frame periods, wherein each super-frame period comprises a control period and a data transmission period;

based on the superframe period, each node acquires a current latest routing decision model from a server when accessing the network; in a control period, the node generates a current optimal action a based on the latest routing decision model and a local state vector s and maps the current optimal action a into an optimal forwarding node; in the data transmission period, the node transmits data to the optimal forwarding node; after each data transmission period is finished, the node counts related network performance indexes (performance indexes such as packet delivery rate (Packet Delivery ratio, PDR), end-to-End Delay (E2E Delay), energy efficiency (Energy Efficiency, EE) and the like), and calculates a corresponding rewarding value r by adopting a nonlinear score method according to the constructed rewarding function; before the depth contrast reinforcement learning model converges, the node uploads experience information < s, a, r, s' > acquired in each superframe period to a server; in each control period, the node needs to acquire the latest routing decision model parameters from the server and upload experience information generated by the interaction of the node with the environment in the last data transmission period to the server. S' is a new state of the environment in which the agent performs the action a in the state s;

Periodically collecting experience information from the nodes by the server, storing the experience information into an experience pool C, extracting part of the experience information from the experience pool C, and training a deep contrast reinforcement learning model; in addition, when the server receives the experience information, the server also needs to calculate the dominance function based on the experience information, if the dominance value of the dominance function is greater than 0, the server will store the experience information into a specific queue in the experience library D according to the action information of the experience information, and the specific queue is used as a training sample for comparison learning. The experience library D has m queues, each corresponding to a specific action.

Further, in the control period, the node generates and maps the current optimal action a into an optimal forwarding node based on the latest routing decision model and the local state vector s, including:

in the control period, when a certain node accesses the network, acquiring a current latest routing decision model from a server, and monitoring neighbor node information of the node;

the node collects candidate forwarding node information according to the monitored neighbor node information, and builds a candidate forwarding node set;

modeling the current state and the historical state information of m candidate forwarding nodes with the largest energy in the candidate forwarding node set into a three-dimensional graph vector; and taking the three-dimensional graph vector as a local state vector s of the latest routing decision model;

The node inputs the local state vector s into the latest routing decision model (Actor network) based on the latest routing decision model and the local state vector s, and generates an optimal action a; and mapping the optimal action a into a corresponding optimal forwarding node.

Further, the node information includes energy efficiency information, hop count, expected number of transmissions, buffer queue number, and potential sub-points of the node;

the energy efficiency information of the nodes is used for evaluating the energy use condition of the current forwarding node so as to avoid the nodes with low energy efficiency from selecting the nodes as the forwarding nodes to transmit data, thereby improving the life cycle of the network;

the hop count is used for evaluating the distance from the node to the sink node, and based on the distance, the node is prevented from selecting the node with the overlarge hop count as the forwarding node, so that the success rate of data transmission is improved, and the time delay of data transmission is reduced;

the expected transmission times are used for evaluating the communication quality of the appointed link, and the data link with lower expected transmission times (Expected Transmission Times, ETX) is selected, so that the reliability of data transmission can be improved;

the buffer queue number is used for evaluating the load degree of the candidate forwarding nodes and avoiding selecting the nodes causing serious load imbalance of the network to forward data;

The potential sub-points are used to evaluate network dynamics and potential congestion conditions of candidate forwarding nodes, thereby predicting potential impact that node distributed decisions may have on the current forwarding node.

Further, inputting the local state vector s into a latest routing decision model (Actor network), generating an optimal action a, and mapping the optimal action a into a corresponding optimal forwarding node, including:

the convolution operation is carried out on the local state vector s by adopting a convolution layer with the convolution kernel size of c multiplied by 1, so as to obtain a first vector X1=Conv ^c×1×1 (s)；

Processing the first vector X1 by global average pooling and global maximum pooling to obtain a global average pooling vectorAnd global maximum pooling vector->And pools the global average vector X _avg And global maximum pooling vector X _max Combining to obtain a second vector x2= [ X _ave ；X _max ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein->H is the number of candidate forwarding nodes considered, and W is the routing metric number;

for the second vectorGlobal averaging pooling is performed again on the channels to extract more detail features, generating a third vector +.>

The convolution layers with convolution kernel sizes of 1 XH X1 and 1 X1 XW are adopted to simultaneously carry out convolution operation on the second vector X2 in two different dimensions to respectively generate a fifth vector N _w And a sixth vector M _w ：N _w ＝Conv ^1×H×1 (X2)，

M _w ＝Conv ^1×1×W (X2)；

Using matrix multiplication on a fifth vector N _w And a sixth vector M _w Performing multiplication calculation to obtain product vectorNM _w ＝N _w ×M _w ；

The residual block is adopted to ensure the integrity of the information, and a convolution layer with the convolution kernel size of 3 multiplied by 1 is adopted to carry out convolution operation on the product vector and the third vector, so as to obtain a fourth vector X4=Conv ^3×1×1 (NM _w +X3)；

Performing convolution operation on the fourth vector by adopting a convolution layer with the convolution kernel size of 1 multiplied by W to generate an optimal actiona＝softmax(Conv ^1×1×W (X4))；

Dimension conversion of optimal action a intoAnd selecting the forwarding node corresponding to the action with the maximum probability value as the optimal candidate forwarding node of the node according to the distribution of the optimal action.

Furthermore, the deep contrast reinforcement learning model is characterized in that contrast learning is introduced into the deep reinforcement learning model, the generalization capability of the deep reinforcement learning model is improved through the contrast learning by learning the relative relation of samples, and the deep reinforcement learning model can better learn characteristic representation and has better generalization by learning the commonality and the characteristics of data, so that different paths can be better distinguished.

Further, the reward value r is calculated by adopting a reward function of a depth contrast reinforcement learning model, and the calculation formula is as follows:

R＝w1*f(PDR)+w2*f(EE)+w3*f(Delay)

Wherein w1, w2 and w3 are weight coefficients of each performance evaluation index, each performance evaluation index comprises a packet delivery rate PDR, an end-to-end Delay and an energy efficiency EE, and w1+w2+w3=1 is used for indicating the importance degree of the current network to different indexes;

mu sumIs the corresponding coefficient; for the forward index: the packet delivery rate PDR and the energy efficiency EE are calculated by adopting f (x); for the reverse index: end-to-end Delay is calculated by f (y); />Mean x value, max [ x ] of each period since node is network-connected]The x value representing the maximum single period since the node was network-connected; similarly, min [ y ]]The minimum y value for a single cycle since the node was network-connected.

Further, extracting part of experience information from the experience pool C and training a deep contrast reinforcement learning model, wherein the method comprises the following steps:

calculating the dominance value of each piece of experience information according to the dominance function;

judging whether the dominance value is greater than 0, if so, storing the experience information into a preset queue in an experience library D according to the action information, and taking the experience information as a training sample of a depth contrast reinforcement learning model; otherwise, no further operation is performed on the empirical information;

during the model training period, when the training round number reaches the contrast learning training threshold value, the server also extracts corresponding sample data from the experience library D each time to assist in training a routing decision model (Actor network); conversely, the server only extracts the mini-batch data from the experience pool C each time to train a routing decision model (Actor network); suspending the model training task when the model converges;

The nodes periodically acquire relevant parameter updating route decision model (Actor network) from the server and interact with the environment.

The experience pool C is an original experience playback pool of the depth contrast reinforcement learning model; the experience library D is newly added sample data for storing training contrast learning.

Further, the formula of the dominance function is:

wherein Q (s, a) is a state value function of the state-action pair (s, a), the value being calculated by the Critic network; v(s) is a state value function of the local state vector s; pi (s, a) is calculated by the Actor network.

Further, when the training round number reaches the contrast learning training threshold, the server further extracts corresponding sample data from the experience library D each time to assist in training the routing decision model (Actor network), including:

when the training round number reaches a contrast learning training threshold value, the server combines contrast learning and deep reinforcement learning, and corresponding sample data are extracted from the experience library D each time to assist in training the Actor network;

the optimization targets of contrast learning are:

wherein e _i Representing one sample in the dataset, z ₁ (e _i1 ) Is e _i First through data enhancement method F ₁ Then passes through a feature extraction module z ₁ The characteristics obtained after that; z ₂ (e _i2 ) Is e _i First through data enhancement method F ₂ Then passes through a feature extraction module z ₂ The characteristics obtained (positive samples); z ₁ (e _j1 ) And z ₂ (e _j2 ) Is data e _j The amplified features (negative samples);a similarity measurement method between two feature vectors; n is the total number of samples, i is the indicated value of a certain sample, j is the indicated value of a certain sample, f ₁ To pass F ₁ Method for extracting enhanced features, f ₂ To pass F ₂ Performing a feature extraction method on the enhanced features; the objective of this formula is to maximize the similarity between features of the same data after different enhancement methods and minimize the similarity between features of different data after the same enhancement methods.

In a second aspect, the present invention further provides a wireless network system based on deep contrast reinforcement learning, where the wireless network system is based on the wireless route optimization method based on deep contrast reinforcement learning; the wireless network system comprises an Internet of things wireless multi-hop network and a server, wherein the network comprises a sink node and a plurality of wireless terminals/relay nodes;

a depth contrast reinforcement learning model is deployed on the server, and adopts an Actor-Critic network architecture; training a deep contrast reinforcement learning model by using a server with more abundant resources;

Each terminal node is provided with an Actor network as a distributed routing decision model, and periodically acquires the latest parameters of the depth contrast reinforcement learning model from a server to perform distributed resource scheduling decision, namely forwarding node selection;

the wireless system adopts a centralized model training and a distributed interactive architecture, adopts an edge server with relatively abundant resources to provide additional data storage and support of computing capacity for terminal nodes with limited resources, provides more comprehensive model training data for model training through asynchronous experience acquisition of the terminal, and improves the convergence rate of the model and the generalization capacity of the model.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the wireless route optimization method and the network system based on the deep contrast reinforcement learning, the contrast learning is introduced into the deep reinforcement learning model, the generalization capability of the deep reinforcement learning model can be improved through the relative relation of learning samples, and the model can better learn characteristic representation and has better generalization through the commonality and the characteristics of learning data, so that different paths can be better distinguished.

2. The invention designs a wireless route optimization method and a network system based on deep contrast reinforcement learning, which designs a contrast learning training sample construction method based on an advantage function, and fully utilizes available data information by better utilizing unlabeled data in the environment, thereby reducing the cost of data acquisition. In addition, by generating more accurate positive and negative sample pairs, the model can learn the relation inside the data better, so that the model is adapted to different environment states better, and the robustness of the model is improved.

3. The invention relates to a wireless route optimization method and a network system based on deep contrast reinforcement learning. Therefore, the method has smaller calculation complexity for the terminal node, and is suitable for the resource-limited node, namely the terminal node only needs to collect the local network state information and interact with the environment based on a local intelligent algorithm.

4. Compared with the traditional routing algorithm which only uses a single routing metric, the wireless routing optimization method and the network system based on the deep contrast reinforcement learning comprehensively consider the information such as the energy, the hop count, the buffer queue number, the expected transmission times, the potential sub-node number and the like of each candidate forwarding node; and constructing a three-dimensional graph vector containing the history information of the candidate forwarding nodes, extracting the characteristics of the graph vector by adopting a depth contrast reinforcement learning model, and realizing the depth fusion of the multipath metric parameters and the artificial intelligence algorithm.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention. In the drawings:

fig. 1 is a superframe structure according to an embodiment of the present invention;

FIG. 2 is a flow chart of the operation of a node in an embodiment of the invention;

FIG. 3 is a flowchart illustrating the operation of a server according to an embodiment of the present invention;

FIG. 4 is a diagram of a distributed interaction and centralized training system model architecture in accordance with an embodiment of the present invention;

fig. 5 is a core module of an Actor network in an embodiment of the present invention.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Based on the existing route optimization method, the following problems exist: (1) Conventional routing algorithm routing metrics are not fully considered, and only routing metrics in a single aspect are considered. The routing metric characteristics of the nodes and links lack mixed routing metric information such as residual energy, hop count, expected transmission times, buffer queue count, potential sub-node count and the like of the candidate forwarding nodes, which cannot be comprehensively considered, so that the problems such as energy efficiency, data transmission reliability and network stability in a network are not effectively improved; (2) For hybrid route metric functions employing only simple linear additions, only expert-based subjective experience is employed to determine each route metric weight coefficient. However, these coefficients are generally not adaptively adjusted during the operation of the network according to the objective actual requirements of the network, which will affect the performance of the network to some extent; (3) The routing algorithm based on the deep reinforcement learning can adaptively optimize the related weight coefficient according to the network operation environment, but deploying the corresponding deep reinforcement learning model on the terminal consumes a great deal of calculation, storage and energy resource of the node in the training process of the model, which brings new challenges to the intellectualization of the resource-limited node.

Therefore, aiming at the problems, the invention designs a wireless route optimization method and a network system based on deep contrast reinforcement learning. The network system includes: the wireless multi-hop network of the Internet of things and a server, wherein the network comprises an aggregation node and a plurality of wireless terminals/relay nodes. A more resource-rich server is used for training the deep contrast reinforcement learning model, each end node deploys a routing decision model (Actor network) of the deep contrast reinforcement learning model, and periodically acquires the latest parameters of the model from the server for distributed resource scheduling decisions (forwarding node selection). Specifically, in the initial stage, a node joins a network and acquires the latest routing decision model parameters; in the control period, the node selects the optimal forwarding node based on a routing decision model (Actor network) and a local observation state; in a data transmission period, the node conducts data transmission based on the selected optimal forwarding node and counts corresponding network performance parameters to guide the optimization direction of the deep contrast reinforcement learning model; the invention adopts a centralized training distributed interactive learning model, namely, nodes and environments perform distributed interaction and experience information is uploaded to an experience pool of a server; the server periodically draws small volumes of experience from the experience pool to train the model until the model converges. The experience collection mode breaks the correlation among experiences and simultaneously greatly enriches model training data, thereby being beneficial to improving the convergence speed of the model and realizing the intellectualization of the resource-limited nodes. In addition, the invention introduces contrast learning into the deep reinforcement learning model, effectively improves the sample efficiency, generalization capability and decision capability of the deep reinforcement learning model in complex environments, and can better cope with the problem of non-stationarity in a multi-agent system, especially in the environment of the Internet of things, wherein network conditions, equipment states and the like can be changed frequently. The model can be better adapted to the dynamically changing environment through the comparative learning of the current state and the historical state.

The key design points of the invention are as follows:

(1) The wireless route optimization mode based on the deep contrast reinforcement learning model introduces contrast learning into deep reinforcement learning by means of contrast learning and deep reinforcement learning technology, and provides a novel deep contrast reinforcement learning model suitable for wireless resource scheduling. The model provides more information for training of the deep reinforcement learning model by introducing contrast learning, and improves the efficiency, generalization capability and adaptability of the model.

(2) A contrast learning positive and negative sample construction method based on a merit function constructs accurate positive and negative samples for contrast learning training by means of deep reinforcement learning decision performance. Specifically, when the dominance function is greater than 0, indicating that the current action is better than the average value, the experience information can be used as a positive sample of the current action; conversely, the sample acts as a negative sample of the action. By the sample construction method, the requirement on the label sample is reduced, and the label sample can be learned from the data interacted by the intelligent body and the environment, so that the feedback data of the environment needing professional labeling is reduced, and the cost of data acquisition is reduced.

(3) The route optimization method based on deep contrast learning adopts a centralized model training and distributed interaction architecture, adopts an edge server with relatively rich resources to provide additional data storage and support of computing capacity for terminal nodes with limited resources, provides more comprehensive model training data for model training through asynchronous experience acquisition of the terminal, and improves the generalization capacity of the model while improving the convergence rate of the model;

(4) The invention comprehensively considers the route layer and MAC layer related characteristics of the candidate forwarding nodes as route measurement information, and constructs a three-dimensional graph vector containing historical observation information as the input of a route decision model; the invention provides a brand-new route optimization method for deeply fusing multiparameter route measurement and an artificial intelligent algorithm, which avoids subjectivity of traditional route measurement function weight setting, adopts a contrast learning cooperative deep reinforcement learning model to extract the characteristics of the three-dimensional graph vector, and realizes the intelligent route optimization algorithm for deeply fusing multiparameter route measurement parameters.

Fig. 2 and 3 present a workflow diagram for a node and a server, respectively. The model training stage comprises the following specific steps:

step 1, randomly initializing a deep contrast reinforcement learning model deployed on a server, wherein the deep contrast reinforcement learning model adopts an Actor-Critic network architecture; the deep contrast reinforcement learning model is characterized in that contrast learning is introduced into the deep reinforcement learning model, the generalization capability of the deep reinforcement learning model is improved through the contrast learning by virtue of the relative relation of learning samples, and the deep reinforcement learning model can better learn characteristic representation and has better generalization by virtue of the commonality and the characteristics of learning data, so that different paths can be better distinguished.

Step 2, in the network, only an Actor network needs to be deployed on each node as a distributed routing decision model, and the architecture of the routing decision model is the same as the structure of the Actor network on a server;

step 3, after the node is accessed to the network, acquiring parameters of an Actor network for routing decision from a server to update a routing decision model of the node;

step 4, collecting candidate forwarding node information by the nodes, wherein the information comprises energy efficiency information, hop count, expected transmission times, buffer queue number and potential sub-points of the nodes;

step 5, in the control period, the node selects m candidate forwarding nodes with the largest residual energy from the candidate forwarding node set, constructs the candidate forwarding node information into a three-dimensional graph vector of t×m×n (t is the considered discrete time number, m is the candidate forwarding node number, and n is the selected information dimension) as the state information s of the proposed depth contrast reinforcement learning model, inputs the three-dimensional graph vector into a routing decision model (Actor network), and the model outputs a corresponding action a to indicate the optimal forwarding node information which the node should select;

and 6, in the data transmission period, the node transmits the data in the buffer area through the selected forwarding node based on the corresponding channel access mechanism of the MAC layer. At the end of the data transmission period, the node counts the network performance (such as packet delivery rate, end-to-end delay, energy efficiency, etc.);

Step 7, mapping the performance indexes into corresponding rewards r of the nodes under the state and action by adopting a nonlinear function; the node transmits the experience information collected during this period, i.e. < s, a, r, s' > to the server, which stores the experience information in the experience pool C.

Step 8, the server receives experience information according to the received experience informationThe dominance value of the empirical information is calculated. If->Storing the experience information to line a in the experience library D; otherwise, no further operations are performed on the empirical information. Note that the above-mentioned a e 0,m-1]。

step 9, during the model training period, when the training round number reaches the contrast learning training threshold, the server also extracts corresponding sample data from the experience library D each time to assist in training the Actor network (contrast learning); otherwise, the server only extracts mini-batch data from experience pool C each time to train the model. Suspending the model training task when the model converges;

step 10, the node periodically acquires relevant parameters from a server to update a routing decision model and interacts with the environment;

repeating the steps 1 to 10 until all the nodes are exhausted.

In the step 1, the depth contrast reinforcement learning model architecture deployed on the server is shown in fig. 4. The DRL model adopts an Actor-Critic network architecture.

In the steps 2 and 3, each wireless node only needs to deploy the Actor model shown in fig. 4 for local decision of the node, wherein parameters of the Actor model are trained and optimized by the server; therefore, the local node only needs to download the part of the parameters from the server. The distributed interaction and centralized training model is shown in fig. 4, and compared with the deployment and training of the whole network model, the distributed interaction and centralized training is adopted to effectively reduce the storage and calculation pressure of the terminal nodes;

in the step 4, the remaining energy information is used for evaluating the energy condition of the current forwarding node, so as to avoid the node selecting the node with lower remaining energy as the forwarding node to transmit data, thereby improving the network life cycle; the hop count is used for evaluating the distance from the node to the sink node, and based on the measurement, the node should avoid selecting the node with the overlarge hop count as a forwarding node as much as possible, so as to improve the success rate of data transmission and reduce the data transmission delay; the expected number of transmissions is used to evaluate the communication quality of the given link, and selecting a data link with a lower expected number of transmissions (Expected Transmission Times, ETX) may promote the reliability of the data transmission; the buffer queue number is used for evaluating the load degree of the candidate forwarding nodes and avoiding selecting the nodes causing serious load imbalance of the network to forward data; the potential number of the child nodes is used for evaluating network dynamics and potential congestion conditions of the candidate forwarding nodes, so that potential influences of node distributed decisions on the current forwarding nodes can be estimated;

In step 5, as shown in fig. 5 (left), the local state vector s includes 3 local observation vectors o. It is worth mentioning that each local observation o is a two-dimensional matrix of m×5, mainly comprising two cases: namely, when the number of the candidate forwarding nodes of the node is more than or equal to m at a certain specific moment, the first m nodes with the largest residual energy are taken as final candidate forwarding nodes, and the nodes construct the mixed routing metric information of the m nodes into an m multiplied by 5 two-dimensional graph vector which is taken as the local observation o of the node at the moment; if the number of candidate forwarding nodes for a node is k and k < m, then the remaining m-k rows in the two-dimensional graph vector will be filled with 0. In the invention, t historical observation vectors o are taken into consideration to form a final local state vector s of the node, and the final local state vector s is used as the state input of a routing decision model.

Specifically, the details of the operation of the node to input the local state vector s into the Actor network are as follows:

1. the convolution operation is carried out on the local state vector s by adopting a convolution layer with the convolution kernel size of (c multiplied by 1), so as to obtain a first vector X1:

X1＝Conv ^c×1×1 (s)

2. global average pooling and global maximum pooling are respectively adopted for the first vector X1 to obtain a global average pooling vector And global maximum pooling vector->And pools the global average vector X _avg And global maximum pooling vector X _max Combining to obtain a second vector x2= [ X _ave ；X _max ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>H is considered candidate for data setThe number of forwarding nodes, where H equals m and W is the routing metric; for the second vector->Global averaging pooling is performed again on the channels to extract more detail features, generating a third vector +.>

3. Convolution layers with convolution kernel sizes of (1×h×1) and (1×1×w) are adopted to simultaneously perform convolution operation on the second vector X2 in two different dimensions to respectively generate N _w And M _w ：

N _w ＝Conv ^1×H×1 (X2)

M _w ＝Conv ^1×1×W (X2)

4. Calculation by matrix multiplication

NM _w ＝N _w ×M _w

5. The residual block is used to ensure the integrity of the information, and a convolution operation is performed by adopting a convolution layer with the convolution kernel size of (3×1×1), so as to obtain a fourth vector X4:

X4＝Conv ^3×1×1 (NM _w +X3)

6. finally, a convolution operation is performed by adopting a convolution layer with the convolution kernel size of (1 multiplied by W), so as to generate an optimal action:

a＝Conv ^1×1×W (X4)

in the method, in the process of the invention,will->Dimension conversion to +.>Distribution selection according to optimal actionsAnd selecting the forwarding node corresponding to the action with the maximum probability value as the optimal candidate forwarding node of the node.

It should be noted that h=m, W is the number of considered route metrics, i.e. 5. As more routing metrics are considered, W may be dynamically adjusted according to network conditions.

In the step 6, in the data transmission period, the MAC layer of the node may use some specific medium access control, and the node may send data on the working channel where the selected optimal forwarding node is located; after the data transmission period, counting the performances of the node, such as packet delivery rate, end-to-end time delay, energy efficiency and the like in the period;

in the step 7, a nonlinear score method is adopted to carry out the uniformity quantification on the performance index. The formula of the nonlinear scoring method is as follows:

wherein mu andis the corresponding coefficient. For forward indicators, such as PDR and EE, f (x) is used for calculation; for the inverse index Delay, f (y) is used for calculation. />Mean x value, max [ x ] of each period since node is network-connected]The x value representing the maximum single period since the node was network-connected; similarly, min [ y ]]The minimum y value for a single cycle since the node was network-connected.

Thus, the reward function of the depth contrast reinforcement learning model is:

R＝w1*f(PDR)+w2*f(EE)+w3*f(Delay)

wherein w1, w2 and w3 are weight coefficients of each performance evaluation index, and w1+w2+w3=1 is used to indicate the importance of the current network to different indexes.

The node then uploads the empirical information e= < s, a, r, s' >, generated by the interaction with the environment during this period, to the server. The server stores the experience information into experience pool C for training of the deep contrast reinforcement learning model.

In the above step 8, the server calculates the dominance value of the received experience information according to the dominance function:

where Q (s, a) is a function of the state value of the state-action pair (s, a), calculated by the Critic network; v(s) is a state value function of the state s, and the calculation formula is as follows:

where pi (s, a) is calculated by the Actor network. If the dominant value of the current experienceIndicating that the expected return of the performed action a is above the average level, which is a positive signal, it is indicated that selecting action a in the current state is advantageous because it provides a better return than the average level, and the server adds this experience information to the a-th queue in the experience library D, which is used as data for the comparative learning training.

In the step 9, when the training round number reaches the contrast learning training threshold, the server optimizes the training of the Actor network by combining the contrast learning and the deep reinforcement learning.

The contrast learning can optimize the feature space such that the intra-class distance of the data in the feature space is reduced and the inter-class distance is increased. Let e _i Representing one sample in the dataset, z ₁ (e _i1 ) Is e _i First through data enhancement method F ₁ Then is subjected to characteristic extractionGet module z ₁ The resulting features, z ₂ (e _i2 ) Is e _i First through data enhancement method F ₂ Then passes through a feature extraction module z ₂ The resulting features (positive samples). z ₁ (e _j1 ) And z ₂ (e _j2 ) Is data e _j The amplified features (negative samples).For the similarity measurement method between two feature vectors, the optimization objective of contrast learning is:

therefore, the contrast learning can maximize the similarity between two samples amplified by the same data, and simultaneously minimize the similarity between the features amplified by different data and the similarity between the features amplified by the same data.

Order theRepresentation a _i Positive samples of->Representation a _i Is a negative sample of (a). The corresponding contrast learning loss function is:

in the method, in the process of the invention, is state s _i Number of positive samples sampled, τ>0 is an adjustable temperature coefficient. />The number of samples sampled for comparison learning. />As a function of similarity.

The loss function of an Actor network can be defined as:

in the method, in the process of the invention,is the loss function of the Actor network in the original deep reinforcement learning. Accordingly, the +>The gradient of (2) is:

subsequently, gradient ramp up is used to update the parameters of the Actor network:

in the formula, alpha E [0,1] is the learning rate of the Actor network.

It should be noted that, in the initial stage of training, the reinforcement learning decision module plays a dominant role (i.e. the initial stage δ=0) and aims to learn a more accurate Q value; as the number of training rounds increases, the feature classifier and the decision module are progressively co-trained, i.e., δ=1 when the number of training rounds is greater than the contrast learning training threshold.

In addition, the Critic network is used to approximate the state value function Q (s, a) for evaluating and further directing the updating of the Actor network parameters. Q (Q) _tar (s, a) is an estimate of Q (s, a):

Q _tar (s,a)＝r+Q(s′,a′)

in the method, in the process of the invention,due to Q _tar (s, a) is calculated from the true prize value r of the state-action pair (s, a), so Q _tar (s, a) is considered more accurate than Q (s, a).

In order to make Q (s, a) closer to Q _tar (s, a) introducing a mean square error function as a loss function of the Critic network:

accordingly, the gradient of L (ω) can be expressed as:

to minimize L (ω), the parameter ω of the Critic network is updated using a gradient descent method:

wherein, beta epsilon (0, 1) is the learning rate of the Critic network.

Suspending the model training task when the model converges;

in step 10, the node periodically obtains the parameters of the latest routing decision model from the server.

Example 1

The invention relates to a wireless route optimization method based on deep contrast reinforcement learning, which is applied to a wireless multi-hop network and a server of the Internet of things, wherein a deep contrast reinforcement learning model is deployed on the server, the network comprises a sink node and a plurality of wireless terminal nodes, and an Actor network is deployed on each terminal node as a distributed route decision model; the method comprises the following steps:

Dividing the entire time into a plurality of consecutive superframe periods, as shown in fig. 1, each superframe period including a control period and a data transmission period;

based on the superframe period, the model parameters of the Actor network are obtained from the server to update the local latest routing decision model when the node is newly accessed to the network; in a control period, the node generates a current optimal action a based on a local latest routing decision model and a local state vector s and maps the current optimal action a to an optimal forwarding node; in the data transmission period, the node transmits data to the optimal forwarding node; after each data transmission period is finished, the node counts related network performance indexes and calculates a corresponding reward value r; before the depth contrast reinforcement learning model converges, the node uploads experience information < s, a, r, s' > acquired in each superframe period to a server; s' is a new state of the environment in which the agent performs the action a in the state s;

periodically collecting experience information from the nodes by the server, storing the experience information into an experience pool C, calculating a dominance value of the experience information according to a dominance function by the server, and storing the experience information with the dominance value greater than 0 in an experience library D; when the training round number is larger than the contrast learning training threshold value, the server acquires an experience data collaborative training model from the experience pool C and the experience library D; otherwise, the server only acquires experience data from the experience pool C to train the model; in addition, the server also needs to send the updated parameters of the Actor network to each node regularly. The above process is repeated until all nodes in the network are exhausted. Fig. 2 and 3 present a workflow diagram for a node and a server, respectively.

The specific implementation process is as follows:

when the network is started, the server firstly randomly initializes the parameters of the Actor and the Critic network model;

after the new node is accessed to the network, the local routing decision model (i.e. the Actor network) parameters are obtained from the server. As shown in fig. 4, the Actor network is a part of a deep contrast reinforcement learning model deployed on a server and is mainly used for distributed agents (nodes) to acquire corresponding optimal actions based on local environment states;

the node needs to maintain a candidate forwarding node set, and the set stores and updates the mixed route measurement information of the candidate forwarding nodes of the node in real time, wherein the mixed route measurement information comprises information such as hop count, residual energy, buffer area queue number, expected transmission times, potential sub-node number and the like;

in the control period, the node generates the current optimal action a and maps the current optimal action a into an optimal forwarding node based on the latest routing decision model and the local state vector s, and the specific process is as follows:

in each control period, when a certain node accesses the network, acquiring a current latest routing decision model from a server, and monitoring neighbor node information of the node; the node collects candidate forwarding node information according to the monitored neighbor node information, and builds a candidate forwarding node set; and the node selects m candidate forwarding nodes with the largest residual energy from the candidate forwarding node set. It should be noted that the length of the control period can be adaptively set according to the network and m can be adaptively adjusted according to the network density. In addition, if the number of the candidate forwarding nodes of the node is less than m, all the candidate forwarding nodes are selected, and the rest route metrics are filled by using 0. The node abstracts the mixed routing metric information of the selected candidate forwarding node for the first t super-frame periods into a three-dimensional graph vector s. As shown in fig. 5, the local observation for each specific superframe period is a two-dimensional graph vector, in which each row stores mixed routing metric information of a specific forwarding node, including hop count, remaining energy, buffer queue number, expected transmission count, and number of potential sub-nodes, etc.;

The node takes the three-dimensional map vector s as input to the local Actor network. Accordingly, the Actor network will output an optimal action a e a. Where a= {1,2,..m } is the action space of a node, a indicates the optimal forwarding node information for that node, and can be mapped using the function map (s, a). Specifically, when a=1, it indicates that the forwarding node corresponding to the 1 st row in s [0] will be the forwarding node of the next data transmission cycle of the node.

In addition, aiming at the characteristics of the graph vector, the embodiment of the invention designs an Actor network architecture based on multi-scale convolution to extract the attribute characteristics of each candidate degree node, and the specific process is as follows:

the convolution process in the Actor network module, namely, inputting the local state vector s (which is a three-dimensional graph vector s) into the latest routing decision model (Actor network), generating an optimal action a, and mapping the optimal action a into a corresponding optimal forwarding node, comprises the following steps:

1. the convolution operation is carried out on the local state vector s by adopting a convolution layer with the convolution kernel size of c multiplied by 1, so as to obtain a first vector X1=Conv ^c×1×1 (s)；

2. Processing the first vector X1 by global average pooling and global maximum pooling to obtain a global average pooling vector And global maximum pooling vector->And pools the global average vector X _avg And global maximum pooling vector X _max Combining to obtain a second vector x2= [ X _ave ；X _max ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>H is the number of candidate forwarding nodes considered, and W is the routing metric number; again for the second vector->Global averaging pooling is performed again on the channels to extract more detail features, generating a third vector +.>

3. The convolution layers with convolution kernel sizes of 1 XH X1 and 1 X1 XW are adopted to simultaneously carry out convolution operation on the second vector X2 in two different dimensions to respectively generate a fifth vector N _w And a sixth vector M _w ：N _w ＝Conv ^1×H×1 (X2)，M _w ＝Conv ^1×1×W (X2)；

4. Using matrix multiplication on a fifth vector N _w And a sixth vector M _w Performing multiplication calculation to obtain product vector NM _w ＝N _w ×M _w ；

5. The residual block is adopted to ensure the integrity of the information, and a convolution layer with the convolution kernel size of 3 multiplied by 1 is adopted to carry out convolution operation on the product vector and the third vector, so as to obtain a fourth vector X4=Conv ^3×1×1 (NM _w +X3)；

6. Performing convolution operation on the fourth vector by adopting a convolution layer with the convolution kernel size of 1 multiplied by W to generate an optimal actiona＝Conv ^1×1×W (X4); dimension converting the optimal action a into +.>And selecting the forwarding node corresponding to the action with the maximum probability value as the optimal candidate forwarding node of the node according to the distribution of the optimal action.

During the data transmission period, the MAC layer of the node may employ a corresponding medium access mechanism (e.g., CSMA/CA, TDMA, or other medium access control protocol, etc.) to perform data transmission on the working channel of the selected forwarding node; at the end of the data period, the node needs to record the corresponding network performance index (such as packet delivery, end-to-end delay and energy efficiency) in the period.

And the node normalizes the performance index by adopting a nonlinear score method, wherein the packet delivery rate and the energy efficiency are forward indexes, and the average end-to-end time delay is a reverse index. Thus, the following two formulas are used to calculate them, respectively.

Wherein mu andis the corresponding coefficient; for the forward index: the packet delivery rate PDR and the energy efficiency EE are calculated by adopting f (x); for the reverse index: end-to-end Delay is calculated by f (y); />Mean x value, max [ x ] of each period since node is network-connected]The x value representing the maximum single period since the node was network-connected; similarly, min [ y ]]The minimum y value for a single cycle since the node was network-connected.

The node performs weighted accumulation on the index quantized by the above index to calculate the final prize value R using the following formula:

R＝w1*f(PDR)+w2*f(EE)+w3*f(Delay)

In addition, the server calculates the dominance value of the received experience information according to the dominance function:

where Q (s, a) is a state value function of the state-action pair (s, a), the value being calculated by a Critic network deployed on a centralized server; v(s) is a state value function of the state s, and the calculation formula is as follows:

where pi (s, a) is calculated by the Actor network. If the dominant value of the current experienceIndicating that the expected return of the performed action a is above the average level, which is a positive signal, it is indicated that selecting action a in the current state is advantageous because it provides a better return than the average level, and the server adds this experience information to the a-th queue in the experience library D, which is used as data for the comparative learning training. Where a.epsilon.1, m ]。

When the training round number reaches the contrast learning training threshold, the server optimizes training of the Actor network by combining contrast learning and deep reinforcement learning. The contrast learning can optimize the feature space such that the intra-class distance of the data in the feature space is reduced and the inter-class distance is increased.

Let e _i Representing one sample in the dataset, z ₁ (e _i1 ) Is e _i First through data enhancement method F ₁ Then passes through a feature extraction module z ₁ The resulting features, z ₂ (e _i2 ) Is e _i First through data enhancementMethod F ₂ Then passes through a feature extraction module z ₂ The resulting features (positive samples). z ₁ (e _j1 ) And z ₂ (e _j2 ) Is data e _j The amplified features (negative samples).For the similarity measurement method between two feature vectors, the optimization objective of contrast learning is:

therefore, the contrast learning can maximize the similarity between two samples amplified by the same data, and simultaneously minimize the similarity between the features amplified by different data and the similarity between the features amplified by the same data. N is the total number of samples, i is the indicated value of a certain sample, j is the indicated value of a certain sample, f ₁ To pass F ₁ Method for extracting enhanced features, f ₂ To pass F ₂ Performing a feature extraction method on the enhanced features; the objective of this formula is to maximize the similarity between features of the same data after different enhancement methods and minimize the similarity between features of different data after the same enhancement methods.

Order theRepresentation a _i Positive samples of +.>Representation a _i Is a negative sample of (a). The corresponding contrast learned loss function is defined as:

in the method, in the process of the invention,when the comparison loss is calculated, the number of positive samples and the number of negative samples can be dynamically adjusted according to actual conditions. />Is state s _i Number of positive samples sampled, τ>0 is an adjustable temperature coefficient. />Is the number of samples sampled for contrast learning. />As a function of similarity.

In the embodiment of the present invention, the loss function of the Actor network is defined as:

in the method, in the process of the invention,is the loss function of the Actor network in the original deep reinforcement learning model. Accordingly, the +>The gradient of (2) is:

subsequently, the gradient ascent method is used to update the parameters of the Actor network:

it should be noted that, in the initial stage of training, the reinforcement learning decision module (i.e. the routing decision module) plays a dominant role (i.e. the initial stage δ=0) and aims to learn a more accurate Q value; as the number of training rounds increases, the feature classifier and the decision module are progressively co-trained, i.e., δ=1 when the number of training rounds is greater than the contrast learning training threshold. In the embodiment of the invention, the contrast learning training threshold value can be dynamically adjusted according to the complexity of the environment.

The Critic network is used to approximate the state value function Q (s, a) for evaluating and further directing the updating of the Actor network parameters. Q (Q) _tar (s, a) is an estimate of Q (s, a) and can be calculated by:

Q _tar (s,a)＝r+Q(s′,a′)

in the method, in the process of the invention,due to Q _tar (s, a) is calculated from the true prize value r of the state-action pair (s, a), so Q _tar (s, a) is considered more accurate than Q (s, a). />

In order to make Q (s, a) closer to Q _tar (s, a), the embodiment of the invention introduces a mean square error function as a loss function of the Critic network:

accordingly, the gradient of L (ω) can be expressed as:

wherein, beta epsilon (0, 1) is the learning rate of the Critic network.

Repeating the model training process until the model converges.

In the embodiment of the invention, the distributed node only needs to periodically acquire the latest Actor network parameters from the server when the model is not converged. Therefore, the nodes do not need to independently train a depth contrast reinforcement learning model, so that the calculation and storage pressure of the nodes are greatly reduced; in addition, distributed nodes in the network interact with the environment in an asynchronous manner and collect experience, so that more abundant and diversified experience information is provided for the server, and the method is helpful for accelerating the convergence speed of the model and improving the generalization capability of the model. In addition, in the model training process, contrast learning is introduced into the reinforcement learning model, so that the representation and learning capacity of the model is further improved, the environment states can be better distinguished, and a better forwarding node selection strategy is provided for the nodes.

Example 2

The difference between the present embodiment and embodiment 1 is that the present embodiment provides a wireless network system based on deep contrast reinforcement learning, which is a wireless route optimization method based on the above-mentioned deep contrast reinforcement learning; the wireless network system comprises an Internet of things wireless multi-hop network and a server, wherein the network comprises a sink node and a plurality of wireless terminals/relay nodes;

and each terminal node is provided with an Actor network as a distributed routing decision model, and periodically acquires the latest parameters of the depth contrast reinforcement learning model from a server to perform distributed resource scheduling decision, namely forwarding node selection.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The wireless route optimization method based on the deep contrast reinforcement learning is characterized in that the method is applied to a wireless multi-hop network of the Internet of things and a server, a deep contrast reinforcement learning model is deployed on the server, the network comprises a sink node and a plurality of wireless terminal nodes, and an Actor network is deployed on each terminal node as a distributed route decision model; the method comprises the following steps:

based on the superframe period, each node acquires a current latest routing decision model from a server when accessing the network; in a control period, the node generates a current optimal action a based on the latest routing decision model and a local state vector s and maps the current optimal action a into an optimal forwarding node; in the data transmission period, the node transmits data to the optimal forwarding node; after each data transmission period is finished, the node counts related network performance indexes and calculates a corresponding reward value r; before the depth contrast reinforcement learning model converges, the node uploads experience information < s, a, r, s '> acquired in each super frame period to a server, wherein s' is a new state of an intelligent agent executing action a and environment in a state s;

The server periodically collects the experience information from the nodes and stores the experience information into an experience pool, and extracts a portion of the experience information from the experience pool and trains the depth contrast reinforcement learning model.

2. The wireless route optimization method based on depth contrast reinforcement learning according to claim 1, wherein in a control period, the node generates and maps a current optimal action a to an optimal forwarding node based on a latest routing decision model and a local state vector s, comprising:

modeling the current state and the historical state information of m candidate forwarding nodes with the largest energy in the candidate forwarding node set into a three-dimensional graph vector; the three-dimensional graph vector is used as a local state vector s of the latest routing decision model;

the node inputs a local state vector s into the latest routing decision model based on the latest routing decision model and the local state vector s, and generates an optimal action a; and mapping the optimal action a into a corresponding optimal forwarding node.

3. The wireless route optimization method based on depth contrast reinforcement learning according to claim 2, wherein the node information includes energy efficiency information, hop count, expected number of transmissions, buffer queue number, and potential sub-points of the node;

the expected transmission times are used for evaluating the communication quality of the appointed link, selecting a data link with lower expected transmission times, and improving the reliability of data transmission;

the buffer queue number is used for evaluating the load degree of the candidate forwarding nodes and avoiding selecting nodes causing serious load imbalance of the network to forward data;

the potential sub-points are used for evaluating network dynamics and potential congestion conditions of candidate forwarding nodes, so that potential influences of node distributed decisions on the current forwarding nodes are estimated.

4. The wireless route optimization method based on depth contrast reinforcement learning according to claim 2, wherein inputting the local state vector s into the latest route decision model, generating the optimal action a, and mapping the optimal action a to the corresponding optimal forwarding node, comprises:

convolving the local state vector s with a convolution layer having a convolution kernel size of c×1×1 to obtain a first vector x1=conv ^c×1×1 (s)；

Processing the first vector X1 by global average pooling and global maximum pooling to obtain a global average pooling vectorAnd global maximum pooling vector->And pools the global average vector X _avg And global maximum pooling vector X _max Combining to obtain a second vector x2= [ X _ave ；X _max ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For real numbers, H is the number of candidate forwarding nodes under consideration, and W is the routing metric number;

for the second vectorGlobal average pooling is performed again on the channels to extract detail features, and a third vector +.>

M _w ＝Conv ^1×1×W (X2)；

The integrity of the information is ensured by adopting a residual block, and a convolution layer with the convolution kernel size of 3 multiplied by 1 is adopted to carry out convolution operation on the product vector and the third vector, so as to obtain a fourth vector X4=Conv ^3×1×1 (NM _w +X3)；

Convolving and softmax operating the fourth vector with a convolution layer having a convolution kernel size of 1×1×w to generate an optimal motion

a＝soft max(Conv ^1×1×W (X4))；

5. The wireless route optimization method based on deep contrast reinforcement learning according to claim 1, wherein the deep contrast reinforcement learning model is characterized in that contrast learning is introduced into the deep reinforcement learning model, the generalization capability of the deep reinforcement learning model is improved through the contrast learning by learning the relative relation of samples, and the deep reinforcement learning model can better learn characteristic representation and has better generalization by learning the commonality and characteristics of data, so that different paths can be better distinguished.

6. The wireless route optimization method based on depth contrast reinforcement learning according to claim 5, wherein the reward value r is calculated by using a reward function of a depth contrast reinforcement learning model, and a calculation formula of the reward function is:

R＝w1*f(PDR)+w2*f(EE)+w3*f(Delay)

mu sumIs the corresponding coefficient; for the forward index: the packet delivery rate PDR and the energy efficiency EE are calculated by adopting f (x); for the reverse index: end-to-end Delay is calculated by f (y); />Mean x value, max [ x ] of each period since node is network-connected]The x value representing the maximum single period since the node was network-connected; min [ y ]]The minimum y value for a single cycle since the node was network-connected.

7. The method of wireless route optimization based on depth contrast reinforcement learning of claim 1, wherein extracting part of experience information from an experience pool and training the depth contrast reinforcement learning model comprises:

judging whether the dominance value is larger than 0, if so, storing the experience information into a preset queue in an experience library D according to the action information, and taking the experience information as a training sample of the depth contrast reinforcement learning model; otherwise, no further operation is performed on the empirical information;

In the model training period, when the training round number reaches a contrast learning training threshold value, the server also extracts corresponding sample data from the experience library D each time to assist in training the routing decision model; otherwise, the server only extracts the mini-batch data from the experience pool C each time to train the routing decision model; suspending the model training task when the model converges;

the node periodically acquires relevant parameters from the server to update the routing decision model and interacts with the environment.

8. The wireless route optimization method based on depth contrast reinforcement learning according to claim 7, wherein the formula of the dominance function is:

9. The method for wireless route optimization based on deep contrast reinforcement learning of claim 7, wherein when the number of training rounds reaches the contrast learning training threshold, the server further extracts corresponding sample data from the experience library D each time to assist in training the route decision model, comprising:

when the training round number reaches a contrast learning training threshold value, the server combines contrast learning and deep reinforcement learning, and corresponding sample data are extracted from the experience library each time to assist in training the Actor network;

The optimization targets of contrast learning are:

wherein e _i Representing one sample in the dataset, z ₁ (e _i1 ) Is e _i First through data enhancement method F ₁ Then passes through a feature extraction module z ₁ The characteristics obtained after that; z ₂ (e _i2 ) Is e _i First through data enhancement method F ₂ Then passes through a feature extraction module z ₂ The characteristics obtained after that; z ₁ (e _j1 ) And z ₂ (e _j2 ) Is data e _j The characteristics obtained through augmentation;a similarity measurement method between two feature vectors; n is the total number of samples, i is the indicated value of a certain sample, j is the indicated value of a certain sample, f ₁ To pass F ₁ Method for extracting enhanced features, f ₂ To pass F ₂ Performing a feature extraction method on the enhanced features; the objective of this formula is to maximize the space between features of the same data after different enhancement methodsAnd minimizing the similarity between features of different data subjected to the same enhancement method.

10. A wireless network system based on deep contrast reinforcement learning, characterized in that the wireless network system is based on the wireless route optimization method based on deep contrast reinforcement learning according to any one of claims 1 to 9; the wireless network system comprises an Internet of things wireless multi-hop network and a server, wherein the network comprises a sink node and a plurality of wireless terminal nodes;

The server is provided with a deep contrast reinforcement learning model, and the deep contrast reinforcement learning model adopts an Actor-Critic network architecture; training the depth contrast reinforcement learning model through a server;

the wireless system adopts a centralized model training and a distributed interactive architecture, adopts a resource-rich edge server to provide additional data storage and support of computing capacity for terminal nodes with limited resources, and provides more comprehensive model training data for model training through asynchronous experience acquisition of the terminal.