CN115412992A

CN115412992A - Distributed co-evolution method, UAV (unmanned aerial vehicle) and intelligent routing method and device thereof

Info

Publication number: CN115412992A
Application number: CN202210878196.9A
Authority: CN
Inventors: 韦云凯; 赵鹏程; 路雨昕; 冷甦鹏; 杨鲲; 刘强; 方琛
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-11-29
Also published as: WO2024021281A1

Abstract

The invention discloses a distributed co-evolution method, a UAV (unmanned aerial vehicle) and an intelligent routing method and device thereof, wherein each UAV node is configured with a routing DQN (differential Quadrature reference number) model, and each UAV node is configured with a routing evaluation model; each UAV node is configured with a block; sequentially forwarding the data packet to a destination node from a source node in a network to form a route from the source node to the destination node corresponding to the data packet; in an evolution cycle, the following steps are performed: s1: updating route DQN model parameters: s2: selecting block nodes: s3: broadcasting route DQN model parameters; s4: and (5) co-evolving parameters. Compared with the traditional GPSR, the distributed coevolution routing technology based on the block chain and the deep reinforcement learning has lower end-to-end delay and higher packet transfer rate, and compared with a common intelligent routing algorithm, the distributed coevolution routing technology based on the block chain and the deep reinforcement learning has higher convergence rate.

Description

Distributed co-evolution method, UAV (unmanned aerial vehicle) and intelligent routing method and device thereof

Technical Field

The invention relates to the field of unmanned aerial vehicle ad hoc networks, in particular to a distributed co-evolution method, a UAV (unmanned aerial vehicle) and an intelligent routing method and device thereof.

Background

In recent years, unmanned Aerial Vehicle (UAV) networks have been widely used in various fields to perform many tasks that are difficult to implement in a conventional manner, have a complex environment, and are flexible and versatile. In such an environment, efficient and flexible cooperation between UAVs is required to ensure efficient completion of tasks. Therefore, the UAV network has low latency and high reliable information transmission capability in a complex communication environment, which is a basic premise for completing application tasks. However, due to the rapid movement characteristics of UAVs, the communications environment and topology of the UAV network are in dynamic change at all times. How to adopt an efficient and adaptive routing technology in such a dynamic environment to ensure that the UAV network has low latency and high reliability, and the information transmission capability and task are effectively completed becomes a challenge of the UAV network.

Greedy Perimeter Stateless Routing (GPSR) only considers the distance between the next hop node and the destination node in Routing. But in UAV networks, routing based solely on geographic location may not be the best route due to the complex and dynamic network environment and topology. Furthermore, routing holes caused by frequent UAV movement can increase end-to-end delay and waste routing resources.

In order to cooperate with models of different nodes and enable the whole network to achieve a good routing effect quickly, a centralized parameter aggregation and optimization algorithm such as federal Learning (fed Learning) is usually adopted. However, this approach cannot avoid the fundamental disadvantages of centralized approach, such as increased routing overhead and delay in data collection and distribution, and cannot adapt to a fast changing dynamic environment. And the centralized aggregated parameters do not take care of the diversity of network environments where different UAVs are located, and the learning model cannot make an optimal routing decision according to a unique local environment.

Therefore, aiming at the problems, an intelligent and distributed co-evolution routing technology is designed for the UAV network, the convergence rate of deep reinforcement learning is increased, the co-evolution of the routes in the UAV network is realized, and the method has important significance for improving the routing performance and the dynamic environment adaptive capacity of the UAV network.

Disclosure of Invention

The invention aims to provide a UAV network distributed co-evolutionary routing technology based on a block chain and deep reinforcement learning. The technology is characterized in that firstly, based on a Deep reinforcement learning DQN (Deep Q-network) algorithm, each UAV makes a routing decision along with environmental evolution, and autonomous evolution is realized; on the basis, through a block chain technology, the optimal routing decision model of each round is fused with the local model of each node, the convergence rate of deep reinforcement learning is increased, the routing co-evolution in the UAV network is realized, and the overall routing performance is improved.

A distributed co-evolution method is applied to a network consisting of a plurality of UAV nodes, wherein the network comprises an authentication committee consisting of at least one UAV node;

each UAV node is configured with a routed DQN model;

each UAV node is configured with a route evaluation model;

each UAV node is configured as a block;

sequentially forwarding the data packet to a destination node from a source node in a network to form a route from the source node to the destination node corresponding to the data packet;

in an evolution cycle, the following steps are performed:

s1: updating route DQN model parameters:

the UAV node trains a self route DQN model based on a sample generated by the route, and updates self route DQN model parameters;

s2: selecting block nodes:

inputting a route into a route evaluation model to obtain the scores of all nodes in the route, so that the authentication committee generates a node with the highest score as a block outlet node according to the score election of all nodes in the route;

s3: broadcast routing DQN model parameters: the egress node broadcasts the routing DQN model parameter of the egress node to a network;

s4: parameter co-evolution: and the non-output-block node receives the route DQN model parameter of the output-block node and performs parameter fusion with the route DQN model parameter of the non-output-block node to obtain the evolved route DQN model parameter.

Further, the updating of the route DQN model parameters specifically includes:

and when the UAV node forwards the data packet, forwarding the data packet to the next node with the maximum Q value based on the route DQN model, and generating a training sample, wherein the sample is used for training and updating the route DQN model of the UAV node in idle, and generating updated and iterated route DQN model parameters.

Further, the electing block node specifically includes:

when a destination node receives a data packet, obtaining a total score of a route based on a route evaluation model, and generating a score verification packet according to the total score, wherein the score verification packet in the network is sequentially transmitted back to a source node according to a corresponding route path, and the score verification packet comprises scores distributed to all nodes; and the verification committee updates the scores corresponding to the nodes when receiving the score verification package, and elects the node with the highest score generated in the election stage of the evolution cycle as the block output node.

Further, inputting the route into a route evaluation model to obtain the score of each node in the route, and specifically comprising the following steps:

for route R _h The source node is

The destination node is

When receiving a data packet, a destination node generates a corresponding score verification packet according to a route evaluation model, wherein the route evaluation model specifically comprises the following steps:

calculating the distance d from the source node to the destination node when the source node sends a data packet to the destination node _h ；

Calculating end-to-end total time delay T of data packet from source node to destination node _h ；

According to the distance d _h End-to-end total delay T _h Calculating a route R _h Total fraction of (2)

Assigning a total score to route R _h The first n nodes above, then the UAV node

At the route R _h The above obtained score

Wherein the content of the first and second substances,

is a UAV node

At route R _h τ is a coefficient for adjusting the fraction distribution specific gravity;

the destination nodeNumbering the data packets, sending and receiving time, participating in routing R _h The nodes and the scores distributed to the nodes are signed and packaged to generate score verification packages.

Further, the validation committee upon receipt of the score validation package validates all signatures therein and superimposes the score assigned to each UAV node on the current score of the corresponding UAV node and updates the score of the corresponding UAV.

Further, the process of parameter fusion is as follows:

obtaining a UAV node C _k DQN model parameter of the current route

Obtain out-of-block node C _sel DQN model parameter θ of the current route _sel ；

According to the above

θ _sel Calculating UAV node C _k DQN model parameter of route in next evolution period

Wherein, the first and the second end of the pipe are connected with each other,

and

are UAV nodes C, respectively _k And the number of parameters of the DQN model of the route in the current evolution period and the next evolution period.

Further, the policy for generating a route includes utilizing a routing policy or exploring a routing policy, and specifically includes the following processes:

before the UAV node forwards the data packet, comparing a random number built in the UAV node with a preset threshold value:

if the random number built in the UAV node is larger than or equal to a preset threshold value, performing a routing utilization strategy, wherein the routing utilization strategy is as follows: the UAV node selects a neighbor node with the maximum Q value as a next hop node based on the route DQN model;

if the random number built in the UAV node is smaller than a preset threshold value, performing an exploration routing strategy, wherein the exploration routing strategy is as follows: and randomly selecting one neighbor node as the next hop by the UAV node.

An intelligent routing method based on a routing DQN model is disclosed, wherein when a UAV node receives a data packet, the method comprises the following steps:

acquiring current state space information of the UAV node;

inputting the current state space information into a pre-trained routing DQN model, and determining a neighbor node of a next hop;

the route DQN model is obtained by training according to a UAV node and a training sample corresponding to the UAV node; the training sample comprises: current state space information when UAV nodes route, actions, reward values, and maximum Q values in next hop nodes.

Further, the training process of the route DQN model includes:

constructing an initial route DQN model;

acquiring a training sample corresponding to the UAV node;

inputting the training sample into the route DQN model to obtain the Q value of the selected node;

calculating a preset value of a loss function L (theta) according to the training sample and a Q value calculated by the route DQN model;

and updating the route DQN model parameters based on a gradient descent algorithm to minimize a loss function L (theta) to obtain a trained route DQN model.

Further, in the training sample:

the current state space information s = { NT, VL }, where VL is an access list used to record a node to which the data packet has been forwarded, and NT is a neighbor table composed of neighbor node position information, BER transmitted between each neighbor node and the current node, and a queue length of the data packet to be sent in a cache of each neighbor node;

the action represents that the neighbor node which is not forwarded is selected as the next hop;

the reward value is a weighted summation of a reference R value, a congestion penalty value, a forwarding angle reward value, a link quality value and a distance ratio.

A UAV:

the UAV node is configured with a routed DQN model

The UAV node is configured with a route evaluation model;

the UAV nodes are configured to a block of configuration,

the UAV node performs the following steps in an evolution cycle:

updating parameters:

forwarding a data packet based on a route DQN model and generating a training sample, wherein the sample is used for training and updating the route DQN model configured by the sample per se to obtain a route DQN model parameter in the current consensus period;

and (3) routing scoring:

if the current UAV node is a target node, obtaining a total score of a route based on a route evaluation model, generating a score verification package according to the total score, and returning the score verification package to the previous node, wherein the score verification package comprises scores distributed to all nodes;

if the current UAV node is not the destination node, forwarding the data packet to the next node based on the route DQN model;

returning and scoring:

if the current UAV node is the source node, the score verification package is sent to a verification committee;

if the current UAV node is not the source node, returning the score verification packet to the previous node so that the score verification packet is sequentially returned to the source node according to the corresponding routing path, wherein the source node is used for forwarding the received score verification packet to a verification committee;

node election:

if the current UAV node is a member of a verification committee, when a score verification packet is received, verifying a signature of the score verification packet, updating the score of each UAV node, and participating in election in an election stage of an evolution cycle to generate a node with the highest score as a block output node;

co-evolution:

if the current UAV node is a block-out node, broadcasting a route DQN model parameter in the current consensus period of the UAV node to a network;

and if the current UAV node is a non-block-output node, receiving the route DQN model parameter of the block-output node and performing parameter fusion with the route DQN model parameter of the UAV node.

A control device, comprising:

one or more processors;

a storage unit configured to store one or more programs that, when executed by the one or more processors, enable the one or more processors to implement the distributed co-evolution method.

The invention has the following beneficial effects: the whole technical scheme consists of two parts, namely an intelligent routing algorithm based on a DQN model and a distributed coevolution strategy based on a block chain. In the intelligent routing algorithm based on the DQN model, each UAV in the network is used as an intelligent agent, the DQN model is trained independently, and routing decisions are made according to the self-learning model. The DQN model enables the UAV to extract other network information besides the location information as features and make a better routing decision compared with the traditional GPSR based on comprehensive consideration of the features. In a distributed co-evolution strategy based on a block chain, the core idea is that an evolutionary optimal learning model is selected regularly and broadcasted to a network by a UAV (unmanned aerial vehicle) with the learning model, and other UAVs perform parameter fusion on local parameters and the UAVs so as to achieve the purpose of co-evolution. The UAV ad hoc network is used as a block chain, each UAV corresponds to a single block, and the UAVs are selected by a consensus algorithm to release the blocks, namely the UAVs share the optimal parameters. The specific details are as follows:

in an intelligent routing algorithm based on the DQN model, the DQN model suitable for UAV distributed training and decision is designed by combining the characteristics and requirements of UAV ad hoc network routing. The method specifically comprises the following steps: (1) A UAV is designed to serve as a perception state space of an intelligent agent in a network environment and an action space for carrying out routing decision. (2) According to the UAV ad hoc network characteristic, a reward function is designed, and the UAV can comprehensively consider the network environment to carry out optimal routing decision. (3) And establishing a proper neural network model according to the capacity of the UAV, performing action decision and obtaining corresponding rewards by the UAV according to the current state, and updating the local neural network model according to a gradient descent algorithm.

In the distributed co-evolution strategy based on the block chain, an evaluation model is designed to evaluate the routing performance of each UAV in the current consensus period. The evaluation score of each node is validated and counted by a validation committee selected from a Verifiable Random Function (VRF). The validation committee agreed on the UAVs that achieved the highest score during the current consensus period and broadcast the results to the network wide. The selected UAV broadcasts the DQN parameter to the network, and other UAVs fuse the local parameter to realize co-evolution.

In the invention, the intelligent routing algorithm based on DQN enables UAV network routing to achieve lower end-to-end delay and higher data packet transmission rate compared with the traditional GPSR; the distributed coevolution strategy based on the block chain accelerates the convergence speed of the DQN model and improves the overall routing performance during the UAV task. The method has certain generalization capability and is suitable for routing scenes in various mobile self-organized networks, so that the method can be applied to self-organized networks such as Internet of vehicles and wireless sensor networks in an expanded way.

Drawings

FIG. 1 is a block chain-based distributed co-evolution strategy diagram according to the present invention;

FIG. 2 is a schematic view of the collection phase of the present invention;

FIG. 3 is a schematic view of an election phase according to the present invention;

FIG. 4 is a block consensus diagram according to the present invention;

FIG. 5 is a schematic diagram of a distributed co-evolutionary routing based on block chain and deep reinforcement learning according to the present invention;

FIG. 6 is a graph illustrating the performance of the routing algorithm of the present invention compared to a conventional routing algorithm in terms of end-to-end delay;

FIG. 7 is a graph illustrating the performance of the routing algorithm of the present invention compared to a conventional routing algorithm in terms of packet transfer rate;

FIG. 8 is a graphical illustration of a comparison of reward values in training iterations for the routing algorithm of the present invention and a conventional intelligent routing algorithm;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

In addition, descriptions of well-known structures, functions, and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the disclosure.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as exemplary only and not as limiting. Thus, other examples of the exemplary embodiments may have different values.

1. Intelligent routing algorithm based on DQN model

The intelligent routing algorithm core based on the DQN is that each UAV utilizes a self DQN model to comprehensively consider network environment information, so that routing decision becomes more efficient, and dynamic adaptive capacity is improved along with environmental evolution. The whole thought of the algorithm is firstly introduced, then the setting of the excitation function in the DQN model is introduced in detail, and finally the pseudo code of the whole algorithm flow is given.

(1) Integral thinking

In the intelligent routing algorithm based on DQN, the UAV calculates the Q value of the neighbor node through a DQN model, and selects the neighbor node with the maximum Q value as the next hop. The DQN model extracts the access list, bit Error Rate (BER) of transmission, position information of neighbor nodes and queue length as features and applies the reward function to directly act on the calculation of Q value. Therefore, the environment information is comprehensively considered in the calculation of the Q value of the neighbor node, and the routing decision becomes more reasonable and efficient. With the continuous convergence of the DQN model, the Q value can reflect the real state of the neighbor node better. The extraction of features and the setting of the reward function will be described in more detail in the second section of this section.

In the routing algorithm, only the selection of the next hop is considered when the node is routed, so that the current node is marked as C, and the neighbor nodes are marked as { N ₁ ，N ₂ ，N ₃ ，...，N _j ,...}. The routing Process may be viewed as a Markov Decision Process (MDP). MDP is represented as a quadruple (S, A, P, R), where S is the state space, A is the action space, P is the state transition probability, and R is the reward value. The current node is used as an agent to make the best routing decision according to the current state so as to obtain the maximum reward value. MDP is specifically defined as follows:

state space, action space and reward value

State space: the state space defining the current node C consists of an access list and a neighbor table.The packet will contain an access list (VL) that records the nodes that have forwarded the packet. Each time a node forwards a data packet, it adds itself to the access list for that packet. Let neighbor node N _j The position information of

N _j BER of transmission with C is

N _j The queue length of the data packet to be sent in the cache is

The three components form a neighbor vector

The neighbor table consists of neighbor vectors, denoted NT = { NI ₁ ，NI ₂ ，...，NI _j ,...}. The current node can know the current position, queuing length and transmission BER of the neighbor node through the periodically transmitted HELLO information and the received ACK, and update its neighbor table. Thus, the state space of the current node may be represented as s = { NT, VL }.

An action space: the action space contains all actions that the agent can perform in the current state. Here, the action space is the route decision space of the current node, i.e., all routes of the next hop. The current node is defined to select only the neighbor nodes which have not forwarded the packet, so as to avoid wasting the routing resources. And therefore, the action space contains only neighbor nodes that are not in the access list, notation A = { N = _j -VL. Wherein the action a _j (a _j E.g. A) represents selecting neighbor node N _j Is the next hop.

The reward value is as follows: the intelligent agent makes an action in the current state and obtains a corresponding reward, and the reward value reflects the quality of the action. Thus, the design of the reward function is particularly important, which directly determines the direction of evolution of the agent. In the routing algorithm, the calculation of the reward function comprehensively considers the access list, the transmission BER, the queuing length of the neighbor node and the position information, thereby realizing the effects of balancing the flow, avoiding the use of the peripheral mode and improving the communication quality. The specific calculation mode of the reward value is detailed in "(2) reward function design").

Training process

The current node C selects the neighbor node N _j As the next hop and receives the prize value. At neighbor node N _j After successfully receiving the data packet, it replies ACK to the current node C, which contains its current location information, BER, queuing length, and the maximum Q value in the neighbor nodes. The current node C updates the neighbor table according to the information and compares the state, decision and reward value during routing with the node N _j The maximum Q value of the next jump is synthesized into a quadruple (s, a) _j ，R，max _a Q (s ', a')) is stored as a training sample in the sample pool.

During training, the node trains the DQN model of the node by using a memory playback mechanism, namely training samples are randomly extracted from a sample pool for training. Training updates the DQN model parameters by a gradient descent algorithm, with the goal of minimizing the loss function L (θ), i.e., evaluating the mean square error of the Q value and the target Q value, which is calculated as follows:

L(θ _i )＝E[Q _target -Q(s，a；θ _i )] ² #(1)

wherein, theta _i Model parameters, Q, representing the ith iteration _target For the target Q value, it is calculated as follows:

wherein R is _total Is the total prize value, max, earned _a′ Q (s ', a') is the maximum Q value of the next-hop neighbor node. The evaluation network parameters in the DQN model are updated along with each iteration, the target network parameters are fixed, and after a certain number of iterations, the evaluation network synchronizes the parameters to the target network so as to ensure the stability and the convergence of the DQN model.

After multiple iterations, the DQN model is converged, and the Q value can reflect the real state of the neighbor node, so that the node can make a routing decision superior to the traditional GPSR.

(2) Reward function design

The traditional GPSR only considers the position information during routing, and has the following problems: (1) The traffic is concentrated to the best located node, causing network congestion. (2) the use of the peripheral mode increases the number of hops and the delay. (3) And the quality of a communication link is ignored during routing, and the routing efficiency is reduced. In order to solve the problems, the access list, the queuing length and the BER information are applied to the calculation of the reward value, so that the environmental information is comprehensively considered when the nodes route, and the effects of realizing flow balance, avoiding peripheral modes and improving link quality are achieved. The reward function is specifically designed as follows:

flow balance

In the traditional GPSR protocol, only the node position is considered in routing, so that in routing, the traffic can be concentrated on a neighbor node nearest to a destination node, and network congestion is caused. This can result in high queuing delay and even loss of data packets.

To solve this problem, a traffic balancing strategy is applied in the routing algorithm, and each node can obtain local congestion information from the neighbors and apply the local congestion information to routing decision. Here, the queue length in the buffer of the neighboring node is used to determine the congestion state, and a larger queue length means a higher queue delay and a higher packet loss probability. The current node can know the current queuing length from the HELLO messages and the replied ACK periodically sent by the neighbor nodes, and update the neighbor information table in time. A penalty mechanism is applied to force the current node C to take congestion factors into account. The congestion penalty acts on the Q value and is calculated as follows:

wherein the content of the first and second substances,

is a neighbor node N _j The length of the queue of (a) is,

is node N _j Buffer size of Q (C, N) _j ) Is node N _j Original Q value, beta, in node C ₁ And beta ₂ Is the congestion penalty factor.

Avoiding peripheral modes

In UAV networks, routing holes are likely to occur due to the characteristic of high-speed UAV movement. In the conventional GPSR, when the current node is at the local minimum position, it must switch to the peripheral mode to bypass the hole region, but the peripheral mode may bring more hops and longer delay. Thus, the present routing algorithm uses a direct-propagation strategy to avoid the use of ambient modes.

The current node C will extract the access list in the packet and know which neighbor nodes have been accessed, and thus make a better routing decision. Herein defined as < V _f CN _j (≦ 180 °) as forwarding angle, where V _f Are neighbor nodes included in the VL. Angle V _f CN _j The angle of (d) is calculated as follows:

wherein, P _C 、

And

respectively represent nodes C, V _f And N _j The geographic location of (c).

Based on the forwarding angle, the current node routes the data packet to the destination node as straightly as possible, so that the current node can avoid the routed area and reduce the use of the peripheral mode. The forwarding angle is extracted as a feature and applied to the Q value with a reward function, which is calculated as follows:

wherein num represents V _f The number of the cells.

Link quality

In UAV ad hoc networks, the network environment is complex and varied, e.g., severe electromagnetic interference may exist. As the UAV is distributed in a wide space, for a node, the link quality between it and different neighboring nodes may vary greatly. In the present routing algorithm, BER is used to represent link quality and is used in the calculation of the Q value, forcing the node to take into account link quality in routing. The reward is calculated as follows:

wherein the content of the first and second substances,

representing the current node C and its neighbor nodes N _j The error rate of the transmission between them, which can be obtained from the HELLO message and the ACK in reply.

In addition to the above three strategies, the distance between the next hop and the destination node is also considered, that is, the core idea of the conventional GPSR is to forward the packet to the vicinity of the destination node. The distance reward function is calculated as follows:

wherein the content of the first and second substances,

and d _C Respectively represent nodes N _j And C distance to destination node

Thus, the total prize value is calculated as:

R _total ＝R+α ₁ R _con +α ₂ R _dir +α ₃ R _qua +α ₄ R _dis #(8)

wherein alpha is ₁ 、α ₂ 、α ₃ And alpha ₄ Is a weightThe coefficients, R, are calculated as follows:

it is understood from the context that the reference R value is R and the congestion penalty value is R _con The forward angle reward value is R _dir Link quality value R _qua The distance ratio is R _dis 。

(3) Algorithm flow

In the intelligent routing algorithm based on DQN, each UAV makes a routing decision through its own DQN model, and a training sample generated in each routing process is used for training its own DQN model. With continued iteration, the DQN model converges and the UAV can route efficiently.

When a routing task exists, the UAV adopts an element greedy strategy, namely, the routing decision has two choices: utilization or exploration. The UAV selects the neighbor node with the maximum Q value as the next hop, which is a strategy and has the probability of 1-belonging to the same group; or, the UAV randomly selects a neighbor node as the next hop, which is an exploration strategy, and the probability is e. Where e decreases as the number of iterations increases. After sending to the neighbor node, the UAV extracts the maximum Q value of the next hop from the ACK, generates training samples (s, a, R, max) _a′ Q (s ', a')) and stored in the cuvette.

When the UAV is idle, the UAV enters a training stage, a plurality of training samples are randomly extracted from a sample pool for training, and the DQN model parameters are updated by minimizing a loss function L (theta). After a certain number of iterations, the target network obtains and updates neural network parameters from the evaluation network.

The specific process is as follows:

2. distributed co-evolution strategy based on block chain

The core of the distributed co-evolution strategy based on the block chain is that a designed evaluation model is used for evaluating the routing performance of each UAV in the current consensus period, a verification committee selects the UAV with the highest score, the selected UAV broadcasts the DQN model parameter to a network, and other UAVs fuse the local parameter with the selected UAV, so that co-evolution is realized. This section first introduces the evaluation model and then introduces the overall consensus process.

(1) Evaluation model

To facilitate the evaluation of routing performance, the following concepts are first defined:

unit distance time delay: in the UAV ad hoc network, the ratio of the total end-to-end delay of a certain packet to the distance between the source node and the destination node of the packet in the routing process from the source node to the destination node is referred to as the unit distance delay of the packet.

Average unit distance delay: in UAV ad hoc networks, the average of unit distance delays for all packets.

Basic unit distance time delay: the average unit distance time delay in the traditional GPSR is defined as the basic unit distance time delay, and t is used ₀ And (4) showing.

On the basis, the invention adopts basic unit distance time delay t ₀ For reference, the goodness of the route experienced by a given data packet in a DQN-based UAV network is measured. For a certain data packet P _h From the source node

To the destination node

Go through intermediate nodes in sequence

Record this route as R _h . Then route R _h The score on the routing performance is calculated by the formula:

wherein d is _h Sending out a data packet P to a destination node at a source node _h Distance of time source node to destination node, T _h Is a data packet P _h The total end-to-end delay from the source node to the destination node, mu, is a coefficient.

To route P _h Will be assigned to route R _h The first n nodes. Considering that the contribution of each node in the middle of the route from the source node to the destination node is different from the contribution of the node to the overall route performance, wherein the contribution of the node at the earlier forwarding order is larger. Thus, S ^h At route R _h The allocation mode on each node is as follows:

wherein the content of the first and second substances,

is a routing node

In the route R _h The above-obtained fraction, τ, is a coefficient for adjusting the fraction distribution specific gravity. Destination node

And (4) no fraction is distributed.

When the source node forwards the data packet, the forwarding time and the distance from the source node to the destination node at the moment are marked and signed, and other nodes sign in sequence when the source node forwards the data packet, so that when the destination node receives the data packet, the total score can be calculated according to the arrival time, and the scores of all nodes are distributed according to the signing sequence in the data packet. Then, the destination node generates a score verification packet which comprises the number of the data packet which is suitable for receiving, the sending and receiving time, the passing routing node and the score distributed by each node, and forwards the packet to the previous node after signature. After the check score is calculated and distributed, each node signs and transmits the packet back to the source node in sequence. The source node checks for an error-free signature and forwards the packet to the authentication committee. In the process of returning, the network load is obviously increased, so the coevolution strategy can be periodically carried out according to the specific network condition, thereby balancing the load increase and the coevolution effect.

The validation committee selects from all nodes in a way that validates encrypted draws implemented by random function (VRF). And after the verification committee receives the score verification package, the signature in the package is verified firstly, and after the verification is successful, the score distributed by each node is added to the current score. When the collection phase is over, the validation committee agrees on the node with the highest score and broadcasts it into the network. The validation committees for each round are selected by the previous committee, wherein the probability of each node being selected is proportional to the score at the end of the previous round. The validation committee verifier number may be adjusted based on the UAV network size and the number of potential attackers. At the beginning of a new round of consensus period, the node scores are cleared.

(2) Consensus process

In the block chain-based distributed coevolution strategy, the consensus algorithm process can be divided into a collection stage, an election stage and a block consensus stage. t is t _col 、t _ele 、t _con The timing variables of the collecting stage, the election stage and the block consensus stage respectively indicate the start of the corresponding stage when the timing is set to zero, and indicate that the corresponding stage ends to enter the next stage or starts the next consensus cycle when the timing reaches the given time. Setting act _ flag as a collection and election identifier, entering a collection stage when the act _ flag is set to be 0, and entering an election stage when the act _ flag is set to be 1; con _ flag is a block consensus flag, which when set to 0 enters the block consensus phase, and when set to 1, prepares to enter the next consensus period. The unmanned aerial vehicle ad hoc network is regarded as a whole block chain, and each unmanned aerial vehicle corresponds to one block. Defining the number of UAVs in the network as m (m is equal to N) ⁺ ) Each ofUAV is denoted as { C ₁ ，C ₂ ，C ₃ ，...，C _m }. The UAV selected by the validation Committee in this round is denoted C _sel . The consensus algorithm flow is specifically as follows:

(a) Collecting stage

In the collection phase, UAV C _k And executing the routing task, generating samples, and training the DQN model of the routing task by using the collected samples. When receiving the point verification package, C _k And (4) after verifying whether the self-distributed score is correct, signing (not a verifier), and transmitting the packet back to the last node or forwarding to a nearby verifier. If C is _k Is a verifier who needs to be responsible for updating and maintaining the scores of all nodes, needs to verify all the signatures therein when receiving the score verification package, and adds the assigned scores to the corresponding nodes. The algorithm flow is as follows:

(b) Election stage

In the election phase, the validation committee first selects the next round of validation committees through the VRF and publishes the identity of the selected new verifier to the network. The validation committee then agrees on the node with the highest score in the round and publishes the node to the network. Wherein, the messages publishing the new authentication committee and the highest node are signed by all members of the authentication committee. The algorithm flow is as follows:

(c) Block consensus phase

In the block consensus phase, C selected by the authentication Committee _sel And broadcasting the signed DQN model parameters to the whole network, verifying the signature after other nodes receive the DQN model parameters, and fusing the parameters if the signature is verified. If the parameters are not received for a long time, other nodes can send the parameters to the node C _sel A request is sent.

And when parameters are fused, adopting a weighted average method. The weight parameter ρ is used to adjust the proportion of the optimal parameter in the fusion to ensure the stability of the local model. The parameter fusion calculation is as follows:

wherein the content of the first and second substances,

and

are respectively unmanned aerial vehicle C _k Parameter in the r-th and r + 1-th consensus periods, θ _sel Is a consensus algorithm to pick C _sel The parameter (c) of (c).

3. Simulation example

To illustrate the effect of the present invention, the performance of the present invention in a given scenario is further described below with reference to a specific simulation example.

The simulation example is based on a Pycharm platform, and a convolutional neural network is built by using a Pytrch package. The UAV ad hoc network in the simulation consists of fifty mobile nodes, and data packets are mutually transmitted according to preset parameters. In order to prove the superiority of the invention, the routing algorithm provided by the invention is compared with the traditional GPSR and the common intelligent routing algorithm, wherein the common intelligent routing algorithm has the same DQN model as the routing algorithm and only lacks a distributed coevolution strategy based on a block chain. In the simulation, three performance evaluation indexes are provided, namely end-to-end delay, data packet transfer rate and reward value, wherein the effect of co-evolution can be observed by comparing the reward value change of the routing algorithm with the reward value change of a common intelligent routing algorithm.

The simulation results are as follows:

fig. 6 shows the performance of the three routing algorithms in terms of end-to-end delay. It can be seen that after convergence, the end-to-end delay of the routing algorithm provided by the present invention and the normal intelligent routing algorithm is stable at about 200 milliseconds, while the end-to-end delay of the conventional GPSR is about 260 milliseconds. Compared with the common intelligent routing algorithm, the routing algorithm has the advantages of faster delay reduction and earlier convergence.

Fig. 7 is a performance representation of three routing algorithms in terms of packet transfer rate. It can be seen that the packet transfer rate of the routing algorithm provided by the invention and the common intelligent routing algorithm is stabilized at about 72% after convergence, while the packet transfer rate of the conventional GPSR is about 67%. Similar to end-to-end delay, the routing algorithm converges faster than a normal intelligent routing algorithm.

Fig. 8 is a comparison of reward values for the proposed routing algorithm of the present invention and a conventional smart routing algorithm. As shown, the routing algorithm reward value proposed by the present invention grows faster and always higher than the normal smart routing algorithm before convergence.

In conclusion, compared with the traditional GPSR, the distributed coevolution routing technology based on the block chain and the deep reinforcement learning has lower end-to-end delay and higher packet transfer rate, and compared with the common intelligent routing algorithm, the distributed coevolution routing technology based on the block chain and the deep reinforcement learning has higher convergence speed.

The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims

1. A distributed co-evolution method is applied to a network consisting of a plurality of UAV nodes, wherein the network comprises an authentication committee consisting of at least one UAV node;

each UAV node is configured with a routed DQN model;

each UAV node is configured with a route evaluation model;

each UAV node is configured as a block;

in an evolution cycle, the following steps are performed:

s1: updating route DQN model parameters:

s2: selecting block nodes:

2. The distributed co-evolution method according to claim 1, wherein the updating of the route DQN model parameters specifically comprises:

3. The distributed co-evolution method according to claim 1, wherein the electing a block node specifically comprises:

when a destination node receives a data packet, acquiring a total score of a route based on a route evaluation model, and generating a score verification packet according to the total score, wherein the score verification packet in the network is sequentially transmitted back to a source node according to a corresponding route path, and the score verification packet comprises scores distributed to each node; and the verification committee updates the scores corresponding to the nodes when receiving the score verification package, and elects the node with the highest score generated in the election stage of the evolution cycle as the block output node.

4. The distributed coevolution method according to claim 1, wherein a route is input into a route evaluation model to obtain scores of nodes in the route, and the method comprises the following steps:

for route R _h The source node is

The destination node is

When receiving a data packet, a destination node generates a corresponding score verification packet according to a route evaluation model, wherein the route evaluation model specifically comprises the following steps;

At the route R _h The above obtained score

Wherein the content of the first and second substances,

is a UAV node

the destination node numbers the data packets, sending and receiving time and participates in the route R _h The nodes and the scores distributed to the nodes are signed and packaged to generate score verification packages.

5. The distributed co-evolution method of claim 1, wherein the validation committee validates all signatures in the score validation package upon receipt thereof, superimposes the score assigned to each UAV node on the current score of the corresponding UAV node and updates the score of the corresponding UAV.

6. The distributed co-evolution method according to claim 1, wherein the parameter fusion process is as follows:

obtaining a UAV node C _k DQN model parameters of the current route

According to the above

Wherein the content of the first and second substances,

and

respectively, a UAV node C _k Route DQN model parameters in the current evolution cycle and in the next evolution cycle.

7. The distributed coevolution method according to claim 1, wherein the policy for generating routes includes using a routing policy or an exploratory routing policy, and specifically includes the following procedures:

if the random number built in the UAV node is larger than or equal to a preset threshold value, performing a routing utilization strategy, wherein the routing utilization strategy is as follows: the UAV node selects a neighbor node with the maximum Q value as a next hop based on the route DQN model;

8. An intelligent routing method based on a routing DQN model is applied to a UAV node, and when the UAV node receives a data packet, the following steps are performed:

acquiring current state space information of the UAV node;

the route DQN model is obtained by training according to a UAV node and a training sample corresponding to the UAV node; the training sample includes: current state space information when UAV nodes route, actions, reward values, and maximum Q values in next hop nodes.

9. The intelligent routing method based on the routing DQN model of claim 8, wherein the training process of the routing DQN model comprises:

constructing an initial route DQN model;

acquiring a training sample corresponding to the UAV node;

10. The intelligent routing method based on the routing DQN model of claim 9, wherein in the training samples:

the current state space information s = { NT, VL }, wherein VL is an access list used for recording nodes which have forwarded the data packets, and NT is a neighbor table and consists of neighbor node position information, BER transmitted between each neighbor node and the current node, and queue length of data packets to be sent in a cache of each neighbor node;

the action represents that a neighbor node which is not forwarded is selected as a next hop node;

11. A UAV, comprising:

the UAV node is configured with a routed DQN model

The UAV node is configured with a route evaluation model;

the UAV node is configured as a block that,

the UAV node performs the following steps in an evolution cycle:

updating parameters:

and (3) routing scoring:

returning and scoring:

node election:

if the current UAV node is a member of a verification committee, when a score verification package is received, verifying a signature of the score verification package, updating the score of each UAV node, and electing to generate a node with the highest score as a block output node in an election stage of an evolution cycle;

co-evolution:

12. A control device, characterized by comprising:

one or more processors;

a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement a distributed co-evolution method according to any one of claims 1 to 7.