CN114024639A

CN114024639A - Distributed channel allocation method in wireless multi-hop network

Info

Publication number: CN114024639A
Application number: CN202111318928.0A
Authority: CN
Inventors: 雷建军; 尚凤军; 王颖; 刘捷; 周盈
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chengdu Skysoft Info & Tech Co ltd; Shenzhen Hongyue Information Technology Co ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-08
Anticipated expiration: 2041-11-09
Also published as: CN114024639B

Abstract

The invention relates to the field of wireless network communication, in particular to a distributed channel allocation method in a wireless multi-hop network, which comprises the steps of adopting a physical framework at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer forms a multi-hop wireless communication network by n wireless nodes randomly deployed in the network, and each node is used as an autonomous Agent and interacts with an uncertain network environment through a local decision module; the aggregation node of the computation layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, has an edge computation function, and can train an asynchronous DRL (distributed resource distribution) model based on experience information acquired in a node distributed manner, model a multi-channel distribution problem into a POMDP (point-of-sale) problem and perform channel distribution by using the trained asynchronous DRL model; the invention solves the problems of hidden terminals and exposed terminals in a high-density multi-hop wireless network, and effectively avoids the problems of data conflict and channel resource waste.

Description

Distributed channel allocation method in wireless multi-hop network

Technical Field

The invention relates to the field of wireless network communication, in particular to a distributed channel allocation method in a wireless multi-hop network.

Background

Multi-channel medium access control (MMAC) technology may enable communication links that interfere with each other in single-channel communications to achieve interference-free data transmission in multiple orthogonal channels. MMAC can effectively avoid the problem of single channel interference and improve the throughput of the whole network, and therefore, is considered to be a very potential technology for alleviating the shortage of wireless network channel resources at present. Although multi-channel communication has many advantages over single-channel communication, it brings with it many new problems:

channel allocation and negotiation: the most basic and important problem of the multi-channel-based MAC communication technology is how to reasonably allocate channel resources to ensure that each node maximizes the network capacity of the entire network under the premise of normal communication. Furthermore, prior to communication, negotiation between nodes is required to solve the problem of channel usage to ensure that two communicating nodes operate on the same channel during data transmission.

Multi-channel broadcasting: the wireless network based on the single-channel model can easily realize broadcasting because each sensor node is in the same channel; however, in a multi-channel environment, when a certain node performs broadcasting, some nodes cannot receive the broadcasting content because the nodes are distributed over a plurality of channels. The broadcast function plays an important role in network applications, and therefore, how to implement the broadcast function is another difficult problem facing multi-channel communication.

Multi-hop hidden and exposed terminals: as shown in fig. 1, the multi-hop hidden terminal is a node within the communication range of the receiving node and out of the communication range of the transmitting node. These nodes may not receive the data sent by the sending node, but may send data to the same receiving node, causing data transmission collisions. At high densities, hidden terminal problems can lead to unnecessary data collisions, greatly reducing network performance. The multi-hop exposed terminal problem refers to a node that is within the coverage of a transmitting node but outside the coverage of a receiving node, and the exposed terminal delays transmission by hearing the transmission of the transmitting node. The presence of exposed terminals may result in unnecessary waste of channel resources.

Disclosure of Invention

In order to effectively reduce interference and data collision in a network, improve the utilization rate of a channel and the throughput of a system and ensure the reliability of data service transmission among nodes, the invention provides a distributed channel allocation method in a wireless multi-hop network, which adopts a physical framework at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer forms a multi-hop wireless communication network by n wireless nodes randomly deployed in the network, and each node is used as an autonomous Agent and interacts with an uncertain network environment through a local decision module; the aggregation node of the computation layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, the node has an edge computation function or adopts a special edge server node, namely, the computation task of the node can be unloaded, an asynchronous DRL model can be trained based on empirical information collected in a distributed mode by the node, a multi-channel distribution problem is modeled into a POMDP problem, and distributed channel distribution is carried out by using the asynchronous DRL model trained by a centralized node or an edge server.

Further, modeling the multi-channel allocation problem as a POMDP problem, i.e. Agent observes the current network state s and performs action a in time period t, and after performing action a, transitions to the network state s' in the next time period with a state transition probability P, and obtains a corresponding reward R from the environment, the POMDP problem is expressed as:

M＝<S,A,P,R,γ>；

wherein M represents a POMDP problem model; s is a state set representation state space; a is an action set representing an action space, wherein the action a belongs to the channel number to be switched by the node represented by A; r is a reward function; gamma is a discount factor. I.e. given the environment state S e S, Agent performs the action a e a, the environment state will migrate from S to S ', i.e. S → S', while getting the corresponding reward R from the environment.

Further, the environmental state observed by the node i in the t-th time period

Expressed as:

wherein the content of the first and second substances,

representing the occupation condition of the neighbor node of the node i to each wireless channel, namely the potential interference degree of each channel; k is the number of available channels, N is the number of nodes;

indicating the occupation of the channel j by the neighbor node of the node i in the t-th time period,

a neighbor node indicating the presence of node i uses channel j,

indicating that the neighbor node of the node i uses the channel j;

n_i,othe total number of neighbor nodes of node i.

Further, the reward R obtained from the environment when the node, after performing action a, transitions from state s to the next state s' may be expressed as:

wherein, the R (s, a) node i switches the channel to the reward R after the channel k in the t-th data period, that is, R ═ R (s, a);

the neighbor node indicating whether the node i exists in the current period uses the channel k: if not presentThe neighbor node of the node i uses the channel k, then

On the contrary, the method can be used for carrying out the following steps,

the neighbor successful transmission probability of node i for the t-th time period.

Further, the asynchronous DRL model deployed in the computation layer comprises a current network, a target network, an error computation module, an experience pool and a decision module deployed locally in the wireless node, wherein the network structure of the local decision module is the same as that of the current network, and parameters of the local decision module are periodically acquired from the edge node; wherein:

the target network fixes the network parameters and obtains a target value function,

the current network is used for evaluating strategy updating parameters and approximating a value function;

updating the parameter theta of the current network every time period; parameter θ of target network^-Updating every a plurality of fixed time periods, wherein the time period is kept unchanged;

the experience e ═ S, a, r, S '>, S, S' belongs to S, a belongs to A, and the node in the network asynchronously collects from the wireless multi-hop network environment;

the error calculation module updates the parameters of the current network through the TD deviation calculated by the target network and the current network; in addition, the parameters of the current network are copied to the target network at regular intervals.

Further, the objective function

The calculation of (a) includes:

wherein R(s)_t,a_t) For node i e [1, N ]](N is the number of nodes) state s at the t-th time period_tE.g. S performs action a_tThe reward obtained in the t time period after belonging to A; q(s)_t+1,a_t+1；θ^-),(s_t+1∈S,a_t+1E.g. A) represents a network, i.e. the t +1 time period is based on the target network, i.e. the parameter is theta^-Node i in state s_t+1Performing action a_t+1The network of (2); s_t+1Is the state of the node i in the t +1 th time period; a is_t+1An action performed for node i at time period t + 1; max_at+1∈AQ(s_t+1,a_t+1；θ^-) Representing node i based on the target network (parameter θ)^-) In a state s_t+1Lower selection action a_t+1To maximize the corresponding Q value.

Further, the error calculation module calculates a current network Q(s)_t,a_t(ii) a θ) and target value

The error between:

updating neural network parameters with gradient descent:

wherein L (theta) is a TD error function of the model;

expressing the expectation of the selected mini-batch empirical data; theta is the parameter of the current network updated in real time; an alpha learning rate;

is the corresponding gradient; q(s)_t,a_t(ii) a Theta) represents a network, i.e. the node i takes the state s at the time period of the t-th time when the network parameter is theta_tPerforming action a_tThe network of (2).

Furthermore, the whole system time is divided into a plurality of continuous superframe time, one superframe time is a time period, each superframe comprises a beacon frame, a control period and a data transmission period, and the control period adopts a fixed control channel to transmit related control information and channel allocation decisions; the data transmission period adopts K non-overlapping channels to support non-interference parallel data transmission; and in the control period, all nodes in the network are switched to the control channel to monitor and send the related control information; in the data transmission period, a node to be sent data is switched to a channel where a parent node of the node is located to transmit data based on a channel access mechanism.

Further, in the process of performing the action a, the node adopts a channel access mechanism based on RTS/DCTS, which includes:

if the node d is positioned in the mth hop and the node of the (m + 1) th hop next to the mth hop is the node i, the node d is the father node of the node i; if the node e is positioned in the mth hop and the node of the (m + 1) th hop next to the mth hop is the node j, the node e is the father node of the node j; the four nodes work on the same channel, and the backoff values of the node i and the node j are 0;

when the node i sends an RTS frame to the node d, the node d waits for a CIFS time and returns a CTS frame;

after receiving the RTS frame of the node i or the CTS frame of the node d, the child node of the node d sets a corresponding NAV based on the information in the Duration field;

when receiving an RTS frame from the node i, the node e waits for an SIFS frame and returns a CTS frame to inform the child node of delaying data transmission during the transmission period of the node i;

wherein RTS refers to request to send; CTS means clear to send; the CIFS is an interframe space used for returning CTS by the destination node; SIFS means to separate frames belonging to one dialog, and CIFS is slightly larger than SIFS.

Further, if node j is located within the communication range of node i and its parent node is not located within the communication range of node i, after node j receives the RTS frame and waits for an RIFS, node j sends the RTS frame to parent node e.

The invention solves the problems of hidden terminals and exposed terminals in a high-density multi-hop wireless network, effectively avoids the problems of data conflict and channel resource waste, and improves the overall network performance. In addition, an asynchronous DRL model is provided for dynamically optimizing a channel allocation strategy of the node aiming at the wireless multi-hop multi-channel network based on the channel access performance and the channel occupation condition of the node in the data transmission period. A novel wireless mode based on Mobile Edge Computing (MEC) is provided, the computing and storage pressure of a terminal node is solved, and a distributed interaction (micro-learning) and centralized training (macro-learning) framework is designed to train an asynchronous DRL model. Therefore, the asynchronous DRL model proposed by the present invention can be implemented even on resource-constrained terminals. In addition, the invention considers the non-stationary problem in the multi-agent scene (MAS), and can further accelerate the network convergence while avoiding the severe dynamic change of the network by only utilizing the neighbor local information.

Drawings

Fig. 1 is a diagram of an example of a hidden and exposed terminal in multiple channels provided in the prior art;

FIG. 2 is a diagram of an edge computing enabled system architecture according to an embodiment of the present invention;

FIG. 3 is a superframe structure diagram employed by the present invention;

FIG. 4 is an asynchronous DRL model based on a distributed decision making architecture in accordance with the present invention;

fig. 5 is a centralized training flow of the asynchronous DRL model according to the embodiment of the present invention.

Fig. 6 is one of the schematic diagrams of RTS/DCTS operation provided by the embodiment of the present invention;

fig. 7 is a second schematic diagram of RTS/DCTS operation according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a distributed channel allocation method in a wireless multi-hop network, which adopts a physical framework at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer forms a multi-hop wireless communication network by n wireless nodes randomly deployed in the network, and each node is used as an autonomous Agent and interacts with an uncertain network environment through a local decision module; the aggregation node of the computation layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, and the node has an edge computation function, namely, the computation task of the node can be unloaded, an asynchronous DRL model can be trained based on empirical information acquired in a distributed mode by the node, a multi-channel distribution problem is modeled into a POMDP problem, and the trained asynchronous DRL model is used for carrying out channel distribution.

Example 1

The present embodiment presents a system architecture diagram, as shown in fig. 2, the system architecture includes a physical device layer, a computing layer, and a network service layer. The physical equipment layer is a multi-hop wireless communication network consisting of n wireless nodes randomly deployed in the network, and each node is used as an autonomous Agent and interacts with an uncertain network environment through a local decision module; the aggregation node of the computation layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, has an edge computation function and can unload computation tasks of the node, and an asynchronous DRL model can be trained on the basis of experience information acquired in a distributed mode by the node.

In the data transmission process, the present embodiment selects to perform data transmission in a superframe structure, where the superframe structure is shown in fig. 3, the system time is divided into a plurality of consecutive superframe times, and each superframe includes a beacon frame, a control period and a data transmission period. Wherein, the control period adopts a fixed control channel to transmit the relevant control information and channel allocation decision; the data transmission period employs K non-overlapping channels to support interference-free parallel data transmission. Thus, during a control period, all nodes in the network are to switch to a control channel to listen and transmit related control information (routing, time synchronization, channel switching, etc.); in the data transmission period, a node to be sent data is switched to a channel where a parent node of the node is located to transmit data based on a channel access mechanism.

As shown in fig. 4, the asynchronous DRL model adopted in this embodiment solves the problem of dynamic multi-channel allocation in a multi-hop wireless network by using DRL. The embodiment of the invention combines the DQN function approximation capability and an A3C asynchronous empirical sampling framework, provides an asynchronous DRL model, and aims to reasonably allocate channels for nodes so as to improve the reliability of data transmission to the maximum extent. The DRL model deployed on the edge server adopts a DQN framework, DNN is introduced to extract features from original data to approximate a behavior value function, and an asynchronous training framework of A3C is combined to solve the problem that the DQN is not suitable for a high-dimensional action space and an MAS, so that the correlation between experiences is broken, the convergence speed of the network is remarkably improved, and the problem that an A3C algorithm cannot be realized on a wireless node with limited resources is solved.

In the embodiment, the limited computing capability, energy and memory capability of the wireless node under certain scenes are considered, so that the computing bottleneck and low performance are caused, the support of high-level application is limited, and a computing-intensive task, namely the training of the DRL model, is operated. Therefore, the embodiment of the invention adopts a wireless network architecture based on edge computing enabling, and transfers the computing task of the node training asynchronous DRL model to the edge nodes (sink nodes) with rich resources. As shown in fig. 2, the asynchronous DRL model deployed at the computation layer is composed of a current network (main), a target network (target), and an experience pool (experience replay). Thus, the edge computation-enabled sink nodes complete the training and updating tasks of the model.

When the asynchronous DRL model is adopted for channel allocation, the invention combines the function approximation capability of DQN and the asynchronous interaction architecture of A3CThe distributed interaction module (micro-learning) in the asynchronous DRL model presented in fig. 4 allows the terminal node to asynchronously select channel resources using local observation information. In addition, a centralized training module (macro learning) trains the asynchronous DRL model by adjusting operating parameters, thereby directing the system to move toward an application-specific global optimization goal (e.g., maximizing reliability of data transfer). Wherein each terminal node maintains a DRL prediction model to independently allocate channels. In particular, embodiments of the present invention model the multi-channel allocation problem as a POMDP problem, which consists of five tuples, M ═ M<S,A,P,R,γ>State s, action a, state transition probability P, reward function R, and discount factor γ. The Agent observes the current network state s and executes action a at each control period of time step t. Then, the system transfers to the next state according to the state transition probability to obtain the reward R from the environment_t+1。

State space, S ═ S₁,S₂,...,S_2K+N}. Where K is the number of available channels and N is the number of nodes. For a particular node i, at the t-th cycle, its state vector,

wherein the content of the first and second substances,

indicating the occupancy of channel j by the neighbor node of node i,

indicating that the neighbor channel of the node i occupies the channel j; otherwise, S_i,t,j＝0。

Is the total number of neighbor nodes for node i.

Motion space, a ═ a₁,a₂...,a_K}，a_kE.g. A. Wherein, the channel number used for indicating the node i to switch in the next data transmission period, a_k＝ch_i,t,k,ch_i,t,k＝k∈[1,K]。

A reward function, R. When the node i is in the t data period, the local observation state

Performing an action

Switching to channel ch_i,t,kAt the end of the data transmission cycle, the environment returns to the node an immediate reward value, R (s, a), which can be solved by the following function:

wherein the content of the first and second substances,

in the current data cycle, the neighbor node indicating that there is no node i uses the channel ch_i,t,k(ii) a On the contrary, the method can be used for carrying out the following steps,

is to use the channel ch_i,t,kThe number of neighbor nodes of node i of k.

Is the node is in ch_i,t,kThe probability of successful transmission of the data transmission is performed.

The edge computing enabled sink node trains the DRL model in a centralized mode based on experience information acquired by each node in a distributed asynchronous mode in the network, updated network model parameters are sent to the nodes, and each node can acquire the latest network parameters from a parent node of the node.

The centralized training process of the DRL model is shown in fig. 5, where two networks with the same structure but different parameters exist in the asynchronous DRL model, and the current value of the Q estimate is predicted, using the latest parameters; and predicting the neural network target value parameter of Q reality, which uses the previous old parameter. In the embodiment, the state of a node is used as the input of the neural network, each node executes different actions as the class of the node, the probability of each action executed by the node is predicted by the neural network, and the probability is used as the output of the neural network, namely the value of Q, for example, Q (s, a; theta) represents the probability of executing the action a by inputting the state s of the node when the parameter of the neural network is theta.

When the model is trained, some (mini-batch) experiences are randomly taken out from the experience pool to be trained so as to break the correlation between the experiences. In addition, because the experience information in the experience pool is provided by the intelligent agent in an asynchronous sampling mode, the correlation between experiences can be further broken, and richer experiences are provided.

From FIG. 5, it can be seen that<s,a>Information is used as the input of the current value network to obtain Q (s, a; theta) used for evaluating the current state behavior value function; the S ' S information is used as input to the target value network to obtain the corresponding maxQ (S ', a '; theta)^-) (ii) a Calculate out

The method comprises the following steps:

thus, based on

And the value can be further calculated by adopting a DQN error function module:

the current network updates the parameters of the current network based on the error function gradient:

wherein S ∈ S, and a ∈ A. Copying parameters of the current value network to a target value network after a certain number of iterations;

θ^-←θ

repeating the above process to make the network reach a stable state.

Although the asynchronous DRL based channel allocation model improves network performance by applying multiple parallel data transmissions, the hidden and exposed terminal problems on a specific channel will be further exacerbated in a high-density wireless multi-hop network scenario. Fig. 1 illustrates the hidden terminal and exposure problem in a wireless multihop network, when node D is transmitting data to node C, since node B is located outside the communication range of node D. Therefore, the node B mistakenly thinks that the channel is in an idle state, so when the node B sends data to the nodes C and a at the same time, data collision occurs at the node C, which causes unnecessary data retransmission, and further aggravates the network congestion degree; furthermore, when the node B1 transmits data to the node a1, since the node B2 is in the communication range of the node B1 and the node B2 and a2 are not in the communication ranges of the node a1 and the node B1, respectively, the node B2 mistakenly considers that the channel is in the idle state and delays data transmission, which causes unnecessary waste of channel resources. Therefore, the embodiment of the present invention proposes to solve the hidden terminal and exposed terminal problems in the wireless multi-hop network based on the RTS/DCTS mechanism. The RTS/DCTS mechanism is further described below by way of example.

Fig. 6 is a diagram illustrating a solution to the hidden terminal problem in the wireless multi-hop network based on RTS/DCTS according to a preferred embodiment of the present invention. Wherein, nodes i and j, and nodes d and e are respectively located at m and m +1 hops (which refer to different and adjacent hop counts) and operate on the same channel. Node d is a parent node of node i and node e is a parent node of node j. Node e is also a neighbor node of node i. Assume that the backoff values of nodes i and j are both 0 at this time.

when node e receives the RTS frame from node i, waits for a SIFS, and returns a CTS frame to inform its child node of delaying data transmission during the transmission of node i, thereby avoiding the hidden terminal problem.

In the channel access mechanism in the multi-hop environment, the hidden terminal problem is unavoidable, so the probability of successful transmission of node i on a particular channel k,

the following formula can be used for calculation:

wherein τ is a transmission probability in the channel access slot. In particular, the amount of the solvent to be used,

(n_sis the total number of child nodes of the parent node of the node). n is_aRepresents the number of neighbor nodes of node i, and n_fThe number of neighbor nodes of the parent node of node i (excluding the child nodes of the parent node).

Referring to fig. 7, fig. 7 is a schematic diagram illustrating an example of solving the problem of exposed terminals in a wireless multi-hop network based on RTS/DCTS according to a preferred embodiment of the present invention. Wherein, nodes i and j, and nodes d and e are respectively located at m and m +1 hops (which refer to different and adjacent hop counts) and operate on the same channel. Node d is a parent node of node i and node e is a parent node of node j. Node j is also a neighbor node to node i. Assume that the backoff values of nodes i and j are both 0 at this time.

When the node i sends RTS to the node d, the node d waits for a CIFS time and returns a CTS frame; since node j is within communication range of node i. Therefore, node j will also receive the RTS frame, but since the destination node of the RTS frame is not the destination node of node j, node j will not set the NAV according to the Duration field information of the RTS;

after receiving the RTS frame, the node j waits for an RIFS and judges whether a CTS frame is received or not; since the parent node e is not in the communication range of the node i, the node e does not return a CTS after SIFS; therefore, node j does not receive a CTS frame after RIFS; node j sends RTS frame to father node e;

the nodes in the network execute the above processes, so that the problems of data conflict and channel resource waste caused by hidden terminals and exposed terminals in the network can be effectively solved; thus, the successful transmission probability can be rewritten as:

based on the RTS/DCTS mechanism, data collision between data links under adjacent father nodes on the same channel can be effectively avoided through SIFS and CTS; in addition, the channel access mechanism introduces RIFS interframe space to solve the problem of violent terminals in the network, thereby improving the successful transmission probability of the nodes, namely

Therefore, the channel access mechanism can improve the successful transmission probability of the nodes in the network;

in addition, P can be seen from the above formula_sAnd parameters

n_aAnd n_fDirectly related, while the parameter n_s，n_aAnd n_fCan be further optimized by optimizing the channel allocation strategy; therefore, the embodiment of the invention ensures the successful transmission probability of the node on the channel

As part of the channel allocation model reward function, to further optimize network performance.

The channel allocation and channel access mechanism provided by the embodiment of the invention optimizes channel resources from different layers, and the channel allocation optimizes the channel resources from the frequency domain and the channel access from the time domain. In addition, a reasonable channel allocation mechanism can further alleviate the interference problem in the channel access process, and the channel access performance of the node can further optimize the channel allocation strategy.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A distributed channel allocation method in a wireless multi-hop network is characterized in that a physical architecture at least comprising a physical equipment layer, a calculation layer and a network service layer is adopted, the physical equipment layer forms a multi-hop wireless communication network by n wireless nodes randomly deployed in the network, a multi-channel allocation problem is modeled into a POMDP problem, distributed channel allocation is realized by utilizing an asynchronous DRL model, each node is used as an autonomous Agent, the local decision module interacts with uncertain network environment, the aggregation node of the computation layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, and the node has edge computation function, the calculation tasks of the nodes can be unloaded, the asynchronous DRL model can be trained based on experience information acquired in a distributed mode by the nodes, and the wireless nodes periodically update local decision module parameters from the sink nodes.

2. The method of claim 1, wherein the multi-channel allocation problem is modeled as a POMDP problem, that is, when an Agent observes a current network state s and performs an action a in a time period t, and after performing the action a, transitions to a network state s' in a next time period with a state transition probability P, and obtains a corresponding reward R from an environment, the POMDP problem is expressed as:

M＝<S,A,P,r,γ>；

3. The method of claim 2, wherein the environmental status observed by the node i in the t time period

Expressed as:

wherein the content of the first and second substances,

indicating the presence of a neighbor node of node iWith the use of the channel j,

indicating that the neighbor node of the node i uses the channel j;

n_i,othe total number of neighbor nodes of node i.

4. A method according to claim 1, wherein the reward function obtained from the environment when the node performs action a and moves from state s to next state s' is represented as:

wherein R (s, a) represents a reward value obtained from the environment when node i switches the channel to channel k at the t-th data period;

the neighbor node indicating whether the node i exists in the current period uses the channel k: if the neighbor node without the node i uses the channel k, then

On the contrary, the method can be used for carrying out the following steps,

the number of nodes using the channel k in the neighbor nodes of the node i in the t-th time period;

for data transmission on channel k for node iThe probability of successful transmission.

5. The method according to claim 2, wherein the asynchronous DRL model deployed in the computation layer includes a current network, a target network, an error computation module, an experience pool, and a decision module deployed locally at the wireless node, the network structure of the local decision module is the same as that of the current network, and parameters of the local decision module are periodically obtained from the edge node; wherein:

the experience e ═ S, a, r, S '>, S, S' e S, a e a in the experience pool, and the node in the network asynchronously collects from the wireless multi-hop network environment;

6. The method of claim 5, wherein the objective function is a function of the target value

The calculation of (a) includes:

wherein R(s)_t,a_t) Performed for the node at the t time periodAction a_tE, awards obtained in the t-th time period after the E is in the A, i belongs to the E, 1, N-and N are the number of nodes; q(s)_t+1,a_t+1；θ^-) Representing a network, i.e. the t +1 th time period being θ based on the network parameter^-In which node i is in state s_t+1Performing action a_t+1The network of (2); s_t+1Is the state of the node i in the t +1 th time period; a is_t+1An action performed for node i at time period t + 1;

representing node i as θ based on a network parameter^-In which nodes are in state s_t+1Lower selection action a_t+1To maximize the corresponding Q value.

7. The method of claim 5, wherein the error calculation module calculates the current network Q(s)_t,a_t(ii) a Theta) and target network Q(s)_t+1,a_t+1；θ^-) TD error between is expressed as:

updating neural network parameters with gradient descent:

wherein

L (theta) is a TD error function of the model;

expressing the expectation; theta is a real-time updated network parameter; alpha is schoolThe learning rate;

is the gradient of L (θ); q(s)_t,a_t(ii) a Theta) represents a network, i.e. the network parameter is the node i e, 1, N-in the state s under the condition of theta at the t-th time period_tPerforming action a_tThe network of (2).

8. The method of claim 2, wherein the whole system time is divided into a plurality of consecutive superframes, one superframe time is a time period, each superframe comprises a beacon frame, a control period and a data transmission period, and the control period uses a fixed control channel to transmit related control information and channel allocation decisions; the data transmission period adopts K non-overlapping channels to support non-interference parallel data transmission; and in the control period, all nodes in the network are switched to the control channel to monitor and send the related control information; in the data transmission period, a node to be sent data is switched to a channel where a parent node of the node is located to transmit data based on a channel access mechanism.

9. The method according to claim 2, wherein the node employs a channel access mechanism based on RTS/DCTS in performing action a, and the method comprises:

10. The method according to claim 9, wherein if node j is located within the communication range of node i and its parent node is not located within the communication range of node i, node j sends an RTS frame to parent node e after waiting for an RIFS after node j receives the RTS frame.