CN114024639B

CN114024639B - Distributed channel allocation method in wireless multi-hop network

Info

Publication number: CN114024639B
Application number: CN202111318928.0A
Authority: CN
Inventors: 雷建军; 尚凤军; 王颖; 刘捷; 周盈
Original assignee: Chengdu Skysoft Info & Tech Co ltd
Current assignee: Chengdu Skysoft Info & Tech Co ltd; Shenzhen Hongyue Information Technology Co ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2024-01-05
Anticipated expiration: 2041-11-09
Also published as: CN114024639A

Abstract

The invention relates to the field of wireless network communication, in particular to a distributed channel allocation method in a wireless multi-hop network, which comprises a physical architecture at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer comprises n wireless nodes which are randomly deployed in the network to form a multi-hop wireless communication network, each node is used as an autonomous Agent, and interacts with an uncertain network environment through a local decision module; the aggregation node of the computing layer is responsible for aggregating, analyzing and processing data collected by other stations in the network, has an edge computing function, can train an asynchronous DRL model based on experience information acquired by the node in a distributed mode, models a multi-channel allocation problem as a POMDP problem, and performs channel allocation by using the trained asynchronous DRL model; the invention solves the problems of hidden terminals and exposed terminals in the high-density multi-hop wireless network, and effectively avoids the problems of data collision and channel resource waste.

Description

Distributed channel allocation method in wireless multi-hop network

Technical Field

The invention relates to the field of wireless network communication, in particular to a distributed channel allocation method in a wireless multi-hop network.

Background

Multi-channel media control access (multiple media access control, MMAC) techniques enable interference-free data transmission in multiple orthogonal channels for communication links that interfere with each other in single-channel communications. MMAC can effectively avoid the problem of single channel interference, and improve the throughput of the whole network, so that MMAC is considered as a technology with great potential for relieving the shortage of wireless network channel resources at present. While multi-channel communication has many advantages over single-channel communication, it brings about many new problems:

channel allocation and negotiation: the most basic and important problem of the multi-channel-based MAC communication technology is how to reasonably allocate channel resources so as to ensure that each node maximizes the network capacity of the whole network on the premise of normal communication. In addition, prior to communication, negotiations between nodes are required to address the use of the channel to ensure that two communication nodes operate on the same channel during data transmission.

Multi-channel broadcasting: wireless networks based on a single channel model can easily implement broadcasting because each sensor node is on the same channel; however, in a multi-channel environment, when a certain node performs broadcasting, certain nodes cannot receive broadcasting contents due to the distribution of the nodes over a plurality of channels. Broadcast functions play an important role in network applications, and therefore, how to implement broadcast functions is based on yet another challenge faced by multi-channel communication.

Multi-hop hidden terminals and exposed terminals: as shown in fig. 1, the multi-hop hidden terminal is a node that is within the communication range of the receiving node and outside the communication range of the transmitting node. These nodes may transmit data to the same receiving node because they cannot receive the transmission data from the transmitting node, resulting in collision of data transmission. In high density situations, hidden terminal problems can lead to unnecessary data collisions, greatly degrading network performance. The multi-hop exposure terminal problem refers to a node that is within the coverage of a transmitting node and out of the coverage of a receiving node, and the exposure terminal delays transmission by hearing the transmission of the transmitting node. The presence of the exposed terminal may result in unnecessary waste of channel resources.

Disclosure of Invention

In order to effectively reduce interference and data conflict in a network, improve the utilization rate of channels and the throughput of a system and ensure the reliability of data service transmission among nodes, the invention provides a distributed channel allocation method in a wireless multi-hop network, which adopts a physical architecture at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer comprises n wireless nodes which are randomly deployed in the network to form a multi-hop wireless communication network, and each node is used as an autonomous Agent and interacts with an uncertain network environment through a local decision module; the aggregation node of the computing layer is responsible for aggregating, analyzing and processing data collected by other sites in the network, and the node has an edge computing function or adopts a special edge server node, so that the computing task of the node can be unloaded, an asynchronous DRL model can be trained based on the experience information acquired by the node in a distributed mode, the multi-channel allocation problem is modeled as a POMDP problem, and the distributed channel allocation is carried out by utilizing the asynchronous DRL model trained by the centralized node or the edge server.

Further, the multi-channel allocation problem is modeled as a POMDP problem, that is, an Agent observes a current network state s and performs an action a in a time period t, and after performing the action a, the Agent transitions to a network state s' in a next time period with a state transition probability P, and obtains a corresponding reward R from the environment, where the POMDP problem is expressed as:

M＝<S,A,P,R,γ>；

wherein M represents a POMDP problem model; s is the state set representing the state space; a is an action set representing an action space, wherein an action a epsilon A represents a channel number to be switched by a node; r is a reward function; gamma is the discount factor. I.e. given an environmental state S e S, an Agent performs an action a e a, the environmental state will migrate from S to S ', i.e. s→s', while obtaining a corresponding return R from the environment.

Further, the environmental state observed by the node i in the t-th time periodExpressed as:

wherein,the occupation condition of the neighbor node of the node i on each wireless channel is represented, namely the potential interference degree of each channel; k is the number of available channels and N is the number of nodes; />Representing the occupancy of channel j by the neighbor node of node i during the t-th time period,/>Neighbor node indicating the presence of node i uses channel j, and>indicating that a neighboring node of the existing node i uses a channel j; />n _i,o Is the total number of neighbor nodes of node i.

Further, the prize R that is obtained from the environment when the node is after performing action a and transitions from state s to the next state s' can be expressed as:

wherein R (s, a) node i switches the channel to a reward R after channel k in the t-th data period, i.eR＝R(s,a)；A neighbor node indicating whether node i exists in the current period uses channel k: if the neighbor node of the node i does not exist uses the channel k, then +.>On the contrary, let(s)> For the time period t, the probability of successful transmission of the neighbor of the node i.

Further, the asynchronous DRL model deployed in the computing layer comprises a current network, a target network, an error computing module and an experience pool, and a decision module deployed in the wireless node, wherein the network structure of the local decision module is the same as that of the current network, and the parameters of the local decision module are periodically acquired from the edge node; wherein:

the target network fixes the network parameters and obtains the target value function,the current network is used for evaluating the strategy updating parameters and approximating the value function;

the parameter theta of the current network is updated every time period; parameter θ of target network ^- Updating once every fixed multiple time periods, wherein the period is kept unchanged;

experience e= < S, a, r, S '>, S, S' ∈s, a∈a in the experience pool, asynchronously collected by nodes in the network from the wireless multi-hop network environment;

the error calculation module updates parameters of the current network through TD deviation calculated by the target network and the current network; in addition, parameters of the current network are copied to the target network at regular intervals.

Further, the target value functionThe calculation of (1) comprises:

wherein R(s) _t ,a _t ) For node i E [1, N](N is the number of nodes), at the t-th time period state s _t E S executing action a _t Rewards obtained in the t-th time period after E A; q(s) _t+1 ,a _t+1 ；θ ^- ),(s _t+1 ∈S,a _t+1 E A) represents a network, i.e. the t+1st time period is based on the target network, i.e. the parameter θ ^- Node i is in state s _t+1 Executing action a _t+1 Is a network of (a); s is(s) _t+1 The state of the node i in the t+1th time period; a, a _t+1 An action performed for node i at time period t+1th; max (max) _at+1 ∈AQ(s _t+1 ,a _t+1 ；θ ^- ) The representation node i is based on the target network (parameter θ ^- ) In state s _t+1 Lower selection action a _t+1 To maximize the corresponding Q value.

Further, the error calculation module calculates a current network Q (s _t ,a _t The method comprises the steps of carrying out a first treatment on the surface of the θ) and target valueError between:

gradient descent is used to update neural network parameters:

wherein L (θ) is the TD error function of the model;representing a desire for selected mini-batch empirical data; parameters of the current network updated in real time by theta; alpha learning rate; />For a corresponding gradient; q(s) _t ,a _t The method comprises the steps of carrying out a first treatment on the surface of the θ) represents a network, i.e. node i is in state s at the time period of t when the network parameter is θ _t Executing action a _t Is a network of (a) a network of (b) a plurality of (c) networks.

Further, dividing the whole system time into a plurality of continuous super-frame times, wherein one super-frame time is a time period, each super-frame comprises a beacon frame, a control period and a data transmission period, and the control period adopts a fixed control channel to transmit related control information and channel allocation decisions; k non-overlapping channels are adopted in the data transmission period to support interference-free parallel data transmission; and in the control period, all nodes in the network switch to the control channel to intercept and send the relevant control information; and switching the node with data to be sent to a channel where a parent node is located in the data transmission period to perform data transmission based on a channel access mechanism.

Further, in the process of executing the action a, the node adopts a channel access mechanism based on RTS/DCTS, which comprises the following steps:

if the node d is located in the m-th hop and the m+1st hop node of the next hop is the node i, namely the node d is a father node of the node i; if the node e is located in the m-th hop and the m+1st hop node of the next hop is the node j, namely the node e is the father node of the node j; the four nodes all work on the same channel, and the back-off value of the node i and the node j is 0;

when the node i sends an RTS frame to the node d, the node d waits for a CIFS time and returns a CTS frame;

after receiving the RTS frame of the node i or the CTS frame of the node d, the child node of the node d sets a corresponding NAV based on the information in the Duration field;

when node e receives the RTS frame from node i, waiting for a SIFS, returning a CTS frame to inform the child node thereof that the child node delays data transmission during the transmission of node i;

wherein, RTS refers to request sending; CTS refers to clear to send; CIFS is the interframe space for the destination node to return CTS; SIFS refers to a technique for separating frames belonging to a session, and CIFS is slightly larger than SIFS.

Further, if the node j is located in the communication range of the node i, and the parent node thereof is not located in the communication range of the node i, after the node j receives the RTS frame, the node j waits for a RIFS and then sends the RTS frame to the parent node e.

The invention solves the problems of hidden terminals and exposed terminals in the high-density multi-hop wireless network, and effectively avoids the problems of data collision and channel resource waste so as to improve the overall network performance. In addition, an asynchronous DRL model is provided for a wireless multi-hop multi-channel network to dynamically optimize the channel allocation strategy of the node based on the channel access performance and the channel occupation condition of the node in the data transmission period. A novel wireless mode based on Mobile Edge Computing (MEC) is provided to solve the computing and storage pressure of terminal nodes, and a distributed interactive (micro learning) and centralized training (macro learning) framework is designed to train an asynchronous DRL model. Therefore, the asynchronous DRL model proposed by the invention can be implemented even on a resource-constrained terminal. In addition, the invention considers the non-stationary problem in the multi-agent scene (MAS), and only utilizes the neighbor local information, thereby avoiding the severe dynamic change of the network and further accelerating the network convergence.

Drawings

Fig. 1 is an exemplary diagram of hidden and exposed terminals in multiple channels provided in the prior art;

FIG. 2 is a diagram of a system architecture for edge computation enabled provided by an embodiment of the present invention;

fig. 3 is a superframe structure diagram used in the present invention;

FIG. 4 is an asynchronous DRL model based on a distributed decision architecture in the present invention;

fig. 5 is a centralized training flow of an asynchronous DRL model according to an embodiment of the present invention.

FIG. 6 is one of the operational schematics of RTS/DCTS provided by an embodiment of the present invention;

FIG. 7 is a second schematic diagram of RTS/DCTS operation according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a distributed channel allocation method in a wireless multi-hop network, which adopts a physical architecture at least comprising a physical equipment layer, a calculation layer and a network service layer, wherein the physical equipment layer comprises n wireless nodes which are randomly deployed in the network to form a multi-hop wireless communication network, each node is used as an autonomous Agent, and interacts with an uncertain network environment through a local decision module; the aggregation node of the computing layer is responsible for aggregating, analyzing and processing data collected by other stations in the network, the node has an edge computing function, so that the computing task of the node can be unloaded, an asynchronous DRL model can be trained based on the experience information acquired by the node in a distributed mode, the multi-channel allocation problem is modeled as a POMDP problem, and the trained asynchronous DRL model is utilized for channel allocation.

Example 1

The present embodiment provides a system architecture diagram, as shown in fig. 2, where the system architecture includes a physical device layer, a computing layer, and a network service layer. The physical equipment layer is a multi-hop wireless communication network formed by n wireless nodes which are randomly deployed in the network, each node is used as an autonomous Agent, and the nodes interact with an uncertain network environment through a local decision module; the aggregation node of the computing layer is responsible for aggregating, analyzing and processing data collected by other stations in the network, has an edge computing function, can unload computing tasks of the nodes, and can train an asynchronous DRL model based on experience information acquired by the nodes in a distributed mode.

In the data transmission process, the present embodiment selects to perform data transmission in a superframe structure, as shown in fig. 3, where the system time is divided into a plurality of consecutive superframe times, and each superframe includes a beacon frame, a control period, and a data transmission period. Wherein, the control period adopts a fixed control channel to transmit related control information and channel allocation decision; the data transmission period employs K non-overlapping channels to support interference-free parallel data transmission. Thus, during a control period, all nodes in the network switch to the control channel to listen and send relevant control information (routing, time synchronization, channel switching, etc.); and switching the node with data to be sent to a channel where a parent node is located in the data transmission period to perform data transmission based on a channel access mechanism.

The asynchronous DRL model adopted in this embodiment is shown in fig. 4, and the DRL is adopted to solve the problem of dynamic multichannel allocation in the multihop wireless network. The embodiment of the invention combines the DQN function approximation capability and an A3C asynchronous experience sampling architecture, provides an asynchronous DRL model, and aims to reasonably allocate channels for nodes so as to furthest improve the reliability of data transmission. The DRL model deployed on the edge server adopts a DQN architecture, DNN is introduced to extract features from original data to approach a behavior value function, and meanwhile, an asynchronous training framework of A3C is combined to solve the problem that the DQN is not suitable for a high-dimensional action space and MAS, so that the correlation between experiences is broken, the convergence speed of a network is remarkably improved, and the problem that an A3C algorithm cannot be realized on a wireless node with limited resources is solved.

The present embodiment considers that the limited computing power, energy and memory capability of the wireless node in some scenarios results in computing bottlenecks and low performance, limits support for advanced applications, and runs computationally intensive tasks, i.e., trains the DRL model. Therefore, the embodiment of the invention adopts a wireless network architecture based on edge computing energization, and transfers the computing task of the node training asynchronous DRL model to the edge node (sink node) with rich resources. As shown in fig. 2, the asynchronous DRL model deployed at the computing layer consists of a current network (main), a target network (target), and an experience pool (experience replay). Thus, the edge compute enabled sink node performs the training and updating tasks of the model.

When the asynchronous DRL model is adopted for channel allocation, the method combines the function approximation capability of DQN and the asynchronous interaction architecture of A3C, and a distributed interaction module (micro-learning) in the asynchronous DRL model shown in fig. 4 allows the terminal node to asynchronously select channel resources by using local observation information. In addition, a centralized training module (macro learning) trains the asynchronous DRL model by adjusting the operating parameters, directing the system towards an application-specific global optimization objective (e.g., maximizing reliability of data transmission). Wherein each terminal node maintains a DRL predictive model to independently allocate channels. Specifically, embodiments of the present invention model the multi-channel allocation problem as a POMDP problem, which consists of five tuples: m=<S,A,P,R,γ>State s, action a, state transition probability P, reward function R, and discount factor γ. The Agent observes the current network state s and performs action a at each control cycle of time step t. Then transition to the next state with state transition probability, obtaining rewards R from the environment _t+1 。

State space, s= { S ₁ ,S ₂ ,...,S _2K+N }. Where K is the number of available channels and N is the number of nodes. For a particular node i, at the t-th cycle, its state vector,

wherein,representing the occupancy of channel j by the neighbor node of node i,/>Indicating that the neighbor channel with node i occupies channel j; on the contrary, S _i,t,j ＝0。/>Is the total number of neighbor nodes for node i.

Motion space, a= { a ₁ ,a ₂ ...,a _K }，a _k E A. Wherein, the channel number, a, used for indicating the node i to switch in the next data transmission period _k ＝ch _i,t,k ,ch _i,t,k ＝k∈[1,K]。

A bonus function, R. When the node i is in the t data period, the state is locally observedExecuting an actionSwitching to channel ch _i,t,k At the end of the data transmission period, the environment returns to the node an immediate prize value, r=r (s, a), which can be solved by the following function:

wherein,in the current data period, the neighbor node without the node i uses the channel ch _i,t,k The method comprises the steps of carrying out a first treatment on the surface of the On the contrary, let(s)> Is to use channel ch _i,t,k Number of neighbor nodes of node i of=k. />Is node ch _i,t,k And successful transmission probability of data transmission is carried out on the data.

The aggregation node with edge computing enabled trains the DRL model in a centralized mode based on the experience information acquired by each node in the network in a distributed asynchronous mode, and sends updated network model parameters to the nodes, and each node can acquire the latest network parameters from the father node.

The centralized training process of the DRL model is shown in fig. 5, two networks with identical structures but different parameters exist in the asynchronous DRL model, the current value of Q estimation is predicted, and the latest parameters are used; whereas the neural network target value parameters of Q reality are predicted, which use the previous old parameters. In this embodiment, the state of the node is taken as an input of the neural network, each node performs a different action as a class of the node, the probability of each action performed by the node is predicted by the neural network, and the probability is taken as an output of the neural network, i.e., a value of Q, for example, Q (s, a; θ) represents a probability of performing the action a by the node when the node state s is input and the parameter of the neural network is θ.

During model training, some (mini-batch) experiences are randomly taken from an experience pool to train so as to break the correlation between the experiences. In addition, since the experience information in the experience pool is provided by the agent asynchronously, the correlation between experiences can be further broken and more abundant experiences can be provided.

As can be seen from fig. 5<s,a>Information is used as input of a current value network to acquire Q (s, a; theta) for evaluating a current state behavior value function; the S ' S information is used for the input of the target value network to obtain the corresponding maxQ (S ', a '; θ) ^- ) The method comprises the steps of carrying out a first treatment on the surface of the CalculatingComprising the following steps:

thus, based onThe value, adopting the DQN error function module, can further calculate the error value:

the current network updates parameters of the current value network based on the error function gradient:

wherein s.epsilon.S and a.epsilon.A. Copying parameters of the current value network to the target value network every time a certain number of iterations are performed;

θ ^- ←θ

repeating the above process to make the network reach a stable state.

Although asynchronous DRL based channel allocation models improve network performance by applying multiple parallel data transmissions, hidden and exposed terminal problems on specific channels are further exacerbated in highly dense wireless multi-hop network scenarios. Fig. 1 illustrates hidden terminals and exposure problems in a wireless multi-hop network when node D is transmitting data to node C, since node B is outside the communication range of node D. Therefore, the node B misdeems the channel to be in an idle state, so when the node B sends data to the nodes C and a at this time, data collision occurs at the node C, which results in unnecessary data retransmission, further aggravating the network congestion degree; further, when node B1 transmits data to node A1, since node B2 is in the communication range of node B1 and nodes B2 and A2 are not in the communication range of nodes A1 and B1, respectively, node B2 erroneously recognizes that the channel is in an idle state to delay data transmission, which leads to unnecessary waste of channel resources. Therefore, the embodiment of the invention proposes to solve the problems of hidden terminals and exposed terminals in the wireless multi-hop network based on an RTS/DCTS mechanism. The RTS/DCTS mechanism is further described below by way of example.

Fig. 6 is a schematic diagram of solving the problem of hidden terminals in a wireless multi-hop network based on RTS/DCTS according to a preferred embodiment of the present invention. Nodes i and j, nodes d and e are located in m and m+1 hops (referring to different and adjacent hop counts), respectively, and operate on the same channel. Node d is the parent of node i and node e is the parent of node j. Node e is also a neighbor node of node i. Let the backoff values of nodes i and j be 0 at this time.

when node e receives the RTS frame from node i, waits for a SIFS, returns a CTS frame to inform its child nodes that it is delaying data transmission during node i transmission, thereby avoiding hidden terminal problems.

In the channel access mechanism in the multi-hop environment, hidden terminal problems are unavoidable, so that the probability of successful transmission of node i on a specific channel k,the calculation can be performed using the following formula:

where τ is the probability of transmission in the channel access slot. In particular, the method comprises the steps of,(n _s the total number of child nodes that are parent nodes of the node). n is n _a Representing the number of neighbor nodes of node i, and n _f Representing the number of neighbor nodes of the parent node of node i (child nodes not including the parent node).

Referring to fig. 7, fig. 7 is a schematic diagram illustrating an example of the problem of the exposed terminal in the wireless multi-hop network based on RTS/DCTS according to the preferred embodiment of the present invention. Nodes i and j, nodes d and e are located in m and m+1 hops (referring to different and adjacent hop counts), respectively, and operate on the same channel. Node d is the parent of node i and node e is the parent of node j. Node j is also a neighbor node to node i. Let the backoff values of nodes i and j be 0 at this time.

When the node i sends RTS to the node d, the node d waits for a CIFS time and returns a CTS frame; because node j is located within the communication range of node i. Therefore, node j will also receive the RTS frame, but since the destination node of the RTS frame is not the destination node of node j, node j will not set a NAV according to the Duration field information of the RTS;

after node j receives the RTS frame and waits for a RIFS, judging whether a CTS frame is received or not; since its parent node e is not within communication range of node i, node e does not return a CTS after SIFS; therefore, node j does not receive a CTS frame after RIFS; node j sends an RTS frame to parent node e;

the nodes in the network execute the process, so that the problems of data conflict and channel resource waste caused by hidden terminals and exposed terminals in the network can be effectively solved; thus, the probability of successful transmission can be rewritten as:

based on the RTS/DCTS mechanism, data collision between data links under adjacent father nodes on the same channel can be effectively avoided through SIFS and CTS; in addition, the channel access mechanism introduces the RIFS interframe space to solve the problem of violent terminals in the network, thereby improving the successful transmission probability of the nodes, namelyTherefore, the channel access mechanism can improve the successful transmission probability of the nodes in the network;

furthermore, from the above formula, it can be seen that P _s And parametersn _a And n _f Directly related to parameter n _s ，n _a And n _f Further optimization may be achieved by optimizing the channel allocation strategy; therefore, the embodiment of the invention ensures that the successful transmission probability of the node on the channel>A portion of the reward function for the channel allocation model is aimed at further optimizing network performance.

The channel allocation and channel access mechanism provided by the embodiment of the invention optimizes channel resources from different layers, optimizes channel resources from a frequency domain in channel allocation and optimizes channel resources from a time domain in channel access. In addition, the reasonable channel allocation mechanism can further alleviate the interference problem in the channel access process, and the channel access performance of the node can further optimize the channel allocation strategy.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A distributed channel allocation method in a wireless multi-hop network is characterized in that a physical architecture at least comprising a physical device layer, a computing layer and a network service layer is adopted, the physical device layer comprises n wireless nodes which are randomly deployed in the network to form a multi-hop wireless communication network, the multi-channel allocation problem is modeled as a POMDP problem, an asynchronous DRL model is utilized to realize distributed channel allocation, each node is used as an autonomous Agent, interaction is carried out with an uncertain network environment through a local decision module, a gathering node of the computing layer is responsible for gathering, analyzing and processing data collected by other stations in the network, the node has an edge computing function, the computing task of the node can be unloaded, the asynchronous DRL model can be trained based on experience information collected by the node in a distributed mode, and the wireless node periodically updates parameters of the local decision module from the gathering node, and the method concretely comprises the following steps:

the POMDP problem consists of five tuples, m= < S, a, P, R, γ >, state S, action a, state transition probability P, reward function R, and discount factor γ;

the Agent observes the current network state s and executes the action a in the control period of each time step t; then transition to the next state with state transition probability, obtaining rewards R from the environment _t+1 ；

State spaceWhere K is the number of available channels and N is the number of nodes; for a specific node i, at the t-th period, its state vector, +.>

Wherein,representing the occupancy of channel j by the neighbor node of node i,/>Indicating that the neighbor channel with node i occupies channel j; on the contrary, S _i,t,j ＝0；/>Is the total number of neighbor nodes of node i;

motion space a= { a ₁ ,a ₂ ...,a _K }，a _k E A, wherein a is a channel number for indicating that node i is to switch in the next data transmission period _k ＝ch _i,t,k ,ch _i,t,k ＝k∈[1,K]；

Reward function R, when node i is in the t data period, locally observing stateExecuting an actionSwitching to channel ch _i,t,k At the end of the data transmission period, the environment returns to the node an immediate prize value, r=r (s, a), which can be solved by the following function:

wherein,in the current data period, the neighbor node without the node i uses the channel ch _i,t,k The method comprises the steps of carrying out a first treatment on the surface of the On the contrary, let(s)> Is to use channel ch _i,t,k Number of neighbor nodes of node i of=k; />Is node ch _i,t,k Successful transmission probability of data transmission is carried out on the data;

the aggregation node energized by edge calculation is used for intensively training a DRL model based on the experience information acquired by each node in the network in a distributed and asynchronous way, and sending updated network model parameters to the nodes, wherein each node can acquire the latest network parameters from a father node;

taking the states of the nodes as inputs of the neural network, and executing each node differentlyThe action is taken as the category of the node, the probability of each action executed by the node is predicted by the neural network, and the probability is taken as the output of the neural network, namely<s,a>Information is used as input of a current value network to acquire Q (s, a; theta) for evaluating a current state behavior value function; the S ' S information is used for the input of the target value network to obtain the corresponding maxQ (S ', a '; θ) ^- ) The method comprises the steps of carrying out a first treatment on the surface of the CalculatingComprising the following steps:

wherein S epsilon S and a epsilon A; copying parameters of the current value network to the target value network every time a certain number of iterations are performed;

θ ^- ←θ

repeating the above process to make the network reach a stable state.

2. The method of claim 1, wherein the entire system time is divided into a plurality of consecutive super-frame times, one super-frame time being a time period, each super-frame including a beacon frame, a control period and a data transmission period, the control period employing a fixed control channel to transmit the associated control information and channel allocation decisions; k non-overlapping channels are adopted in the data transmission period to support interference-free parallel data transmission; and in the control period, all nodes in the network switch to the control channel to intercept and send the relevant control information; and switching the node with data to be sent to a channel where a parent node is located in the data transmission period to perform data transmission based on a channel access mechanism.

3. The method for distributed channel allocation in a wireless multi-hop network according to claim 1, wherein the node uses an RTS/DCTS-based channel access mechanism in performing act a, comprising:

4. A method for distributing channels in a wireless multi-hop network according to claim 3, wherein if node j is located in the communication range of node i and its parent node is not located in the communication range of node i, when node j receives the RTS frame, after waiting for a RIFS, node j sends the RTS frame to parent node e.