CN109660375B

CN109660375B - High-reliability self-adaptive MAC (media Access control) layer scheduling method

Info

Publication number: CN109660375B
Application number: CN201710946487.6A
Authority: CN
Inventors: 刘元安; 张洪光; 王怡浩; 范文浩; 吴帆; 谢刚
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2017-10-11
Filing date: 2017-10-11
Publication date: 2020-10-02
Anticipated expiration: 2037-10-11
Also published as: CN109660375A

Abstract

The invention discloses a high-reliability self-adaptive MAC layer scheduling method. The method mainly solves the problem that a large amount of energy consumption is caused by idle interception of cluster head nodes in a wireless sensor network. The method comprises the following steps: carrying out model establishment on the wireless sensor network; generating a specific frame format, and embedding queue occupancy rate and delay in a frame control field; initializing an action set, and selecting a probability set and a feedback set; the coordinator interacts with the surrounding environment by using a learning automaton method and updates the action and the state of the coordinator; the whole learning process is divided into three stages: adopting corresponding search strategies in an initial stage, an exploration stage and a greedy stage; evaluating the interaction between the action and the environment, and updating the feedback and selection probability set; and selecting relevant parameters for determining the duty ratio based on the feedback set to realize self-adaptive MAC layer scheduling. The embodiment of the invention ensures that the duty ratio of the node is adjusted in a self-adaptive manner in the operation period, minimizes the power consumption and has wide application value.

Description

High-reliability self-adaptive MAC (media Access control) layer scheduling method

Technical Field

The invention belongs to the technical field of wireless sensor networks, and particularly relates to a high-reliability self-adaptive MAC (media access control) layer scheduling method.

Background

Wireless Sensor Network (WSN) nodes are typically battery powered and in many deployment environments, replacing batteries or electromagnetic charging is expensive or even infeasible. Therefore, low power consumption is considered as the most important index of the wireless sensor network communication protocol. In particular, the node does not know when other nodes are data-ing, so the transceiver will continue to be in receive mode even if the node is in idle state. Idle listening is considered one of the major problems of energy waste.

Currently, the most widely adopted ieee802.15.4 standard defines several different types of nodes: a Full Function Device (FFD), also known as a beacon-enabled device, may operate as a personal area network coordinator, cluster head or router, and a partial function device (RFD), also known as a non-beacon device, may only operate as an end device. When the FFD acts as a cluster head, it will quickly drain its energy because the FFD cannot predict when other sensor nodes will send their data to them, so they need to be in receive mode all the time to receive all the collected information. To overcome this problem, the standard specification defines a mode in which beacons are enabled. This mode supports the transmission of beacon frames from the coordinator to the end devices that allow node synchronization. This allows all devices to sleep between coordinated transmissions, which helps reduce idle listening and thus extends network lifetime.

In recent years, many duty cycle adjustment algorithms are proposed for such situations, for example, a reserved frame control field existing in a MAC frame header is modified, and information such as occupation of a transmission queue and end-to-end delay of a collection node is collected to select a duty cycle; yet another solution is to use reinforcement learning, whose main goal is to find the optimal duty cycle, and design a solution to adjust the sleep time of the SMAC protocol in the WSN environment, and the proposed solution takes the number of frames queued for transmission as the state and the reserved active time as the action. However, this means that a large number of state-action pairs need to be stored, which is not desirable in wireless sensor nodes where memory resources are limited. Recently, an extension of CAP based on a busy tone emitted by a device at the end of a standard CAP has been proposed. A busy tone is sent only when a device fails to send all of its data frames. The CAP is extended if there is some real-time data in the transmission queue of any device at the end of the CAP. However, these extensions do not meet the standard and require modification of the superframe structure.

Disclosure of Invention

The embodiment of the invention provides a high-reliability self-adaptive MAC layer scheduling method, which adaptively adjusts the duty ratio in the running period without human intervention so as to minimize the power consumption and balance the probability of successful data transmission and the delay constraint of application.

In order to achieve the above object, an embodiment of the present invention provides a highly reliable adaptive MAC layer scheduling method, which is applied to a coordinator device in a wireless sensor network, and the method includes:

the method comprises the steps of establishing a model according to a wireless sensor network environment, wherein the wireless sensor network environment model is represented by a three-dimensional array E ═ alpha, beta, p, wherein alpha represents an action set of automatic learning, namely input, of a node, and represents a duty ratio set of the node in the invention; beta represents a feedback signal output by the node after selecting a proper duty ratio and interacting with the environment.

Specifically, the environment can be divided into a P-model and a Q-model according to the difference of β value types, wherein in the P-model, the feedback signal is Boolean (0 or 1), and in the Q-model, the feedback signal is [0,1]]An inner continuous random variable. The P-model is adopted in the invention because the control model is simple and easy to use. p ═ p₁，p₂，...，p_rDenotes a series of reward and punishment probabilities, and each learning automaton action α_iAll have a corresponding p_i。

The node generates a specific frame structure format, and embeds the reserved bits of the frame control field into parameters such as queue occupancy rate, queuing delay and the like.

Specifically, to avoid introducing any additional overhead, each terminal device embeds the queue occupancy O and the queuing delay D in the frame control structure of each data frame transmitted, using the 3 reserved bits of the frame control field as shown in fig. 3.

It should be noted that each sender uses two bits to represent 4 different levels of queue occupancy o_iAnd queuing delay d_iIs divided into 2 levels.

The coordinator (FFD) performs flow estimation to generate a flow adaptive duty ratio set.

It should be noted that, the present invention assumes that the wireless sensor network is in a star topology structure, and the coordinator collects data sent by the terminal device. Each coordinator estimates incoming traffic by computing idle overhearing, packet accumulation, and delay in the terminal device transmit queues.

The coordinator initializes its action set, and the actions select a probability set and a feedback set.

Specifically, the learning automaton isA learning tool based on probability by randomly activating a probability vector P_i(t) to select the activities, the activity probability vectors are the main building blocks of the learning automaton, and must therefore be kept updated at any time.

In an initial stage, in order to prevent a large amount of data from being lost when the data traffic of the wireless sensor network is large, the operation is selected to be the maximum duty ratio, that is, the coordinator is always in the receiving state, and the corresponding operation selection probability is also 1, so that the coordinator in the previous stage can be ensured to collect more information of the network.

The coordinator (FFD) interacts with the surrounding environment using a learning automata method (LA) method.

Specifically, the learning robot model of the variable structure may be represented by a three-dimensional array LA ═ (α, p), where α ═ { α ═₁，α₂，...，α_rDenotes an action set of the learning automaton, β ═ β₁，β₂，...，β_rDenotes the set of feedback signals given by the environment, p ═ p₁，p₂，...，p_rDenotes the set of action probabilities, satisfy

Wherein p is_i(n) α representing the learning process through the n-th round_iCorresponding action probabilities.

Selecting an exploration strategy: selecting different exploration strategies at different periods

Specifically, the whole part of the exploration strategy is divided into 3 stages: an initial stage, an exploration stage and a greedy stage;

in the initial stage, all actions in the set are explored in a deterministic manner by adopting a cyclic search strategy, and the node selects the highest duty ratio at the initial stage and slowly reduces the duty ratio until the minimum duty ratio is reached, so that all duty ratio sets are ensured to be tried.

And an exploration phase, wherein actions of higher duty ratio than the current selection are randomly explored, and if the selection probability is increased, the reward is increased. Otherwise, if the reward remains the same or decreases, it will randomly explore low duty cycle actions.

A greedy phase, in which the node becomes substantially more cognizant of the environment after a period of learning by applying an exploration strategy, allows autonomous selection actions to begin.

Evaluating the influence of the action on data transmission after interaction with the environment, updating the feedback set, and updating the action selection probability set

In particular, the coordinator updates the reward for each beacon interval by using feedback received from the sender during the last activity duration.

Selecting action, selecting BO and SO standard parameters for determining duty ratio based on feedback set, and realizing adaptive MAC scheduling

After selecting the action value, the BO and SO standard parameters that determine the duty cycle are adjusted.

In order to achieve the above object, an embodiment of the present invention provides a highly reliable adaptive MAC layer scheduling apparatus, which is applied to a coordinator device in a wireless sensor network, and the apparatus includes:

a generation unit: generating a specific frame control structure format, and embedding parameters such as queue occupancy rate, queue delay and the like into reserved bits of a frame control field;

a transmission unit: each sensor node sets a generated frame format according to the self condition and sends the frame format to each sensor node;

a receiving unit: the data frame receiving unit is used for receiving data frames sent by each sensor node after the sensor node is accessed to a channel; the data frame at least comprises parameters such as queue occupancy rate, queue delay and the like;

an evaluation unit: evaluating the selection probability of the action according to the parameters sent by the sensor nodes and the coordinated working state;

an autonomous learning unit: updating the action set, the action selection probability set and the feedback set of the node by adopting a learning automaton method;

a policy selection unit: judging which time slot is in, adopting a corresponding strategy, adopting a circular exploration strategy at an initial stage, adopting a random strategy at an exploration stage, and adopting a greedy strategy at a final greedy stage;

the self-adaptive adjusting unit: after the action is selected, the adaptive MAC scheduling is completed based on the feedback set and the action set adjusting parameters BO and SO.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a high-reliability adaptive MAC layer scheduling method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model of a learning automaton according to an embodiment of the present invention;

FIG. 3 is a block diagram of a frame control format according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a high-reliability adaptive MAC layer scheduling apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a MAC layer scheduling node transmission collision according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical scheme of the invention is specifically explained according to the attached drawings.

The high-reliability self-adaptive MAC layer scheduling method comprises the following steps:

and S101, establishing a model of the wireless sensor network.

Specifically, the wireless sensor network environment model is represented by a three-dimensional array E ═ α, β, p, where α represents an action set of node auto-learning, i.e., input, and in the present invention represents a duty cycle set of nodes; beta represents a feedback signal output by the node after selecting a proper duty ratio and interacting with the environment.

Specifically, the environment can be divided into a P-model and a Q-model according to the difference of β value types, wherein in the P-model, the feedback signal is Boolean (0 or 1), and in the Q-model, the feedback signal is [0,1]]The internal continuous random variable is suitable for the actual control field; the P-model is widely applied to the research of the wireless sensor network because the control model is simple and easy to use. p ═ p₁，p₂，...，p_rDenotes a series of reward and punishment probabilities, and each learning automaton action α_iAll have a corresponding p_i. In the invention, a P-model is adopted to model the wireless sensor network environment.

S102, the node generates a specific frame structure format, and embeds parameters such as queue occupancy rate, queue delay and the like by using reserved bits of a frame control field.

It should be noted that each sender uses two bits to represent 4 different levels of queue occupancy o_iAnd queuing delay d_iIs divided into 2 levels. With this information, the coordinator can estimate the queue occupancy O and the queuing delay D. Queue occupancy O is defined as follows:

where any node is equal to 1 if it reaches or exceeds the maximum number of frames that can be stored in its queue. Otherwise, it is equal to the average queuing occupancy during inactive periods, i.e. the time at which the occupancy in the packet accumulation CAP is highest. The queue occupancy rate O is expressed by 2Bits, so that the space can be saved, and the fluctuation range of the value can be reduced.

S103, the coordinator (FFD) carries out flow estimation to generate a flow self-adaptive duty ratio set.

It should be noted that, the present invention assumes that the wireless sensor network is in a star topology structure, and the coordinator collects data sent by the terminal device. Each coordinator estimates incoming traffic by computing idle overhearing, packet accumulation, and delay in the terminal device transmit queues. The expression for idle snoop IL is as follows:

IL＝1.0-SF_u(2)

wherein, SF_uThe superframe utilization rate is the ratio of the time occupied by the terminal device in the superframe to the total time available for data communication, and is defined as:

where SD is the superframe duration, T_bTime taken by the coordinator for beacon transmission, T_cIndicating the time, T, that a device occupies the channel due to a frame collision_rIs the time for data reception.

Illustratively, in type 1(C1), the sender node under consideration (node a) first ends its transmission, while the transmissions of the other nodes still continue, see fig. 5. In type 2(C2), the sender a completes transmission after the collision occurs. Finally, in type 3(C3), both nodes end transmission at the same time. To detect C1 and C2, a or B may listen to the channel to detect other transmissions if they are within range of each other. Therefore, the transmitting side considers that if a collision occurs while listening to the channel after transmission, a busy channel is detected and the acknowledgement frame 2 is not received. On the other hand, to detect C3, the receiver perceives the received energy increase on its CCA threshold, but it is not synchronized with the start frame delimiter.

S104, the action set is initialized for coordination, and the probability set and the feedback set are selected for actions.

In particular, studyThe automaton is a probability-based learning tool that works through a random activity probability vector P_i(t) to select the activities, the activity probability vectors are the main building blocks of the learning automaton, and must therefore be kept updated at any time. Learning automata A_iIs represented as follows:

wherein, P_i(t) denotes the node n at time t_iSelecting the probability of a certain duty ratio, wherein in the invention, the probability is set as the expected value of the total feedback return of the corresponding duty ratio, and is defined as follows:

in order to prevent a large amount of data from being lost when the data traffic of the wireless sensor network is large, the duty ratio of the operation is selected to be zero in the initial stage, that is, the coordinator is always in the receiving state, and the corresponding operation selection probability is also 1, so that the coordinator in the early stage can be ensured to collect more information of the network.

S105, the mediator (FFD) interacts with the surrounding environment using a learning automata method (LA).

It should be noted that the learning robot model with a variable structure may be represented by a three-dimensional array LA (α, p), where α { α ═ is₁，α₂，...，α_rDenotes an action set of the learning automaton, β ═ β₁，β₂，...，β_rDenotes the set of feedback signals given by the environment, p ═ p₁，p₂，...，p_rDenotes the set of action probabilities, satisfy

Wherein p is_i(n) α representing the learning process through the n-th round_iThe corresponding action probability is satisfied, and the probability update formula p (n +1) ═ T (α (n), β (n), p (n)), T denotes a learning algorithmThe general learning algorithm mechanism of the learning automaton is defined as follows:

wherein α (n) and b (n) are linear functions g_iAnd h_iThe weight coefficient of (2) can be defined as a linear function or a constant, and is determined according to specific application; and adopting a P-environment model, wherein the feedback signal takes a value of 0 or 1, and when the feedback signal takes 0, the environment gives a reward signal. When the feedback signal takes 0, the corresponding probability update is represented as follows:

when the feedback signal takes 1, the corresponding probability update is expressed as follows:

s106, selecting an exploration strategy: different exploration strategies are selected at different periods.

It should be noted that although there is an action selection probability, it may cause the coordinator to adjust more slowly only depending on the action selection probability, and cannot reflect the environment in time.

in the initial stage, all actions in the set are explored in a deterministic manner by adopting a cyclic search strategy, the node selects the highest duty ratio at the initial stage, and the duty ratio is slowly reduced until the minimum duty ratio is reached, so that all duty ratio sets are ensured to be tried, namely the action set of the learning automaton is completely listed.

During the exploration phase, once all actions have been selected, we adopt the following strategy:

in particular, this strategy includes the act of randomly exploring a higher duty cycle than the current selection, indicating that the reward is increased if the probability of selection increases. Otherwise, if the reward remains the same or decreases, it will randomly explore low duty cycle actions.

And a greedy stage, wherein after the exploration strategy is applied for learning for a period of time, the node basically knows the environment almost, and the autonomous selection action can be started, wherein the following strategy is adopted:

in particular, the greedy strategy selects the action with the best P value in the subset of actions with lower action values, in other words, selects a higher duty cycle than the one selected at the last moment. In case of several actions with the same P-value in the selected subset, the action with the lowest duty cycle (highest action value) is selected. This means that we choose the best action with the lowest duty cycle, indicating that it chooses a better P value if the reward is equal to or lower than the reward received in the previous phase. Therefore, under steady conditions, a minimum duty cycle is preferred. Once an action is selected, the probability of exploration of a node is increased if the new action value is different from the previous stage.

S107, evaluating the influence of the action and the environment on data transmission after interaction, updating a feedback set and updating an action selection probability set

It is noted that the coordinator updates the reward per beacon interval by using the feedback received from the sender during the last activity duration. The reward function is defined as follows:

where β represents the combination of penalty (negative) values for performance selected for the phase duty cycle. As can be seen from the above equation, the best reward is a zero value (no penalty) because it means no idle listening and no overflow of the transmit queue.

Specifically, the reward is based on the queue occupancy O and the threshold O_maxA comparison between them. If the queue occupancy is higher than the upper threshold O_maxThen the reward signal is negative (-1), which means that the larger the O_maxThe more chance that the final device must drop the packet, and therefore the lower the reward it receives. The threshold value O_maxThe choice of (c) indicates how sensitive the coordinator is to frame loss. The setting of the parameter may be set according to the reliability requirements of the application. It may be set to 0.8 in the normal case, if the queue occupancy O is less than the threshold O_maxThen the feedback signal is defined as a negative value equal to the amount of idle sensing, since idle sensing is one of the main causes of energy consumption. So the lower it is, the better. A maximum reward of zero (no penalty) can only be reached when the idle listening is zero and the queue occupancy O indicates no data frame loss. This means that the goal of an optimal trade-off between bandwidth utilization and energy consumption is achieved.

S108, selecting action, selecting BO and SO standard parameters for determining duty ratio based on feedback set, and realizing self-adaptive MAC scheduling

After selecting the action value, the BO and SO standard parameters that determine the duty cycle are adjusted. The adjustment is defined as follows:

BO＝max(4，|A|→(BI-SD)＜) (13)

SO←max(0，BO-α_t) (14)

it should be noted that the selection is based on the delay experienced by the data frames, and the parameter values BO and SO are embedded in the beacon frames broadcast to the terminal devices for synchronization.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A high-reliability self-adaptive MAC layer scheduling method is characterized by comprising the following steps:

the first step is modeling according to the wireless sensor network environment, applying the learning automaton method to the wireless sensor network environment, and representing the sensor network environment model by a three-dimensional array E ═ (α, p), wherein α ═ α₁,α₂,....,α_rDenotes a finite set of actions for node auto-learning, i.e., input, and denotes a set of duty cycles for the node, β ═ β₁,β₂,...,β_mDenotes a feedback signal output by the node after selecting a proper duty ratio and interacting with the environment, and p ═ p₁,p₂,....,p_rDenotes a series of reward and punishment probabilities, each punishment probability p_iAll in conjunction with a given input variable α_i(ii) related;

secondly, the node generates a specific frame structure format, and embeds queue occupancy rate and queuing delay parameter by using reserved bits of a frame control field, specifically, each terminal device embeds queue occupancy rate O and queuing delay D in the frame control structure of each data frame to be transmitted, and the information is embedded by using 3 reserved bits of the frame control field;

thirdly, the coordinators (FFD) perform flow estimation to generate a flow adaptive duty ratio set, and each coordinator estimates the incoming service by calculating idle monitoring, packet accumulation and delay in the terminal equipment sending queue, wherein the expression of the idle monitoring IL is as follows:

IL＝1.0-SF_u(1)

wherein, SF_uRepresenting the superframe utilization, which is the ratio of the time that the terminal device occupies the superframe to the total time available for data communication;

fourthly, initializing an action set of the coordinator, and selecting a probability set and a feedback set for the action;

fifthly, the coordinator (FFD) interacts with the surrounding environment by using a Learning Automata (LA) method, a P-environment model is adopted, the value of the feedback signal is 0 or-1, and if the feedback signal is 0, the probability is defined as follows:

if the feedback signal takes-1, the probability is defined as follows:

wherein r is the number of actions in the action set of the learning automaton, α (n) and b (n) are weight coefficients of a linear function, which can be defined as a linear function or a constant, depending on the specific application;

and sixthly, selecting an exploration strategy: selecting different exploration strategies at different periods, dividing the whole learning process into three stages, wherein a cyclic search strategy is adopted in the initial stage, a random search strategy is adopted in the exploration stage, and a greedy strategy is adopted in the greedy stage;

seventhly, evaluating the influence of the action on data transmission after the action interacts with the environment, updating a feedback set, and updating an action selection probability set, wherein a reward function for evaluation is defined as follows:

whereinβ denotes the combination of penalty values for the performance of the phase duty cycle selection, i.e. the combination of negative values, O denotes the occupancy, O denotes the duty cycle_maxIndicating an occupancy upper threshold, as seen from the above equation, the best reward is a zero value, i.e., no penalty, since it indicates no idle listening and no overflow of the transmission queue;

and an eighth step of selecting action, namely selecting BO and SO standard parameters for determining the duty ratio based on the feedback set, and selecting the optimal duty ratio, wherein the BO parameters are defined as follows:

BO＝max(4,|A|→(BI-SD)＜) (5)

where a denotes a learning automaton, BI denotes the current beacon interval, SD denotes the superframe duration, and denotes the delay experienced by the data frame.

2. The method as claimed in claim 1, wherein the network environment model is established, and in particular, the wireless sensor network environment model is represented by a three-dimensional array E ═ (α, p), where α ═ α₁,α₂,....,α_nThe node automatic learning, namely the input limited action set is represented, and the duty ratio set of the node is represented, β ═ β₁,β₂,...,β_mDenotes a feedback signal output by the node after selecting a proper duty ratio and interacting with the environment, and p ═ p₁,p₂,....,p_nDenotes a series of reward and punishment probabilities, each punishment probability p_iAll in conjunction with a given input variable α_iIn connection with this, the environment can be divided into 3 types of P-type environment, Q-type environment and S-type environment according to the feedback signal β, the wireless sensor network environment is modeled by P-model, the feedback signal is boolean value, i.e. β is described by binary 0 and 1, wherein α_i(α_i∈α) representing the selected activity of the learning automaton, P (t) representing the probability vector at time t, denoted by P_rewardDenotes a reward factor, denoted by P_penaltyRepresenting a penalty factor, determining the probability of increasing or decreasing activity by these two factors, respectively, using the notation P (t) to represent the probability vector at any time t, the notation p (t) to represent the probability vector at a given time t, andthe rate vector p (t) is updated as follows:

if the activity is awarded by the random environment, the activity probability vector P (t) is updated as follows:

3. a highly reliable adaptive MAC layer scheduling method as claimed in claim 1, characterized in that the node generates a specific frame structure format, in particular, each terminal device embeds the queue occupancy O and the queuing delay D in the frame control structure of each data frame transmitted, this information being embedded using 3 reserved bits of the frame control field, in particular, each transmitting node uses two bits to represent 4 different levels of queue occupancy O_iAnd queuing delay d_iDivided into 2 levels, from this information the coordinator can estimate queue occupancy O and queuing delay D, the queue occupancy O being defined as follows:

it should be noted that, through the information of 3 reserved bits, the coordinator can estimate the queue occupancy rate O and the queuing delay D, if there is a node device reaching or exceeding the maximum number of frames that can be stored in its queue, the queue occupancy rate O estimated by the coordinator is equal to 1, otherwise, the queue occupancy rate O estimated by the coordinator is equal to the average value of the queue occupancy rates of the first message received in the packet accumulation contention access period, i.e. CAP, where the first message received in the packet accumulation contention access period, i.e. CAP, is the message with the highest queue occupancy rate during the inactive period, and it should be noted that the queuing delay bit D of each terminal device i is_iRepresenting the current beacon interval BI with a defined minimum delay threshold D_thIf less than the threshold, queuing a delay bit D_iIs '0', otherwise '1', the coordinator will minimize the delay threshold D_thThe flag is the maximum delay of the node device transmission, which is done to ensure that any node can still transmit data when the queuing delay is above a threshold.

4. The method as claimed in claim 1, wherein the coordinators (FFD) perform traffic estimation, and specifically each coordinator estimates the incoming traffic by calculating idle snoops, packet accumulation and delay in the sending queue of the terminal device, and the expression of the idle snoop IL is as follows:

IL＝1.0-SF_u(9)

where SD is the superframe duration, T_bIndicates the time taken by the coordinator to perform beacon transmission, T_cIndicating the time, T, that a device occupies the channel due to a frame collision_rIs the time for data reception, T_sThe definition is as follows:

T_s＝T_CCA+T_DATA+T_IFS+T_ACK(11)

wherein, T_CCAIndicating the channel estimation time, T, during each frame data transmission_DATAIndicating data transmission time, T_IFSIndicating the inter-frame space, T_ACKIndicating the time of acknowledgment receipt.

5. The method as claimed in claim 1, wherein the action set, the action selection probability set and the feedback set are initialized, and the learning automaton is a probability-based learning toolOver-random activity probability vector P_i(t) to select the activity, the activity probability vector is the main component of the learning automaton, so it must be kept updated at any time, learning automaton A_iIs represented as follows:

wherein, P_i(t) denotes the node n at time t_iSelecting a probability of a certain duty cycle, wherein the probability is expressed as an expectation value of the corresponding duty cycle overall feedback return following a normal distribution:

wherein

Representing the expected value of the corresponding duty cycle overall feedback return,

the probability density of the overall feedback expectation of the duty ratio is shown, it should be noted that the action of the coordinator is initially selected as the maximum duty ratio, that is, the coordinator is always in the receiving state, and the corresponding action selection probability is also 1, so that the earlier-stage coordinator can collect more information of the network.

6. A highly reliable adaptive MAC layer scheduling method as claimed in claim 1, characterized in that the coordinator (FFD) uses a Learning Automata (LA) method to interact with the surrounding environment, in particular, the learning automata model can be represented by a three-dimensional array LA ═ α, p, where α ═ α₁,α₂,...,α_rDenotes an action set of the learning automaton, β ═ β₁,β₂,...,β_rDenotes the set of feedback signals given by the environment, p ═ p₁,p₂,...,p_rDenotes the action probability set, fullFoot

Wherein p is_i(n) α representing the learning process through the n-th round_iCorresponding action probability, satisfying probability updating formula p (n +1) ═ T (α (n), β (n), p (n)), where T represents learning algorithm;

specifically, a P-environment model is adopted, the feedback signal takes a value of 0 or 1, when the feedback signal takes a value of 0, the environment gives a reward signal, and when the feedback signal takes a value of 0 or 1, the corresponding probability updates are respectively expressed as follows:

when the feedback signal takes 0:

when the feedback signal takes 1:

it should be noted that, in the process of adjusting the duty ratio by using the learning automaton method, a feedback β of the environment is continuously received, and the total received feedback can be understood as the sum of the immediate feedback and the future feedback, as shown below:

where γ is a discount factor, γ ∈ [0,1], representing a weight for future feedback.

7. The high-reliability adaptive MAC layer scheduling method of claim 1, wherein different exploration strategies are selected at different time periods, specifically, the whole part of the exploration strategies are divided into 3 stages: initial phase, exploration phase and greedy phase:

in the initial stage, all actions in the set are explored in a deterministic manner by adopting a cyclic search strategy, the node selects the highest duty ratio at the initial stage, and the duty ratio is slowly reduced until the minimum duty ratio is reached, so that all duty ratio sets are ensured to be tried, namely the action set of the learning automaton is listed completely;

exploration phase, once all actions are selected, actions higher than the duty cycle selected by the current phase will be randomly explored if corresponding β in feedback set β_t ⁱIncreased, then action α is indicated_iThe duty cycle represented is better, otherwise, if the feedback set β remains unchanged or corresponding β_t ⁱLess, it will randomly explore the actions of lower duty cycle, the strategy is as follows:

a greedy stage, in which after the exploration strategy is applied to learn for a period of time, the node basically knows the environment almost, and at this time, the greedy strategy is used to find the optimal action value, when the feedback set β corresponds to β_t ⁱHigher than the last stage

To account for increased traffic, the greedy strategy selects a subset of actions with lower action values, i.e., selects a higher duty cycle, if the corresponding β in feedback set β_t ⁱLower than or equal to the previous stage

The greedy strategy selects the subset of actions with higher action values, i.e. selects a lower duty cycle; therefore, under a stable condition, a minimum duty ratio is preferred, the probability of searching is increased when the duty ratio selected in the next stage is different from that selected in the present stage, otherwise, the learning and exploring probabilities are reduced to avoid oscillation when the optimal action is selected, and the strategy is as follows:

where β represents the combination of negative values of the performance of the duty cycle selection for this phase, i.e. the combination of penalty values, equation (18) can result, the best reward is a zero value, i.e. no penalty, since it represents no idle listening, no overflow of the transmission queue, it is noted that if the new action is equal to the last action selected, the learning and exploration rates are reduced to avoid oscillation in which the best action is selected.

8. The method according to claim 1, wherein the selecting operation selects BO and SO standard parameters for determining the duty cycle based on the feedback set to implement the adaptive MAC scheduling, and specifically, after the selecting operation value, the adjusting formula is defined as follows:

BO＝max(4,|A|→(BI-SD)＜) (19)

SO←max(0,BO-α_t) (20) wherein a denotes a learning automaton, BI denotes the current beacon interval, SD denotes the superframe duration, denotes the delay experienced by the data frame, it is noted that the selection is based on the delay experienced by the data frame, and the parameter values BO and SO are embedded in the beacon frame broadcast to the terminal devices for synchronization.

9. An apparatus for implementing a highly reliable adaptive MAC layer scheduling method, comprising:

a model establishing unit for establishing a model according to the wireless sensor network environment, applying the learning automaton method to the environment of the wireless sensor network, and representing the sensor network environment model by a three-dimensional array E (α, p), wherein α is { α ═₁,α₂,....,α_rDenotes a finite set of actions for node auto-learning, i.e., input, and denotes a set of duty cycles for the node, β ═ β₁,β₂,...,β_mDenotes a feedback signal output by the node after selecting a proper duty ratio and interacting with the environment, and p ═ p₁,p₂,....,p_rRepresents a series of reward and punishmentProbability, per penalty probability p_iAll in conjunction with a given input variable α_i(ii) related;

a generating unit, configured to generate a specific frame structure format by a node, embed, using reserved bits of a frame control field, a queue occupancy rate and a queuing delay parameter, specifically, embed, by each terminal device, a queue occupancy rate O and a queuing delay D in a frame control structure of each data frame to be transmitted, where the information is embedded using 3 reserved bits of the frame control field;

a traffic estimation unit, configured to perform traffic estimation by a coordinator (FFD), and generate a traffic adaptive duty cycle set, where each coordinator estimates an incoming service by calculating idle snoops, packet accumulation, and delay in a terminal device transmission queue, where an expression of the idle snoop IL is as follows:

IL＝1.0-SF_u(21)

the coordinator initialization unit is used for initializing an action set of the coordinator, and selecting a probability set and a feedback set for the action;

an environment interaction unit, configured to interact with a surrounding environment by using a Learning Automata (LA) method through a coordinator (FFD), where a P-environment model is adopted, a value of a feedback signal is 0 or-1, and if the feedback signal is 0, a probability is defined as follows:

if the feedback signal takes-1, the probability is defined as follows:

an exploration strategy selection unit for selecting an exploration strategy: selecting different exploration strategies at different periods, dividing the whole learning process into three stages, wherein a cyclic search strategy is adopted in the initial stage, a random search strategy is adopted in the exploration stage, and a greedy strategy is adopted in the greedy stage;

the interaction evaluation unit is used for evaluating the influence of the action on data transmission after interaction with the environment, updating the feedback set and updating the action selection probability set, and the reward function used for evaluation is defined as follows:

wherein β denotes the combination of penalty values for the performance of the phase duty cycle selection, i.e. the combination of negative values, O denotes the occupancy, O denotes the duty cycle_maxIndicating an occupancy upper threshold, as seen from the above equation, the best reward is a zero value, i.e., no penalty, since it indicates no idle listening and no overflow of the transmission queue;

and an action selection unit for selecting an action, namely selecting BO and SO standard parameters for determining the duty ratio based on the feedback set, and selecting the optimal duty ratio, wherein the BO parameters are defined as follows:

BO＝max(4,|A|→(BI-SD)＜) (25)