CN110049018B - SPMA protocol parameter optimization method, system and medium based on reinforcement learning - Google Patents
SPMA protocol parameter optimization method, system and medium based on reinforcement learning Download PDFInfo
- Publication number
- CN110049018B CN110049018B CN201910229439.4A CN201910229439A CN110049018B CN 110049018 B CN110049018 B CN 110049018B CN 201910229439 A CN201910229439 A CN 201910229439A CN 110049018 B CN110049018 B CN 110049018B
- Authority
- CN
- China
- Prior art keywords
- parameter
- action
- value
- score
- selecting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/26—Special purpose or proprietary protocols or architectures
Abstract
The invention provides an SPMA protocol parameter optimization method, a system and a medium based on reinforcement learning, comprising the following steps: parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set; a time delay and success rate obtaining step: and according to the obtained current parameter state set and a preset scene, bringing the obtained current parameter state set into the preset scene to obtain the service time delay and the success rate of each priority of the SPMA protocol. The invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.
Description
Technical Field
The invention relates to the technical field of communication protocols, in particular to an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.
Background
The spma (statistical Priority Multiple access) protocol is mainly directed to scenarios with high Priority time sensitive traffic. In order to meet the high real-time service requirements of different priorities, such as coordinated targeting information transmission of TTNT, SPMA is adopted as an access protocol. The multiple access protocol based on priority probability statistics is composed of a plurality of priority queues, a priority competition backspacing window, a priority threshold, channel occupation statistics, a transmitting and receiving antenna and a corresponding distributed control algorithm. The services with different priorities correspond to different MAC layer priority queues, and channel occupation statistics is obtained through interaction between the MAC layer and the physical layer to determine the sending of the packets. Channel occupancy statistics are used to count the activity level of a communication channel for a predetermined period of time, said channel occupancy statistics refer to: the idle degree of a communication channel in the set channel statistic window.
When the high layer has packet transmission or receives the forwarding packet, the packet enters the corresponding priority queue according to a certain rule, then the statistical value of the channel occupancy is compared with the corresponding priority threshold, if the statistical value of the channel occupancy is lower than the priority threshold, the packet is sent; if the channel occupancy statistics quotient is at the priority threshold, the priority packet waits for a random back-off time, and after the back-off time is reduced to zero, the node checks the channel occupancy statistics and transmits the channel occupancy statistics. When high priority data arrives within the back-off time, the back-off timer is suspended and the channel occupancy statistics are immediately compared with a corresponding high priority threshold to determine the transmission of the newly arriving high priority packets. In the SPMA protocol, the simulation parameters mainly include a backoff window length, a channel statistics window length, and a priority threshold. In the corresponding aspect of the parameters and the performance indexes, the end-to-end single-hop transmission delay is related to the length of a backoff window; the packet loss rate is related to the length of the backoff window, the length of the statistical window, and the setting of the priority threshold.
Through the search of relevant documents, AbdelitifSerrani, Najib Naja and Abdeliah Jamali published 'QLAR: A Q-learning based adaptive routing for MANETs' on AICCSA (IEEE/ACS 13th International Conference of Computer Systems and Applications) in 2016 (IEEE/ACS 13th International Conference of Computer Systems and Applications, 'QLAR: Q learning-based MANET adaptive routing'), and the article optimizes multiple sets of parameters restricted by each other in a routing algorithm by using a reinforcement learning method so as to select an optimal path for transmission. Similarly, in the SPMA communication system, there are multiple sets of parameters that are restricted with each other, and there are different degrees of influence on the performance index of the system, and this relationship cannot be specifically represented by a mathematical expression. For different application scenes, a group of parameters are randomly selected, which often cannot reach the required performance index, and the values of the parameter set cannot be directly given through a mathematical method, so that an enhanced learning method needs to be adopted to obtain the optimal parameters (set) suitable for the current scene.
Patent document CN106954229A (application number: 201710136147.7) discloses a hybrid channel load statistical method based on SPMA, which obtains a channel load from a physical layer and a channel load from a network layer, respectively, and in a case of a light load, uses a channel load statistical value obtained by the physical layer, and in a case of a heavy load, calibrates a channel load statistical value obtained by the network layer, and if a difference between the two statistical results exceeds a certain margin, calibrates a channel load statistical value obtained by the network layer.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.
The invention provides an SPMA protocol parameter optimization method based on reinforcement learning, which comprises the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by the probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing.
Preferably, the parameter set selecting step:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Preferably, the parameter scoring step:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, entering a parameter optimization step to continue execution.
Preferably, the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
Preferably, the action set a is:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
The invention provides an SPMA protocol parameter optimization system based on reinforcement learning, which comprises:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by using the probability, selecting the parameter set randomly by using the probability 1, and calling a delay and success rate acquisition module.
Preferably, the parameter set selecting module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Preferably, the parameter scoring module:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, the parameter optimization module is called.
Preferably, the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'selection action a';
the action set A is as follows:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
Compared with the prior art, the invention has the following beneficial effects:
the invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a specific process based on Q-learning according to the present invention;
FIG. 2 is a block diagram of the SPMA and reinforcement learning system provided by the present invention
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention provides an SPMA protocol parameter optimization method based on reinforcement learning, which comprises the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by the probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing.
Specifically, the parameter set selecting step:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Specifically, the parameter scoring step:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, entering a parameter optimization step to continue execution.
Specifically, the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
Specifically, the action set a is:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
The SPMA protocol parameter optimization system based on reinforcement learning provided by the invention can be realized through the step flow of the SPMA protocol parameter optimization method based on reinforcement learning provided by the invention. The person skilled in the art can understand the enhanced learning based SPMA protocol parameter optimization method as a preferred example of the enhanced learning based SPMA protocol parameter optimization system.
The invention provides an SPMA protocol parameter optimization system based on reinforcement learning, which comprises:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by using the probability, selecting the parameter set randomly by using the probability 1, and calling a delay and success rate acquisition module.
Specifically, the parameter set selection module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Specifically, the parameter scoring module:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, the parameter optimization module is called.
Specifically, the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'selection action a';
the action set A is as follows:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
The present invention will be described more specifically below by way of preferred examples:
preferred example 1:
the invention provides a Q-learning method in reinforcement learning to learn a system environment, which divides parameters in SPMA into different states with certain granularity, and defines the learning action as the increase and decrease of the parameters, namely, the parameters are increased from the current state to the close state larger than the current parameters or are decreased to the close state smaller than the current parameters. Fig. 2 is a schematic diagram of a system block of the SPMA and reinforcement learning.
As shown in fig. 1, more specifically, the present invention comprises the steps of:
step 1: and selecting an SPMA parameter set, and dividing each parameter into different states according to certain granularity.
In the process of reinforcement learning, other non-optimized parameters of the system are considered to be unchanged, namely, the learning face of each iteration is a static scene, only the set of the specified parameters which need to be adjusted is changed, and the action taken by the system cannot influence the current scene. In our experiment, we select each priority threshold, lifetime, statistical window length, and backoff window length of SPMA as parameter sets, and divide according to different granularities. Other static parameters of the SPMA communication system are as follows:
simulation unit time slot | 0.1ms |
Arrival rate of priority tasks (number of priority traffic/s) | [200,1600](Poisson distribution) |
Task service rate per priority | 1000 |
Simulation time length(s) | 1 |
Step 2: the objective of adaptive adjustment mechanism is to obtain the parameter value of adaptive relevant scene, and for different granularities under different conditions, we set the adjustment precision to the granularity, that is, the adjustment step length of each action is thetanN represents different parameters, and the action set A is:
A={±θ1,±θ2,…}
the learned action is defined herein as an increase or decrease in a parameter, i.e., an increase from a current state to an immediate state greater than the current parameter, or a decrease to an immediate state less than the current parameter. We specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if performing the action would cause the result to violate design rules, the action is reselected;
and step 3: the optimization goal of the system is the delay and success rate of each priority service, so the performance index, namelyAndwhere i represents a different priority level or priority level,for the success rate of the ith priority service,is the delay of the ith priority service. The invention designs a set of percentile scoring criteria according to whichAndthe scoring mechanism is abstracted as a linear piecewise function: in thatAbove the upper limit value orWhen the value is lower than the lower limit value, the score is 100; in thatBelow the lower limit value orAbove the upper value, the score is 0; segmenting the fraction of 0 to 100 inAndand when the section functions are positioned in different intervals, selecting the section functions with different slopes. Meanwhile, under different application environments and SPMA parameters, the method is divided intoThe slope of the segment function also varies.
While setting the target score to G, the iteration may terminate when the Q-learning iteration score exceeds the target score. Here, we set the scoring threshold artificiallyAndwhen in useAndis greater thanWhen the temperature of the water is higher than the set temperature, when in useAndis less thanWhen the temperature of the water is higher than the set temperature, wherein the content of the first and second substances, is composed ofAndis determined by the static scenario faced by the iterative learning.
The final overall score was:
parameters in the scoring criteria are adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.
The reward function is defined as follows:
and 4, step 4: in Q-learning, the Q value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state. The criteria for the selection of the Q function as an action is the final result that the Q-learning needs to obtain. The invention adopts a table look-up mode to model the Q value function. The updating process of the Q-value table can be given according to the following algorithm:
1. knowing that the number of the state parameter sets is M, the number of the action sets is N, and initializing an M multiplied by N Q value table (all table elements are initialized to 0);
2. giving learning iteration times (Episode), learning efficiency alpha, attenuation coefficient gamma and exploration probability;
3. for each learning process (For reach epade):
(1) randomly selecting an initial state s ═ l in a state parameter setk,k∈[0,M-1];
(2) If the target score is not reached (score < G), the following procedure is performed:
a. selecting an action a belonging to A under the current state s according to a-greedy method;
b. obtaining a next state s ' (the current state s can only be increased to an adjacent state s ' larger than the current parameter or is decreased to an adjacent state s ' smaller than the current parameter) according to the current action a;
c. obtaining the time delay and success rate of each priority service according to the current parameters, and calculating to knowAnd
d. obtaining a current prize r (rewardval) according to the total score;
e. updating the Q value table:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)-Q(s,a)];
f. and updating the current system state: s ← s'.
To prevent premature convergence of the algorithm to local optima, a greedy approach is employed for decision making that can balance exploration and utilization of current strategies, i.e., balance reward maximization based on existing knowledge and trying new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.
The above α is the learning rate, generally α <1, and determines how much of this test needs to be learned to update the Q table, γ is the attenuation value for the future reward, γ → 0 indicates that the current algorithm mainly considers the instant reward, and γ → 1 indicates that the algorithm also considers the future system reward. Wherein r is rewardVal, α is 0.9, γ is 0.4.
Preferred example 2:
a multi-access protocol SPMA parameter optimization method based on reinforcement learning and based on priority probability statistics is characterized by comprising the following steps:
step 1: selecting each priority threshold, a life cycle, a statistical window length and a backspacing window length of the SPMA as parameter sets, dividing each parameter into different states with certain granularity, and outputting the parameter sets divided with certain granularity;
the division of each parameter into different states at a constant granularity means that the parameter is divided into different sections of a predetermined length at a constant size.
The granularity refers to the size of a division parameter set, for example, for 0 to 100, 10 is taken as the granularity, 10 states can be divided, and 20 is taken as the granularity, 5 states can be divided.
The states refer to the parameter set intervals after being divided, for example, 0 to 100 are divided into 5 states by taking 20 as granularity, the state 1 is 0 to 20, the state 2 is 20 to 40, and the like.
Step 2: the adaptive adjustment mechanism aims at obtaining the parameter value (the size of the parameter such as the priority threshold, the life cycle, the length of a statistical window and the like) of an adaptive relevant scene, and aiming at different granularities of different conditions, the adjustment precision is set as the granularity, namely the adjustment step length of each action is thetanN represents different parameters, and the action set A is:
A={±θ1,±θ2,…}
we specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if doing so would cause the result to violate design rules, the action is reselected;
and step 3: the optimization goal of the system is to comprehensively consider performance indexes and design a set of percentile scoring criterion for service time delay and transmission success rate of each priority, the service time delay and the success rate are divided into linear piecewise functions, and the slopes of the piecewise functions are different under different application environments and SPMA parameters. Setting the target score to G, and when the Q-learning iteration score exceeds the target score, the iteration can be terminated;
and 4, step 4: in Q-learning, the Q-value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)-Q(s,a)]。
the final overall score in step 3 was:
wherein i represents different priorities, parameters in the scoring criterion can be adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.
The reward function is defined as follows:
step 4, wherein r is rewardVal, α is 0.9, and γ is 0.4.
To prevent premature convergence of the algorithm to local optima in step 4, we use the-greedy method for decision making that can balance exploration and utilization of current strategies, i.e. balance reward maximization based on existing knowledge and try new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (8)
1. An SPMA protocol parameter optimization method based on reinforcement learning is characterized by comprising the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing;
the parameter scoring step comprises:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution.
2. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter set selection step comprises:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
3. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
4. The SPMA protocol parameter optimization method based on reinforcement learning of claim 3, wherein the action set A is:
A={±θ1,±θ2,...,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
5. An SPMA protocol parameter optimization system based on reinforcement learning, comprising:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in a Q value table by probability, selecting the parameter set randomly by the probability 1, and calling a time delay and success rate acquisition module;
the parameter scoring module:
according to the obtained service time delay of each priority levelAnd success rateCalculating the total score according to the following formula:
Wherein the content of the first and second substances,
score represents total score;
determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, the parameter optimization module is called.
6. The reinforcement learning-based SPMA protocol parameter optimization system of claim 5, wherein the parameter set selection module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
7. The reinforcement learning-based SPMA protocol parameter optimization system of claim 6, wherein the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a ') represents the most obtainable by the selection action a ' in the state sA large Q value;
the action set A is as follows:
A={±θ1,±θ2,...,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
8. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method of any of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910229439.4A CN110049018B (en) | 2019-03-25 | 2019-03-25 | SPMA protocol parameter optimization method, system and medium based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910229439.4A CN110049018B (en) | 2019-03-25 | 2019-03-25 | SPMA protocol parameter optimization method, system and medium based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110049018A CN110049018A (en) | 2019-07-23 |
CN110049018B true CN110049018B (en) | 2020-11-17 |
Family
ID=67275145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910229439.4A Active CN110049018B (en) | 2019-03-25 | 2019-03-25 | SPMA protocol parameter optimization method, system and medium based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110049018B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110581810B (en) * | 2019-09-30 | 2023-03-24 | 湖南理工学院 | Data communication method, device, equipment and medium based on SPMA protocol |
CN113613339B (en) * | 2021-07-10 | 2023-10-17 | 西北农林科技大学 | Channel access method of multi-priority wireless terminal based on deep reinforcement learning |
CN115315020A (en) * | 2022-08-08 | 2022-11-08 | 重庆邮电大学 | Intelligent CSMA/CA (Carrier sense multiple Access/Carrier aggregation) backoff method based on IEEE (institute of Electrical and electronics Engineers) 802.15.4 protocol of differentiated services |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000010296A3 (en) * | 1998-08-12 | 2000-08-31 | Sc Wireless Inc | Method and apparatus for network control in communications networks |
CN105306176A (en) * | 2015-11-13 | 2016-02-03 | 南京邮电大学 | Realization method for Q learning based vehicle-mounted network media access control (MAC) protocol |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106954229A (en) * | 2017-03-09 | 2017-07-14 | 中国电子科技集团公司第二十研究所 | Hybrid channel loading statistical method based on SPMA |
CN109462858A (en) * | 2017-11-08 | 2019-03-12 | 北京邮电大学 | A kind of wireless sensor network parameter adaptive adjusting method |
-
2019
- 2019-03-25 CN CN201910229439.4A patent/CN110049018B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000010296A3 (en) * | 1998-08-12 | 2000-08-31 | Sc Wireless Inc | Method and apparatus for network control in communications networks |
CN105306176A (en) * | 2015-11-13 | 2016-02-03 | 南京邮电大学 | Realization method for Q learning based vehicle-mounted network media access control (MAC) protocol |
Non-Patent Citations (1)
Title |
---|
基于 Q 学习的 Ad Hoc 网络路由协议的改进与研究;刘芬,隋天宇,王叶群;《计算机与数字工程》;20190228;第47卷(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110049018A (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Age of information: A new metric for information freshness | |
CN109818865B (en) | SDN enhanced path boxing device and method | |
CN111867139B (en) | Deep neural network self-adaptive back-off strategy implementation method and system based on Q learning | |
CN110049018B (en) | SPMA protocol parameter optimization method, system and medium based on reinforcement learning | |
Soysal et al. | Age of information in G/G/1/1 systems | |
Li | Multi-agent Q-learning of channel selection in multi-user cognitive radio systems: A two by two case | |
CN112953830B (en) | Routing planning and scheduling method and device for flow frame in time-sensitive network | |
CN107948083B (en) | SDN data center congestion control method based on reinforcement learning | |
CN110968426B (en) | Edge cloud collaborative k-means clustering model optimization method based on online learning | |
CN108650131B (en) | Processing system for multi-controller deployment in SDN network | |
CN112637965B (en) | Game-based Q learning competition window adjusting method, system and medium | |
Swenson et al. | Distributed inertial best-response dynamics | |
CN113395723B (en) | 5G NR downlink scheduling delay optimization system based on reinforcement learning | |
KR20200081630A (en) | Method for allocating resource using machine learning in a wireless network and recording medium for performing the method | |
CN107835517B (en) | Long-distance CSMA/CA method with QoS guarantee | |
Gao et al. | Modeling and parameter optimization of statistical priority-based multiple access protocol | |
CN111200566A (en) | Network service flow information grooming method and electronic equipment | |
Jin et al. | Joint qos control and bitrate selection for video streaming based on multi-agent reinforcement learning | |
CN115665258A (en) | Deep reinforcement learning-based priority perception deployment method for multi-target service function chain | |
CN115118728A (en) | Ant colony algorithm-based edge load balancing task scheduling method | |
Chiariotti et al. | Age of information in multihop connections with tributary traffic and no preemption | |
Guo et al. | Optimal energy-efficient regular delivery of packets in cyber-physical systems | |
Coronado et al. | Ensuring QoS for IEEE 802.11 real-time communications using an AIFSN prediction scheme | |
KR101105693B1 (en) | Method of deciding dynamic sleep section for terminal in wireless access communication system | |
CN110996398A (en) | Wireless network resource scheduling method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |