CN110049018B - SPMA protocol parameter optimization method, system and medium based on reinforcement learning - Google Patents

SPMA protocol parameter optimization method, system and medium based on reinforcement learning Download PDF

Info

Publication number
CN110049018B
CN110049018B CN201910229439.4A CN201910229439A CN110049018B CN 110049018 B CN110049018 B CN 110049018B CN 201910229439 A CN201910229439 A CN 201910229439A CN 110049018 B CN110049018 B CN 110049018B
Authority
CN
China
Prior art keywords
parameter
action
value
score
selecting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910229439.4A
Other languages
Chinese (zh)
Other versions
CN110049018A (en
Inventor
俞晖
杨明
高思颖
卢超
徐鹏杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Spaceflight Electronic Technology Research Institute
Shanghai Jiaotong University
Original Assignee
China Spaceflight Electronic Technology Research Institute
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Spaceflight Electronic Technology Research Institute, Shanghai Jiaotong University filed Critical China Spaceflight Electronic Technology Research Institute
Priority to CN201910229439.4A priority Critical patent/CN110049018B/en
Publication of CN110049018A publication Critical patent/CN110049018A/en
Application granted granted Critical
Publication of CN110049018B publication Critical patent/CN110049018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/26Special purpose or proprietary protocols or architectures

Abstract

The invention provides an SPMA protocol parameter optimization method, a system and a medium based on reinforcement learning, comprising the following steps: parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set; a time delay and success rate obtaining step: and according to the obtained current parameter state set and a preset scene, bringing the obtained current parameter state set into the preset scene to obtain the service time delay and the success rate of each priority of the SPMA protocol. The invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.

Description

SPMA protocol parameter optimization method, system and medium based on reinforcement learning
Technical Field
The invention relates to the technical field of communication protocols, in particular to an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.
Background
The spma (statistical Priority Multiple access) protocol is mainly directed to scenarios with high Priority time sensitive traffic. In order to meet the high real-time service requirements of different priorities, such as coordinated targeting information transmission of TTNT, SPMA is adopted as an access protocol. The multiple access protocol based on priority probability statistics is composed of a plurality of priority queues, a priority competition backspacing window, a priority threshold, channel occupation statistics, a transmitting and receiving antenna and a corresponding distributed control algorithm. The services with different priorities correspond to different MAC layer priority queues, and channel occupation statistics is obtained through interaction between the MAC layer and the physical layer to determine the sending of the packets. Channel occupancy statistics are used to count the activity level of a communication channel for a predetermined period of time, said channel occupancy statistics refer to: the idle degree of a communication channel in the set channel statistic window.
When the high layer has packet transmission or receives the forwarding packet, the packet enters the corresponding priority queue according to a certain rule, then the statistical value of the channel occupancy is compared with the corresponding priority threshold, if the statistical value of the channel occupancy is lower than the priority threshold, the packet is sent; if the channel occupancy statistics quotient is at the priority threshold, the priority packet waits for a random back-off time, and after the back-off time is reduced to zero, the node checks the channel occupancy statistics and transmits the channel occupancy statistics. When high priority data arrives within the back-off time, the back-off timer is suspended and the channel occupancy statistics are immediately compared with a corresponding high priority threshold to determine the transmission of the newly arriving high priority packets. In the SPMA protocol, the simulation parameters mainly include a backoff window length, a channel statistics window length, and a priority threshold. In the corresponding aspect of the parameters and the performance indexes, the end-to-end single-hop transmission delay is related to the length of a backoff window; the packet loss rate is related to the length of the backoff window, the length of the statistical window, and the setting of the priority threshold.
Through the search of relevant documents, AbdelitifSerrani, Najib Naja and Abdeliah Jamali published 'QLAR: A Q-learning based adaptive routing for MANETs' on AICCSA (IEEE/ACS 13th International Conference of Computer Systems and Applications) in 2016 (IEEE/ACS 13th International Conference of Computer Systems and Applications, 'QLAR: Q learning-based MANET adaptive routing'), and the article optimizes multiple sets of parameters restricted by each other in a routing algorithm by using a reinforcement learning method so as to select an optimal path for transmission. Similarly, in the SPMA communication system, there are multiple sets of parameters that are restricted with each other, and there are different degrees of influence on the performance index of the system, and this relationship cannot be specifically represented by a mathematical expression. For different application scenes, a group of parameters are randomly selected, which often cannot reach the required performance index, and the values of the parameter set cannot be directly given through a mathematical method, so that an enhanced learning method needs to be adopted to obtain the optimal parameters (set) suitable for the current scene.
Patent document CN106954229A (application number: 201710136147.7) discloses a hybrid channel load statistical method based on SPMA, which obtains a channel load from a physical layer and a channel load from a network layer, respectively, and in a case of a light load, uses a channel load statistical value obtained by the physical layer, and in a case of a heavy load, calibrates a channel load statistical value obtained by the network layer, and if a difference between the two statistical results exceeds a certain margin, calibrates a channel load statistical value obtained by the network layer.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.
The invention provides an SPMA protocol parameter optimization method based on reinforcement learning, which comprises the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by the probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing.
Preferably, the parameter set selecting step:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Preferably, the parameter scoring step:
according to the obtained service time delay of each priority level
Figure BDA0002006261870000031
And success rate
Figure BDA0002006261870000032
Calculating the total score according to the following formula:
when in use
Figure BDA0002006261870000033
And
Figure BDA0002006261870000034
greater than a score threshold
Figure BDA0002006261870000035
The method comprises the following steps:
will be provided with
Figure BDA0002006261870000036
Is updated to
Figure BDA0002006261870000037
Figure BDA0002006261870000038
Is updated to
Figure BDA0002006261870000039
When in use
Figure BDA00020062618700000310
And
Figure BDA00020062618700000311
less than a score threshold
Figure BDA00020062618700000312
The method comprises the following steps:
will be provided with
Figure BDA00020062618700000313
Is updated to
Figure BDA00020062618700000314
Figure BDA00020062618700000315
Is updated to
Figure BDA00020062618700000316
Figure BDA00020062618700000317
Wherein the content of the first and second substances,
Figure BDA00020062618700000318
indicating the success rate of the ith priority service;
Figure BDA00020062618700000319
indicating the delay of the ith priority service.
score represents total score;
Figure BDA00020062618700000320
a score representing the success rate of the ith priority service;
Figure BDA00020062618700000321
a score representing the delay of the ith priority traffic.
Figure BDA00020062618700000322
And
Figure BDA00020062618700000323
is composed of
Figure BDA00020062618700000324
The weight of (2) is determined according to a preset scene;
Figure BDA00020062618700000325
and
Figure BDA00020062618700000326
is composed of
Figure BDA00020062618700000327
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, entering a parameter optimization step to continue execution.
Preferably, the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure BDA0002006261870000041
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
Preferably, the action set a is:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
The invention provides an SPMA protocol parameter optimization system based on reinforcement learning, which comprises:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by using the probability, selecting the parameter set randomly by using the probability 1, and calling a delay and success rate acquisition module.
Preferably, the parameter set selecting module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Preferably, the parameter scoring module:
according to the obtained service time delay of each priority level
Figure BDA0002006261870000051
And success rate
Figure BDA0002006261870000052
Calculating the total score according to the following formula:
when in use
Figure BDA0002006261870000053
And
Figure BDA0002006261870000054
greater than a score threshold
Figure BDA0002006261870000055
The method comprises the following steps:
will be provided with
Figure BDA0002006261870000056
Is updated to
Figure BDA0002006261870000057
Figure BDA0002006261870000058
Is updated to
Figure BDA0002006261870000059
When in use
Figure BDA00020062618700000510
And
Figure BDA00020062618700000511
less than a score threshold
Figure BDA00020062618700000512
The method comprises the following steps:
will be provided with
Figure BDA00020062618700000513
Is updated to
Figure BDA00020062618700000514
Figure BDA00020062618700000515
Is updated to
Figure BDA00020062618700000516
Figure BDA00020062618700000517
Wherein the content of the first and second substances,
Figure BDA00020062618700000518
indicating the success rate of the ith priority service;
Figure BDA00020062618700000519
indicating the delay of the ith priority service.
score represents total score;
Figure BDA00020062618700000520
a score representing the success rate of the ith priority service;
Figure BDA00020062618700000521
a score representing the delay of the ith priority traffic.
Figure BDA00020062618700000522
And
Figure BDA00020062618700000523
is composed of
Figure BDA00020062618700000524
The weight of (2) is determined according to a preset scene;
Figure BDA00020062618700000525
and
Figure BDA00020062618700000526
is composed of
Figure BDA00020062618700000527
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, the parameter optimization module is called.
Preferably, the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure BDA0002006261870000061
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'selection action a';
the action set A is as follows:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
Compared with the prior art, the invention has the following beneficial effects:
the invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a specific process based on Q-learning according to the present invention;
FIG. 2 is a block diagram of the SPMA and reinforcement learning system provided by the present invention
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention provides an SPMA protocol parameter optimization method based on reinforcement learning, which comprises the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by the probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing.
Specifically, the parameter set selecting step:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Specifically, the parameter scoring step:
according to the obtained service time delay of each priority level
Figure BDA0002006261870000071
And success rate
Figure BDA0002006261870000072
Calculating the total score according to the following formula:
when in use
Figure BDA0002006261870000073
And
Figure BDA0002006261870000074
greater than a score threshold
Figure BDA0002006261870000075
The method comprises the following steps:
will be provided with
Figure BDA0002006261870000081
Is updated to
Figure BDA0002006261870000082
Figure BDA0002006261870000083
Is updated to
Figure BDA0002006261870000084
When in use
Figure BDA0002006261870000085
And
Figure BDA0002006261870000086
less than a score threshold
Figure BDA0002006261870000087
The method comprises the following steps:
will be provided with
Figure BDA0002006261870000088
Is updated to
Figure BDA0002006261870000089
Figure BDA00020062618700000810
Is updated to
Figure BDA00020062618700000811
Figure BDA00020062618700000812
Wherein the content of the first and second substances,
Figure BDA00020062618700000813
indicating the success rate of the ith priority service;
Figure BDA00020062618700000814
indicating the delay of the ith priority service.
score represents total score;
Figure BDA00020062618700000815
a score representing the success rate of the ith priority service;
Figure BDA00020062618700000816
a score representing the delay of the ith priority traffic.
Figure BDA00020062618700000817
And
Figure BDA00020062618700000818
is composed of
Figure BDA00020062618700000819
The weight of (2) is determined according to a preset scene;
Figure BDA00020062618700000820
and
Figure BDA00020062618700000821
is composed of
Figure BDA00020062618700000822
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, entering a parameter optimization step to continue execution.
Specifically, the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure BDA00020062618700000823
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
Specifically, the action set a is:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
The SPMA protocol parameter optimization system based on reinforcement learning provided by the invention can be realized through the step flow of the SPMA protocol parameter optimization method based on reinforcement learning provided by the invention. The person skilled in the art can understand the enhanced learning based SPMA protocol parameter optimization method as a preferred example of the enhanced learning based SPMA protocol parameter optimization system.
The invention provides an SPMA protocol parameter optimization system based on reinforcement learning, which comprises:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by using the probability, selecting the parameter set randomly by using the probability 1, and calling a delay and success rate acquisition module.
Specifically, the parameter set selection module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
Specifically, the parameter scoring module:
according to the obtained service time delay of each priority level
Figure BDA0002006261870000101
And success rate
Figure BDA0002006261870000102
Calculating the total score according to the following formula:
when in use
Figure BDA0002006261870000103
And
Figure BDA0002006261870000104
greater than a score threshold
Figure BDA0002006261870000105
The method comprises the following steps:
will be provided with
Figure BDA0002006261870000106
Is updated to
Figure BDA0002006261870000107
Figure BDA0002006261870000108
Is updated to
Figure BDA0002006261870000109
When in use
Figure BDA00020062618700001010
And
Figure BDA00020062618700001011
less than a score threshold
Figure BDA00020062618700001012
The method comprises the following steps:
will be provided with
Figure BDA00020062618700001013
Is updated to
Figure BDA00020062618700001014
Figure BDA00020062618700001015
Is updated to
Figure BDA00020062618700001016
Figure BDA00020062618700001017
Wherein the content of the first and second substances,
Figure BDA00020062618700001018
indicating the success rate of the ith priority service;
Figure BDA00020062618700001019
indicating the delay of the ith priority service.
score represents total score;
Figure BDA00020062618700001020
a score representing the success rate of the ith priority service;
Figure BDA00020062618700001021
a score representing the delay of the ith priority traffic.
Figure BDA00020062618700001022
And
Figure BDA00020062618700001023
is composed of
Figure BDA00020062618700001024
The weight of (2) is determined according to a preset scene;
Figure BDA00020062618700001025
and
Figure BDA00020062618700001026
is composed of
Figure BDA00020062618700001027
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, the parameter optimization module is called.
Specifically, the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure BDA0002006261870000111
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'selection action a';
the action set A is as follows:
A={±θ1,±θ2,…,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.
The present invention will be described more specifically below by way of preferred examples:
preferred example 1:
the invention provides a Q-learning method in reinforcement learning to learn a system environment, which divides parameters in SPMA into different states with certain granularity, and defines the learning action as the increase and decrease of the parameters, namely, the parameters are increased from the current state to the close state larger than the current parameters or are decreased to the close state smaller than the current parameters. Fig. 2 is a schematic diagram of a system block of the SPMA and reinforcement learning.
As shown in fig. 1, more specifically, the present invention comprises the steps of:
step 1: and selecting an SPMA parameter set, and dividing each parameter into different states according to certain granularity.
In the process of reinforcement learning, other non-optimized parameters of the system are considered to be unchanged, namely, the learning face of each iteration is a static scene, only the set of the specified parameters which need to be adjusted is changed, and the action taken by the system cannot influence the current scene. In our experiment, we select each priority threshold, lifetime, statistical window length, and backoff window length of SPMA as parameter sets, and divide according to different granularities. Other static parameters of the SPMA communication system are as follows:
simulation unit time slot 0.1ms
Arrival rate of priority tasks (number of priority traffic/s) [200,1600](Poisson distribution)
Task service rate per priority 1000
Simulation time length(s) 1
Step 2: the objective of adaptive adjustment mechanism is to obtain the parameter value of adaptive relevant scene, and for different granularities under different conditions, we set the adjustment precision to the granularity, that is, the adjustment step length of each action is thetanN represents different parameters, and the action set A is:
A={±θ1,±θ2,…}
the learned action is defined herein as an increase or decrease in a parameter, i.e., an increase from a current state to an immediate state greater than the current parameter, or a decrease to an immediate state less than the current parameter. We specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if performing the action would cause the result to violate design rules, the action is reselected;
and step 3: the optimization goal of the system is the delay and success rate of each priority service, so the performance index, namely
Figure BDA0002006261870000121
And
Figure BDA0002006261870000122
where i represents a different priority level or priority level,
Figure BDA0002006261870000123
for the success rate of the ith priority service,
Figure BDA0002006261870000124
is the delay of the ith priority service. The invention designs a set of percentile scoring criteria according to which
Figure BDA0002006261870000125
And
Figure BDA0002006261870000126
the scoring mechanism is abstracted as a linear piecewise function: in that
Figure BDA0002006261870000127
Above the upper limit value or
Figure BDA0002006261870000128
When the value is lower than the lower limit value, the score is 100; in that
Figure BDA0002006261870000129
Below the lower limit value or
Figure BDA00020062618700001210
Above the upper value, the score is 0; segmenting the fraction of 0 to 100 in
Figure BDA00020062618700001211
And
Figure BDA00020062618700001212
and when the section functions are positioned in different intervals, selecting the section functions with different slopes. Meanwhile, under different application environments and SPMA parameters, the method is divided intoThe slope of the segment function also varies.
While setting the target score to G, the iteration may terminate when the Q-learning iteration score exceeds the target score. Here, we set the scoring threshold artificially
Figure BDA00020062618700001213
And
Figure BDA00020062618700001214
when in use
Figure BDA00020062618700001215
And
Figure BDA00020062618700001216
is greater than
Figure BDA0002006261870000131
When the temperature of the water is higher than the set temperature,
Figure BDA0002006261870000132
Figure BDA0002006261870000133
when in use
Figure BDA0002006261870000134
And
Figure BDA0002006261870000135
is less than
Figure BDA0002006261870000136
When the temperature of the water is higher than the set temperature,
Figure BDA0002006261870000137
Figure BDA0002006261870000138
Figure BDA0002006261870000139
wherein the content of the first and second substances,
Figure BDA00020062618700001310
Figure BDA00020062618700001311
Figure BDA00020062618700001312
Figure BDA00020062618700001313
is composed of
Figure BDA00020062618700001314
And
Figure BDA00020062618700001315
is determined by the static scenario faced by the iterative learning.
The final overall score was:
Figure BDA00020062618700001316
parameters in the scoring criteria are adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.
The reward function is defined as follows:
Figure BDA00020062618700001317
and 4, step 4: in Q-learning, the Q value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state. The criteria for the selection of the Q function as an action is the final result that the Q-learning needs to obtain. The invention adopts a table look-up mode to model the Q value function. The updating process of the Q-value table can be given according to the following algorithm:
1. knowing that the number of the state parameter sets is M, the number of the action sets is N, and initializing an M multiplied by N Q value table (all table elements are initialized to 0);
2. giving learning iteration times (Episode), learning efficiency alpha, attenuation coefficient gamma and exploration probability;
3. for each learning process (For reach epade):
(1) randomly selecting an initial state s ═ l in a state parameter setk,k∈[0,M-1];
(2) If the target score is not reached (score < G), the following procedure is performed:
a. selecting an action a belonging to A under the current state s according to a-greedy method;
b. obtaining a next state s ' (the current state s can only be increased to an adjacent state s ' larger than the current parameter or is decreased to an adjacent state s ' smaller than the current parameter) according to the current action a;
c. obtaining the time delay and success rate of each priority service according to the current parameters, and calculating to know
Figure BDA00020062618700001318
And
Figure BDA00020062618700001319
d. obtaining a current prize r (rewardval) according to the total score;
e. updating the Q value table:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)-Q(s,a)];
f. and updating the current system state: s ← s'.
To prevent premature convergence of the algorithm to local optima, a greedy approach is employed for decision making that can balance exploration and utilization of current strategies, i.e., balance reward maximization based on existing knowledge and trying new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.
The above α is the learning rate, generally α <1, and determines how much of this test needs to be learned to update the Q table, γ is the attenuation value for the future reward, γ → 0 indicates that the current algorithm mainly considers the instant reward, and γ → 1 indicates that the algorithm also considers the future system reward. Wherein r is rewardVal, α is 0.9, γ is 0.4.
Preferred example 2:
a multi-access protocol SPMA parameter optimization method based on reinforcement learning and based on priority probability statistics is characterized by comprising the following steps:
step 1: selecting each priority threshold, a life cycle, a statistical window length and a backspacing window length of the SPMA as parameter sets, dividing each parameter into different states with certain granularity, and outputting the parameter sets divided with certain granularity;
the division of each parameter into different states at a constant granularity means that the parameter is divided into different sections of a predetermined length at a constant size.
The granularity refers to the size of a division parameter set, for example, for 0 to 100, 10 is taken as the granularity, 10 states can be divided, and 20 is taken as the granularity, 5 states can be divided.
The states refer to the parameter set intervals after being divided, for example, 0 to 100 are divided into 5 states by taking 20 as granularity, the state 1 is 0 to 20, the state 2 is 20 to 40, and the like.
Step 2: the adaptive adjustment mechanism aims at obtaining the parameter value (the size of the parameter such as the priority threshold, the life cycle, the length of a statistical window and the like) of an adaptive relevant scene, and aiming at different granularities of different conditions, the adjustment precision is set as the granularity, namely the adjustment step length of each action is thetanN represents different parameters, and the action set A is:
A={±θ1,±θ2,…}
we specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if doing so would cause the result to violate design rules, the action is reselected;
and step 3: the optimization goal of the system is to comprehensively consider performance indexes and design a set of percentile scoring criterion for service time delay and transmission success rate of each priority, the service time delay and the success rate are divided into linear piecewise functions, and the slopes of the piecewise functions are different under different application environments and SPMA parameters. Setting the target score to G, and when the Q-learning iteration score exceeds the target score, the iteration can be terminated;
and 4, step 4: in Q-learning, the Q-value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)-Q(s,a)]。
the final overall score in step 3 was:
Figure BDA0002006261870000151
wherein i represents different priorities, parameters in the scoring criterion can be adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.
The reward function is defined as follows:
Figure BDA0002006261870000152
step 4, wherein r is rewardVal, α is 0.9, and γ is 0.4.
To prevent premature convergence of the algorithm to local optima in step 4, we use the-greedy method for decision making that can balance exploration and utilization of current strategies, i.e. balance reward maximization based on existing knowledge and try new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (8)

1. An SPMA protocol parameter optimization method based on reinforcement learning is characterized by comprising the following steps:
parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;
parameter optimization: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing;
the parameter scoring step comprises:
according to the obtained service time delay of each priority level
Figure FDA0002644792710000011
And success rate
Figure FDA0002644792710000012
Calculating the total score according to the following formula:
when in use
Figure FDA0002644792710000013
And
Figure FDA0002644792710000014
greater than a score threshold
Figure FDA0002644792710000015
The method comprises the following steps:
will be provided with
Figure FDA0002644792710000016
Is updated to
Figure FDA0002644792710000017
Figure FDA0002644792710000018
Is updated to
Figure FDA0002644792710000019
When in use
Figure FDA00026447927100000110
And
Figure FDA00026447927100000111
less than a score threshold
Figure FDA00026447927100000112
The method comprises the following steps:
will be provided with
Figure FDA00026447927100000113
Is updated to
Figure FDA00026447927100000114
Figure FDA00026447927100000115
Is updated to
Figure FDA00026447927100000116
Figure FDA00026447927100000117
Wherein the content of the first and second substances,
Figure FDA00026447927100000118
indicating the success rate of the ith priority service;
Figure FDA00026447927100000119
representing the time delay of the ith priority service;
score represents total score;
Figure FDA0002644792710000021
a score representing the success rate of the ith priority service;
Figure FDA0002644792710000022
a score representing the delay of the ith priority service;
Figure FDA0002644792710000023
and
Figure FDA0002644792710000024
is composed of
Figure FDA0002644792710000025
The weight of (2) is determined according to a preset scene;
Figure FDA0002644792710000026
and
Figure FDA0002644792710000027
is composed of
Figure FDA0002644792710000028
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution.
2. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter set selection step comprises:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
3. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter optimization step:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure FDA0002644792710000029
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.
4. The SPMA protocol parameter optimization method based on reinforcement learning of claim 3, wherein the action set A is:
A={±θ1,±θ2,...,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
5. An SPMA protocol parameter optimization system based on reinforcement learning, comprising:
a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;
a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;
a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;
a parameter optimization module: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in a Q value table by probability, selecting the parameter set randomly by the probability 1, and calling a time delay and success rate acquisition module;
the parameter scoring module:
according to the obtained service time delay of each priority level
Figure FDA0002644792710000031
And success rate
Figure FDA0002644792710000032
Calculating the total score according to the following formula:
when in use
Figure FDA0002644792710000033
And
Figure FDA0002644792710000034
greater than a score threshold
Figure FDA0002644792710000035
The method comprises the following steps:
will be provided with
Figure FDA0002644792710000036
Is updated to
Figure FDA0002644792710000037
Figure FDA0002644792710000038
Is updated to
Figure FDA0002644792710000039
When in use
Figure FDA00026447927100000310
And
Figure FDA00026447927100000311
less than a score threshold
Figure FDA00026447927100000312
The method comprises the following steps:
will be provided with
Figure FDA00026447927100000313
Is updated to
Figure FDA00026447927100000314
Figure FDA00026447927100000315
Is updated to
Figure FDA00026447927100000316
Figure FDA00026447927100000317
Wherein the content of the first and second substances,
Figure FDA0002644792710000041
indicating the success rate of the ith priority service;
Figure FDA0002644792710000042
representing the time delay of the ith priority service;
score represents total score;
Figure FDA0002644792710000043
a score representing the success rate of the ith priority service;
Figure FDA0002644792710000044
a score representing the delay of the ith priority service;
Figure FDA0002644792710000045
and
Figure FDA0002644792710000046
is composed of
Figure FDA0002644792710000047
The weight of (2) is determined according to a preset scene;
Figure FDA0002644792710000048
and
Figure FDA0002644792710000049
is composed of
Figure FDA00026447927100000410
The weight of (2) is determined according to a preset scene;
determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, the parameter optimization module is called.
6. The reinforcement learning-based SPMA protocol parameter optimization system of claim 5, wherein the parameter set selection module:
the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;
the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.
7. The reinforcement learning-based SPMA protocol parameter optimization system of claim 6, wherein the parameter optimization module:
selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;
obtaining the current reward r according to the obtained total score and updating a Q value table;
the current prize r is expressed as follows:
Figure FDA00026447927100000411
wherein the content of the first and second substances,
scorebeforerepresenting the total score obtained in the last iteration process;
the Q-value table update formula is as follows:
updating Q (s, a) to Q (s, a) + alpha [ r + gamma maxa′Q(s′,a′)-Q(s,a)]
Wherein the content of the first and second substances,
q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;
α represents a learning rate;
gamma represents a decay value for a future reward;
maxa′q (s ', a ') represents the most obtainable by the selection action a ' in the state sA large Q value;
the action set A is as follows:
A={±θ1,±θ2,...,±θn}
θnrepresents the adjustment step size of each action;
n represents different parameters;
selecting an action a from the action set A according to a greedy strategy:
selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;
selecting action a from the action set A at a probability of 1-randomly;
greater than 0 and less than 1.
8. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method of any of claims 1 to 4.
CN201910229439.4A 2019-03-25 2019-03-25 SPMA protocol parameter optimization method, system and medium based on reinforcement learning Active CN110049018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910229439.4A CN110049018B (en) 2019-03-25 2019-03-25 SPMA protocol parameter optimization method, system and medium based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910229439.4A CN110049018B (en) 2019-03-25 2019-03-25 SPMA protocol parameter optimization method, system and medium based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN110049018A CN110049018A (en) 2019-07-23
CN110049018B true CN110049018B (en) 2020-11-17

Family

ID=67275145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910229439.4A Active CN110049018B (en) 2019-03-25 2019-03-25 SPMA protocol parameter optimization method, system and medium based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN110049018B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110581810B (en) * 2019-09-30 2023-03-24 湖南理工学院 Data communication method, device, equipment and medium based on SPMA protocol
CN113613339B (en) * 2021-07-10 2023-10-17 西北农林科技大学 Channel access method of multi-priority wireless terminal based on deep reinforcement learning
CN115315020A (en) * 2022-08-08 2022-11-08 重庆邮电大学 Intelligent CSMA/CA (Carrier sense multiple Access/Carrier aggregation) backoff method based on IEEE (institute of Electrical and electronics Engineers) 802.15.4 protocol of differentiated services

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000010296A3 (en) * 1998-08-12 2000-08-31 Sc Wireless Inc Method and apparatus for network control in communications networks
CN105306176A (en) * 2015-11-13 2016-02-03 南京邮电大学 Realization method for Q learning based vehicle-mounted network media access control (MAC) protocol

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106954229A (en) * 2017-03-09 2017-07-14 中国电子科技集团公司第二十研究所 Hybrid channel loading statistical method based on SPMA
CN109462858A (en) * 2017-11-08 2019-03-12 北京邮电大学 A kind of wireless sensor network parameter adaptive adjusting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000010296A3 (en) * 1998-08-12 2000-08-31 Sc Wireless Inc Method and apparatus for network control in communications networks
CN105306176A (en) * 2015-11-13 2016-02-03 南京邮电大学 Realization method for Q learning based vehicle-mounted network media access control (MAC) protocol

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于 Q 学习的 Ad Hoc 网络路由协议的改进与研究;刘芬,隋天宇,王叶群;《计算机与数字工程》;20190228;第47卷(第2期);全文 *

Also Published As

Publication number Publication date
CN110049018A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
Sun et al. Age of information: A new metric for information freshness
CN109818865B (en) SDN enhanced path boxing device and method
CN111867139B (en) Deep neural network self-adaptive back-off strategy implementation method and system based on Q learning
CN110049018B (en) SPMA protocol parameter optimization method, system and medium based on reinforcement learning
Soysal et al. Age of information in G/G/1/1 systems
Li Multi-agent Q-learning of channel selection in multi-user cognitive radio systems: A two by two case
CN112953830B (en) Routing planning and scheduling method and device for flow frame in time-sensitive network
CN107948083B (en) SDN data center congestion control method based on reinforcement learning
CN110968426B (en) Edge cloud collaborative k-means clustering model optimization method based on online learning
CN108650131B (en) Processing system for multi-controller deployment in SDN network
CN112637965B (en) Game-based Q learning competition window adjusting method, system and medium
Swenson et al. Distributed inertial best-response dynamics
CN113395723B (en) 5G NR downlink scheduling delay optimization system based on reinforcement learning
KR20200081630A (en) Method for allocating resource using machine learning in a wireless network and recording medium for performing the method
CN107835517B (en) Long-distance CSMA/CA method with QoS guarantee
Gao et al. Modeling and parameter optimization of statistical priority-based multiple access protocol
CN111200566A (en) Network service flow information grooming method and electronic equipment
Jin et al. Joint qos control and bitrate selection for video streaming based on multi-agent reinforcement learning
CN115665258A (en) Deep reinforcement learning-based priority perception deployment method for multi-target service function chain
CN115118728A (en) Ant colony algorithm-based edge load balancing task scheduling method
Chiariotti et al. Age of information in multihop connections with tributary traffic and no preemption
Guo et al. Optimal energy-efficient regular delivery of packets in cyber-physical systems
Coronado et al. Ensuring QoS for IEEE 802.11 real-time communications using an AIFSN prediction scheme
KR101105693B1 (en) Method of deciding dynamic sleep section for terminal in wireless access communication system
CN110996398A (en) Wireless network resource scheduling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant