CN110049018B

CN110049018B - SPMA protocol parameter optimization method, system and medium based on reinforcement learning

Info

Publication number: CN110049018B
Application number: CN201910229439.4A
Authority: CN
Inventors: 俞晖; 杨明; 高思颖; 卢超; 徐鹏杰
Original assignee: China Spaceflight Electronic Technology Research Institute; Shanghai Jiaotong University
Current assignee: China Spaceflight Electronic Technology Research Institute; Shanghai Jiaotong University
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2020-11-17
Anticipated expiration: 2039-03-25
Also published as: CN110049018A

Abstract

The invention provides an SPMA protocol parameter optimization method, a system and a medium based on reinforcement learning, comprising the following steps: parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set; a time delay and success rate obtaining step: and according to the obtained current parameter state set and a preset scene, bringing the obtained current parameter state set into the preset scene to obtain the service time delay and the success rate of each priority of the SPMA protocol. The invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.

Description

SPMA protocol parameter optimization method, system and medium based on reinforcement learning

Technical Field

The invention relates to the technical field of communication protocols, in particular to an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.

Background

The spma (statistical Priority Multiple access) protocol is mainly directed to scenarios with high Priority time sensitive traffic. In order to meet the high real-time service requirements of different priorities, such as coordinated targeting information transmission of TTNT, SPMA is adopted as an access protocol. The multiple access protocol based on priority probability statistics is composed of a plurality of priority queues, a priority competition backspacing window, a priority threshold, channel occupation statistics, a transmitting and receiving antenna and a corresponding distributed control algorithm. The services with different priorities correspond to different MAC layer priority queues, and channel occupation statistics is obtained through interaction between the MAC layer and the physical layer to determine the sending of the packets. Channel occupancy statistics are used to count the activity level of a communication channel for a predetermined period of time, said channel occupancy statistics refer to: the idle degree of a communication channel in the set channel statistic window.

When the high layer has packet transmission or receives the forwarding packet, the packet enters the corresponding priority queue according to a certain rule, then the statistical value of the channel occupancy is compared with the corresponding priority threshold, if the statistical value of the channel occupancy is lower than the priority threshold, the packet is sent; if the channel occupancy statistics quotient is at the priority threshold, the priority packet waits for a random back-off time, and after the back-off time is reduced to zero, the node checks the channel occupancy statistics and transmits the channel occupancy statistics. When high priority data arrives within the back-off time, the back-off timer is suspended and the channel occupancy statistics are immediately compared with a corresponding high priority threshold to determine the transmission of the newly arriving high priority packets. In the SPMA protocol, the simulation parameters mainly include a backoff window length, a channel statistics window length, and a priority threshold. In the corresponding aspect of the parameters and the performance indexes, the end-to-end single-hop transmission delay is related to the length of a backoff window; the packet loss rate is related to the length of the backoff window, the length of the statistical window, and the setting of the priority threshold.

Through the search of relevant documents, AbdelitifSerrani, Najib Naja and Abdeliah Jamali published 'QLAR: A Q-learning based adaptive routing for MANETs' on AICCSA (IEEE/ACS 13th International Conference of Computer Systems and Applications) in 2016 (IEEE/ACS 13th International Conference of Computer Systems and Applications, 'QLAR: Q learning-based MANET adaptive routing'), and the article optimizes multiple sets of parameters restricted by each other in a routing algorithm by using a reinforcement learning method so as to select an optimal path for transmission. Similarly, in the SPMA communication system, there are multiple sets of parameters that are restricted with each other, and there are different degrees of influence on the performance index of the system, and this relationship cannot be specifically represented by a mathematical expression. For different application scenes, a group of parameters are randomly selected, which often cannot reach the required performance index, and the values of the parameter set cannot be directly given through a mathematical method, so that an enhanced learning method needs to be adopted to obtain the optimal parameters (set) suitable for the current scene.

Patent document CN106954229A (application number: 201710136147.7) discloses a hybrid channel load statistical method based on SPMA, which obtains a channel load from a physical layer and a channel load from a network layer, respectively, and in a case of a light load, uses a channel load statistical value obtained by the physical layer, and in a case of a heavy load, calibrates a channel load statistical value obtained by the network layer, and if a difference between the two statistical results exceeds a certain margin, calibrates a channel load statistical value obtained by the network layer.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an SPMA protocol parameter optimization method, system and medium based on reinforcement learning.

The invention provides an SPMA protocol parameter optimization method based on reinforcement learning, which comprises the following steps:

parameter selection and division: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;

a time delay and success rate obtaining step: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;

and (3) parameter grading step: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution;

parameter optimization: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by the probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing.

Preferably, the parameter set selecting step:

the parameter sets include any one or any plurality of: each priority threshold value, life cycle, statistical window length and rollback window length of the SPMA protocol;

the Q value is the cumulative decay reward after action is taken in accordance with the action policy in the system state.

Preferably, the parameter scoring step:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

indicating the delay of the ith priority service.

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority traffic.

And

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, entering a parameter optimization step to continue execution.

Preferably, the parameter optimization step:

selecting an action a from the action set A according to a greedy strategy, updating the current parameter state to obtain an updated current parameter state, wherein the updated current parameter state is not less than a preset minimum value and not more than a preset maximum value;

obtaining the current reward r according to the obtained total score and updating a Q value table;

the current prize r is expressed as follows:

wherein the content of the first and second substances,

score_beforerepresenting the total score obtained in the last iteration process;

the Q-value table update formula is as follows:

updating Q (s, a) to Q (s, a) + alpha [ r + gamma max_a′Q(s′,a′)-Q(s,a)]

Wherein the content of the first and second substances,

q (s, a) represents a Q value under the conditions of the state s and the action a at the current time;

α represents a learning rate;

gamma represents a decay value for a future reward;

max_a′q (s ', a') represents the maximum Q value that can be obtained in the state s 'with the action a'.

Preferably, the action set a is:

A＝{±θ₁,±θ₂,…,±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting the action of the maximum Q value corresponding to the Q value table according to a preset probability, namely selecting the action a capable of obtaining the maximum Q value from the action set A according to the maximum Q value of the Q value table, or;

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

The invention provides an SPMA protocol parameter optimization system based on reinforcement learning, which comprises:

a parameter selection and division module: selecting a parameter set of an SPMA protocol, dividing each parameter in the parameter set into different current parameter states according to preset granularity, and obtaining a current parameter state set;

a time delay and success rate obtaining module: according to the obtained current parameter state set and a preset scene, the obtained current parameter state set is brought into the preset scene, and the service time delay and the success rate of each priority of the SPMA protocol are obtained;

a parameter scoring module: according to the obtained time delay and success rate of each priority service, scoring by adopting a preset scoring criterion, and judging whether the preset scoring criterion is met: if yes, ending the process; otherwise, calling a parameter optimization module;

a parameter optimization module: and updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by using the probability, selecting the parameter set randomly by using the probability 1, and calling a delay and success rate acquisition module.

Preferably, the parameter set selecting module:

Preferably, the parameter scoring module:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

indicating the delay of the ith priority service.

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority traffic.

And

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

determining whether the total score is greater than or equal to the target score G: if so, ending the flow; otherwise, the parameter optimization module is called.

Preferably, the parameter optimization module:

the current prize r is expressed as follows:

wherein the content of the first and second substances,

the Q-value table update formula is as follows:

updating Q (s, a) to Q (s, a) + alpha [ r + gamma max_a′Q(s′,a′)-Q(s,a)]

Wherein the content of the first and second substances,

α represents a learning rate;

gamma represents a decay value for a future reward;

max_a′q (s ', a') represents the maximum Q value that can be obtained in the state s 'selection action a';

the action set A is as follows:

A＝{±θ₁,±θ₂,…,±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

According to the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method described in any of the above.

Compared with the prior art, the invention has the following beneficial effects:

the invention combines the SPMA protocol parameter optimization problem under different application scenes with the reinforcement learning algorithm, greatly simplifies the parameter calculation process compared with the parameter selection method of the original SPMA communication system, is easier to reach the required performance index, can more effectively complete the related setting of the SPMA protocol, and has wide application prospect.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a specific process based on Q-learning according to the present invention;

FIG. 2 is a block diagram of the SPMA and reinforcement learning system provided by the present invention

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Specifically, the parameter set selecting step:

Specifically, the parameter scoring step:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

indicating the delay of the ith priority service.

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority traffic.

And

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

Specifically, the parameter optimization step:

the current prize r is expressed as follows:

wherein the content of the first and second substances,

the Q-value table update formula is as follows:

updating Q (s, a) to Q (s, a) + alpha [ r + gamma max_a′Q(s′,a′)-Q(s,a)

Wherein the content of the first and second substances,

α represents a learning rate;

gamma represents a decay value for a future reward;

Specifically, the action set a is:

A＝{±θ₁,±θ₂,…,±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

The SPMA protocol parameter optimization system based on reinforcement learning provided by the invention can be realized through the step flow of the SPMA protocol parameter optimization method based on reinforcement learning provided by the invention. The person skilled in the art can understand the enhanced learning based SPMA protocol parameter optimization method as a preferred example of the enhanced learning based SPMA protocol parameter optimization system.

Specifically, the parameter set selection module:

Specifically, the parameter scoring module:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

indicating the delay of the ith priority service.

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority traffic.

And

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

Specifically, the parameter optimization module:

the current prize r is expressed as follows:

wherein the content of the first and second substances,

the Q-value table update formula is as follows:

updating Q (s, a) to Q (s, a) + alpha [ r + gamma max_a′Q(s′,a′)-Q(s,a)]

Wherein the content of the first and second substances,

α represents a learning rate;

gamma represents a decay value for a future reward;

the action set A is as follows:

A＝{±θ₁,±θ₂,…,±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

The present invention will be described more specifically below by way of preferred examples:

preferred example 1:

the invention provides a Q-learning method in reinforcement learning to learn a system environment, which divides parameters in SPMA into different states with certain granularity, and defines the learning action as the increase and decrease of the parameters, namely, the parameters are increased from the current state to the close state larger than the current parameters or are decreased to the close state smaller than the current parameters. Fig. 2 is a schematic diagram of a system block of the SPMA and reinforcement learning.

As shown in fig. 1, more specifically, the present invention comprises the steps of:

step 1: and selecting an SPMA parameter set, and dividing each parameter into different states according to certain granularity.

In the process of reinforcement learning, other non-optimized parameters of the system are considered to be unchanged, namely, the learning face of each iteration is a static scene, only the set of the specified parameters which need to be adjusted is changed, and the action taken by the system cannot influence the current scene. In our experiment, we select each priority threshold, lifetime, statistical window length, and backoff window length of SPMA as parameter sets, and divide according to different granularities. Other static parameters of the SPMA communication system are as follows:

simulation unit time slot	0.1ms
		Arrival rate of priority tasks (number of priority traffic/s)	[200,1600](Poisson distribution)
Task service rate per priority	1000
		Simulation time length(s)	1

Step 2: the objective of adaptive adjustment mechanism is to obtain the parameter value of adaptive relevant scene, and for different granularities under different conditions, we set the adjustment precision to the granularity, that is, the adjustment step length of each action is theta_nN represents different parameters, and the action set A is:

A＝{±θ₁,±θ₂,…}

the learned action is defined herein as an increase or decrease in a parameter, i.e., an increase from a current state to an immediate state greater than the current parameter, or a decrease to an immediate state less than the current parameter. We specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if performing the action would cause the result to violate design rules, the action is reselected;

and step 3: the optimization goal of the system is the delay and success rate of each priority service, so the performance index, namely

And

where i represents a different priority level or priority level,

for the success rate of the ith priority service,

is the delay of the ith priority service. The invention designs a set of percentile scoring criteria according to which

And

the scoring mechanism is abstracted as a linear piecewise function: in that

Above the upper limit value or

When the value is lower than the lower limit value, the score is 100; in that

Below the lower limit value or

Above the upper value, the score is 0; segmenting the fraction of 0 to 100 in

And

and when the section functions are positioned in different intervals, selecting the section functions with different slopes. Meanwhile, under different application environments and SPMA parameters, the method is divided intoThe slope of the segment function also varies.

While setting the target score to G, the iteration may terminate when the Q-learning iteration score exceeds the target score. Here, we set the scoring threshold artificially

And

when in use

And

is greater than

When the temperature of the water is higher than the set temperature,

when in use

And

is less than

When the temperature of the water is higher than the set temperature,

wherein the content of the first and second substances,

is composed of

And

is determined by the static scenario faced by the iterative learning.

The final overall score was:

parameters in the scoring criteria are adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.

The reward function is defined as follows:

and 4, step 4: in Q-learning, the Q value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state. The criteria for the selection of the Q function as an action is the final result that the Q-learning needs to obtain. The invention adopts a table look-up mode to model the Q value function. The updating process of the Q-value table can be given according to the following algorithm:

1. knowing that the number of the state parameter sets is M, the number of the action sets is N, and initializing an M multiplied by N Q value table (all table elements are initialized to 0);

2. giving learning iteration times (Episode), learning efficiency alpha, attenuation coefficient gamma and exploration probability;

3. for each learning process (For reach epade):

(1) randomly selecting an initial state s ═ l in a state parameter set_k,k∈[0,M-1]；

(2) If the target score is not reached (score < G), the following procedure is performed:

a. selecting an action a belonging to A under the current state s according to a-greedy method;

b. obtaining a next state s ' (the current state s can only be increased to an adjacent state s ' larger than the current parameter or is decreased to an adjacent state s ' smaller than the current parameter) according to the current action a;

c. obtaining the time delay and success rate of each priority service according to the current parameters, and calculating to know

And

d. obtaining a current prize r (rewardval) according to the total score;

e. updating the Q value table:

Q(s,a)←Q(s,a)+α[r+γmax_a′Q(s′,a′)-Q(s,a)]；

f. and updating the current system state: s ← s'.

To prevent premature convergence of the algorithm to local optima, a greedy approach is employed for decision making that can balance exploration and utilization of current strategies, i.e., balance reward maximization based on existing knowledge and trying new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.

The above α is the learning rate, generally α <1, and determines how much of this test needs to be learned to update the Q table, γ is the attenuation value for the future reward, γ → 0 indicates that the current algorithm mainly considers the instant reward, and γ → 1 indicates that the algorithm also considers the future system reward. Wherein r is rewardVal, α is 0.9, γ is 0.4.

Preferred example 2:

a multi-access protocol SPMA parameter optimization method based on reinforcement learning and based on priority probability statistics is characterized by comprising the following steps:

step 1: selecting each priority threshold, a life cycle, a statistical window length and a backspacing window length of the SPMA as parameter sets, dividing each parameter into different states with certain granularity, and outputting the parameter sets divided with certain granularity;

the division of each parameter into different states at a constant granularity means that the parameter is divided into different sections of a predetermined length at a constant size.

The granularity refers to the size of a division parameter set, for example, for 0 to 100, 10 is taken as the granularity, 10 states can be divided, and 20 is taken as the granularity, 5 states can be divided.

The states refer to the parameter set intervals after being divided, for example, 0 to 100 are divided into 5 states by taking 20 as granularity, the state 1 is 0 to 20, the state 2 is 20 to 40, and the like.

Step 2: the adaptive adjustment mechanism aims at obtaining the parameter value (the size of the parameter such as the priority threshold, the life cycle, the length of a statistical window and the like) of an adaptive relevant scene, and aiming at different granularities of different conditions, the adjustment precision is set as the granularity, namely the adjustment step length of each action is theta_nN represents different parameters, and the action set A is:

A＝{±θ₁,±θ₂,…}

we specify that when the parameter reaches a minimum value, it is no longer reduced, when the parameter is a maximum value, it is no longer increased, and when an action is determined, if doing so would cause the result to violate design rules, the action is reselected;

and step 3: the optimization goal of the system is to comprehensively consider performance indexes and design a set of percentile scoring criterion for service time delay and transmission success rate of each priority, the service time delay and the success rate are divided into linear piecewise functions, and the slopes of the piecewise functions are different under different application environments and SPMA parameters. Setting the target score to G, and when the Q-learning iteration score exceeds the target score, the iteration can be terminated;

and 4, step 4: in Q-learning, the Q-value is the cumulative decay reward after action is taken according to the optimal action strategy in a certain system state:

Q(s,a)←Q(s,a)+α[r+γmax_a′Q(s′,a′)-Q(s,a)]。

the final overall score in step 3 was:

wherein i represents different priorities, parameters in the scoring criterion can be adjusted due to environmental factors, and parameters may need to be adjusted to ensure better performance for different environments.

The reward function is defined as follows:

step 4, wherein r is rewardVal, α is 0.9, and γ is 0.4.

To prevent premature convergence of the algorithm to local optima in step 4, we use the-greedy method for decision making that can balance exploration and utilization of current strategies, i.e. balance reward maximization based on existing knowledge and try new actions to gain unknown knowledge. We choose a certain value, i.e. choose behaviors according to the optimal value of the Q-value table with probability, and choose behaviors randomly with 1-probability.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. An SPMA protocol parameter optimization method based on reinforcement learning is characterized by comprising the following steps:

parameter optimization: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in the Q value table by probability, selecting the parameter set randomly by the probability 1, and returning to the step of obtaining the time delay and the success rate to continue executing;

the parameter scoring step comprises:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

representing the time delay of the ith priority service;

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority service;

and

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, entering a parameter optimization step to continue execution.

2. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter set selection step comprises:

3. The SPMA protocol parameter optimization method based on reinforcement learning of claim 1, wherein the parameter optimization step:

the current prize r is expressed as follows:

wherein the content of the first and second substances,

the Q-value table update formula is as follows:

updating Q (s, a) to Q (s, a) + alpha [ r + gamma max_a′Q(s′，a′)-Q(s，a)]

Wherein the content of the first and second substances,

α represents a learning rate;

gamma represents a decay value for a future reward;

4. The SPMA protocol parameter optimization method based on reinforcement learning of claim 3, wherein the action set A is:

A＝{±θ₁，±θ₂，...，±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

5. An SPMA protocol parameter optimization system based on reinforcement learning, comprising:

a parameter optimization module: updating the current parameter state according to a greedy strategy, selecting a new parameter set according to the maximum Q value in a Q value table by probability, selecting the parameter set randomly by the probability 1, and calling a time delay and success rate acquisition module;

the parameter scoring module:

according to the obtained service time delay of each priority level

And success rate

Calculating the total score according to the following formula:

when in use

And

greater than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

When in use

And

less than a score threshold

The method comprises the following steps:

will be provided with

Is updated to

Is updated to

Wherein the content of the first and second substances,

indicating the success rate of the ith priority service;

representing the time delay of the ith priority service;

score represents total score;

a score representing the success rate of the ith priority service;

a score representing the delay of the ith priority service;

and

is composed of

The weight of (2) is determined according to a preset scene;

and

is composed of

The weight of (2) is determined according to a preset scene;

determining whether the total score is greater than or equal to the target score G: if yes, ending the process; otherwise, the parameter optimization module is called.

6. The reinforcement learning-based SPMA protocol parameter optimization system of claim 5, wherein the parameter set selection module:

7. The reinforcement learning-based SPMA protocol parameter optimization system of claim 6, wherein the parameter optimization module:

the current prize r is expressed as follows:

wherein the content of the first and second substances,

the Q-value table update formula is as follows:

Wherein the content of the first and second substances,

α represents a learning rate;

gamma represents a decay value for a future reward;

max_a′q (s ', a ') represents the most obtainable by the selection action a ' in the state sA large Q value;

the action set A is as follows:

A＝{±θ₁，±θ₂，...，±θ_n}

θ_nrepresents the adjustment step size of each action;

n represents different parameters;

selecting an action a from the action set A according to a greedy strategy:

selecting action a from the action set A at a probability of 1-randomly;

greater than 0 and less than 1.

8. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the reinforcement learning based SPMA protocol parameter optimization method of any of claims 1 to 4.