Disclosure of Invention
Aiming at the problems, the invention provides a vehicle networking cloud computing resource optimization method based on reinforcement learning.
The invention discloses a vehicle networking cloud computing resource optimization method based on reinforcement learning, which specifically comprises the following steps of:
A. modeling the resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism.
The system is set to have M virtual units VU and two service requests: important service requests and ordinary service requests; the number of virtual units VU that can be allocated to a service request is L, where L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda p And λ q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda l Function lambda of the number l of virtual units VU allocated to a service l =l+1,1/λ l The average processing time of the service to which l virtual units VU are assigned is indicated.
A part of the resources is reserved by the reservation ratio Th exclusively for important service requests.
Describing the system state S by the event of the vehicle service request, namely arrival or departure, and the number of VUs occupied by different kinds of services in the system; s pi And s qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } p ,n p +1,...L p },j∈{n q ,n q +1,...L q },n p And n q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:
1) arrival of important and general vehicle service requests e
ar Respectively consist of
And
is shown by
2) Departure of service request e
d From
And
indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.
Thus, e ∈ { e ∈ }
ar ,e
d Denotes the total event.
At a fixed time interval tau int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, timeout is equal to 1, and the system state is represented as follows:
s ar ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e ar >} (1)
s d ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e d ,timeout>} (2)
S∈{s
ar ,s
d in which s
ar Indicating the state of the service arrival event at occurrence, s
d Indicating the state at the time of the service leave event,
after receiving a vehicle service request, if the system decides to process the request immediately, the system assigns l VUs to the request; the action corresponding to the reception of the important service request is
The action corresponding to the reception of the ordinary service request is
Wherein
And
representing the state of the system at the time of arrival of important and ordinary service requests.
When the system encounters a special state:
at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed
In order to satisfy the minimum number of current service requests VU, one or more services are selected from the running services according to the system state, part of the resources are released, and actions are performed
Indicating that i VUs are released from the running important service that holds i VUs,
indicating the release of l VUs from a running generic service occupying j VUs while performing an action
Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, then it will not be assigned any VUs, and the corresponding action is a(s)
ar )=0。
When the vehicle service request in the cloud leaves, the occupied VU can be released, and the corresponding action is a(s) d )=-1。
Some special states are encountered:
wherein
And
respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted
Indicating that i VUs are added to the running important service possessing i VUs,
indicating that l VUs are added to the running general service occupying j VUs.
The set of actions of the adaptive VU assignment model is:
the overall system revenue is considered as z (s, a), which includes three categories of revenue, cost and additional cost:
z(s,a)=x(s,a)-y(s,a)-ext(s,a),e∈{e ar ,e d } (6)
where x (s, a) refers to revenue obtained from vehicle users when satisfying a vehicle service request, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, and ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism.
The revenue is expressed as:
where R is the reward obtained by processing the vital service request immediately, I
v The profit is evaluated through the change of QoE of the user and QoS of the system; if the request is processed immediately, I
v A fixed value will be used according to the increase in QoE and QoS; r is a radical of hydrogen
v The user allocates a VU payment for the system;
and
respectively representing the costs of rejecting important services and ordinary services;
and
weight factors representing important services and ordinary services, respectively;
increasing a basic reward for allocating l resources; r is
Th Partial resources are reserved for the system for the important service request, and the QoE of the important service is improved to obtain the converted reward.
The system cost is represented by the following equation:
y(s,a)=t(s,a)h(s,a),a∈a(s) (8)
where t (s, a) represents the average expected time from the system to make the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:
wherein c is v Indicating the cost of occupying one VU.
In addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about
The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:
the receiving probability of a newly arrived common service request is reduced to reduce the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost can be expressed as:
wherein the content of the first and second substances,
indicating that the service is reduced by the cost of l VUs,
and
weight factors representing important services and ordinary services, respectively; c. C
Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.
B. The model is solved using reinforcement learning.
The reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the following action value function is continuously updated in an iterative mode to obtain the approximately optimal strategy.
When the temperature is higher than the set temperature
When the temperature of the water is higher than the set temperature,
when in use
When the utility model is used, the water is discharged,
wherein beta is n Denotes the learning rate at decision time of step n, gamma denotes a discount factor, 0<γ<1;β n Is a small positive value, β n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.
If each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,
where pi represents the strategy.
The optimal strategy pi can be obtained from the optimal action value function * Which represents the probability of taking a certain action to accomplish the goal, as shown in the following formula,
the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the maximum action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM n And ε n And the algorithm finally converges.
Further, a layered architecture is optimally allocated to the edge computing communication and the self-adaptation of the computing resources, the upper layer architecture is a service request receiving or refusing decision mechanism based on the SMDP, and the lower layer architecture is a frequency spectrum resource re-allocation mechanism based on the MDP, and the layered reinforcement learning is used for solving.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention firstly provides an adaptive resource allocation model based on the SMDP, adds a reservation strategy and a resource secondary allocation mechanism on the basis, and solves the problem by using the reinforcement learning based on the model. Compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain an adaptive resource allocation strategy. And the system performance is improved by introducing a secondary distribution mechanism, the rejection rate of service requests is reduced, and more system benefits are obtained. The reservation policy also improves QoE of important service users.
Detailed Description
The invention is described in further detail below with reference to the figures and the detailed description.
The present invention uses Virtual Units (VUs) to represent the smallest unit of resources in an overall vehicle cloud system, including the computational and storage resources required to process vehicle service requests in the vehicle cloud system.
Fig. 1 shows a vehicle cloud system with a resource reservation policy and a resource secondary allocation mechanism. When the service request arrives and the system does not have available VUs for distribution, the system can select part of services from the cloud at the moment, release a small amount of resources to distribute to the newly arrived service request and meet the requirement of the newly arrived service. Sometimes, when a large amount of free resources exist in the cloud, the resource occupation amount of part of services in the system is increased, and the QoE is improved. And when the system reserves a large number of VUs to guarantee important service requests, subsequent ordinary service requests are rejected due to lack of available VUs. If there are too few VUs reserved, it is difficult to meet the QoE of important service requests. Both of these situations reduce the long term yield of the overall system. Therefore, how to improve the long-term overall yield of the system, the QoS of the vehicle cloud system and the QoE of the vehicle users by adaptively adjusting the number of reserved VUs and the resource occupancy of the services running in the cloud according to the system environment is a major issue of the research herein.
Reinforcement learning is how an agent acts in a dynamic environment to maximize the long-term jackpot average defined by a goal. Reinforcement learning is an MDP process that aims to obtain an optimal strategy that maps the current environmental state to actions that can be taken. The intelligent agent does not need to obtain a complete environment dynamic model, so that the problem that the traditional method needs strong assumed conditions which are often inaccurate in the actual environment is solved.
The VU allocation task is matched with the intelligent environment framework, the intelligent body is regarded as a controller in a vehicle cloud system, and the states are service arrival, service departure and VU resource allocation conditions; the action is to accept, reject or assign a different number of VUs. By exploring the environment, the controller continuously interacts with the environment, and a resource allocation strategy is dynamically improved, so that an optimal strategy for VU resource allocation is finally obtained. Therefore, the resource allocation problem of the Internet of vehicles system is modeled as a Semi Markov Decision Process (SMDP), and a resource reservation strategy and a resource secondary allocation mechanism are introduced.
The system is set to have M VUs and two service requests: important service requests and ordinary service requests; the number of virtual units VU which can be allocated to a service request is L, wherein L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda p And λ q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda l Function lambda of the number l of virtual units VU allocated to a service l =l+1,1/λ l The average processing time of the service to which l virtual units VU are allocated is indicated.
In order to meet the important service request with higher priority as much as possible, a part of the resources are reserved according to the reservation ratio Th and are specially used for the important service request. The reservation ratio is the ratio of the reserved resource (M · Th)/(1+ Th) to the remaining resource M/(1+ Th). Because the arrival of the service request in the vehicle cloud system environment is dynamically changed, when an important service request protection strategy is executed, the number of reserved resources is difficult to predict, when the reserved resources are excessive, the utilization rate of the VU is reduced, and the rejection rate of the newly arrived common service request is increased; when the reserved resource is too small, the important service request cannot be protected, so the ratio Th is adaptively adjusted according to the change of the environment.
Describing the system state S with the event (arrival or departure) of the vehicle service request and the number of VUs occupied by different kinds of services in the system; s pi And s qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } p ,n p +1,...L p },j∈{n q ,n q +1,...L q },n p And n q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:
1) arrival of important and general vehicle service requests e
ar Respectively consist of
And
is shown, i.e.
2) Departure of service request e
d From
And
indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.
Thus, e ∈ { e ∈ }
ar ,e
d Denotes the total event.
At a fixed time interval tau int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, it is equal to 1, and the system status is represented as follows:
s ar ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e ar >} (1)
s d ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e d ,timeout>} (2)
S∈{s
ar ,s
d in which s
ar Indicating the state of the service arrival event at the occurrence of s
d Indicating the state at the time of the service leave event,
upon receipt of a vehicle service request (
And
) Later, if the system decides to process the request immediately, the system assigns l VUs to the request; receiving the important service request corresponds to the action of
i∈{n
p ,n
p +1,...L
p Receiving the action corresponding to the common service request as
j∈{n
q ,n
q +1,...L
q Therein of
And
representing the state of the system at the time of arrival of important and ordinary service requests.
When the system encounters a special state:
at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed
In order to satisfy the minimum number of current service requests VU, one or more services are selected from the running services according to the system state, part of the resources are released, and actions are performed
Indicating that i VUs are released from the running important service that holds i VUs,
indicating that i VUs are released from a running normal service that occupies j VUs while an action is performed
Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, it will not be assigned any VU and the corresponding action is a(s)
ar )=0。
When a vehicle service request in the cloud leaves (
And
) The occupied VU is released, and the corresponding action is a(s)
d )=-1。
Some special states are encountered:
wherein
And
respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted
Indicating that i VUs are added to the running important service that occupies i VUs,
indicating that l VUs are added to the running general service occupying j VUs.
The set of actions for the adaptive VU assignment model is:
the revenue of the whole system is considered as z (s, a), and includes three categories of revenue, cost and extra cost:
z(s,a)=x(s,a)-y(s,a)-ext(s,a),e∈{e ar ,e d } (6)
where x (s, a) refers to revenue obtained from vehicle users when satisfying a vehicle service request, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, and ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism.
The system revenue x (s, a) generated by evaluating the number of VUs used by a service should take into account the following factors: the vehicle user pays for using the cloud resources; individual rewards obtained by immediately processing important service requests; the service request occupies the cost of the VU; quantified user QoE and system QoS (due to immediate processing of requests) improvement; the reward caused by the resource expansion of the running service is proportional to the importance of the service and inversely proportional to the amount of resources occupied by the running service. Thus, revenue is expressed as:
where R is the reward obtained by processing the vital service request immediately, I
v The profit evaluated by the change of the QoE of the user and the QoS of the system; if the request is processed immediately, I
v A fixed value will be used according to the increase in QoE and QoS; r is
v The user allocates a VU payment for the system;
and
respectively representing the costs brought by the rejection of important services and ordinary services;
and
weight factors representing important services and ordinary services, respectively;
increasing a basic reward for allocating l resources; r is a radical of hydrogen
Th Partial resources are reserved for the system to request important services, and QoE of the important services is improved to obtain the converted reward.
The system cost is represented by:
y(s,a)=t(s,a)h(s,a),a∈a(s) (8)
where t (s, a) represents the average expected time from the system making the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:
wherein c is v Indicating the cost of occupying one VU.
In addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about
The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:
the receiving probability of a newly arrived common service request is reduced to reduce the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost can be expressed as:
wherein, the first and the second end of the pipe are connected with each other,
indicating that the service is reduced by the cost of l VUs,
and
weight factors representing important services and common services respectively; c. C
Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.
Solving the Model using reinforcement learning, wherein the Model (s, a) is represented as an MDP process of the environment for which the Model is a Model of the environmentStatistical estimation of state transition probabilities. The Model (s, a) is updated with real experience obtained from the environment. The form model is used in this chapter, and every time the controller obtains a real experience, the controller will execute the process<S t ,A t ,S t+1 ,R t+1 >The 4-tuple is put into the Model (s, a).
The model-based reinforcement learning process is illustrated in fig. 2. The intelligent entity can learn online and continuously interact with the environment, and the obtained actual experience can be used for directly performing reinforcement learning, improving action value functions and strategies and improving the model so as to be more accurately matched with the current environment. Meanwhile, the intelligent agent interacts with the model, and the reinforcement learning is applied to the simulation experience generated by the model to indirectly learn, so that the action value function and the strategy are improved, and the learning process is accelerated. The direct and indirect learning processes are performed in parallel. By learning from the model, the agent can understand the environment more deeply without being limited to maximizing the system rewards, and the agent can have certain reasoning ability by learning from the model.
The reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the following action value function is continuously updated in an iterative mode to obtain the approximately optimal strategy.
When in use
When the utility model is used, the water is discharged,
when in use
When the temperature of the water is higher than the set temperature,
wherein beta is n Represents the decision at the nth stepLearning rate of time, gamma denotes a discount factor, 0<γ<1;β n Is a small positive value, β n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.
If each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,
where pi represents the strategy.
The optimal strategy pi can be obtained from the optimal action value function * Which represents the probability of taking a certain action to accomplish the goal, as shown in the following formula,
the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM n And epsilon n And finally, the algorithm converges. See algorithm 1 for details.
Communication and computing resource joint allocation theoretical analysis
Fig. 3 shows the case of offloading a mobile vehicle service request to an edge server in an edge computing scenario. The method mainly comprises two processes: 1) the mobile equipment initiating the service request is connected to the RSU through the wireless network and performs data transmission, the frequency spectrum resources (the number of sub-carriers) are dynamically allocated according to the system state and the channel state information, 2) the VU resources required by the execution of the service are allocated in the RSU, and after the execution is finished, the execution result is returned to the mobile user.
Since all users share the whole wireless spectrum resource, the wireless spectrum resource is divided into a large number of mutually orthogonal sub-carriers, and the system allocates different sub-carrier numbers for various services. The data transmission adopts the time division multiplexing technology, and each mobile user occupies different time slots to carry out service transmission. The number of subcarriers affects the transmission rate because the transmission rate is the transmission rate of a single subcarrier by the total number of subcarriers, so the more subcarriers are allocated, the higher the transmission rate. Due to the time-varying characteristic of the wireless channel, the reliability of the service in the transmission process is reduced, and in severe cases, the service transmission may be interrupted, which affects the QoS of the system and the QoE of the user. In order to compensate the influence caused by the time-varying characteristic of the channel, the system redistributes the number of the sub-carriers occupied by various services according to the information state of the channel.
Considering that there are multiple services in the system, type e { c ∈ 1 ,c 2 ,…,c l ,…,c L },c l Indicating the type of service, the priority of the various services increasing with l, i.e.
The requirements on the quality of service also increase with increasing priority.
Assume that the total number of computational resources VU in the edge server is
By using
Indicating c occupying j VUs
l Amount of type of service, using
Denotes c
l Minimum VU calculation requirements required for type services. The virtual resource VU in the edge server aims to improve the QoS/QoE (increase the number of the distributed VUs) as much as possible and increase the overall benefit of the system on the premise of meeting various service requirements. Therefore, expressing the VU allocation optimization problem as maximizing the system computational yield
Assume the total bandwidth resource of the wireless spectrum as
Subcarrier bandwidth of B
subc The number of subcarriers is
(symbol)
Meaning rounding down. All users share spectrum resources, and the multiple access technology enables multiple users to share limited radio spectrum resources at the same time without causing severe interference (collision). By using
Indicating c occupying i subcarriers
l The amount of traffic of a type. According to the Shannon theorem, the channel capacity is
C=Wlog 2 (1+SNR) (19)
Where W is the allocated bandwidth and is expressed by the number of subcarriers, that is, W ═ ib, i ═ 1,2 …. The SNR represents the signal-to-noise ratio,
wherein P is c To transmit power, σ 2 For noise power, m (t) represents the large scale fading component, and h (t) represents the small scale fading component. In the large-scale fading model, the time constant associated with fading variation is very large, being several seconds or minutes, as the mobile device moves. The small-scale fading propagation model characterizes rapid fluctuations in received signal strength over short distances or short times. In this subsection, only the time-varying characteristic of the wireless channel in a short time is considered, and since the moving distance of the mobile device in a short time is short, the large-scale fading is approximated to a constant process in this chapter.
Modeling channel fast fading variations within time interval tau using a first order gaussian markov process
h(t)~CN(0,1-∈ 2 ) (21)
E.g. quantifying the channel correlation in two consecutive time intervals, model in Jakes [26]In ∈ ═ J 0 (2πf d /τ),J 0 (. h) is the zeroth order of the Bessel function of the first kind, f d =vf c C represents the maximum Doppler shift, v represents the moving speed of the mobile device, f c Representing the carrier frequency, c being the speed of light c 3 × 10 8 m/s。
Due to the small-scale fading of the channel, the signal-to-noise ratio of the wireless link varies with time, and thus the transmission quality of the service also varies with time. By using
Is shown by c
l Minimum transmission rate required for type traffic
Denotes c
l Minimum signal-to-noise ratio required for type traffic.
The allocation target of the spectrum resources in the edge calculation is to dynamically adjust the number of sub-carriers allocated to the service receiving service at the next decision time according to the time-varying characteristic of the channel, so that the requirements of various services on delay jitter are met, the QoS/QoE is improved, and more system benefits are obtained. Therefore, the subcarrier allocation optimization problem is expressed as minimizationService delay, further translated into maximizing service transmission rate
Whether the allocation of computing resources or communication resources is adopted, the system needs to reduce the rejection rate of service requests on the premise of meeting various service requirements, and then the system receives new services as much as possible. Therefore, the optimization problem is expressed as
The aim of the joint allocation of communication and computing resources is to reduce the rejection rate of services, improve QoS/QoE and maximize the overall long-term benefit of the system on the premise of meeting various service requirements according to the system state and channel state information. However, these objectives are mutually exclusive and cannot be met simultaneously, so trade-offs need to be made to consider the trade-off. The joint optimization objective is expressed as
Wherein
Is a weight factor, satisfies
The proportion of the weights is adjusted according to the system requirements to achieve different desired goals.
Because the joint optimization problem has NP-Hard attribute, it is difficult to obtain an analytic solution and solve the solution. The joint optimization problem described above is constructed as a hierarchical model, as shown in FIG. 4.
The lower level architecture in the hierarchical model is shown in the upper part of fig. 4. The time is divided according to a small equal time interval tau (millisecond magnitude), a solid black point is regarded as a decision moment of a controller, the number of sub-carriers occupied by various received services is redistributed according to the system state and channel state information, and the process can be regarded as MDP.
The upper level architecture in the hierarchical model is shown in the middle of fig. 4. The open circles represent the time of occurrence of the event and the controller decides whether to receive a service request and also how many VUs to allocate to it, which process can be considered as SMDP.
The lower part of fig. 4 shows that the MDP procedure is based on, on which the SMDP decision is superimposed. Originally, learning needs to be carried out under the same learning rule to obtain two types of decisions of a common strategy, wherein one type is a decision for receiving or not receiving a service request, the other type is a decision for executing VU and sub-carrier allocation, and the VU and the sub-carrier allocation can be decoupled through layering. The SMDP large-scale decision of the upper layer can have an independent learning process and an independent strategy, and the user can see farther on the basis of the MDP of the lower layer.
It can be seen that the MDP procedure and the SMDP procedure are overlapped, and in practical situations, the MDP procedure and the SMDP procedure may not occur simultaneously. When the lower layer MDP decision interval division is as small as possible, the lower layer MDP decision interval division and the lower layer MDP decision interval division can be approximately considered to be coincident.
The system state includes usage of subcarriers in the communication resource and usage of VUs in the computing resource. The set of states of the communication resource is represented as
The state set of the computing resource is represented as
Event e includes service requests for various types of traffic arriving and leaving the current system, and no event occurs as denoted by e-0, and thus,
the state set of the communication resource, the state set of the computing resource and the event are integrated into one state representation, and Z ═ e, X, Y }.
System action set A ∈ { A } up ,A down Denoted as upper-layer action A up E { -2, -1,0} and lower layer action A down E {1,2, ib, jv }. Where a-2 indicates that the system rejects the service request, a-1 indicates that the service leaves the system, and a-0 indicates that the system receives the service request. and a-1 indicates that the system performs subcarrier number reallocation on the frequency spectrum resources according to the time-varying channel state information from high to low in sequence according to the service priority, a-2 indicates that the system does not perform reallocation, a-ib indicates that i subcarriers are allocated to the service, and a-jv indicates that j VMs are allocated to the service.
By using
Indicates reception of c
l The revenue obtained by the type of service is,
is denoted by c
l The type service allocates the gains obtained by the i subcarriers,
is denoted by c
l Type traffic allocation of revenue, exp, obtained by j VMs
comm Express unit cost of communication, exp
comp Which represents the cost per unit of calculation,
and the penalty brought by the rejection of the service request by the system is represented. z is a radical of formula
* Representing the state of the system at the time of occurrence of an event, the number n of time intervals elapsed between two adjacent events
T Can be expressed as
n T =min{t>0|z t =z * ,z * ∈Z} (25)
Thus, the cumulative reward gained by the system performing a reallocation of communication resources, i.e. sub-carriers, between two events is expressed as
The instantaneous profit obtained by the system receiving the service request when the event occurs is expressed as
When no event occurs, the instantaneous benefit obtained by the underlying MDP process is expressed as
In order to solve the HRL to obtain the optimal strategy, the Q-Learning algorithm is utilized in the section. The upper SMDP decision process is marked as SMDP Q-Learning, and the action cost function is updated
The lower layer MDP decision process is marked as MDP Q-Learning, and the action value function is updated
Since the SMDP and MDP are different time scales, the Q update counts n' and n are different. And continuously and iteratively updating the Q value until the algorithm is converged, and finally obtaining the optimal strategy for the joint optimization allocation of the communication and computing resources.
Algorithm 2 describes the whole process of hierarchical reinforcement learning. Compared with a single-layer SMDP process, the system is modeled into an HRL model in the section, the influence of the time-varying characteristic of a channel on service transmission is considered, spectrum resources, namely the number of subcarriers, can be adaptively reallocated, the time delay requirement of various services is met as much as possible, and therefore the QoS/QoE is improved. Therefore, the system can obtain more benefits under the framework.
Simulation verification:
and only carrying out simulation verification on the vehicle cloud resources with the resource reservation and resource secondary allocation mechanism. We evaluated the performance of SMDP-based resource allocation models and model-based reinforcement learning algorithms using MATLAB software simulation validation. The convergence performance of the model-based reinforcement learning resource allocation method is verified firstly, and compared with the condition of no resource secondary allocation mechanism. The greedy algorithm does not consider statistical information of the environment, only looks at the short-term maximum income, compares the reinforcement learning algorithm with the traditional heuristic greedy algorithm, can highlight the learning capability of the reinforcement learning algorithm from the environment and verify the advantages of a resource secondary distribution mechanism, and then discusses the reservation strategy. Here we use a model-based Dyna algorithm. The algorithm 1 updates the Dyna-Q online learning algorithm for k-step backtracking and is used for learning an optimal strategy.
When a new important service request arrives, no allocable resources exist in the cloud, part of common services are selected from the cloud, the resource occupancy of the common services is reduced, and the released VUs are allocated to the new important service, so that the rejection rate of the important service is reduced, and meanwhile, the benefits of the service with the reduced resource occupancy cannot be damaged too much. And in a given fixed time interval, if no event occurs, the resource occupation amount of the common service in the cloud is increased, and the QoE of the service is improved. The simulation only considers the above resource secondary allocation process without loss of generality. Table 1 shows simulation parameters of the simulation verification resource secondary allocation mechanism. Each algorithm has a running time of T max =1×10 5 Time unit 1, represents the computer simulation time, not the actual time. Fig. 5 records the results over a period of computer simulation time 20, with time plotted on the abscissa to observe changes over time, convergence of the algorithm, and cumulative average gains.
TABLE 1 simulation parameters
FIG. 5 shows that the system gain obtained with our proposed model and mechanism is much higher than the greedy algorithm. This is because the greedy strategy is a short-view strategy, focusing only on the maximum benefit at hand. It can be seen from the figure that the resource secondary allocation mechanism can make the system obtain more benefits, because the secondary allocation mechanism trades the benefits of a small amount of common services for the benefits of more important services, and reduces the rejection rate of important service requests; but also improves the QoE of the general service in some special cases. The left side of the dotted line in the figure indicates that there are more common service requests than there are critical services, and it can be seen that the algorithm converges quickly. The right side of the dotted line shows that the arrival rate of the service request is increased when the environment is changed into an important service, the model-based reinforcement learning algorithm can sense the change of the environment and adjust the corresponding strategy, so that more benefits are obtained, and the greedy strategy cannot be changed due to the change of the environment and cannot obtain more system benefits.
As can be seen from fig. 6(a), when the arrival rate of the common service requests in the environment is higher than that of the important service requests, the rejection rate of the important service requests is higher than that of the common service requests under the greedy algorithm and the SAoR-free mechanism, because the resources in the cloud are limited, the controller receives a large number of common services, and the number of the resources allocated to the important service requests is 2, which cannot meet the requirement, and thus the important service requests are rejected. The resource secondary allocation mechanism just solves the problem, and the rejection rate of important service requests is obviously reduced. And the model-based reinforcement learning algorithm learns from the environment, and aims at maximizing the future long-term system yield, so that a compromise strategy is obtained, the maximum number of resources which can be obtained by service cannot be allocated each time, and the rejection rate is lower than that of a greedy algorithm. From fig. 6(b), it can be seen that when the arrival rate of important service requests is higher than that of the common requests, and resources in the cloud are limited, the overall rejection rate is increased, but the model-based reinforcement learning algorithm and the resource secondary allocation mechanism can still enable the system to obtain better performance than the greedy strategy.
Next, analyzing the influence of the reservation policy on the system, and as shown in table 2,
TABLE 2 simulation parameters
As can be seen from fig. 7, the system yield obtained by using the greedy algorithm is lower than that of the reinforcement learning algorithm, because the greedy algorithm increases the rejection rate of subsequent service requests, reduces QoS and QoE, and reduces the system yield. As can be seen from fig. 7(a), the cumulative average prize is a decreasing trend as the number of VUs reserved by the system for important services increases. This is because there are more general service requests than main service requests in the system, and increasing the number of VUs reserved tends to reduce the system resources available for general services. Although the QoE of the important service is improved, the receiving rate of the common service is greatly reduced, and the accumulated benefit of the system is reduced. As can be seen in fig. 7(b), the cumulative average prize increases first and then decreases as the number of VUs reserved by the system for important services increases. The reason is that the number of important service requests in the system is increased, the number of reserved VUs is increased, the QoE of the important service is improved, and the rejection rate of the important service requests is reduced. And the income obtained by receiving the important service request is more than that of the ordinary service request, and the reward is increased. Once the number of reserved VUs exceeds a certain value, the rejection rate of the common service is greatly increased, the QoE is reduced, and the loss of the QoE of the common service cannot be compensated by the rewards obtained from important services, so that the overall accumulated average reward of the system is reduced. Therefore, the number of the reserved VUs is dynamically adjusted according to the change of the environment, so that the QoE of important services can be improved, the QoE of common services is not lost too much, and the long-term benefit of the system can be improved.
Simulation results show that compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain a self-adaptive resource allocation strategy. And the introduction of a secondary distribution mechanism also improves the system performance, reduces the rejection rate of service requests and obtains more system benefits. The reservation strategy also improves the QoE of important service users, but the reservation ratio is dynamically adjusted according to the change of the environment.