CN111711666B - Internet of vehicles cloud computing resource optimization method based on reinforcement learning - Google Patents

Internet of vehicles cloud computing resource optimization method based on reinforcement learning Download PDF

Info

Publication number
CN111711666B
CN111711666B CN202010460525.9A CN202010460525A CN111711666B CN 111711666 B CN111711666 B CN 111711666B CN 202010460525 A CN202010460525 A CN 202010460525A CN 111711666 B CN111711666 B CN 111711666B
Authority
CN
China
Prior art keywords
service
resource
vus
important
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010460525.9A
Other languages
Chinese (zh)
Other versions
CN111711666A (en
Inventor
洪鑫涛
梁宏斌
张宗源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hualui Cloud Technology Co ltd
Original Assignee
Hua Lu Yun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hua Lu Yun Technology Co ltd filed Critical Hua Lu Yun Technology Co ltd
Priority to CN202010460525.9A priority Critical patent/CN111711666B/en
Publication of CN111711666A publication Critical patent/CN111711666A/en
Application granted granted Critical
Publication of CN111711666B publication Critical patent/CN111711666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • H04L47/83Admission control; Resource allocation based on usage prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • H04L47/82Miscellaneous aspects
    • H04L47/821Prioritising resource allocation or reservation requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a vehicle networking cloud computing resource optimization method based on reinforcement learning, which comprises the following specific steps: modeling the resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism; the model is solved using reinforcement learning. The invention ensures that the system obtains the optimal scheme of the resource allocation of the Internet of vehicles, and improves the resource utilization rate, the system QoS and the user QoE.

Description

Internet of vehicles cloud computing resource optimization method based on reinforcement learning
Technical Field
The invention belongs to the field of optimization application of internet of vehicles, and particularly relates to an internet of vehicles cloud computing resource optimization method based on reinforcement learning.
Background
With the rapid development of the internet of things, the role of the internet of things in the industrial field is more prominent. Industrial internet of things (IIoT) is one of the important applications in the industrial field, aiming at collecting and processing data in an industrial environment to achieve intelligent operations, such as industrial monitoring, automation, automatic control, etc. In IIoT, there is a need to collect and process data collected from various devices in a timely, reliable, and efficient manner. To meet the stringent requirements of IIoT, advanced communication and computing technologies are expected to play an important role. However, IIoT still faces many technical difficulties to be solved, for example, the industrial environment itself shows high randomness and dynamics, so that the traditional technology cannot adapt to the current industrial environment. In such a highly complex and dynamic system, Artificial Intelligence (AI) has great potential to address the above challenges. The IIoT system is intelligently managed and controlled to complete efficient decision-making in an autonomous learning mode.
Gao et al propose a visual and light detection and ranging (LIDAR) fused object classification method for autonomous driving based on convolutional neural networks and image upsampling theory. Shi et al propose an end-to-end navigation scheme for mobile robots that can directly extend agents trained through simulation to real scenes for practical applications based on Deep Reinforcement Learning (DRL). Aazam et al designed a novel architecture for IIoT that utilizes fog computing to provide local computing support for the IIoT environment.
The internet of vehicles is an important application scenario of IIoT, and has very important significance, and effective management of resources in the internet of vehicles is a problem to be solved. While virtualization techniques may facilitate management of computing, communication, and storage resources. And the vehicle cloud consisting of various vehicle-mounted units, roadside units and remote cloud servers assisted by a virtualization technology enables resources to be uniformly allocated, and the utilization efficiency of the Internet of vehicles resources is improved. Therefore, how to intelligently and efficiently allocate resources is important for the normal operation of the internet of vehicles. Ning et al build an intelligent offload framework for 5G-based vehicular networks by jointly using licensed cellular spectrum and unlicensed spectrum. Sodhro et al propose an intelligent resource regulation and control technology FCDAA (forward central dynamic and available approach) for the problem of resource limitation of mobile devices. Liang et al provide a method for sharing the frequency spectrum of V2V and V2I in the Internet of vehicles by using a multi-agent reinforcement learning algorithm. Ramon et al studied the SDN-based resource management problem for the internet of vehicles from both theoretical and practical aspects. Yu et al propose a game theory-based cloud resource allocation scheme to achieve effective resource management in a cloud-based vehicle network. Zhao et al propose a hierarchical resource allocation scheme based on the nash game. The original problem is converted into two sub-problems according to a time division multiplexing scheme: power allocation and time slot allocation. And the fair and effective spectrum resource allocation of the VANET is realized. He et al propose a dynamic orchestration unification framework for network, cache, and computing resources. And converting the resource allocation problem in the unified framework into a joint optimization problem. Aiming at the high complexity of the joint optimization problem, a deep reinforcement learning method is provided, and the system performance is improved.
Although there are many works in the literature on optimal allocation of vehicle cloud resources, the revenue, cost, QoS of the vehicle cloud system, and QoE of the vehicle users of the system are not jointly considered to achieve maximum system long-term revenue for the vehicle cloud system. This motivates us to propose an adaptive vehicle cloud/fog resource allocation model based on the Semi Markov Decision Process (SMDP). In addition, in the above documents, it is assumed that resources occupied by a service are not changed during the whole operation period, and secondary allocation (SAoR) of the resources according to the use condition of the resources is not considered, so that the resource utilization efficiency is improved, and the QoS of the system and the QoE of the vehicle users are improved. In order to improve the QoE of the important service request, the document refers to a method of reserving a protection channel specially for improving the call handover success rate in communication, and reserves part of resources specially for the important service request.
However, since the mobile terminal is limited by physical resources such as computation and power, it is necessary to offload traffic to an edge server in a short distance. Moving Edge Computing (Mobile Edge Computing) is just a product of this trend. The mobile edge computing shortens the physical distance between the mobile terminal and the server by deploying computing resources close to the network edge of the mobile device as much as possible, reduces transmission delay, shares the overload pressure of a clustered server, improves reliability and has a more flexible computing mode.
According to the characteristics of the mobile edge calculation itself, communication and calculation resources need to be considered jointly. In a certain time, a large number of mobile devices are required to share limited resources within a certain range, so that the highest system efficiency under the mobile edge computing scene can be achieved by jointly and optimally allocating computing and communication resources.
Salahuddin et al, outlines a variety of vehicle cloud models, demonstrates the benefits of reinforcement learning-based techniques for resource allocation in the vehicle cloud, can perceive long-term benefits, and minimizes vehicle cloud resource deployment overhead. Tang et al propose a novel intelligent POC allocation based on deep learning by using the wireless SDN of the Internet of things, namely SDN-IoT, which realizes the rapid convergence of the channel allocation process and significantly improves the network performance. Alam et al propose a reinforcement learning based code offloading mechanism that significantly reduces the execution time and latency of mobile services while ensuring lower mobile device power consumption. Wang et al propose an intelligent resource allocation scheme (DRLRA) based on deep reinforcement learning to adaptively allocate computing and network resources, reduce average service time, and balance resource usage in varying MEC environments.
The above documents have a certain effect on the joint optimization allocation of communication resources and computing resources, but they ignore the time-varying characteristics of wireless channels, which affect the transmission efficiency of services and then the QoE of users, and thus cannot obtain the optimal solution of the overall real-time resource allocation of the system.
Disclosure of Invention
Aiming at the problems, the invention provides a vehicle networking cloud computing resource optimization method based on reinforcement learning.
The invention discloses a vehicle networking cloud computing resource optimization method based on reinforcement learning, which specifically comprises the following steps of:
A. modeling the resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism.
The system is set to have M virtual units VU and two service requests: important service requests and ordinary service requests; the number of virtual units VU that can be allocated to a service request is L, where L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda p And λ q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda l Function lambda of the number l of virtual units VU allocated to a service l =l+1,1/λ l The average processing time of the service to which l virtual units VU are assigned is indicated.
A part of the resources is reserved by the reservation ratio Th exclusively for important service requests.
Describing the system state S by the event of the vehicle service request, namely arrival or departure, and the number of VUs occupied by different kinds of services in the system; s pi And s qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } p ,n p +1,...L p },j∈{n q ,n q +1,...L q },n p And n q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:
1) arrival of important and general vehicle service requests e ar Respectively consist of
Figure GDA0002611820400000031
And
Figure GDA0002611820400000032
is shown by
Figure GDA0002611820400000033
2) Departure of service request e d From
Figure GDA0002611820400000034
And
Figure GDA0002611820400000035
indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.
Figure GDA0002611820400000036
Thus, e ∈ { e ∈ } ar ,e d Denotes the total event.
At a fixed time interval tau int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, timeout is equal to 1, and the system state is represented as follows:
s ar ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e ar >} (1)
s d ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e d ,timeout>} (2)
S∈{s ar ,s d in which s ar Indicating the state of the service arrival event at occurrence, s d Indicating the state at the time of the service leave event,
Figure GDA0002611820400000037
after receiving a vehicle service request, if the system decides to process the request immediately, the system assigns l VUs to the request; the action corresponding to the reception of the important service request is
Figure GDA0002611820400000038
The action corresponding to the reception of the ordinary service request is
Figure GDA0002611820400000039
Wherein
Figure GDA00026118204000000310
And
Figure GDA00026118204000000311
representing the state of the system at the time of arrival of important and ordinary service requests.
When the system encounters a special state:
Figure GDA0002611820400000041
at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed
Figure GDA0002611820400000042
In order to satisfy the minimum number of current service requests VU, one or more services are selected from the running services according to the system state, part of the resources are released, and actions are performed
Figure GDA0002611820400000043
Figure GDA0002611820400000044
Indicating that i VUs are released from the running important service that holds i VUs,
Figure GDA0002611820400000045
indicating the release of l VUs from a running generic service occupying j VUs while performing an action
Figure GDA0002611820400000046
Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, then it will not be assigned any VUs, and the corresponding action is a(s) ar )=0。
When the vehicle service request in the cloud leaves, the occupied VU can be released, and the corresponding action is a(s) d )=-1。
Some special states are encountered:
Figure GDA0002611820400000047
wherein
Figure GDA0002611820400000048
And
Figure GDA0002611820400000049
respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted
Figure GDA00026118204000000410
Figure GDA00026118204000000411
Indicating that i VUs are added to the running important service possessing i VUs,
Figure GDA00026118204000000412
indicating that l VUs are added to the running general service occupying j VUs.
The set of actions of the adaptive VU assignment model is:
Figure GDA00026118204000000413
the overall system revenue is considered as z (s, a), which includes three categories of revenue, cost and additional cost:
z(s,a)=x(s,a)-y(s,a)-ext(s,a),e∈{e ar ,e d } (6)
where x (s, a) refers to revenue obtained from vehicle users when satisfying a vehicle service request, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, and ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism.
The revenue is expressed as:
Figure GDA0002611820400000051
where R is the reward obtained by processing the vital service request immediately, I v The profit is evaluated through the change of QoE of the user and QoS of the system; if the request is processed immediately, I v A fixed value will be used according to the increase in QoE and QoS; r is a radical of hydrogen v The user allocates a VU payment for the system;
Figure GDA0002611820400000052
and
Figure GDA0002611820400000053
respectively representing the costs of rejecting important services and ordinary services;
Figure GDA0002611820400000054
and
Figure GDA0002611820400000055
weight factors representing important services and ordinary services, respectively;
Figure GDA0002611820400000056
increasing a basic reward for allocating l resources; r is Th Partial resources are reserved for the system for the important service request, and the QoE of the important service is improved to obtain the converted reward.
The system cost is represented by the following equation:
y(s,a)=t(s,a)h(s,a),a∈a(s) (8)
where t (s, a) represents the average expected time from the system to make the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:
Figure GDA0002611820400000057
wherein c is v Indicating the cost of occupying one VU.
In addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about
Figure GDA0002611820400000058
The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:
Figure GDA0002611820400000059
the receiving probability of a newly arrived common service request is reduced to reduce the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost can be expressed as:
Figure GDA0002611820400000061
wherein the content of the first and second substances,
Figure GDA0002611820400000062
indicating that the service is reduced by the cost of l VUs,
Figure GDA0002611820400000063
and
Figure GDA0002611820400000064
weight factors representing important services and ordinary services, respectively; c. C Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.
B. The model is solved using reinforcement learning.
The reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the following action value function is continuously updated in an iterative mode to obtain the approximately optimal strategy.
When the temperature is higher than the set temperature
Figure GDA0002611820400000065
When the temperature of the water is higher than the set temperature,
Figure GDA0002611820400000066
when in use
Figure GDA0002611820400000067
When the utility model is used, the water is discharged,
Figure GDA0002611820400000068
wherein beta is n Denotes the learning rate at decision time of step n, gamma denotes a discount factor, 0<γ<1;β n Is a small positive value, β n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.
If each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,
Figure GDA0002611820400000069
where pi represents the strategy.
The optimal strategy pi can be obtained from the optimal action value function * Which represents the probability of taking a certain action to accomplish the goal, as shown in the following formula,
Figure GDA00026118204000000610
the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the maximum action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM n And ε n And the algorithm finally converges.
Further, a layered architecture is optimally allocated to the edge computing communication and the self-adaptation of the computing resources, the upper layer architecture is a service request receiving or refusing decision mechanism based on the SMDP, and the lower layer architecture is a frequency spectrum resource re-allocation mechanism based on the MDP, and the layered reinforcement learning is used for solving.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention firstly provides an adaptive resource allocation model based on the SMDP, adds a reservation strategy and a resource secondary allocation mechanism on the basis, and solves the problem by using the reinforcement learning based on the model. Compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain an adaptive resource allocation strategy. And the system performance is improved by introducing a secondary distribution mechanism, the rejection rate of service requests is reduced, and more system benefits are obtained. The reservation policy also improves QoE of important service users.
Drawings
Fig. 1 is a vehicle cloud system with reserved resources and a resource secondary allocation mechanism.
FIG. 2 is a model-based reinforcement learning process.
FIG. 3 is an edge computing scenario in a vehicle networking.
FIG. 4 is a hierarchical model for joint optimal allocation of edge computing communication and computing resources.
FIG. 5 is a graph of the cumulative average prize in a simulation experiment.
Fig. 6 shows the rejection rate of the service in different environments of the simulation experiment.
Fig. 7 shows the reservation ratio and the system cumulative average reward in different environments of the simulation experiment.
Detailed Description
The invention is described in further detail below with reference to the figures and the detailed description.
The present invention uses Virtual Units (VUs) to represent the smallest unit of resources in an overall vehicle cloud system, including the computational and storage resources required to process vehicle service requests in the vehicle cloud system.
Fig. 1 shows a vehicle cloud system with a resource reservation policy and a resource secondary allocation mechanism. When the service request arrives and the system does not have available VUs for distribution, the system can select part of services from the cloud at the moment, release a small amount of resources to distribute to the newly arrived service request and meet the requirement of the newly arrived service. Sometimes, when a large amount of free resources exist in the cloud, the resource occupation amount of part of services in the system is increased, and the QoE is improved. And when the system reserves a large number of VUs to guarantee important service requests, subsequent ordinary service requests are rejected due to lack of available VUs. If there are too few VUs reserved, it is difficult to meet the QoE of important service requests. Both of these situations reduce the long term yield of the overall system. Therefore, how to improve the long-term overall yield of the system, the QoS of the vehicle cloud system and the QoE of the vehicle users by adaptively adjusting the number of reserved VUs and the resource occupancy of the services running in the cloud according to the system environment is a major issue of the research herein.
Reinforcement learning is how an agent acts in a dynamic environment to maximize the long-term jackpot average defined by a goal. Reinforcement learning is an MDP process that aims to obtain an optimal strategy that maps the current environmental state to actions that can be taken. The intelligent agent does not need to obtain a complete environment dynamic model, so that the problem that the traditional method needs strong assumed conditions which are often inaccurate in the actual environment is solved.
The VU allocation task is matched with the intelligent environment framework, the intelligent body is regarded as a controller in a vehicle cloud system, and the states are service arrival, service departure and VU resource allocation conditions; the action is to accept, reject or assign a different number of VUs. By exploring the environment, the controller continuously interacts with the environment, and a resource allocation strategy is dynamically improved, so that an optimal strategy for VU resource allocation is finally obtained. Therefore, the resource allocation problem of the Internet of vehicles system is modeled as a Semi Markov Decision Process (SMDP), and a resource reservation strategy and a resource secondary allocation mechanism are introduced.
The system is set to have M VUs and two service requests: important service requests and ordinary service requests; the number of virtual units VU which can be allocated to a service request is L, wherein L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda p And λ q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda l Function lambda of the number l of virtual units VU allocated to a service l =l+1,1/λ l The average processing time of the service to which l virtual units VU are allocated is indicated.
In order to meet the important service request with higher priority as much as possible, a part of the resources are reserved according to the reservation ratio Th and are specially used for the important service request. The reservation ratio is the ratio of the reserved resource (M · Th)/(1+ Th) to the remaining resource M/(1+ Th). Because the arrival of the service request in the vehicle cloud system environment is dynamically changed, when an important service request protection strategy is executed, the number of reserved resources is difficult to predict, when the reserved resources are excessive, the utilization rate of the VU is reduced, and the rejection rate of the newly arrived common service request is increased; when the reserved resource is too small, the important service request cannot be protected, so the ratio Th is adaptively adjusted according to the change of the environment.
Describing the system state S with the event (arrival or departure) of the vehicle service request and the number of VUs occupied by different kinds of services in the system; s pi And s qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } p ,n p +1,...L p },j∈{n q ,n q +1,...L q },n p And n q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:
1) arrival of important and general vehicle service requests e ar Respectively consist of
Figure GDA0002611820400000081
And
Figure GDA0002611820400000082
is shown, i.e.
Figure GDA0002611820400000083
2) Departure of service request e d From
Figure GDA0002611820400000084
And
Figure GDA0002611820400000085
indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.
Figure GDA0002611820400000086
Thus, e ∈ { e ∈ } ar ,e d Denotes the total event.
At a fixed time interval tau int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, it is equal to 1, and the system status is represented as follows:
s ar ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e ar >} (1)
s d ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e d ,timeout>} (2)
S∈{s ar ,s d in which s ar Indicating the state of the service arrival event at the occurrence of s d Indicating the state at the time of the service leave event,
Figure GDA0002611820400000091
upon receipt of a vehicle service request (
Figure GDA0002611820400000092
And
Figure GDA0002611820400000093
) Later, if the system decides to process the request immediately, the system assigns l VUs to the request; receiving the important service request corresponds to the action of
Figure GDA0002611820400000094
i∈{n p ,n p +1,...L p Receiving the action corresponding to the common service request as
Figure GDA0002611820400000095
j∈{n q ,n q +1,...L q Therein of
Figure GDA0002611820400000096
And
Figure GDA0002611820400000097
representing the state of the system at the time of arrival of important and ordinary service requests.
When the system encounters a special state:
Figure GDA0002611820400000098
at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed
Figure GDA0002611820400000099
In order to satisfy the minimum number of current service requests VU, one or more services are selected from the running services according to the system state, part of the resources are released, and actions are performed
Figure GDA00026118204000000910
Figure GDA00026118204000000911
Indicating that i VUs are released from the running important service that holds i VUs,
Figure GDA00026118204000000912
indicating that i VUs are released from a running normal service that occupies j VUs while an action is performed
Figure GDA00026118204000000913
Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, it will not be assigned any VU and the corresponding action is a(s) ar )=0。
When a vehicle service request in the cloud leaves (
Figure GDA00026118204000000914
And
Figure GDA00026118204000000915
) The occupied VU is released, and the corresponding action is a(s) d )=-1。
Some special states are encountered:
Figure GDA00026118204000000916
wherein
Figure GDA0002611820400000101
And
Figure GDA0002611820400000102
respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted
Figure GDA0002611820400000103
Figure GDA0002611820400000104
Indicating that i VUs are added to the running important service that occupies i VUs,
Figure GDA0002611820400000105
indicating that l VUs are added to the running general service occupying j VUs.
The set of actions for the adaptive VU assignment model is:
Figure GDA0002611820400000106
the revenue of the whole system is considered as z (s, a), and includes three categories of revenue, cost and extra cost:
z(s,a)=x(s,a)-y(s,a)-ext(s,a),e∈{e ar ,e d } (6)
where x (s, a) refers to revenue obtained from vehicle users when satisfying a vehicle service request, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, and ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism.
The system revenue x (s, a) generated by evaluating the number of VUs used by a service should take into account the following factors: the vehicle user pays for using the cloud resources; individual rewards obtained by immediately processing important service requests; the service request occupies the cost of the VU; quantified user QoE and system QoS (due to immediate processing of requests) improvement; the reward caused by the resource expansion of the running service is proportional to the importance of the service and inversely proportional to the amount of resources occupied by the running service. Thus, revenue is expressed as:
Figure GDA0002611820400000107
where R is the reward obtained by processing the vital service request immediately, I v The profit evaluated by the change of the QoE of the user and the QoS of the system; if the request is processed immediately, I v A fixed value will be used according to the increase in QoE and QoS; r is v The user allocates a VU payment for the system;
Figure GDA0002611820400000108
and
Figure GDA0002611820400000109
respectively representing the costs brought by the rejection of important services and ordinary services;
Figure GDA00026118204000001010
and
Figure GDA00026118204000001011
weight factors representing important services and ordinary services, respectively;
Figure GDA00026118204000001012
increasing a basic reward for allocating l resources; r is a radical of hydrogen Th Partial resources are reserved for the system to request important services, and QoE of the important services is improved to obtain the converted reward.
The system cost is represented by:
y(s,a)=t(s,a)h(s,a),a∈a(s) (8)
where t (s, a) represents the average expected time from the system making the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:
Figure GDA0002611820400000111
wherein c is v Indicating the cost of occupying one VU.
In addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about
Figure GDA0002611820400000112
The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:
Figure GDA0002611820400000113
the receiving probability of a newly arrived common service request is reduced to reduce the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost can be expressed as:
Figure GDA0002611820400000114
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0002611820400000115
indicating that the service is reduced by the cost of l VUs,
Figure GDA0002611820400000116
and
Figure GDA0002611820400000117
weight factors representing important services and common services respectively; c. C Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.
Solving the Model using reinforcement learning, wherein the Model (s, a) is represented as an MDP process of the environment for which the Model is a Model of the environmentStatistical estimation of state transition probabilities. The Model (s, a) is updated with real experience obtained from the environment. The form model is used in this chapter, and every time the controller obtains a real experience, the controller will execute the process<S t ,A t ,S t+1 ,R t+1 >The 4-tuple is put into the Model (s, a).
The model-based reinforcement learning process is illustrated in fig. 2. The intelligent entity can learn online and continuously interact with the environment, and the obtained actual experience can be used for directly performing reinforcement learning, improving action value functions and strategies and improving the model so as to be more accurately matched with the current environment. Meanwhile, the intelligent agent interacts with the model, and the reinforcement learning is applied to the simulation experience generated by the model to indirectly learn, so that the action value function and the strategy are improved, and the learning process is accelerated. The direct and indirect learning processes are performed in parallel. By learning from the model, the agent can understand the environment more deeply without being limited to maximizing the system rewards, and the agent can have certain reasoning ability by learning from the model.
The reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the following action value function is continuously updated in an iterative mode to obtain the approximately optimal strategy.
When in use
Figure GDA0002611820400000121
When the utility model is used, the water is discharged,
Figure GDA0002611820400000122
when in use
Figure GDA0002611820400000123
When the temperature of the water is higher than the set temperature,
Figure GDA0002611820400000124
wherein beta is n Represents the decision at the nth stepLearning rate of time, gamma denotes a discount factor, 0<γ<1;β n Is a small positive value, β n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.
If each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,
Figure GDA0002611820400000125
where pi represents the strategy.
The optimal strategy pi can be obtained from the optimal action value function * Which represents the probability of taking a certain action to accomplish the goal, as shown in the following formula,
Figure GDA0002611820400000126
the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM n And epsilon n And finally, the algorithm converges. See algorithm 1 for details.
Figure GDA0002611820400000127
Figure GDA0002611820400000131
Communication and computing resource joint allocation theoretical analysis
Fig. 3 shows the case of offloading a mobile vehicle service request to an edge server in an edge computing scenario. The method mainly comprises two processes: 1) the mobile equipment initiating the service request is connected to the RSU through the wireless network and performs data transmission, the frequency spectrum resources (the number of sub-carriers) are dynamically allocated according to the system state and the channel state information, 2) the VU resources required by the execution of the service are allocated in the RSU, and after the execution is finished, the execution result is returned to the mobile user.
Since all users share the whole wireless spectrum resource, the wireless spectrum resource is divided into a large number of mutually orthogonal sub-carriers, and the system allocates different sub-carrier numbers for various services. The data transmission adopts the time division multiplexing technology, and each mobile user occupies different time slots to carry out service transmission. The number of subcarriers affects the transmission rate because the transmission rate is the transmission rate of a single subcarrier by the total number of subcarriers, so the more subcarriers are allocated, the higher the transmission rate. Due to the time-varying characteristic of the wireless channel, the reliability of the service in the transmission process is reduced, and in severe cases, the service transmission may be interrupted, which affects the QoS of the system and the QoE of the user. In order to compensate the influence caused by the time-varying characteristic of the channel, the system redistributes the number of the sub-carriers occupied by various services according to the information state of the channel.
Considering that there are multiple services in the system, type e { c ∈ 1 ,c 2 ,…,c l ,…,c L },c l Indicating the type of service, the priority of the various services increasing with l, i.e.
Figure GDA0002611820400000132
The requirements on the quality of service also increase with increasing priority.
Assume that the total number of computational resources VU in the edge server is
Figure GDA0002611820400000141
By using
Figure GDA0002611820400000142
Indicating c occupying j VUs l Amount of type of service, using
Figure GDA0002611820400000143
Denotes c l Minimum VU calculation requirements required for type services. The virtual resource VU in the edge server aims to improve the QoS/QoE (increase the number of the distributed VUs) as much as possible and increase the overall benefit of the system on the premise of meeting various service requirements. Therefore, expressing the VU allocation optimization problem as maximizing the system computational yield
Figure GDA0002611820400000144
Figure GDA0002611820400000145
Assume the total bandwidth resource of the wireless spectrum as
Figure GDA0002611820400000146
Subcarrier bandwidth of B subc The number of subcarriers is
Figure GDA0002611820400000147
(symbol)
Figure GDA00026118204000001410
Meaning rounding down. All users share spectrum resources, and the multiple access technology enables multiple users to share limited radio spectrum resources at the same time without causing severe interference (collision). By using
Figure GDA0002611820400000148
Indicating c occupying i subcarriers l The amount of traffic of a type. According to the Shannon theorem, the channel capacity is
C=Wlog 2 (1+SNR) (19)
Where W is the allocated bandwidth and is expressed by the number of subcarriers, that is, W ═ ib, i ═ 1,2 …. The SNR represents the signal-to-noise ratio,
Figure GDA0002611820400000149
wherein P is c To transmit power, σ 2 For noise power, m (t) represents the large scale fading component, and h (t) represents the small scale fading component. In the large-scale fading model, the time constant associated with fading variation is very large, being several seconds or minutes, as the mobile device moves. The small-scale fading propagation model characterizes rapid fluctuations in received signal strength over short distances or short times. In this subsection, only the time-varying characteristic of the wireless channel in a short time is considered, and since the moving distance of the mobile device in a short time is short, the large-scale fading is approximated to a constant process in this chapter.
Modeling channel fast fading variations within time interval tau using a first order gaussian markov process
h(t)~CN(0,1-∈ 2 ) (21)
E.g. quantifying the channel correlation in two consecutive time intervals, model in Jakes [26]In ∈ ═ J 0 (2πf d /τ),J 0 (. h) is the zeroth order of the Bessel function of the first kind, f d =vf c C represents the maximum Doppler shift, v represents the moving speed of the mobile device, f c Representing the carrier frequency, c being the speed of light c 3 × 10 8 m/s。
Due to the small-scale fading of the channel, the signal-to-noise ratio of the wireless link varies with time, and thus the transmission quality of the service also varies with time. By using
Figure GDA0002611820400000151
Is shown by c l Minimum transmission rate required for type traffic
Figure GDA0002611820400000152
Denotes c l Minimum signal-to-noise ratio required for type traffic.
The allocation target of the spectrum resources in the edge calculation is to dynamically adjust the number of sub-carriers allocated to the service receiving service at the next decision time according to the time-varying characteristic of the channel, so that the requirements of various services on delay jitter are met, the QoS/QoE is improved, and more system benefits are obtained. Therefore, the subcarrier allocation optimization problem is expressed as minimizationService delay, further translated into maximizing service transmission rate
Figure GDA0002611820400000153
Figure GDA0002611820400000154
Whether the allocation of computing resources or communication resources is adopted, the system needs to reduce the rejection rate of service requests on the premise of meeting various service requirements, and then the system receives new services as much as possible. Therefore, the optimization problem is expressed as
Figure GDA0002611820400000155
The aim of the joint allocation of communication and computing resources is to reduce the rejection rate of services, improve QoS/QoE and maximize the overall long-term benefit of the system on the premise of meeting various service requirements according to the system state and channel state information. However, these objectives are mutually exclusive and cannot be met simultaneously, so trade-offs need to be made to consider the trade-off. The joint optimization objective is expressed as
Figure GDA0002611820400000156
Wherein
Figure GDA0002611820400000157
Is a weight factor, satisfies
Figure GDA0002611820400000158
The proportion of the weights is adjusted according to the system requirements to achieve different desired goals.
Because the joint optimization problem has NP-Hard attribute, it is difficult to obtain an analytic solution and solve the solution. The joint optimization problem described above is constructed as a hierarchical model, as shown in FIG. 4.
The lower level architecture in the hierarchical model is shown in the upper part of fig. 4. The time is divided according to a small equal time interval tau (millisecond magnitude), a solid black point is regarded as a decision moment of a controller, the number of sub-carriers occupied by various received services is redistributed according to the system state and channel state information, and the process can be regarded as MDP.
The upper level architecture in the hierarchical model is shown in the middle of fig. 4. The open circles represent the time of occurrence of the event and the controller decides whether to receive a service request and also how many VUs to allocate to it, which process can be considered as SMDP.
The lower part of fig. 4 shows that the MDP procedure is based on, on which the SMDP decision is superimposed. Originally, learning needs to be carried out under the same learning rule to obtain two types of decisions of a common strategy, wherein one type is a decision for receiving or not receiving a service request, the other type is a decision for executing VU and sub-carrier allocation, and the VU and the sub-carrier allocation can be decoupled through layering. The SMDP large-scale decision of the upper layer can have an independent learning process and an independent strategy, and the user can see farther on the basis of the MDP of the lower layer.
It can be seen that the MDP procedure and the SMDP procedure are overlapped, and in practical situations, the MDP procedure and the SMDP procedure may not occur simultaneously. When the lower layer MDP decision interval division is as small as possible, the lower layer MDP decision interval division and the lower layer MDP decision interval division can be approximately considered to be coincident.
The system state includes usage of subcarriers in the communication resource and usage of VUs in the computing resource. The set of states of the communication resource is represented as
Figure GDA0002611820400000161
The state set of the computing resource is represented as
Figure GDA0002611820400000162
Event e includes service requests for various types of traffic arriving and leaving the current system, and no event occurs as denoted by e-0, and thus,
Figure GDA0002611820400000163
the state set of the communication resource, the state set of the computing resource and the event are integrated into one state representation, and Z ═ e, X, Y }.
System action set A ∈ { A } up ,A down Denoted as upper-layer action A up E { -2, -1,0} and lower layer action A down E {1,2, ib, jv }. Where a-2 indicates that the system rejects the service request, a-1 indicates that the service leaves the system, and a-0 indicates that the system receives the service request. and a-1 indicates that the system performs subcarrier number reallocation on the frequency spectrum resources according to the time-varying channel state information from high to low in sequence according to the service priority, a-2 indicates that the system does not perform reallocation, a-ib indicates that i subcarriers are allocated to the service, and a-jv indicates that j VMs are allocated to the service.
By using
Figure GDA0002611820400000164
Indicates reception of c l The revenue obtained by the type of service is,
Figure GDA0002611820400000165
is denoted by c l The type service allocates the gains obtained by the i subcarriers,
Figure GDA0002611820400000166
is denoted by c l Type traffic allocation of revenue, exp, obtained by j VMs comm Express unit cost of communication, exp comp Which represents the cost per unit of calculation,
Figure GDA0002611820400000167
and the penalty brought by the rejection of the service request by the system is represented. z is a radical of formula * Representing the state of the system at the time of occurrence of an event, the number n of time intervals elapsed between two adjacent events T Can be expressed as
n T =min{t>0|z t =z * ,z * ∈Z} (25)
Thus, the cumulative reward gained by the system performing a reallocation of communication resources, i.e. sub-carriers, between two events is expressed as
Figure GDA0002611820400000168
The instantaneous profit obtained by the system receiving the service request when the event occurs is expressed as
Figure GDA0002611820400000171
When no event occurs, the instantaneous benefit obtained by the underlying MDP process is expressed as
Figure GDA0002611820400000172
In order to solve the HRL to obtain the optimal strategy, the Q-Learning algorithm is utilized in the section. The upper SMDP decision process is marked as SMDP Q-Learning, and the action cost function is updated
Figure GDA0002611820400000173
The lower layer MDP decision process is marked as MDP Q-Learning, and the action value function is updated
Figure GDA0002611820400000174
Since the SMDP and MDP are different time scales, the Q update counts n' and n are different. And continuously and iteratively updating the Q value until the algorithm is converged, and finally obtaining the optimal strategy for the joint optimization allocation of the communication and computing resources.
Figure GDA0002611820400000175
Figure GDA0002611820400000181
Algorithm 2 describes the whole process of hierarchical reinforcement learning. Compared with a single-layer SMDP process, the system is modeled into an HRL model in the section, the influence of the time-varying characteristic of a channel on service transmission is considered, spectrum resources, namely the number of subcarriers, can be adaptively reallocated, the time delay requirement of various services is met as much as possible, and therefore the QoS/QoE is improved. Therefore, the system can obtain more benefits under the framework.
Simulation verification:
and only carrying out simulation verification on the vehicle cloud resources with the resource reservation and resource secondary allocation mechanism. We evaluated the performance of SMDP-based resource allocation models and model-based reinforcement learning algorithms using MATLAB software simulation validation. The convergence performance of the model-based reinforcement learning resource allocation method is verified firstly, and compared with the condition of no resource secondary allocation mechanism. The greedy algorithm does not consider statistical information of the environment, only looks at the short-term maximum income, compares the reinforcement learning algorithm with the traditional heuristic greedy algorithm, can highlight the learning capability of the reinforcement learning algorithm from the environment and verify the advantages of a resource secondary distribution mechanism, and then discusses the reservation strategy. Here we use a model-based Dyna algorithm. The algorithm 1 updates the Dyna-Q online learning algorithm for k-step backtracking and is used for learning an optimal strategy.
When a new important service request arrives, no allocable resources exist in the cloud, part of common services are selected from the cloud, the resource occupancy of the common services is reduced, and the released VUs are allocated to the new important service, so that the rejection rate of the important service is reduced, and meanwhile, the benefits of the service with the reduced resource occupancy cannot be damaged too much. And in a given fixed time interval, if no event occurs, the resource occupation amount of the common service in the cloud is increased, and the QoE of the service is improved. The simulation only considers the above resource secondary allocation process without loss of generality. Table 1 shows simulation parameters of the simulation verification resource secondary allocation mechanism. Each algorithm has a running time of T max =1×10 5 Time unit 1, represents the computer simulation time, not the actual time. Fig. 5 records the results over a period of computer simulation time 20, with time plotted on the abscissa to observe changes over time, convergence of the algorithm, and cumulative average gains.
TABLE 1 simulation parameters
Figure GDA0002611820400000182
Figure GDA0002611820400000191
FIG. 5 shows that the system gain obtained with our proposed model and mechanism is much higher than the greedy algorithm. This is because the greedy strategy is a short-view strategy, focusing only on the maximum benefit at hand. It can be seen from the figure that the resource secondary allocation mechanism can make the system obtain more benefits, because the secondary allocation mechanism trades the benefits of a small amount of common services for the benefits of more important services, and reduces the rejection rate of important service requests; but also improves the QoE of the general service in some special cases. The left side of the dotted line in the figure indicates that there are more common service requests than there are critical services, and it can be seen that the algorithm converges quickly. The right side of the dotted line shows that the arrival rate of the service request is increased when the environment is changed into an important service, the model-based reinforcement learning algorithm can sense the change of the environment and adjust the corresponding strategy, so that more benefits are obtained, and the greedy strategy cannot be changed due to the change of the environment and cannot obtain more system benefits.
As can be seen from fig. 6(a), when the arrival rate of the common service requests in the environment is higher than that of the important service requests, the rejection rate of the important service requests is higher than that of the common service requests under the greedy algorithm and the SAoR-free mechanism, because the resources in the cloud are limited, the controller receives a large number of common services, and the number of the resources allocated to the important service requests is 2, which cannot meet the requirement, and thus the important service requests are rejected. The resource secondary allocation mechanism just solves the problem, and the rejection rate of important service requests is obviously reduced. And the model-based reinforcement learning algorithm learns from the environment, and aims at maximizing the future long-term system yield, so that a compromise strategy is obtained, the maximum number of resources which can be obtained by service cannot be allocated each time, and the rejection rate is lower than that of a greedy algorithm. From fig. 6(b), it can be seen that when the arrival rate of important service requests is higher than that of the common requests, and resources in the cloud are limited, the overall rejection rate is increased, but the model-based reinforcement learning algorithm and the resource secondary allocation mechanism can still enable the system to obtain better performance than the greedy strategy.
Next, analyzing the influence of the reservation policy on the system, and as shown in table 2,
TABLE 2 simulation parameters
Figure GDA0002611820400000192
As can be seen from fig. 7, the system yield obtained by using the greedy algorithm is lower than that of the reinforcement learning algorithm, because the greedy algorithm increases the rejection rate of subsequent service requests, reduces QoS and QoE, and reduces the system yield. As can be seen from fig. 7(a), the cumulative average prize is a decreasing trend as the number of VUs reserved by the system for important services increases. This is because there are more general service requests than main service requests in the system, and increasing the number of VUs reserved tends to reduce the system resources available for general services. Although the QoE of the important service is improved, the receiving rate of the common service is greatly reduced, and the accumulated benefit of the system is reduced. As can be seen in fig. 7(b), the cumulative average prize increases first and then decreases as the number of VUs reserved by the system for important services increases. The reason is that the number of important service requests in the system is increased, the number of reserved VUs is increased, the QoE of the important service is improved, and the rejection rate of the important service requests is reduced. And the income obtained by receiving the important service request is more than that of the ordinary service request, and the reward is increased. Once the number of reserved VUs exceeds a certain value, the rejection rate of the common service is greatly increased, the QoE is reduced, and the loss of the QoE of the common service cannot be compensated by the rewards obtained from important services, so that the overall accumulated average reward of the system is reduced. Therefore, the number of the reserved VUs is dynamically adjusted according to the change of the environment, so that the QoE of important services can be improved, the QoE of common services is not lost too much, and the long-term benefit of the system can be improved.
Simulation results show that compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain a self-adaptive resource allocation strategy. And the introduction of a secondary distribution mechanism also improves the system performance, reduces the rejection rate of service requests and obtains more system benefits. The reservation strategy also improves the QoE of important service users, but the reservation ratio is dynamically adjusted according to the change of the environment.

Claims (2)

1. A vehicle networking cloud computing resource optimization method based on reinforcement learning is characterized by comprising the following steps:
A. modeling a resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism;
the system is set to have M virtual units VU and two service requests: important service requests and ordinary service requests; the number of virtual units VU that can be allocated to a service request is L, where L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda p And λ q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate λ l Function lambda of the number l of virtual units VU allocated to a service l =l+1,1/λ l Represents the average processing time of the service to which the l virtual units VU are assigned;
reserving a part of resources according to a reservation ratio Th and specially used for important service requests;
describing the system state S by the event of the vehicle service request, namely arrival or departure, and the number of VUs occupied by different kinds of services in the system; s pi And s qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } p ,n p +1,...L p },j∈{n q ,n q +1,...L q },n p And n q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:
1) arrival of important and general vehicle service requests e ar Respectively consist of
Figure FDA0002510802880000011
And
Figure FDA0002510802880000012
is shown, i.e.
Figure FDA0002510802880000013
2) Departure of service request e d From
Figure FDA0002510802880000014
And
Figure FDA0002510802880000015
indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.
Figure FDA0002510802880000016
Thus, e ∈ { e ∈ } ar ,e d Represents the total event;
if no event occurs within a fixed time interval, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, it is equal to 1, and the system status is represented as follows:
s ar ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e ar >} (1)
s d ={s|s=<...,s p(i-1) ,s pi ,...,s qj ,s q(j+1) ,...,e d ,timeout>} (2)
S∈{s ar ,s d in which s ar Indicating the state of the service arrival event at the occurrence of s d Indicating the state at the time the service leave event occurred,
Figure FDA0002510802880000017
after receiving a vehicle service request, the system assigns l VUs to the request if it decides to process the request immediately;receiving the important service request corresponds to the action of
Figure FDA0002510802880000018
The action corresponding to the reception of the ordinary service request is
Figure FDA0002510802880000021
Wherein
Figure FDA0002510802880000022
And
Figure FDA0002510802880000023
representing the system state at the time of arrival of important and ordinary service requests;
when the system encounters a special state:
Figure FDA0002510802880000024
at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed
Figure FDA0002510802880000025
In order to satisfy a minimum number of current service requests VU, one or more services are selected from the running services, depending on the system state, part of the resources are released, and actions are performed
Figure FDA0002510802880000026
Figure FDA0002510802880000027
Indicating that i VUs are released from the running important service possessing i VUs,
Figure FDA0002510802880000028
indicating that i VUs are released from a running normal service that occupies j VUs while an action is performed
Figure FDA0002510802880000029
Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, it will not be assigned any VU and the corresponding action is a(s) ar )=0;
When the vehicle service request in the cloud leaves, the occupied VU can be released, and the corresponding action is a(s) d )=-1;
Some special states are encountered:
Figure FDA00025108028800000210
wherein
Figure FDA00025108028800000211
And
Figure FDA00025108028800000212
respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted
Figure FDA00025108028800000213
Figure FDA00025108028800000214
Indicating that i VUs are added to the running important service that occupies i VUs,
Figure FDA00025108028800000215
indicating that I VUs are added to the running common service which occupies j VUs;
the set of actions for the adaptive VU assignment model is:
Figure FDA00025108028800000216
the overall system revenue is considered as z (s, a), which includes three categories of revenue, cost and additional cost:
z(s,a)=x(s,a)-y(s,a)-ext(s,a),e∈{e ar ,e d } (6)
where x (s, a) refers to revenue obtained from vehicle users when vehicle service requests are satisfied, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism;
revenue is expressed as:
Figure FDA0002510802880000031
where R is the reward obtained by processing the vital service request immediately, I v The profit is evaluated through the change of QoE of the user and QoS of the system; if the request is processed immediately, I v A fixed value will be used according to the increase in QoE and QoS; r is v The user allocates a VU payment for the system;
Figure FDA0002510802880000032
and
Figure FDA0002510802880000033
respectively representing the costs of rejecting important services and ordinary services;
Figure FDA0002510802880000034
and
Figure FDA0002510802880000035
weight factors representing important services and common services respectively;
Figure FDA0002510802880000036
increase inAllocating a basic award of l resources; r is Th Reserving partial resources for the system to request the important service, and improving the QoE of the important service to obtain converted rewards;
the system cost is represented by the following equation:
y(s,a)=t(s,a)h(s,a),a∈a(s) (8)
where t (s, a) represents the average expected time from the system making the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:
Figure FDA0002510802880000037
wherein c is v Represents the cost of occupying one VU;
in addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about
Figure FDA0002510802880000038
The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:
Figure FDA0002510802880000039
the receiving probability of a newly arrived common service request is reduced to lower the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost may be expressed as follows:
Figure FDA0002510802880000041
wherein the content of the first and second substances,
Figure FDA0002510802880000042
indicating that the service is reduced by the cost of l VUs,
Figure FDA0002510802880000043
and
Figure FDA0002510802880000044
weight factors representing important services and common services respectively; c. C Th Representing that the reservation strategy is executed to reduce the cost of QoE of the common service;
B. solving the model by using reinforcement learning;
the reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the approximately optimal strategy is obtained by continuously and iteratively updating the following action value function,
when the temperature is higher than the set temperature
Figure FDA0002510802880000045
When the temperature of the water is higher than the set temperature,
Figure FDA0002510802880000046
when in use
Figure FDA0002510802880000047
When the utility model is used, the water is discharged,
Figure FDA0002510802880000048
wherein beta is n Denotes the learning rate at decision time of step n, gamma denotes a discount factor, 0<γ<1;β n Is a small positive value, β n 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning;
if each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,
Figure FDA0002510802880000049
where π represents a policy;
the optimal strategy pi can be obtained from the optimal action value function * Which means that the probability of a certain action is taken to accomplish the goal, as shown in the following formula,
Figure FDA00025108028800000410
the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM n And ε n And finally, the algorithm converges.
2. The Internet of vehicles cloud computing resource optimization method based on reinforcement learning of claim 1, further comprising a hierarchical architecture for adaptive joint optimization allocation of edge computing communication and computing resources, wherein the upper architecture is an SMDP-based service request receiving or refusing decision mechanism, and the lower architecture is an MDP-based spectrum resource reallocation mechanism, and is solved by means of hierarchical reinforcement learning.
CN202010460525.9A 2020-05-27 2020-05-27 Internet of vehicles cloud computing resource optimization method based on reinforcement learning Active CN111711666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010460525.9A CN111711666B (en) 2020-05-27 2020-05-27 Internet of vehicles cloud computing resource optimization method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010460525.9A CN111711666B (en) 2020-05-27 2020-05-27 Internet of vehicles cloud computing resource optimization method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111711666A CN111711666A (en) 2020-09-25
CN111711666B true CN111711666B (en) 2022-07-26

Family

ID=72538151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010460525.9A Active CN111711666B (en) 2020-05-27 2020-05-27 Internet of vehicles cloud computing resource optimization method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111711666B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750298B (en) * 2020-12-17 2022-10-28 华路易云科技有限公司 Truck formation dynamic resource allocation method based on SMDP and DRL
CN112612610B (en) * 2020-12-18 2021-08-03 广州竞远安全技术股份有限公司 SOC service quality guarantee system and method based on Actor-Critic deep reinforcement learning
CN112667400B (en) * 2020-12-29 2021-08-13 天津大学 Edge cloud resource scheduling method, device and system managed and controlled by edge autonomous center
US20240121136A1 (en) * 2021-03-26 2024-04-11 Shenzhen University Deep learning method and system for spectrum sharing of partially overlapping channels
CN113613339B (en) * 2021-07-10 2023-10-17 西北农林科技大学 Channel access method of multi-priority wireless terminal based on deep reinforcement learning
CN113518090B (en) * 2021-07-20 2023-08-01 绍兴文理学院 Edge computing architecture Internet of things intrusion detection method and system
CN113703962B (en) * 2021-07-22 2023-08-22 北京华胜天成科技股份有限公司 Cloud resource allocation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662764A (en) * 2012-04-25 2012-09-12 梁宏斌 Dynamic cloud computing resource optimization allocation method based on semi-Markov decision process (SMDP)
CN109451462A (en) * 2018-11-16 2019-03-08 湖南大学 A kind of In-vehicle networking frequency spectrum resource allocation method based on semi-Markov chain
CN109831522A (en) * 2019-03-11 2019-05-31 西南交通大学 A kind of vehicle connection cloud and mist system dynamic resource Optimal Management System and method based on SMDP
CN109905335A (en) * 2019-03-06 2019-06-18 中南大学 A kind of cloud radio access network resource distribution method and system towards bullet train

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662764A (en) * 2012-04-25 2012-09-12 梁宏斌 Dynamic cloud computing resource optimization allocation method based on semi-Markov decision process (SMDP)
CN109451462A (en) * 2018-11-16 2019-03-08 湖南大学 A kind of In-vehicle networking frequency spectrum resource allocation method based on semi-Markov chain
CN109905335A (en) * 2019-03-06 2019-06-18 中南大学 A kind of cloud radio access network resource distribution method and system towards bullet train
CN109831522A (en) * 2019-03-11 2019-05-31 西南交通大学 A kind of vehicle connection cloud and mist system dynamic resource Optimal Management System and method based on SMDP

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"An SMDP-Based Service Model for Interdomain Resource Allocation in Mobile Cloud Networks";Hongbin Liang等;《IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY》;20120630;第61卷(第5期);全文 *
"基于SMDP的移动云计算网络安全服务与资源优化管理研究";梁宏斌;《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》;20131015;全文 *
车载云计算系统中资源分配的优化方法;董晓丹等;《中国电子科学研究院学报》;20200120(第01期);全文 *

Also Published As

Publication number Publication date
CN111711666A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
Sun et al. Autonomous resource slicing for virtualized vehicular networks with D2D communications based on deep reinforcement learning
Qian et al. Survey on reinforcement learning applications in communication networks
CN111629380B (en) Dynamic resource allocation method for high concurrency multi-service industrial 5G network
Wu et al. An efficient offloading algorithm based on support vector machine for mobile edge computing in vehicular networks
CN111414252A (en) Task unloading method based on deep reinforcement learning
CN110839075A (en) Service migration method based on particle swarm in edge computing environment
Park et al. Network resource optimization with reinforcement learning for low power wide area networks
CN110580199B (en) Service migration method based on particle swarm in edge computing environment
CN111614754B (en) Fog-calculation-oriented cost-efficiency optimized dynamic self-adaptive task scheduling method
CN112055329A (en) Edge Internet of vehicles task unloading method suitable for RSU coverage switching
CN111240821B (en) Collaborative cloud computing migration method based on Internet of vehicles application security grading
CN111132074A (en) Multi-access edge computing unloading and frame time slot resource allocation method in Internet of vehicles environment
CN109743217B (en) Self-adaptive resource adjusting method based on SVRA algorithm
EP4024212A1 (en) Method for scheduling interference workloads on edge network resources
CN112188627A (en) Dynamic resource allocation strategy based on state prediction
CN116390125A (en) Industrial Internet of things cloud edge cooperative unloading and resource allocation method based on DDPG-D3QN
CN114885422A (en) Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network
CN116916386A (en) Large model auxiliary edge task unloading method considering user competition and load
CN115967990A (en) Classification and prediction-based border collaborative service unloading method
CN115052262A (en) Potential game-based vehicle networking computing unloading and power optimization method
CN117202264A (en) 5G network slice oriented computing and unloading method in MEC environment
CN113452625B (en) Deep reinforcement learning-based unloading scheduling and resource allocation method
CN115118783A (en) Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning
CN114637552A (en) Fuzzy logic strategy-based fog computing task unloading method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220706

Address after: 210000 third floor, Beidou building, No. 6, Huida Road, Jiangbei new area, Nanjing, Jiangsu

Applicant after: Hua Lu Yun Technology Co.,Ltd.

Address before: 610031 No. 1, floor 5, unit 3, building 6, No. 8 Qingyang Avenue, Qingyang District, Chengdu, Sichuan Province

Applicant before: Liang Hongbin

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 210000 third floor, Beidou building, No. 6, Huida Road, Jiangbei new area, Nanjing, Jiangsu

Patentee after: Hualui Cloud Technology Co.,Ltd.

Address before: 210000 third floor, Beidou building, No. 6, Huida Road, Jiangbei new area, Nanjing, Jiangsu

Patentee before: Hua Lu Yun Technology Co.,Ltd.