CN111711666B

CN111711666B - Internet of vehicles cloud computing resource optimization method based on reinforcement learning

Info

Publication number: CN111711666B
Application number: CN202010460525.9A
Authority: CN
Inventors: 洪鑫涛; 梁宏斌; 张宗源
Original assignee: Hua Lu Yun Technology Co ltd
Current assignee: Hualui Cloud Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2022-07-26
Anticipated expiration: 2040-05-27
Also published as: CN111711666A

Abstract

The invention discloses a vehicle networking cloud computing resource optimization method based on reinforcement learning, which comprises the following specific steps: modeling the resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism; the model is solved using reinforcement learning. The invention ensures that the system obtains the optimal scheme of the resource allocation of the Internet of vehicles, and improves the resource utilization rate, the system QoS and the user QoE.

Description

Internet of vehicles cloud computing resource optimization method based on reinforcement learning

Technical Field

The invention belongs to the field of optimization application of internet of vehicles, and particularly relates to an internet of vehicles cloud computing resource optimization method based on reinforcement learning.

Background

With the rapid development of the internet of things, the role of the internet of things in the industrial field is more prominent. Industrial internet of things (IIoT) is one of the important applications in the industrial field, aiming at collecting and processing data in an industrial environment to achieve intelligent operations, such as industrial monitoring, automation, automatic control, etc. In IIoT, there is a need to collect and process data collected from various devices in a timely, reliable, and efficient manner. To meet the stringent requirements of IIoT, advanced communication and computing technologies are expected to play an important role. However, IIoT still faces many technical difficulties to be solved, for example, the industrial environment itself shows high randomness and dynamics, so that the traditional technology cannot adapt to the current industrial environment. In such a highly complex and dynamic system, Artificial Intelligence (AI) has great potential to address the above challenges. The IIoT system is intelligently managed and controlled to complete efficient decision-making in an autonomous learning mode.

Gao et al propose a visual and light detection and ranging (LIDAR) fused object classification method for autonomous driving based on convolutional neural networks and image upsampling theory. Shi et al propose an end-to-end navigation scheme for mobile robots that can directly extend agents trained through simulation to real scenes for practical applications based on Deep Reinforcement Learning (DRL). Aazam et al designed a novel architecture for IIoT that utilizes fog computing to provide local computing support for the IIoT environment.

The internet of vehicles is an important application scenario of IIoT, and has very important significance, and effective management of resources in the internet of vehicles is a problem to be solved. While virtualization techniques may facilitate management of computing, communication, and storage resources. And the vehicle cloud consisting of various vehicle-mounted units, roadside units and remote cloud servers assisted by a virtualization technology enables resources to be uniformly allocated, and the utilization efficiency of the Internet of vehicles resources is improved. Therefore, how to intelligently and efficiently allocate resources is important for the normal operation of the internet of vehicles. Ning et al build an intelligent offload framework for 5G-based vehicular networks by jointly using licensed cellular spectrum and unlicensed spectrum. Sodhro et al propose an intelligent resource regulation and control technology FCDAA (forward central dynamic and available approach) for the problem of resource limitation of mobile devices. Liang et al provide a method for sharing the frequency spectrum of V2V and V2I in the Internet of vehicles by using a multi-agent reinforcement learning algorithm. Ramon et al studied the SDN-based resource management problem for the internet of vehicles from both theoretical and practical aspects. Yu et al propose a game theory-based cloud resource allocation scheme to achieve effective resource management in a cloud-based vehicle network. Zhao et al propose a hierarchical resource allocation scheme based on the nash game. The original problem is converted into two sub-problems according to a time division multiplexing scheme: power allocation and time slot allocation. And the fair and effective spectrum resource allocation of the VANET is realized. He et al propose a dynamic orchestration unification framework for network, cache, and computing resources. And converting the resource allocation problem in the unified framework into a joint optimization problem. Aiming at the high complexity of the joint optimization problem, a deep reinforcement learning method is provided, and the system performance is improved.

Although there are many works in the literature on optimal allocation of vehicle cloud resources, the revenue, cost, QoS of the vehicle cloud system, and QoE of the vehicle users of the system are not jointly considered to achieve maximum system long-term revenue for the vehicle cloud system. This motivates us to propose an adaptive vehicle cloud/fog resource allocation model based on the Semi Markov Decision Process (SMDP). In addition, in the above documents, it is assumed that resources occupied by a service are not changed during the whole operation period, and secondary allocation (SAoR) of the resources according to the use condition of the resources is not considered, so that the resource utilization efficiency is improved, and the QoS of the system and the QoE of the vehicle users are improved. In order to improve the QoE of the important service request, the document refers to a method of reserving a protection channel specially for improving the call handover success rate in communication, and reserves part of resources specially for the important service request.

However, since the mobile terminal is limited by physical resources such as computation and power, it is necessary to offload traffic to an edge server in a short distance. Moving Edge Computing (Mobile Edge Computing) is just a product of this trend. The mobile edge computing shortens the physical distance between the mobile terminal and the server by deploying computing resources close to the network edge of the mobile device as much as possible, reduces transmission delay, shares the overload pressure of a clustered server, improves reliability and has a more flexible computing mode.

According to the characteristics of the mobile edge calculation itself, communication and calculation resources need to be considered jointly. In a certain time, a large number of mobile devices are required to share limited resources within a certain range, so that the highest system efficiency under the mobile edge computing scene can be achieved by jointly and optimally allocating computing and communication resources.

Salahuddin et al, outlines a variety of vehicle cloud models, demonstrates the benefits of reinforcement learning-based techniques for resource allocation in the vehicle cloud, can perceive long-term benefits, and minimizes vehicle cloud resource deployment overhead. Tang et al propose a novel intelligent POC allocation based on deep learning by using the wireless SDN of the Internet of things, namely SDN-IoT, which realizes the rapid convergence of the channel allocation process and significantly improves the network performance. Alam et al propose a reinforcement learning based code offloading mechanism that significantly reduces the execution time and latency of mobile services while ensuring lower mobile device power consumption. Wang et al propose an intelligent resource allocation scheme (DRLRA) based on deep reinforcement learning to adaptively allocate computing and network resources, reduce average service time, and balance resource usage in varying MEC environments.

The above documents have a certain effect on the joint optimization allocation of communication resources and computing resources, but they ignore the time-varying characteristics of wireless channels, which affect the transmission efficiency of services and then the QoE of users, and thus cannot obtain the optimal solution of the overall real-time resource allocation of the system.

Disclosure of Invention

Aiming at the problems, the invention provides a vehicle networking cloud computing resource optimization method based on reinforcement learning.

The invention discloses a vehicle networking cloud computing resource optimization method based on reinforcement learning, which specifically comprises the following steps of:

A. modeling the resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism.

The system is set to have M virtual units VU and two service requests: important service requests and ordinary service requests; the number of virtual units VU that can be allocated to a service request is L, where L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda _p And λ _q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda _l Function lambda of the number l of virtual units VU allocated to a service _l ＝l+1，1/λ _l The average processing time of the service to which l virtual units VU are assigned is indicated.

A part of the resources is reserved by the reservation ratio Th exclusively for important service requests.

Describing the system state S by the event of the vehicle service request, namely arrival or departure, and the number of VUs occupied by different kinds of services in the system; s _pi And s _qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } _p ,n _p +1,...L _p }，j∈{n _q ,n _q +1,...L _q }，n _p And n _q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:

1) arrival of important and general vehicle service requests e _ar Respectively consist of

And

is shown by

2) Departure of service request e _d From

And

indicating respectively important service leaves occupying i VUs and ordinary service leaves occupying j VUs, i.e.

Thus, e ∈ { e ∈ } _ar ,e _d Denotes the total event.

At a fixed time interval tau _int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, timeout is equal to 1, and the system state is represented as follows:

s _ar ＝{s|s＝<...,s _p(i-1) ,s _pi ,...,s _qj ,s _q(j+1) ,...,e _ar >} (1)

s _d ＝{s|s＝<...,s _p(i-1) ,s _pi ,...,s _qj ,s _q(j+1) ,...,e _d ,timeout>} (2)

S∈{s _ar ,s _d in which s _ar Indicating the state of the service arrival event at occurrence, s _d Indicating the state at the time of the service leave event,

after receiving a vehicle service request, if the system decides to process the request immediately, the system assigns l VUs to the request; the action corresponding to the reception of the important service request is

The action corresponding to the reception of the ordinary service request is

Wherein

And

representing the state of the system at the time of arrival of important and ordinary service requests.

When the system encounters a special state:

at this time, if the service request is selected to be received, the special action contracted in the resource secondary allocation mechanism is executed

In order to satisfy the minimum number of current service requests VU, one or more services are selected from the running services according to the system state, part of the resources are released, and actions are performed

Indicating that i VUs are released from the running important service that holds i VUs,

indicating the release of l VUs from a running generic service occupying j VUs while performing an action

Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, then it will not be assigned any VUs, and the corresponding action is a(s) _ar )＝0。

When the vehicle service request in the cloud leaves, the occupied VU can be released, and the corresponding action is a(s) _d )＝-1。

Some special states are encountered:

wherein

And

respectively representing the threshold value of the quantity of idle resources when the important service and the common service execute special actions; if the resource occupied by the running service is selected to be secondarily allocated, a special action expanded in the resource secondary allocation mechanism is adopted

Indicating that i VUs are added to the running important service possessing i VUs,

indicating that l VUs are added to the running general service occupying j VUs.

The set of actions of the adaptive VU assignment model is:

the overall system revenue is considered as z (s, a), which includes three categories of revenue, cost and additional cost:

z(s,a)＝x(s,a)-y(s,a)-ext(s,a),e∈{e _ar ,e _d } (6)

where x (s, a) refers to revenue obtained from vehicle users when satisfying a vehicle service request, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, and ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism.

The revenue is expressed as:

where R is the reward obtained by processing the vital service request immediately, I _v The profit is evaluated through the change of QoE of the user and QoS of the system; if the request is processed immediately, I _v A fixed value will be used according to the increase in QoE and QoS; r is a radical of hydrogen _v The user allocates a VU payment for the system;

and

respectively representing the costs of rejecting important services and ordinary services;

and

weight factors representing important services and ordinary services, respectively;

increasing a basic reward for allocating l resources; r is _Th Partial resources are reserved for the system for the important service request, and the QoE of the important service is improved to obtain the converted reward.

The system cost is represented by the following equation:

y(s,a)＝t(s,a)h(s,a),a∈a(s) (8)

where t (s, a) represents the average expected time from the system to make the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:

wherein c is _v Indicating the cost of occupying one VU.

In addition, since resource expansion of the service will have a certain influence on the future long-term yield of the system, some loss will be brought about

The loss caused by the resource occupation of the service in the cloud is reduced, and the loss is in direct proportion to the importance of the service and in inverse proportion to the quantity of the resource occupied by the service; executing the reservation policy encounters a special state:

the receiving probability of a newly arrived common service request is reduced to reduce the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost can be expressed as:

wherein the content of the first and second substances,

indicating that the service is reduced by the cost of l VUs,

and

weight factors representing important services and ordinary services, respectively; c. C _Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.

B. The model is solved using reinforcement learning.

The reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the following action value function is continuously updated in an iterative mode to obtain the approximately optimal strategy.

When the temperature is higher than the set temperature

When the temperature of the water is higher than the set temperature,

when in use

When the utility model is used, the water is discharged,

wherein beta is _n Denotes the learning rate at decision time of step n, gamma denotes a discount factor, 0<γ<1；β _n Is a small positive value, β _n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.

If each state action pair in the environment can traverse numerous times, when the action value function converges, an optimal action value function can be obtained,

where pi represents the strategy.

The optimal strategy pi can be obtained from the optimal action value function _* Which represents the probability of taking a certain action to accomplish the goal, as shown in the following formula,

the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the maximum action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM _n And ε _n And the algorithm finally converges.

Further, a layered architecture is optimally allocated to the edge computing communication and the self-adaptation of the computing resources, the upper layer architecture is a service request receiving or refusing decision mechanism based on the SMDP, and the lower layer architecture is a frequency spectrum resource re-allocation mechanism based on the MDP, and the layered reinforcement learning is used for solving.

Compared with the prior art, the invention has the beneficial technical effects that:

the invention firstly provides an adaptive resource allocation model based on the SMDP, adds a reservation strategy and a resource secondary allocation mechanism on the basis, and solves the problem by using the reinforcement learning based on the model. Compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain an adaptive resource allocation strategy. And the system performance is improved by introducing a secondary distribution mechanism, the rejection rate of service requests is reduced, and more system benefits are obtained. The reservation policy also improves QoE of important service users.

Drawings

Fig. 1 is a vehicle cloud system with reserved resources and a resource secondary allocation mechanism.

FIG. 2 is a model-based reinforcement learning process.

FIG. 3 is an edge computing scenario in a vehicle networking.

FIG. 4 is a hierarchical model for joint optimal allocation of edge computing communication and computing resources.

FIG. 5 is a graph of the cumulative average prize in a simulation experiment.

Fig. 6 shows the rejection rate of the service in different environments of the simulation experiment.

Fig. 7 shows the reservation ratio and the system cumulative average reward in different environments of the simulation experiment.

Detailed Description

The invention is described in further detail below with reference to the figures and the detailed description.

The present invention uses Virtual Units (VUs) to represent the smallest unit of resources in an overall vehicle cloud system, including the computational and storage resources required to process vehicle service requests in the vehicle cloud system.

Fig. 1 shows a vehicle cloud system with a resource reservation policy and a resource secondary allocation mechanism. When the service request arrives and the system does not have available VUs for distribution, the system can select part of services from the cloud at the moment, release a small amount of resources to distribute to the newly arrived service request and meet the requirement of the newly arrived service. Sometimes, when a large amount of free resources exist in the cloud, the resource occupation amount of part of services in the system is increased, and the QoE is improved. And when the system reserves a large number of VUs to guarantee important service requests, subsequent ordinary service requests are rejected due to lack of available VUs. If there are too few VUs reserved, it is difficult to meet the QoE of important service requests. Both of these situations reduce the long term yield of the overall system. Therefore, how to improve the long-term overall yield of the system, the QoS of the vehicle cloud system and the QoE of the vehicle users by adaptively adjusting the number of reserved VUs and the resource occupancy of the services running in the cloud according to the system environment is a major issue of the research herein.

Reinforcement learning is how an agent acts in a dynamic environment to maximize the long-term jackpot average defined by a goal. Reinforcement learning is an MDP process that aims to obtain an optimal strategy that maps the current environmental state to actions that can be taken. The intelligent agent does not need to obtain a complete environment dynamic model, so that the problem that the traditional method needs strong assumed conditions which are often inaccurate in the actual environment is solved.

The VU allocation task is matched with the intelligent environment framework, the intelligent body is regarded as a controller in a vehicle cloud system, and the states are service arrival, service departure and VU resource allocation conditions; the action is to accept, reject or assign a different number of VUs. By exploring the environment, the controller continuously interacts with the environment, and a resource allocation strategy is dynamically improved, so that an optimal strategy for VU resource allocation is finally obtained. Therefore, the resource allocation problem of the Internet of vehicles system is modeled as a Semi Markov Decision Process (SMDP), and a resource reservation strategy and a resource secondary allocation mechanism are introduced.

The system is set to have M VUs and two service requests: important service requests and ordinary service requests; the number of virtual units VU which can be allocated to a service request is L, wherein L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda _p And λ _q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate lambda _l Function lambda of the number l of virtual units VU allocated to a service _l ＝l+1，1/λ _l The average processing time of the service to which l virtual units VU are allocated is indicated.

In order to meet the important service request with higher priority as much as possible, a part of the resources are reserved according to the reservation ratio Th and are specially used for the important service request. The reservation ratio is the ratio of the reserved resource (M · Th)/(1+ Th) to the remaining resource M/(1+ Th). Because the arrival of the service request in the vehicle cloud system environment is dynamically changed, when an important service request protection strategy is executed, the number of reserved resources is difficult to predict, when the reserved resources are excessive, the utilization rate of the VU is reduced, and the rejection rate of the newly arrived common service request is increased; when the reserved resource is too small, the important service request cannot be protected, so the ratio Th is adaptively adjusted according to the change of the environment.

Describing the system state S with the event (arrival or departure) of the vehicle service request and the number of VUs occupied by different kinds of services in the system; s _pi And s _qj Respectively representing the number of important service requests of i VUs and the number of common service requests of j VUs in the system, wherein i is equal to { n ∈ } _p ,n _p +1,...L _p }，j∈{n _q ,n _q +1,...L _q }，n _p And n _q Indicating the minimum number of VUs assigned to important and ordinary service requests, respectively; two events e are defined:

And

is shown, i.e.

2) Departure of service request e _d From

And

Thus, e ∈ { e ∈ } _ar ,e _d Denotes the total event.

At a fixed time interval tau _int If no event occurs, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, it is equal to 1, and the system status is represented as follows:

s _ar ＝{s|s＝<...,s _p(i-1) ,s _pi ,...,s _qj ,s _q(j+1) ,...,e _ar >} (1)

S∈{s _ar ,s _d in which s _ar Indicating the state of the service arrival event at the occurrence of s _d Indicating the state at the time of the service leave event,

upon receipt of a vehicle service request (

And

) Later, if the system decides to process the request immediately, the system assigns l VUs to the request; receiving the important service request corresponds to the action of

i∈{n _p ,n _p +1,...L _p Receiving the action corresponding to the common service request as

j∈{n _q ,n _q +1,...L _q Therein of

And

When the system encounters a special state:

indicating that i VUs are released from a running normal service that occupies j VUs while an action is performed

Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, it will not be assigned any VU and the corresponding action is a(s) _ar )＝0。

When a vehicle service request in the cloud leaves (

And

) The occupied VU is released, and the corresponding action is a(s) _d )＝-1。

Some special states are encountered:

wherein

And

Indicating that i VUs are added to the running important service that occupies i VUs,

indicating that l VUs are added to the running general service occupying j VUs.

The set of actions for the adaptive VU assignment model is:

the revenue of the whole system is considered as z (s, a), and includes three categories of revenue, cost and extra cost:

z(s,a)＝x(s,a)-y(s,a)-ext(s,a),e∈{e _ar ,e _d } (6)

The system revenue x (s, a) generated by evaluating the number of VUs used by a service should take into account the following factors: the vehicle user pays for using the cloud resources; individual rewards obtained by immediately processing important service requests; the service request occupies the cost of the VU; quantified user QoE and system QoS (due to immediate processing of requests) improvement; the reward caused by the resource expansion of the running service is proportional to the importance of the service and inversely proportional to the amount of resources occupied by the running service. Thus, revenue is expressed as:

where R is the reward obtained by processing the vital service request immediately, I _v The profit evaluated by the change of the QoE of the user and the QoS of the system; if the request is processed immediately, I _v A fixed value will be used according to the increase in QoE and QoS; r is _v The user allocates a VU payment for the system;

and

respectively representing the costs brought by the rejection of important services and ordinary services;

and

increasing a basic reward for allocating l resources; r is a radical of hydrogen _Th Partial resources are reserved for the system to request important services, and QoE of the important services is improved to obtain the converted reward.

The system cost is represented by:

y(s,a)＝t(s,a)h(s,a),a∈a(s) (8)

where t (s, a) represents the average expected time from the system making the decision a(s), the current state s to the next state; h (s, a) represents the system service loss for the average expected time t (s, a) expressed as:

wherein c is _v Indicating the cost of occupying one VU.

wherein, the first and the second end of the pipe are connected with each other,

indicating that the service is reduced by the cost of l VUs,

and

weight factors representing important services and common services respectively; c. C _Th Representing the cost of implementing a reservation policy that reduces QoE for ordinary services.

Solving the Model using reinforcement learning, wherein the Model (s, a) is represented as an MDP process of the environment for which the Model is a Model of the environmentStatistical estimation of state transition probabilities. The Model (s, a) is updated with real experience obtained from the environment. The form model is used in this chapter, and every time the controller obtains a real experience, the controller will execute the process<S _t ,A _t ,S _t+1 ,R _t+1 >The 4-tuple is put into the Model (s, a).

The model-based reinforcement learning process is illustrated in fig. 2. The intelligent entity can learn online and continuously interact with the environment, and the obtained actual experience can be used for directly performing reinforcement learning, improving action value functions and strategies and improving the model so as to be more accurately matched with the current environment. Meanwhile, the intelligent agent interacts with the model, and the reinforcement learning is applied to the simulation experience generated by the model to indirectly learn, so that the action value function and the strategy are improved, and the learning process is accelerated. The direct and indirect learning processes are performed in parallel. By learning from the model, the agent can understand the environment more deeply without being limited to maximizing the system rewards, and the agent can have certain reasoning ability by learning from the model.

When in use

When the utility model is used, the water is discharged,

when in use

When the temperature of the water is higher than the set temperature,

wherein beta is _n Represents the decision at the nth stepLearning rate of time, gamma denotes a discount factor, 0<γ<1；β _n Is a small positive value, β _n And < 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning.

where pi represents the strategy.

the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM _n And epsilon _n And finally, the algorithm converges. See algorithm 1 for details.

Communication and computing resource joint allocation theoretical analysis

Fig. 3 shows the case of offloading a mobile vehicle service request to an edge server in an edge computing scenario. The method mainly comprises two processes: 1) the mobile equipment initiating the service request is connected to the RSU through the wireless network and performs data transmission, the frequency spectrum resources (the number of sub-carriers) are dynamically allocated according to the system state and the channel state information, 2) the VU resources required by the execution of the service are allocated in the RSU, and after the execution is finished, the execution result is returned to the mobile user.

Since all users share the whole wireless spectrum resource, the wireless spectrum resource is divided into a large number of mutually orthogonal sub-carriers, and the system allocates different sub-carrier numbers for various services. The data transmission adopts the time division multiplexing technology, and each mobile user occupies different time slots to carry out service transmission. The number of subcarriers affects the transmission rate because the transmission rate is the transmission rate of a single subcarrier by the total number of subcarriers, so the more subcarriers are allocated, the higher the transmission rate. Due to the time-varying characteristic of the wireless channel, the reliability of the service in the transmission process is reduced, and in severe cases, the service transmission may be interrupted, which affects the QoS of the system and the QoE of the user. In order to compensate the influence caused by the time-varying characteristic of the channel, the system redistributes the number of the sub-carriers occupied by various services according to the information state of the channel.

Considering that there are multiple services in the system, type e { c ∈ ₁ ,c ₂ ,…,c _l ,…,c _L }，c _l Indicating the type of service, the priority of the various services increasing with l, i.e.

The requirements on the quality of service also increase with increasing priority.

Assume that the total number of computational resources VU in the edge server is

By using

Indicating c occupying j VUs _l Amount of type of service, using

Denotes c _l Minimum VU calculation requirements required for type services. The virtual resource VU in the edge server aims to improve the QoS/QoE (increase the number of the distributed VUs) as much as possible and increase the overall benefit of the system on the premise of meeting various service requirements. Therefore, expressing the VU allocation optimization problem as maximizing the system computational yield

Assume the total bandwidth resource of the wireless spectrum as

Subcarrier bandwidth of B _subc The number of subcarriers is

(symbol)

Meaning rounding down. All users share spectrum resources, and the multiple access technology enables multiple users to share limited radio spectrum resources at the same time without causing severe interference (collision). By using

Indicating c occupying i subcarriers _l The amount of traffic of a type. According to the Shannon theorem, the channel capacity is

C＝Wlog ₂ (1+SNR) (19)

Where W is the allocated bandwidth and is expressed by the number of subcarriers, that is, W ═ ib, i ═ 1,2 …. The SNR represents the signal-to-noise ratio,

wherein P is _c To transmit power, σ ² For noise power, m (t) represents the large scale fading component, and h (t) represents the small scale fading component. In the large-scale fading model, the time constant associated with fading variation is very large, being several seconds or minutes, as the mobile device moves. The small-scale fading propagation model characterizes rapid fluctuations in received signal strength over short distances or short times. In this subsection, only the time-varying characteristic of the wireless channel in a short time is considered, and since the moving distance of the mobile device in a short time is short, the large-scale fading is approximated to a constant process in this chapter.

Modeling channel fast fading variations within time interval tau using a first order gaussian markov process

h(t)～CN(0,1-∈ ² ) (21)

E.g. quantifying the channel correlation in two consecutive time intervals, model in Jakes [26]In ∈ ═ J ₀ (2πf _d /τ)，J ₀ (. h) is the zeroth order of the Bessel function of the first kind, f _d ＝vf _c C represents the maximum Doppler shift, v represents the moving speed of the mobile device, f _c Representing the carrier frequency, c being the speed of light c 3 × 10 ⁸ m/s。

Due to the small-scale fading of the channel, the signal-to-noise ratio of the wireless link varies with time, and thus the transmission quality of the service also varies with time. By using

Is shown by c _l Minimum transmission rate required for type traffic

Denotes c _l Minimum signal-to-noise ratio required for type traffic.

The allocation target of the spectrum resources in the edge calculation is to dynamically adjust the number of sub-carriers allocated to the service receiving service at the next decision time according to the time-varying characteristic of the channel, so that the requirements of various services on delay jitter are met, the QoS/QoE is improved, and more system benefits are obtained. Therefore, the subcarrier allocation optimization problem is expressed as minimizationService delay, further translated into maximizing service transmission rate

Whether the allocation of computing resources or communication resources is adopted, the system needs to reduce the rejection rate of service requests on the premise of meeting various service requirements, and then the system receives new services as much as possible. Therefore, the optimization problem is expressed as

The aim of the joint allocation of communication and computing resources is to reduce the rejection rate of services, improve QoS/QoE and maximize the overall long-term benefit of the system on the premise of meeting various service requirements according to the system state and channel state information. However, these objectives are mutually exclusive and cannot be met simultaneously, so trade-offs need to be made to consider the trade-off. The joint optimization objective is expressed as

Wherein

Is a weight factor, satisfies

The proportion of the weights is adjusted according to the system requirements to achieve different desired goals.

Because the joint optimization problem has NP-Hard attribute, it is difficult to obtain an analytic solution and solve the solution. The joint optimization problem described above is constructed as a hierarchical model, as shown in FIG. 4.

The lower level architecture in the hierarchical model is shown in the upper part of fig. 4. The time is divided according to a small equal time interval tau (millisecond magnitude), a solid black point is regarded as a decision moment of a controller, the number of sub-carriers occupied by various received services is redistributed according to the system state and channel state information, and the process can be regarded as MDP.

The upper level architecture in the hierarchical model is shown in the middle of fig. 4. The open circles represent the time of occurrence of the event and the controller decides whether to receive a service request and also how many VUs to allocate to it, which process can be considered as SMDP.

The lower part of fig. 4 shows that the MDP procedure is based on, on which the SMDP decision is superimposed. Originally, learning needs to be carried out under the same learning rule to obtain two types of decisions of a common strategy, wherein one type is a decision for receiving or not receiving a service request, the other type is a decision for executing VU and sub-carrier allocation, and the VU and the sub-carrier allocation can be decoupled through layering. The SMDP large-scale decision of the upper layer can have an independent learning process and an independent strategy, and the user can see farther on the basis of the MDP of the lower layer.

It can be seen that the MDP procedure and the SMDP procedure are overlapped, and in practical situations, the MDP procedure and the SMDP procedure may not occur simultaneously. When the lower layer MDP decision interval division is as small as possible, the lower layer MDP decision interval division and the lower layer MDP decision interval division can be approximately considered to be coincident.

The system state includes usage of subcarriers in the communication resource and usage of VUs in the computing resource. The set of states of the communication resource is represented as

The state set of the computing resource is represented as

Event e includes service requests for various types of traffic arriving and leaving the current system, and no event occurs as denoted by e-0, and thus,

the state set of the communication resource, the state set of the computing resource and the event are integrated into one state representation, and Z ═ e, X, Y }.

System action set A ∈ { A } _up ,A _down Denoted as upper-layer action A _up E { -2, -1,0} and lower layer action A _down E {1,2, ib, jv }. Where a-2 indicates that the system rejects the service request, a-1 indicates that the service leaves the system, and a-0 indicates that the system receives the service request. and a-1 indicates that the system performs subcarrier number reallocation on the frequency spectrum resources according to the time-varying channel state information from high to low in sequence according to the service priority, a-2 indicates that the system does not perform reallocation, a-ib indicates that i subcarriers are allocated to the service, and a-jv indicates that j VMs are allocated to the service.

By using

Indicates reception of c _l The revenue obtained by the type of service is,

is denoted by c _l The type service allocates the gains obtained by the i subcarriers,

is denoted by c _l Type traffic allocation of revenue, exp, obtained by j VMs _comm Express unit cost of communication, exp _comp Which represents the cost per unit of calculation,

and the penalty brought by the rejection of the service request by the system is represented. z is a radical of formula ^* Representing the state of the system at the time of occurrence of an event, the number n of time intervals elapsed between two adjacent events _T Can be expressed as

n _T ＝min{t>0|z _t ＝z ^* ,z ^* ∈Z} (25)

Thus, the cumulative reward gained by the system performing a reallocation of communication resources, i.e. sub-carriers, between two events is expressed as

The instantaneous profit obtained by the system receiving the service request when the event occurs is expressed as

When no event occurs, the instantaneous benefit obtained by the underlying MDP process is expressed as

In order to solve the HRL to obtain the optimal strategy, the Q-Learning algorithm is utilized in the section. The upper SMDP decision process is marked as SMDP Q-Learning, and the action cost function is updated

The lower layer MDP decision process is marked as MDP Q-Learning, and the action value function is updated

Since the SMDP and MDP are different time scales, the Q update counts n' and n are different. And continuously and iteratively updating the Q value until the algorithm is converged, and finally obtaining the optimal strategy for the joint optimization allocation of the communication and computing resources.

Algorithm 2 describes the whole process of hierarchical reinforcement learning. Compared with a single-layer SMDP process, the system is modeled into an HRL model in the section, the influence of the time-varying characteristic of a channel on service transmission is considered, spectrum resources, namely the number of subcarriers, can be adaptively reallocated, the time delay requirement of various services is met as much as possible, and therefore the QoS/QoE is improved. Therefore, the system can obtain more benefits under the framework.

Simulation verification:

and only carrying out simulation verification on the vehicle cloud resources with the resource reservation and resource secondary allocation mechanism. We evaluated the performance of SMDP-based resource allocation models and model-based reinforcement learning algorithms using MATLAB software simulation validation. The convergence performance of the model-based reinforcement learning resource allocation method is verified firstly, and compared with the condition of no resource secondary allocation mechanism. The greedy algorithm does not consider statistical information of the environment, only looks at the short-term maximum income, compares the reinforcement learning algorithm with the traditional heuristic greedy algorithm, can highlight the learning capability of the reinforcement learning algorithm from the environment and verify the advantages of a resource secondary distribution mechanism, and then discusses the reservation strategy. Here we use a model-based Dyna algorithm. The algorithm 1 updates the Dyna-Q online learning algorithm for k-step backtracking and is used for learning an optimal strategy.

When a new important service request arrives, no allocable resources exist in the cloud, part of common services are selected from the cloud, the resource occupancy of the common services is reduced, and the released VUs are allocated to the new important service, so that the rejection rate of the important service is reduced, and meanwhile, the benefits of the service with the reduced resource occupancy cannot be damaged too much. And in a given fixed time interval, if no event occurs, the resource occupation amount of the common service in the cloud is increased, and the QoE of the service is improved. The simulation only considers the above resource secondary allocation process without loss of generality. Table 1 shows simulation parameters of the simulation verification resource secondary allocation mechanism. Each algorithm has a running time of T _max ＝1×10 ⁵ Time unit 1, represents the computer simulation time, not the actual time. Fig. 5 records the results over a period of computer simulation time 20, with time plotted on the abscissa to observe changes over time, convergence of the algorithm, and cumulative average gains.

TABLE 1 simulation parameters

FIG. 5 shows that the system gain obtained with our proposed model and mechanism is much higher than the greedy algorithm. This is because the greedy strategy is a short-view strategy, focusing only on the maximum benefit at hand. It can be seen from the figure that the resource secondary allocation mechanism can make the system obtain more benefits, because the secondary allocation mechanism trades the benefits of a small amount of common services for the benefits of more important services, and reduces the rejection rate of important service requests; but also improves the QoE of the general service in some special cases. The left side of the dotted line in the figure indicates that there are more common service requests than there are critical services, and it can be seen that the algorithm converges quickly. The right side of the dotted line shows that the arrival rate of the service request is increased when the environment is changed into an important service, the model-based reinforcement learning algorithm can sense the change of the environment and adjust the corresponding strategy, so that more benefits are obtained, and the greedy strategy cannot be changed due to the change of the environment and cannot obtain more system benefits.

As can be seen from fig. 6(a), when the arrival rate of the common service requests in the environment is higher than that of the important service requests, the rejection rate of the important service requests is higher than that of the common service requests under the greedy algorithm and the SAoR-free mechanism, because the resources in the cloud are limited, the controller receives a large number of common services, and the number of the resources allocated to the important service requests is 2, which cannot meet the requirement, and thus the important service requests are rejected. The resource secondary allocation mechanism just solves the problem, and the rejection rate of important service requests is obviously reduced. And the model-based reinforcement learning algorithm learns from the environment, and aims at maximizing the future long-term system yield, so that a compromise strategy is obtained, the maximum number of resources which can be obtained by service cannot be allocated each time, and the rejection rate is lower than that of a greedy algorithm. From fig. 6(b), it can be seen that when the arrival rate of important service requests is higher than that of the common requests, and resources in the cloud are limited, the overall rejection rate is increased, but the model-based reinforcement learning algorithm and the resource secondary allocation mechanism can still enable the system to obtain better performance than the greedy strategy.

Next, analyzing the influence of the reservation policy on the system, and as shown in table 2,

TABLE 2 simulation parameters

As can be seen from fig. 7, the system yield obtained by using the greedy algorithm is lower than that of the reinforcement learning algorithm, because the greedy algorithm increases the rejection rate of subsequent service requests, reduces QoS and QoE, and reduces the system yield. As can be seen from fig. 7(a), the cumulative average prize is a decreasing trend as the number of VUs reserved by the system for important services increases. This is because there are more general service requests than main service requests in the system, and increasing the number of VUs reserved tends to reduce the system resources available for general services. Although the QoE of the important service is improved, the receiving rate of the common service is greatly reduced, and the accumulated benefit of the system is reduced. As can be seen in fig. 7(b), the cumulative average prize increases first and then decreases as the number of VUs reserved by the system for important services increases. The reason is that the number of important service requests in the system is increased, the number of reserved VUs is increased, the QoE of the important service is improved, and the rejection rate of the important service requests is reduced. And the income obtained by receiving the important service request is more than that of the ordinary service request, and the reward is increased. Once the number of reserved VUs exceeds a certain value, the rejection rate of the common service is greatly increased, the QoE is reduced, and the loss of the QoE of the common service cannot be compensated by the rewards obtained from important services, so that the overall accumulated average reward of the system is reduced. Therefore, the number of the reserved VUs is dynamically adjusted according to the change of the environment, so that the QoE of important services can be improved, the QoE of common services is not lost too much, and the long-term benefit of the system can be improved.

Simulation results show that compared with a greedy algorithm, the model-based reinforcement learning algorithm can obtain a self-adaptive resource allocation strategy. And the introduction of a secondary distribution mechanism also improves the system performance, reduces the rejection rate of service requests and obtains more system benefits. The reservation strategy also improves the QoE of important service users, but the reservation ratio is dynamically adjusted according to the change of the environment.

Claims

1. A vehicle networking cloud computing resource optimization method based on reinforcement learning is characterized by comprising the following steps:

A. modeling a resource allocation problem of the Internet of vehicles system as a Semi Markov Decision Process (SMDP), and introducing a resource reservation strategy and a resource secondary allocation mechanism;

the system is set to have M virtual units VU and two service requests: important service requests and ordinary service requests; the number of virtual units VU that can be allocated to a service request is L, where L is less than or equal to M, and L belongs to {1, 2.., L }; assuming that the arrival rates of important requests and ordinary requests follow a Poisson distribution, the average rates are respectively lambda _p And λ _q (ii) a The processing time of the request follows an exponential distribution, with an average leaving rate λ _l Function lambda of the number l of virtual units VU allocated to a service _l ＝l+1，1/λ _l Represents the average processing time of the service to which the l virtual units VU are assigned;

reserving a part of resources according to a reservation ratio Th and specially used for important service requests;

And

is shown, i.e.

2) Departure of service request e _d From

And

Thus, e ∈ { e ∈ } _ar ,e _d Represents the total event;

if no event occurs within a fixed time interval, performing secondary resource allocation on the service in the cloud according to the resource use condition in the cloud, and increasing the resource occupation amount; when timeout is equal to 0, it indicates that an event occurs in a fixed time interval, otherwise, it is equal to 1, and the system status is represented as follows:

s _ar ＝{s|s＝<...,s _p(i-1) ,s _pi ,...,s _qj ,s _q(j+1) ,...,e _ar >} (1)

S∈{s _ar ,s _d in which s _ar Indicating the state of the service arrival event at the occurrence of s _d Indicating the state at the time the service leave event occurred,

after receiving a vehicle service request, the system assigns l VUs to the request if it decides to process the request immediately;receiving the important service request corresponds to the action of

The action corresponding to the reception of the ordinary service request is

Wherein

And

representing the system state at the time of arrival of important and ordinary service requests;

when the system encounters a special state:

In order to satisfy a minimum number of current service requests VU, one or more services are selected from the running services, depending on the system state, part of the resources are released, and actions are performed

Indicating that i VUs are released from the running important service possessing i VUs,

Receiving a new service request; if a vehicle service request is denied for processing based on the system status and long term revenue status of the system, it will not be assigned any VU and the corresponding action is a(s) _ar )＝0；

When the vehicle service request in the cloud leaves, the occupied VU can be released, and the corresponding action is a(s) _d )＝-1；

Some special states are encountered:

wherein

And

indicating that I VUs are added to the running common service which occupies j VUs;

the set of actions for the adaptive VU assignment model is:

z(s,a)＝x(s,a)-y(s,a)-ext(s,a),e∈{e _ar ,e _d } (6)

where x (s, a) refers to revenue obtained from vehicle users when vehicle service requests are satisfied, y (s, a) refers to system cost generated by evaluating the number of VUs used by the service, ext (s, a) refers to additional cost incurred by the resource secondary allocation mechanism;

revenue is expressed as:

where R is the reward obtained by processing the vital service request immediately, I _v The profit is evaluated through the change of QoE of the user and QoS of the system; if the request is processed immediately, I _v A fixed value will be used according to the increase in QoE and QoS; r is _v The user allocates a VU payment for the system;

and

and

weight factors representing important services and common services respectively;

increase inAllocating a basic award of l resources; r is _Th Reserving partial resources for the system to request the important service, and improving the QoE of the important service to obtain converted rewards;

the system cost is represented by the following equation:

y(s,a)＝t(s,a)h(s,a),a∈a(s) (8)

wherein c is _v Represents the cost of occupying one VU;

the receiving probability of a newly arrived common service request is reduced to lower the QoE, and the quantitative cost is in direct proportion to the number of reserved resources; thus, the additional cost may be expressed as follows:

wherein the content of the first and second substances,

indicating that the service is reduced by the cost of l VUs,

and

weight factors representing important services and common services respectively; c. C _Th Representing that the reservation strategy is executed to reduce the cost of QoE of the common service;

B. solving the model by using reinforcement learning;

the reinforcement learning algorithm is used for solving a Bellman optimal equation through asynchronous iteration to obtain an optimal strategy; when the controller can not obtain the state transition probability of the environment, the approximately optimal strategy is obtained by continuously and iteratively updating the following action value function,

when the temperature is higher than the set temperature

When the temperature of the water is higher than the set temperature,

when in use

When the utility model is used, the water is discharged,

wherein beta is _n Denotes the learning rate at decision time of step n, gamma denotes a discount factor, 0<γ<1；β _n Is a small positive value, β _n 1, adopting a proper value and continuously reducing along with the learning process to avoid the non-convergence of learning;

where π represents a policy;

the optimal strategy pi can be obtained from the optimal action value function _* Which means that the probability of a certain action is taken to accomplish the goal, as shown in the following formula,

the controller selects the action to be taken under the current system state by adopting an epsilon-greedy exploration strategy, namely, the action corresponding to the action value is selected according to the probability epsilon, and one action is randomly selected from the rest actions according to the probability epsilon/| a(s) |; and gradually attenuate beta using DCM _n And ε _n And finally, the algorithm converges.

2. The Internet of vehicles cloud computing resource optimization method based on reinforcement learning of claim 1, further comprising a hierarchical architecture for adaptive joint optimization allocation of edge computing communication and computing resources, wherein the upper architecture is an SMDP-based service request receiving or refusing decision mechanism, and the lower architecture is an MDP-based spectrum resource reallocation mechanism, and is solved by means of hierarchical reinforcement learning.