CN113158544A

CN113158544A - Edge pre-caching strategy based on federal learning under vehicle-mounted content center network

Info

Publication number: CN113158544A
Application number: CN202110149492.0A
Authority: CN
Inventors: 姚琳; 李兆洋; 吴国伟
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-07-23
Anticipated expiration: 2041-02-03
Also published as: CN113158544B

Abstract

The invention belongs to the technical field of vehicle-mounted content center networks, and provides an edge pre-caching strategy based on federal learning in a vehicle-mounted content center network. The RSU takes the historical moving path of the vehicle and the content possibly requested as the basis, models the state and the action taken, then solves the optimal content distribution mode by utilizing deep reinforcement learning, and stores the required content to the corresponding RSU in advance, thereby reducing the time delay required by the vehicle to acquire the content from the RSU. Each RSU self-trains the model according to the locally collected data, then the models obtained by training each RSU are summarized by using federal learning, the models are weighted and averaged according to the data quantity, and the summarized models are uniformly distributed to each RSU. And finally, the priority of the repeated content in cache replacement is reduced according to the cache list of the neighbor node, so that the cache redundancy is reduced.

Description

Edge pre-caching strategy based on federal learning under vehicle-mounted content center network

Technical Field

The invention relates to an edge pre-caching strategy based on federal learning under a vehicle-mounted content center network, and belongs to the technical field of vehicle-mounted content center networks.

Background

Vehicular ad-hoc networks (VANET) are a special type of mobile ad-hoc network that contains several fixed infrastructures and vehicles. In VANET, each vehicle may communicate with other vehicles or fixed roadside base units. In the past decades, VANET has become a content sharing platform of unrelated origin, i.e. VANET is more concerned about the content itself than the actual carrier of the content. Content-oriented applications cover different areas such as entertainment, sports, shopping, etc. In order to meet the content-oriented characteristics of the VANET, a new network structure, namely a content-centric networking (CCN), is provided. Unlike IP networks, content names are basic elements in CCNs, which are characterized by the basic exchange of content request packets (called Interest) and content response packets (called Data). The intra-network caching of the CCN facilitates efficient distribution of streaming content under the mobility and intermittent connectivity of vehicles, resulting in a content-centric vehicle network (VCCN). VCCN can achieve better network performance under security applications, traffic applications and content applications (such as file sharing and commercial advertising).

Similar to the vehicular network, the VCCN mainly includes two types of nodes, namely mobile nodes such as vehicles, also called obus (on Board Unit), and roadside fixed infrastructure (RSU). The nodes have the functions of forwarding the interest packets and caching the content, and the RSU serves as an edge node and bears the functions of receiving the request from the mobile node and requesting the content from the cloud data source, so that the reasonable configuration of the caching strategy of the RSU plays a vital role in improving the content obtaining efficiency of the user. For the in-vehicle network edge cache, the operation environment is very complex, and the local content popularity near the mobile node is influenced by various factors. In particular, user preferences in terms of content are influenced by user context (e.g., location, personal characteristics, and device diversity) in complex patterns. Furthermore, the edge nodes selected to satisfy a particular user request are subject to complex effects of network conditions (e.g., network topology, cooperation between wireless channels and BSs). Due to the natural dynamics of wireless networks, the caching environment of vehicular networks may change over time. The edge nodes should have the intelligence to learn new states and new actions and match them in order to take optimal or near optimal actions. By performing feedback of its behavior, the intelligence of its behavior is known. The intelligent caching strategy should then be able to accept feedback and thus be able to adapt to dynamic changes in the operating environment.

Disclosure of Invention

In order to effectively improve the performance of an edge cache system under a vehicle-mounted content center network, the invention provides an edge pre-cache strategy based on federal learning. The RSU takes the historical moving path of the vehicle and the content possibly requested as the basis, models the state and the action taken, then utilizes deep reinforcement learning to solve the optimal content distribution mode, and stores the required content to the corresponding RSU in advance, thereby reducing the time delay required by the vehicle to acquire the content from the RSU. Each RSU self-trains the model according to the data collected locally, then the models obtained by training each RSU are summarized by using federal learning, the models are weighted and averaged according to the data quantity, and the summarized models are uniformly distributed to the RSUs. And finally, the priority of the repeated content in cache replacement is reduced according to the cache list of the neighbor node, so that the cache redundancy is reduced.

The technical scheme of the invention is as follows:

an edge pre-caching strategy based on federal learning under a vehicle-mounted content center network comprises the following steps:

(1) firstly, data of content requests and corresponding vehicle movement information are collected in a dynamic environment of a vehicle-mounted network, a deep reinforcement learning intelligent agent deployed on an RSU is trained, and a decision which is most beneficial to reducing request time delay is made under a given condition. The training process of the DRL agent first needs to define a state space (state space), an action space (action space), and a reward function (reward function):

(1.1) the state space is mainly composed of two parts, one is the moving state of the vehicle and the other is the request probability of the content. The moving state of the vehicle mainly comprises the current position of the current vehicle and a position which can be reached after a time slice. The current position is easy to obtain, but the possible arriving position cannot be accurately predicted, so that the Markov chain is adopted to predict the possible arriving position of the vehicle according to the historical path of the vehicle, and the prediction result is used as a component of a state space. The request probability of the content is also divided into two types, one is the popularity of the content, and the other is the next possible requested content predicted based on the content currently requested by the vehicle.

(1.2) in order to avoid the motion space from being too expansive, the DRL agent is limited to select only one content to be stored in the cache at a time, and the selection is repeated for multiple times to achieve the purpose of storing the contents with high priority in the cache. To further improve efficiency, the range of selectable content is further narrowed according to the content popularity, and only content with popularity higher than a threshold can be used as a pre-cached object.

(1.3) representing the working efficiency of the DRL agent by using the cache hit rate, wherein in order to take short-term benefits and long-term benefits into consideration, a return function is represented by an exponential weighted average hit rate:

wherein r is_iRepresents the hit rate of the ith time slice from the current time, w epsilon (0,1) is an exponential weighting factor, and the larger w is, the less profit is gained by the decay of the reward function along with the time.

(2) After the state space, the action space and the reward function are defined, a deep learning framework of the intelligent agent can be constructed and trained. The deep reinforcement learning framework adopted by the patent consists of the following parts:

(2.1) operator network is defined by a parameter θ^μIs a mapping from state space to action space. Given the state of a state space, the operator network can calculate the original motion in a corresponding motion space according to the parameters of the operator network

As an output.

(2.2) the generation of the original motion can effectively reduce the computational complexity brought by a large-scale motion space, but the dimension of the motion space is reduced, and simultaneously, the inaccuracy of a decision result is easily caused. The method of K-Nearest Neighbor (KNN) is therefore used to extend the generated actions into a set of actions, i.e. a set of valid actions in an action space, each element of which may be the action to be performed.

(2.3) to avoid the action of selecting the low Q value, a critic network is defined to limit the output of the operator network, and the parameters of the operator network are updated. The Q value for each action is calculated for the critic network as follows:

wherein s is_tIndicates the state at time t, a_tIndicating the action taken at time t, θ^QAnd theta^μRepresenting the parameters of the critical network and the actor network respectively,

indicates the expectation of the value in parentheses under the conditions of the environment E, r(s)_t,a_t) Is shown in state s_tTake action a_tThe return is brought in gamma e (0,1)]Weight decay factor, μ(s), for the cumulative return in the future_t+1|θ^μ) The operation obtained based on the operator network and the state at time t +1 is shown. For each possible action in the action set generated in the previous step, the Critic network calculates a corresponding Q value according to the current state and the next state, wherein the action that takes the maximum value will be selected as the execution action.

And then randomly selecting state transition records in N playback pools, and updating the criticic network by a minimum loss function, wherein the definition of a loss function L is as follows:

wherein y is_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′) I represents the selected ith record,q 'and μ' represent critic and operator networks before the corresponding state transition of this record occurs.

Updating the parameters of the actor network by using the sampling strategy gradient, wherein the sampling strategy gradient is calculated as follows:

i.e. the parameter theta to the operator network according to the chain rule^μCalculating a gradient in which

For critic network to state s_iAction taken under conditions a ═ μ(s)_i) The gradient is calculated and the gradient is calculated,

for the operator network to the parameter theta^μAnd (5) calculating a gradient.

(3) The training of the deep reinforcement learning agent needs a large amount of data as a training set, the data are usually collected from different RSUs, if the data are to be uploaded to a central node, such as a specific RSU or a remote server, on one hand, a situation that bandwidth is occupied in a large amount is caused, on the other hand, the computing performance of the single node becomes a bottleneck limit, and meanwhile, the computing resources of a large number of edge nodes are not utilized effectively. Therefore, the method adopts a federal learning framework, each RSU collects data locally and trains a given network, and then model parameters are uploaded to a remote server periodically. And the remote server performs federal averaging to obtain updated model parameters and issues the updated model parameters to each RSU again. The procedure for federal learning is as follows:

(3.1) firstly, the remote server initializes a model of the deep reinforcement learning intelligent agent and endows random parameter initial values for the current actor network and the critic network. The remote server then distributes this model to the various RSUs within the region.

And (3.2) the RSU starts to train the model after receiving the model, the training process is as described in the step (2), if historical data is available, the historical data can be processed and used for training the model, and meanwhile, new data obtained in the operation of the system after receiving the model can further update the model.

(3.3) after a period of training, each RSU transmits the model trained by itself back to a remote server, and the remote server performs federated averaging (fed averaging), and considering that different RSUs are located at different positions, the traffic flow is also calculated in the following specific manner:

wherein theta is_t+1Representing the network parameters after one iteration, K is the total number of RSUs participating in the federal learning, n is the total number of requests received by each RSU during the single training period of the iteration, n_kThe number of requests received for the kth RSU,

representing the parameters after the k RSU training. The whole process is circularly carried out until the model parameters are kept stable.

And (3.4) the remote server redistributes the trained model to each RSU, and each RSU guides the cache operation by using a uniform intelligent body.

(4) It is mentioned in step (1) that the DRL agent selects only one content at a time for pre-caching, and then pre-caches a number of possible contents by repeating many times. Thus, in effect, one pre-cached content corresponds to the Q value of one action. On this basis, in order to reduce the space waste caused by storing the same content by a plurality of adjacent RSUs, each RSU exchanges its own cache list with the adjacent RSU when calculating the Q value of each action, and if one content exists in a plurality of adjacent RSUs, the priority of the action is additionally reduced, specifically, the calculation method is as follows:

wherein n is_dFor storing the content in the adjacent RSUThe number of such cells. And the RSU reorders all the contents according to the adjusted Q value, and then sequentially pre-caches the contents meeting the conditions.

The invention has the beneficial effects that: for a mobile network in a vehicle, an operation environment is very complicated, and local content popularity in the vicinity of a mobile node is affected by various factors. The deep reinforcement learning can model a complex operation environment, the cache environment is represented through mobile prediction and user request content prediction, and an optimal pre-cache selection result is obtained through training of a large amount of data.

Since the RSUs are located in different regions, the user density and the number of requests are different, generally, the larger the training set is, the more accurate the obtained model is, but if the RSUs upload all the training data to a specific RSU or a remote server, the data transmission will occupy a large amount of bandwidth resources, and meanwhile, the single-point performance bottleneck will restrict the training efficiency of the whole model. The federal study can effectively solve the problems, the bandwidth occupation is reduced by a mode of transmitting model parameters, and simultaneously, the calculation resources of the RSU are fully utilized to carry out model training, so that the bottleneck of single-point performance is avoided.

Finally, the priority of the repeated content is additionally reduced according to the cache list adjacent to the RSU, so that space waste caused by redundant cache can be effectively reduced, and the cache efficiency is improved.

Drawings

Fig. 1 is an organizational chart of a pre-caching strategy according to the present invention.

FIG. 2 is a flowchart of deep reinforcement learning modeling according to the present invention.

FIG. 3 is a flow chart of deep reinforcement learning agent training according to the present invention.

Fig. 4 is a flow chart of federal learning in accordance with the present invention.

Fig. 5 is a flowchart of the RSU pre-caching according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by examples and drawings.

The method comprises the steps of carrying out deep reinforcement learning to model an edge cache environment, integrating trained model parameters by using a federal learning framework, and carrying out pre-caching by an RSU (remote subscriber Unit) through a local intelligent agent.

Referring to fig. 2, the specific implementation process of modeling the edge cache environment required for deep reinforcement learning is as follows:

step 1, counting historical moving paths of each vehicle in a preheating stage RSU.

And 2, respectively establishing a Markov chain-based movement prediction model according to the historical movement path of each vehicle.

And 3, uploading the position of each vehicle periodically by each vehicle, and enabling the RSU to obtain the current position l of each vehicle^tAs part of the state space.

And 4, substituting the positions of the two latest time slices of the vehicle into the movement prediction model by the RSU to calculate the most probable position l of the next time slice of each vehicle^t+1Also as part of the state space.

Step 5, based on the assumption that the user will request in sequence when accessing the video stream data, the content that the user may access can be calculated, for example, the content accessed at time t is c_iThen the content that may be requested at time t +1 may be calculated as follows:

c^(t+1)＝c_i+Δi

Δi＝Δt/d_c

where Δ t denotes the duration of a time slice, d_cIndicating the time of average play of a content.

And 6, the RSU calculates the popularity of the content according to the following mode:

wherein λ ∈ [0,1 ]]To characterize the weight of the number of historical requests relative to the number of recent requests, n, for a decay factor^tIndicating the number of requests for the content over a period of time t. The popularity of content also serves as a component element of the state space.

And 7, screening the content according to the content popularity, wherein only the popularity is higher than a threshold value rho_tCan the content of (a) be served as a pre-cached object.

And 8, limiting the DRL agent to select one content to be stored in the cache at one time, and repeating the selection for multiple times to store the high-priority content in the cache. The motion space of a single operation is all the contents screened in step 7:

where N represents the number of content for which popularity reaches a threshold.

And 9, representing the working efficiency of the DRL agent by using the cache hit rate, wherein in order to take short-term income and long-term income into consideration, the return function is represented by an exponential weighted average hit rate:

wherein r is_iRepresenting the hit rate of the ith time slice from the current start, w ∈ (0,1) is an exponential weighting factor, and the larger w is, the less profit is gained by the decay of the reward function along with time.

Referring to fig. 3, the specific training process of the deep reinforcement learning agent on the RSU is as follows:

step 10, initializing operator network mu (s | theta)^μ) Criticc network Q (s, a | θ)^Q) With the parameters respectively being theta^μAnd theta^Q(ii) a Simultaneously initializing the target networks mu 'and Q', the initialization parameter theta of which^μ′←θ^μ，θ^Q′←θ^Q. The experienced playback set R is initialized.

Step 11. select an original action based on the state at time t

Step 12, selecting the nearest k effective actions by using a KNN algorithm to be recorded as

Step 13, selecting the action with the maximum Q value according to the current strategy

Executing, observing and reporting r_tAnd a new state s_t+1And transfer the state(s)_t,a_t,r_t,s_t+1) The empirical playback set R is logged.

Step 14. select a certain size of state transition record(s) from the empirical playback set R_i,a_i,r_i,s_i+1). Setting y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′)。

And step 15, updating the critic network through a minimization loss function:

updating the actor network with the parameter gradient:

step 16, updating the target network:

θ^Q′←τθ^Q+(1-τ)θ^Q′

θ^μ′←τθ^μ+(1-τ)θ^μ′

wherein τ < 1 is the update coefficient.

Steps 11 to 16 are model updates performed for a time slice, each time slice being repeated one time in a loop.

Referring to fig. 4, the specific flow of the federal learning based on deep reinforcement learning is as follows:

and step 17, the remote server initializes a model of the deep reinforcement learning agent and endows random parameter initial values for the current actor network and the critic network.

The remote server distributes the model to the RSUs within the region, step 18.

And 19, performing online training on the deep reinforcement learning agent by the RSU based on the steps 10 to 16.

And 20, after the RSUs are trained for a period of time, each RSU transmits the model trained by the RSU back to the remote server, and the remote server performs federated averaging (fed averaging), wherein the specific calculation mode is shown in the technical scheme (3.3).

And step 21, the remote server redistributes the trained model to each RSU, and each RSU guides the cache operation by using a uniform intelligent agent. Steps 19 to 21 are then repeated until the model converges.

Referring to fig. 5, the specific process of the RSU performing the pre-caching is as follows:

and step 22, periodically exchanging respective cache lists adjacent to the RSU.

And 23, after the model training is finished and each time slice begins, the RSU collects environment information and constructs corresponding states, including the moving state of the vehicle and the request probability of the content.

And 24, selecting an effective action set according to the description of the steps 11-12.

And 25, firstly, eliminating the contents of which the popularity does not reach the threshold value from the contents corresponding to each action in the set.

Step 26, if the collection has optional contents, executing steps 27-29; otherwise, ending the pre-caching operation of the current time slice.

Step 27. when each RSU calculates Q value of each action by using critic network, if one content exists in a plurality of adjacent RSUs, additional content is addedThe priority of the action is reduced, and the specific calculation mode is as follows:

wherein n is_dIs the number of such contents present in the adjacent RSU. And the RSU reorders all contents according to the adjusted Q value and then pre-caches the contents with the highest Q value.

And step 28, if the cache is full, selecting content to be discarded by adopting an LRU cache replacement strategy, and then placing the pre-cached content into the cache.

Step 29, if the amount of the pre-cached content reaches 3/5 of the cache space, ending the pre-caching operation of the current time slice; otherwise, go back to step 24 to repeat the above operations.

Claims

1. An edge pre-caching strategy based on federal learning under a vehicle-mounted content center network is characterized by comprising the following steps:

(1) firstly, acquiring data of content requests and corresponding vehicle movement information in a dynamic environment of a vehicle-mounted network, training a deep reinforcement learning agent deployed on an RSU, and making a decision which is most beneficial to reducing request delay under a given condition; the training process of the DRL agent first needs to define the state space, action space and reward function:

(1.1) the state space is mainly composed of two parts, one part is the moving state of the vehicle and the other part is the request probability of the content; the moving state of the vehicle comprises the current position of the current vehicle and a position which can be reached after a time slice; the current position is easy to obtain, but the possible arriving position cannot be accurately predicted, so that the Markov chain is adopted to predict the possible arriving position of the vehicle according to the historical path of the vehicle, and the prediction result is used as a component of a state space; the request probability of the content is also divided into two types, one is the popularity of the content, and the other is the content which is predicted to be requested next possibly based on the content currently requested by the vehicle;

(1.2) in order to avoid the excessive expansion of the action space, the DRL agent is limited to select one content to be stored in the cache at one time, and the selection is repeated for multiple times to store the high-priority content in the cache; in order to further improve the efficiency, the range of the selectable content is further reduced according to the popularity of the content, and only the content with the popularity higher than a threshold value can be used as a pre-cached object;

wherein r is_iThe hit rate of the ith time slice from the current time is represented, w belongs to (0,1) as an exponential weighting factor, and the larger w is, the less profit of the decay of the return function along with the time is;

(2) after a state space, an action space and a return function are defined, a deep learning framework of the intelligent agent can be constructed and trained; the deep reinforcement learning framework adopted by the method consists of the following parts:

(2.1) operator network is defined by a parameter θ^μIs a mapping from state space to action space; given the state of a state space, the operator network calculates the original action in a corresponding action space according to the parameters of the operator network

As an output;

(2.2) expanding the generated action into a group of actions by adopting a K-nearest neighbor method, namely, a set of effective actions in an action space, wherein each element of the set of effective actions is possibly used as the action to be executed;

(2.3) in order to avoid the action of selecting the low Q value, a critic network is defined to limit the output of the operator network, and the parameters of the operator network are updated; the deterministic target strategy is as follows:

indicates the expectation of the value in parentheses under the conditions of the environment E, r(s)_t，a_t) Is shown in state s_tTake action a_tThe return is brought in gamma e (0,1)]Weight decay factor, μ(s), for the cumulative return in the future_t+1|θ^μ) Represents an action based on the operator network and the state at time t + 1; for each possible action in the action set generated in the previous step, the criticc network calculates a corresponding Q value according to the current state and the next state, wherein the action which obtains the maximum value is selected as the execution action;

the criticc network is then updated by minimizing a loss function defined as:

wherein y is_i＝r_i+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^Q′) I represents the selected ith record, and Q 'and mu' represent critic and operator networks before the state transition corresponding to the record occurs;

updating the parameters of the actor network by using the gradient of the sampling strategy:

for operator network to parameter theta^μCalculating a gradient;

(3) the method adopts a federal learning framework, each RSU respectively collects data locally and trains a given network, and then model parameters are uploaded to a remote server periodically; the remote server carries out federal averaging to obtain updated model parameters and sends the updated model parameters to each RSU again; the process of federal learning is as follows:

(3.1) firstly, initializing a model of a deep reinforcement learning agent by a remote server, and endowing random parameter initial values for a current actor network and a critic network; the remote server then distributes the model to the RSUs in the region;

(3.2) the RSU starts to train the model after receiving the model, the training process is the same as the step (2), if historical data available for adoption exist, the processed historical data are used for training the model, and meanwhile, new data obtained in the operation of the system after receiving the model further update the model;

(3.3) after a period of training, each RSU transmits the model trained by the RSU back to a remote server, and the remote server performs federal averaging, wherein the traffic flow is calculated in the following specific way by considering that the positions of different RSUs are different:

wherein, theta_t+1Representing the network parameters after one iteration, K is the total number of RSUs participating in the federal learning, n is the total number of requests received by each RSU during the single training period of the current iteration, n_kThe number of requests received for the kth RSU,

representing the parameters after the k RSU training. The whole process is circularly carried out until the model parameters are stable;

(3.4) the remote server redistributes the trained model to each RSU, and each RSU guides cache operation by a uniform agent;

(4) in the step (1), the DRL agent only selects one content for pre-caching at one time, and then pre-caches a plurality of possible contents by repeating for a plurality of times; thus, in fact, one pre-cached content corresponds to the Q value of one action; on this basis, in order to reduce the space waste caused by storing the same content by a plurality of adjacent RSUs, each RSU exchanges its own cache list with the adjacent RSU when calculating the Q value of each action, and if one content exists in a plurality of adjacent RSUs, the priority of the action is additionally reduced, specifically, the calculation method is as follows:

wherein n is_dThe number of the content existing in the adjacent RSU; and the RSU reorders all the contents according to the adjusted Q value, and then sequentially pre-caches the contents meeting the conditions.