CN113158544B

CN113158544B - Edge pre-caching strategy based on federal learning under vehicle-mounted content center network

Info

Publication number: CN113158544B
Application number: CN202110149492.0A
Authority: CN
Inventors: 姚琳; 李兆洋; 吴国伟
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2024-04-12
Anticipated expiration: 2041-02-03
Also published as: CN113158544A

Abstract

The invention belongs to the technical field of vehicle-mounted content center networks, and provides an edge pre-caching strategy based on federal learning under a vehicle-mounted content center network. The RSU models the state and the action taken by taking the historical moving path of the vehicle and the content possibly requested as the basis, and then solves the optimal content distribution mode by deep reinforcement learning, and stores the required content into the corresponding RSU in advance, so that the time delay required by the vehicle for acquiring the content from the RSU is reduced. Each RSU trains the model by itself according to the data collected locally, then utilizes federal learning to summarize the model trained by each RSU, and uniformly distributes the summarized model to each RSU according to the weighted average of the data quantity. And finally, reducing the priority of the repeated content in the cache replacement according to the cache list of the neighbor node, thereby reducing the cache redundancy.

Description

Edge pre-caching strategy based on federal learning under vehicle-mounted content center network

Technical Field

The invention relates to an edge pre-caching strategy based on federation learning under a vehicle-mounted content center network, and belongs to the technical field of vehicle-mounted content center networks.

Background

Vehicular ad-hoc networks (VANET) is a special type of mobile ad-hoc network that contains several fixed infrastructures and vehicles. In VANET, each vehicle may communicate with other vehicles or with fixed roadside base units. Over the past several decades, VANET has become a content sharing platform of unrelated origin, i.e. VANET has been more concerned about the content itself than the actual carrier of the content. Content-oriented applications cover different areas such as entertainment, sports, shopping, etc. In order to meet the characteristics of VANET facing content, a new network structure, namely content-centric network (CCN), is provided. Unlike IP networks, content names are basic elements in CCNs, which are characterized by a basic exchange of content request packets (called interests) and content response packets (called Data). In-network caching of CCNs facilitates efficient distribution of streaming content under vehicular mobility and intermittent connectivity, resulting in a content-centric in-vehicle network (vehicular content centric network, VCCN). The VCCN may achieve better network performance under security applications, traffic applications, and content applications (e.g., file sharing and commercial advertising).

Similar to the on-board network, the VCCN mainly includes two types of nodes, namely a mobile node such as a vehicle, also called OBU (On Board Unit), and a Road Side Unit (RSU). The nodes all have the functions of forwarding interest packets and caching contents, and the RSU serves as an edge node and bears the functions of receiving requests from the mobile node and requesting contents from a cloud data source, so that the reasonable configuration of the caching strategy of the RSU plays a vital role in improving the efficiency of acquiring the contents by a user. For in-vehicle network edge caching, the operating environment is very complex and local content popularity in the vicinity of the mobile node is impacted by various factors. In particular, the user's preferences in terms of content are influenced by the user context (such as location, personal characteristics, and device diversity) in complex patterns. Furthermore, the edge nodes selected to meet a particular user request are subject to the complex influence of network conditions such as network topology, cooperation between wireless channels and BSs. Due to the natural dynamics of wireless networks, the caching environment of the vehicle network may change over time. The edge node should have intelligence to learn new states and new actions and match them in order to take optimal or near optimal actions. By performing feedback of its behavior, the intelligence of its behavior is known. The intelligent caching strategy should then be able to accept feedback so as to be able to accommodate dynamic changes in the operating environment.

Disclosure of Invention

In order to effectively improve the performance of an edge caching system under a vehicle-mounted content center network, the invention provides an edge pre-caching strategy based on federal learning. The RSU models the state and the action taken by taking the historical moving path of the vehicle and the content possibly requested as the basis, and then utilizes deep reinforcement learning to solve the optimal content distribution mode, and the needed content is stored in the corresponding RSU in advance, so that the time delay required by the vehicle for acquiring the content from the RSU is reduced. Each RSU trains the model by itself according to the data collected locally, then utilizes federal learning to summarize the model trained by each RSU, weights and averages the model according to the data quantity, and distributes the summarized model to each RSU. And finally, reducing the priority of the repeated content in the cache replacement according to the cache list of the neighbor node, thereby reducing the cache redundancy.

The technical scheme of the invention is as follows:

an edge pre-caching strategy based on federation learning under a vehicle-mounted content center network comprises the following steps:

(1) Firstly, acquiring data of a content request and corresponding vehicle movement information in a dynamic environment of a vehicle-mounted network, training a deep reinforcement learning intelligent agent deployed on an RSU, and making a decision which is most favorable for reducing request delay under a given condition. The training process of the DRL agent first requires defining a state space (state space), an action space (action space), and a report function (report function):

(1.1) the state space is mainly composed of two parts, one part is the moving state of the vehicle and the other part is the request probability of the content. Wherein the moving state of the vehicle mainly comprises the current position of the current vehicle and the position which can be reached after a time slice. The current location is easily available, but the likely location is not accurately predicted, so we use a Markov chain to predict the likely location of the vehicle based on its historical path and take the prediction as part of the state space. The probability of request for content is also divided into two categories, one category is popularity of content, and the other category is content predicted to be requested next possible based on the content currently requested by the vehicle.

(1.2) in order to avoid over-expansion of the action space, the DRL agent is restricted to select only one content at a time to be stored in the cache, and the selection is repeated a plurality of times to achieve that the high-priority content is stored in the cache. To further increase efficiency, the range of selectable content is further narrowed according to content popularity, and only content with popularity higher than a threshold value can be the target of pre-caching.

(1.3) characterizing the working efficiency of the DRL agent by using the cache hit rate, and in order to consider both short-term benefit and long-term benefit, the return function is represented by an exponentially weighted average hit rate:

wherein r is _i Indicating the hit rate of the ith time slice from the current start, w e (0, 1) is an exponentially weighted factor, and the larger w is, the less the return of the return function decays over time.

(2) After defining the state space, action space, and return function, a deep learning framework of the agent can be constructed and trained. The deep reinforcement learning framework adopted by the patent is composed of the following parts:

(2.1) an actor network is defined as a parameter θ ^μ Is a mapping from the state space to the action space. Given the state of one state space, the actor network can calculate the original action in the corresponding action space according to the parameters thereofAs an output.

(2.2) generating the original action can effectively reduce the computational complexity caused by a large-scale action space, but the dimension of the action space is reduced, and meanwhile, the inaccuracy of a decision result is easily caused. The method of K-Nearest Neighbor (KNN) is used to expand the generated actions into a set of actions, i.e. a set of valid actions in an action space, each element of which is likely to be an action to be performed.

(2.3) to avoid choosing actions with low Q, a critical network is also defined to limit the output of the actor network and update the parameters of the actor network. The Q value for each action for the critic network is calculated as follows:

wherein s is _t A represents the state at time t, a _t Representing the action taken at time t, θ ^Q And theta ^μ Representing parameters of the critic network and the actor network respectively,indicating the expectation of values in brackets under the conditions of environment E, r (s _t ,a _t ) Represented in state s _t Take action a down _t Return brought by gamma epsilon (0, 1)]For the weight decay factor of the future cumulative return, μ (s _t+1 |θ ^μ ) An action obtained based on the state of the actor network and time t+1 is shown. For each possible action in the set of actions generated in the previous step, the Critic network calculates a corresponding Q value according to the current state and the next state, wherein the action taking the maximum value is selected as the execution action.

And then randomly selecting state transition records in N playback pools, updating the critic network by minimizing a loss function, wherein the loss function L is defined as:

wherein y is _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) I represents the i-th record selected, and Q 'and μ' represent the critic and actor networks before the state transition corresponding to this record occurs.

Updating parameters of an actor network by using a sampling strategy gradient, wherein the sampling strategy gradient is calculated as follows:

i.e. according to the chain law to the parameter theta of the actor network ^μ Gradient determination, whereinFor the critic network pair state s _i Action a=μ(s) taken under conditions _i ) Gradient determination->For the actor network pair parameter theta ^μ And (5) obtaining a gradient.

(3) The training of deep reinforcement learning agents requires a large amount of data as a training set, which is usually collected from different RSUs, and if the data is to be uniformly uploaded to a central node, such as a specific RSU or a remote server, on one hand, bandwidth is largely occupied, on the other hand, the computing performance of the single node becomes a bottleneck limit, and the computing resources of a large number of edge nodes are not effectively utilized. Therefore, the patent adopts a federal learning architecture, each RSU respectively collects data locally and trains a given network, and then periodically uploads model parameters to a remote server. And the remote server performs federal average to obtain updated model parameters and re-transmits the updated model parameters to each RSU. The federal learning process is as follows:

(3.1) initializing a model of the deep reinforcement learning agent by the remote server, and endowing random parameter initial values for the current actor network and the critic network. The remote server then distributes this model to the various RSUs within the area.

(3.2) the RSU starts model training after receiving the model, the training process is as described in step (2), if there is history data available, it can be processed for model training, and new data obtained in the system operation after receiving the model can be updated further.

(3.3) after a period of training, each RSU transmits its own trained model back to the remote server, which performs federal averaging (federated averaging), taking into account the different RSUs' locations, so that the traffic flow is also calculated in the following manner:

wherein θ is _t+1 Representing a round of iterationThe network parameters after the calculation are K is the total number of RSUs participating in federal learning, n is the total number of requests received by each RSU in the independent training period of the iteration, n _k Then for the number of requests received by the kth RSU,representing parameters after the kth RSU training. The whole process is circularly carried out until the model parameters are kept stable.

(3.4) the remote server redistributes the trained model to the RSUs, which direct the caching with a unified intelligent agent.

(4) In step (1) it is mentioned that the DRL agent selects only one content at a time for pre-caching, and then pre-caches a number of possible contents by repeating a number of times. So that in practice a pre-cache content corresponds to the Q value of an action. On the basis, in order to reduce the space waste caused by the storage of the same content by a plurality of adjacent RSUs, when each RSU calculates the Q value of each action, the RSU firstly exchanges own cache list with the adjacent RSU, and if one content exists in the plurality of adjacent RSUs, the priority of the action is additionally reduced, and the specific calculation mode is as follows:wherein n is _d For the number of the content presence in the neighboring RSUs. And the RSU reorders each content according to the adjusted Q value, and then pre-caches the content meeting the conditions in sequence.

The invention has the beneficial effects that: for mobile networks on board vehicles, the operating environment is very complex and local content popularity in the vicinity of the mobile node is affected by various factors. The deep reinforcement learning can model a complex operation environment, characterize a caching environment through mobile prediction and user request content prediction, and obtain an optimal pre-caching selection result through training of a large amount of data.

Because the RSU is located in different areas, the user density and the number of requests are different, in general, the model obtained by the larger training set is more accurate, but if the RSU uploads the training data to a specific RSU or a remote server, the data transmission occupies a large amount of bandwidth resources, and meanwhile, the single-point performance bottleneck also restricts the training efficiency of the whole model. The federal learning can effectively solve the problems, bandwidth occupation is reduced by transmitting model parameters, and model training is performed by fully utilizing the computing resources of the RSU, so that single-point performance bottlenecks are avoided.

Finally, the priority of the repeated content is additionally reduced according to the cache list of the adjacent RSU, so that the space waste caused by redundant cache can be effectively reduced, and the cache efficiency is improved.

Drawings

FIG. 1 is an organizational chart of a pre-caching strategy according to the present invention.

FIG. 2 is a flow chart of the deep reinforcement learning modeling according to the present invention.

FIG. 3 is a flow chart of the training of the deep reinforcement learning agent according to the present invention.

FIG. 4 is a flow chart of federal learning according to the present invention.

Fig. 5 is a flowchart of RSU pre-buffering according to the present invention.

Detailed Description

For the purpose of making the technical solutions and advantages of the present invention clearer, the present invention will be further explained in detail by examples and drawings.

The method comprises modeling an edge caching environment by deep reinforcement learning, integrating trained model parameters by using a federal learning architecture, and pre-caching by an RSU through a local agent.

Referring to fig. 2, the specific implementation of modeling the edge cache environment required for deep reinforcement learning is as follows:

and step 1, counting historical moving paths of each vehicle by the RSU in the preheating stage.

And 2, respectively establishing a movement prediction model based on a Markov chain according to the historical movement paths of the vehicles.

Step 3, each vehicle periodically uploads the position of the vehicle,the RSU will determine the current position of the vehicle ^t As an integral part of the state space.

Step 4.RSU substitutes the positions of the last two time slices of the vehicle into the movement prediction model to calculate the most likely position l of the next time slice of each vehicle ^t+1 Also as part of the state space.

Step 5. Based on the assumption that the user will basically request in sequence when accessing the video stream data, the content that the user may access, for example, the content accessed at time t is c _i The content that may be requested at time t+1 may be calculated as follows:

c ^(t+1) ＝c _i+Δi

Δi＝Δt/d _c

wherein Δt represents the duration of one time slice, d _c Representing the time at which a content is played on average.

The rsu calculates popularity of the content as follows:

wherein lambda is E [0,1]To characterize the weight of historical requests versus recent requests by an attenuation factor, n ^t Indicating the number of requests for the content during the t period. Popularity of content also acts as a constituent element of the state space.

Step 7, screening the content according to the popularity of the content, wherein only the popularity is higher than the threshold value rho _t Can be the pre-cached object.

Step 8. The DRL agent is limited to selecting only one content to be stored in the cache at a time, and the selection is repeated a plurality of times to store the high-priority content in the cache. The action space of the single operation is all the contents screened in the step 7:

where N represents the number of content whose popularity reaches a threshold.

Step 9, the working efficiency of the DRL agent is represented by the cache hit rate, and in order to consider both short-term benefit and long-term benefit, the return function is represented by an exponentially weighted average hit rate:

wherein r is _i Indicating the hit rate of the i-th time slice from the current beginning, w epsilon (0, 1) is an exponential weighting factor, and the larger w is, the less the return of the return function decays with time.

Referring to fig. 3, a specific training process of the deep reinforcement learning agent on the rsu is as follows:

step 10. Initializing an actor network μ (s|θ ^μ ) Critic network Q (s, a|θ ^Q ) The parameters are respectively theta ^μ And theta ^Q The method comprises the steps of carrying out a first treatment on the surface of the Simultaneously initializing target networks μ 'and Q', which initialize parameters θ ^μ′ ←θ ^μ ，θ ^Q′ ←θ ^Q . The verified playback set R is initialized.

Step 11, selecting an original action based on the state at the time t

Step 12, selecting the nearest k effective actions as marks by using KNN algorithm

Step 13, selecting the action with the maximum Q value according to the current strategyExecution, observe reward r _t And a new state s _t+1 And transitions the state (s _t ,a _t ,r _t ,s _t+1 ) Record the experience playback set R.

Step 14. Selecting a state transition record of a certain size from the experience playback set R (s _i ,a _i ,r _i ,s _i+1 ). Set y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )。

Step 15, updating the critic network by minimizing the loss function:

updating the actor network by using the parameter gradient:

step 16, updating the target network:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

wherein τ < 1 is an update coefficient.

Steps 11 to 16 are model updates performed for one time slice, each time slice being repeated one time in a loop.

Referring to fig. 4, a specific flow of federal learning based on deep reinforcement learning is as follows:

and step 17, initializing a model of the deep reinforcement learning intelligent agent by the remote server, and endowing random parameter initial values for the current actor network and the critic network.

The remote server distributes this model to the various RSUs within the area.

Step 19. The rsu trains the deep reinforcement learning agent online based on steps 10 to 16.

Step 20, after training for a period of time, each RSU transmits the self-trained model back to the remote server, and the remote server performs federal average (federated averaging), and the specific calculation mode is as shown in technical scheme (3.3).

And step 21, the remote server redistributes the trained model to each RSU, and each RSU guides the caching operation by using a unified agent. Steps 19 to 21 are then repeated until the model converges.

Referring to fig. 5, a specific flow of the rsu for pre-buffering is as follows:

step 22. The neighboring RSUs exchange their respective cache lists periodically.

Step 23, after model training is completed, at the beginning of each time slice, the RSU collects environment information and constructs corresponding states, including the moving state of the vehicle and the request probability of the content.

Step 24. Selecting an effective action set according to the description of the steps 11-12.

And 25, removing the content of which the popularity does not reach the threshold value for the content corresponding to each action in the set.

Step 26. If there is more optional content in the collection, go to steps 27-29; otherwise, ending the pre-caching operation of the current time slice.

Step 27, when each RSU calculates the Q value of each action by using the critic network, if a content exists in a plurality of adjacent RSUs, the priority of the action is additionally reduced, and the specific calculation method is as follows:wherein n is _d For the number of the content presence in the neighboring RSUs. And the RSU reorders each content according to the adjusted Q value, and then pre-caches the content with the highest Q value.

And step 28, if the cache is full, selecting content to discard by adopting the LRU cache replacement strategy, and then placing the pre-cache content into the cache.

Step 29, if the content quantity of the pre-cache reaches 3/5 of the cache space, ending the pre-cache operation of the current time slice; otherwise, the step 24 is skipped to repeatedly execute the above operation.

Claims

1. An edge pre-caching strategy based on federation learning under a vehicle-mounted content center network is characterized by comprising the following steps:

(1) Firstly, acquiring data of a content request and corresponding vehicle movement information in a dynamic environment of a vehicle-mounted network, training a deep reinforcement learning intelligent agent deployed on an RSU, and making a decision which is most favorable for reducing request time delay under a given condition; the training process of the DRL agent firstly needs to define a state space, an action space and a return function:

(1.1) the state space is mainly composed of two parts, one part is the moving state of the vehicle and the other part is the request probability of the content; wherein the moving state of the vehicle comprises the current position of the current vehicle and a position which can be reached after a time slice; the current position is easy to obtain, but the position which can be reached is not accurately predicted, so that a Markov chain is adopted to predict the position which can be reached by the vehicle according to the historical path of the vehicle, and the prediction result is used as a component part of a state space; the request probability of the content is also divided into two categories, one is popularity of the content, and the other is content predicted to be requested next possible based on the content currently requested by the vehicle;

(1.2) in order to avoid over-expansion of the action space, the DRL agent is limited to select only one content at a time to be stored in the cache, and the selection is repeated a plurality of times to store the content with high priority in the cache;

wherein r is _i Indicating the hit rate of the ith time slice from the current beginning, wherein w epsilon (0, 1) is an exponential weighting factor, and the larger w is, the less the return of the return function decays with time is;

(2) After defining the state space, the action space and the return function, a deep learning framework of the intelligent body can be constructed and trained; the deep reinforcement learning framework adopted by the method comprises the following parts:

(2.1) an actor network is defined as a parameter θ ^μ Is a mapping from a state space to an action space; given the state of one state space, the actor network calculates the original action in the corresponding action space according to the parameters thereofAs an output;

(2.2) expanding the generated actions into a group of actions, namely a set of valid actions in an action space, by adopting a K-nearest neighbor method, wherein each element possibly serves as an action to be executed;

(2.3) in order to avoid selecting the action with low Q value, defining a critic network to limit the output of the actor network, and updating the parameters of the actor network; its deterministic target strategy is as follows:

wherein s is _t A represents the state at time t, a _t Representing the action taken at time t, θ ^Q And theta ^μ Representing parameters of the critic network and the actor network respectively,indicating the expectation of values in brackets under the conditions of environment E, r (s _t ，a _t ) Represented in state s _t Take action a down _t Return brought by gamma epsilon (0, 1)]For the weight decay factor of the future cumulative return, μ (s _t+1 |θ ^μ ) Representing actions derived based on the actor network and the state at time t+1; for each possible action in the action set generated in the last step, the critic network calculates a corresponding Q value according to the current state and the next state, wherein the action taking the maximum value is selected as the execution action;

the critic network is then updated by minimizing a loss function defined as:

wherein y is _i ＝r _i +γQ′(s _i+1 ，μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) I represents the i-th record selected, Q 'and mu' represent critic and actor networks before the state transition corresponding to the record occurs;

updating parameters of an actor network by using a sampling strategy gradient:

i.e. according to the chain law to the parameter theta of the actor network ^μ Gradient determination, whereinFor the critic network pair state s _i Action a=μ(s) taken under conditions _i ) Gradient determination->For the actor network pair parameter theta ^μ Obtaining a gradient;

(3) The method adopts a federal learning architecture, each RSU respectively collects data locally and trains a given network, and then periodically uploads model parameters to a remote server; the remote server performs federal average to obtain updated model parameters and re-transmits the updated model parameters to each RSU; the federal learning process is as follows:

(3.1) initializing a model of a deep reinforcement learning agent by a remote server, and endowing random parameter initial values for a current actor network and a critic network; the remote server then distributes this model to the various RSUs within the area;

(3.2) the RSU starts to train the model after receiving the model, the training process is the same as the step (2), if the historical data available for use is available, the RSU is used for training the model after processing, and meanwhile, new data obtained in the system operation after receiving the model is updated;

(3.3) after a period of training, each RSU transmits its own trained model back to the remote server, which performs federal averaging, taking into account the different RSUs' locations, so that the traffic flow is also calculated in the following manner:

wherein θ _t+1 Representing network parameters after one iteration round, K is the total number of RSUs participating in federal learning, n is the total number of requests received by each RSU during the independent training of the iteration, n _k Then for the number of requests received by the kth RSU,representing parameters after training of the kth RSU; the whole process is circularly carried out until the model parameters remain stable;

(3.4) the remote server redistributes the trained model to RSUs, and the RSUs guide caching operation by using a unified agent;

(4) In the step (1), only one content is selected for pre-caching at a time by the DRL agent, and then a plurality of possible contents are pre-cached through repeated steps; so that in practice a pre-cache content corresponds to the Q value of an action; on the basis, in order to reduce the space waste caused by the storage of the same content by a plurality of adjacent RSUs, when each RSU calculates the Q value of each action, the RSU firstly exchanges own cache list with the adjacent RSU, and if one content exists in the plurality of adjacent RSUs, the priority of the action is additionally reduced, and the specific calculation mode is as follows:wherein n is _d For the number of the content presence in the neighboring RSUs; RSU re-uses according to the adjusted Q valueAnd sequencing the contents, and then pre-caching the contents meeting the conditions in sequence.