CN116321307A

CN116321307A - Bidirectional cache placement method based on deep reinforcement learning in non-cellular network

Info

Publication number: CN116321307A
Application number: CN202310257897.5A
Authority: CN
Inventors: 王朝炜; 于小飞; 王子夜; 王卫东
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-23

Abstract

The invention provides a bidirectional cache placement method based on deep reinforcement learning in a non-cellular network, and relates to the technical fields of mobile communication, internet of things and the like. The method comprises the following steps: establishing a utility function based on cache hit rate, cache space resource utilization rate, content response time delay and energy consumption index for the edge server node, constructing a multi-objective optimization problem based on the utility function to solve a content cache decision, wherein an optimization objective is to maximize the cache hit rate, and minimize the system cost as much as possible; then establishing a buffer resource allocation decision network by using the deep Q network; and updating the content caching decision by using the experience replay training Q network according to the preference timing update of the user by the continuously received user request. The method of the invention can reasonably allocate bandwidth and computing resources for users, and achieve the aim of improving the utilization rate of the whole system resources in the non-cellular network and simultaneously ensuring the requirements of the user application service quality.

Description

Bidirectional cache placement method based on deep reinforcement learning in non-cellular network

Technical Field

The invention relates to the technical fields of mobile communication, internet of things and the like, in particular to a bidirectional cache placement method based on deep reinforcement learning in a non-cellular network.

Background

The development of mobile communication technology and artificial intelligence brings challenges to the internet of things system such as high dynamic topology, computational resource limitation, network scale expansion, quality of service (Quality of Service, qoS) and the like. The research on the corresponding cache placement strategy in the mobile edge computing system has become a problem to be solved in the emerging 6G application oriented bidirectional cache characteristics. Particularly, the method is combined with a Deep Reinforcement Learning (DRL) technology, and a cache service is provided at the edge of a wireless access network, so that the communication and the efficient utilization of computing resources can be assisted, the communication overhead caused by content retransmission is reduced, and the burden of a backhaul link is reduced. The honeycomb-free large-scale multiple input multiple output (Cell-Free Massive MIMO) technology combines the concepts of distributed MIMO and massive MIMO, replaces complex multi-antenna macro base stations (Macro Base Station, MBS) with a group of simple distributed Access Points (APs), and all Access points are directly connected with a central processing unit (Central Processor Unit, CPU) through a backhaul link and cooperate with each other to serve all users in the same time-frequency resource, so that the system has expandability and improves coverage rate.

In the Cell-Free Massive MIMO technology-assisted mobile edge computing communication scenario, a plurality of edge servers equipped with single antennas can be used as edge APs and flexibly distributed in a wide domain. Both the edge APs and the mobile user terminal have certain computing and caching capabilities, and can cache content. A set of distributed edge APs simultaneously provides caching services for all users in coverage and receives task delivery requests from content and processing of the Internet collected by the CPU. The CPU can control the buffer status of each edge APs and the transfer and forwarding of the content. If the content requested by the user cannot be acquired in time at the edge side of the wireless network, the content can be acquired from the CPU through a backhaul link.

The response delay of the user request and the computational power consumption of the terminal device are two indicators that significantly affect the quality of user experience (Quality of Experience, qoE). The content is cached in the mobile user terminal and the edge APs, so that the content can be processed and returned quickly, and the requirement of reducing the energy consumption of equipment when the mobile user acquires the content quickly is met. When users are in the coverage range of the edge APs, they can request corresponding content according to the computation service of interest, if the users locally pre-cache the requested content, the content can be selected to be locally computed or unloaded to the edge APs for computation. If the user terminal does not cache the requested content, the content needs to be acquired from the edge APs, and then local calculation is selected or is handed to the edge APs for calculation. Compared with the content retrieval from the CPU, the content is cached at the edge side, so that the communication cost can be effectively reduced, the network capacity can be increased, and the content delivery rate can be improved. However, considering the limited caching capability of the user terminal, the content-centric entertainment and information enrichment and diversity in the internet of things in the future face huge tests on spectrum resources, network capacity and user experience quality. Therefore, by combining with a finer bidirectional caching task model, future caching positions of different contents can be predicted according to historical request information of a user, so that the related popular content auxiliary computing performance is pre-cached on the edge APs, and the method has important significance for development of emerging immersive applications.

Disclosure of Invention

Aiming at the scene that an edge server provides assistance for a user bidirectional computing task under a mobile edge computing system in a non-cellular network, the invention provides a bidirectional cache placement method based on deep reinforcement learning in the non-cellular network in order to realize the cooperative optimization of the bandwidth, computing and cache resource allocation of the whole system with finer granularity.

The invention provides a bidirectional cache placement method based on deep reinforcement learning in a non-cellular network, which comprises the following steps:

step 1, establishing a utility function based on cache hit rate, cache space resource utilization rate, content response time delay and energy consumption index for an edge server node, constructing a multi-objective optimization problem based on the utility function to solve a content cache decision, wherein an optimization objective is to maximize the cache hit rate and minimize the system cost as much as possible;

step 2, mapping a content caching decision process based on user history preference into a Markov decision process, and establishing a caching resource allocation decision network by using a deep Q network;

the method comprises the steps that an edge server node collects historical requests and terminal equipment resource information of all users in a base station signal coverage area, predicts interested contents of the users according to the historical requests of the users, generates an initial content caching decision, utilizes an experience replay training Q network, and updates preferences of the users at regular time to update the content caching decision;

step 3, the user terminal generates a service requirement and sends a request to the edge server node, the user terminal detects whether the requested content is cached locally, if not, the requested content is acquired from the edge server node, and then the requested content is selected to be locally or unloaded to the edge server node for processing; if yes, directly selecting to process locally or unloading to an edge server node; and the edge server updates the content caching decision according to the continuously received user request.

In the step 1, the utility function in the time slot t is expressed as

Wherein P is _hit (t) is the cache hit rate of the edge server node in time slot t; y (t) is the normalized system cost of the edge server node in the time slot t, and comprises the utilization rate of cache space resources, content response time delay and energy consumption;

cost of the system

Wherein ω, & gt>

And mu is the weight proportion, T ^total (t) and E ^total (T) represents the total delay and total energy consumption of the edge server in time slot T, respectively, T ^max (t) and E ^max (t) represents maximum time delay and maximum energy consumption of edge server in time slot t, C _M Representing the cache capacity of the edge server, C _{F_M} Representing the sum of the cache contents size of the edge server at time slot t.

In the step 1, K users are set in the coverage area of the edge server node, and the user set is as follows

Providing F environmental frame contents, wherein the content set is +.>

The data size of the content i is S _i ，C _i Representing the cache state of content i on edge server, C _i =0 means uncached, C _i =1 indicates buffered, ++>

Then construct the following multi-objective optimization problem solving edge server content caching decisions:

the optimization problem represents optimizing content caching decisions

An optimization target for maximizing the cache hit rate and minimizing the system cost as much as possible is achieved; />

Representing an average utility function; />

Indicating whether content i is requested or not +.>

Representing content i->

Indicating that at least one user requested content i, is->

Representation->

No user request; h is a _k Indicating whether user k's request hits in the edge server's cache space, h if the content hits _k =1, otherwise h _k =0; τ represents the current slot.

In the step 2, in the deep Q network, the state of the agent is set to be the content cache state s (t) = [ C ] of the current edge server ₁ ,C ₂ ,…,C _F ]The method comprises the steps of carrying out a first treatment on the surface of the Actions of agent output

Indicating whether or not to take action on content i; evaluating rewards of the state after the network computing action is executed, the rewards being set to an optimal target value

Compared with the prior art, the invention has the advantages and positive effects that:

(1) The method of the invention models the utilization rate of system resources and the service quality of different services in a non-cellular network as a multi-objective optimization problem, comprehensively considers the edge multi-dimensional resource coupling, the application bidirectional input characteristic and the user fairness, maps the edge server content caching decision process based on the user history preference as a Markov process, establishes a caching resource allocation decision network by utilizing a DQN neural network, solves the multi-objective optimization problem on the basis, and finds an optimal caching strategy. This decision is based on the long-term request information for service content by each user and the corresponding device caching capabilities. The user history request information is utilized in each time slot to predict the next content demand of the user and perform buffer resource allocation, so that bandwidth and computing resources are reasonably allocated for the user, and the purposes of improving the utilization rate of the whole system resources in the non-cellular network and simultaneously guaranteeing the application service quality requirements of the user are achieved through reasonably allocating the bandwidth, computing and buffer resources of the edge server.

(2) Experimental comparison shows that the method can coordinate the unloading decision of the user by combining the iterative optimization method in each time slot, jointly optimize the computing communication resource of the edge server, and ensure the experience quality requirement of each user and the high-efficiency utilization of the edge multidimensional resource; compared with the existing caching scheme, the method can obtain better effect, and the optimization targets of maximizing the utilization rate of system resources and guaranteeing the experience quality of all users are achieved.

Drawings

FIG. 1 is a schematic diagram of a bi-directional cache scenario of a mobile edge computing network facilitated by the present invention l-Free Massive MIMO technique;

FIG. 2 is a flow chart of optimizing system resource utilization and user quality of service experience using the method of the present invention;

FIG. 3 is a schematic diagram of a content caching decision for an edge server using deep Q network optimization in the method of the present invention;

FIG. 4 is a graph comparing the effect of the method of the present invention on utility function as a function of content quantity with existing caching schemes;

FIG. 5 is a graph comparing the effect of the method of the present invention on utility function as a function of the number of users with the prior caching scheme.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples.

Because of limited edge resources, the mobile edge computing system can use the Cell-Free Massive MIMO technology and the deep reinforcement learning technology when facing the emerging business with the content as the center, thereby improving the utilization rate of the edge resources and ensuring the QoS requirement of the business and the QoE of the user. The invention considers the bidirectional cache scene of the mobile edge computing network assisted by the Cell-Free Massive MIMO technology. As shown in fig. 1, the scenario consists of one central processing unit CPU, edge APs and mobile user terminals. A user requests content through a mobile device having caching and computing capabilities. Edge APs with MEC (mobile edge computing) servers can provide caching and offloading services with some communication coverage. In the method of the present invention, it is assumed that the input of the application computing task comes from two aspects: 1) Data generated by the user equipment; 2) Data from the internet. The edge AP has enough content storage space to store all content. The buffer space of the user terminal is limited, so that only the content can be selectively buffered, and the QoE of the user can be influenced by the calculation time delay and the energy consumption.

Unlike conventional unidirectional computing task models, the bidirectional computing task model of the present invention has input data composed of two parts: 1) The data generated by the user equipment, namely the local input data, comprises the strategy selection, the three-dimensional action information and the like of the current equipment; 2) Data from the internet, i.e., remote input data, such as map information, etc. However, the resource coupling of the multi-user system is complex, so that the bandwidth of the system needs to be dynamically allocated according to the information such as the user preference, the calculation amount requirement and the like, and the resources are cached and calculated, so that the resource utilization rate is improved as much as possible, and finally, the purposes of ensuring and improving the QoS of different services and the corresponding QoE of the user are achieved. According to the bidirectional cache placement method based on deep reinforcement learning in the non-cellular network, the edge AP comprehensively considers the conditions of equipment energy consumption, storage capacity, calculation time delay and the like to perform joint allocation of the edge multidimensional resources, so that the aims of maximizing the cache hit rate of system content and the utilization rate of storage resources and improving the overall experience of the system are achieved.

As shown in fig. 2, the overall flow of optimizing the system resource utilization rate and the user service experience quality by adopting the method in the application scene is as follows: firstly, a user terminal in the coverage area of an edge AP generates service requirements, selects required contents according to the own preference of a user and sends a request to the edge AP. The edge server, namely the edge AP, collects the equipment resource information and the historical request data of each user, and then makes resource allocation decisions according to the information after receiving the request. According to the method, the edge AP predicts the content possibly interested by the user according to the history request of the user to form an initial caching decision, and further uses experience replay training deep Q network (Q-network) to obtain a stable customized user caching decision. And the edge server updates the content caching decision according to the continuously received user request. If the requested content is already cached in the mobile terminal, the user may choose to either locally calculate or offload to the edge AP calculation. If the content is not cached, the content can be acquired from the edge AP, and then local calculation is selected or is given to the edge AP for calculation processing. And (3) carrying out timing updating iteration on the preference of each user by utilizing the deep Q network, and continuously improving the accuracy of the decision made by the system for the user request prediction, thereby maximizing the utilization rate of system resources and ensuring the experience quality of all users.

Considering a single edge wireless access point, the edge server cache capacity is labeled C _M . The application provides F ambient frame contents, expressed as

Each content data is different in size by +.>

And (3) representing. The content requested by the user is in F environmental frames. K users are randomly distributed in the coverage area of the AP base station and expressed as

The system operates indefinitely and may be divided into a number of time slots, denoted t=0, 1,2,3 …. At most one content request is submitted by each terminal device per time slot t. The request content of the ith user in the time slot t is recorded as

Use->

Representing the requested content of K users in time slot t. Use->

Representation content->

Whether or not it is requested. />

The value is 1 or 0. If it is

Indicating that at least one user has requested content +.>

Representation content->

There is no user request. Then the request status of all content can be expressed as +.>

Suppose that the AP must respond to the user's request and provide service before the end of the current slot. By C _i E {0,1} represents the cache state of the ith content on the MEC server as follows:

then the caching decisions for all content are expressed as

MEC cache resources are limited, with the following limitations:

wherein C is _{F_M} Is the sum of the cache content sizes in the MEC.

In order to evaluate the QoE of the user with finer granularity, the invention constructs the integral utility function comprising a plurality of indexes such as cache hit rate, cache space resource utilization rate, content response time delay, energy consumption and the like. Use h _k E {0,1} to indicate whether the content requested by user hits in the cache space, h if the requested content of user k hits _k =1, otherwise h _k =0. The cache hit rate for time slot t can be expressed as:

then, a normalized system cost Y (t) of the time slot t is defined, including the content response delay, the energy consumption, and the utilization of the buffered space resources, expressed as:

wherein ω is,

μ represents the weight ratio occupied by the content response time delay, the energy consumption and the buffer space resource rate utilization respectively. T (T) ^max (t) and E ^max (T) represents the maximum delay and maximum energy consumption of the edge server system in time slot T, respectively, T ^total (t) and E ^total (t) represents the total delay and total energy consumption of the edge server system at time slot t, respectively. The invention accordingly defines a new effectThe function reliability (t), i.e., the ratio of cache hit rate to normalized system cost, can be expressed as:

the transmission load required for local computation of the terminal device is different from that of the edge server, and the corresponding computation cost is also different, so that careful design of a communication-computation coordination strategy is required to achieve equilibrium. For this purpose, the invention constructs the complete multi-objective optimization problem as follows:

representing the average optimization objective function +.>

Representing the average utility function of a segment of time slots, the utility function of each time slot, utility (t), is calculated as in equation (6). The optimization objective of the above-described multi-objective optimization problem is to maximize cache hit rate and minimize system cost, using the optimization variable +.>

To represent caching decisions for all content. Constraint (7 a) indicates that the sum of the sizes of the contents cached in the edge servers is ensured not to exceed the cache capacity of the edge servers. τ represents slot τ. Constraints (7 b) - (7 d) represent binary variables for content caching, content request, and user request hit cases, respectively. Constraint (7 e) indicates that the sum of the superparameters is 1.

Considering that the multi-objective optimization problem is a non-convex problem and an NP-hard problem, and meanwhile, the calculation communication resource allocation decision at the time t+1 of the system is influenced by the cache resource allocation condition at the time t. The invention comprehensively considers the coupling of the edge multidimensional resources, the application bidirectional input characteristic and the fairness of the users, maps the content caching decision process of the edge server based on the historical preference of the users into a Markov decision process, and establishes a caching resource allocation decision network by utilizing a deep reinforcement learning network so as to achieve the combined optimization target of maximizing the utilization rate of the system resources and ensuring the QoS of the service.

As shown in FIG. 3, the bidirectional buffer storage placement method based on deep reinforcement learning takes a non-cellular network edge computing system as an environment, and selects actions to obtain maximum rewards through interaction with the environment so as to achieve the aim of searching an optimal state-action scheme. The present invention uses Deep Q Networks (DQNs) to train Q-networks with empirical replay to improve the stability of the scheme, and thus to select resource allocation decisions based on environmental conditions. The following describes the design of states, actions and rewards in the DQN in detail, respectively.

(1) And (5) state design. The state is a description of the external environment, and the intelligent agent needs to make subsequent decisions by means of the state parameter, defines the state in the decision network as s, and changes the state with time. In the embodiment of the invention, the system state at the time t is the current edge serverBuffer status s (t) = [ C ₁ ,C ₂ ,…,C _F ]。

(2) And (5) designing actions. The action is an output parameter of the agent, which is used to adjust variable information in the system environment, defining the action in the network as a. In the embodiment of the invention, the network action a is a buffer resource allocation decision for the predicted condition of the next moment, and needs to be implemented into a real system to adjust a resource variable. At each time slot, the edge server should decide which content to cache in the user device and server, respectively, to maximize the utility function. Thus, an action may be expressed as

Wherein->

When->

Representing +.>

An action is taken to take place,

representation pair->

No action is taken, either as caching the content or no longer caching the content.

(3) And (5) rewarding the design. The reward value of the evaluation network is required to embody the advantages and disadvantages of the buffer resource allocation decision made by the deep reinforcement learning network on the overall performance of the system. For each slot, the environment designs a system prize value based on the current state, the actions in the current state, and the next state. The prize value should be designed in relation to the goal of the resource allocation decision. The invention adopts the Q-learning method to train, and can obtain the discount accumulated rewards after the action a is executed in the state s. The agent learns how to select the action with the largest Q value in each iteration and intelligently performs the action according to the optimal solution after a number of iterations.

The prize value is designed as follows: the system returns a prize in each state, and the invention sets the prize as the optimal target value. Since the optimization objective is to maximize the utility function, the reinforcement learning reward is defined as U (X) as follows:

wherein X represents a variable that needs to be optimized. τ represents the current slot.

In order to avoid dimensional explosion, the method adopts a core algorithm Q-network. The mapping between the input s (τ) and the output Q (s (τ), a (τ), θ) is determined by the neural network structure, where θ represents the weighting parameters of the deep neural network (Deep Neural Networks, DNN). The invention uses DNN to approach nonlinear functions to realize Q-network. The structure of DNN includes three fully connected hidden layers, each with 256, 512 neurons, respectively. In DNN, the activation function of the first two hidden layers is set to a linear rectification function (ReLUs), and the third hidden layer function is set to the tanh function.

In addition, the Q-network is trained using empirical replay to improve the stability of the scheme, and empirical data (s (τ), a (τ), r (τ), s (τ+1)) is stored at a capacity

Playback pool->

Where r (τ) represents the reward for the state after action a (τ) is performed at time τ. When the number of stored experience tuples is greater than N _D At the time, from playback pool->

Randomly select N _M The network is trained on empirical data. Action a (τ) is selected using an ε -greedy strategy to balance development and exploration. The search rate is from the initial value epsilon _s Linearly decreasing to a final value epsilon _e . In DRLThe relevant parameters were set as follows: learning rate α=1e-4, discount factor e=0.9, initial search rate e _s =0.9, ending the search rate ε _s =0.001. Assume that popularity of requested content is modeled as a Zipf distribution. Thus, the popularity of the ith content requested by the user is: />

ζ represents the shape parameter of the Zipf distribution, set to a constant value of 0.56.

In fig. 4 and fig. 5, the caching method of the present invention is abbreviated as DRL caching, and is compared with other existing caching schemes. Existing caching schemes include random caching, greedy caching, and genetic caching.

As shown in fig. 4, the impact of different caching schemes on utility functions with different amounts of content under the same environmental conditions is compared. As the number of content increases, the overall utility function value exhibits a decreasing trend. Because as the number of content increases, the user requests are more targeted, the cache hit rate decreases and the delay increases. The utility function value has fluctuation, and under the condition of each content number, contents with different sizes can be randomly generated, and when the total size of the contents is smaller, the same MEC server cache memory space can cache more contents, so that the hit rate is improved, the time delay is reduced, and the utility function value is increased. As can be seen from FIG. 4, the caching scheme of the present invention has higher utility function values under different content amounts.

As shown in fig. 5, the impact of different caching schemes on utility functions with different numbers of users under the same environmental conditions is compared. Overall, the utility function of all caching schemes gradually decreases as the number of users increases. This is because as users increase, the bandwidth allocated to each user decreases, the transmission rate decreases, the delay increases, and the utility function value decreases. Further, as the number of users increases, the degree of decrease in the utility function gradually decreases, because the rate of decrease in the transmission rate becomes smaller as the number of users increases. As can be seen from FIG. 5, the caching scheme of the present invention has the highest utility function value under different user numbers.

The experimental result shows that the bidirectional cache placement method based on deep reinforcement learning in the non-cellular network achieves the aim of improving the utilization efficiency of system resources on the premise of guaranteeing the fairness of users under the condition of different system resources, and can achieve better effects than the existing cache scheme, and achieve the optimization targets of maximizing the utilization rate of the system resources and guaranteeing the experience quality of all users.

Other than the technical features described in the specification, all are known to those skilled in the art. Descriptions of well-known components and well-known techniques are omitted so as to not unnecessarily obscure the present invention. The embodiments described in the above examples are not intended to represent all the embodiments consistent with the present application, and on the basis of the technical solutions of the present invention, various modifications or variations may be made by those skilled in the art without the need for inventive efforts, while remaining within the scope of the present invention.

Claims

1. The bidirectional cache placement method based on deep reinforcement learning in the non-cellular network is characterized by comprising the following steps of:

(1) Establishing a utility function based on cache hit rate, cache space resource utilization rate, content response time delay and energy consumption index for the edge server node, wherein the utility function utility (t) in the time slot t is as follows:

wherein P is _hit (t) is the cache hit rate of the edge server node in the time slot t, Y (t) is the normalized system cost of the edge server node in the time slot t, and Y (t) is calculated as follows:

wherein ω is,

And mu is the weight proportion, T ^total (t) and E ^total (T) represents the total delay and total energy consumption of the edge server in time slot T, respectively, T ^max (t) and E ^max (t) represents maximum time delay and maximum energy consumption of edge server in time slot t, C _M Representing the cache capacity of the edge server, C _{F_M} Representing the sum of the cache content sizes of the edge servers in the time slot t;

k users in the coverage area of the edge server node are set, and the user set is

Providing F environmental frame contents, wherein the content set is +.>

the optimization problem represents optimizing content caching decisions

Representing an average utility function, τ representing the current slot;

indicating whether content i is requested or not +.>

Representing content i->

Indicating that at least one user requested content i,

representation->

No user request; h is a _k Indicating whether user k's request hits in the edge server's cache space, h if the content hits _k =1, otherwise h _k ＝0；

(2) Mapping a content caching decision process based on user history preference into a Markov decision process, and establishing a caching resource allocation decision network by using a deep Q network;

in the deep Q network, the state of the agent is set to be the content cache state s (t) = [ C ] of the current edge server ₁ ,C ₂ ,…,C _F ]The method comprises the steps of carrying out a first treatment on the surface of the Actions of agent output

Indicating whether or not to take action on content i; evaluating rewards of the network computing action after execution, the rewards being set to an optimization target value +.>

(3) The user terminal generates a service demand and sends a request to the edge server node, the user terminal detects whether the requested content is cached locally, if not, the requested content is obtained from the edge server node; if yes, selecting to process locally or uninstalled to the edge server node; and the edge server updates the content caching decision according to the continuously received user request.

2. The method of claim 1 wherein the edge server node's cache hit rate at time slot t

3. The method of claim 1 wherein the method establishes a buffer resource allocation decision network using a deep Q network, the Q network is implemented using a deep neural network DNN approximating a nonlinear function, the DNN includes three fully connected hidden layers, the activation function of the first two hidden layers is set as a linear rectification function, and the third hidden layer function is set as a tanh function.