CN114143891A

CN114143891A - FDQL-based multi-dimensional resource collaborative optimization method in mobile edge network

Info

Publication number: CN114143891A
Application number: CN202111447130.6A
Authority: CN
Inventors: 高志宇; 王天荆; 沈航; 白光伟; 田一博
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04

Abstract

In the prior art, a mobile edge network presents a trend of intellectualization, diversification and integration, so that the optimal allocation of multidimensional resources faces a plurality of challenges. In order to improve the accuracy of multi-dimensional resource optimization, the invention provides an FDQL-based multi-dimensional resource collaborative optimization method in a mobile edge network. The method constructs a multidimensional resource allocation model by taking the minimized MOS as an optimization target, and designs a double-layer decision scheme. Firstly, a base station at the bottom layer utilizes a double-depth Q learning DDQL to carry out local model training to obtain an optimal decision in a short period; and then, the edge nodes on the upper layer utilize the Federal deep learning FDQL to carry out global model training so as to reduce the deviation of distributed decisions in a long period. The experimental result shows that the algorithm is superior to other methods in the aspects of reducing the content service delay and improving the user experience quality.

Description

FDQL-based multi-dimensional resource collaborative optimization method in mobile edge network

Technical Field

The invention belongs to the technical field of communication, and particularly relates to a multi-dimensional resource collaborative optimization method based on Federal deep reinforcement learning (FDQL) in a mobile edge network.

Background

As known from the mobile market forecast report issued by Ericsson, by 2024, 5G users will reach 19 hundred million^[1]The rapidly increasing data traffic exacerbates the conflict between limited spectrum, computing, and cache resources and increasing resource demand. Meanwhile, the Internet of Things (IoT, Intemet of Things)^[2]Vehicle net (IoV, Intemet of Vehicles)^[3]Increases the complexity of the network environment. At present, network communication faces diversified, integrated, intelligent and other challenges, which aggravates the difficulty of resource management. For the method, part of service processing and resource scheduling functions are deployed to the cloud platform by an operator to realize service performance improvement^[4]。

However, in the face of future global coverage of 100%, ultra-large scale terminal device access, and mass data transmission below millisecond delay, the traditional processing platform relying on cloud computing faces a huge challenge. Especially, emerging services such as intelligent driving, Virtual Reality (VR), Augmented Reality (AR), and ultra-high video streaming increasingly depend on real-time data processing capability with high reliability and low time delay, and a cloud center far away from a user and a terminal device cannot process huge application programs in time, and network congestion and transmission delay also seriously affect user experience. Mobile Edge Computing (MEC) techniques to marginalize and localize network resources^[5-9]Is one of the key technologies for solving the problems.

The MEC technology deploys a server with strong computing power to an edge node close to a user and a macro base station to increase local computing resources; meanwhile, the method is configured with larger caching capacity, and the novel data service quality of Web browsing, multimedia, social networks and the like is improved. And computing and storage resources sink to the edge of the network, so that the bandwidth pressure can be remarkably relieved, the load of a return link is reduced, and the service delay is reduced, thereby overcoming the defect of overlong service delay of the cloud center. Although a large-scale mobile network densely deployed MEC server can provide a service solution with ultra-low delay and high bandwidth for a user, the problem of resource usage imbalance still exists in a rapidly increased user service request, and therefore how to optimize a multidimensional resource allocation scheme on the premise of satisfying Quality of Experience (QoE) of the user becomes one of the problems that the MEC needs to solve urgently.

The traditional genetic algorithm, the particle swarm algorithm, the game theory, the graph theory coloring algorithm and the like can be used for solving the resource optimization problem of the MEC. Document [10] proposes a two-stage heuristic optimization algorithm based on a genetic algorithm, which decouples the joint optimization problem of computation offload and resource allocation, and obtains an allocation strategy with the minimum energy consumption by iteratively updating the solution of the problem. For the problem of multiple access characteristics and resource limitation of a base station in a dense cell, document [11] models a joint Optimization problem of a computation offloading decision and allocation of spectrum, power and computation resources as a mixed integer nonlinear programming problem of NP-hard, and obtains a computation offloading and resource scheduling scheme with low system overhead by using a Particle Swarm Optimization (PSO). Document [12] establishes an efficient network resource optimization strategy according to the Stackelberg game theory, realizes low-delay and high-reliability video service, and alleviates conflicts between transmission performance and QoE. Document [13] proposes a network resource sharing scheme based on graph theory coloring theory, which takes system resource overhead as an optimization target, realizes efficient V2X collaborative caching and computation, and communication resource allocation, and reduces multimedia service delay. The new characteristics of big data, dynamism, multiple targets and the like of the existing MEC network make the traditional method unable to fully mine network information to generate an optimal resource allocation decision.

Artificial intelligence represented by machine learning and deep learning is gradually changed from initial algorithm driving to data, algorithm and computational power composite driving, and various problems in the application field can be effectively solved. Document [14]A Reinforcement Learning (RL) optimization framework based on edge cloud computing is provided, and optimal task unloading and resource allocation decisions are quickly made by utilizing the advantages of the RL capability in adaptive processing of environmental diversity and dynamic property, so that the aims of minimizing task delay and energy consumption of a user battery are fulfilled. With exponential growth in MEC network size and structural complexityThe RL-based resource optimization algorithm has slow convergence due to huge state space, and is difficult to find an optimal solution. Deep Reinforcement Learning (DRL) estimates the value function of RL using a Deep Neural Network (DNN) to obtain an accurate approximate solution. Deep Q-Learning (DQL) is used as a DRL algorithm, the perception capability of Deep Learning and the decision capability of reinforcement Learning are combined, and the perception decision problem of a complex system is solved by continuous trial and error^[15]. Document [16 ] for an MEC server to provide three multimedia services for a user, namely live streaming, buffered streaming and low-delay enhanced mobile broadband application]A QoS evaluation model is designed firstly, then network resources are dynamically distributed by using DQN to meet the QoS of users to the maximum extent, and the resource scheduling performance is superior to a circular scheduling algorithm and a priority scheduling algorithm. Document [17]]A resource allocation model with minimum task average energy consumption is constructed under the constraint conditions of communication, calculation and cache resources, and a multidimensional resource allocation method based on Double Deep Q-Learning (DDQL) is provided. Simulation results show that compared with a random algorithm, a greedy algorithm, a particle swarm algorithm and DQL, the DDQL can better solve the problem of multi-task resource allocation and reduce the average energy consumption of tasks by at least 5%. Estimating cumulative delay and reward per action using two neural networks [18 ]]A attention-based DDQL method is provided, and a CPU frequency and transmission power scheduling strategy with minimum delay and energy consumption can be obtained in a long period. Document [19 ]]With the goals of maximizing long-term profit and meeting low-delay calculation of users as targets, the DDQL is used for jointly optimizing the calculation unloading and buffer resource allocation scheme of the edge node, and the maximum profit is realized when the QoS of the service is ensured.

Usually, the training of the DRL relies on "big data", however, it is easier for the application industry to collect "small data" at low cost, and these distributed small data form numerous "data islands", which greatly restricts the usability of the DRL decision. On the other hand, centralized DRL training poses a huge challenge to the computing and storage capabilities of the MEC server. Federal Learning (Federated Learning, FL)^[20]And breaking a data island, and obtaining a global optimal model with privacy safety by sharing network parameters. Document [21 ]]A resource management method based on federal learning is provided, and the bottleneck problem of intensive resource usage such as calculation, bandwidth, energy and data in the MEC is solved. To solve the problem of the decrease of average aggregation accuracy due to large data volume difference of participants [22 ]]A fair alpha-FedAvg algorithm is provided, and a more fair global resource optimization model is generated by re-weighting the FL polymerization process by utilizing the alpha value, so that the efficiency of local resource allocation is improved. Document [23 ]]And a collaborative edge caching frame based on Federated Deep Reinforcement Learning (FDRL) is provided, wherein the FDRL uploads near-optimal local DRL parameters to an edge node to participate in the next round of global FL training, so that the obtained local caching resource optimization scheme can effectively reduce backhaul flow and improve content hit rate.

Disclosure of Invention

Aiming at the trend that the mobile edge network in the prior art is intelligent, diversified and integrated, the optimal allocation of multidimensional resources faces many challenges. Aiming at the requirement of mass content service in MEC, in order to improve the accuracy of multi-dimensional resource optimization, the invention models the problem of spectrum, calculation and cache resource joint optimization into a mixed integer nonlinear programming problem, and constructs a double-layer multi-dimensional resource allocation framework decoupling original problem according to the federal deep reinforcement learning. The base station at the bottom layer adopts DDQL to obtain a local resource collaborative optimization strategy in a short period, so that the full utilization of local multidimensional resources is realized; and the edge node of the upper layer adopts FDQL to train the global optimal resource collaborative optimization strategy in a long period, so that the user can obtain the request content with low time delay and high quality.

The invention discloses a multi-dimensional resource collaborative optimization method based on FDQL in a mobile edge network, which constructs a multi-dimensional resource distribution model by taking a minimized MOS as an optimization target and designs a double-layer decision scheme. Firstly, a base station at the bottom layer utilizes a double-depth Q learning DDQL to carry out local model training to obtain an optimal decision in a short period; and then, the edge nodes on the upper layer utilize the Federal deep reinforcement learning FDQL to carry out global model training so as to reduce the deviation of distributed decisions in a long period. Simulation experiments show that the method is superior to other methods in the aspects of reducing content service delay and improving user experience quality.

Drawings

FIG. 1 is a schematic diagram of a mobile edge network;

FIG. 2 is a workflow diagram of an FDQL framework;

FIG. 3 is a graph comparing the training performance of four algorithms;

FIG. 4 is a comparison graph of the resource co-optimization performance of four algorithms under different short periods;

FIG. 5 is a comparison graph of decision times of four algorithms in different short periods;

FIG. 6 is a comparison graph of the resource co-optimization performance of four algorithms for different numbers of users;

FIG. 7 is a graph of loss function comparison of centralized DDQL and FDQL;

FIG. 8 is a graph of average MOS value comparison of a centralized DDQL and an FDQL;

FIG. 9 is a graph of decision time comparison of centralized DDQL and FDQL;

FIG. 10 is a QoE performance comparison graph of centralized DDQL and FDQL for different numbers of users;

fig. 11 is a graph comparing QoE performance of centralized DDQL and FDQL under different numbers of base stations.

Detailed Description

The invention is further described with reference to the following detailed description and accompanying drawings.

1. Overview

The invention provides a multi-dimensional resource collaborative optimization method based on Federal deep reinforcement learning FDQL, which is used for constructing a double-layer resource management architecture. Its main technical contribution includes the following three aspects:

(1) and establishing an MOS (metal oxide semiconductor) maximization model under the constraints of wireless rate, content acquisition delay and cache capacity according to the user content service request, and optimizing the allocation of frequency spectrum, calculation and cache resources in the MEC system by taking the MOS maximization model as a basis.

(2) The base station converts the MOS optimization model into a Markov Decision Process (MDP) with a continuous state space and a high-dimensional action space, and designs a multidimensional resource collaborative optimization algorithm based on the double-depth Q learning DDQL to realize local multidimensional resource joint optimization in a short period.

(3) Aiming at the global multidimensional resource joint optimization problem, the edge nodes carry out parameter aggregation for all associated base stations in a long period, and resource cooperative optimization based on FDQL is implemented. After continuous iteration, the resource optimization decision of the global model achieving ideal convergence approaches to the decision result of the centralized deep Q learning DQL, so that the multi-dimensional resource allocation meets the QoE requirement of a user.

2. System model

In this example, the MEC system consists of N Base Stations (BSs) capable of providing computing and caching services and an edge server, and the Base stations connect the edge node and the neighbor Base Station through a wired optical cable, as shown in fig. 1. Base station

The covered cells are distributed with U randomly_nIndividual intelligent terminal user, user

The base stations are connected by wireless links.

The edge node and the N base stations cooperatively cache a large amount of content to meet the content service requirements of users. Suppose that the cache contents of the whole system are collected in a long period

Wherein the content

Is shown as D_f. Defining the local popularity of content f at base station n as p_n，fThen its global popularity

Satisfying the following Mandelbrot-Zipf (MZipf) distribution

Wherein r is_fFor the rank of the content f in the descending sequence of popularity, τ is the stationary factor and β is the skewness factor. Each base station buffers according to popularity in descending order

In a part of the content

Its buffer status is expressed as

And the edge node equipped with enough resources caches

To ensure the quality of service of the system.

In the same time period, assume U in the nth cell_nThe content request of each user is collected as

The content request status of user u is a multi-tuple

Wherein

Representing a content acquisition rate requirement; when the content f is not requested

Therefore, a content request list Table is formed in the base station to record the requirements of each user, and subsequent multidimensional resource allocation and content distribution are facilitated.

2.2 communication model

In the edge cache system, a user makes a content request to an associated base station, and the base station adds the content request into a content request list. If the base station caches the request content locally, the request content is directly sent to the user; on the contrary, if a plurality of neighbor base stations cache the request content,the local base station acquires the content from the nearest neighbor base station and sends the content to the user, otherwise, the local base station acquires the content from the edge node and sends the content to the user. Thus, define y _n，u，f1 means that the requested content f is cached at the local base station and y _n，u，f0 means uncached; definition of z in the same way _n，u，f1 means that the requested content f is cached in the nearest neighbor base station, and z _n，u，f0 means uncached.

Because a user may request a plurality of contents in the same period, and the service delay requirements of the users applying automatic navigation, video live broadcast and the like are strict, the base station processes the same content requests of different users in a sampling multicast mode, and the content service quality is enhanced by improving the frequency spectrum utilization rate. For this purpose, the system spectrum resource W is divided into L subchannels of bandwidth B

And is shared by N cells. Generally, the Signal to Interference plus Noise Ratio (SINR) of the content received by the user u through the sub-channel l can be expressed as SINR

Wherein the transmission power of the base station n to the user u on the subchannel l is P_n，u，lChannel gain of h_n，u，l；

Is the noise power. The downlink transmission rate of subchannel l is then

v_n，u，l＝B log₂(1+SINR_n，u，l) (3)

The corresponding downlink transmission time delay is easily obtained from the formula (3)

From a set of content requests

And list Table, base station n is F_nA content distribution sub-channel

Introducing a channel connection binary matrix

Wherein x_n，u，l1 denotes that user u receives a content transmission on subchannel l; on the contrary, x _n，u，l0 means no access. Next, define X_nThe u-th row vector x_n，u＝(x_n，u，l，…，x_n，u，L) Is the channel connection vector of user U, then U in the nth cell_nThe total transmission delay of an individual user can be expressed as

Wherein when y_n，u，f1-y when equal to 0_n，u，fIndicating that the request content f is cached in other base stations or edge nodes; when z is_n，u，f1-z when equal to 0_n，u，fIndicating that the request content f is cached in the edge node; TD_n，n′、TD_n，mRespectively representing the wired transmission time delay between the base station n and the nearest neighbor base station n' and the edge node m.

2.2, calculation model

At present, streaming media contents such as panoramic video, live video, VR video and the like occupy about 80% of network traffic, and content response and subsequent encoding and reconstruction processes of the streaming media all need to be supported by a computing unit. Therefore, the computing resource allocation scheme directly affects the processing delay of content requests, especially immersive interactive VR videos, and too high delay may cause frequent jams of pictures to cause dizziness, which seriously affects the user experience. To facilitate the distribution rate of requested content, the base station needs to rationally plan the allocation of computing units. Do not define γ_nThe number of CPU cycles required for the base station n to process a unit task, isPrinciple F_nThe total calculated delay for each requested content may be expressed as

Wherein λ_n，fIs the amount of computing resources allocated to the content f. By combining the formulas (5) and (6), U in the nth cell_nThe content acquisition delay of each user includes transmission delay and calculation delay, that is:

CA_n＝TD_n+CD_n (7)

2.3 caching the update model

The base station in the wireless network system supporting the cache is provided with the cache device with limited capacity, so that after the content service of one period is finished, the base station n needs to set according to the content request

The cache is updated. First, the base station n calculates the local request probability of the cached content f

Wherein c is_n，fThe number of requests for the content f. Consider the average request probability of content f over T consecutive periods

Wherein

Is the local request probability within the period t. Setting a buffer threshold epsilon_qThe base station is composed of

And judging whether the content f of the current request needs to be cached continuously or not. Thus, when

Time-cache update variable g _n，f1 indicates that the content f needs to be cached; when in use

Hour g _n，f0 means that the buffer content f is not needed, and the buffer space is released. At this time, the buffer content set of the base station n

Is updated to

Then, the base station n searches the global popularity p of the content f which is not cached in the request_fIf p is_fRatio of

The minimum popularity of the medium content is large, i.e.

Add the content f to

And order

Each base station has buffer spaces with different sizes, so the total content of the buffer spaces cannot exceed the maximum buffer capacity C of the base stations_max. At the same time, the cache update policy should maximize the hit rate (hit ratio) of the requested content, which does not translate into

Maximize the sum of the popularity of the contents

Generally, the content request of the user has a time duration, so equation (11) can enable the user to achieve a higher QoE in one period.

2.4 problem modeling

Inspired by the widely used QoE metrics, the Mean Opinion Score (MOS) model [24] can be used to measure the quality of content services such as video streaming downloads or web browsing. According to the formula (7) and the formula (11), the MOS model designed by the invention is as follows

Wherein the parameters C of the linear model_n，1，C_n，2Make the MOS_n∈[1，5]Weight factor w_n，1，w_n，2Respectively representing the influence degrees of content acquisition delay and cache updating on the MOS. MOS of easily known nth cell_nThe higher the score, the higher the QoE of the user.

The invention aims to optimize the system MOS in a maximum mode through multidimensional resource cooperation so as to meet the QoE requirements of users. Thus, the multidimensional resource optimization model is

Wherein constraints C1 and C2 represent 0-1 decisions for content delivery; constraint C3 represents a 0-1 decision of channel allocation; constraint C4 indicates that the transmission power allocated by the base station to subchannel/is less than the maximum transmission power; constraint C5 ensures that the user's receiving rate is greater than a predetermined acquisition rate requirement; constraint C6 indicates that the total amount of computing resources allocated is less than the maximum computing power of the base station; constraint C7 ensures that the total amount of content after the cache update is less than the base station cache capacity.

The mixed integer programming problem (13) requires the joint solving of four classes of 0-1 decision variables x_n，u，f，y_n，u，f，z_n，u，f，g_n，fAnd a discrete variable λ_n，f，P_n，u，l. The above decision variables cause the problem to be non-convex optimization, and a solution method of convex optimization cannot be used. How to efficiently and accurately solve the problem (13) is the basis of a multi-dimensional resource collaborative allocation scheme, and the resource optimization of the mobile edge system is realized by using the federal deep reinforcement learning.

3. Multi-dimensional resource collaborative optimization method based on federal deep reinforcement learning

The mobile edge system of the invention is a two-layer network architecture, which respectively solves the problem of resource collaborative optimization of different time scales. The base station at the bottom layer utilizes DDQL to carry out local model training to obtain the optimal decision in a short period; then, the edge nodes on the upper layer perform global model training by using FDQL to reduce the deviation of distributed decision in a long period.

3.1 local multidimensional resource collaborative optimization based on DDQL

In the face of the problem of local multi-dimensional resource collaborative optimization, the base station n is used as an agent to model the agent into a Markov decision process, DDQL is adopted to interact with the environment in a continuous trial and error mode, and an optimal strategy pi is searched through maximum accumulated reward^*。

MDP is expressed as a quadruplet<S_n，A_n，PR_n，R_n>In which S is_nRepresents a state space, A_nRepresents a motion space, PR_nRepresenting the probability of a state transition, R_nRepresenting a reward function.

State space agent requires knowledge of the user and base station information before deciding on action selection, and thus state space S_nConsisting of user requests and base station buffer status. At time slot i, system state

Is composed of

Action space the action space is the set of actions taken by an agent, the action vector of the present invention includesAllocation of resources, computing resources and cache updates, and thus action space a_nDefined as a multi-dimensional resource co-optimization mode

Wherein

A channel connection matrix is represented that is,

a vector of the power allocation is represented,

it is shown that the calculation unit is assigned a vector,

representing an updated content cache vector.

Reward function when the environment is in state

Execute actions at the time

The system will enter the next state

And obtain instant rewards

The optimization goal of the present invention is to maximize the QoE of the user, thus setting the MOS score as the reward function

State space S_nTo the action space A_nOne mapping of (d) constitutes the policy pi:

current state

Action taken by policy π

Can be expressed as

Where γ ∈ (0, 1) is the discount factor. According to Bellman equation, the Q function is updated by the formula

Where η ∈ (0, 1) is a learning rate that controls the learning speed.

The invention uses DDQL to search a local multidimensional resource collaborative optimization strategy and updates the parameters of the deep neural network

To approach the Q value corresponding to the optimal strategy

In order to break the association between adjacent data and maintain the independence of the data, the DDQL establishes an experience replay (experience replay) mechanism to use the experience obtained by the agent

And storing the experience playback pool to record each interaction process of the experience playback pool with the environment. When the experience playback pool is full, the new experience will randomly replace the old experience. On the other hand, in order toOvercoming the over-estimation of Q value in training, DDQL constructs the current network for selecting action and the target network for estimating function, and sets their state-action value functions as

And

wherein

Both represent parameters of the deep neural network.

When training starts, the DDQL firstly randomly samples small batch of samples from an experience playback pool and inputs the samples into a current network and a target network, and corresponding Q values are respectively calculated through forward propagation; then using the loss function

And performing back propagation on the current network to update the network parameters. Specifically, the calculation formula (20) relates to the parameter

Gradient of (2)

Then the parameter

Can be expressed as

Where α represents the learning rate.

The DDQL-based local multidimensional resource collaborative optimization model training method is given below.

3.2 FDQL-based global multidimensional resource collaborative optimization

As described above, each base station collects content demand data of local users and maximizes the short-term utility of the multidimensional resource collaborative optimization strategy by using DDQL, but it is a difficult problem how to assist edge nodes to quickly find a global optimal strategy. If the edge node uses the traditional centralized DQL training method, not only the communication cost is increased, but also the confidentiality and security of data are damaged. Therefore, the invention designs the FDQL framework on the basis of the DDQL to construct a high-quality centralized multidimensional resource collaborative optimization model. Each base station participating in federal learning uploads model parameters to the edge node instead of user content demand data, so that data transmission burden is effectively reduced, and the problem of user privacy disclosure is solved. For simplicity of illustration, the invention performs multidimensional resource collaborative optimization according to time periods {1, …, T, …, T +1, …, T + T, …, 2T, … }. And in a short period of t ≠ kT, each base station implements DDQL model training to obtain a local optimal strategy. Generally, the content requests of the users have time continuity, so that the local resource allocation strategy of each base station can better meet the QoE required by the users in a period of time. After receiving the content in a long period, the user demand is likely to change, so that the edge node needs to perform FDQL model training in the t-kT long period to obtain a global optimal strategy, and feeds back the global optimal strategy to each base station to enhance the generalization capability of the local DDQL, thereby improving the user content acquisition experience by using a better resource allocation strategy.

Fig. 2 shows the workflow of the FDQL framework. First, each base station updates parameters according to equations (21) and (22)

And uploading the data to the edge node; the edge node weights and aggregates the uploaded N parameters and obtains global parameters

Wherein

w_nThe sum of the contents is requested for the user in the nth cell. Next, each base station uses the global parameters of the feedback

The next round of DDQL training is performed. The system repeats the above steps until the FDQL algorithm converges.

The FDQL-based global multidimensional resource collaborative optimization model training method is provided below.

4. Simulation analysis

The following verification proves that the multidimensional resource collaborative optimization scheme provided by the invention can maximally meet the QoE of the user on a Python platform, wherein each Monte Carlo experiment is performed 100 times. In the simulation scene, 1 edge node is deployed in the regional center, 5 base stations are uniformly distributed in an edge network, each base station serves 20 users, and the content requirements of the users are randomly generated. Other simulation parameters are shown in table 1.

TABLE 1 simulation parameters

4.1 DDQL parameter selection

The invention selects a feedback neural network comprising 3 hidden layers as a network model of the DDQL, wherein the number of neurons in the hidden layers has various combination modes and needs a large amount of experiments to combine and adjust the number. The invention runs 5 time periods in the 3 rd cell, takes the average MOS value as the evaluation index, and the experimental result is shown in the table 2. It can be seen that when the first input layer is set to 300 neurons, the second hidden layer is set to 600 neurons, and the third output layer is set to 300 neurons, the average MOS value of the DDQL is the highest. Without loss of generality, the optimal parameter settings in table 2 are adopted in the following experiments to ensure fast convergence of DDQL.

TABLE 2 DDQL parameter set comparison

4.2 Performance comparison of local multidimensional resource collaborative optimization algorithm

In order to evaluate the performance of the local multi-dimensional resource collaborative optimization algorithm provided by the invention, the local multi-dimensional resource collaborative optimization algorithm is combined with a Genetic Algorithm (GA) and a particle swarm algorithm (PSO)^[11]Deep Q Learning (DQL), where PSO is a swarm intelligence optimization algorithm that simulates an entire bird population by designing particles that simulate bird behavior.

FIG. 3 compares the training performance of the four algorithms, and it is easy to see that the convergence performance of the greedy strategy GA is lower than that of the other three algorithms because the GA meets the optimal resource allocation of the current iteration and cannot guarantee that the solution is globally optimal; the PSO algorithm will yield too early convergence, i.e. tends to fall into local minima, and thus its global optimality is lower than DQL, resulting in lower MOS values. In the face of a complex state action-space model, the solving capability of the DQL is lower than that of the DDQL, the DDQL is fast and stable in convergence, the MOS value is the highest, and the best performance is shown. When the training times are more than 600, the four algorithms reach the convergence stationary phase, but the MOS values of the DDQL are respectively improved by 89.58%, 44.41% and 26.39% compared with GA, PSO and DQL, and the obtained multidimensional resource collaborative optimization strategy can provide the best content service experience for users.

Fig. 4 tests the local multidimensional resource co-optimization performance of four algorithms in 6 short periods, with 100 monte carlo experiments per period. As can be seen from fig. 4, DDQL has a higher average MOS value than GA, PSO, DQL in each period, and the user experience is better. The decision time is also one of the key indexes for measuring the QoE of the user, and higher decision delay influences the speed of the user for acquiring the content, thereby reducing the network service quality. Corresponding to fig. 4, fig. 5 shows the average decision time of four algorithms per cycle, where the GA needs more iterations to find a feasible solution of the complex problem, resulting in much higher average decision time than other algorithms; although the average decision time of PSO is greatly reduced compared with GA, the particles are easy to fall into the same chemical, and the diversity of understanding is reduced, so that the convergence rate is lower than that of the reinforcement learning method. Compared with DQL, the average decision time of DDQL is increased, but the average MOS value of a user can reach the highest value, and the optimal multidimensional resource collaborative optimization performance is realized. Table 3 shows that the total average decision time of DDQL is reduced by about 89.76%, 64.86% over 6 short cycles compared to GA, PSO.

TABLE 3 Total mean decision time for the four algorithms

Fig. 6 compares the resource co-optimization performance of the four algorithms for different numbers of users. With the increase of the number of users served by each base station, the total content demand is correspondingly increased, and the average MOS values of the four algorithms are gradually reduced. This is because the higher the total content requirement in the system is, the less multidimensional resources are allocated to each user, which affects the content acquisition speed and thus the QoE of the user. The DDQL adopts a double-layer network, and can obtain the optimal resource allocation strategy of a local optimization model (13), so that the MOS value is improved by about 97.62%, 47.34% and 30.10% compared with the GA, the PSO and the DQL, and the best content service is provided.

4.3 Performance comparison of Global multidimensional resource collaborative optimization Algorithm

In the invention, the edge nodes utilize FDQL to carry out global model training so as to reduce the deviation of distributed decision in a long period. In order to verify the validity of the FDQL, it is not compared with the conventional centralized DDQL and the loss function (20) is used as the evaluation criterion. Fig. 7 shows that the loss function of the centralized DDQL is much higher than that of the FDQL in the first 100 rounds of training, and the loss functions of the two algorithms are similar in the last 100 rounds of training, which indicates that the FDQL converges faster and has better stability.

FIG. 8 tests the global multidimensional resource co-optimization performance of four algorithms in 6 long periods. In the centralized DDQL, a large number of multidimensional resources are consumed by uploading all data to the edge nodes by each base station, so that the cooperative optimization performance of global resources is reduced, and the average MOS value of 6 long periods is reduced by 13.7% compared with that of FDQL. FIG. 9 shows the corresponding decision time, wherein the Q matrix of the centralized DDQL training is large in scale, and takes a lot of decision time; and FDQL leads each base station to train local data in parallel, and edge nodes only carry out federal fusion operation, so that the total average decision time is reduced by 83.80 percent compared with centralized DDQL.

Finally, fig. 10 and fig. 11 compare the global multidimensional resource collaborative optimization performance of the centralized DDQL and the FDQL under different network environments. As the number of users served by each base station increases to 26, fig. 10 shows that the global optimum solution is not easily obtained by the centralized DDQL due to the enlargement of the data size, and the average MOS value is sharply decreased. Similarly, when the number of base stations increases, i.e. the network scale increases, fig. 11 shows that the centralized DDQL still has a performance bottleneck, and cannot provide good content service for users of large-scale networks. However, the FDQL can obtain the optimal multidimensional resource optimization scheme regardless of the increase of the content demand of the users in the local cell or the increase of the global network scale, so as to provide stable content service for the users and ensure the efficient operation of the MEC system.

5. Summary of the invention

Aiming at the requirement of mass content service in MEC, the invention provides a multi-dimensional resource collaborative optimization algorithm based on the federal deep learning, which is beneficial to reducing the service delay of a system and improving the user experience. The method models a frequency spectrum, calculation and cache resource joint optimization problem into a mixed integer nonlinear programming problem, and constructs a double-layer multi-dimensional resource allocation framework decoupling original problem according to the federal deep reinforcement learning. The base station at the bottom layer adopts DDQL to obtain a local resource collaborative optimization strategy in a short period, so that the full utilization of local multidimensional resources is realized; and the edge node of the upper layer adopts FDQL to train the global optimal resource collaborative optimization strategy in a long period, so that the user can obtain the request content with low time delay and high quality. Simulation experiments show that the DDQL algorithm has higher local QoE performance than GA, PSO and DQL; meanwhile, the method has better global decision stability than centralized DDQL.

Reference documents:

[1]Ghosh A，Maeder A，Baker M，et al.5G evolution：a view on 5G cellular technology beyond 3GPP release 15[J].IEEE Ac-cess，2019，7(99)：127639-127651.

[2]Huang J Y，Nkenyereye L，Sung N M，et al.IoT service slicing and task offloading for edge computing[J].IEEE Access，2020，8(14)：11526-11547.

[3]Cheng J，Yuan G，Zhou M，et al.Accessibility analysis and modeling for IoV in an urban scene[J].IEEE Transactions on Vehic-ular Technology，2020，69(4)：4246-4256.

[4]He S W，Huang W，Wang J H，et al.Cache-enabled coordinated mobile edge network：opportunities and challenges[J].IEEE Wireless Communications，2020，27(2)：204-211.

[5]Yang z，Du Y，Che C，et al.Energy-efficient joint resource allocation algorithms for MEC-enabled emotional computing in ur-ban communities[J].IEEE Access，2019，7(7)：137410-137419.

[6]Guo Hz，Liu J J，zhang J.Computation offloading for multi-access mobile edge computing in ultradense networks[J].IEEE Communications Magazme，2018，56(8)：14-19.

[7]Kamel M，Hamouda W，Youssef A.Ultra-dense networks：a survey[J].IEEE Communications Surveys&Tutorials，2016，18(4)：2522-2545.

[8]Abbas N，zhang Y，Taherkordi A，et al.Mobile edge computing：a survey[J].IEEE Internet of Things Journal，2018，5(1)：450-465.

[9]zhang K，Leng S P，He Y J，et al.Mobile edge computing and networking for green and low-latency Internet of things[J].IEEE Communications Magazine，2018，56(5)：39-45.

[10]Li H，Xu H，Zhou C，et al.Joint optimization strategy of computation offloading and resource allocation in multi-access edge computing environment[J].IEEE Transactions on Vehicular Technology，2020，69(9)：10214-10226.

[11]Guo F，Zhang H，Hong J，et al.An efficient computation offloading management scheme in the densely dep1oyed small cell net-works with mobile edge computing[J].IEEE/ACM Transactions on Networking，2018，26(6)：2651-2664.

[12]Cao T，Xu C，J Du，et al.Reliable and efficient multimedia service optimization for edge computing-based 5G networks：game theoretic approaches[J].IEEE Transactions on Vehicular Technology，2020，17(3)：1610-1625.

[13] lienwei, zhanghaibo, prince heart. in the internet of vehicles, MEC-based V2x cooperative caching and resource allocation [ J ] communication bulletin, 2021, 42 (02): 26-36.

[14]Kiran N，Pan C，Wang S，et al.Joint resource allocation and computation ofloading in mobile edge computing for SDN based wireless networks[J].Journal of Communications and Networks，2020，22(1)：1-11.

[15] Pengjun, wang cheng long, jiang fu, zhao, yue, liu wei rong a rapid deep Q learning network side cloud migration strategy for vehicle services [ J ]. electronic and information bulletin, 2020, 42 (01): 58-64.

[16]Guo B，Zhang X，WangY，et al.Deep-Q-network-based multimedia multi-service QoS optimization for mobile edge computing systems[J].IEEEAccess，2019，7：160961-160972.

[17] Widely available, zhang jun, li wenjing, zhou fan, torpedo, pay timely rain, qixue xue. 148-161.

[18]Ren J，Wang H，Hou T T，et al.Collaborative edge computing and caching with deep reinforcement learning decision agents[J].IEEE Access，2020，8：120604-120612.

[19]Liu T，ZhangY，Zhu Y，et al.Online computation offloading and resource scheduling in mobile-edge computing[J].IEEE Inter-net of Things Journal，2021，8(8)：6649-6664.

[20]Messaond S，Bradai A，Ahmed O，et al.Deep federated Q-learning-based network slicing for industrial IoT[J].IEEE Transac-tions on Industrial Informatics，2021，17(8)：5572-5582.

[21]Yu R，Li P.Toward resource-efficient federated 1earning in mobile edge computing[J].IEEE Network，2021，35(1)：148-155.

[22]Zhong Z，Zhou Y P，Wu D，et al.P-FedAvg：parallelizing federated learning with theoretical guarantees[J].IEEE Transactions on Vehicular Technology，2020，69(4)：4246-4256.

[23]Wang X，Wang C，Li X，et al.Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching[J].IEEE Internet of Things Journal，2020，7(10)：9441-9455.

[24]Rugelj M，Sedlar U，Volk M，et al.Novel cross-layer QoE-aware radio resource allocation algorithms in multiuser OFDMA systems.IEEE Transactions Communications，2014，62(9)：3196-3208。

Claims

1. A multi-dimensional resource collaborative optimization method based on FDQL in a mobile edge network comprises a plurality of base stations and an edge node in a mobile edge computing MEC system, wherein the base stations are communicated with the edge node and a neighbor base station, and the base stations and the edge node have the capability of providing computing and caching services; the method is characterized in that the method for the collaborative optimization of the FDQL-based multidimensional resources in the mobile edge network comprises the following steps: 1) constructing a multi-dimensional resource allocation model to represent allocation and cache updating of frequency spectrums and computing sources; 2) optimizing a multidimensional resource allocation model;

in the step 1), a multidimensional resource allocation model is constructed by taking the minimum mean opinion score MOS as an optimization target;

the MOS model is as follows:

wherein the parameters C of the linear model_n，1，C_n，2Make the MOS_n∈[1，5]Weight factor w_n，1，w_n，2Respectively representing the influence degrees of content acquisition delay and cache updating on the MOS; CA_nIs U in the nth cell_nThe content acquisition delay of each user comprises transmission delay and calculation delay; ps is_nIs U in the nth cell_nThe base station updates the cache according to the content request set; the nth cell is the range covered by the base station n;

MOS of nth cell_nThe higher the score is, the higher the QoE (quality of experience) of the user is, and the multidimensional resource optimization model is maxMOS_n；

In the step 2) of the said step,

2.1) carrying out local model training on the base station of the bottom layer by using the double-depth Q learning DDQL to obtain the optimal decision in a short period:

2.1.1) taking a base station n as an agent, and modeling a local resource allocation problem into a Markov Decision Process (MDP);

2.1.2 interacting with the environment in a continuous trial and error mode by adopting DDQL, and searching an optimal strategy by maximizing accumulated rewards;

2.2) carrying out global model training on the edge nodes on the upper layer by using the Federal deep reinforcement learning FDQL to reduce the deviation of distributed decisions in a long period:

performing multidimensional resource collaborative optimization according to a time period {1, …, T, …, T, T +1, …, T + T, …, 2T, … };

in a short period that t is not equal to kT, each base station implements DDQL model training to obtain a local optimal multidimensional resource allocation strategy;

and (3) carrying out FDQL model training by the edge node in the t-kT long period to obtain a globally optimal multi-dimensional resource allocation strategy, and feeding back to each base station to enhance the generalization capability of the local DDQL, thereby improving the user content acquisition experience by using a better resource allocation strategy.

2. The method as claimed in claim 1, wherein in the step 2.1.1), the Markov decision process MDP is expressed as a quadruple < S_n，A_n，PR_n，R_nIs where S_nRepresents a state space, A_nRepresents a motion space, PR_nRepresenting the probability of a state transition, R_nRepresenting a reward function;

state space: the state space S is the space where the agent needs to know the information of the user and the base station before deciding on the action selection_nThe method comprises the steps of consisting of a user request and a base station cache state; at time slot i, system state

Where r and c represent content requests and content caches respectively,

and

respectively represent the 1 st and the U th_nThe status of the individual user or users is,

representing the buffer state of the base station n;

an action space: the action space is a set of actions taken by the agent; the motion vector includes communication, allocation of computing resources, and cache update, then motion space A_nThe method is defined as a multidimensional resource collaborative optimization mode:

wherein

A channel connection matrix is represented that is,

a vector of the power allocation is represented,

it is shown that the calculation unit is assigned a vector,

a content cache vector representing an update;

the reward function: when the environment is in a state

Execute actions at the time

The system enters the next state

And obtain instant rewards

Set the MOS score to the reward function

current state

Action taken by policy π

The action-state value function of (a) is expressed as:

wherein γ ∈ (0, 1) is a discount factor;

from Bellman equalisation Bellman's equation, update of Q function

Where η ∈ (0, 1) is a learning rate that controls the learning speed.

3. The method as claimed in claim 2, wherein in step 2.1.2), DDQL is used to find the local multidimensional resource collaborative optimization strategy, and parameters of the deep neural network are updated

To approach the Q value corresponding to the optimal strategy,

DDQL establishes experience playback mechanism to transfer the experience obtained by the agent

Storing the experience playback pool to record each interaction process of the experience playback pool and the environment;

when the experience playback pool is full, the new experience randomly replaces the old experience;

the DDQL constructs the current network for selection action and the function for evaluationA plurality of target networks, and setting their state-action value functions as

And

wherein

All represent parameters of the deep neural network;

when training starts, the DDQL firstly randomly samples small batch of samples from an experience playback pool and inputs the samples into a current network and a target network, and corresponding Q values are respectively calculated through forward propagation; then, the loss function is used for carrying out back propagation on the current network so as to update the network parameters;

the loss function is

It relates to the parameter

Is that

Then the parameter

Is expressed as

Where α represents the learning rate.

4. The method for collaborative optimization of multidimensional resources based on FDQL in mobile edge network as claimed in claim 3, wherein in the step 2.2), the workflow of FDQL framework:

first, each base station updates parameters

And uploading the data to the edge node;

the edge node weights and aggregates the uploaded N parameters, and obtains global parameters

Wherein the subscript g represents

As a result of the global parameter,

w_nrequesting the sum of contents for the users in the nth cell;

next, each base station uses the global parameters of the feedback

Carrying out next round of DDQL training;

and repeating the steps until the FDQL algorithm converges.