CN114021770A

CN114021770A - Network resource optimization method and device, electronic equipment and storage medium

Info

Publication number: CN114021770A
Application number: CN202111089718.9A
Authority: CN
Inventors: 魏翼飞; 公雨; 李骏; 郭达; 张勇; 滕颖蕾
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-02-08

Abstract

The application provides a network resource optimization method, a device, electronic equipment and a storage medium, collected communication sample resources, calculation sample resources, cache sample resources and user terminal information are processed through a depth certainty strategy gradient model, input information, proxy action information and reward data information are recorded, then a generated data set is used for training a gradient enhancement decision tree initial model, and a gradient enhancement decision tree model capable of optimizing network resources is obtained, so that the gradient enhancement decision tree model can be used for rapidly processing current environment data information including communication, calculation, cache resources and user terminal information to obtain a resource allocation strategy of maximized total utility. Therefore, the network resources can be distributed according to the resource distribution strategy of the maximized total utility, so that the network resource distribution is more reasonable, and the utilization rate of the network resources is greatly improved.

Description

Network resource optimization method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of network resource allocation technologies, and in particular, to a method and an apparatus for optimizing network resources, an electronic device, and a storage medium.

Background

The network slice refers to flexible allocation of network resources, and a plurality of mutually isolated logic subnets with different characteristics are divided according to requirements. In a core network or a conventional cellular network, the overall system is designed to support many types of services. However, a virtual wireless network consisting of a Mobile Virtual Network Operator (MVNO) is dedicated to one service (e.g., video transcoding and map downloading), which will provide a better user experience. MVNOs are primarily focused on abstracting and virtualizing the physical resources of Infrastructure providers (InP) into multiple network slices to satisfy Quality of Service (QoS) of network Slice Providers (SP).

The effects of MVNO, InP, SP are summarized below:

1) the MVNO leases resources such as physical resources, backhaul bandwidth and the like from the InP, generates virtual resources to different slices according to different user requests, and leases the virtual resources to the SP to execute the operation.

2) InP, which owns the physical network radio resources (e.g., backhaul and spectrum) may operate the physical network infrastructure.

3) The SP leases virtual resources to the user for different services and various QoS requirements.

However, the existing network resource allocation method is not reasonable enough, and based on the rapid development of the network resource, the data volume of the network transmission is greatly increased, so that the overall network resource is easy to have the conditions of slow operation and unsmooth blocking.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device and a storage medium for optimizing network resources, so as to solve or partially solve the above technical problems.

Based on the above purpose, a first aspect of the present application provides a network resource optimization method, including:

collecting communication sample resources, calculation sample resources, cache sample resources and current user terminal information in a network system;

inputting the communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into a deep deterministic strategy gradient model for processing, and outputting agent action information and reward data information;

training a gradient enhancement decision tree initial model by using the environmental data information, the agent action information and the reward data information as training samples to obtain a gradient enhancement decision tree model capable of optimizing network resources;

inputting the current environment data information, the current agent action information and the current reward data information of the network system into a trained gradient enhanced decision tree model for processing, and outputting a resource allocation strategy for maximizing the total utility of the network system by the gradient enhanced decision tree model.

A second aspect of the present application provides a network resource optimization apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire communication sample resources, calculation sample resources and cache sample resources in a network system;

the deep certainty strategy gradient processing module is configured to input the communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into a deep certainty strategy gradient model for processing, and output agent action information and reward data information;

the decision tree training module is configured to train a gradient enhancement decision tree initial model by using the environmental data information, the agent action information and the reward data information as training samples to obtain a gradient enhancement decision tree model capable of optimizing network resources;

and the resource allocation processing module is configured to input the current communication resource, the current computing resource, the current cache resource and the current user terminal information of the network system into the gradient enhancement decision tree model for processing, and the gradient enhancement decision tree model outputs a resource allocation strategy for maximizing the total utility of the network system.

A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

A fourth aspect of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect.

From the above, according to the network resource optimization method, the network resource optimization device, the electronic device and the storage medium provided by the application, the collected communication sample resources, the calculation sample resources, the cache sample resources and the user terminal information are used for training the depth certainty strategy gradient model, the agent action information and the reward data information output after training are used for training the gradient enhancement decision tree initial model, and then the gradient enhancement decision tree model capable of optimizing the network resources is obtained, so that the current environment data information, the current agent action information and the current reward data information output by the depth certainty strategy gradient model can be rapidly processed by using the gradient enhancement decision tree model to obtain the resource allocation strategy of the maximum total effectiveness. Therefore, the network resources can be distributed according to the resource distribution strategy of the maximized total utility, so that the network resource distribution is more reasonable, and the utilization rate of the network resources is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a network resource optimization method according to an embodiment of the present application;

fig. 2 is a block diagram of a network resource optimization apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

With the continuous expansion of wireless communication networks and the diversification of user application requirements, MVNOs are urgently required to design a system composed of QoS and Quality of Experience (QoE) to provide satisfactory services for users.

Multi-access edge computing (MEC) refers to deploying an edge server with specific computing resources and cache resources in a small cell at the edge of a network, and the technology can make full use of network resources to meet the QoS of users. Thus, when a user requests a resource, the MEC server may perform the corresponding task in a distributed manner, which will save backhaul bandwidth. Edge servers in small Base stations are lightweight and have limited resources compared to Macro Base Stations (MBS). Therefore, there is a strong need to find a feasible resource allocation scheme for the computation and caching tasks requested by the user. Furthermore, the 5G technology guarantees QoE of users and QoS of networks, but finding an optimal scheme for allocating channel resources and bandwidth in a dynamic environment is still a challenge.

Deep Reinforcement Learning (DRL) is a key branch in the field of artificial intelligence, has the capability of identifying dynamic environments, and has a wide application prospect in the aspect of solving the problem of resource allocation. The DRL method can widely solve the problem of complex resource allocation in the network slice time-varying network. Some studies apply DRL methods to manage resources, such as Deep Q Networks (DQN), which is an effective method for jointly scheduling resources for users. DQN is adapted to solve the discrete action space problem. However, the action space in our work is continuous. Therefore, the resource allocation problem is solved by combining the operator-critical framework and the Deep Neural Network (DNN) by adopting a Deep Deterministic Policy Gradient (DDPG) method.

Ensemble learning is a process of combining multiple single models to form a better model, and assists the DRL algorithm by ensemble learning in view of the limitations and high computational cost of DRL. A Gradient Boosting Decision Tree (GBDT) is a branch of ensemble learning, and it is proposed that a solution obtained by deep reinforcement learning can be converted into a GBDT model by a distillation method widely used in the imaging field. Compared with DRL method, it can show the importance of input parameter in GBDT model and calculate output more economically and faster.

Based on the above theoretical basis, an embodiment of the present application provides a network resource optimization method, as shown in fig. 1, the method includes the steps of:

step 101, collecting communication sample resources, calculation sample resources, cache sample resources and user terminal information in a network system.

Step 102, inputting the communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into a deep deterministic strategy gradient model (i.e. DDPG) for processing, and outputting agent action information and reward data information.

And 103, recording the environment data information, the agent action information and the reward data information to generate a data set.

And 104, training a gradient enhanced decision tree initial model (GBDT initial model) by using the data set to obtain the gradient enhanced decision tree model (GBDT model) capable of optimizing network resources.

And 105, inputting the current communication resource, the current computing resource, the current cache resource and the current user terminal information of the network system into a trained gradient enhanced decision tree model for processing, wherein the gradient enhanced decision tree model outputs a resource allocation strategy for maximizing the total utility of the network system.

In the above scheme, the Deep Deterministic Policy Gradient (DDPG) is a set of algorithm models based on edge computation and caching, taking into account the mobility of the user terminal and the dynamic communication conditions between the MEC server and the user terminal to jointly optimize task scheduling and resource allocation in the continuous action space.

In order to coordinate network functions and dynamically allocate limited resources, an improved Deep Reinforcement Learning (DRL) method is adopted, mobility of a user terminal and dynamic wireless channel conditions are fully considered, and a maximum profit function of a Mobile Virtual Network Operator (MVNO) is obtained. In consideration of the slow convergence rate of the DRL algorithm, the DRL and the integrated learning are combined, and a data set generated by the DDPG algorithm is used for training a gradient enhanced decision tree (GBDT) model. The trained GBDT model can completely imitate the behavior of the DDPG agent, and the output speed of the result is higher and the cost efficiency is higher.

In some embodiments, the network system comprises: user terminals communicatively connected to each other, mobile communication base stations provided with controllers (i.e., macro base station MBS), and small base stations equipped with multi-access edge computing.

Step 101 specifically includes:

step 1011, the mobile communication base station determines the spectrum bandwidth allocated to the small base station according to the obtained association index between each user terminal with the service request and the small base station, the total spectrum bandwidth of the small base station and the sub-channel allocated to the user terminal, and uses the determined spectrum bandwidth allocated to the small base station as the communication sample resource.

The network system consists of an MBS with a controller and several small base stations with MEC servers, wherein

On behalf of a set of user terminals,

representing a set of small base stations. The services requested by the user terminal can be divided into computation offload and content delivery. The service types of different services can be distinguished, provided that the request packet is marked.

Represents a set of user terminals requesting computational offloading,

representing a set of user terminals requesting content delivery. If the user terminals can only accept one service request at a time, the number of user terminals requesting a service may be defined as N + M ═ V. In addition to this, the present invention is,

representing the set of requesting services SP. All user terminals requesting a service SP can be seen as a set

Wherein V ═ U-_sV_sAnd is

The coverage areas of the small base stations are overlapping to ensure that each user terminal having a service request is associated with a small base station.

Can be regarded as a task establishment indicator, wherein

Indicating that the user terminal v requests the service s to be associated with the small base station u; otherwise

In particular, each user terminal can only be associated with one small base station, defined as

The total spectrum bandwidth of all small base stations may be defined as B, i.e. B ═ u @_uB_u。B_uRepresenting the spectrum bandwidth allocated to the small base station u. In practice, small base station B_uThe bandwidth of Hz can be divided into B_uB sub-channels allocated to user terminals v_sIs defined as

Thus B_uCan be expressed as

Wherein

Is from the small base station u to the user terminal v_sThe allocated bandwidth.

Step 1012, the mobile communication base station acquires the computing capability of the small base station allocated to the user terminal as the computing sample resource.

If the small base stations belong to different InP, the licensed spectrum for each InP is orthogonal. Therefore, there is no interference between different small base stations. However, there is interference between user terminals belonging to the same SP and connected to the same small base station. User terminal v_sThe average signal to interference and noise ratio (SINR) with the small base station u can be defined as

Wherein

And

respectively representing user terminals

And user terminal

The transmission power of the transmission,

and

is the average channel gain, σ²Is Additive White Gaussian Noise (AWGN).

In addition, small base station u and user terminal v_sThe data transmission rate therebetween can be calculated by the Shannon theory, i.e.

The present application uses a quasi-static assumption that the environmental state remains unchanged during the time slot t. User terminal

The requested computing task may be described as

Wherein

Indicating the input data size (unit, bit),

to representUser terminal

The computing power of the requested computing task (the total number of CPU cycles of the computing task), and further,

is allocated to the user terminal

Computing power (CPU per second), computing tasks of small base station u

The total execution time at the small base station u is

Thus, the user terminal has a calculated rate of

Computing tasks

The total energy consumption can be expressed as

Wherein e_uRepresenting the energy consumption of the small base station u per CPU cycle.

Furthermore, the computing power of each small base station is limited, i.e.

Wherein F_uIs the computational power allocated to the small base station u. In practice, the total computation power of all small base stations may be defined as F, i.e. F ═ u_uF_u。

And step 1013, the mobile communication base station uses the obtained buffer space allocated to the small base station as a buffer sample resource.

User terminal

The caching task can be described as

The storage space of the small base station is limited and only the small base station can be stored

A content type. The caching task is implemented in a first-in first-out manner, i.e. when the latest content is determined to be stored, the oldest stored content is deleted. The probability of the user terminal requesting the content F follows Zipf distribution and is modeled as

Where the parameter l indicates the popularity of the content, which is always a positive value. In our caching model, if the content caching task of the user terminal is known, the popularity of the content can be calculated directly from the formula.

In addition to this, the present invention is,

is the time to download the desired content over the backhaul. Thus, the expected backhaul bandwidth savings achieved by caching the content may be expressed as

Wherein

It can be calculated directly by the content popularity equation.

A caching strategy is used in the implementation where the prices of different content are known. Furthermore, the buffer space of each small base station is limited, i.e.

Wherein C is_uIs the buffer space allocated to the small base station u. Practically all small base stationsMay be defined as C, i.e., C ═ u @_uC_u。

Based on the obtained communication sample resources, calculation sample resources and cache sample resources and user terminal information, in order to maximize the total profit of the MVNO, an integrated architecture is constructed, and the MVNO carries out task scheduling and resource allocation to the user terminal

The virtual network access charge is charged per bps

After paying the MVNO, the user terminal has access to the physical resources and completes the task. On the other hand, MVNOs also pay for InP the spectrum usage cost per Hz

If the requesting task user terminal is computational offload, the MVNO may be slave to the user terminal

Charging a fee per bps

At the same time, the MVNO will pay the calculated energy cost per J for the small cell

If the task is content delivery, the MVNO may charge a fee per bps

At the same time, the MVNO will pay per byte the expected savings in backhaul bandwidth

Thus, the user terminal

And a small base stationThe profit function for the transmission between u can be defined as

The total profit of an MVNO can be divided into three components, namely communication, calculation and caching of revenue.

Communication yield: the first term of the profit function described above is the communication revenue.

Representing user terminals

The fees paid to the MVNO for access to the virtual network,

representing the bandwidth cost that the MVNO pays for InP.

And (4) calculating the income: the second term of the profit function described above is the calculated profit.

Representing user terminals

The fee paid for the MVNO to perform the computational task,

representing the energy consumption cost paid by the MVNO for InP.

Caching income: the last item of the profit function described above is cache revenue.

Representing user terminals

The cost paid for the MVNO to perform the caching task,

representing MVNOs as cached content

Cost towards InP.

The present disclosure optimizes the goal to maximize the total profit, OP, of the MVNO, and thus can yield

S.t.：

C1 denotes a user terminal

Can only be associated with one small base station u; c2 means that the bandwidth allocated from the small base station u to all the user terminals associated with it cannot exceed the spectrum resources of the small base station u; c3 and C5 ensure the user terminal respectively

Rate of communication

And calculating the rate

Requiring; c4 and C6 show the computing power F of each small cell u_uAnd a buffer space C_uIs limited.

In some embodiments, step 102 specifically includes:

step 1021, setting a first input parameter and a first output parameter of the depth certainty strategy gradient model, wherein the first input parameter at least comprises: the communication sample resource, the calculation sample resource, the buffer sample resource and the ue information, the first output parameter at least includes: agent action information and reward data information.

And 1022, inputting the obtained communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into an evolution network, performing cyclic execution according to time, continuously calculating a corresponding first loss function in the execution process, and adjusting parameters of a depth deterministic policy model according to the first loss function.

Wherein the depth-deterministic policy model comprises: an evolution network and an evaluation network.

Initializing parameters of an evolution network and an evaluation network in advance; performing cycle execution in the evolution network, continuously calculating a first loss function by using an evaluation network in the execution process, performing minimization processing on the first loss function, and adjusting parameters of the evaluation network according to the minimized loss function; adjusting parameters of the evolution network according to the sampled strategy gradient; and adjusting parameters of the evolution target network and the evaluation target network.

And 1023, acquiring specific data of the first output parameter output by the depth certainty strategy model after all processing is finished.

A controller deployed at a mobile communications base station can interact with the environment (i.e. collect all information on the state of the system) and obtain rewards (i.e. make decisions on all requests) after performing actions, with the goal of maximizing the long-term cumulative return. The process of exploring the optimal strategy by the controller is as follows: observing status information s in time slot t_tE.g., S, and then selects action a according to policy pi (a | S) (representing the probability of selecting an action in this state)_tE is A; taking action a_tLater intelligenceThe energy bank immediately receives the instant reward. In general, the goal of MDP is to explore a strategy pi (a | s) to maximize the cost function, usually expressed in terms of expected impressions accumulated returns calculated by the Bellman equation.

Three key elements in reinforcement learning are introduced below: state space, action space, and rewards.

State space: the state space contains two components, namely the available resources of the small base station U (U e U) equipped with the MEC server and the state V (V e V) of the user. The state space at time slot t can be denoted as s_t＝{F_u，B_u，C_u，Ω_v}。F_u、B_uAnd C_uRepresenting the available computation, bandwidth and buffering resources for each small cell U equipped with the MEC server (U e U). In addition, the state Ω of the user_vIncluding average SINR between the user and the small cell, input data size (bits) for the computation task, computation power (total number of CPU cycles used to complete the task), cache capacity, content popularity, and user location, etc.

An action space: the action space is used for small base station selection and resource allocation, and the aim is to complete calculation unloading or content delivery tasks. In time slot t, the motion space can be represented as

And

respectively indicating the allocation of small base stations equipped with MEC servers to users

Bandwidth, computational resources and cache resources.

Representing whether or not to perform task setup.

Rewarding: taking action a_tThereafter, the agent will receive the reward R_t. In particular, the reward should correspond to the above-described optimization objective function. Thus, a reward may be defined as

Training samples were created using the DDPG method: the GBDT model is trained very fast, but it cannot be learned directly from the environment. The DDPG method can solve the problem that the optimization device of the application obtains the maximum return or realizes a specific target by learning an optimal strategy in the process of continuously interacting with the environment. However, GBDT, a model for supervised learning, requires the correct label from the environment. Thus, in our model, a training sample is first created by the DDPG, and then a training sample containing environmental information and output reward information is created.

In some embodiments, step 104 specifically includes:

step 1041, setting a second input parameter and a second output parameter of the initial model of the gradient enhanced decision tree, wherein the second input parameter includes: environmental data information, agent action information, and reward data information, the second output parameter comprising: a resource allocation policy of a network system that maximizes overall utility.

And 1042, setting the initial value of the iteration count m to be 0, and initializing an additional predictor in the initial model of the gradient enhancement decision tree.

Step 1043, inputting a first predetermined amount of environment data information, agent action information and reward data information output by the depth certainty strategy gradient model into the gradient enhancement decision tree initial model as a training sample for training, adding 1 to corresponding m every time of training, stopping training until the value of m reaches a predetermined threshold value, and taking the trained gradient enhancement decision tree initial model as the gradient enhancement decision tree model.

Step 1043 specifically includes:

a group of base learners in the initial model of the gradient enhancement decision tree is designated as a target base learner group;

inputting environment data information, agent action information and reward data information into a gradient enhancement decision tree initial model for training, and calculating a second loss function after training, wherein each training time, corresponding m is added by 1;

calculating a first negative gradient vector of the second loss function;

respectively fitting a second negative gradient vector to each base learner in the target base learner group;

determining a component most suitable for the negative gradient vector according to the second gradient vector and the determined target base learning group;

updating parameters of an additional predictor according to the component of the most suitable negative gradient vector;

and determining that m is equal to a set threshold value, and taking the final gradient enhancement decision tree initial model as a gradient enhancement decision tree model.

GBDT based decision tree is an iterative decision tree algorithm. The extensible end-to-end tree lifting system is called XGboost and is an improved GBDT algorithm. In particular, the GBDT uses only the information of the first derivative in the optimization, while the XGboost algorithm uses the first and second derivatives to perform the second order Taylor expansion on the cost function. In addition, the complexity of the model can be controlled by adding a regular term containing the number of nodes of each leaf and the score function to the cost function. In the overall framework, the improved GBDT algorithm is applied to the regression task. A data set comprising n samples is given. The data set may be represented as D ═ x_i，y_i)(|D|＝n，x_i∈F∪B∪C∪Ω，y_iE.g. R) in which y_iExpressed as a solution according to a reward function, x_iState space represented as our system model

From the above, a state space composed of a large amount of dynamic environment information and an action space containing a large amount of continuous values can be obtained. The DDPG algorithm is employed to maximize the reward function, and the DDPG method uses a neural network to evaluate and select actions, which is more complex and more difficult to obtain as compared to a tree model. Therefore, the DDPG algorithm is combined with the GBDT model, so that the convergence speed can be increased, and accurate estimation can be realized.

Training samples were created using DDPG, with environmental state parameters as input and rewards as output in the GBDT model. Thus, with constant training, the GBDT model learns to get the maximum reward for given environmental information, with the goal of achieving the same level of accuracy as a DRL agent. In some embodiments, the method further comprises:

and step A, testing the gradient enhanced decision tree model by taking the second preset amount of environment data information, agent action information and reward data information output by the depth certainty strategy gradient model as a test sample.

And step B, determining the accuracy of the gradient enhancement decision tree model according to the test result.

And step C, when the accuracy is determined to be greater than or equal to a preset accuracy threshold, the obtained gradient enhanced decision tree model is used as a final gradient enhanced decision tree model.

And step D, in response to the fact that the accuracy is smaller than the preset accuracy threshold, the obtained gradient enhancement decision tree model is retrained again by using the test sample until the obtained accuracy is smaller than the preset accuracy threshold, and the retrained gradient enhancement decision tree model is used as a final gradient enhancement decision tree model.

Through the steps, the accuracy of the obtained gradient enhancement decision tree model can be tested, so that the accuracy of the finally obtained gradient enhancement decision tree model can meet the actual requirement, and the precision of the gradient enhancement decision tree model is improved.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a network resource optimization device.

Referring to fig. 2, the network resource optimization apparatus includes:

an acquisition module 21 configured to acquire communication sample resources, calculation sample resources, and cache sample resources in the network system;

a deep deterministic policy gradient processing module 22 configured to input the communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into a deep deterministic policy gradient model for processing, and output agent action information and reward data information;

a decision tree training module 23 configured to train a gradient enhancement decision tree initial model by using the environmental data information, the agent action information, and the reward data information as training samples, so as to obtain a gradient enhancement decision tree model capable of optimizing network resources;

and the resource allocation processing module 24 is configured to input the current communication resource, the current computing resource, the current cache resource and the current user terminal information of the network system into a gradient enhanced decision tree model for processing, wherein the gradient enhanced decision tree model outputs a resource allocation strategy for maximizing the total utility of the network system.

In some embodiments, the network system comprises: the system comprises user terminals which are in communication connection with each other, a mobile communication base station provided with a controller and a small base station equipped with multi-access edge calculation;

the acquisition module 21 is configured to:

the mobile communication base station determines the spectrum bandwidth allocated to the small base station according to the obtained association index between each user terminal with the service request and the small base station, the total spectrum bandwidth of the small base station and the sub-channel allocated to the user terminal, and takes the determined spectrum bandwidth allocated to the small base station as a communication sample resource; the mobile communication base station acquires the computing capacity of the small base station distributed to the user terminal as computing sample resources; and the mobile communication base station takes the obtained cache space distributed to the small base station as a cache sample resource.

In some embodiments, the depth deterministic policy gradient processing module 22 is configured to:

setting first input parameters and first output parameters of the depth deterministic strategy gradient model, wherein the first input parameters comprise at least: the communication sample resource, the calculation sample resource, the buffer sample resource and the ue information, the first output parameter at least includes: agent action information and reward data information; inputting the obtained communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into an evolution network, performing cyclic execution according to time, continuously calculating a corresponding first loss function in the execution process, and adjusting parameters of a depth certainty strategy model according to the first loss function; and after all the processing is finished, acquiring specific data of a first output parameter output by the depth certainty strategy model.

In some embodiments, the depth deterministic policy model comprises: an evolution network and an evaluation network;

the depth deterministic policy gradient processing module 22 is further configured to:

In some embodiments, the decision tree training module 23 is configured to:

setting a second input parameter and a second output parameter of the initial model of the gradient enhancement decision tree, wherein the second input parameter comprises: environmental data information, agent action information, and reward data information, the second output parameter comprising: a resource allocation policy of the network system that maximizes total utility; setting the initial value of the iteration count m to be 0, and initializing an additional predictor in the initial model of the gradient enhancement decision tree; and inputting a first preset amount of environmental data information, agent action information and reward data information output by the depth certainty strategy gradient model into the gradient enhancement decision tree initial model as training samples for training, wherein each training time, the corresponding m is added by 1, and the training is stopped until the value of m reaches a preset threshold value, and the trained gradient enhancement decision tree initial model is used as the gradient enhancement decision tree model.

In some embodiments, the decision tree training module 23 is further configured to:

a group of base learners in the initial model of the gradient enhancement decision tree is designated as a target base learner group; inputting environment data information, agent action information and reward data information into a gradient enhancement decision tree initial model for training, and calculating a second loss function after training, wherein each training time, corresponding m is added by 1; calculating a first negative gradient vector of the second loss function; respectively fitting a second negative gradient vector to each base learner in the target base learner group; determining a component most suitable for the negative gradient vector according to the second gradient vector and the determined target base learning group; updating parameters of an additional predictor according to the component of the most suitable negative gradient vector; and determining that m is equal to a set threshold value, and taking the final gradient enhancement decision tree initial model as a gradient enhancement decision tree model.

In some embodiments, the apparatus further comprises a test module configured to:

testing the gradient enhancement decision tree model by taking the second preset amount of environment data information, agent action information and reward data information output by the depth certainty strategy gradient model as a test sample; determining the accuracy of the gradient enhancement decision tree model according to the test result; when the accuracy is determined to be greater than or equal to a preset accuracy threshold, taking the obtained gradient enhancement decision tree model as a final gradient enhancement decision tree model; and in response to the fact that the accuracy is smaller than the preset accuracy threshold, the obtained gradient enhancement decision tree model is retrained again by using the test sample until the obtained accuracy is smaller than the preset accuracy threshold, and the retrained gradient enhancement decision tree model is used as a final gradient enhancement decision tree model.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The apparatus in the foregoing embodiment is used to implement the corresponding network resource optimization method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any embodiment described above, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the network resource optimization method described in any embodiment above is implemented.

Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding network resource optimization method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiment methods, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the network resource optimization method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the network resource optimization method according to any of the foregoing embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A method for optimizing network resources, comprising:

collecting communication sample resources, calculation sample resources, cache sample resources and user terminal information in a network system;

inputting the communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into a depth certainty strategy gradient model for processing, and outputting agent action information and reward data information;

recording the environment data information, the agent action information and the reward data information to generate a data set;

training a gradient enhancement decision tree initial model by using the data set to obtain a gradient enhancement decision tree model capable of optimizing network resources;

and inputting the current communication resource, the current computing resource, the current cache resource and the current user terminal information of the network system into a gradient enhanced decision tree model for processing, wherein the gradient enhanced decision tree model outputs a resource allocation strategy for maximizing the total utility of the network system.

2. The method of claim 1, wherein the network system comprises: the system comprises user terminals which are in communication connection with each other, a mobile communication base station provided with a controller and a small base station equipped with multi-access edge calculation;

the acquiring of communication sample resources, calculation sample resources and cache sample resources in a network system specifically includes:

the mobile communication base station determines the spectrum bandwidth allocated to the small base station according to the obtained association index between each user terminal with the service request and the small base station, the total spectrum bandwidth of the small base station and the sub-channel allocated to the user terminal, and takes the determined spectrum bandwidth allocated to the small base station as a communication sample resource;

the mobile communication base station acquires the computing capacity of the small base station distributed to the user terminal as computing sample resources;

and the mobile communication base station takes the obtained cache space distributed to the small base station as a cache sample resource.

3. The method according to claim 1, wherein the inputting the communication sample resource, the calculation sample resource, the cache sample resource, and the user terminal information into a deep deterministic policy gradient model for processing, and outputting agent action information and reward data information specifically comprises:

setting first input parameters and first output parameters of the depth deterministic strategy gradient model, wherein the first input parameters comprise at least: the communication sample resource, the calculation sample resource, the buffer sample resource and the ue information, the first output parameter at least includes: agent action information and reward data information;

inputting the obtained communication sample resource, the calculation sample resource, the cache sample resource and the user terminal information into an evolution network, performing cyclic execution according to time, continuously calculating a corresponding first loss function in the execution process, and adjusting parameters of a depth certainty strategy model according to the first loss function;

and after all the processing is finished, acquiring specific data of a first output parameter output by the depth certainty strategy model.

4. The method of claim 3, wherein the depth-deterministic policy model comprises: an evolution network and an evaluation network;

the inputting the obtained communication sample resource, the calculation sample resource, the cache sample resource, and the user terminal information into an evolution network, and performing loop execution according to time, continuously calculating a corresponding first loss function in the execution process, and adjusting parameters of a depth certainty policy model according to the first loss function specifically includes:

initializing parameters of an evolution network and an evaluation network in advance;

performing loop execution in the evolution network, wherein in the process of executing,

continuously calculating a first loss function by using an evaluation network, minimizing the first loss function, and adjusting parameters of the evaluation network according to the minimized loss function;

adjusting parameters of the evolution network according to the sampled strategy gradient;

and adjusting parameters of the evolution target network and the evaluation target network.

5. The method according to claim 1, wherein the training of the initial gradient decision tree model by using the environmental data information, the agent action information, and the reward data information as training samples to obtain a gradient decision tree model capable of optimizing network resources specifically comprises:

setting a second input parameter and a second output parameter of the initial model of the gradient enhancement decision tree, wherein the second input parameter comprises: environmental data information, agent action information, and reward data information, the second output parameter comprising: a resource allocation policy of the network system that maximizes total utility;

setting the initial value of the iteration count m to be 0, and initializing an additional predictor in the initial model of the gradient enhancement decision tree;

and inputting a first preset amount of environmental data information, agent action information and reward data information output by the depth certainty strategy gradient model into the gradient enhancement decision tree initial model as training samples for training, wherein each training time, the corresponding m is added by 1, and the training is stopped until the value of m reaches a preset threshold value, and the trained gradient enhancement decision tree initial model is used as the gradient enhancement decision tree model.

6. The method according to claim 5, wherein the first predetermined amount of environment data information, agent action information, and reward data information output by the deep deterministic policy gradient model are input into a gradient enhanced decision tree initial model for training, each training time, the corresponding m is added by 1 until the value of m reaches a predetermined threshold, the training is stopped, and the trained gradient enhanced decision tree initial model is used as the gradient enhanced decision tree model, which specifically includes:

calculating a first negative gradient vector of the second loss function;

7. The method of claim 1, further comprising:

testing the gradient enhancement decision tree model by taking the second preset amount of environment data information, agent action information and reward data information output by the depth certainty strategy gradient model as a test sample;

determining the accuracy of the gradient enhancement decision tree model according to the test result;

when the accuracy is determined to be greater than or equal to a preset accuracy threshold, taking the obtained gradient enhancement decision tree model as a final gradient enhancement decision tree model;

and in response to the fact that the accuracy is smaller than the preset accuracy threshold, the obtained gradient enhancement decision tree model is retrained again by using the test sample until the obtained accuracy is smaller than the preset accuracy threshold, and the retrained gradient enhancement decision tree model is used as a final gradient enhancement decision tree model.

8. A network resource optimization apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.