CN112188600A

CN112188600A - Method for optimizing heterogeneous network resources by using reinforcement learning

Info

Publication number: CN112188600A
Application number: CN202011002522.7A
Authority: CN
Inventors: 李君�; 李磊; 仲星; 朱明浩; 李正权
Original assignee: Binjiang College of Nanjing University of Information Engineering
Current assignee: Ictehi Technology Development Co ltd; Binjiang College of Nanjing University of Information Engineering
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-05
Anticipated expiration: 2040-09-22
Also published as: CN112188600B

Abstract

The invention discloses a method for optimizing heterogeneous network resources by using reinforcement learning, which belongs to the technical field of communication and integrates reinforcement learning and convex optimization theories, provides a strategy of dormancy of ABS, CRE and a small base station according to the relevance of actions, divides an action space, and redesigns a reward function value by taking a negative number and then taking a reciprocal number as a new reward function value aiming at the problem that the system energy efficiency is too large as the magnitude order of the reward function value in the reinforcement learning modeling process. The invention reduces the action space of reinforcement learning, and the convex optimization theory can ensure the system convergence and accelerate the convergence speed of reinforcement learning; simulation experiments can prove that the method has convergence and lower complexity, and the convergence speed is improved by 60% compared with Q-Learning of the traditional table type on the premise of almost reaching the theoretical value of the system energy efficiency.

Description

Method for optimizing heterogeneous network resources by using reinforcement learning

Technical Field

The invention belongs to the technical field of communication, and particularly relates to a method for optimizing heterogeneous network resources by using reinforcement learning.

Background

As the number of access wireless devices increases, higher demands are placed on the communication capacity of the network system. One of the effective methods to solve this problem is to build a heterogeneous network, in which elcic is introduced to effectively overcome the interference problem and improve the signal-to-interference-and-noise ratio between the mobile device and the base station. At the same time, more stringent requirements are placed on the performance and energy efficiency of heterogeneous networks. As the complexity of heterogeneous networks continues to increase, optimization of energy efficiency faces more and more challenges and is one of the hotspots of communication network research, especially for heterogeneous networks equipped with 5G base stations. The key point is how to effectively configure heterogeneous network resources, so that the energy efficiency of a network system is maximized.

The method mainly focuses on jointly considering characteristics of Almost Blank Subframes (ABS), Cell Range Expansion (CRE), small Cell dormancy policy and the like to solve system energy efficiency configuration. Many scholars have finally established a non-convex NP-Hard problem. This is converted into a convex problem by relaxation (Karush-Kuhn-Tucker, KKT) conditions. The most effective method is to consider ABS, CRE and base station dormancy strategies jointly and divide the strategies into three subproblems of independently considering ABS, CRE and small base station dormancy strategies, wherein each subproblem is convex, and according to a convex optimization theory, the original non-convex NP-Hard problem is obtained by circularly iterating solutions of the three subproblems. The disadvantage of this solution is that the traditional mathematical method still requires a large amount of computation in actually solving the subproblems, and the computation process is quite complex. Limiting the practical application of this solution.

In recent years, machine learning techniques are increasingly applied to many fields, such as big data analysis, advertisement precise placement, image classification, and the like. At present, many scholars introduce machine learning technology into a communication system for resource optimization research, mainly taking deep learning and reinforcement learning as main points.

In the deep neural network, the deep learning has the advantage of good fitting performance. The deep learning method can well approach the relation between heterogeneous network resources and system performance, thereby realizing the maximization of the heterogeneous network performance. The disadvantage is that neural networks can create problems of overfitting and excessive learning speed. The reinforcement learning has the advantages that the model-free scheme can be adopted like the deep learning, and the model scheme can also be adopted to solve the practical problem. It makes the solution of specific problem become more efficient, timely.

The learners map the relationships between the base stations and the base stations in the heterogeneous network and between the base stations and the users to the field of graph theory, and then decompose the initial Q-Learning problem into a plurality of Q-Learning sub-problems by combining reinforcement Learning and the graph theory to solve the network resource allocation so as to optimize the system performance.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method for optimizing heterogeneous network resources by using reinforcement Learning aiming at the defect of overlarge action space when reinforcement Learning is directly applied to heterogeneous network resource allocation, and the convergence speed is improved by 60 percent compared with the traditional table type Q-Learning on the premise of almost reaching the theoretical value of system energy efficiency.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

a method for optimizing heterogeneous network resources by using reinforcement learning comprises the following steps:

step 1, establishing a Markov decision process according to an energy efficiency target of a heterogeneous network to be optimized;

step 2, designing traditional Q-Learning according to Markov decision process;

step 3, aiming at the problem that the magnitude order of the reward function value in Q-Learning is too large, redesigning the reward function value, firstly taking a negative number and then taking the reciprocal, and compressing the reward function value to (-1, 0);

and 4, step 4: according to the action correlation, namely ABS, CRE and a small base station sleep strategy, dividing the traditional Q-Learning action space into three sub Q-Learning action spaces;

step 5, the loop iteration process is that loop iteration is carried out on the stable solution obtained by the three sub Q-Learning units; in order to accelerate the convergence speed, the stable solution of each loop iteration is not necessarily the optimal solution of the three sub Q-Learning;

and 6, bringing the solution solved by each subproblem into the condition for solving the two subsequent subproblems, enabling the solutions of the three subproblems to simultaneously reach a stable state through mutual loop iteration, combining the stable solutions of the three subproblems, and outputting the optimal solution A of the original problem_ABSo，A_CREoAnd A_picoo。

Further, in step 1, a markov decision process (S, a, P, R) is established, specifically, S is defined as a state space, i.e. a set of user positions in a heterogeneous network cell; defining A as an action space, namely an action set selected by the agent under the condition of the state S, and defining P as a transition state, namely: p(s)_t+1＝s'|s_t＝s,a_tA); r is defined as the reward function.

Further, in step 3, the reward function value is redesigned by taking a negative number and then taking a reciprocal, and the reward function value is compressed to (-1,0), namely

And E, the energy efficiency function of the system E ensures the consistency of the reward function and the system energy efficiency.

Further, in step 4, the conventional Q-Learning motion space is divided into three sub-Q-Learning motion spaces, i.e. a is decomposed into three sub-Q-Learning motion spaces

And

in turn, optimizing the action space set of ABS, CRE and small cell dormancy strategy, defining

For configuration set of ABS, define

For configuration set of CRE, define

A small base station dormancy strategy set is obtained; solving the dormancy strategy solutions of ABS, CRE and small base station respectively:

further, in step 5, the loop is iterated

R_ABS～P(R|S,A_ABS)≤R_ABSo～P(R|S,A_ABSo)，

R_CRE～P(R|S,A_CRE)≤R_CREo～P(R|S,A_CREo)，

R_CRE～P(R|S,A_CRE)≤R_Picoo～P(R|S,A_Picoo) Wherein A is_ABSo，A_CREoAnd

A_picoois the optimal action of the three sub-Q-Learning.

The invention principle is as follows: the method decomposes an initial problem into a plurality of sub-problems according to the correlation of configuration resources, and obtains the solution of the initial problem by circularly iterating the solutions of the sub-problems. The approach to solve several sub-problems is to use Q-Learning instead of the traditional mathematical approach. Mapping the initial problem to a reinforcement Learning field, segmenting an action space according to the correlation of actions, decomposing the original Q-Learning into a plurality of sub-Q-Learning according to the criterion of the segmentation actions, and circularly iterating the optimal strategy of the sub-Q-Learning to obtain the optimal strategy of the initial Q-Learning. The system energy efficiency is used as a reward function to be redesigned, the system energy efficiency is firstly subjected to negative value and then is subjected to reciprocal value, the reward function value of reinforcement learning can be compressed to (-1,0), and meanwhile, the new reward function is ensured to be consistent with the system energy efficiency value.

Has the advantages that: compared with the prior art, aiming at the defect that the action space is overlarge when reinforcement learning is directly applied to the configuration of heterogeneous network resources, the method for optimizing the heterogeneous network resources by utilizing the reinforcement learning integrates reinforcement learning and convex optimization theories, provides a strategy for dividing the action space according to the relevance of the action, namely ABS, CRE and small base station dormancy, and redesigns the reward function value to firstly take a negative number and then take a reciprocal number as a new reward function value aiming at the problem that the system energy efficiency is overlarge as the magnitude order of the reward function value in the reinforcement learning modeling process. The invention reduces the action space of reinforcement learning, and the convex optimization theory can ensure the system convergence and accelerate the convergence speed of reinforcement learning; simulation experiments can prove that the method has convergence and lower complexity, and the convergence speed is improved by 60% compared with Q-Learning of the traditional table type on the premise of almost reaching the theoretical value of the system energy efficiency.

Drawings

FIG. 1 is a flow chart of the method construction process of the present invention;

FIG. 2 is a schematic diagram of the iterative operation of the sub-Q-Learning loop of the present invention;

FIG. 3 is a graph of convergence rate of the conventional Q-Learning method under the same parameter setting;

FIG. 4 is a graph of convergence rate for the method of the present invention under the same parameter settings;

FIG. 5 is a system energy efficiency diagram of the method of the present invention.

Detailed Description

The present invention will be further described with reference to the following embodiments.

As shown in fig. 1-5, a method for optimizing heterogeneous network resources by reinforcement learning includes the following steps:

step 1: establishing a Markov Decision Process (MDP) (S, A, P, R) according to the heterogeneous network energy efficiency target needing to be optimized, wherein S is defined as a state space, namely a set of user positions in a heterogeneous network cell; define A as the action space, i.e. the set of actions the agent chooses in the case of state S, and define P as the transition state, i.e. the state of transition

P(s_t+1＝s'|s_t＝s,a_tA); r is defined as the reward function.

Step 2: designing a traditional Q-Learning according to a Markov decision process;

and step 3: for number of reward function values in Q-LearningRedesigning the value of the reward function when the level is too large, taking the negative number and then taking the reciprocal, and compressing the value of the reward function to (-1,0), namely

The energy efficiency function of the E system simultaneously ensures the consistency of the reward function and the system energy efficiency;

and 4, step 4: according to the dependence of the action, namely ABS, CRE and small base station sleep strategy, the traditional Q-Learning action space is divided into three sub-Q-Learning action spaces, namely A is divided into

And

For configuration set of ABS, define

For configuration set of CRE, define

And setting a dormancy strategy set for the small base station. Respectively solving dormancy strategy solutions of ABS, CRE and small base station

And 5: the loop iteration process is that loop iteration is carried out on stable solutions obtained by three sub Q-Learning. To speed up convergence, the stable solution for each iteration of the loop is not necessarily the optimal solution for the three sub-Q-Learning, i.e., the optimal solution for the loop is obtained

R_ABS～P(R|S,A_ABS)≤R_ABSo～P(R|S,A_ABSo)，

R_CRE～P(R|S,A_CRE)≤R_CREo～P(R|S,A_CREo)，

R_CRE～P(R|S,A_CRE)≤R_Picoo～P(R|S,A_Picoo) Wherein A is_ABSo，A_CREoAnd

A_picoois the optimal action of the three sub-Q-Learning;

step 6: the solution solved by each sub-problem is brought into the condition for solving the two following sub-problems, the solutions of the three sub-problems simultaneously reach a stable state through mutual loop iteration, the stable solutions of the three sub-problems are combined, and the optimal solution A of the original problem is output_ABSo，A_CREoAnd A_picoo。

FIG. 1 is a flow chart of the process of construction of the method of the present invention. The traditional table type Q-Learning complex problem has a high dimensional motion space, and direct application of Q-Learning is impractical. As shown in fig. 1, an MDP is established according to the energy efficiency that needs to be optimized, and Q-Learning of the conventional table type is established. Aiming at the problem of overlarge action space, the relevance of the action is optimized according to the energy efficiency requirement of the system, and the action space is divided. The Q-Learning of the original table is decomposed into three sub-Q-Learning, each of which finds the action to be optimized. And when the solutions of the three sub-Q-Learning are stable in the loop iteration, combining and outputting the solutions of the three sub-Q-Learning. A solution of the original Q-Learning is obtained.

Fig. 2 is a flow chart showing three sub-Q-Learning loop iterations, where the current solution of the sub-Q-Learning is updated by the last loop iteration solution, and then the solution of the sub-Q-Learning is used as a condition for two sub-Q-Learning to be solved subsequently, and through the loop iterations, the solutions of three sub-problems simultaneously reach a stable state, and stable solutions of three sub-problems are combined to generate an optimal solution of the original problem for output.

Based on the flow chart constructed in fig. 1 and fig. 2, in the simulation experiment, the number of users was set to 50, 100, 150, and 200, respectively, and was randomly entered into the cell. The wireless channel is modeled as a deterministic path loss attenuation and random shadow fading model, and the system bandwidth is set to be 10 MHz. FIGS. 3 and 4 show the relationship between iteration number and accuracy for the conventional table type Q-Learning method and the method of improving reinforcement Learning actions (TQL) of the present invention, where the Learning rate, discount factor and greedy rate are all set to 0.1. Where fig. 3 shows that the Q-Learning method converges after about 80 × 10000 ═ 800000 iterations under different load conditions, in fig. 4, the proposed TQL method converges after about 800 × 400 ═ 320000 iterations, in fig. 3-4, Accuracy denotes the Accuracy, Learning rate, Discount factor, green rate denotes the greedy rate, and iterationtps denotes the number of iteration steps. As can be seen from fig. 3 and 4, the convergence speed of our proposed TQL method is improved by about 60% compared to method 1.

Fig. 5 shows a comparison between the TQL method proposed by the present invention and the conventional Q-Learning method and the ADP ES IC method for optimizing the Energy Efficiency of the heterogeneous network, where Energy Efficiency represents Energy Efficiency and UEs represents the number of users. From fig. 5(a), it can be seen that the optimization of the system energy efficiency by the method proposed by the present invention is already very close to the theoretical value of the system energy efficiency, and at the same time, the optimization of the system energy efficiency by the method proposed by the present invention is far greater than the performance of ADP ES IC proposed by the related scholars. Fig. 5(b) shows a gap between the energy efficiency optimization method for the heterogeneous network and an energy efficiency theoretical optimal value, and it can be seen from fig. 4 that the gap is mainly that the method provided by the present invention finds a relatively optimized solution in a state where an optimal solution is not found in an individual state, but the method does not find the optimal solution, and fig. 5(b) verifies that the loss of the system energy efficiency is small.

The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.

Claims

1. A method for optimizing heterogeneous network resources by using reinforcement learning is characterized in that: the method comprises the following steps:

step 2, designing traditional Q-Learning according to Markov decision process;

step 3, redesigning the reward function value aiming at the magnitude order of the reward function value in Q-Learning, firstly taking a negative number and then taking a reciprocal, and compressing the reward function value to (-1, 0);

2. The method of claim 1, wherein the method comprises the following steps: in step 1, a markov decision process (S, a, P, R) is established, specifically, S is defined as a state space, i.e., a set of user positions in a heterogeneous network cell, a is defined as an action space, i.e., an action set selected by an agent in the case of the state S, and P is defined as a transition state, i.e.,:

P(s_t+1＝s'|s_t＝s,a_ta); r is defined as the reward function.

3. The method of claim 2, wherein the method comprises the following steps: in step 3, redesigning the value of the reward function, firstly taking the negative number and then taking the reciprocal, and compressing the value of the reward function to (-1,0), namely

4. The method of claim 3, wherein the method comprises the following steps: in step 4, the conventional Q-Learning motion space is divided into three sub-Q-Learning motion spaces, i.e. a is decomposed into

And

For configuration set of ABS, define

For configuration set of CRE, define

5. the method of claim 4, wherein the method comprises the following steps: in step 5, the loop is iterated

R_ABS～P(R|S,A_ABS)≤R_ABSo～P(R|S,A_ABSo)，

R_CRE～P(R|S,A_CRE)≤R_CREo～P(R|S,A_CREo)，

R_CRE～P(R|S,A_CRE)≤R_Picoo～P(R|S,A_Picoo) Wherein A is_ABSo，A_CREoAnd A_picooIs the optimal action of the three sub-Q-Learning.