CN112188600B

CN112188600B - Method for optimizing heterogeneous network resources by reinforcement learning

Info

Publication number: CN112188600B
Application number: CN202011002522.7A
Authority: CN
Inventors: 李君�; 李磊; 仲星; 朱明浩; 李正权
Original assignee: Ictehi Technology Development Co ltd; Binjiang College of Nanjing University of Information Engineering
Current assignee: Ictehi Technology Development Co ltd; Binjiang College of Nanjing University of Information Engineering
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2023-05-30
Anticipated expiration: 2040-09-22
Also published as: CN112188600A

Abstract

The invention discloses a method for optimizing heterogeneous network resources by reinforcement learning, which belongs to the technical field of communication, integrates reinforcement learning and convex optimization theory, provides a method for dividing an action space according to the correlation of actions, namely ABS, CRE and small base station dormancy strategies, and redesigns a reward function value to take a negative number and then take an inverse number as a new reward function value aiming at the problem that the system energy efficiency is over-large in order of magnitude of the reward function value in the reinforcement learning modeling process. The invention reduces the action space of reinforcement learning, ensures the convergence of the system by adopting the convex optimization theory, and accelerates the convergence rate of reinforcement learning; the simulation experiment proves that the method has convergence and lower complexity, and the convergence speed is improved by 60 percent compared with the traditional form type Q-Learning on the premise of almost reaching the theoretical value of the energy efficiency of the system.

Description

Method for optimizing heterogeneous network resources by reinforcement learning

Technical Field

The invention belongs to the technical field of communication, and particularly relates to a method for optimizing heterogeneous network resources by reinforcement learning.

Background

As access to wireless devices increases, higher demands are placed on the communication capacity of the network system. One effective way to solve this problem is to build a heterogeneous network, where introducing elcic can effectively overcome the interference problem and improve the signal-to-interference-and-noise ratio between the mobile device and the base station. At the same time, more stringent requirements are placed on the performance and energy efficiency of heterogeneous networks. As the complexity of heterogeneous networks continues to increase, energy efficiency optimization faces increasing challenges and is one of the hot spots of communication network research, especially for heterogeneous networks equipped with 5G base stations. The key is how to effectively configure heterogeneous network resources to maximize the energy efficiency of the network system.

Research on heterogeneous network resource allocation problems from the bottom direction mainly focuses on jointly considering almost blank subframes (Almost Blank Subframe, ABS), cell coverage extension (Cell Range Expansion, CRE), and characteristics of small base station dormancy strategies, etc. to solve system energy efficiency allocation. Many scholars have finally established a non-convex NP-Hard problem. The conversion to a convex problem is achieved by relaxation (Karush-Kuhn-Tucker, KKT) conditions. The most effective method is to consider ABS jointly, and the CRE and the base station dormancy strategy are divided into three sub-problems of ABS, CRE and small base station dormancy strategy, wherein each sub-problem is convex, and according to a convex optimization theory, the original non-convex NP-Hard problem is obtained by circularly iterating solutions of the three sub-problems. The disadvantage of this solution is that the traditional mathematical method still requires a large amount of computation in actually solving the sub-problem and the computation process is quite complex. Limiting the field of practical application of this solution.

In recent years, machine learning techniques have been increasingly applied to many fields such as big data analysis, advertisement precision delivery, image classification, and the like. At present, a plurality of students introduce machine learning technology into a communication system for resource optimization research, mainly based on deep learning and reinforcement learning.

In the deep neural network, the deep learning has the advantage of good fitting performance. The deep learning method can well approximate the relation between the heterogeneous network resources and the system performance, thereby realizing the maximization of the heterogeneous network performance. The disadvantage is that neural networks can create problems with overfitting and learning speed. The reinforcement learning has the advantage that the model-free scheme and the model-based scheme can be adopted to solve the practical problem like the deep learning. It makes the solution of specific problem become more high-efficient, in time.

The learner maps the relation between the base station and the base station in the heterogeneous network and the base station and the user to the graph theory field, and then decomposes the initial Q-Learning problem into a plurality of Q-Learning sub-problems by combining reinforcement Learning and the graph theory so as to solve the network resource allocation and optimize the system performance.

Disclosure of Invention

The invention aims to: the invention aims to provide a method for optimizing heterogeneous network resources by using reinforcement Learning, aiming at the defect that reinforcement Learning is directly applied to heterogeneous network resource allocation and has overlarge action space, and the convergence rate is improved by 60% compared with that of the traditional form type Q-Learning on the premise that the theoretical value of system energy efficiency is almost reached.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

a method for optimizing heterogeneous network resources using reinforcement learning, comprising the steps of:

step 1, establishing a Markov decision process according to a heterogeneous network energy efficiency target which needs to be optimized;

step 2, designing a traditional Q-Learning according to a Markov decision process;

step 3, redesigning the rewarding function value aiming at the problem that the magnitude of the rewarding function value in the Q-Learning is overlarge, taking the negative number and then taking the reciprocal, and compressing the rewarding function value to (-1, 0);

step 4: dividing a traditional Q-Learning action space into three sub-Q-Learning action spaces according to action correlation, namely an ABS, CRE and small base station dormancy strategy;

step 5, the cyclic iteration process is that the stable solution obtained by the three sub Q-Learning is cyclic iterated; in order to accelerate the convergence speed, the stable solution of each loop iteration is not necessarily the optimal solution of three sub Q-Learning;

step 6, bringing the solution obtained by each sub-problem into the condition of solving the two following sub-problems, enabling the solutions of the three sub-problems to reach a stable state at the same time through mutually circulating and iterating, combining the stable solutions of the three sub-problems, and outputting the optimal solution A of the original problem _ABSo ，A _CREo And A _picoo 。

Further, in step 1, a step is establishedA markov decision process (S, a, P, R), specifically defining S as a state space, i.e. a set of user locations within a heterogeneous network cell; definition A is action space, namely action set selected by the agent under the condition of state S, definition P is transition state, namely: p(s) _t+1 ＝s'|s _t ＝s,a _t =a); define R as the bonus function.

Further, in step 3, the prize function value is redesigned, the negative number is taken first, then the reciprocal is taken, and the prize function value is compressed to (-1, 0), namely

And the energy efficiency function of the E system is ensured, and the consistency of the rewarding function and the energy efficiency of the system is ensured.

Further, in step 4, the conventional Q-Learning action space is divided into three sub-Q-Learning action spaces, i.e. A is decomposed into

And->

The action space set for optimizing the sleep strategies of the ABS, CRE and small base station is sequentially defined as +.>

Define +.>

For the configuration set of CRE, define +.>

A sleep strategy set for the small base station; solving the sleep strategy solutions of the ABS, the CRE and the small base station respectively:

further, in step 5, the loop iterates, namely

R _ABS ～P(R|S,A _ABS )≤R _ABSo ～P(R|S,A _ABSo )，

R _CRE ～P(R|S,A _CRE )≤R _CREo ～P(R|S,A _CREo )，

R _CRE ～P(R|S,A _CRE )≤R _Picoo ～P(R|S,A _Picoo ) Wherein A is _ABSo ，A _CREo And

A _picoo is the optimal action of three sub-Q-Learning.

The principle of the invention: according to the method, an initial problem is decomposed into a plurality of sub-problems according to the correlation of configuration resources, and the solution of the initial problem is obtained by circularly iterating the solution of the sub-problems. The approach to solving several sub-problems employs Q-Learning instead of traditional mathematical methods. Mapping the initial problem to the reinforcement Learning field, dividing an action space according to the correlation of actions, decomposing the original Q-Learning into a plurality of sub Q-Learning according to the rule of dividing actions, and obtaining the optimal strategy of the initial Q-Learning by circularly iterating the optimal strategy of the sub Q-Learning. The system energy efficiency is redesigned as the reward function, and the system energy efficiency is firstly taken as a negative value and then is taken as an inverse value, so that the reward function value of reinforcement learning can be compressed to (-1, 0), and the new reward function is ensured to be consistent with the system energy efficiency value.

The beneficial effects are that: compared with the prior art, the method for optimizing heterogeneous network resources by utilizing reinforcement learning is integrated with reinforcement learning and convex optimization theory, and the method is used for dividing the action space according to the correlation of actions, namely ABS, CRE and small base station dormancy strategies, aiming at the problem that the energy efficiency of a system in the reinforcement learning modeling process is over large in order of magnitude of a reward function value, redesigning the reward function value, taking the negative number first and then taking the reciprocal as a new reward function value. The invention reduces the action space of reinforcement learning, ensures the convergence of the system by adopting the convex optimization theory, and accelerates the convergence rate of reinforcement learning; the simulation experiment proves that the method has convergence and lower complexity, and the convergence speed is improved by 60 percent compared with the traditional form type Q-Learning on the premise of almost reaching the theoretical value of the energy efficiency of the system.

Drawings

FIG. 1 is a flow chart of a method construction process of the present invention;

FIG. 2 is a schematic diagram of iterative operation of a sub-Q-Learning loop of the present invention;

FIG. 3 is a chart showing the convergence rate of the conventional Q-Learning method under the same parameter setting;

FIG. 4 is a chart showing the convergence speed of the method of the present invention under the same parameter setting;

FIG. 5 is a diagram of the energy efficiency of the method system of the present invention.

Detailed Description

The invention is further described below in conjunction with the detailed description.

As shown in fig. 1-5, a method for optimizing heterogeneous network resources by reinforcement learning includes the following steps:

step 1: establishing a Markov decision process (Markov Decision Process, MDP) (S, A, P, R) according to the heterogeneous network energy efficiency target to be optimized, wherein S is defined as a state space, namely a set of user positions in a heterogeneous network cell; definition A is action space, namely action set selected by the agent under the condition of state S, definition P is transition state, namely

P(s _t+1 ＝s'|s _t ＝s,a _t =a); define R as the bonus function.

Step 2: designing a traditional Q-Learning according to a Markov decision process;

step 3: aiming at the problem that the magnitude of the rewarding function value in Q-Learning is overlarge, redesigning the rewarding function value, taking the negative number and then taking the reciprocal, and compressing the rewarding function value to (-1, 0), namely

The energy efficiency function of the E system ensures the consistency of the rewarding function and the energy efficiency of the system at the same time;

step 4: according to the correlation of actions, namely ABS, CRE and small base station dormancy strategies, the traditional Q-Learning action is performedThe space is divided into three sub-Q-Learning action spaces, i.e. A is decomposed into

And->

Define +.>

For the configuration set of CRE, define +.>

And the method is a small base station dormancy strategy set. Solving sleep strategy solutions of ABS, CRE and small base station respectively

Step 5: the loop iteration process is to perform loop iteration on stable solutions obtained by three sub Q-Learning. To increase the convergence rate, the stable solution for each loop iteration is not necessarily the optimal solution for three sub-Q-Learning, i.e

R _ABS ～P(R|S,A _ABS )≤R _ABSo ～P(R|S,A _ABSo )，

R _CRE ～P(R|S,A _CRE )≤R _CREo ～P(R|S,A _CREo )，

A _picoo is the optimal action of three sub Q-Learning;

step 6: the solution obtained by each sub-problem is brought into the condition of solving the two following sub-problems, the solutions of the three sub-problems reach a stable state at the same time through mutually circulating iteration, the stable solutions of the three sub-problems are combined, and the optimal solution A of the original problem is output _ABSo ，A _CREo And A _picoo 。

FIG. 1 is a flow chart of the construction process of the method of the present invention. The traditional form type Q-Learning complex problem has a higher dimensional action space, and it is impractical to directly apply Q-Learning. As shown in fig. 1, an MDP is established according to the energy efficiency to be optimized, and Q-Learning of the conventional table type is established. Aiming at the problem of overlarge action space, the action space is divided according to the relation of the action required to be optimized by the energy efficiency of the system. The Q-Learning of the original table is decomposed into three sub-Q-Learning, and the actions to be optimized are obtained. When the solutions of the three sub Q-Learning are all kept stable in the loop iteration, the solutions of the three sub Q-Learning are combined and output. The solution of the original Q-Learning is obtained.

FIG. 2 shows a flow chart of three sub-Q-Learning loop iterations, wherein the current sub-Q-Learning solution is updated by the last loop iteration solution, then the sub-Q-Learning solution is used as the condition of two sub-Q-Learning solutions to be solved subsequently, the solutions of the three sub-problems reach a stable state at the same time through the loop iteration, the stable solutions of the three sub-problems are combined, and the optimal solution of the original problem is generated and output.

Based on the flowcharts of fig. 1 and fig. 2, in the simulation experiment, the number of users is set to 50, 100, 150, and 200, respectively, and is randomly entered into the cell. The wireless channel is modeled as a deterministic path loss attenuation and random shadowing fading model, and the system bandwidth is set to 10MHz. FIGS. 3 and 4 show the relationship between the number of iterations and the accuracy of the conventional form type Q-Learning method and the method of improving reinforcement Learning actions (TQL) of the present invention, wherein the Learning rate, the discount factor and the greedy rate are each set to 0.1. Wherein fig. 3 shows convergence after about 80×10000=800000 iterations of the Q-Learning method under different load conditions, and in fig. 4, the proposed TQL method converges after about 800×400=320000 iterations, and in fig. 3-4, accuracy represents the Accuracy rate, learning rate, discover factor represents the Discount factor, greedy rate, and itersips represents the number of iterative steps. As can be seen from fig. 3 and 4, the convergence rate of the TQL method we propose is improved by about 60% compared to method 1.

Fig. 5 shows a comparison of the TQL method proposed by the present invention with the conventional Q-Learning method and ADPs ES IC method for energy efficiency optimization of heterogeneous networks, where Energy Efficiency represents the energy efficiency and UEs represents the number of users. From fig. 5 (a), it can be seen that the optimization of the system energy efficiency by the method proposed by the present invention is already very close to the theoretical value of the system energy efficiency, and it is also seen that the optimization of the system energy efficiency by the method proposed by the present invention is far greater than the performance of the ADPs ES IC proposed by the relevant scholars. Fig. 5 (b) shows the gap between the energy efficiency method of the optimized heterogeneous network and the theoretical optimal value of the energy efficiency, and it can be seen from fig. 4 that the gap exists mainly in that the method of the invention finds a relatively optimized solution in a state that the optimal solution is not found in an individual state, but the relatively optimized solution is found in a state that the optimal solution is not found, and fig. 5 (b) verifies that the loss of the energy efficiency of the system is small.

The foregoing is merely a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that modifications and variations can be made without departing from the technical principles of the present invention, and the modifications and variations should also be regarded as the scope of the invention.

Claims

1. A method for optimizing heterogeneous network resources by using reinforcement learning is characterized in that: the method comprises the following steps:

step 3, redesigning the order of magnitude of the rewarding function value in the Q-Learning, taking the negative number and then taking the reciprocal, and compressing the rewarding function value to (-1, 0);

2. The method for optimizing heterogeneous network resources using reinforcement learning of claim 1, wherein: in step 1, a markov decision process (S, a, P, R) is established, specifically, defining S as a state space, that is, a set of user positions in heterogeneous network cells, defining a as an action space, that is, an action set selected by an agent under the condition of state S, and defining P as a transition state, that is:

P(s _t+1 ＝s'|s _t ＝s,a _t =a); define R as the bonus function.

3. The method for optimizing heterogeneous network resources using reinforcement learning of claim 2, wherein: in step 3, the value of the reward function is redesigned, the negative number is firstly taken, then the reciprocal is taken, and the value of the reward function is compressed to (-1, 0), namely

4. A method for optimizing heterogeneous network resources using reinforcement learning as recited in claim 3, wherein: in step 4, the conventional Q-LThe reading action space is divided into three sub-Q-Learning action spaces, i.e. A is decomposed into

And->

The action space set for optimizing the sleep strategy of the ABS, the CRE and the small base station is sequentially defined

Define +.>

For the configuration set of CRE, define +.>

5. the method for optimizing heterogeneous network resources using reinforcement learning of claim 4, wherein: in step 5, the loop iteration is that

R _ABS ～P(R|S,A _ABS )≤R _ABSo ～P(R|S,A _ABSo )，

R _CRE ～P(R|S,A _CRE )≤R _CREo ～P(R|S,A _CREo )，

R _CRE ～P(R|S,A _CRE )≤R _Picoo ～P(R|S,A _Picoo ) Wherein A is _ABSo ，A _CREo And A _picoo Is the optimal action of three sub-Q-Learning.