CN115996475A

CN115996475A - Ultra-dense networking multi-service slice resource allocation method and device

Info

Publication number: CN115996475A
Application number: CN202211487474.4A
Authority: CN
Inventors: 张勇; 滕颖蕾; 柴玉昊; 张震宇; 袁思雨; 白昊男
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-04-21

Abstract

The invention provides a method and a device for distributing resources of ultra-dense networking multi-service slices, comprising the following steps: acquiring a multi-agent reinforcement learning model, wherein a strategy network and a value network are deployed on each micro base station, a transmission power equalization solution is solved in advance, the strategy network takes the transmission rate and the transmission power of the micro base station as state parameters, and a correlation parameter set of each micro base station and a predicted transmission power set of other micro base stations as action parameters; each micro base station acquires own state parameters, generates corresponding action strategies, calculates estimated Q values for the action strategies generated by the corresponding micro base stations according to global information by a value network, and is used for updating strategy network parameters; constructing a loss function of the estimated Q value and the actual Q value by taking the maximized rewarding value as a target, and updating parameters of the value network until the model reaches a preset performance requirement; and inputting the state parameters of each micro base station into a trained multi-agent reinforcement learning model to generate corresponding action strategies so as to realize multi-service slice resource allocation.

Description

Ultra-dense networking multi-service slice resource allocation method and device

Technical Field

The invention relates to the technical field of communication, in particular to a method and a device for distributing ultra-dense networking multi-service slice resources.

Background

Network slicing technology has become one of the key technologies of the fifth generation mobile network (Fifth Generation Mobile Network, 5G), and in the next generation mobile network and other technical fields, the network has raised requirements for flexibility, isolation, privacy, customization and the like for differentiated services of users, and at the same time, the importance of a small-scale network for providing specific services has also increased to meet the requirements of different scenes and different crowds.

One of the emerging solutions is to introduce ultra-dense networking of macro-micro base station isomerism, which meets the transmission capacity and coverage requirements of users. The ultra-dense networking of the base stations can improve the spectrum efficiency of the system to a certain extent, dynamic radio resource allocation is carried out through rapid resource scheduling, the licensed spectrum of the macro base station (Macro Base Station, MBS) is multiplexed at the micro base station (Small Base Station, SBS), and the utilization rate and the spectrum efficiency of the system radio resource are improved, but the problems of system interference and system cost are brought. In order to provide reliable service, the micro base station needs to acquire multiplexing permission of the macro base station spectrum, which needs to perform interference coordination between the macro base station and the micro base station to ensure that the operation of the micro base station is not affected by harmful interference.

Therefore, a method for reducing contention interference of the micro base station and optimizing radio resource allocation under the premise of guaranteeing the communication quality and the communication requirement of the user is needed.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for distributing resources of ultra-dense networking multi-service slices, so as to eliminate or improve one or more drawbacks existing in the prior art, and solve the problems of system interference and system cost caused when the wireless resource utilization rate and spectrum efficiency of the system are improved in the prior art.

In one aspect, the invention provides a resource allocation method for multi-service slices of ultra-dense networking, which is characterized in that the ultra-dense networking comprises at least one macro base station, and each macro base station is also connected with a plurality of micro base stations for service; the users of the micro base stations multiplex slice resources of the Ying Hong base stations, and the method is used for carrying out multi-service slice resource allocation based on cross-layer interference generated between the micro base stations and the macro base stations and same-layer interference generated between adjacent micro base stations; the method comprises the following steps:

acquiring a multi-agent reinforcement learning model, wherein the multi-agent reinforcement learning model deploys a strategy network and a value network on each micro base station; each strategy network constructs a state space by taking the transmission rate and the total transmitting power of each user in a corresponding single micro base station as state parameters; acquiring associated parameters used for indicating whether users in all micro base stations multiplex resource blocks in macro base stations or not, and constructing an action space by taking an associated parameter set of each micro base station and a predicted transmitting power set of other micro base stations as action parameters; each micro base station obtains the state parameters of the micro base station, selects corresponding actions according to the strategy network, generates estimated Q values according to the state parameters of the corresponding micro base station, the selected actions and the state parameters and actions of other micro base stations by the value network of each micro base station, and is used for carrying out parameter updating on the strategy network of the corresponding micro base station; constructing a loss function of the estimated Q value and the actual Q value of the model by taking the maximized rewarding value as an optimization target, and carrying out parameter updating on the value network; until reaching the preset performance requirement;

In the state updating process, the macro base station constructs a macro base station income calculation formula according to the cross-layer interference price and the cross-layer interference generated by multiplexing resource blocks in the micro base station by the user; the micro base station constructs a micro base station gain calculation formula according to the association parameter, the fixed bandwidth length of the resource block, the signal-to-interference-plus-noise ratio, the same-layer interference price, the same-layer interference, the cross-layer interference price and the cross-layer interference; taking the macro base station as a leader and taking each micro base station as a follower to construct a non-cooperative game; fixing the values of the associated parameters, and solving the micro base station profit calculation by adopting a reverse induction method to obtain the transmitting power equalization solution of each micro base station so as to update the state space of each strategy network; substituting the transmission power equalization solution into the macro base station income calculation to obtain the cross-layer interference price equalization solution;

and inputting the state parameters of each micro base station into the multi-agent reinforcement learning model to generate corresponding action strategies so as to realize multi-service slice resource allocation.

In some embodiments of the present invention, the macro base station constructs a macro base station benefit calculation formula according to a cross-layer interference price and cross-layer interference generated by multiplexing resource blocks in the micro base station by a user, where the macro base station benefit calculation formula is:

wherein ,U_MBS Representing the macro base station benefit; u (U) _UE Representing a set of users for all micro base stations; u (U) _PRB Representing the total number of resource blocks; u (U) _BS Representing the macro base station and a set of all micro base stations;

representing the cross-layer interference price of a user i using a resource block j at a micro base station b; />

Indicating cross-layer interference caused by the use of resource block j by user i at micro base station b.

In some embodiments of the present invention, the micro base station constructs a micro base station benefit calculation formula according to the association parameter, the resource block fixed bandwidth length, the signal to interference plus noise ratio, the same layer interference price, the same layer interference, the cross layer interference price and the cross layer interference, wherein the micro base station benefit calculation formula is:

s.t.

wherein ,U^b Representing the micro base station benefit; u (U) _UE,b Representing a set of users of the micro base station b; u (U) _s The type of slice is represented;

representing the association relationship between the user i and the slice s, the resource block j and the micro base station b; b represents the fixed bandwidth length of the resource block;

indicating that the user i uses the resource block j to cause the signal to interference plus noise ratio at the micro base station b; />

Indicating the same-layer interference price of user i using resource block j at micro base station b；/>

The same-layer interference caused by the use of the resource block j by the user i at the micro base station b is shown;

representing the cross-layer interference price of a user i using a resource block j at a micro base station b; / >

Indicating cross-layer interference caused by a user i using a resource block j at a micro base station b; />

Representing the transmitting power allocated to the resource block j by the user i at the micro base station b; u (U) _UE Representing a set of users for all micro base stations; u (U) _BS Representing the macro base station and a set of all micro base stations; i _max Representing the interference maximum; u (U) _PRB Representing the total number of resource blocks; τ represents the total number of resource blocks.

In some embodiments of the present invention, each policy network constructs a state space with a transmission rate of each user in a corresponding single micro base station and a total transmission power as state parameters, where the total transmission power uses the transmission power equalization solution, and the state parameters are expressed as:

wherein ,s_j (t) represents a state parameter of the micro base station at the time t; r is (r) _N,j (t) represents the transmission rate of the nth user multiplexing resource block j at the time t;

the user i allocates the transmitting power of the resource block j in the micro base station b; u (U) _j Representing a set of users multiplexing resource blocks j;

the value network of each micro base station generates a predicted Q value according to the state parameters and the selected actions of the corresponding micro base station and the state parameters and actions of other micro base stations, and the state parameters of the value network are expressed as follows:

s _j '(t)＝(s _j (t),a _j (t),s _-j (t),a _-j (t))；

wherein ,s_j (t) represents a state parameter of the micro base station at the time t; a, a _j (t) represents an operation parameter of the micro base station at the time t; s is(s) _-j (t) represents a state parameter set of other micro base stations at the moment t; a, a _-j And (t) represents the action parameter set of other micro base stations at the moment t.

In some embodiments of the present invention, an association parameter for indicating whether a user in each micro base station multiplexes a resource block in a macro base station is obtained, and an action space is constructed by using an association parameter set of each micro base station and a predicted transmit power set of other micro base stations as action parameters, where the action parameters are expressed as:

a _j (t)＝{W _j ,P _-j }；

wherein ,

wherein ,a_j (t) represents an operation parameter of the micro base station at the time t; w (W) _j Representing a set of associated parameters; p (P) _-j Representing a set of predicted transmit powers of other micro base stations;

representing the association relationship between the user i and the slice s, the resource block j and the micro base station b;

representing the transmitting power allocated to the resource block j by the user i at the micro base station b; u (U) _j Representing the set of users multiplexing resource block j.

In some embodiments of the present invention, the value network of each micro base station generates a predicted Q value according to the state parameter and the selected action of the corresponding micro base station and the state parameters and actions of other micro base stations, and constructs a policy gradient according to the predicted Q value, and is used for updating parameters of the policy network of the corresponding micro base station, where a calculation formula of the policy gradient is as follows:

wherein ,

representing the policy gradient; θ represents a policy parameter; j (u) _j ) Representing a cumulative estimated prize value; d represents an experience playback pool; u (u) _j (a _j |s _j ) Representing an action strategy made by the micro base station according to the state; />

Representing the value network; s is(s) _j Representing the state of the micro base station estimated by the value network; a, a _j Representing the motion of the micro base station estimated by the value network; s is(s) _other Representing the state of other micro base stations estimated by the value network; a, a _other And representing the actions of other micro base stations estimated by the value network.

In some embodiments of the present invention, a loss function of the estimated Q value and the actual Q value of the model is constructed with the maximized reward value as an optimization target, and the value network is updated with parameters, where a calculation formula of the loss function is as follows:

wherein ,

representing the loss function; θ represents a policy parameterA number; u (u) _j Representing adaptive weight parameters; r represents the updated result of the action; />

Representing the value network; s is(s) _j Representing the state of the micro base station estimated by the value network; a, a _j Representing the motion of the micro base station estimated by the value network; s is(s) _other Representing the state of other micro base stations estimated by the value network; a, a _other Representing the actions of other micro base stations estimated by the value network; / >

Representing the actual Q value.

In some embodiments of the present invention, a loss function of the estimated Q value and the actual Q value of the model is constructed with the maximized reward value as an optimization target, and the value network is updated with parameters, where the calculation formula of the reward value is:

wherein, reward _j Representing the prize value; u (u) _j Representing the adaptive weight parameters; r is (r) _j Representing the total transmission rate of each user multiplexing resource block j of the micro base station; n represents the total number of users; r is (r) _-j Representing the total transmission rate of other micro base stations;

representing the interference price of the same layer; />

Representing co-layer interference; />

Representing cross-layer interference prices; />

Representation ofCross-layer interference; u (U) _j Representing the set of users multiplexing resource block j.

In some embodiments of the present invention, the adaptive weight parameter is learned according to a state of a global environment;

when u is _j When=1, the prize value is only related to the transmission rate of the micro base station, is zero and is played;

when 0 is<u _j <1, the prize value is related to the transmission rate of the micro base station itself and the transmission rate of other micro base stations, so as to form a hybrid game.

In another aspect, the invention also provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of a method as described in any one of the above-mentioned.

The invention has the advantages that:

the invention provides a method and a device for distributing resources of ultra-dense networking multi-service slices, which are characterized in that a multi-agent reinforcement learning model obtained through pre-training is obtained, and state parameters of each micro base station are input into the multi-agent reinforcement learning model, so that action strategies are correspondingly generated, and the purposes of reducing competition and interference of the micro base stations on the premise of guaranteeing the communication quality and the communication requirement of users are achieved, so that the wireless resource distribution is optimized, and the frequency spectrum tension is relieved.

In multi-agent reinforcement learning model training, constructing a macro base station income calculation formula and a micro base station income calculation formula in advance, modeling a resource allocation problem as a non-cooperative game, and obtaining a transmitting power balance solution and a cross-layer interference price balance solution; and then, using the transmission power equalization solution to update the state space of each strategy network in the multi-agent reinforcement learning model, guiding the model to update and optimize to a preset direction, and simplifying the calculation amount of the model. Meanwhile, the multi-agent reinforcement learning model deploys a strategy network and a value network on each micro base station, and the value network can acquire global information and generate more accurate estimated Q values, so that the strategy network generates more optimal action strategies.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention. In the drawings:

fig. 1 is a schematic structural diagram of a resource allocation method for ultra-dense networking multi-service slices according to an embodiment of the invention.

FIG. 2 is a schematic diagram of an ultra-dense networking architecture in accordance with an embodiment of the present invention.

Fig. 3 is a flowchart of a method for resource allocation of ultra-dense networking multi-service slices according to an embodiment of the invention.

Fig. 4 (a) is a schematic diagram of a profit situation when the number of resource blocks is 20 according to an embodiment of the present invention.

Fig. 4 (b) is a schematic diagram illustrating a profit situation when the number of resource blocks is 24 according to an embodiment of the present invention.

Fig. 4 (c) is a schematic diagram illustrating a profit situation when the number of resource blocks is 28 according to an embodiment of the present invention.

Fig. 5 is a schematic diagram showing the comparison between the effects of the madmpg algorithm and the stamperberg game on the average gain impact of the base station for the cross-layer interference price in an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled" may refer to not only a direct connection, but also an indirect connection in which an intermediate is present, unless otherwise specified.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

In order to solve the problems of system interference and system cost caused by the improvement of the utilization rate of system wireless resources and the spectrum efficiency in the prior art, namely the problems of dynamic multi-service slicing and power distribution of macro base stations and micro base stations in the ultra-dense networking when sharing a macro base station network, the invention provides a resource distribution method for the multi-service slicing of the ultra-dense networking. The method comprises the following steps S101 to S102:

step S101: and obtaining a multi-agent reinforcement learning model, wherein the model is obtained through training by a preset method.

Step S102: and inputting the state parameters of each micro base station into a multi-agent reinforcement learning model to generate corresponding action strategies so as to realize multi-service slice resource allocation.

As shown in fig. 1, the price is formulated and charged by researching cross-layer interference caused by macro base stations to multiplexing resource blocks of micro base stations, each micro base station carries out dynamic slicing and power distribution after receiving the pricing scheme of the macro base station, on the basis of not influencing macro base station users, the respective economic benefits and communication quality of the macro base station users and the micro base station users are improved, the problems are decomposed into a transmitting power distribution problem and a slicing resource block distribution problem, and the problems are converted into an optimization problem of maximizing the benefits of the macro base station and the micro base station on the premise of guaranteeing the communication quality of the users and reducing the interference.

Fig. 2 is a schematic diagram of a group of ultra-dense networking architecture, where the ultra-dense networking network includes a macro base station and a plurality of micro base stations, where the macro base station covers a range of the plurality of micro base stations, and users of the micro base stations may multiplex slice resources of the macro base station for communication, but may cause cross-layer interference to the macro base station users. The slice manager is responsible for coordinating the number of resource blocks required by each slice, realizing isolation communication among the slices, and the slices can be deployed in a plurality of micro base stations. To reduce communication signaling overhead, each micro base station performs resource allocation to the respective micro base station and connected users independently and in a distributed manner. It should be noted that fig. 1 and fig. 2 only show a group of architecture of ultra-dense networking as a reference, and the number of macro base stations and micro base stations can be adjusted according to specific requirements and/or specific application scenarios in actual use.

In step S101, in order to obtain a multi-agent reinforcement learning model suitable for the application scenario of the present invention, the initial reinforcement learning model needs to be trained. However, considering that the difficulty of training the initial reinforcement learning model to learn the resource allocation strategy is high, the problems that the model is not optimized according to the preset direction and further cannot obtain a better strategy, the training amount, the operation amount and the like can be solved, so that in order to better describe the resource allocation strategy of a user, the operation amount of the initial reinforcement learning model is simplified, the problem of analysis by a reverse induction method is firstly solved, the game balance is solved, the initial reinforcement learning model is guided to train and optimize in the preset direction, and finally the multi-agent reinforcement learning model required by the invention is obtained, so that the resource allocation is realized.

The method for training the initial reinforcement learning model to obtain the multi-agent reinforcement learning model comprises the following steps:

a strategy network and a value network are deployed on each micro base station; each strategy network constructs a state space by taking the transmission rate and the total transmitting power of each user in a corresponding single micro base station as state parameters; acquiring associated parameters used for indicating whether users in all micro base stations multiplex resource blocks in macro base stations or not, and constructing an action space by taking an associated parameter set of each micro base station and a predicted transmitting power set of other micro base stations as action parameters; each micro base station obtains the state parameters of the micro base station, selects corresponding actions according to the strategy network, generates estimated Q values according to the state parameters of the corresponding micro base station, the selected actions and the state parameters and actions of other micro base stations, and is used for carrying out parameter updating on the strategy network of the corresponding micro base station; constructing a loss function of the estimated Q value and the actual Q value of the model by taking the maximized rewarding value as an optimization target, and carrying out parameter updating on the value network; until the preset performance requirement is reached.

In the state updating process, the macro base station constructs a macro base station income calculation formula according to the cross-layer interference price and the cross-layer interference generated by multiplexing resource blocks in the micro base station by the user; the micro base station constructs a micro base station gain calculation formula according to the association parameter, the fixed bandwidth length of the resource block, the signal to interference plus noise ratio, the same-layer interference price, the same-layer interference, the cross-layer interference price and the cross-layer interference; taking the macro base station as a leader, and taking each micro base station as a follower to construct a non-cooperative game; fixing the values of the associated parameters, and solving the benefit calculation formula of the micro base stations by adopting a reverse induction method to obtain the transmission power equalization solution of each micro base station so as to update the state space of each strategy network; and (5) balancing the transmitting power, namely, jie Dairu, and obtaining a cross-layer interference price balancing solution by using a macro base station gain calculation formula.

Specifically, macro base station benefits and micro base station benefits form a set of Stankleber game, wherein the macro base station is a leader and is responsible for making cross-layer interference prices, and the micro base station is a follower and is responsible for giving associated parameters and transmitting power. For the multi-objective joint optimization problem of the model, solving the game equilibrium solution and the design algorithm is more difficult than the single-objective optimization problem, and the traditional iterative algorithm is insufficient for solving the communication problem among multiple base stations. Therefore, in the invention, a reverse induction method is adopted, the value of the associated parameter is fixed at first, the micro base station gain calculation formula is solved, the transmission power equalization solution of each micro base station is obtained, and the result of the transmission power equalization solution is substituted into the macro base station gain calculation formula to obtain the cross-layer interference price equalization solution. After a fixed equalization solution of the transmitting power is obtained, an optimization scheme of resource block association is provided by combining a multi-agent reinforcement learning model, and information of other micro base stations required by the equalization solution is provided by an independent prediction method. The overall flow of optimizing the resource allocation strategy is shown in figure (3).

In some embodiments, macro base station formulated cross-layer interference pricing

The macro base station benefit calculation formula is shown in formula (1):

in the formula (1), U _MBS Representing macro base station revenues; u (U) _UE Representing a set of users for all micro base stations; u (U) _PRB Representing the total number of resource blocks; u (U) _BS Representing a macro base station and a set of all micro base stations;

Wherein cross-layer interference

The calculation formula of (2) is shown as the following formula:

in the formula (2),

representing cross-layer interference; u (U) _s The type of slice is represented; />

Representing the association relationship between the user i and the slice s, the slice resource block j and the micro base station b; />

Representing the transmitting power allocated to the resource block j by the user i at the micro base station b;

the channel gain of the macro base station resource block j multiplexed by the user i at the micro base station b is shown.

wherein ,

the association relationship between the user i and the slice s, the slice resource block j and the micro base station b is shown, which can be specifically understood as follows: if user i uses resource block j belonging to slice s in micro base station b +.>

If user i does not use resource block j,/belonging to slice s in micro base station b>

As shown in the formula (1), the macro base station obtains the sum of charges for multiplexing resource blocks by the users of all the micro base stations to transmit and generate interference to the macro base station. Each micro base station user has independent benefits, but multiplexing frequency spectrum transmission can cause interference and affect the benefits of other micro base station users and macro base station users, so that the macro base station is used as a leader, each micro base station is used as a follower, and the resource allocation problem of the micro base station is constructed into a non-cooperative game. Each micro base station user is considered a selfish, rational player, and the game strategy space consists of a transmit power allocation strategy space and a slice resource block allocation strategy space, and each micro base station attempts to maximize its utility.

In some embodiments, the micro base station benefit is calculated as shown in formula (3), and the constraint is shown in formulas (4) to (9):

s.t.

in the formulas (3) to (9), U ^b Representing the benefit of the micro base station; u (U) _UE,b Representing a set of users of the micro base station b; u (U) _s The type of slice is represented;

representing the association relationship between the user i and the slice s, the resource block j and the micro base station b; b represents the fixed bandwidth length of the resource block; />

The same-layer interference price of the resource block j used by the user i at the micro base station b is represented; />

The same-layer interference caused by the use of the resource block j by the user i at the micro base station b is shown; />

Representing the transmitting power allocated to the resource block j by the user i at the micro base station b; u (U) _UE Representing a set of users for all micro base stations; u (U) _BS Representing a macro base station and a set of all micro base stations; i _max Representing the interference maximum; u (U) _PRB Representing the total number of resource blocks; τ represents the total number of resource blocks.

In each constraint condition, the lower limit of the transmitting power of the micro base station user is indicated by the formula (4); equation (5) shows the maximum value of all users' co-layer interference; equation (6) shows the maximum value of cross-layer interference for all users; equation (7) is a limitation of the number of resource blocks of a slice, and the number of allocated resource blocks in all slices cannot exceed the total number of the resource blocks of the slice; equation (8) shows that the same user is associated with at most one base station; equation (9) is an isolation constraint for a slice, indicating that one resource block can only be allocated to the same user at a time.

In equation (3), co-layer interference

The calculation formula of (2) is shown in the formula (10):

in the formula (10), K _b',b A binary variable indicating whether or not the micro base station b' overlaps or is adjacent to the micro base station b;

representing the association relationship between the user i and the slice s, the resource block j and the micro base station b; />

Representing the transmitting power allocated to the resource block j by the user i at the micro base station b; />

Indicating the channel gain of user i at resource block j of micro base station b'.

The method for solving the transmission power equalization solution and the cross-layer interference price equalization solution comprises the following steps:

fixing associated parameters

And (3) obtaining the maximum value of the micro base station gain, wherein the micro base station gain calculation formula, namely the formula (3) is simplified as follows:

constraint 1:

constraint 2:

to U ^b And (3) deriving to obtain:

obtaining a second derivative, and obtaining:

/>

thus U ^b On the domain is a convex function, where σ represents the white noise of the channel.

From the following components

Obtaining:

and obtaining the transmitting power equalization solution of each micro base station.

From the following components

The constraint of (2) can be obtained:

order the

Let us assume->

It is also necessary to ensure that the macro base station benefit is maximum, and the calculation formula of the macro base station benefit is expressed as:

constraint 1:

constraint 2:

macro base station benefit and cross-layer interference price

Regarding the value interval of (a), a discontinuous optimization problem is presented, so that an index function is introduced first:

Given a given

When U _MBS Becomes a fully micro-functional. In the Stankleberg game, the default macro base station has the ability to collect channel information between the macro base station and each micro base station, i.e./the>

(user i multiplexes the channel gain of macro base station resource block j at micro base station b), the calculation of macro base station benefit can be rewritten as follows:

constraint 1:

constraint 2:

constraint 3:

and has been obtained above

/>

Order the

The KKT condition may be written as:

α,β,γ≥0；

from the KKT condition:

analysis resulted in α=β=0.

The KKT condition is a form of the Lagrangian multiplier method, and is mainly applied to an optimal solving mode under the condition that the optimization function has non-equivalent constraint.

In summary, the macro base station cross-layer interference price equalization solution can be obtained in the following form:

(1) When (when)

In the time-course of which the first and second contact surfaces,

(2) When (when)

In the time-course of which the first and second contact surfaces,

/>

.......

(N) when

In the time-course of which the first and second contact surfaces,

thus, a set of stonelberg betting solutions is obtained:

and />

Due to the +.>

and />

When (I)>

And

is unique and is therefore a nash equilibrium solution.

After a fixed equilibrium solution of the transmitting power is obtained, a state space of a strategy network of the multi-agent reinforcement learning model is set, a multi-agent reinforcement learning model is utilized to provide related parameters of resource blocks, information of other base stations required by the equilibrium solution is provided through an independent prediction method, and resource allocation optimization is carried out in combination with game results.

In some embodiments, the multi-agent reinforcement learning model adopts a madppg algorithm, madppg is an extension of DDPG in multi-agent tasks, and the basic idea is centralized learning and decentralized execution. MADDPG algorithm introduces critic capable of observing global information to guide actor training when model training, and only uses actor with local observation to take action when testing.

The multi-agent reinforcement learning model deploys a strategy network (strategy generated by an actor) and a value network (strategy generated by a critic evaluation actor) on each micro base station, wherein the actor can only acquire the information of the micro base station to which the actor belongs, and the critic can acquire the information of all the micro base stations. Because of the non-cooperative competition relationship among the micro base stations in the invention, the targets of the micro base stations are different, and each micro base station has a strategy network and a value network corresponding to the targets.

In some embodiments, each policy network constructs a state space with the transmission rate and total transmit power of each user in a corresponding single micro base station as state parameters designed as shown in equation (11):

in the formula (11), r _N,j (t) represents the transmission rate of the nth user multiplexing resource block j at the time t;

The user i allocates the transmitting power of the resource block j in the micro base station b; u (U) _j Representing the set of users multiplexing resource block j.

The first N items r _1,j (t),r _2,j (t),…,r _N,j (t) represents the transmission rate of all user multiplexing resource blocks j at the moment t; the n+1th item represents the sum of the transmission powers allocated by the users of all the multiplexing resource blocks j.

wherein ,

for the transmit power equalization solution found above, specific: />

In some embodiments, the association parameters used for indicating whether the user in each micro base station multiplexes the resource blocks in the macro base station are obtained, and an action space is constructed by taking the association parameter set of each micro base station and the predicted transmission power set of other micro base stations as action parameters, wherein the action parameters are designed as shown in a formula (12):

a _j (t)＝{W _j ,P _-j }； (12)

in the formula (11), W _j Representation of

A set of associated parameters; p (P) _-j Representing the predicted set of other micro base station transmit powers, as shown in equation (13) and equation (14):

wherein ,

In some embodiments, the value network of each micro base station generates an estimated Q value according to the state parameters and the selected actions of the corresponding micro base station and the state parameters and actions of other micro base stations, and the state parameters of the value network are shown in formula (15):

s _j '(t)＝(s _j (t),a _j (t),s _-j (t),a _-j (t))； (15)

In the formula (15), s _j (t) represents a state parameter of the micro base station at the time t; a, a _j (t) represents an operation parameter of the micro base station at the time t; s is(s) _-j (t) represents a state parameter set of other micro base stations at the moment t; a, a _-j And (t) represents the action parameter set of other micro base stations at the moment t.

In the training process of the multi-agent reinforcement learning model, each micro base station (actor) randomly samples according to the state of the current moment, selects and executes corresponding actions, and correspondingly, a value network (critic) calculates an estimated Q value according to the state of the micro base station and the selected actions to serve as feedback for the actions of the micro base station. The strategy network updates the strategy according to critic feedback, and the value network builds a loss function according to the estimated Q value and the actual Q value for training. In the invention, critic can acquire global information, namely, the state and action of other micro base stations can be acquired, so as to obtain a more accurate estimated Q value.

In the multi-agent reinforcement learning model test process, each micro base station randomly samples according to the state at the current moment, selects and executes corresponding actions, at the moment, feedback of critic is not needed any more, and the state or action of other micro base stations is not needed to be relied on, so that decentralized execution is realized.

In the present invention, entering a state parameter results in a deterministic action and is therefore a deterministic strategy.

For deterministic policies, constructing a policy gradient for updating the policy network according to the estimated Q value of the value network, wherein the policy gradient is calculated as shown in formula (16):

in formula (16), θ represents a policy parameter; j (u) _j ) Representing a cumulative estimated prize value; d represents an experience playback pool; u (u) _j (a _j |s _j ) Representing an action strategy made by the micro base station according to the state;

representing a value network; s is(s) _j Representing the state of the micro base station estimated by the value network; a, a _j Representing the motion of the micro base station estimated by the value network; s is(s) _other Representing the state of other micro base stations estimated by the value network; a, a _other Representing the actions of other micro base stations estimated by the value network.

For the value network, the parameters need to be updated by constructing a loss function of the estimated Q value and the actual Q value of the model by taking the maximized rewarding value as an optimization target.

For setting the rewarding value, firstly, calculating the total transmission rate of multiplexing resource blocks of each user of the micro base station, wherein the calculation formula is shown in a formula (17):

in the formula (17), U _j Representing a set of users multiplexing resource blocks j; r is (r) _i,j Representing the transmission rate of user i using resource block j; u (U) _s The type of slice is represented;

Representing the association relationship between the user i and the slice s, the slice resource block j and the micro base station b; b represents the fixed bandwidth length of the resource block; />

Indicating that user i is using resource block j at micro base station b to create a signal to interference plus noise ratio.

wherein ,

and (5) performing equalization solution on the transmission power obtained above.

Indicating the transmitting power of the user i allocated to the slice resource block j at the micro base station b; />

The channel gain of the resource block j of the micro base station b of the channel gain user i is shown; />

Representing the transmission power allocated to the resource block j by the user i at the micro base station b'; />

On this basis, in order to represent the competition and cooperation relationship between the micro base station and other micro base stations, the prize value calculation formula is designed as shown in formula (18):

in the formula (18), u _j Representing adaptive weight parameters; r is (r) _j Representing the total transmission rate of each user multiplexing resource block j of the micro base station; n represents the total number of users; r is (r) _-j Representing the transmission rate of other micro base stations;

representing the interference price of the same layer; />

Representing co-layer interference; />

Representing cross-layer interference prices; />

Representing cross-layer interference; u (U) _j Representing the set of users multiplexing resource block j.

Specifically, the adaptive weight parameter is learned according to the state of the global environment, when u _j When=1, the prize value is only related to the transmission rate of the micro base station, and is zero and game; when 0 is<u _j <When 1, the prize value is related to the transmission rate of the micro base station, and also related to the transmission rates of other micro base stations, so as to form the hybrid game.

The loss function constructed according to the estimated Q value and the actual Q value of the model is shown in a formula (19):

in the formula (19), θ represents a policy parameter; u (u) _j Representing adaptive weight parameters; r represents the updated result of the action;

representing a value network; s is(s) _j Representing the state of the micro base station estimated by the value network; a, a _j Representing the motion of the micro base station estimated by the value network; s is(s) _other Representing the state of other micro base stations estimated by the value network; a, a _other Representing the actions of other micro base stations estimated by the value network; />

Representing the actual Q value.

The updated calculation formula is shown in formula (20):

in the formula (20), r _j Representing the total transmission rate of each user multiplexing resource block j of the micro base station;

representing an actual value network; s' _j Representing the actual state of the micro base station; a' _j Representing the actual actions of the micro base station; s' _other Representing the actual state of other micro base stations; a' _other Representing the actual actions of other micro base stations; u's' _j Representing actual adaptive weight parameters; o (o) _j Representing the local information observed by the micro base station.

Updating the parameters of the value network by taking the maximized rewarding value as an optimization target; the strategy network builds a strategy gradient according to the estimated Q value generated by the value network, and updates the strategy gradient to enable the generated action strategy to be more accurate until the initial reinforcement learning model reaches the preset performance requirement, so that the multi-agent reinforcement learning model required by the invention is obtained.

In step S102, the state parameters of each micro base station are input into the trained multi-agent reinforcement learning model, and corresponding action strategies are generated to realize multi-service slice resource allocation.

The invention is described in detail below with reference to one example:

according to the ultra-dense networking multi-service slice resource allocation method provided by the invention, simulation experiments are carried out.

The method is provided with a macro base station and five micro base stations, and the positions of the macro base station and the micro base stations are fixed. The user accesses the base station with the maximum channel gain according to the channel gain of the user. For the micro base station, the number of resource blocks k=24, the number of users is 16, the bandwidth b=10 MHz on each subcarrier, and the transmitting power on each subchannel does not exceed 1W. In the initial reinforcement learning model, the learning rate of actor is 0.02 and the learning rate of critic is 0.01.

In the simulation experiment, the situation of user benefit is explained by changing the number of resource blocks. Specifically, ten step sizes are taken as one round, and the average value of the ten step sizes is taken as an experimental result. As shown in fig. 4 (a) to 4 (c), the first 50 rounds are pre-training phases, and it can be seen that all base stations try to change their own strategy to get higher benefits by themselves until all base stations converge. And as the number of resource blocks increases, the benefit of all micro base stations is increased, but the rate of convergence is slowed down. Wherein the number of resource blocks k=20 of fig. 4 (a), and the number of resource blocks k=24 of fig. 4 (b); the number of resource blocks k=28 of fig. 4 (c).

As shown in fig. 5, the case of the average benefit of the user is explained by changing the cross-layer interference price. According to game theory result, under the condition of unchanged other parameters, the cross-layer interference price and the micro base station income are available

While other resources are allocated at the same time, the average benefit of the micro base station gradually decreases with the increase of the cross-layer interference price, but the decrease amplitude gradually decreases. Meanwhile, the result of the MADDPG algorithm is compared with the theoretical value of the Stankleberg game, and is close to the theoretical value of the game and slightly higher than the theoretical value of the game due to the income self-adaptive adjustment mechanism.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a method.

Accordingly, the present invention also provides an apparatus comprising a computer apparatus including a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the apparatus implementing the steps of the method as described above when the computer instructions are executed by the processor.

The embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the edge computing server deployment method described above. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

In summary, the invention provides a method and a device for distributing resources of ultra-dense networking multi-service slices, which are characterized in that a multi-agent reinforcement learning model obtained through pre-training is obtained, and state parameters of each micro base station are input into the multi-agent reinforcement learning model, so that action strategies are correspondingly generated, and the competition and interference of the micro base stations are reduced on the premise of ensuring the communication quality and the communication requirement of users, thereby optimizing the radio resource distribution and relieving the frequency spectrum tension.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The ultra-dense networking multi-service slice resource allocation method is characterized in that the ultra-dense networking comprises at least one macro base station, and each macro base station is also connected with a plurality of micro base stations for service; the users of the micro base stations multiplex slice resources of the Ying Hong base stations, and the method is used for carrying out multi-service slice resource allocation based on cross-layer interference generated between the micro base stations and the macro base stations and same-layer interference generated between adjacent micro base stations; the method comprises the following steps:

2. The method for distributing ultra-dense networking multi-service slice resources according to claim 1, wherein the macro base station constructs a macro base station profit calculation formula according to a cross-layer interference price and cross-layer interference generated by multiplexing resource blocks in the micro base station by users, and the macro base station profit calculation formula is as follows:

cross-layer interference indicating that user i uses resource block j at micro base station bDisturbing the price; />

3. The method for distributing resources of ultra-dense networking multi-service slices according to claim 1, wherein the micro base station constructs a micro base station profit calculation formula according to the association parameter, a resource block fixed bandwidth length, a signal-to-interference-plus-noise ratio, a same-layer interference price, same-layer interference, the cross-layer interference price and the cross-layer interference, and the micro base station profit calculation formula is:

s.t.

/>

The same-layer interference caused by the use of the resource block j by the user i at the micro base station b is shown; / >

Representing the transmitting power allocated to the resource block j by the user i at the micro base station b; u (U) _UE Representing a set of users for all micro base stations; u (U) _BS Representing the macro base station and a set of all micro base stations; i _max Representing the interference maximum; u (U) _PRB Representing the total number of resource blocks; τ represents the total number of resource blocks。

4. The method for distributing resources of ultra-dense networking multi-service slices according to claim 1, wherein each policy network constructs a state space by taking a transmission rate of each user in a corresponding single micro base station and a total transmission power as state parameters, wherein the total transmission power uses the transmission power equalization solution, and the state parameters are expressed as:

s _j '(t)＝(s _j (t),a _j (t),s _-j (t),a _-j (t))；

5. The method for allocating resources to ultra-dense networking multi-service slices according to claim 1, wherein the method is characterized in that the method comprises the steps of obtaining association parameters for indicating whether users in each micro base station multiplex resource blocks in macro base stations, and constructing an action space by taking an association parameter set of each micro base station and a predicted transmission power set of each other micro base station as action parameters, wherein the action parameters are expressed as:

a _j (t)＝{W _j ,P _-j }；

wherein ,

/>

6. The method for distributing resources of ultra-dense networking multi-service slices according to claim 1, wherein the value network of each micro base station generates an estimated Q value according to the state parameters and the selected actions of the corresponding micro base station and the state parameters and actions of other micro base stations, constructs a policy gradient according to the estimated Q value, and is used for parameter updating of the policy network of the corresponding micro base station, and the calculation formula of the policy gradient is as follows:

wherein ,

7. The method for distributing resources of ultra-dense networking multi-service slices according to claim 1, wherein a loss function of the estimated Q value and the actual Q value of the model is constructed by taking a maximized rewarding value as an optimization target, and parameters of the value network are updated, and the calculation formula of the loss function is as follows:

wherein ,

representing the loss function; θ represents a policy parameter; u (u) _j Representing adaptive weight parameters; r represents the updated result of the action; />

Representing the value network; s is(s) _j Representing the state of the micro base station estimated by the value network; a, a _j Representing the motion of the micro base station estimated by the value network; s is(s) _other Representing the priceThe state of other micro base stations estimated by the value network; a, a _ohter Representing the actions of other micro base stations estimated by the value network; />

Representing the actual Q value.

8. The method for distributing resources of ultra-dense networking multi-service slices according to claim 7, wherein a loss function of the estimated Q value and the actual Q value of the model is constructed by taking a maximized rewarding value as an optimization target, and parameters of the value network are updated, wherein the calculating formula of the rewarding value is as follows:

representing the interference price of the same layer; />

Representing co-layer interference; />

Representing cross-layer interference prices; />

9. The method for distributing resources of ultra-dense networking multi-service slices according to claim 8, wherein the adaptive weight parameters are learned according to the state of a global environment;

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 9.