CN113163479A

CN113163479A - Cellular Internet of things uplink resource allocation method and electronic equipment

Info

Publication number: CN113163479A
Application number: CN202110164357.3A
Authority: CN
Inventors: 孙德栋; 欧清海; 张宁池; 姚贤炯; 王艳茹; 刘椿枫; 李温静; 丰雷; 刘卉; 马文洁; 张洁; 陈毅龙; 郭丹丹; 佘蕊; 杨志祥; 王志强; 贺军
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing University of Posts and Telecommunications; State Grid Shanghai Electric Power Co Ltd; State Grid Shaanxi Electric Power Co Ltd; Beijing Zhongdian Feihua Communication Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing University of Posts and Telecommunications; State Grid Shanghai Electric Power Co Ltd; State Grid Shaanxi Electric Power Co Ltd; Beijing Zhongdian Feihua Communication Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-07-23

Abstract

One or more embodiments of the present specification provide a cellular internet of things uplink resource allocation method and an electronic device, where the method includes: each edge node and each direct transmission node of the cellular Internet of things are used as agents, and the agents select an action space A by adopting a search-utilization strategy according to the current system state_iAction a in_iAnd executing; according to the executed action a_iCalculating the reward value of each agent through a reward function; determining a Q function of the intelligent agent in the current system state according to the Q function of the intelligent agent, and enabling the intelligent agent to enter the next system state from the current system state; determining that the agent performs action a based on the agent's estimation policy, the average estimation policy_iA temporal average estimation strategy and an estimation strategy; optimal according to the agent up to a preset number of iterationsAnd estimating a strategy, and performing resource allocation on the uplink resources of the cellular Internet of things. The method provided by the disclosure can realize effective resource allocation of the uplink resources of the cellular Internet of things.

Description

Cellular Internet of things uplink resource allocation method and electronic equipment

Technical Field

One or more embodiments of the present disclosure relate to the field of wireless communication technologies, and in particular, to an uplink resource allocation method for a cellular internet of things and an electronic device.

Background

As one of the three major application scenarios of 5G, mass machine type communication (mtc) is intended to provide connectivity for large-scale internet of things (IoT) devices. The mMTC supports more than 100 million connections of devices with various QoS requirements per square kilometer, brings opportunities for the interconnection of everything, and simultaneously provides new challenges for the aspects of spectrum utilization rate, transmission delay, data throughput and the like. Non-orthogonal multiple access (NOMA) is considered a key technology that can effectively address these challenges. Compared with the traditional orthogonal multiple access technology, the NOMA can improve the spectrum efficiency, reduce the access delay and the signaling overhead and has more advantages when supporting mass connection by utilizing the new power and the coding domain to carry out non-orthogonal resource allocation on the limited resources among the devices. The basic idea of NOMA is to use non-orthogonal transmission at the transmitting end, actively introduce interference information, and demodulate at the receiving end by Successive Interference Cancellation (SIC) technique. SIC can well improve the spectrum efficiency and effectively enhance the network capacity of an uplink and a downlink. In view of the unique advantages of NOMA, NOMA is currently incorporated into the technical part of the 5G mtc standard by 3GPP, and resource management in NOMA is also a hot research issue in the field of wireless communication.

At present, because the performance of internet of things equipment in a large-scale cellular internet of things application scene is generally poor, a Successive Interference Cancellation (SIC) technology in NOMA transmission cannot be completed, and a relay node and a base station which are used for forwarding cannot perform effective communication; meanwhile, complicated interference situation occurs in NOMA frequency spectrum resource sharing, so that effective resource allocation cannot be performed on uplink resources of the cellular Internet of things.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a cellular internet of things uplink resource allocation method and an electronic device, so as to solve the problem that effective resource allocation cannot be performed on cellular internet of things uplink resources.

In view of the foregoing, one or more embodiments of the present specification provide a method for allocating uplink resources of a cellular internet of things, including:

taking each edge node and each direct transmission node of the cellular Internet of things as an intelligent agent, and executing the following operations on the intelligent agent until the preset iteration times are reached:

the agent selects an action space A by adopting an exploration-utilization strategy according to the current system state of the agent_iAction a in_iAnd performing the action a_i；

According to the action a executed_iCalculating a reward value for each of the agents by a reward function; and

determining a Q function of the intelligent agent in the current system state according to the Q function of the intelligent agent, and enabling the intelligent agent to enter the next system state from the current system state;

determining that the agent performs the action a based on an estimation strategy, an average estimation strategy, of the agent_iA temporal average estimation strategy and an estimation strategy; and

in response to the determinationThe agent performs the action a_iThe estimated strategy value is larger than the average estimated strategy value, and the learning rate delta is used_wAdjusting the current estimation strategy, otherwise using the learning rate delta_lAdjusting the current estimation strategy, where δ_l>δ_w；

The above operations executed by the agent reach the preset iteration times to obtain the optimal estimation strategy;

and according to the optimal estimation strategy, performing resource allocation on the uplink resources of the cellular Internet of things.

Further, taking each edge node and each direct transmission node of the cellular internet of things as an agent, executing the following operations on the agent until a preset iteration number is reached, and the method further comprises the following steps:

recording the initial Q function value of the intelligent agent as 0, and determining a counter X for recording the occurrence frequency of the system state S_i(S), and an initial estimation strategy of the agent pi (S, a)_i) Mean estimation strategy

Wherein the initial estimation strategy

Initial mean estimation strategy

Further, the system state S is determined by the state S of the direct transfer node_wAnd the state s of the edge node_nWherein S ═ { S ═ S_w,s_n,w∈W,n∈N}；

In particular, the state s of the direct transfer node_wChannel allocation coefficient lambda comprising said direct transfer node_w,cState s of said edge node_nChannel allocation coefficient eta including said edge node n_n,r,cAnd a transmission power control coefficient theta_nWherein λ is_w,c＝{0,1}，s_w＝{λ_w,c,w∈W,c∈C}，η_n,r,c＝{0,1}，θ_n＝{0.0,0.2,0.4,0.6,0.8,1.0}，s_n＝{η_n,r,c,θ_n,n∈N,r∈R,c∈C}。

Further, the reward function is noted as rew (S, a)_i) If the agent is an edge node, the reward function rew (S, a)_i) The algorithm is as follows:

if the agent is a direct transfer node, the reward function rew (S, a)_i) The algorithm is as follows:

further, the method for determining the Q function in the current system state of the agent comprises:

recording said Q function as Q_i(S,a_i)，

Wherein, delta_qRepresents the Q function learning rate, beta represents the jackpot discount coefficient,

respectively the next arriving system state and the action performed.

Further, the exploration-utilization strategy is specifically a greedy strategy epsilon-greedy, and the calculation method of the greedy strategy is as follows:

selective action a of agent i given system state S_iIs denoted as p (a)_i|S)，p(a_i| S) algorithm is as follows:

wherein ε represents the action selection probability, and 0<ε<1，Q_i(S,a_i) Denotes the Q function, A_i(S) represents the number of actions that agent i can perform in system state S.

Further, the determining that the agent performs action a_iThe calculation method of the time average estimation strategy comprises the following steps:

the determination that the agent performs action a_iThe calculation method of the time estimation strategy comprises the following steps:

wherein the content of the first and second substances,

the step size of the updating of the estimation strategy is represented, and the calculation method comprises the following steps:

where δ is the learning rate, δ takes a different value depending on the following two cases,

further, the method also comprises the following steps: determining a signal transmission model for communication among the edge node, the direct transfer node, the relay node and the base station based on a non-orthogonal multiple access (NOMA) technology and an Open Mobile Alliance (OMA) technology, wherein the signal transmission model specifically comprises:

determining N edge nodes, R relay nodes, W direct transmission nodes, and C channels under the base station, where N is {1,2,3, …, N }, R is {1,2,3, …, R }, W is {1,2,3, …, W }, and C is {1,2,3, …, C };

the relay node receives a signal sent by the edge node through NOMA technology to obtain a first signal y_rThe first signal y_rThe algorithm is as follows:

wherein H_n,rRepresenting the channel gain, θ, of the edge node n to the relay node r_nRepresenting the transmission power control coefficient, P, of the edge node n_nRepresenting the maximum transmit power, S, of the edge node n_nRepresenting signals from edge nodes n, eta_n,r,cDenotes a channel allocation coefficient, ξ denotes an additive white Gaussian noise signal, and

σ²representing additive white Gaussian noise power, wherein N belongs to N, and R belongs to R;

further, H_n,rThe algorithm is as follows:

wherein the content of the first and second substances,

representing small-scale fading of the channel to the relay node r of the edge node n and satisfying a gaussian distribution

d_n,rDenotes the distance from the edge node n to the relay node r, λ is the path loss exponent;

the base station receives the first signal sent by the relay node through the OMA technology and the signal sent by the direct transfer node through the NOMA technology, and the signal is obtained by decoding SIC through the successive interference cancellation technologySecond signal y_BSThe second signal y_BSThe algorithm is as follows:

wherein H_w,BSRepresenting the channel gain, H, from the direct transfer node w to the base station_r,BSRepresenting the channel gain, P, from the relay node r to the base station_wRepresenting the transmission power, S, of the direct-transfer node_wRepresenting signals from direct-transfer nodes, λ_w,cDenotes the channel allocation coefficient, mu_rIs a relay gain factor;

H_w,BSthe algorithm is as follows:

wherein the content of the first and second substances,

representing small-scale fading of the channel from the direct transfer node w to the base station and satisfying a gaussian distribution

d_w,BSRepresents the distance from the direct transfer node w to the base station;

H_r,BSthe algorithm is as follows:

wherein the content of the first and second substances,

representing small scale fading of the relay node to base station channel and satisfying a gaussian distribution

d_r,BSRepresents the distance from the relay node r to the base station;

based on Shannon's theorem, calculating the receiving rate R of the base station for receiving the second signal_sumThe receiving rate R_sumThe algorithm is as follows:

where B denotes the channel bandwidth, τ_nIndicating that the signal sent by the edge node n is amplified and forwarded by the relay node r on the channel c, and the received signal-to-noise ratio, tau, at the base station_wThe signal transmitted by the direct transmission node w reaches the base station through the channel c, and the receiving signal-to-noise ratio at the base station is represented;

in particular, tau_nThe calculation method comprises the following steps:

wherein H_i,rDenotes the channel gain, θ, from edge node i to relay node r_iRepresenting the transmission power control coefficient, P, of the edge node i_iRepresenting the maximum transmit power, σ, of the edge node i²For additive white Gaussian noise power, i belongs to N, i is not equal to N, and theta_iP_i<θ_nP_n；

τ_wThe calculation method comprises the following steps:

further, the method also includes, after:

limiting the transmission power of the edge node multiplexing the same channel specifically includes:

when eta_n,r,cWhen 1, satisfy

Wherein, P_totnFor the threshold value of the transmission power, i ≠ n, α_iP_i<α_nP_n；

Determining that each transmission link meets the QoS requirement of the system, and specifically meeting the following conditions:

τ_n,τ_w≥τ_o,

wherein, tau_oA minimum value representing a received signal-to-noise ratio;

limiting each edge node, the direct transfer node and the relay node to only allocate one channel, and specifically satisfying the following conditions:

limiting the number of the edge nodes accessed by each channel, and specifically meeting the following conditions:

r∈R

wherein q is_maxRepresenting the maximum number of edge nodes allowed to access per channel.

Based on the same inventive concept, one or more embodiments of the present specification further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method as described in any one of the above items when executing the program.

As can be seen from the above description, in one or more embodiments of the present disclosure, each edge node and each direct transmission node are regarded as an agent, each agent performs its own action according to the state of the whole system, when the reward obtained by the agent is worse than expected, the learning rate can be quickly adjusted to adapt to the policy change of other agents, when the reward obtained is better than expected, the agent learns cautiously, and the time for the policy change is adapted to other agents, and finally, each agent can converge to the optimal estimation policy, and perform resource allocation on each edge node and each direct transmission node based on the optimal estimation policy.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

Fig. 1 is a flowchart of a method for allocating uplink resources in a cellular internet of things according to one or more embodiments of the present disclosure;

FIG. 2 is a flow diagram of determining a signal transmission model in accordance with one or more embodiments of the present disclosure;

FIG. 3 is a flow diagram of optimizing a signal transmission model in accordance with one or more embodiments of the present disclosure;

fig. 4 is a schematic structural diagram of a cellular internet of things uplink resource allocation device according to one or more embodiments of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items.

As described in the background section, the existing NOMA-based cellular internet of things application scenario cannot perform efficient allocation of uplink resources. In the process of implementing the present disclosure, the applicant finds that a relay node and a base station for forwarding cannot perform effective communication due to poor performance of the internet of things device; meanwhile, complicated interference situation occurs in NOMA frequency spectrum resource sharing, and finally uplink resources cannot be effectively allocated.

Hereinafter, the technical means of the present disclosure will be described in further detail with reference to specific examples.

WoLF in the agent reinforcement learning algorithm (WoLF-PHC) means that parameters need to be adjusted only slowly when the behavior of an agent is better than an expected value, and the speed of adjusting parameters needs to be increased when the behavior of an agent is worse than the expected value. The PHC is a learning algorithm of a single intelligent agent in a stable environment, and through reinforcement learning, the selection probability of the action which can be maximally accumulated and expected is increased, and finally the optimal strategy can be converged.

Under the same base station of the same cell, an edge node represents edge terminal node equipment, a direct transfer node represents direct transmission terminal node equipment, and a relay node represents relay forwarding node equipment; the relay node and the direct transfer node have good channel conditions and can directly communicate with the base station, while the edge node with poor channel conditions in the cell cannot directly communicate with the base station and needs to communicate with the base station through the relay node in an amplification forwarding mode.

Referring to fig. 1, an uplink resource allocation method for a cellular internet of things according to an embodiment of the present specification includes the following steps:

step S101: taking each edge node and each direct transmission node of the cellular Internet of things as an agent, and executing the following operations of step S102-step S104 for each agent until the preset iteration number is reached.

Before the step, the initial value of the Q function of the intelligent agent is recorded as 0, and the Q function is determinedCounter X for recording the number of occurrences of a system state S_i(S), and an initial estimation strategy of the agent pi (S, a)_i) Mean estimation strategy

Wherein the initial estimation strategy

Initial mean estimation strategy

The estimation strategy represents the probability of selecting each action under a given system state, and the average estimation strategy is a standard for measuring the estimation strategy, so that the estimation strategy is changed to the optimal estimation strategy.

Wherein, a_iRepresenting an action space A performed by an agent_iThe system state S is determined by the state S of the direct transfer node_wAnd the state s of the above-mentioned edge node_nIs expressed as S ═ S_w,s_n,w∈W,n∈N}。

Further, the state s of the direct transfer node_wChannel allocation coefficient lambda comprising said direct transfer node_w,_cState s of the above edge node_nChannel allocation coefficient eta including the edge node n_n,r,cAnd a transmission power control coefficient theta_nWherein, in the step (A),

λ_w,c＝{0,1}

s_w＝{λ_w,c,w∈W,c∈C}

η_n,r,c＝{0,1}

θ_n＝{0.0,0.2,0.4,0.6,0.8,1.0}

s_n＝{η_n,r,c,θ_n,n∈N,r∈R,c∈C}

step S102: the agent selects an action space A by adopting an exploration-utilization strategy according to the current system state of the agent_iAction a in_iAnd executed.

In this step, an action space A_iComprises thatThe following actions: adjusting signal transmission channels, adjusting connected relay nodes, and adjusting transmission power control. For example, there is an agent i, action a_i∈A_iIf the agent i directly transmits the node, lambda needs to be adjusted_w,cIf the agent is an edge node, adjusting the channel allocation coefficient eta_n,r,cAnd a transmission power control coefficient theta_nAnd (4) finishing.

The exploration is carried out, namely a greedy strategy (epsilon-greedy) is selected by utilizing the strategy, and an action space A is selected by utilizing the greedy strategy (epsilon-greedy)_iAction a in_iThe specific calculation method is as follows:

That is, agent i will be at ε (0)<ε<1) Probability of selecting an action space A in the System State S_iAny of the actions.

Step S103: according to the executed action a_iCalculating the reward value of each agent through a reward function; and determining the Q function of the intelligent agent in the current system state according to the Q function of the intelligent agent, and enabling the intelligent agent to enter the next system state from the current system state.

In this step, after each agent has performed the action, the system calculates the corresponding reward value of the agent, and takes the received snr of the transmitted signal at the base station as the reward of the agent, specifically, the reward function is recorded as rew (S, a)_i) If the agent is an edge node, then the reward function rew (S, a)_i) The algorithm of (1) is as follows:

if the agent is a direct transfer node, then the reward function rew (S, a)_i) The algorithm of (1) is as follows:

it will be appreciated that the greater the received signal-to-noise ratio value, the greater the reward value received by the agent. Each agent only needs to observe the state at the current moment without observing the action executed by other agents and the acquired reward value, and takes corresponding action to generate corresponding influence on the system, so that the system enters a new system state at the next moment.

Agent updates the Q function Q (S, a) at this time_i) The specific algorithm is as follows:

the system state reached and the action performed at the next moment, respectively.

Step S104: determining that the agent performs action a based on the agent's estimation policy, the average estimation policy_iA temporal average estimation strategy and an estimation strategy; and performing action a in response to determining that the agent performs action a_iThe estimated strategy value is larger than the average estimated strategy value, and the learning rate delta is used_wAdjusting the current estimation strategy, otherwise using the learning rate delta_lAdjusting the current estimation strategy, where δ_l>δ_w。

In this step, the currently executed action a is updated_iTime-averaged estimation strategy

The calculation method comprises the following steps:

further, updating the currently performed action a_iThe calculation method of the time estimation strategy is as follows:

wherein the content of the first and second substances,

in particular, the estimation strategy of agent i is pi_i(S,a_i) And average estimation strategy

By comparison, if satisfied

Then it is considered as an estimation strategy pi_i(S,a_i) Better, and vice versa, average estimation strategy

And more preferably. If the current action a_iIf the operation is not the one for maximizing the Q function value

Is a negative number, and vice versa

Is positive, thereby increasing the probability of selection of the action that maximizes the Q function value.

Step S105: and when the occurrence frequency of the system state reaches a preset iteration frequency, obtaining the optimal estimation strategy of the intelligent agent, and performing resource allocation on the cellular Internet of things uplink resource according to the optimal estimation strategy.

In summary, when the estimation strategy is better, the learning rate of the estimation strategy update becomes slower; when the average estimation strategy is better, the learning efficiency of the estimation strategy update becomes faster. I.e. when the behaviour of the agent is better than expected, the delta is passed_wMake slow adjustments to the parameters by delta when the agent's behavior is worse than expected_lAnd rapidly adjusting parameters.

Therefore, the method provided by the embodiment is an uplink resource allocation scheme for online reinforcement learning, and in consideration of a complex interference situation caused by NOMA spectrum resource sharing, in actual complex cellular Internet of things communication, when the number of terminal devices is gradually increased, high computational complexity is caused. However, the multi-agent reinforcement learning algorithm model can enable the system to converge into a stable resource allocation scheme within a specified iteration number. Therefore, the method and the device can realize effective resource allocation of the uplink resources of the cellular Internet of things.

It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities.

As an optional embodiment, step S101 further includes, before: and determining a signal transmission model for communication between the edge node, the direct transfer node and the relay node and the base station based on the NOMA technology and the OMA technology.

With reference to fig. 2, the signal transmission system model specifically includes:

step S201: n edge nodes, R relay nodes, W direct transfer nodes and C channels under a base station are determined.

In this step, N ═ 1,2,3, …, N }, R ═ 1,2,3, …, R }, W ═ 1,2,3, …, W }, C ═ 1,2,3, …, C }

Step S202: the relay node receives a signal sent by an edge node through the NOMA technology to obtain a first signal y_r。

In this step, the first signal y_rThe algorithm is as follows:

wherein H_n,rRepresenting the channel gain, θ, from the edge node n to the relay node r_nRepresenting the transmission power control coefficient, P, of the edge node n_nRepresenting the maximum transmit power, S, of the edge node n_nRepresenting signals from edge nodes n, eta_n,r,cDenotes a channel allocation coefficient, ξ denotes an additive white Gaussian noise signal, and

σ²and the power of the additive white Gaussian noise signal is expressed, N belongs to N, and R belongs to R.

Further, H_n,rThe algorithm is as follows:

wherein the content of the first and second substances,

d_n,rDenotes the distance from the edge node n to the relay node r, and λ is the path loss exponent.

The edge node and the base station need to be transmitted through two hops when communicating, the edge node sends a signal to the relay node as a first hop, the edge node can multiplex the same subchannel to transmit information to the relay node by using a NOMA mode, the edge node multiplexing the same subchannel can execute NOMA power control in the transmission process, and the signal can be demodulated through a SIC technology when finally reaching the base station through a relay amplification-and-forward (AF) mode.

Step S203: the base station receives a first signal sent by the relay node through the OMA technology and a signal sent by the direct transfer node through the NOMA technology to obtain a second signal y_BS。

In this step, the second signal y_BSThe algorithm is as follows:

wherein H_w,BSRepresenting the channel gain, H, from the direct transfer node w to the base station_r,BSRepresenting the channel gain, P, from the relay node r to the base station_wRepresenting the transmission power, S, of the direct-transfer node_wRepresenting signals from direct-transfer nodes, λ_w,cDenotes the channel allocation coefficient, mu_rIs the relay gain factor.

H_w,BSThe algorithm is as follows:

wherein the content of the first and second substances,

d_w,BSRepresenting the distance from the direct transfer node w to the base station.

H_r,BSThe algorithm is as follows:

wherein the content of the first and second substances,

d_r,BSIndicating the distance from the relay node r to the base station.

Further, λ is the channel c assigned to the direct transfer node w for transmitting signals to the base station_w,c1, otherwise λ_w,c＝0。

It can be understood that the second hop refers to sending out a signal from the relay node to the base station, and considering the performance problem of the relay, the second hop directly adopts the AF mode to transmit the signal in the OMA mode, in the AF mode, the relay node only receives the signal from the edge node and amplifies and transmits the signal to the base station, and the SIC decoding operation is performed by the base station without any encoding operation on the signal.

Step S204: based on Shannon's theorem, calculating the receiving rate R of the base station for receiving the second signal_sum。

In this step, the receiving rate R is_sumThe algorithm is as follows:

in particular, tau_nThe calculation method comprises the following steps:

wherein H_i,rDenotes the channel gain, θ, from edge node i to relay node r_iRepresenting the transmission power control coefficient, P, of the edge node i_iRepresenting the maximum transmit power, σ, of the edge node i²For additive white Gaussian noise power, i belongs to N, i is not equal to N, and theta_iP_i<θ_nP_n。

H_i,rAnd the above H_n,rThe calculation methods are the same, and are not described herein again.

τ_wThe calculation method comprises the following steps:

as an alternative embodiment, in conjunction with fig. 3, step S204 may further include the following steps:

step S301: the transmission power of edge nodes multiplexing the same channel is limited.

The method specifically comprises the following steps:

when eta_n,r,cWhen the number is equal to 1, the alloy is put into a container,

wherein, P_totnFor the threshold value of the transmission power, i ≠ n, θ_iP_i<θ_nP_n。

That is, the difference between the power of the edge node n minus the powers of all edge points smaller than the power of the edge node n must be larger than the threshold value P of the transmission power_totnThe threshold value P of the transmission power can be adjusted according to actual conditions_totnThe setting is performed.

Step S302: it is determined that each transmission link satisfies system quality of service (QoS) requirements.

In this step, the conditions to be satisfied are as follows:

τ_n,τ_w≥τ_o,

wherein, tau_oRepresenting the minimum value of the received signal-to-noise ratio.

It can be understood that if each transmission link is required to satisfy the QoS requirement of the system, the above condition is satisfied, and τ can be adjusted according to the actual situation_oThe value is set, and is not particularly limited herein.

Step S303: and limiting each edge node, the direct transmission node and the relay node to be allocated with only one channel.

In this step, the conditions to be satisfied are as follows:

step S304: limiting the number of access edge nodes per channel.

In this step, the conditions to be satisfied are as follows:

r∈R

The embodiment is system optimization performed for a hybrid transmission system model, and ensures that a base station can successfully decode a received signal by using the SIC technology.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to any of the above embodiments, one or more embodiments of the present specification further provide a cellular internet of things uplink resource allocation device.

Referring to fig. 4, the uplink resource allocation apparatus for cellular internet of things includes:

the estimation strategy iteration module 401: the method comprises the following steps that each edge node and each direct transfer node of the cellular Internet of things are used as agents, and the following operations are executed on the agents until the preset iteration number is reached: the agent selects an action space A by adopting an exploration-utilization strategy according to the current system state of the agent_iAction a in_iAnd performing the action a_i(ii) a According to the action a executed_iCalculating a reward value for each of the agents by a reward function; determining a Q function of the intelligent agent in the current system state according to the Q function of the intelligent agent, and enabling the intelligent agent to enter the next system state from the current system state; determining that the agent performs the action a based on an estimation strategy, an average estimation strategy, of the agent_iA temporal average estimation strategy and an estimation strategy; and performing the action a in response to determining that the agent performs the action a_iThe estimated strategy value is larger than the average estimated strategy value, and the learning rate delta is used_wAdjusting the current estimation strategy, otherwise using the learning rate delta_lAdjusting the current estimation strategy, where δ_l>δ_w(ii) a And the above operations executed by the intelligent agent reach the preset iteration times to obtain the optimal estimation strategy.

The uplink resource allocation module 402: and the method is configured to perform resource allocation on the uplink resources of the cellular Internet of things according to the optimal estimation strategy.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 5 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A cellular Internet of things uplink resource allocation method is characterized by comprising the following steps:

performing the action a in response to determining that the agent performs the action_iThe estimated strategy value is larger than the average estimated strategy value, and the learning rate delta is used_wAdjusting the current estimation strategy, otherwise using the learning rate delta_lAdjusting the current estimation strategy, where δ_l>δ_w；

2. The method according to claim 1, wherein each edge node and each direct transfer node of the cellular internet of things are used as agents, and the following operations are performed on the agents until a preset number of iterations is reached, and before:

Wherein the initial estimation strategy

Initial mean estimation strategy

3. The method of claim 2, wherein the system state S is defined by a state S of the pass-through node_wAnd the state s of the edge node_nWherein S ═ { S ═ S_w,s_n,w∈W,n∈N}；

4. A method according to claim 3, characterized by recording said reward function as rew (S, a)_i) If the agent is an edge node, the reward function rew (S, a)_i) The algorithm is as follows:

5. the method of claim 4, wherein the Q function calculation method for determining the current system state of the agent is:

recording said Q function as Q_i(S,a_i)，

respectively the next arriving system state and the action performed.

6. The method according to claim 5, wherein the exploration-utilization strategy is specifically a greedy strategy epsilon-greedy, and the greedy strategy is calculated by:

7. The method of claim 6, wherein the determining that the agent performs action a_iThe calculation method of the time average estimation strategy comprises the following steps:

wherein the content of the first and second substances,

8. the method of claim 2, further comprising, prior to the method: determining a signal transmission model for communication among the edge node, the direct transfer node, the relay node and the base station based on a non-orthogonal multiple access (NOMA) technology and an Open Mobile Alliance (OMA) technology, wherein the signal transmission model specifically comprises:

wherein H_n,rRepresenting the channel gain, θ, of the edge node n to the relay node r_nRepresenting the transmission power control coefficient, P, of the edge node n_nRepresenting the maximum transmit power, S, of the edge node n_nRepresenting signals from edge nodes n, eta_n,r,cRepresenting channel segmentsThe coefficient, ξ, represents an additive white Gaussian noise signal, and

further, H_n,rThe algorithm is as follows:

wherein the content of the first and second substances,

the base station receives the first signal sent by the relay node through the OMA technology and the signal sent by the direct transfer node through the NOMA technology, and a second signal y is obtained by decoding through a Successive Interference Cancellation (SIC) technology_BSThe second signal y_BSThe algorithm is as follows:

H_w,BSthe algorithm is as follows:

wherein the content of the first and second substances,

H_r,BSthe algorithm is as follows:

wherein the content of the first and second substances,

d_r,BSRepresents the distance from the relay node r to the base station;

in particular, tau_nThe calculation method comprises the following steps:

τ_wThe calculation method comprises the following steps:

9. the method of claim 8, further comprising, after the method:

when eta_n,r,cWhen 1, satisfy

wherein, tau_oA minimum value representing a received signal-to-noise ratio;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the program.