WO2007036003A1

WO2007036003A1 - Reinforcement learning for resource allocation in a communications system

Info

Publication number: WO2007036003A1
Application number: PCT/AU2006/001433
Authority: WO
Inventors: Nimrod Lilith; Kutluyil Dogancay
Original assignee: University Of South Australia
Priority date: 2005-09-30
Filing date: 2006-10-03
Publication date: 2007-04-05

Abstract

The present invention provides methods using reinforcement learning to efficiently allocate resources such as channels in a communications system. In particular the method describes reinforcement learning agent-based solutions to the problems of call admission control (CAC) and dynamic channel allocation (DCA) in multi-cellular telecommunications environments featuring multi-class traffic and inter-cell handoffs. Both agents providing the CAC and DCA functionality make use of an on-policy reinforcement learning technique known as SARSA and are designed to be implemented at the cellular level in a distributed manner.

Description

REINFORCEMENT LEARNING FOR RESOURCE ALLOCATION IN A

COMMUNICATIONS SYSTEM

TECHNICAL FIELD OF THE INVENTION

The present invention is related to communication systems and more particularly to resource allocation in communication systems carrying multi- class traffic.

BACKGROUND

Cellular telecommunication systems organise a geographical area into a number of substantially regularly sized cells, each with its own base station. By adopting a system using a large number of low power transmitters and receivers rather than a single high power transceiver the capacity of a given area for calls from users within any of the cells can be greatly increased compared to a single large cell approach.

The available bandwidth at each cell is divided into a number of channels, which may be time slots or frequencies (TDM, FDM, or CDMA), each of which may be assigned to a call. Using a cellular system allows a given channel to be assigned simultaneously to multiple calls, as long as each assigning cell is at least a given distance apart, in order to avoid co-channel interference. This distance is termed the 'reuse distance'.

Most modern mobile communication systems use a Fixed Channel

Assignment (FCA) strategy, whereby channels are pre-allocated to given cells according to a regular pattern that minimises the distance between co-channel cells, i.e. cells that may assign the same channel to a call, whilst not violating the channel reuse distance constraint. Ongoing calls may move spatially about the cellular domain and this movement can lead to a call leaving one cellular area and entering an adjacent one. This then requires resources in the cell entered to be allocated to the new call and subsequently any resources in the cell left may be freed, a process known as 'hand off. If resources cannot be allocated in the newly entered cell, then the handoff is blocked.

However the downside of FCA is that the pre-allocations are static, or at least fixed whilst the system is operational, and thus an efficient allocation for an estimated offered traffic pattern cannot adapt to efficiently accommodate any variances from that estimated traffic load.

In contrast to FCA, Dynamic Channel Assignment (DCA) strategies do not permanently pre-allocate given channels to particular cells. Instead channels are assigned to cells as they are required, as long as these assignments do not violate the channel reuse constraint. This flexibility in channel assignment allows a cellular system to take advantage of possible stochastic variations in offered call traffic over a given area.

It is thus proposed by the inventors to use DCA strategies, however, one problem for DCA systems is how to efficiently determine which of the available channels to assign to a new call. Well performing channel assignment schemes are generally computationally complex, and simpler schemes tend to perform less efficiently or are inflexible. Broadly speaking there exists a trade-off between implementation complexity and system performance, with complex heuristics having to deal with concepts such as channel ordering, borrowing and locking, and having to take into account system information over a wide area of multiple cells or even system-wide.

It is also proposed by the inventors to use a method to selectively admit new call requests, a procedure known as Call Admission Control (CAC), which as proposed further increases system efficiency by not allowing connection patterns that reduce overall carrying capacity through inefficient channel allocation patterns. CAC can also improve system performance in the case of different types or classes of calls, for example it may be desired that handoff calls be prioritised as they are considered to be higher priority than new call requests as dropping an ongoing call is less desirable than blocking a new call in the perception of users.

Call admission control can be implemented using guard channel schemes. Guard channel schemes prioritise call handoffs by reserving a portion of bandwidth for assignment to handoff requests. The amount of bandwidth to reserve depends upon traffic conditions, therefore adaptive algorithms should be preferred as microcellular systems may be highly dynamic environments due to the increase in call handoffs.

In light of the features provided by telecommunications networks and demands of consumers many current telecommunications systems now carry data traffic along with voice, which is fundamentally different to voice traffic in a number of ways. Two key properties of data traffic flows are that they are commonly statistically self -similar and highly bursty (impulsive). This can lead to the appearance of sudden traffic 'hot-spots' that may rapidly appear and disappear. If resources are not allocated efficiently then the number of calls, or the revenue raised, may not be maximised.

Reinforcement learning (RL), or Neuro-Dynamic programming (NDP) as it is also known, is an intelligent technique that learns through trial and error. Reinforcement learning shows particular promise as a means of solving problems characterised by large state-spaces and uncertain environment dynamics. By building up an internal representation of its environment through interactions with it, a reinforcement learning agent can over time formulate an efficient policy of action with a particular goal in mind. An example of a reinforcement learning agent is shown in Figure 1.

Reinforcement learning can guarantee a convergence to an optimal policy only when applied to a Markovian, i.e., memoryless, environment. However, data traffic, as opposed to voice traffic, cannot be described accurately by memory less probability distributions. This limitation may be overcome by refraining a problem to include extra environmental information in order to produce independent state-transitions, although this may lead to an explosion in the magnitude of the state-space that needs to be traversed by an RL agent through Bellman's curse of dimensionality.

Reinforcement learning is thus an attractive candidate for the solution for the problems of CAC and DCA in a cellular environment for a number of reasons. It requires no specific model of the environment as the learning agent builds up its own environment model through interaction with it. The agent can adapt to environment dynamics as long as it continues to take exploratory actions, and as reinforcement learning requires no supervised learning period it can be implemented to provide real time control while it is in the process of learning.

In the case of the on-line learning of a call admission control policy, exploration is potentially very expensive in that the intentional blocking of calls the agent has ascertained with reasonable certainty should be accepted can both increase blocking probability and reduce revenue. Therefore, it is imperative that particular attention be paid to the exploration-exploitation dilemma in order to minimise the number of blocked new call requests that are required to learn an effective call admission policy. An object of the present invention is to provide a reinforcement learning based approach to resource allocation in a communications network that is scalable, able to be implemented in a distributed manner, and of low computational complexity thus requiring minimal hardware requirements.

SUMMARY OF THE INVENTION

In a broad aspect of the invention, a method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (X₁, xz, .., XN), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource for establishing a new connection in cell i at time t, b) obtaining the system state, x, cell i, c) rejecting the resource request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and l, e) if the value of the random number is less than the value of the exploration function, e, then the agent randomly selects an action a from A, otherwise the agent selects the action a, which has the maximal state action value Q(x,a), for the current state, x, f) the agent accepting the resource allocation request and updating the accepted new resource allocation statistic if the amount of unallocated resource is greater than the guard resource threshold, otherwise rejecting the resource allocation request and updating the rejected new resource allocation request statistic, g) the agent updating the state action value associated with the previous state-action pair, Q(x',a'), wherein x' is the previous state and a! is the previous action, using the previous state-action value Q(x',a'), the current state action value Q(x,a), the learning rate α, the discount factor γ, and the reward, /, calculated after taking the previous action a' in state x', h) the agent calculating the reward for taking action a in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the SARSA reinforcement learning algorithm is used to update the state action value Q(x,a), after taking action a in state x.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the updated state action pair for the previous state action value Qt+i(x',a') is updated according to the formula: Qt₊i(xW) = Qt(oc',a') + a((n + yQt(x,a)) - Qt(x',a%

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein each connection request has an associated priority level.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the method comprises of an additional step cc inserted between steps c and d, step cc comprising: cc) if the priority level of the connection request is above a predetermined level, accepting the resource allocation request and updating the accepted new resource allocation statistic and proceeding directly to step g otherwise proceeding to step d. In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the agent stores statistics on the number of accepted and rejected connection requests for each priority level from time zero until time t, and the reward at time t for cell U is calculated according to :

where ntit is the number of accepted new connection requests of priority level z and h_zit is the number of accepted handoffs connection requests of priority level z up until time t in the said reward region, Gi, of cell i, and n'∑it is the number of blocked new connection requests of priority level z and h'_zit is the number of blocked handoffs connection requests of priority level z up until time t in the reward region of cell it, and w_z is a new connection reward multiplier for priority level z connection requests and y_z is a handoff connection reward multiplier for priority level z connection requests, and K is the number of priority levels in the system, wherein the values for w_z and y_z are predetermined at time zero.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the reward multiplier for handoff connection requests in priority level k, y_z, are greater than the reward multiplier for new connection requests in class k, w_z.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein y_z >= 5w_z.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the value returned by the learning rate function α is in the range (0, 0.25), the value returned by the discount factor function γ is in the range (0.95, 1) and the value returned by the exploration decisions function e is in the range (0,0.1).

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the learning rate, discount factor and exploration decisions functions are constant functions wherein α=0.05, 7=0.975, and e = 0.03 for all times.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the number of priority levels is 2, and the values for the reward multipliers are wi = 10, Wz = l, yi = 50, and τ/2 =

5.

In a broad aspect of the invention, a method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (X₁ DCA, xi DCA, .-, XNΌCA), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α DCA, a predetermined discount factor function, γ DCA, a predetermined exploration decision function, e DCA, wherein the values returned by α DC_A and γ D_CA are between 0 and 1, and the agent stores resource allocation statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x _DC_A at time t, is equal to the index of the cell, i, at time t, and the agent takes an action a DCA at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A, wherein the action of selecting the subset to allocate is determined by the exploration function, e _DC_A, wherein the exploration function decays over time from an initial value, e o DCA, at time zero, wherein the value of and the agent calculates a reward, r DCA_/ for taking action a DCA at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource in cell i at time t, b) obtaining the system state, x DCA, cell i, c) rejecting the allocation request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and l, e) if the value of the random number is less than the value of the exploration function, e DCA, then the agent randomly selects an action a DCA from A, otherwise the agent selects the action a DC_A, which has the maximal state action value

for the current state, x DCA, f) the agent allocating the resource allocation subset, a DCA, and updating the resource allocation statistic, g) the agent updating the state action value associated with the previous state-action pair, Q DCA (x',a')_f wherein x _DCA' is the previous state and a DCA' is the previous action, using the previous state-action value Q DCA (x',a'), the current state action value Q DCA (x,ά), the learning rate α DCA, the discount factor γ DCA, and the reward, r DCA ' , calculated after taking the previous action a DCA' in state x DCA', h) the agent calculating the reward for taking action a DCA in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the rate of decay of the exploration function e DCA decreases with time.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the exploration function has the form of: e DCA t = e DCA o exp (-t/s) where s is a constant with the same units as time, t.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the exploration function e DCA has the form of:

e DCA t = e DCA o / J V-, where s is a constant with the same units as time, t .

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the initial value of the exploration function e DCA O has the value of 0.05, the value of s is 256 and time, i is measured in seconds. In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the resource allocation statistic is the sum of the percentage of resource allocated in cell j at time t, wherein the sum is performed over all cells; in the DCA reward region Gi of cell i and the reward at time t for cell U is calculated according to:

In another aspect of the invention a method for resource allocation in a cellular telecommunications system comprising the additional steps between steps c and d, wherein the additional steps are: ca) perform a search over the state action values Q DCA (X_/a) wherein the search is limited to the current state, x, and over actions associated with the unallocated subsets of the resource and store the value of the state-action value with the maximum state action value in v DCA max and the associated action «DCA as flDCAmn* cb) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a _DCA min and denote the connection associated with this allocation as Cmin cc) if v DC_A max is greater than v DCA mm then allocate the subset of the resource associated with a DCA max to the connection associated c DCA min and release the subset of the resource associated with a DCA min ■

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the agent performs the additional steps of: j) monitoring connection termination requests in cell i, and acceptance of a handoff connection requests from a connection in cell i, to another cell j, _.

13 k) on receiving a request to terminate a connection or an acceptance of a handoff of the connection to an another cell, the agent stores the value of the state-action flag as v vcAflag, and the resources associated with the connection, as a DCA flag, 1) perform a search over the state action values Q _DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a DCA mm and denote the connection associated with this allocation as c DCA min m) free the resources associated with a DCA flag, n) if a DCA min is not equal to a DCA flag then allocate the subset of the resource associated with a DCA flag to the connection c DCA min and release the subset of the resource associated with a DCA mm-

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein at time zero the state action values are initialised with either zero or a positive value /DCA according to a fixed resource allocation scheme.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the percentage of resource allocated by the cell.

In a broad aspect of the invention, a method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality^" of states, X = (xi, Xz, .., XN), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource for establishing a new connection in cell i at time t, b) obtaining the system state, x, cell i, c) rejecting the resource request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and l, e) if the value of the random number is less than the value of the exploration function, e, then the agent randomly selects an action a from A, otherwise the agent selects the action a, which has the maximal state action value Q(x,a), for the current state, x, f) the agent accepting the resource allocation request and updating the accepted new resource allocation statistic if the amount of unallocated resource is greater than the guard resource threshold, otherwise rejecting the resource allocation request and updating the rejected new resource allocation request statistic, g) the agent updating the state action value associated with the previous state-action pair, Q(x',a'), wherein x' is the previous state and a' is the previous action, using the previous state-action value Q(x',a'), the current state action value Q(x,a), the learning rate α, the discount factor γ, and the reward, r', calculated after taking the previous action a' in state x' , h) the agent calculating the reward for taking action a in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource, and further each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the DCA agent is initialised with a predetermined DCA learning rate function, OCDCA, a predetermined discount factor function, γ DCA, a predetermined DCA exploration decision function, CDCA, wherein the values returned by α DCA and γ DCA are between 0 and 1, and the DCA agent stores resource allocation statistics, calculated over a DCA reward region of predefined cells, wherein the DCA agent has a representation of the cell environment having a set of DCA states, X, where the system state x DCA at time t, is equal to the index of the cell, i, at time t,, and the DCA agent takes an action a DCA at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A DCA, wherein the action of selecting the subset to allocate is determined by the DCA exploration function, e DCA, wherein the exploration function decays over time from an initial value, e o DCA, at time zero, wherein the value of and the agent calculates a DCA reward, r, for taking action a at time t, and the DCA agent updates state action values, Q DCA (x,a), wherein at time zero all state action values are initialised to zero and the DCA reward is initialised to zero, the method including the steps of j) receiving a request for an allocation of a resource in cell i at time t, k) obtaining the system state, x DCA, cell i,

1) rejecting the allocation request if all resources are allocated, m) obtaining a random number from a uniform distribution between 0 and l, n) if the value of the random number is less than the value of the DCA exploration function, e DCA, then the DCA agent randomly selects an action a from AD_CA, otherwise the DCA agent selects the action a DCA, which has the maximal state action value Q DCA (x,a), for the current state, x DCA, o) the DCA agent allocating the resource allocation subset, a DCA, and updating the resource allocation statistic, p) the DCA agent updating the state action value associated with the previous state-action pair, Q _DCA (x',a'), wherein x DCA' is the previous state and a DCA' is the previous action, using the previous state-action value Q _DC_A (x',a^f), the current state action value Q DCA (x,a), the DCA learning rate α _DC_A, the DCA discount factor γ DCA, and the DCA reward, rDCA', calculated after taking the previous action a DCA' in state x DCA'_/ q) the DCA agent calculating the reward for taking action a DC_A in state x, wherein the DCA reward is calculated using resource allocation statistics in a predefined DCA reward region. o) repeating steps j) to q) for each request for allocation of a resource.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the updated state action pair for the previous state action value Qt+i(x',a') is updated according to the formula:

Qw(x'/) = Qt(x', a') + a((r_t + γQt(x,a)) - Qt(xW)).

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein each connection request has an associated priority level.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the method comprises of an additional step cc inserted between steps c and d, step cc comprising: cc) if the priority level of the connection request is above a predetermined level, accepting the resource allocation request and updating the accepted new resource allocation statistic and proceeding directly to step g otherwise proceeding to step d.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the agent stores statistics on the number of accepted and rejected connection requests for each priority level from time zero until time t, and the reward at time t for cell it is calculated according to :

where nzit is the number of accepted new connection requests of priority level z and hzit is the number of accepted handoffs connection requests of priority level z up until time t in the said reward region, Gi, of cell i, and n'zit is the number of blocked new connection requests of priority level z and h'_zit is the number of blocked handoffs connection requests of priority level z up until time t in the reward region of cell it, and w_z is a new connection reward multiplier for priority level z connection requests and y_z is a handoff connection reward multiplier for priority level z connection requests, and K is the number of priority levels in the system, wherein the values for w_z and y_z are predetermined at time zero.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein y_z >= 5w_z. In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the value returned by the learning rate function α is in the range (0, 0.25), the value returned by the discount factor function γ is in the range (0.95, 1) and the value returned by the exploration decisions function e is in the range (0,0.1).

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the learning rate, discount factor and exploration decisions functions are constant functions wherein α=0.05, γ=0.975, and e = 0.03 for all times.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the number of priority levels is 2, and the values for the reward multipliers are wi = 10, wi = l, yi = 50, and τ/2 = 5.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the exploration function has the form of: e DCA t = e DCA O exp (-t/s) where s is a constant with the same units as time, t. In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the exploration function e _DC_A has the form of:

where s is a constant with the same units as time, t.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the initial value of the exploration function e DCA O has the value of 0.05, the value of s is 256 and time, t is measured in seconds.

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the resource allocation statistic is the sum of the percentage of resource allocated in cell j at time t, wherein the sum is performed over all cells ;^' in the DCA reward region Gi of cell i and the reward at time t for cell U is calculated according to:

ΓDCA (τ_t ) = ∑ P_j .

In another aspect of the invention a method for resource allocation in a cellular telecommunications system comprising the additional steps between steps 1 and m, wherein the additional steps are:

Ia) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the unallocated subsets of the resource and store the value of the state-action value with the maximum state action value in v DCA max and the associated action a DCA as aocAmax

Ib) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a DCA win and denote the connection associated with this allocation as Cmiή Ic) if v DCA max is greater than v DCA mm then allocate the subset of the resource associated with a DCA max to the connection associated c _DCA min and release the subset of the resource associated with a DCA mm •

In another aspect of the invention a method for resource allocation in a cellular telecommunications system, wherein the agent performs the additional steps of: p) monitoring connection termination requests in cell i, and acceptance of a handoff connection requests from a connection in cell i, to another cell;^', q) on receiving a request to terminate a connection or an acceptance of a handoff of the connection to an another cell, the agent stores the value of the state-action flag as v DCA flag, and the resources associated with the connection, as a ΌCA flag, r) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a DCA min and denote the connection associated with this allocation as c DCA min s) free the resources associated with a DCA flag, t) if a DCA min is not equal to a ΌCA flag then allocate the subset of the resource associated with a ΌCA flag to the connection c DCA min and release the subset of the resource associated with a DC_{A m}m-

In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein at time zero the state action values are initialised with either zero or a positive value /DCA according to a fixed resource allocation scheme. In another aspect of the invention a method for resource allocation in a cellular telecommunications system wherein the system state x DC_A at time t, is equal to the index of the cell, i, at time t, and the percentage of resource allocated by the cell.

In a broad aspect of the invention, a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system for performing a method for resource allocation according to any preceding method claim including, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (xi, xi, .., XN), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero and a computer associated with each cell for executing a computer program code containing instructions that perform the steps of the method.

In a broad aspect of the invention, a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (X₁ DCA, xi DCA, .., XN DCA)/ which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α DCA, a predetermined discount factor function, γ DCA, a predetermined exploration decision function, e DCA, wherein the values returned by α DCA and γ DCA are between 0 and 1, and the agent stores resource allocation statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the agent takes an action a DCA at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A, wherein the action of selecting the subset to allocate is determined by the exploration function, e DCA/ wherein the exploration function decays over time from an initial value, e o DC_A/ at time zero, wherein the value of and the agent calculates a reward, r DCA, for taking action a DCA at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, and a computer associated with each cell for executing a computer program code containing instructions that perform the steps of the method.

In another aspect of the invention a cellular telecommunications system (CTS) wherein the computer associated with each cell executes computer program code containing instructions that perform the steps of both the methods associated with connection admission control (CAC) and dynamic connection allocation control (DCA).

In a broad aspect of the invention, a computer program element comprising a computer program code means to make a programmable device execute steps in accordance with a method according to any preceding method claim.

The invention thus provides a reinforcement learning based approach to resource allocation in a communications network. The use of an intelligent reinforcement learning agent, or agents in each cell that uses information obtained from the local region around the cell allows the invention to be scalable, implemented in a distributed manner. The use of only two state- action variables thus reduces the number of state-action pairs thus providing a solution with low computational complexity to ensure hardware requirements are minimal.

BRIEF DESCRIPTION OF THE FIGURES FIGURE 1 discloses a generic reinforcement learning agent process;

FIGURE 2 discloses a SARSA state-action value update procedure;

FIGURE 3 discloses a Call Admission Control (CAC) agent process according to an embodiment of the invention;

FIGURE 4 discloses a Dynamic Channel assignment (DCA) process according to an embodiment of the invention;

FIGURE 5 discloses a channel reassignment process for call termination or handoff events according to an embodiment of the invention;

FIGURE 6 Simulated cellular telecommunications system showing potential interference region, 20, in (A) and reward region, 30, in (B), of a cell, 10;

FIGURE 7 dislcoses a graphical display of the decay of exploratory parameter epsilon, e, over time;

FIGURES 8A and 8B display a comparison of new call blocking probabilities (A) and handoff blocking probabilities (B) versus load in a cell for uniform traffic distribution; FIGURE 9 discloses a graphical comparison of total revenue versus load in a cell for uniform traffic load;

FIGURE 10 discloses a bar graph of daily variation in traffic load.

FIGURES 11 A and B disclose a graphical comparison of call blocking probabilities (A) and hourly revenue (B) versus time of day for traffic with daily load variation of Figure 10;

FIGURES 12A and B disclose plots of Aggregated exponential traffic (A) and aggregated self-similar traffic (B);

FIGURES 13 A and B disclose a graphical comparison of new call blocking probabilities (A) and handoff blocking probabilities (B) versus load in a cell with class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self -similar distribution a uniform traffic load;

FIGURE 14 discloses a non uniform traffic arrival pattern applied to the cells of Figure 6;

FIGURES 15 A and B disclose a graphical comparison of new call blocking probabilities (A) and handoff blocking probabilities (B) versus load in a cell with class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self-similar distribution and traffic load featuring localised 'hot spots' as shown in Figure 14;

FIGURES 16 A and B disclose a graphical comparison of call blocking probabilities versus time of day for class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self-similar distribution and traffic load featuring daily variation of Figure 10;

FIGURE 17 discloses a graphical comparison of total revenue versus load in a cell for class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self-similar distribution and a uniform traffic load;

FIGURE 18 discloses a graphical comparison of total revenue versus load in a cell for class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self-similar distribution and traffic load featuring localised 'hot spots' as shown in Figure 14;

FIGURE 19 discloses a graphical comparison of hourly revenue versus time of day in a cell for class 1 traffic sampled from a Poisson distribution and class 2 traffic samples from a self-similar distribution and traffic load featuring daily variation of Figure 10;

FIGURE 20 discloses an embodiment of the invention wherein a computer associated with a cell executes computer program code implementing the invention; and

FIGURE 21 discloses a Call Admission Control (CAC) agent and Dynamic Channel assignment (DCA) process according to an embodiment of the invention.

DETAILED DESCRIPTION OF AN EMBODIMENT OF THE INVENTION The present invention is related to resource allocation in a communication system. By way of example this specification describes an embodiment of the invention to allocate a resource in a telecommunication systems carrying multi-class traffic using a reinforcement learning based approach. Other communication environments such as wired packetised data networks or ad- hoc networks may also benefit from the application of the broader invention disclosed herein.

The preferred embodiment describes reinforcement learning agent-based solutions to the problems of call admission control (CAC) and dynamic channel allocation (DCA) in multi-cellular telecommunications environments featuring multi-class traffic and intercell handoffs. Preferably both agents providing the CAC and DCA functionality make use of an on-policy reinforcement learning technique known as SARSA and are designed to be implemented at the cellular level in a distributed manner. Furthermore, both are capable of on-line (real-time) learning without any initial training period. Both of the reinforcement learning agents are disclosed herein using computer simulations and are shown to provide advantageous results in terms of call blocking probabilities and revenue raised under a variety of traffic conditions.

Prior to describing the system architecture it is informative to briefly describe the reinforcement learning approach. Figure 1 illustrates a simple reinforcement learning scheme: a learning agent and, external to it, the environment which it interacts with. The environment can be characterised by the configuration or values of a certain number of its features, which is called its state, denoted at time t in Figure 1 as S(t). Each state has an intrinsic value, dependent upon a certain immediate reward or cost, denoted at time t as R(t), which is generated when it is entered. At each discrete moment in time the agent may take one of a number of possible actions, A(t), which affects the next state of the system, S(t + 1), and therefore the next reward/ cost experienced, according to certain transition probabilities. The agent's choice of action, a, given the current state of the system, s, is modified by experience, i.e., it uses its past experience of action taken in a certain system state and reward/ cost experienced to update its decision making process for future actions. A policy of actions to be taken given particular system states is developed over time by the agent as it interacts with the environment. Alternative policies are evaluated in terms of the reward function. Each state, s, is associated with a state-acton value function, Q(x,a), which is an approximation of the future rewards that may be expected starting from that particular state if an optimal policy was adhered to. As exploration of the problem by the learning agent proceeds, the values associated with particular states may be modified to be closer to the value of the state that preceded it, a technique termed temporal-difference learning.

The state-action function Qt(s'; a') represents the learning agent's estimate at time t of the value of taking action a' in state s' then Qt+i(s'; a') may be updated by:

Q_M (x¹ ,a')= Q, (x' , ά) + a AQ₁ (*^• , a¹) _(1A) where α is a learning rate between 0 and 1. The quantity AQt(s'; a')is the update rule for reinforcement learning. The update rule for the SARSA algorithm is:

Δ0, = {r, + r Qt (x, a)h Q, (*' > a') (IB) where n is the reward received at time t for taking action a in state Xt, γ is the discount factor and s and a represent the next state and action visited, and Qt(s'; a') is the associated state-action value. SARSA is an on-policy method and differs from differs from off -policy methods, such as Q-Learning, in that the update rule uses the same policy for its estimate of the value of the next state-action pair as for its choice of action to take at time t , that is for prediction and control. A process of updating the state-action value estimates for SARSA is depicted in Figure 2. In explanation, after initialisation, the agent retrieves its current action-value estimate of the previous state-action pair to occur and the current state-action pair, and the reward obtained immediately after the previous state-action pair was enacted. These three values are used along with the learning rate (alpha - α) and discount parameters (gamma-γ) to update the agent's estimate of the previous state-action pair, as per equation 2. The current state-action pair is then stored as the previous state-action pair, and the reward obtained after its enactment is stored as the reward to be used in the next update procedure. This process "looks back' in time, that is it updates the action-value estimates after the agent takes an action for the state immediately following the state-action pair being updated.

The preferred embodiment will now be described. Two types of agents have been developed, one providing CAC and the other DCA functionality. Both have been developed to be implemented at the base station level of a multicellular system in a distributed manner as each uses only localised information and makes its decisions only for events occurring in the cell in which it is located.

To illustrate the embodiment of the invention a telecommunications system will be described wherein an allocation of the resource is a discrete channel for a connection. The terms resources and channels will be used to refer to discrete allocations of the connection resource. It is to be understood that this is for descriptive convenience, and that it covers broader concepts such as, but not limited to, allocations of bandwidth or codesets. Similarly the terms call should be considered to be representative of the broader term connection. The CAC and DCA agents were simulated over a 7 X 7 cellular system as shown in Figure 6 with a total of 70 channels available for assignment over a 24 hour period. As a reference case a fixed channel assignment method was considered wherein a reuse distance D = 4.4 cell radii was used.

Firstly the resource agent (RA) to be considered will the dynamic channel assignment (DCA) agent, where the terms DCA and RA will be used interchangeably. The role of the DCA is to choose which channel from the set of available channels to allocate to a resource request. Whilst one could randomly pick a channel from the set, or use a first in first out (FIFO) queue, such actions may not be optimal. The use of a reinforcement learning approach allows the DCA to learn how to choose the optimal (or near optimal) channel to allocate to the request.

Note that much of the reinforcement learning frame work discussed below applies to both the DCA and the CAC agent. It should be noted that parameters such as the learning rate, α, discount rate γ, state x, action a, state- action values Q, reward, r, exploration parameter, e, are agent specific eg

OCDCA, YDCA, XDCA, «DCA, QϋCA, rDCA, 6DCA and OC CAC, J CAC, X CAQ Cl CAQ Q CAC, T cAc, e CAC In the discussion to follow the DCA subscript have been dropped for convenience. In general the appropriate subscript should be clear from the context in which it is presented.

In describing the RA agent the first consideration is that of the state observed by the RA. We begin by firstly denoting the number of locations by N, and the number of discrete resources by M. The simplest description of the state at time t, st, considered is defined as:

S₁ = O₁), ^' (1-0) where i_t e {l,2, ... , N] is the location identifier in which the resource request at time t takes place. This definition of the state space is known as the Reduced State (RS) definition (in a given cell will be equal to the index of the cell).

The state may be further described by including additional information leading to a second definition of the state at time t, St, as:

^S ₁ = QM)), (i-i) where /, e {1,2, ..., N} is the location identifier in which the resource request at time t takes place, and L(i, ) e {θ,l, ... , M} is the number of discrete resources allocated in location it. This definition will be considered to be the full state.

In order to facilitate the computation of L(U) for a cellular system, the number of allocated resources in location i at time t, it was deemed necessary to transform the index of a cell into its constituent row and column values. Assuming ϊ specifies the cell index in which a resource request event takes place in an N X N cellular system the row of the cell, j(i), can then be expressed as:

and k(i), the column of the cell can be expressed as: k(i) = i mod N_{1 (1 3)}

where ' ' > denotes rounding towards infinity and mod is the modulo operator. Given the row,/, and column, k, of a cell, its index i can be found by: i = N(j - 1) 4- k. _{(1 4)}

Let fl, if resource m is allocated in cell i at time t ; Kh*^m) = L [0, o .t,herwi •se

(1.5) Then a given allocation of the resource m e {1,2, ... , N} is available for allocation in a given cell of index i, with corresponding row/ and column k, of an N X N cellular system with a minimum reuse distance of 4.4 cell-radii at time t if where

p=j(ή)-2 ?=&(i«)-2 i < p < ]V and 1 < q < N,

(1.6) where ;^' (/^represents the row of cell it and k(it) is the column of cell it and o=N(p-l)+l represents the index of a cell in the interference region of cell it. Figure 6 (A) shows a cell (10) and a interference region (20) surrounding it.

L (it) can in turn be expressed as:

M

^£(**⁾ = Σ ^Jfø>^m)> ^m=1 (1.7) with reference to (1.5).

One has the choice of whether to implement the reduced state definition

(Equation 1.0) or the full state definition (Equation 1.1). In choosing which to use one must trade off the additional information contained in the full state definition, with the added size of the state space that must be considered. This trade off will be discussed later.

Admissible actions for the RA agent are restricted to assigning an available allocation of a resource once a new resource request accepted event has been received. The availability of a given resource is determined via (1.6). Admissible actions are thus defined as: α = m, m G {1, 2, . . . , M} and v(i_t, m) = 0, The next consideration is the calculation of the reward for the RA. In order to allow for a distributed implementation the RA agent must rely solely on localised information rather than system-wide information, therefore the Reinforcement Learning (RL) RA agents obtain their rewards from a region surrounding their location. The reward region is set to a magnitude of twice the interference region of a location on the basis that any alteration to the call conditions at a location not only directly impacts the constituent agents located in its interference region, but also indirectly impacts on the interference regions of all of those agents. For example, the reward region for

a given ce _Tl_Tl i i-n an N _{M V} X N_M cel _Hlu_Ilar sys *tem, Gi ^« C — { ^L1, ^? 2, ^J . . . , > ( vN x N) ^j ]^j _f is therefore defined as:

1 ≤ j{c) < N and 1 < &(c) < JV },

(1.9) where j(c) represents the row of cell c and k(c) is the column of cell c.

The reward attributed towards each successful resource assignment event for the RL RA agent is the total number of ongoing calls in the reward region, as defined by equation (1.9), of the location where the channel assignment took place. Therefore, this reward can be expressed in an N X N cellular system with M resources for an action undertaken in cell i at time t as: j(i)+4 ft (i)+4 M

where j(i) represents the row of cell i and k(i) is the column of cell i, and y represents the index of a cell in the reward region of cell it as in (1.4) The SARSA reinforcement learning algorithm is used, as described above and shown in Figure 1 and Equations IA and IB and at 23 in Figure 2.

Resource assignment actions are selected using an e greedy algorithm, preferably with e being diminished over time. A function that allows control over the shape of the decay of e over time for the RA agent is implemented, therefore giving control over the balance of explorative actions versus exploitative actions by the RA agent over time:

The decay of the exploration parameter e over a period of 24 simulated hours is shown in Figure 7. As can be seen from figure 7 a greater rate of exploration can be achieved initially using equation (1.12) whilst achieving a more greedy action selection process farther into the operation of the agent. The rate of decay can be controlled by the value of the s parameter, the effects of which are also shown in figure 7. The discount factor, γ, for the RA agent is held constant at 0.975, and the learning rate, a, for the RA agent was held constant at 0.05. This learning rate has been deliberately chosen to be in the lower range of 0 → 1 as learning rates that are too high can degrade performance, and it has been shown lower learning rates can reduce the level of stochastic fluctuations in performance.

Previous research has shown simple Resource Allocation algorithms may perform poorly at high traffic levels due to their property of assigning resources where ever is possible. Approaches that take into account global resource allocation patterns can lead to a greater level of resource utility by minimising resource reuse distances. It is likely that the resource allocation pattern of the interference region of a given location could change between the admission of a call in that location and its termination. For example, a sub- optimal resource assignment may have been initially made due to the unavailability of a preferred resource through its assignment in an interfering location.

In order to improve the performance of the RL RA algorithm a reinforcement learning-based resource reassignment feature has been developed. Emphasis is put on developing a reassignment scheme that maintains an important advantage of reinforcement learning, namely that of low computational complexity.

Specifically, whenever a new call accepted, call termination or handoff event is encountered a reinforcement learning agent considers a single resource reassignment. A handoff event may trigger two reassignment considerations, one in the area the mobile call is departing from and another in the location it is entering.

Whenever a call termination or handoff event occurs, the set of currently allocated resources in the location, including the discrete resource allocated to the call due for termination, is evaluated as if for a new call resource assignment. The agent then frees the minimally valued resource, possibly requiring a reassignment if that minimally valued resource was not the one due for call termination. This procedure for an RL RA agent is shown in table 1.1 Table 1.1: Call Termination/Handoff Resource Reassignment Procedure

1. Let c _flag = resource flagged for freeing

2. Get c_min = mm_α Qt(⁸, Q>)_} f^or s = current location a G currently allocated resources in s S. Free resource C_βag

4- If Cmin φ C _flag-' a) Reallocate call on resource c_min to C_βag b) Free resource c_mi_n

This process may be conceptualised as a purely-greedy agent decision action where the preferred action is to release the least-preferred assigned resource rather than assigning the most-preferred free resource. Let c(ύ) denote the resource assigned to a call termination event about to occur in location i at time t. The agent determines which resource would be most preferable to be unassigned in location i immediately after the termination event, denoted by fiύ, by: /(**) - ^mina Qt(¹Hi O), I l(i_ti a) = 1 V α G A, ^₁₁₃) where l(it,a) indicates whether resource a is allocated in location i at time t (see equation (1.5)). Kf(U) ≠ c(U) then a resource reassignment procedure takes place whereby the call currently occupying the least-preferred resource /(it) is transferred to the more-preferred resource c(U).

This mechanism requires only a simple search over the currently learnt resource allocation values held in the memory table of the agent, the scope of which is equal to the total number of allocated resources immediately prior to the call termination or handoff event. This simple search procedure ensures computational requirements continue to be minimal, and in effect the goal of this resource reassignment technique is to leverage the maximum performance from the location-resource associations made by the reinforcement learning RA agents. As no learning is conducted on the reassignment actions, i.e., no update procedure follows, this process is conducted in a strictly greedy manner.

The second type of resource reassignment, invoked upon new call accepted events, compares the minimally-valued resource currently assigned to the maximally-valued free resource. If the best free resource is currently preferred by the RL RA agent, the call is transferred (Table 1.2). The computational overhead of this reassignment strategy is also low, requiring a search over the same previously learnt state-action values. Assignment actions are limited to at most one channel reassignment per call event.

Table 1.2: New Call Channel Reassignment Procedure

1. Let Cmax = maXα Qt{S) θ)_t for s = current location a € currently available resources in s

2. Let v_max = Q_t(s, c_max)

3. Let C_777Jn = min_α Q_t(s_i a)_} for s = current location a G currently allocated resources in s

a) Reallocate call on resource c_min to c._mαχ b) Free resource c -_r ^'TTWIl

It is important to emphasise that any reassignment is limited to the location in which the call termination of handoff event fires, and that at most one reassignment is enacted. This is an important property as the powerful heuristic Borrowing with Directional Channel Locking (BDCL) has been considered infeasible for practical implementation as resource reassignments may be propagated system-wide.

In order to encourage a compact resource assignment pattern, the initial state- action value estimates corresponding to a fixed resource allocation scheme are initialised to a positive value,/, for example by:

/, if 0 < ⁽a - c) < 10,

0.0, otherwise,

(114)

For the 7 X 7 cellular system with 70 channels considered in the embodiment discussed below, c = ((_x - 4 X [(x - 1)/7J) mod 7) x 10

(1.15)

with •-^* J denoting rounding towards zero and mod being the modulo operator. This effectively "seeds' the value estimate table so as to encourage the learning agent to favour actions consistent with an evenly distributed allocation of channels over the system.

Figure 4 is a flowchart of the operation of the DCA agent. The DCA agent receives a resource request (400) and makes an observation of the system state, x (410). It then checks if an allocation of the resource is available to allocate (420). If all of channels are in use then the request is dropped and the DCA waits for the next request.

If a channel is available then the DCA performs the new channel reassignment procedure of Table 1.2. At step 430 the state action values (Q(x,a)) of the set of channels available for assignment are searched and the channel with the largest value found, and the value is stored as Max Q. A search is then performed over the state action values of the set of assigned channels (the complement set to the previous set) and the channel with the smallest value is found and the value stores as Min Q. If Max Q available is greater than the Min Q assigned (434) then the call on the channel associated with Min Q is reassigned to the channel associated with Max Q (436), and the channel associated with Min Q is released into the pool of unassigned channels (438). This procedure ensures efficient use of high value channels.

If the outcome of step 430 is false the agent procedes to step 440 where it decides whether to perform an exploratory action or not. This consists of obtaining a random number from a uniform distribution between 0 and 1, and if the random number is less than the current value of e (equation 1.12), then exploration is performed along 446. In this case the action taken is to select a random channel from the pool of available channels is assigned to the resource request (448). If the test performed at step 440 is false, exploitation is performed. The agent then takes the action of allocating the channel with the largest state action value Q for the current state, x.

The agent then proceeds to 450 and observes the reward for taking the action and updates the Q values according to equation IA and IB and the flowchart shown in Figure 2. Finally at 452 the agent returns to step 400 to await the next resource request.

In a variant of the above process step 424 can proceed directly to step 440 and bypass the new channel reassignment procedure.

The DCA agent can also perform call reassignment call termination or successful handoff of a call to an adjacent cell. This procedure is illustrated in Figure 5. In addition to waiting for resource allocation requests, the DCA also waits for termination or handoff requests (50). On notice of a termination or handof f request the DCA agent performs a search and identifies the channel in use with the smallest Q Value and assigns this value to Min Q (51). The call is then terminated or handed off (52) and the agent then checks if this channel is the channel associated with Min Q (53). If it is not (55) then the DCA agent reassigns the recently released resources to the call associated with Min Q (56), and frees the resources associated with Min Q(57). The agent then waits for another termination or handoff requests (58).

We will now consider the Call Admission Control agent (CAC). Note that much of the reinforcement learning frame work discussed above applies to the CAC agent. It should be noted that parameters such as the learning rate, α, discount rate γ, state action values Q, exploration parameter, e, are agent specific, and should not be confused with those for the DCA agent.

Resource Allocation schemes that do not take into account call admission are greedy resource assignment policies as they accept and assign resources to a new call request whenever possible. It may be optimal, however, to such a scheme intentionally denies certain new call requests in order to maintain minimal co-channel reuse distances or reserve resources for higher priority traffic. This is an approach taken by call admission control (CAC) which works on the assumption that denying a call request may lead to long-term improvements in service even though service is degraded in the immediate future.

As the uninterrupted continuation of an ongoing call via a successful handoff procedure is considered a higher priority than at new call initialisation the handoff blocking probability of a resource allocation scheme should be minimised. This can be achieved through the prioritisation of handoff calls through resource reservation, although approaches that do this often lead to poorer channel usage for new call requests as there is generally a tradeoff between reserving resources for handoff calls and the minimisation of new call blocking probabilities. Guard schemes prioritise call handoff s by reserving a portion of bandwidth for assignment to handoff requests. The amount of bandwidth to reserve depends upon traffic conditions, therefore adaptive algorithms should be preferred as microcellular systems may be highly dynamic environments due to the increase in call handoff s.

In the case of the on-line learning of a call admission control policy, exploration is potentially very expensive, in that the intentional blocking of calls that the agent has ascertained with reasonable certainty should be accepted, can both increase blocking probability and reduce revenue. Therefore, it is imperative that particular attention be paid to the exploration- exploitation dilemma in order to minimise the number of blocked new call requests that are required to learn an effective call admission policy.

A dynamic guard channel scheme that uses reinforcement learning to adaptively determine the number of channels to reserve for handoff traffic has been developed and is preferably embodied in the CAC disclosed herein. Utilising only localised environment information, the reinforcement learning- based guard channel mechanism is designed to be employed in a distributed architecture. This ensures its feasibility for real-world implementation and allows it to be coupled with the RL-based RA solutions developed and described herein.

A reinforcement learning-based CAC agent is proposed that determines whether a new call request should be accepted into the system via management of a dynamic resource guard scheme. It has been decided to limit the action of the agent to new call requests only as, given the desired prioritisation of handoffs, acceptance of a handoff request is always considered optimal.

Other RL CAC research has concentrated an agent's actions directly on whether to admit or block a new call request either through a comparison of the estimated value of accepting a call and rejecting a call, or by a comparison of the estimated value of accepting a call and zero. The reinforcement learning call admission control (CAC) algorithm detailed here is novel in that it indirectly determines the admission of new call requests via control of the resource guard magnitude of the cell in which the new call event takes place.

Denoting the number of agent locations by N, and the number of discrete resources by M, the state at time t for a CAC agent, St, is defined as:

t \ u κ t)h ₍₁₁₆₎

where i_t e {l,2, ... , N} is the location identifier in which the resource request at time t takes place, and V(i_t ) e {0,1, ... , M] is the number of discrete resources available in location it. Given that in a distributed implementation there will exist one CAC learning agent in each resource assigning location, the location index component of the state information can be omitted as it will remain constant, resulting in a state representation of:

s_t = V(i_t). P_-17,

A given resource m_t s {1,2, ... , M}is available for allocation in a given location of index i at time t if v(U,m)=0 (see equation 1.6). Therefore, the number of resources available in location U can be expressed as: M ^' .

(1.18) where

For the CAC agent an action consists of setting the resource guard magnitude: a_t = m, m e {0, l; . . . , M }.

(1.20)

A new call request will then be admitted if the number of available discrete resources in the location at time t is greater than the resource guard magnitude determined by the CAC agent at that point in time. In the implementation considered, the maximum guard channel threshold value was limited to the total number of discrete resources system-wide, M, divided by the cluster size, Z. In the case of a N X N cellular system with 70 channels and a cluster size of 7 this is 10 (70/7). The minimal guard channel threshold value was 0. This results in admissible actions for the RL CAC agent being: a_t = m, m e {0, 1, . . . , 10}.

A threshold value of 0 corresponds to accepting every new call request and reserving no resources for handoff use only, whereas a threshold value of 10 corresponds to reserving all of the resources a cell would receive in a uniform fixed resource allocation pattern for handoff call requests, both of which are extreme conditions. Exploration can potentially be very expensive when using reinforcement learning to solve the CAC problem, as intentionally blocking calls is, in and of itself, an undesirable action. Furthermore, as the learning agent initially has no experience through nil or limited interaction with the environment, it is vital that it converges on an optimal or near-optimal policy as rapidly as possible. In order to solve this potential dilemma the estimated action values of the learning agent were firstly initialised to zero, i.e.:

Then a reward function based on the negation of the new call and handoff blocking probability experienced up to time t was implemented (see equation (1.23)), with preferred actions still being deemed as those with greater state- action value estimates. This formulation ensures that the state-action values are limited to a maximum value of zero, and encourages initial exploration as each state-action value is initialised to this maximum. As the learning agent progresses through its interaction with the environment, sub-optimal policies will result in higher call blocking probabilities which, due to the negation procedure, will be classified as less desirable and therefore less prone to choice by the agent. Although similar to optimistic initial value methods in so far as this reward structure encourages initial exploration, the proposed method obviates the need to choose an appropriately sized initial value as the agent's estimates are initialised to their maximal value which cannot be attained in operation once a call has been blocked in an agent's reward region. This formulation therefore allows more flexibility in the setting of the exploration parameter e as it inherently encourages exploration away from sub-optimal policies.

Specifically, let n_zit signify the number of accepted new call requests of class z and h_zit signify the number of accepted handoffs of class z up until time t in the reward region, Gi, of cell i (see equation (1.9)). Furthermore, let n\U represent the number of blocked new call requests of class z and h\U the number of blocked handoffs of class z up until time t in the reward region of cell it. Then the reward at time t for cell it can be expressed as:

where w_z is a new call reward multiplier for class z traffic and y₂ is a handoff reward multiplier for class z traffic, and K is the number of traffic classes in the system. These multipliers allow for customisation of the level of preference given to different traffic classes and handoff requests via the 0 relative weighting they give to the overall reward obtained.

Figure 3 is a flowchart of the operation of the CAC agent. The CAC agent waits for a new call request (300) and a new call arrives at step 310. The agent then checks if a channel is available to allocate (320). If all of channels are in 5 use then the call is blocked (324) and the CAC waits for the next request (300).

If a channel is available (326) the agent then performs a test to determine if it should take an exploratory action or not. This consists of obtaining a random number from a uniform distribution between 0 and 1, and if the random 0 number is less than the current value of e, then exploration is performed along 336. In this case the action taken is to set the guard threshold to a random number of channels (338). If the test performed at step 330 is false, a non exploratory action is performed. The agent then takes the action of setting the guard threshold to the number of channels with the largest state action 5 value Q for the current state, x (334). The agent then proceeds to step 340 where it tests whether there are available channels to assign to the call taking into account the number reserved as guard channels. If there are channels available (346) then the call is accepted and the agent updates new accepted call statistics. (348). If there are insufficient channels available (342) then the call is blocked (refused) and the agent updates the new call blocked statistics. The agent then proceeds to step 350 and observes the reward for taking the action and updates the Q values according to equation IA and IB and the flowchart shown in Figure 2. Finally at 360 the agent returns to step 300 to await the next new call request.

Figure 21 is a flowchart of the operation of the CAC and DCA agents. The CAC agent and DCA agents listens for a call event. If the call event is a new call request the CAC agent decides whether to accept or reject the call. If the call is accepted then the DCA decides which channel to assign. If the call event is a call termination or handoff event the DCA checks to see if the resources should be released or reassigned to a current call. The process will now be described in detail.

The CAC agent waits for a call event (300). If the event is a new call request, then the CAC checks if a channel is available to allocate (320). If all of channels are in use then the call is blocked (324) and the CAC waits for the next request (300).

If a channel is available (326) the agent then performs a test to determine the priority level of the new call request. If the priority level is greater than a priority threshold value (360) then the call is accepted and new call accepted statistics are updated(348). The connection request then proceeds to the DCA agent for assignment of a channel(410). If the priority level is less than or equal to the priority threshold value then the CAC performs a test to determine whether to take an exploratory action or not.

If a channel is available (326) the agent then performs a test to determine if it should take an exploratory action or not. This consists of obtaining a random number from a uniform distribution between 0 and 1, and if the random number is less than the current value of e, then exploration is performed along 336. In this case the action taken is to set the guard threshold to a random number of channels (338). If the test performed at step 330 is false, a non exploratory action is performed. The agent then takes the action of setting the guard threshold to the number of channels with the largest state action value Q for the current state, x (334).

The agent then proceeds to step 340 where it tests whether there are available channels to assign to the call taking into account the number reserved as guard channels. If there are insufficient channels available (342) then the call is blocked (refused) and the agent updates the new call blocked statistics. The agent then proceeds to step 350 and observes the reward for taking the action and updates the Q values according to equation IA and IB and the flowchart shown in Figure 2. Finally at 360 the agent returns to step 300 to await the next new call request.

If there are channels available (346) then the call is accepted and the agent updates new accepted call statistics. (348). The agent then proceeds to step 370 and observes the reward for taking the action and updates the Q values according to equation IA and IB and the flowchart shown in Figure 2. The call is then passed onto the DCA agent for assignment of a channel (410). Operation of the DCA is as described previously, with the modifications that at step 422 (no channel is available and the call is blocked (422)) control is handed back to the call admission control agent (300), and on completing step 450 (observer reward and update Q values for the DCA agent), control is handed back to the call admission control agent (300).

If the call event was not a new call event, the request is passed to the DCA agent. If the call is a termination or handoff request (312), then the DCA agent performs a search and identifies the channel in use with the smallest Q Value and assigns this value to Min Q (510). The call is then terminated or handed off (520) and the agent then checks if this channel is the channel associated with Min Q (530). If it is not (550) then the DCA agent reassigns the recently- released resources to the call associated with Min Q (560), and frees the resources associated with Min Q(570). Control is handed back to the call admission control agent (300).

Simulation results of an embodiment of the invention will now be described. To evaluate the performance of the CAC and DCA agents simulations were performed using a 7 X 7 cellular telecommunications system comprising of a 7 X 7 cellular system as shown in Figure 6 with a total of 70 channels available for. assignment over a 24 hour period.

The first comparison performed compared the performance of a DCA agent using the reduced state implementation (Equation 1.0) compared to the full state implementation (Equation 1.1) using a series of simulation. A single class of traffic was simulated with new call arrivals being modelled as independent Poisson processes with a uniform distribution pattern with mean call arrival rates λ, for both classes between 100 to 200 calls/hour. The call durations obeyed an exponential distribution with a mean of 1/μ, equal to 3 minutes for both traffic classes. New calls that were blocked were cleared. All handoff traffic was simulated according to an exponential distribution with a mean call duration, 1/μ, of 1 minute in the new cell. All simulations were initialised with no ongoing calls. The simulated configurations were evaluated in terms of revenue generated and new call and handoff blocking probabilities after a period of 24 simulated hours. As a reference case, a fixed channel assignment method was considered wherein a reuse distance D = 4.4 cell radii was used, and where all calls requests were accepted (unless all channels were allocated).

A comparison of new call blocking probabilities and handoff blocking probabilities versus load was performed. The revenue versus load per cell obtained over a 24 hour period was also examined. Results indicated that at all traffic load considered, the reduced state agent performed better (reduced blocking probability and increased revenue) compared to the full state agent, which in turn performed better than the fixed allocation strategy. This may initially seem counterintuitive, as the full state agent has access to more information to learn from than the reduced state agent. However the reduction in state space allows the agent to more quickly learn appropriate policies. This reduced state space agent also has significantly reduced memory requirements over the full state agent which is an additional feature which enhances its real world applicability. Based on the results of these simulations, the reduced state implementation was further tested. For convenience we will drop the term reduced state agent, and in the discussion that follows, the DCA agent will be a DCA agent implementing a reduced state space.

Two classes of traffic were simulated representing two levels of service. Class 1 traffic being premium service traffic that contributed 10 times the system revenue for every new call request accepted compared to Class 2 traffic, representing standard service. As Class 1 traffic earned more revenue it was prioritised by having all Class 1 new call requests bypass the CAC agent and proceed directly to the DCA agent. New call requests that were intentionally blocked by the CAC agents or unable to be assigned a free channel were cleared, i.e. Erlang B. All simulations were initialised with no ongoing calls. Both classes were assumed to contain roaming traffic and a proportion of all calls underwent a handoff procedure wherein 15% of terminating calls were continued as handoff calls. The entered cell for handoff traffic was chosen randomly using a uniform distribution with all neighbouring cells as eligible. If a handoff resulted in a call leaving the area by movement from an edge cell then the call was terminated and not included in the handoff results of the simulation.

Three configurations were selected for testing. Firstly, as a reference case, a fixed channel assignment method was considered wherein a reuse distance D = 4.4 cell radii was used, and where all calls requests were accepted (unless all channels were allocated). The second case considered was using distributed RL DCA agents in each cell using a reduced state implementation (equation 1.0), where all calls requests were accepted (unless all channels were allocated). The third case considered was using distributed RL CAC and RL DCA agents in each cell. The DCA agents performed channel reassignment on new call, call termination pr call handoff.

AU simulations were initialised with no ongoing calls. The simulated configurations were evaluated in terms of revenue generated and new call and handoff blocking probabilities after a period of 24 simulated hours.

Guard channel and channel assignment actions by the CAC and DCA agents are both selected using an e -greedy algorithm, with e being held constant at 0.03 for the CAC agent and e being diminished over time for the DCA agent according to equation (1.12) with s=256 and eo=O.O5. The exploration parameter e is kept constant for the CAC agent in order to improve its adaptability to environment dynamics, as the developed reward structure inherently provides some exploration parameter control. The learning rate α for the GAC agents is set to 0.01 and their discount parameter γ is set to 0.5. The learning rate and discount parameters for the DCA agents were 0.05 and 0.975 respectively. The reward multipliers are set as follows:wi = 10, wi = 1, yi = 50, and yz = 5, thereby prioritising handoffs.

Simulation results were obtained for a range of traffic conditions. The first traffic considered was called the constant traffic load scenario. Both traffic classes were i.i.d. with new call arrivals being modelled as independent Poisson processes with a uniform distribution pattern with mean call arrival rates λ_Λ for both classes between 100 to 200 calls/hour. The call durations obeyed an exponential distribution with a mean of 1/μ, equal to 3 minutes for both traffic classes. New calls that were blocked were cleared. All handoff traffic was simulated according to an exponential distribution with a mean call duration, 1/μ, of 1 minute in the new cell.

Results of the simulation are shown in Figures 8 and 9. Figure 8 shows the comparison of new call blocking probabilities (A) and handoff blocking probabilities (B) versus load for each class and each of the three configurations using the above uniform traffic distribution. Figure 9 shows the revenue versus load per cell obtained over a 24 hour period. The RL-DCA agent produces a higher revenue over all traffic loads simulated. By dynamically allocating channels over the simulated system call blocking rates for both classes can be reduced which allows for more revenue raising calls to be accepted. The RL-CAC agent obtains more revenue at higher traffic loads as it prioritises the higher revenue class traffic. As can be seen in figure 8, the RL DCA agent configuration produces a lower new call blocking probability for both traffic classes than the FCA configuration. The new call blocking rates for both the configurations lacking the RL-CAC agent are identical for both traffic classes as no prioritisation is given to class 1 traffic. The RL-CAC agent produces the lowest new call blocking probability for class 1 traffic, at the expense of the low-revenue class 2 traffic. Figure 8 B shows the handoff blocking probabilities of the simulated system. The handoff blocking probabilities of the RL-DCA configuration are significantly lower than those of the FCA configuration as it makes more efficient use of the system channels. This substantial difference is improved even further by the configuration including a RL agent for CAC, as the CAC policy prioritises both handoff requests as well as class 1 new call requests.

The next traffic condition considered was a time varying traffic load scenario. The constant traffic load scenario described above was modified by varying the traffic load over a simulated 24 hour period. A uniform mean new call arrival rate of 150 calls/hour per traffic class was multiplied by a parameter dependent on the simulated time of day, the pattern of which is shown in figure 10. Only two configurations were simulated, 'NO CAC, RL-DCA' and 'RL-CAC, RLDCA', as these were the best performing configurations under the constant traffic load scenario above

Figure 11 shows the results of the time varying traffic load scenario. Once the period of peak activity has begun the RL- CAC agent quickly determines a policy that garners increased revenue, achieving both learning activity and profit activity in this period.

The new call and handoff blocking probabilities are shown in figure 11 (A). Both traffic classes have been combined in this plot. Here it can be seen that the new call blocking probability of both classes is slightly higher for the RL- CAC agent, due to the fact that the agent is prioritising class 1 traffic by intentionally blocking class 2 calls, a policy that leads to greater revenue obtained (figure 11 B). Figure 11 also shows that besides producing more revenue over the simulated 24 hour period, the RL-CAC agent also achieves a reduction in the handoff blocking probability for both classes of over 50% during the periods of peak activity.

The third traffic scenario considered was that of self similar data traffic load. Data traffic differs characteristically from voice traffic, and it has been shown that Poisson processes which are usually used for voice traffic modelling are inadequate for the modelling of data traffic, which is in fact better characterised by self-similar processes. Unfortunately, departing from a Markovian traffic model means reinforcement learning cannot be guaranteed to converge to an optimal policy. Nevertheless, it is possible efficient policies may be attained by a reinforcement learning agent despite the lack of a convergence guarantee.

Voice traffic, class 1, was simulated with new call arrivals being modelled as independent Poisson processes with a uniform distribution pattern and mean call arrival rates λ, of between 100 to 200 calls/hour. The call durations obeyed an exponential distribution with a mean of 1/μ equal to 3 minutes for both traffic classes.

The Pareto distribution is commonly used for the modelling of data traffic, and has the probability density function:

JL

(1.24) where α is a shape parameter and β is the minimum value of x. The lower the value of α, the 'heavier' the tail, i.e. the greater the probability of a very large x value. A Pareto distribution is considered 'heavy-tailed' if 0 < α < 2. In this range of α the variance of the distribution is infinite, and if 0 < α < 1 its mean is infinite. Even though observed traffic will not exhibit an infinite variance in actuality the use of a heavy-tail model such as this provides an approximation of the real tail behaviour of data traffic. In our simulations we made use of a pseudo-Pareto distribution to characterise data traffic (class 2) which is a truncated-value distribution due to the fact that there is a limit to the magnitude of random values a computer can generate, 53 bits in our case. Specifically pseudo-Pareto values were generated via

(1.25) where U is a uniformly distributed random value in (0,1]

For new call arrivals the shape parameter α was set to 1.2, and for call durations α was set to 1.4. The traffic load of the data class was set to be approximately equal to the offered load of voice traffic by making use of the formula for mean value of a Pareto distribution: aβ

E{x) = ^{a ~ 1} ' (1.26) which with rearrangement we use as

(1-27) to determine a value of β which would give us a mean approximately equivalent to that of the voice traffic for new call arrivals and call holding times.

The self-similar nature of traffic based on the pseudo-Pareto distribution can be seen in figure 12. Figure 12 A shows the effect of aggregating data samples drawn from an exponential distribution over increasing timescales, with each descending subplot depicting a randomly-located window of one-tenth of the size of the subplot above, leading to scales of the x-axis ranging from units of 1 second in the lowermost sub-plot to 10000 seconds in the uppermost. As the exponential data is aggregated it becomes 'smoothed'. Compare this to Figure 12 (B) which shows the same size data train taken from the pseudo- Pareto distribution (1.25) using a shape parameter α of 1.2. As the data is aggregated over increasing timescales it still appears 'bursty', and in fact bears some resemblance to the original un-aggregated data.

For both classes new calls that were blocked were cleared, and a proportion of calls (p = 15%) of both classes were simulated as handoff traffic. The entered cell for handoff traffic was chosen randomly using a uniform distribution with all neighbouring cells as eligible. Voice handoff traffic was simulated according to an exponential distribution with a mean call duration 1/μ of a further 1 minute in the new cell. The shape value α for the holding time of data traffic in the case of a handoff was set to 1.4 and the β value was set according to equation 1.27 to give a mean of approximately 1 minute.

The system was simulated over a number of uniform traffic loads, from a mean of 50 to 100 new calls per hour per cell for voice traffic and an approximately equal load of data traffic, as explained previously. Figure 13 shows new call blocking and handoff blocking probabilities as a function of total new traffic load in the cell. Figure 16 shows the total revenue raised over the 24 hours for the three allocation configurations. The RL-DCA agent produces a higher revenue over all traffic loads simulated when compared to the FCA configuration. By dynamically allocating channels over the simulated system call blocking rates for both classes can be reduced which allows for more revenue raising calls to be accepted. The RLCAC agent obtains even more revenue at higher traffic loads as it prioritises the higher revenue voice traffic. As can be seen in figure 13 (A), the RL-DCA agent configuration produces a lower new call blocking probability for both traffic classes than the FCA configuration. The RL-CAC agent produces the lowest new call blocking probability for class 1 voice traffic, at the expense of the low-revenue class 2 data traffic. Figure 13 (B) shows the handoff blocking probabilities of the simulated system. The handoff blocking probabilities of the RL-DCA configuration are significantly lower than those of the FCA configuration as it makes more efficient use of the system channels. This substantial difference is improved even further by the configuration including a RL agent for CAC, as the CAC policy prioritises both handoff requests as well as class 1 new call requests. Whilst these handoff s do not contribute to the revenue earned the fact that their dropping probability is lower is important to the perception of users. By implementing both RL-DCA and RL-CAC agents we can achieve better handoff dropping performance for both voice and data traffic and greater revenues at the same time.

As previously discussed the offered traffic load of cellular systems may not be evenly spread (as all of the above simulations have considered), but instead have a number of localised traffic 'hot-spots'. Thus the fourth configuration tested considered non-uniform traffic arrival patterns. The non-uniform call arrival distribution of Figure 14 was used as a reference. Class 1 voice traffic arrival rates were set to one-half of the mean call arrival rates depicted in Figure 14 and Class 2 traffic arrival rates were set to an approximately similar mean value through the use of equation (1.27) to determine an appropriate β value for an α value of 1.2. This gave a combined mean new call arrival rate approximately equivalent to previous simulations featuring traffic modelled by only an exponential distribution.

The revenue results of the simulations are displayed in figure 18. The efficiency of the RL-DCA agents in assigning communications channels can be seen in the higher revenue raised at all traffic levels simulated of "NO CAC, RL-DCA¹ when compared to the FCA implementation. Furthermore, the inclusion of the developed reinforcement learning-based CAC agents produces an approximately linear increase in revenue raised as the traffic load increases for both non-uniform call arrival patterns. This is due to the fact that as communications resources become more scarce the RL- CAC agents intentionally block more low-revenue data traffic calls, allowing a greater number of high-revenue calls to be carried. The behaviours of the three simulated configurations in terms of new call and handoff blocking rates can be seen in figure 15.

For the non-uniform call arrival patterns simulated, the new call and handoff blocking rates of "NO CAC, RL-DCA¹ are lower than those of "NO CAC, FCA¹ for both classes at all traffic levels. This demonstrates the greater efficiency with which channels are utilised under the reinforcement learning-controlled dynamic channel allocation scheme, and this is the reason "NO CAC, RL- DCA' obtains higher revenues than "NO CAC, FCA¹. The reinforcement learning-based call admission control scheme developed in this chapter prioritises high-revenue new call requests and handoff requests of both classes, and these two properties are shown in Figure 15. Figure 15 show that by including the RL-CAC agents new call blocking probabilities for Class 1 and handoff blocking probabilities of both classes are more than halved at the highest traffic level simulated for both non-uniform traffic patterns.

The fifth configuration tested considered was with a time varying new call arrival pattern for class 1 voice traffic and self-similar arrival pattern for class 2 data traffic, variation pattern of Figure 10 was applied to a spatially uniform new call arrival pattern of approximately 300 calls/ hour per cell. Class 1 voice traffic was modelled by an exponential distribution with a mean call arrival rate of 150 calls/ hour per cell and Class 2 data traffic was modelled by the pseudo-Pareto distribution described above with a shape parameter α of 1.2 and a /? parameter determined via equation (1.27) to produce an approximately equivalent mean new call arrival rate to Class 1. Only two resource allocation architectures were simulated, ^VNO CAC, RL-DCA¹ and "RL-CAC, RL-DCA¹ as these were the two best-performing algorithms for the previous simulations for self -similar traffic

The revenues obtained by the two simulated resource allocation architectures is shown in Figure 19. In the peak morning and afternoon periods the incorporation of the proposed reinforcement learning dynamic guard channel mechanism results in greater revenues raised. After 24 simulated hours the addition of the RL- CAC agents raises over 10% more revenue. The manner in which the RL-CAC agents achieve this is illustrated in Figure 16. The only calls that contribute to revenue are new calls accepted, and the RL-CAC agent actually exhibits a slightly higher new call blocking rate for the twelve hours between 9 and 21 o'clock. The key factor is that a greater proportion of the blocked calls are low-revenue data calls, resulting in a greater revenue achieved. It should also be noted that the handoff blocking rates of the RL- CAC agent architecture are significantly lower than those of the agent architecture with no CAC capability over all periods of significant resource demand. Whilst this has no impact on revenue raised it is another advantage of the RL-CAC architecture as the continuation of handoff calls is considered a higher priority than the acceptance of new call requests.

The above discussion and figures demonstrate the ability of a distributed resource allocation system featuring a call admission control agent and dynamic channel assignment agent

The developed RL call admission control and RL dynamic channel assignment architectures have been simulated in an range of environment including those featuring self-similar traffic. These results demonstrated that the developed distributed reinforcement learning architecture overcomes a traditional weakness of DCA schemes, namely that they may under-perform at higher traffic loads due to their propensity to assign channels where ever possible. The results also show that the agents are able to obtain good call blocking performance without access to system-wide information and that reinforcement learning may produce efficient results when applied to an environment not possessing the memoryless property, such as those that are encountered in telecommunications systems.

Finally the proposed systems are feasible for real-world implementation as they only make use of localised environment information for making decisions. Additionally, any channel reassignment procedures are limited to at most one reassignment per call event. The complexity and memory requirement have been kept to a minimum. In the example considered The CAC agent for each cell requires a memory table of only 1+M/Z elements, which in a system featuring 70 channels and a cluster size of 7, such as that used above, results in a total 11 table elements. The DCA agent for each cell requires a table representation of M table elements. It should be noted that both of these memory requirements are independent of the number of cells in the system and therefore the architectures are scalable, and they are so minimal that function approximation techniques such as provided by artificial neural networks are not required.

Special attention has been given to the adaptability of the developed resource allocation solutions, due to the inherent dynamic nature of cellular telecommunications systems featuring mobile traffic. For example, the learning rate parameter α was kept constant for both the DCA and CAC agents, allowing them to track state- action values that may change over time. This also obviated the need for a more complex mechanism for the implementation of the α parameter, such as the recording of the number of occurrences of a given state-action pair. The reduced state-space magnitude allows the RL DCA schemes to learn efficient policies in a more timely fashion and better deal with environment dynamics such as channel failures or ^v spikes' in offered traffic load. Furthermore, the reinforcement learning resource allocation solutions disclosed in this invention develop their policies in an on- line manner with no initial off-line learning periods, demonstrating their adaptability without any prior environmental knowledge.

The success of the agents in a cellular telecommunication system featuring a self similar pseudo Pareto distribution indicates wider applicability of the method. The embodiment was studied at the call based level, but given the success under non-Markovian environment dynamics, such an approach could be applied at the packet level of data transmission. This has much broader scope such as to packetised data traffic in wireline networks (ie in network routers), particularly where the packets have different priority classes. The invention also has application to mobile wireless cellular networks featuring self similar data traffic.

There are also applications beyond cellular systems such as in ad-hoc networks. It is envisaged a reinforcement learning algorithm may be developed for mobile ad hoc networks that not only provides routing and resource reservation functionalities, but also provides dynamic resource allocation. For example, a power control scheme that aims to conserve power, reduce interference amongst neighbours and maintain signal strengths in a multi-hop ad hoc environment could be developed.

The present invention for resource allocation in a cellular telecommunications system may be implemented using hardware, software or a combination thereof and may be implemented in one or more computer systems or other processing systems. In fact, in one embodiment, the invention is directed toward one or more computer systems capable of carrying out the functionality described herein. An example of a computer system 2000 is shown in Figure 20. The computer system 2000 includes one or more processors, such as processor 2010. The processor 2010 is connected to a telecommunications system infrastructure 2020 by a communications path 2015 (e.g. a communications bus, network, etc) that caries digital data to and from the processor 2010 as well as from and to the telecommunications infrastructure. Various software embodiments are possible in the form of computer code which is resident in memory 2025 associated with each processor. So as to maintain the independence of the operation of the resource allocation functionality data is exchanged between cells through the telecommunications infrastructure 2020. After reading this description, it will be come apparent to a person skilled in the relevant art(s) how to implement the invention using other computer systems and/ or computer architectures. It will be appreciated by those skilled in the art that the invention is not restricted in its use to the particular application described. Neither is the present invention restricted in its preferred embodiment with regard to the particular elements and/ or features described or depicted herein. It will be appreciated that various modifications can be made without departing from the principles of the invention. Therefore, the invention should be understood to include all such modifications in its scope.

Throughout the specification and the claims that follow, unless the context requires otherwise, the words "comprise" and "include" and variations such as "comprising" and "including" will be understood to imply the inclusion of a stated integer or group of integers, but not the exclusion of any other integer or group of integers.

The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement of any form of suggestion that such prior art forms part of the common general knowledge.

Claims

THE CLAIMS:

1. A method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (xi, xi, .., XN), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q{x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource for establishing a new connection in cell i at time t, b) obtaining the system state, x, cell t, c) rejecting the resource request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and l, e) if the value of the random number is less than the value of the exploration function, e, then the agent randomly selects an action a from A, otherwise the agent selects the action a, which has the maximal state action value Q(x,a), for the current state, x, f) the agent accepting the resource allocation request and updating the accepted new resource allocation statistic if the amount of unallocated resource is greater than the guard resource threshold, otherwise rejecting the resource allocation request and updating the rejected new resource allocation request statistic, g) the agent updating the state action value associated with the previous state-action pair, Q{x',a'), wherein x' is the previous state and a' is the previous action, using the previous state-action value Q(x',a'), the current state action value Q(x,a), the learning rate α, the discount factor γ, and the reward, /, calculated after taking the previous action a' in state x', h) the agent calculating the reward for taking action a in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource, and further each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the DCA agent is initialised with a predetermined DCA learning rate function, ccDCA, a predetermined discount factor function, γDCA_/ a predetermined DCA exploration decision function, e DCA, wherein the values returned by α DCA and γ DCA are between 0 and 1, and the DCA agent stores resource allocation statistics, calculated over a DCA reward region of predefined cells, wherein the DCA agent has a representation of the cell environment having a set of DCA states, X, where the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the DCA agent takes an action a DCA at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A DCA, wherein the action of selecting the subset to allocate is determined by the DCA exploration function, e DCA, wherein the exploration function decays over time from an initial value, e o _DC_A, at time zero, wherein the value of and the agent calculates a DCA reward, r, for taking action a at time t, and the DCA agent updates state action values, Q DCA (x,a), wherein at time zero all state action values are initialised to zero and the DCA reward is initialised to zero, the method including the steps of j) receiving a request for an allocation of a resource in cell i at time t, k) obtaining the system state, x DCA, cell i, 1) rejecting the allocation request if all resources are allocated, m) obtaining a random number from a uniform distribution between 0 and 1, n) if the value of the random number is less than the value of the DCA exploration function, e DCA, then the DCA agent randomly selects an action a from ADC_A, otherwise the DCA agent selects the action a DCA, which has the maximal state action value Q _DC_A (x,a), for the current state, x _DCA, o) the DCA agent allocating the resource allocation subset, a _DCA, and updating the resource allocation statistic, p) the DCA agent updating the state action value associated with the previous state-action pair, Q DCA (x',a'), wherein x DCA' is the previous state and a DCA' is the previous action, using the previous state-action value Q DC_A (x',a'), the current state action value Q DCA (x,a), the DCA learning rate α DCA, the DCA discount factor γ DCA, and the DCA reward, ΓDCA', calculated after taking the previous action a DCA' in state x DCA', q) the DCA agent calculating the reward for taking action a DC_A in state x, wherein the DCA reward is calculated using resource allocation statistics in a predefined DCA reward region. o) repeating steps j) to q) for each request for allocation of a resource.

2. A method for resource allocation in a cellular telecommunications system according to claim 1, wherein the SARSA reinforcement learning algorithm is used to update the state action value Q(x,a), after taking action a in state x.

3. A method for resource allocation in a cellular telecommunications system according to claim 1, wherein the updated state action pair for the previous state action value Qt+i(x',a') is updated according to the formula: Qm(x', a') = Qt(x', a') + a((n + γQt(x,a)) - Qt(x',a')).

4. A method for resource allocation in a cellular telecommunications system according to claim 2 or 3, wherein each connection request has an associated priority level.

5. A method for resource allocation in a cellular telecommunications system according to claim 4, wherein the method comprises of an additional step cc inserted between steps c and d, step cc comprising: cc) if the priority level of the connection request is above a predetermined level, accepting the resource allocation request and updating the accepted new resource allocation statistic and proceeding directly to step g otherwise proceeding to step d.

6. A method for resource allocation in a cellular telecommunications system according to claims 4 or 5, wherein the agent stores statistics on the number of accepted and rejected connection requests for each priority level from time zero until time t, and the reward at time t for cell U is calculated according to :

7. A method for resource allocation in a cellular telecommunications system according to claim 6 wherein the reward multiplier for handoff connection requests in priority level k, y_z, are greater than the reward multiplier for new connection requests in class k, w_z-

8. A method for resource allocation in a cellular telecommunications system according to claim 7 wherein y_z >= 5w_z.

9. A method for resource allocation in a cellular telecommunications system according to claims 6, 7 and 8 wherein the value returned by the learning rate function α is in the range (0, 0.25), the value returned by the discount factor function γ is in the range (0.95, 1) and the value returned by the exploration decisions function e is in the range (0,0.1).

10. A method for resource allocation in a cellular telecommunications system according to claim 8 wherein the learning rate, discount factor and exploration decisions functions are constant functions wherein α=0.05, γ=0.975, and e = 0.03 for all times.

11. A method for resource allocation in a cellular telecommunications system according to claim 10, wherein the number of priority levels is 2, and the values for the reward multipliers are wi = 10, wi = 1, yi - 50, and yi = 5.

12. A method for resource allocation in a cellular telecommunications system according to claim 1, wherein the rate of decay of the exploration function ^DCA decreases with time.

13. A method for resource allocation in a cellular telecommunications system according to claim 1, wherein the exploration function has the form of: e DCA t = e DCA o exp (-t/s) where s is a constant with the same units as time, t .

14. A method for resource allocation in a cellular telecommunications system according to claim 1, wherein the exploration function e DC_Ahas the form of: e DCA t = e DCA o / J /„ where s is a constant with the same units as time, t .

15. A method for resource allocation in a cellular telecommunications system according to claim 14, wherein the initial value of the exploration function e DCA O has the value of 0.05, the value of s is 256 and time, t is measured in seconds.

16. A method for resource allocation in a cellular telecommunications system according to claim 1 wherein the resource allocation statistic is the sum of the percentage of resource allocated in cell j at time t, wherein the sum is performed over all cells j in the DCA reward region Gj of cell i and the reward at time t for cell U is calculated according to:

"** (0 = ∑Py

JeG₁

17. A method for resource allocation in a cellular telecommunications system according to claim 1 comprising the additional steps between steps 1 and m, wherein the additional steps are:

Ia) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the unallocated subsets of the resource and store the value of the state-action value with the maximum state action value in v _DCA max and the associated action a DCA as aocAmax

Ib) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v _DCA mm and the associated action a as a DCA min and denote the connection associated with this allocation as Cmin Ic) if v DCA max is greater than v DCA min then allocate the subset of the resource associated with a DCA max to the connection associated c DCA min and release the subset of the resource associated with a DCA min .

18. A method for resource allocation in a cellular telecommunications system according to claim 1 wherein the agent performs the additional steps of: p) monitoring connection termination requests in cell i, and acceptance of a handoff connection requests from a connection in cell i, to another cell;^', q) on receiving a request to terminate a connection or an acceptance of a handoff of the connection to an another cell, the agent stores the value of the state-action flag as v DCA flag, and the resources associated with the connection, as a DCA flag, r) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a _DC_A min and denote the connection associated with this allocation as c DCA mm s) free the resources associated with a DCA flag, t) if a DCA min is not equal to a DCA flag then allocate the subset of the resource associated with a DCA flag to the connection c DC_A mm and release the subset of the resource associated with a DC_A mm.

19. A method for resource allocation in a cellular telecommunications system according to claim 1 wherein at time zero the state action values are initialised with either zero or a positive value /DC_A according to a fixed resource allocation scheme.

20. A method for resource allocation in a cellular telecommunications system according to claim 1 wherein the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the percentage of resource allocated by the cell.

21. A method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (xi, xi, .., XN), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource for establishing a new connection in cell i at time t, b) obtaining the system state, x, cell i, c) rejecting the resource request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and 1, e) if the value of the random number is less than the value of the exploration function, e, then the agent randomly selects an action a from A, otherwise the agent selects the action a, which has the maximal state action value Q(x,a), for the current state, x, f) the agent accepting the resource allocation request and updating the accepted new resource allocation statistic if the amount of unallocated resource is greater than the guard resource threshold, otherwise rejecting the resource allocation request and updating the rejected new resource allocation request statistic, g) the agent updating the state action value associated with the previous state-action pair, Q(x',a'), wherein x' is the previous state and a' is the previous action, using the previous state-action value Q(x',a'), the current state action value Q(x,a), the learning rate α, the discount factor γ, and the reward, r', calculated after taking the previous action a! in state x', h) the agent calculating the reward for taking action a in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource.

22. A method for resource allocation in a cellular telecommunications system according to claim 21, wherein the SARSA reinforcement learning algorithm is used to update the state action value Q(x,a), after taking action a in state x.

23. A method for resource allocation in a cellular telecommunications system according to claim 21, wherein the updated state action pair for the previous state action value Qt+i(x',a') is updated according to the formula: Qm(xW) = Qt(x',a') + cc((rt + γQt(x,a)) - Qt(xW)).

24. A method for resource allocation in a cellular telecommunications system according to claim 22 or 23, wherein each connection request has an associated priority level.

25. A method for resource allocation in a cellular telecommunications system according to claim 24, wherein the method comprises of an additional step cc inserted between steps c and d, step cc comprising: cc) if the priority level of the connection request is above a predetermined level, accepting the resource allocation request and updating the accepted new resource allocation statistic and proceeding directly to step g otherwise proceeding to step d.

26. A method for resource allocation in a cellular telecommunications system according to claims 24 or 25, wherein the agent stores statistics on the number of accepted and rejected connection requests for each priority level from time zero until time t, and the reward at time t for cell it is calculated according to :

^K

^_z=1(n^f _zkw_z + h'₂i_ty_z)

?^'t (it) = — V^A ^^" (n'Jt + n_zi_t + ti_zi_t + Mt)

27. A method for resource allocation in a cellular telecommunications system according to claim 26 wherein the reward multiplier for handoff connection requests in priority level k, y₂, are greater than the reward multiplier for new connection requests in class k, w_z.

28. A method for resource allocation in a cellular telecommunications system according to claim 27 wherein y_z >= 5w_z.

29. A method for resource allocation in a cellular telecommunications system according to claims 24, 25, or 26 wherein the value returned by the learning rate function α is in the range (0, 0.25), the value returned by the discount factor function γ is in the range (0.95, 1) and the value returned by the exploration decisions function e is in the range (0,0.1).

30. A method for resource allocation in a cellular telecommunications system according to claim 28 wherein the learning rate, discount factor and exploration decisions functions are constant functions wherein α=0.05, γ=0.975, and e = 0.03 for all times.

31. A method for resource allocation in a cellular telecommunications system according to claim 30, wherein the number of priority levels is 2, and the values for the reward multipliers are wi = 10, wi = 1, yi - 50, and 1/2 = 5.

32. A method for resource allocation in a cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (xi DCA, X2 DCA, .., XNDCA), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α D_CA/ a predetermined discount factor function, γ DCA, a predetermined exploration decision function, BDCA, wherein the values returned by CXDCA and γ DCA are between 0 and 1, and the agent stores resource allocation statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the agent takes an action a DC_A at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A, wherein the action of selecting the subset to allocate is determined by the exploration function, e DCA, wherein the exploration function decays over time from an initial value, e o DCA, at time zero, wherein the value of and the agent calculates a reward, r DCA, for taking action a DCA at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, the method including the steps of a) receiving a request for an allocation of a resource in cell i at time t, b) obtaining the system state, x DCA, cell i, c) rejecting the allocation request if all resources are allocated, d) obtaining a random number from a uniform distribution between 0 and l, e) if the value of the random number is less than the value of the exploration function, e DCA, then the agent randomly selects an action a DCA from A, otherwise the agent selects the action a DCA, which has the maximal state action value QΌCA{X,Λ), for the current state, x DCA, f) the agent allocating the resource allocation subset, a DCA, and updating the resource allocation statistic, g) the agent updating the state action value associated with the previous state-action pair, Q DCA (x',a')_r wherein x DCA' is the previous state and a DCA' is the previous action, using the previous state-action value Q DCA (x',a')_r the current state action value Q DCA (x,a), the learning rate απcA, the discount factor YDC_A, and the reward, 7"DCA ', calculated after taking the previous action a DCA' in state x DCA', h) the agent calculating the reward for taking action a _DCA in state s, wherein the reward is calculated using resource allocation statistics in a predefined reward region. i) repeating steps a) to h) for each request for allocation of a resource.

33. A method for resource allocation in a cellular telecommunications system according to claim 32, wherein the rate of decay of the exploration function e DCA decreases with time.

34. A method for resource allocation in a cellular telecommunications system according to claim 32, wherein the exploration function has the form of: e DCA t = e DCA O exp (-t/ s) where s is a constant with the same units as time, t.

35. A method for resource allocation in a cellular telecommunications system according to claim 32, wherein the exploration function e DCA has the form of: e DCA t = e DCA o / Jy where s is a constant with the same units as time, t .

36. A method for resource allocation in a cellular telecommunications system according to claim 35, wherein the initial value of the exploration function e DCA O has the value of 0.05, the value of s is 256 and time, t is measured in seconds.

37. A method for resource allocation in a cellular telecommunications system according to claim 32 wherein the resource allocation statistic is the sum of the percentage of resource allocated in cell j at time t, wherein the sum is performed over all cells; in the DCA reward region d of cell i and the reward at time t for cell it is calculated according to: ΓDCA (/, ) = ∑ p_j .

38. A method for resource allocation in a cellular telecommunications system according to claim 32 comprising the additional steps between steps c and d, wherein the additional steps are: ca) perform a search over the state action values Q DCA (X_/a) wherein the search is limited to the current state, x, and over actions associated with the unallocated subsets of the resource and store the value of the state-action value with the maximum state action value in v DCA max and the associated action a DCA as ODCAMOX cb) perform a search over the state action values Q DCA (X_/a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v _DCA mm and the associated action a as a DCA imn and denote the connection associated with this allocation as Cm_n cc) if v DCA max is greater than v DCA mm then allocate the subset of the resource associated with a DCA max to the connection associated c DCA min and release the subset of the resource associated with a DCA _mm ■

39. A method for resource allocation in a cellular telecommunications system according to claim 32 wherein the agent performs the additional steps of: j) monitoring connection termination requests in cell i, and acceptance of a handoff connection requests from a connection in cell i, to another cell], k) on receiving a request to terminate a connection or an acceptance of a handoff of the connection to an another cell, the agent stores the value of the state-action flag as υ υcAflag, and the resources associated with the connection, as a DCA flag, 1) perform a search over the state action values Q DCA (x,a) wherein the search is limited to the current state, x, and over actions associated with the allocated subsets of the resource and store the value of the state-action value with the minimum state action value in v DCA mm and the associated action a as a DCA min and denote the connection associated with this allocation as c DCA min m) free the resources associated with a DCA flag, n) if a DCA mm is not equal to a DCA flag then allocate the subset of the resource associated with a DCA flag to the connection c DCA nun and release the subset of the resource associated with a DCA min.

40. A method for resource allocation in a cellular telecommunications system according to claim 32 wherein at time zero the state action values are initialised with either zero or a positive value /DCA according to a fixed resource allocation scheme.

41. A method for resource allocation in a cellular telecommunications system according to claim 32 wherein the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the percentage of resource allocated by the cell.

42. A cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system for performing a method for resource allocation according to any preceding method claim including, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (x\, X₂, .., Xhi), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a guard resource threshold, wherein the guard resource threshold is the percentage of the total allocation of the resource reserved for accepting handoff resource requests, and each cell having a connection admission control (CAC) agent that controls acceptance or rejection of requests for connections using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α, a predetermined discount factor function, γ, a predetermined exploration decision function, e, wherein the values returned by α and γ are between 0 and 1, and the agent stores connection request statistics including connection acceptance and connection refusal statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x at time t, is equal to the percentage of resource available for allocation by the cell, and the agent takes an action a at time t, wherein the action is to select the value of the guard resource threshold from the set of possible thresholds, A, wherein the action of selecting the value of the guard resource threshold is determined by the exploration function, e, and the agent calculates a reward, r, for taking action a at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero and a computer associated with each cell for executing a computer program code containing instructions that perform the steps of the method.

43. A cellular telecommunications system (CTS), having an environment consisting of two or more cells for creating connections between a CTS user and the CTS system, the system having a predetermined resource and quantity of the resource in a plurality of states, X = (xi DCA, *2 DCA, .., XNDCA), which is available for allocation in each cell, and each cell receiving requests for an allocation of the resource, wherein the request are comprised of requests for resources for establishing new connections and requests for resources to accept handoff connections from another cell, and each cell having a dynamic connection allocation control (DCA) agent that controls which subset of the resource to allocate to a connection request from the set of available resource using reinforcement learning, wherein the agent is initialised with a predetermined learning rate function, α DC_A/ a predetermined discount factor function, γ ΌCA, a predetermined exploration decision function, e DCA, wherein the values returned by α DCA and γ DCA are between 0 and 1, and the agent stores resource allocation statistics, calculated over a reward region of predefined cells, wherein the agent has a representation of the cell environment having a set of states, X, where the system state x DCA at time t, is equal to the index of the cell, i, at time t, and the agent takes an action a DCA at time t, wherein the action is to select which subset of the resource to allocate from the set of available resource, A, wherein the action of selecting the subset to allocate is determined by the exploration function, e _DC_A/ wherein the exploration function decays over time from an initial value, e o DC_A, at time zero, wherein the value of and the agent calculates a reward, Γ_DC_A, for taking action a DCA at time t, and the agent updates state action values, Q(x,a), wherein at time zero all state action values are initialised to zero and the reward is initialised to zero, and a computer associated with each cell for executing a computer program code containing instructions that perform the steps of the method.

44. A cellular telecommunications system (CTS) according to claims 42 and 43 wherein the computer associated with each cell executes computer program code containing instructions that perform the steps of both the methods associated with connection admission control (CAC) and dynamic connection allocation control (DCA).

45. A computer program element comprising a computer program code means to make a programmable device execute steps in accordance with a method according to any preceding method claim.