WO2007057857A1 - Adaptive, distributed solution for enhanced co-existence and qos for multimedia traffic over rlans - Google Patents

Adaptive, distributed solution for enhanced co-existence and qos for multimedia traffic over rlans Download PDF

Info

Publication number
WO2007057857A1
WO2007057857A1 PCT/IB2006/054299 IB2006054299W WO2007057857A1 WO 2007057857 A1 WO2007057857 A1 WO 2007057857A1 IB 2006054299 W IB2006054299 W IB 2006054299W WO 2007057857 A1 WO2007057857 A1 WO 2007057857A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
system
controller
radio
receiver
transmitter
Prior art date
Application number
PCT/IB2006/054299
Other languages
French (fr)
Inventor
Salvador Boleko Ribas
Wolfgang Budde
Original Assignee
Koninklijke Philips Electronics, N.V.
U.S. Philips Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/391Modelling the propagation channel
    • H04B17/3913Predictive models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/0001Systems modifying transmission characteristics according to link quality, e.g. power backoff
    • H04L1/0023Systems modifying transmission characteristics according to link quality, e.g. power backoff characterised by the signalling
    • H04L1/0026Transmission of channel quality indication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/12Arrangements for detecting or preventing errors in the information received by using return channel
    • H04L1/16Arrangements for detecting or preventing errors in the information received by using return channel in which the return channel carries supervisory signals, e.g. repetition request signals
    • H04L1/18Automatic repetition systems, e.g. van Duuren system ; ARQ protocols
    • H04L1/1867Arrangements specific to the transmitter end

Abstract

A cognizant radio system is disclosed that includes a transmitter/receiver (202, 204) for communicating with a transmission medium to transmit information thereto and receive information therefrom. An adaptive controller (208) controls predetermined controllable operating parameters of the operation of the transmitter/receiver. A sensor (302) senses parameters reflecting the operation of the transmitter/receiver (202, 204) in its environment and a prediction system (402) then evaluates the performance of the transmission policy in use, accordingly ascertains any changes to be applied to it. The resulting strategy determines the actions to be taken by the transmitter/receiver (202, 204) in its environment as a function of the sensed parameters at a time 'k.' Thusly, such actions become suited to adapt the system behavior in accordance with the performance of the system, the perceived environment of the communication link, as well as the status of the on-going transmission. An evaluation system (404) is operable to evaluate the results of the action performed by the controller (208), and an adaptive system (406) is also provided to modify the operation of the prediction system as a function of the evaluation of the results of the action performed at time 'k-1' and prior (i.e. historic) time slots.

Description

ADAPTIVE, DISTRIBUTED SOLUTION FOR ENHANCED CO-EXISTENCE AND QOS FOR MULTIMEDIA TRAFFIC OVER RLANS

The present invention pertains in general to a radio and, more particularly, to an intelligent controller embedded in a radio device that can modify the radio parameters as a function of the environment status of the radio device.

Geographical proximity among several radio links, regardless of their radio technology, and (or) other radio frequency (RF) energy generating systems or sources which simultaneously operate in either the same or adjoining frequency spectrum, may give rise to the so-called co-channel interference, and adjacent channel interference. When combined with out-of-band interference, caused by non-linearity, these three different sorts the phenomenon can be classified into multiple access interference (MAI) or multiple user interference (MUI).

Interference may be defined as the reception of undesired RF energy which might degrade a receiver's effective performance originated by the activity of other radio signal transmitters, unintentional radiators and incidental radiators. These are referred to as interferers. Such performance degradation can appear as loss or misinterpretation of information expected to be extracted from a desired signal.

Referring now to Figure 1 , there is illustrated a diagrammatic view of an MAI scenario. In this scenario, there are provided three radio links, 102, 104 and 106. Radio link 102 includes an intentional radio transmitter 108 and an intentional radio receiver 110. The reason that these are referred to "intentional" is that they comprise a transmitter that is operable to transmit a signal to a receiver for the purpose of that receiver receiving the information over a conventional radio link, the actual RF link indicated by a link 112. The primary purpose of the transmitter 108 is to transmit a signal for reception at receiver 110 with sufficient energy and free of interference such that the signal can be received, decoded and the information extracted therefrom. This, of course, can be problematic in the presence of interference, which can degrade the signal to noise ratio to such an extent that accurate recovery of the information in the RF signal is difficult at best. Similarly, the radio link 104 has a transmitter 114 that transmits over an intentional radio link 116 to an intentional receiver 118. The radio link 106 includes an intentional transmitter 120 that transmits over an intentional radio link 122 to an intentional receiver 126. Each of the radio links 102, 104 and 106 are operable to contain transmitters and receivers therein, wherein the desire is to operate independent and isolated from the other radio links. However, this typically does not happen in that radio frequencies emit electromagnetic energies in many different directions, depending on the environment, the transmission properties of the antennas, etc. To illustrate this, it can be seen that each of the transmitters 108, 118 and 120 transmit energy that can be received by the other receivers and the other links. For example, transmitter 108 will transmit energy to receiver 126 and receiver 118, these being unintentional transmissions which provide unintentional radio interference.

In addition to each of the transmitters 108, 118 and 120, causing interference with the other intentional receivers, there are also other interferers outside of the radio links that can cause the emission of electromagnetic energy that can be received by receivers 110, 118 and 126. These are illustrated as two interferers, 130 and 132 which emit electromagnetic energy in such a manner that some of this energy will impinge upon the receiver antennae of the receivers 110, 118 and 126, thus causing additional interference problems.

The reduced reliability of transmissions and the ensuant increase of system congestion occasioned by MAI can be explained in terms of information theory. In accordance with well-known information-theoretic results, MAI brings about a shrinkage of the hyper-volume of the n-dimensional achievable set defined by the data rates of the n interfering radio transmitters. In other words, the higher the experienced MAI, the smaller the achievable system capacity, or aggregate traffic load to be reliably carried over the network. In short, interference mapping determines system capacity.

The effects of the herein-before disclosed MAI phenomenon are exacerbated for local area network (LAN) radio technologies that operate in the so-called Industrial, Scientific and Medical (ISM) and the Unlicensed National Information Infrastructure (U- NII) frequency bands. These frequency bands are the so-called:

2.4 GHz bands, which ranges from 2.4000 to 2.4835 GHz, and the

5 GHZ bands, which embraces intervals ranging from 5.150 to 5.350 GHz (U-NII), from 5.470 to 7.725 (Europe) and from 5.725 to 5.875 GHz (ISM).

Unlike other spectrum segments in use by communications systems, such bands constitute unlicensed or license-exempt spectrums; that is, any electronic device is free to emit RF energy in them with no need of license. Moreover, they are not even specifically destined to communications systems. In view of this, unlicensed spectrum access does not permit exclusive spectrum access and exploitation by just one radio technology. To date, most existing regulations ruled by spectrum international regulatory bodies are used to set maximum power levels to be radiated at will by users of unlicensed spectrum. But recently these regulatory bodies, lobbied by radio technology manufacturers and industry committees, have become conscious of the need for enhanced co-existence solutions. As a result, some recommended practices and addenda to previously existing standards have been developed and issued. However, current regulation is still far from being complete.

Regulatory bodies like BT SIG, DARPA, FCC, IEEE, WiFi Alliance and some others have lately created working groups and started projects and new standards, which specifically address co-existence issues and spectrally efficient medium access for radios operating in unlicensed spectrum, e.g. XG Project, FCC Spectrum Policy Task Force, IEEE 802.15.2, IEEE 802.19, WiFi Alliance's Co-Existence and Spectrum Sharing Task Group.

As shown, the main co-existence objective is to lessen MAI impact on communications systems jointly operating and sharing resources in unlicensed spectrum. On the other hand, interference phenomena are closely related to transmit power management. For that reason, a joint radio resource and quality of service (QoS) management-based co-existence solution for radio technologies operating in unlicensed spectrum has been found to be an appropriate option.

Some of the disadvantages of present co-existent systems are listed as follows:

Multiple access, like and unlike, (user) interference (MAI or MUI), especially when operating in license-exempt spectrum;

Lack of distributed and dynamic adaptation of transmission to conveyed traffic features (e.g., Quality of Service (QoS) requirements support);

Lack of dynamic adaptation of transmission configurable mechanisms or parameters at different OSI (Open System Interconnection) protocol stack to time-varying network status and nature of radio channels;

Lack of dynamic adaptation skills without the expense of erratic behavior of the radio;

Inefficient spectrum utilization impacting available system capacity and energy efficiency;

Increased energy consumption, which reduces battery lifetime for portable devices; Large mathematical modeling effort required to reach a proper model description of the control problem due to its high complexity. Some of the aspects of this modeling may include:

Asynchrony,

Distribution,

High dimensionality,

Interaction of multiple, and possibly heterogeneous, controllers,

Interaction of transmission mechanisms at different OSI protocol stack layers,

Multiple input and output variables; some of them categorical, some others continuous,

Non-linearity,

Non-stationary,

Stochasticity, and

Time-varying.

As illustrated by this list, the high complexity associated with MAI-related phenomena in conjunction with interaction among transmission parameters which are available at different layers of the OSI stack and heterogeneous traffic characterization are difficult to implement in any statistical model(s) which are complete, dependable and simple enough to allow off-line tuning of an optimal and robust transmission strategy.

However, the acquisition of accurate a priori knowledge can be sometimes either difficult or impossible. That can be the case, for instance, whenever complex or unknown environments and/or systems are considered. The lack of a fitting analytical model or knowledge about the environment, the system or the phenomena involved in the control process can jeopardize the design of an off-line tuned controller able to yield good performance. In other words, traditional off-line, 1 -protocol layer based, static, worst-case based approaches can never guarantee optimal performance.

In view of the above, there is a need for an on-line self-tuning mechanism for the controller. Such method should help to, in an adaptive fashion, handle the afore-listed design issues. For that purpose, the so called reinforcement learning paradigm has been found to be an appropriate approach, as will be described hereinbelow.

Embodiments of the present invention comprise a cognizant radio system that includes a transmitter/receiver for communicating with a transmission medium to transmit information thereto and receive information therefrom. A controller is provided for controlling predetermined controllable operating parameters of the operation of the transmitter/receiver. The controller includes a sensor for sensing parameters reflecting the operation of the transmitter/receiver in its environment. A prediction system is provided for predicting the operation of the transmitter/receiver in its environment as a function of the sensed parameters at a time "k." The controller is operable to map the sensed parameters through a stochastic representation of actions to be performed by the transmitter/receiver to achieve a desired goal by outputting an action to be performed, or suited transmission configuration of ratios settings. An evaluation system is operable to evaluate the results of the action performed by the controller, and an adaptive system is also provided to modify the operation of the prediction system as a function of the evaluation of the results of the action performed at time "k-1."

For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description taken in conjunction with the accompanying Drawings in which:

Fig. 1 illustrates a radio system for a prior art system showing interference sources;

Fig. 2 illustrates a diagrammatic view of an exemplary context aware radio;

Fig 3 illustrates a block diagram of an exemplary context-aware radio system with an adaptive controller based on reinforcement learning;

Fig. 4 illustrates a diagrammatic detailed view of the exemplary controller introduced in Fig. 3 for a cognizant radio system featuring reinforcement learning; and

Fig. 5 illustrates a block diagram of an exemplary fuzzy inference system used by the reinforcement learning-based controller to perform value function approximation.

Referring now to Figure 2, there is illustrated a simplified block diagram of a context-aware radio architecture of the present disclosure. This context aware radio architecture basically includes a transmission device 202 and reception device 204. These two structures 202 and 204 provide the radio, which is typically operated in accordance with the OSI architectural model. This architectural model is generally comprised of a plurality of layers that format an OSI stack. The OSI architectural model is typically divided into seven layers with the first four layers, 1 -4, referred to as the lower layers, and the layers 5-7 referred to as the upper layers. The lower layers are involved with the physical implementation up through the various transport mechanisms. The first layer is the physical layer that handles the mechanical and electrical aspects of the physical connections made to the computer system. It sends and receives streams of digits across a radio link, indicative of a radio system. However, in a typical computer system, this could be a physical medium such as a cable or optical fiber. The next layer is a data-link layer that ensures error- free transmission of data between directly connected systems. It includes function and procedural means to transfer blocks of data and to detect and correct errors which may have occurred in the physical transmission of the data. The next layer is the network layer that provides the routing and switching functions necessary to move data from its originating system through intermediary nodes to a destination system when there is no direct communication between two systems, and is also responsible for the segmentation of the data within a message and its reassembly at the receiving system. The next layer is a transport layer that provides the link between the upper and lower parts of the model.

The transmit and receive portion of the communication system comprises the physical interface between a general data system 206 that comprises all or part of the session layer, presentation layer and application layer, the three upper layers of the OSI stack. In general, this is conventional. Embedded within the radio is an intelligent, learning controller or controller 208. This controller 208, as will be described hereinbelow, perceives its environment within the operating parameters of the system. Using such perceptions, the radio then generates control signals to interact with the environment in such a manner to achieve certain performance goals, such as improving, for example, QoS. For this parameter, there are many configuration settings or parameters that influence the achievement of specific QoS requirements or targets. These can be adjusted by observing various channel conditions and network status. By perceiving the operating parameters of the system within the context of its environment, the controller 208 in association with various learning rules stored in a block 210 can tune the parameters of the receiver in an adaptive method.

Two of the features regarded as essential to define an intelligent system are:

Adaptation, which may be defined in the current context as the ability of the controller 208 to modify the mapping from its perception of the transmission and environment status to a transmission parameters configuration, as to achieve certain control goals; and

Autonomy in setting and achieving such goals without any external intervention. Finally, although adaptation does not necessarily require the ability to learn, learning becomes indispensable for systems that need to adapt to a wide variety of unpredictable circumstances and become more intelligent as a result of experience.

As autonomy is one of the requirements, there is no possibility for the controller 208 to be presented with examples of state-action pairs, along with an indication that the action was either right or wrong. Unfortunately training data sets may not be available to do it. Hence, the learning paradigm that better applies to the current case appears to be the so called unsupervised learning. Among many sorts of unsupervised learning techniques, reinforcement learning appears to be one of the most promising and fitting to the current problem.

Reinforcement learning is an algorithmic approach to solve optimal control problems, which allows the controller 208 to learn a goal-oriented, suited behavior through interaction with its environment. At every time step, the controller 208 examines its state perception and accordingly it chooses an action or configuration. As a consequence of its last action, actions of controllers associated with other context aware radios and the dynamics of the environment itself, the controller 208 receives from the environment a reinforcement signal (primary reinforcement), which indicates the newly reached state quality with regard to the control task of the controller 208.

The controller 208 aims to learn what action to apply whenever it faces a particular state, relative to the primary reinforcement, which can possibly be delayed. Thusly, primary reinforcement can be thought of as an instrument to specify the task the controller 208 is to carry out, in terms of reward or penalty and to act as a guide to form effective decision policies. More formally, the goal of reinforcement learning is to map state perceptions into actions so as to maximize a value function that estimates the expected cumulative primary reinforcement in the long-term.

In the current work, reinforcement learning exploiting some results borrowed from game theoretical considerations has been deemed apposite to specify a suitable framework for the design of a co-existence solution for radio LAN technologies conveying multimedia traffic, which is based on radio resource management through dynamic adaptation of transmission parameters available at different OSI protocol-stack layers.

Historically, there have been several alternative approaches and techniques to handle either alien or same-type interference, which originates co-existence problems. Some of them are listed in what follows: Exclusive spectrum allocation;

Efficient spectral resources allocation and management through exploitation of diversity or orthogonality for medium access in dimensions:

Frequency,

Information,

Space, and

Time, which can be undertaken either in a static (off-line) or adaptive (online) fashion; and

Detection, identification and cancellation of interferers.

According to the prior definition, co-existence solutions are expected to incorporate some of the following abilities:

Detection of other systems' operation;

Adaptation to other systems' operation;

Reduction of impact on other systems' operation; and

Negotiation with other systems on access and exploitation of resources.

Using these skills as a basis, IEEE 802.19 TAG defines 5 co-existence classes to rate systems, no matter which the specifically implemented techniques are. Those classes are presented in table 1.

Class 0 Ignorant of possibility that other systems or protocols may exist

Class 1 Aware that other systems or protocols may exist, expects them to make any adaptations

Class 2 Aware that other systems or protocols may exist, unilaterally acts to reduce impact

Class 3 Actively detect others, unilaterally acts to reduce impact

Class 4 Actively detect others, negotiates with them to optimize mutual performance

A further class 3 co-existence mechanism is proposed, which is based on the selection of transmission parameters' configuration that better suits the status of both the communication channel associated to the link, as well as the traffic to be exchanged through it. The transmission strategy can then be better regarded from a cross-layer perspective, as long as transmission parameters belonging to different layers in the OSI stack are jointly configured and adapted to the immediate transmission status. Moreover, the system of the present disclosure is directed toward being compliant to the anticipated spectrum etiquette to be developed by institutions like the ones listed hereinabove. Spectrum etiquette rules are a set of rules for radio resource management, ideally followed by all radio systems that operate in an unlicensed band. The rules should help to establish a fair access to the available radio resources, in addition to render a more efficient usage of the radio spectrum.

By means of context-awareness and adequate radio resource management in all the access dimensions of communication channels, i.e., frequency, information, space and time, performance can be improved and interference and power consumption of radio devices reduced. This is the formal framework defined by cognitive radio. Cognitive radios employ a cognition cycle, which may either predict future changes in environment or immediately respond to them, to alter their actions in response to changes in the environment.

In this framework, the control system can be thought of as a mapping from environment state observations to actuation commands so as to achieve certain control objectives, where the learning process modifies the mapping to improve future system's performance.

Reinforcement Learning-Driven Context-Awareness and Adaptation

At each time interval k, the controller 208 perceives a representation of the current state of its environment, x[k] e X , and picks out and executes on action or pure strategy, u[k] from the action space U which is permitted to the controller 208 at that state.

The selection of actions by the controller 208 is guided by the so-called policy or strategy profile, which is a function that assigns a probability of execution to every action when a state is perceived by the controller 208. Such function is denoted by π(x, u) = (P (u|x) and represents the mapping from perceived state onto a probability distribution over the action subspace associated with the state, πi(x, u): X x U — > [0,1] .

One time step later, the controller 208 receives a scalar reinforcement signal and observes a representation of the newly reached state x[k+l]. Both reinforcement and the new state depend on the previous state x[k], the action executed by the controller u[k] and some other factors.

Consequently, this reinforcement signal is random, however a so called pay-off, reward or utility function can be associated over the state-space action to the reinforcement signal, r(x,u): X x U —> IR , where R(x,u) = E{p\ state is x and controller selects action u} (1)

A pictorial representation of the learning cycle associated with reinforcement learning can be found in Figure 3. In Figure 3, the controller 208 receives the input x(k) from a sensor 302, the sensor operable to sense parameters of the system at step (k). The controller 208 will then take some action u(k) and assert this action on the environment, represented by a block 304. The environment 304 can provide some feedback p(k). Prior the environment providing the feedback p(k), some reinforcement evaluation 308 may be performed. The reinforcement evaluation plays a major role in the learning process of the controller 28 and will be explained below under Reinforcement Function. Basically, it is the reinforcement signal that guides the learning process of the controller, and not only the series of states the system goes through. The feedback p (k) will be provided as reinforcement for the next step at step (k+1).

The goal of the controller 208 is the selection of the best policy (π) defined as the one leading to the maximization of a certain optimality index. It is possible to associate one value of a so-called cost, evaluation or state value function with each policy in order to evaluate its effectiveness. A commonly used definition for it, as far as non-episodic tasks are concerned, is that of a expected discounted cumulative primary reinforcement, starting from a certain state when a certain policy is in use, which can be formulated as:

V(x,π) = =

Figure imgf000012_0001

Where the discount factor γ ranges from 0 to 1, and determines to which extent expected forward -looked revenues should influence currently taken actions. Likewise, an action value function of taking action u in state x under a policy π and denoted Q(x,u,π) can be evaluated as follows:

Q(x u π) =

Figure imgf000012_0002
= x,u[k] = u,π}]

= E[{r(x[k + l],u[k + l]) + γQ(x[k + l],u(k + l],π)\x[k] = x,u[k] = u,π}]

Reinforcement learning relies on the supposition that the system dynamics verifies the so called Markov condition or property according to which, the next observed state and immediate revenue depend only on the current state and action, and some random factor. Systems that fulfil the Markov property are called Markov decision or controlled Markov process (MDP). Indeed, most real tasks are not strictly Markov but usually close enough.

There are two main problems that are addressed by any reinforcement learning method:

Prediction problem, which is defined as the evaluation of the performance of a given policy; and

Control problem, which is defined as the search of an optimal policy, i.e., one yielding maximum value functions.

Provided a complete Markov decision process (MDP) mathematical model of the environment dynamics is known a priori, then dynamic programming methods can straightforwardly derive optimal policies by using value functions and Bellman equations to guide the search. When this is not the case, then reinforcement learning techniques provide a series of alternative tools to accomplish it.

Approximate Policy Iteration

For the current case, an actor-critic architecture or adaptive heuristic critic or policy- iteration algorithm has been adopted to implement reinforcement learning. Policy iteration is a two-time-scale iterative procedure in the space of policies, wherein policies are discovered by generation of a sequence of monotonically improving policies. In such scheme the two problems associated to reinforcement learning are considered in two subsequent steps, performed by the two following elements:

The adaptive critic element or critic, which through the reinforcement signal arriving from the environment, performs the policy evaluation task, i.e. it intends to solve the prediction problem; and

The associative search element or actor, which based on an internal reinforcement signal generated by the critic, performs the policy improvement or update task, i.e., it yields an improved and more effective policy and consequently while solving the control problem, determines the controller 208 actions.

Guaranteed convergence of policy iteration methods heavily relies upon:

Tabular representation of value functions;

Exact solution of the so-called Bellman equations, which are formulated in terms of the value functions; and

Tabular representation of each policy. Fulfillment of the first and last requirements is hardly feasible for many real cases, e.g. for large discrete action or state spaces, or when continuous action and state spaces are contemplated. As a result, tabular representation of either value functions or policies are replaced by parametric function approximators:

V(x,π,φ) * V{x,π) (5)

Q(x,u,π,ξ) * Q(x,u,π) (6) π(x,u,0) » π(x,u) (7)

The resulting vantage is that the storage requirements are usually much less for this compact representation than for the high dimensional tabular case. An efficient information and retrieval or memory scheme is crucial to any learning system to retain the empirically derived knowledge.

On the other hand, the introduction of function approximation schemes, demands for any approximate policy iteration method to be successful, suited choices of the scheme itself as well as its projection method, used to estimate the values of the parameters. (Such projection takes place from the high-dimensional state and action space linked to the exact tabular value function onto the low-dimensional subspace spanned by a set of basis functions, which are determined by the parameters)

Several sound theoretical results on convergence properties of approximate policy- iteration methods have previously been looked at. As a consequence, some of the previously looked results have been taken up and their function approximation has been applied to both the value functions and the policy. Figure 4 illustrates a block diagram of an approximate policy iteration architecture.

Figure 4 illustrates a block diagram of the approximate policy inferation architecture. There is provided an actor block 402 and a critic block 404. The actor block 402 is operable to perform the policy evaluation task wherein it solves the prediction problem. The actor block 402 receives as an input the action value function Q(x,u,π) and provides on the output thereof a prediction. The input is provided to a policy improvement block 406 which is then input into a policy projector block 408 to perform the prediction. This prediction is an output to a policy approximation block 410. The critic block 404 contains a policy evaluation block 412, since the output of the policy approximation block 410 provides an output to the input to the system. This policy evaluation block is an input to a value function projection block 414 which receives a reinforcement input, p(k) and ^ (k). This provides a projection of the value function the v, φ, [ε] . The outlook of the critic block 404 is input to an approximate value function block 420 to output the value function for input to the actor block 402, which bases its policy projection on an internal reinforcement signal received at a time (k-1) and then generates a current policy projection at step (k). This then controls the system at that step in accordance with the projected policy. The critic block at step (k) then evaluates the current policy which was asserted to the system by the controller 208 and projects forward a value function which is then approximated and input to the actor block 402 at the next step (k+1). Thus, the critic block 404 utilizes a reinforcement signal from the environment to project forward a value function that is utilized by actor block 402 for the purpose of improving the policy, thus solving the control problem.

For the present case a non-adaptive (or linear architecture), dictionary method function approximation approach has been adopted. That means that the solution space has been restricted to a particular class of parameterized functions of form:

f(a) = ∑cιBι (a,βι) (8)

where a particular set of basis functions or features [B1] constitutes a dictionary. The method is said to be non-adaptive when the parameters of the basis function βt are fixed before the basis function is applied to approximate the function, and therefore just the coefficients C1 of the linearly combination of N basis functions are estimated. Besides, the basis functions ought to be linearly independent so that there is no redundancy in the parameters estimation. Moreover, the selection of suitable basis functions is crucial to reduce the value (or policy) function approximation error and thusly sub-optimality with respect to the actual function.

Fuzzy Logic System as Function Approximator

Fuzzy logic has been chosen for function approximation because of some noteworthy properties of fuzzy interference systems, some of them are enumerated in what follows They allow both categorical and continuous action to be generated as output;

As they are rule-based systems, they enable direct embedment of available prior knowledge or a priori bias as starting configuration, so that the controller 208 does not start its learning process from scratch, which:

Prevents an erratic behavior of the system at early stages of the control process; and

It focuses the learning process in a state-action space region of interest, i.e. where high reward can be expected, and thusly speeds up the convergence to a suitable control strategy.

They save the need for a phase prior mathematical modeling of the process, which as in the present case, can show a high complexity.

They fulfil the universal approximation theorem according to which, any real continuous function can be locally approximated with arbitrary accuracy on an input compact set. This fact, together with their ability to handle both, categorical and continuous output variables make them very attractive to deal with any sort of classification and regression (i.e.,, function approximation) problem.

The main problems to be overcome in the design of any fuzzy logic system can be split into two. structure identification, which involves: the selection of the variables of the input and output spaces, the partition of the input and output spaces of the controller into fuzzy sets, and the determination of the rule-base size; and parameter estimation, which implies determining the parameters of the fuzzy sets in the input and output space of the controller.

Referring now to Figure 5, there is illustrated a diagrammatic view of a fuzzy inference system. The fuzzy inference system provides a non-linear mapping between a crisp input {v, } , i = 1,2, ... , m) and an output to provide a crisp output ( {ψ} , o = 1,2, ... , ή) as a variable. The crisp input is input to a "fuzzifier" block 502 to provide a set of fuzzy inputs. This basically utilizes a set of fuzzy constraints. This is then input to an inference block 504 that operates on a set of rules in a rule base block 506. This inference block 504 provides a fuzzy output set. Therefore, the inference block 504 provides a mapping function from a fuzzy set of inputs to a fuzzy set of outputs. The fuzzy set of outputs is then input to a "defuzzifier" block 508 to provide the crisp output. The domains or crisp universal sets Y1 and Ψo that the different input and output variables belong to, are called universes of discourse.

To specify rules for the rule base in block 506, a linguistic description is used for input and output variables and their characteristics.

Normally, constant linguistic variables are used to describe the generally time- varying quantities. For the fuzzy system, linguistic variables denoted as v , or ψo are used to describe the input and output variables.

In much the same way input and output variables (v, , ψt ) take on values over each universe of discourse (Y, , Ψo ) , linguistic variables (V1 , y/J take on linguistic values or fuzzy labels used to describe features of the variables, and are defined over the universes of

discourse. Let Ay and Bo p respectively denote the jth and pth linguistic values of the

linguistic variables v, and ψo , then the linguistic values variables takes on the elements from the sets of linguistic values denoted by

A1 = { Ay|j=l,2,...,N1) (9)

Figure imgf000017_0001

The mapping of the inputs to the outputs for a fuzzy system is partly characterized by a set of condition to action rules, or modus ponens (i.e., IF premise THEN conclusion) form. Generally, the inputs to the system are associated with the premise whereas outputs are associated with the conclusion. A generic multiple input, multiple output (MIMO), m x n fuzzy system utilizes rules that can be formulated like follows:

IF V1 is A1 AND ... AND V1n is A1n THEN ψx is B1 AND ... AND ψn is Bn

which in turn can be decomposed into n equivalent MISO rules where the conclusion considers just one output at a time of format: IF V1 is A1 AND ... AND vm is A1n THEN ψo is B0

A special case of the previous ones are the so-called Takagi-Sugeno rule bases where the output associated with the conclusion of some rules has a mathematical function expression as the next example illustrates:

IF V1 Is A1 AND ... AND vm is AnJHEN (/o = /(vl 5...,vJ

Henceforth Nr different and unambiguous rules, i.e., never sharing same premise whilst yielding different conclusions, are assumed to make up the rule base.

Furthermore, fuzzy sets and fuzzy logic are used by the fuzzy system to quantify the meaning of linguistic variables and rules in the rule base. Let Y1 denote a universe of discourse and A1 e A1 denote a specific linguistic value for the linguistic variable V1 .

Then, any function associated with Ay that maps Y1 onto the interval [0,1] is called the characteristic or membership function Of A1J. It is expressed formally:

//Λ j :Y, → [0,l]

Such function depicts the certainty that an element OfY1 , denoted O1 may be linguistically classified as A1 ; .

Given a linguistic variable V1 with a linguistic value A1 defined on the universe of discourse Y1 and with membership function μκ , a fuzzy set is defined as a set of ordered pairs

A,; = {(vΛj 0O)Iv1 GY1) (H)

Fuzzy logic defines a collection of set-theoretic and logical operations that can be performed on fuzzy sets.

Fuzzy sets are used to quantify the information in the rule base, and the inference mechanism operates on fuzzy sets to produce fuzzy sets; hence fuzzification as utilized in block 502 to convert the generally quantitative inputs V1 into fuzzy input sets for input to the inference block 504. Let Y1 * denote the set of all possible fuzzy sets that can be defined on Y1 . The fuzzification is defined as F]Y1 — > Y1 * where F(V1) = A1 .

The fuzzy inference mechanism has two main tasks:

To determine the extent to which every rule is relevant to the current situation as characterized by the inputs; and

To draw conclusions using the current inputs and the information in the rule base.

Firstly, the fuzzification step in block 502 fixes the fuzzy sets associated with the input variables:

^A1 , (V1) = ^ (V1 ) VA, (V1) (12)

where * denotes the t-norm or triangular norm operator.

Next, membership values are to be determined for the premises of the rules, p(r) , in the rule base:

Mpir) (Vl > - > Vj = <i>r (MAln rX Am ]m r (Vl >- > Vn,) )

= ωrA (V1)*.. *μ (vj) (13) r = l,2,... ,Nr

where wr is a rule certainty factor that weighs the confidence on the validity of the rule and A1 r denotes the fuzzy set associated with the input v e Y1 in the rth rule.

Subsequently, implications are to be inferred from the premise and the conclusion of each rule. Many alternative implication formulae, e.g., Dienes-Rescher, Goedel, Goguen, Lukasiewicz, Mamdani, Zadeh, etc. are available in the literature to make the evaluation of the so-called truth value of the rule, which can be expressed formally as:

Figure imgf000019_0001

where μqr denotes the membership associated with the conclusion of the rth rule in the rule base. Once the matching stage, that is; the evaluation of the values of rules membership, has been completed, the inference step of block 504 follows. Two different choices for the inference method are available:

Composition based inference, which combines all rules in the rule base into a single fuzzy relation, which is viewed as a sing IF-THEN rule; or

Individual-Rule based inference where each rule in the rule base determines an output fuzzy set, and then the output fuzzy set results from combining the aforementioned individual output fuzzy sets.

Many different methods can be followed to implement whichever of the inference options is implemented.

For the second case, regardless of the specifically followed approach, the outcome of the inference engine 504 for each rule in the rule base must be the corresponding implied fuzzy set 5o rwith membership function:

^r (v, V81^) r = l,2,... ,Nr

Through the phase known as de-fuzzification in block 508, the fuzzy system returns a crisp outcome. To accomplish this de-fuzzification, a set of de-fuzzification techniques can be applied, according to the sort of inference chosen in the inference stage. However, for individual-rule based inference and function approximation purposes, so as to align with the discussion initiated in preceding sections, attention must be paid to those approaches, e.g., center of gravity or center of area, which yield the output value according to a linear architecture expression, that is:

Nr ψo = ∑K(vι,...,vmor(vι,...,vJ = κτΦ (15)

There are theoretical results, which extend the understanding of the convergence of approximated policy iteration methods. In accordance with the previous results, rather than approximating a value function and using it to compute a deterministic policy, which is unlikely to be optimal, especially when applied to the current problem, a parametric stochastic policy (π(x,u,θ)) has been preferred to be directly approximated instead. By doing the above, on the one hand any discrete action space is effectively transformed into a continuous one defined by the policy parameters; on the other hand, a continuous exploration methodology over the time of the action space becomes implicit to the learning process. The latter has benefits to either countervail local minimal traps, normally associated with gradient descent-based procedures, or slow rate non-stationary drifts of the underlying dynamics.

The update of the policy parameters in the policy improvement step in block 406 is then performed by means of a policy-gradient approach, which through stochastic approximation or other similar techniques, estimates the gradient of a performance index (e.g. the average expected return per step) with respect to the policy parameters #and updates their values by the steepest descent method.

The joint use of critic feedback from the critic block 404 brings about a reduction of the variance associated with the gradient estimate, at the expense of extra bias in the estimate. None the less, the convergence to a locally optimal policy when policy gradient methods are applied to approximate policy iteration imposes some requirements on the function approximation scheme of value functions and policy.

The authors in Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour - Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Processing Systems, 12: 1057-1063, 2000 and Vijay R. Konda and John N. Tsitsiklis - On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4): 1143-1166, 2003, proved that a sufficient condition to replace the actual state action value function Q(x,u,π) by a compatible function approximator f (x,u,π,ζ) parameterized by the vector ζ, without affecting the unbiasedness of the gradient is that it verifies:

f (x,u,π,ζ) = ζτVθlnπ(x,u,θ) (16)

However, according to the authors in Jan Peters, Sethu Vijayakumar, and Stefan Schaal - Reinforcement learning for humanoid robotics, Proceedings of the IEEE-RAS Third International Conference on Humanoid Robots, 2003, this function approximator rather approximates the so-called advantage value function, which is defined as:

A(x,u,π) = Q(x,u,π) -V(x,π) (17) Based on results disclosed in Sham Kakade, A natural policy gradient. Advances in Neural Processing Systems, 14, 2002, the authors propose the adoption of a natural policy gradient approach, wherein the natural gradient is computed in lieu of the regular gradient based method. Natural gradient is a second order gradient optimization method, consequently with high convergence speed, based on differential geometry that overcomes many of the limitations shown by ordinary gradient estimation as described in S. Amari and S. C. Douglas - Why natural gradient? Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2: 1213-1216, May 1998.

For the current disclosure, a fuzzy logic system has been used for two purposes: Firstly, it is used as a function approximator in the form of a linear architecture, that approximates the value state function V(x,π,φ) ; and

Secondly, it is used to implement the parameterized stochastic policy π(x,u,θ) To that purpose, the two resulting fuzzy systems should share the same state perception definition x , which indeed acts as an input to the fuzzy system. In what follows, the elements describing the input are introduced. Thereinafter, it is explained which the elements defining the output of the fuzzy systems for the two cases are, and how the twofold functionality is carried out.

One of the first and main steps in fuzzy system design is the selection of its input variables, as they are collected by the controller to build an internal representation of the state of its environment. Despite the limitations of sensors that may produce an incomplete snapshot and perception to be local and partial, the more informative sensory input the better for the performance of reinforcement learning techniques.

The chosen variables for the current problem can be roughly divided into three different categories:

Input variables which illustrate channel condition; Input variables which illustrate network state; and Input variables which illustrate the features of currently conveyed traffic. To depict the channel condition several alternative inputs are suggested: Background noise level. It can be monitored by the radio device during idle periods, in which the device does neither transmit nor receives frames. An example of idle period is the so called back-off stage used by the IEEE 802.11 radio technology to access the medium. Ideally, the reading available to the transmitter side should be the one measured on the receiver side. As this may not be possible for many radio technologies, such measurement may be performed on the transceiver side. Then, the measured value works as an indicator of the presence and activity of potential interferers to the receiver side of the communication link. The term "potential" is utilized because such radiators might differently affect the receiver and the transmitter, e.g. because of their likely unlike geographical proximity and radio path characteristics, and because it is not always possible to detect all on-going transmissions;

Received signal strength (RSS). This value can be estimated by the radio device from any incoming frame. As it is a measure of the total received power, which comprises wanted signal, as well as noise and interference it is not a very accurate link quality estimator.

In addition to received signal strength, link quality can be measured from incoming frames in terms of more informative and less noisy indicators as signal to interference plus noise ratio (SINR) or another similar index.

Especially relevant to evaluate the presented indexes can be those frames which are received approximately on a regular basis, e.g. as it is the case of the so called beacon management frames of IEEE 802.11 systems, which are issued by an access point device to all the radio devices associated to it, as they can because of their regularity be used for making predictions on link quality.

Both indexes supply information on propagation losses (and so distance), multiple path fading and distortion, and slow shadowing effects in the channel. Additionally, the second one, SINR, also reports on the impact of interferers close to the radio device.

In addition to the instantaneous values, an estimate of time derivatives computed as the ratio between the difference between the two latest subsequent measurements of any magnitude, and the time difference between both measurements, should as well be computed and added as an input to the fuzzy system, as it provides valuable insight to the control system on the evolution trend of the observed channel condition, and for instance, consider variations due to mobility.

As the series resulting from measuring the afore-listed magnitudes is expected to be quite noisy, consideration should be given to using their exponentially weighted moving average values as inputs to the fuzzy system, instead of directly using the actual instantaneous readings. As an example, smoothed average value MΣ and quotient deviation-time MΔ of a generic magnitude M can be computed from its measurements as:

MΣ[k] = aΣMΣ [k - ϊ\ + (l - aΣ)M[k] (18)

Figure imgf000024_0001

where a and aA are weight factors ranging from 0 to 1, which tunes the low-pass filtering process and whose selection, closer to 1 or to 0, steers the estimate towards the averaged value or the latest measurement, depending on which one is to be emphasized.

Furthermore, in the disclosed embodiment, it has been assumed that the radio device operates in a low-mobility scenario, which is thence linked to a very slowly fading channel which in turn makes possible slightly outdated measurements to be used for link-adaptation purposes. Notwithstanding this assumption, in the event available measurements were deemed to be too old to be useful, an auxiliary extrapolation of their updated value might be attained as the outcome of a linear predictor implemented incorporating a Kalman filter scheme to recursively estimate the value of its coefficients.

Ideally, it would be desirable that:

The transmit power level of the incoming frames, i.e., the one used at the receiver side, be known; and

The aforementioned indexes be evaluated at the receiver side and fed back to the transmitter side. To this end, a feedback mechanism to make this information available to the transmitter side of the link should be available.

Unfortunately, for most radio LAN technologies, the latter is not usually the case. Alternatively, such a mechanism might be implemented as an additional protocol at OSI stack upper layers than the data-link layer. Otherwise, as a trick, both directions of the radio link can be roughly assumed to be symmetrical, except as to interferers, this being what makes SINR a more robust and sensible choice that RSS.

Finally it would be desirable, due to the noise associated with the measurements of the prior values, that smooth averaged values be considered instead of actual readings.

Frame loss rate (FLR). In case acknowledgement control messages are in use by the radio LAN technology, this index can be easily estimated at the transmitter side. In case any error rate indicators (e.g. BER, BLER or alike) were available from feedback originated at the receiver side, it is strongly suggested that its time derivative estimate should be contemplated as an additional input.

With regard to the network status the next input variables have been deemed apposite.

Ratio between the number of spent transmission attempts and the maximum number of re-transmissions useable by the current frame. This index is used to signal to the controller the urgency required for the delivery of the frame, as well as its delay QoS requirements. Thence it is assumed that at a previous stage, a suitable value for the maximum number of re-transmission has been computed by a scheduler according to the traffic QoS requirements and its monitoring of the on-going transmission status.

Busy medium time. The estimation of this index requires a general purpose listen- before-transmit policy to be incorporated by the radio technology, Moreover, a threshold level in the measured noise level is to be fixed in order to consider the medium as busy, irrespective of the nature of sensed interferer (like or unlike) and to allow its computation. This indicator, or equivalently its percentage along monitored periods, is valuable to evaluate the local interference congestion of the air interface in the vicinity of the radio device.

Network and transport layer protocols in use. As different protocol mechanisms may render distinct interaction with lower layers' transmission mechanisms.

Finally, as for the conveyed traffic characterization the variables listed in what follows are suggested.

Size of the frame to be transmitted. It can be computed as the length in bytes of the

1-type data payload, ranging from 1 to the maximum length allowed for a frame. Indeed, if frame aggregation mechanisms where also possible, this value should consider the maximum aggregated amount of data instead.

Traffic category. It can be defined as a label or index that is used to characterize the type of traffic of the frame to be sent. As an example, it can be thought about the 4 access categories, which are used by the IEEE 802. l ie radio technology to differentiate frames and prioritize their access to the medium. Such differentiation may provide useful as dissimilar sorts of traffic, normally originate different QoS requirements. Let it be considered as an example the differences between elastic and streaming media traffic, where the QoS requirements of the first category are more biased towards reliability (high error- sensitivity), and those of the second one are towards timeliness (high delay-sensitivity). This fact may have some impact on selection of an appropriate transmission parameters set.

Instead of the prior view, based on prioritized QoS given by the traffic category, parameterized QoS could be also supported by feeding into the controller some performance relevant parameters of its QoS specification, e.g., threshold values for delay, error rates or jitter, expected average data rates, etc., under the assumption that feedback of the actual values is available to the transmitter.

Weighted buffer occupancy. This index weighs the pressure exerted by the arriving traffic that waits in the buffer queues to be delivered, and thereby incurs a delay cost. On the other hand, the occupancy of the buffer should be weighted accordingly to the delay requirements or sensitivity of the stored traffic, as far as different traffic categories were considered. For instance, this index might be computed according to the expression:

TC

B[k] = ∑ WlMχk] (20)

where B[k] denotes the weighted buffer occupancy at instant k; TC denotes the number of traffic category instances that the system can handle; W1 denotes the weight associated with the ith traffic category and M1 Ik] denotes the amount of memory, in the buffer, that is occupied by traffic of the ith traffic category at instant k.

On the other hand, the structure identification also comprehends the selection of the output variables to be handled, though indirectly through the actor policy in block 402 by the controller, and to be determined upon any frame transmission. The ones listed in what follows have been chosen for the present case:

Channelization or channel bandwidth in use, as some radio technologies can on-the- fly vary their spectrum bandwidth through e.g. flexible spread spectrum techniques, OFDM techniques, or simply channel bundling;

Channel coding scheme;

Joint source and channel coding scheme at application layer, including compression schemes;

Modulation scheme;

Number of streams or flows, e.g. when multiple layered schemes are in use, and thus additionally to a base layer, several enhancement layers can be sent; Packet aggregation or fragmentation. Frame size can be modified by both mechanisms at data link layer, or alternatively by changing the packetisation at upper layers;

Silence or extended medium back-off period, on the wait for meliorated channel condition; and

Transmission power level.

As it can be seen, some of these mechanisms are readily available to existing radio LAN technologies, i.e., they are part of their standards that define them, but some others have to be dealt with at upper layers than the physical and the data-link layers. Such a joint or cross-layer approach makes possible a harmonized use of the mechanisms available at different layers of the OSI stack.

Function Approximation of the State Value Function

As described hereinabove with respect to the description fuzzy logic, a Takagi- Sugeno representation for the rules in the rule base can be adopted. The input variables to such rules have been chosen to be the ones listed in prior paragraphs. By definition, the output of any Takagi-Sugeno rule is a function of the input variables values. For the present case, the function is just a real number, φr , whose value is updated during the value function stage performed by the critic element of the controller 208.

In accordance with the properties of the fuzzy logic system and providing it uses either the center of gravity or the center of area as de-fuzzification method. Thence, the state value function is approximated by the fuzzy system according to the expression.

^,^ /W = /lftM"" (21)

where φR(r) denotes the truth value associated the rth rule in the rule base, and f refers to the method used to compute the de-fuzzification.

Parameterized Stochastic Policy

From the list of output variables, it can be seen that for current radio LAN technologies, most of their values should be categorical, that is, discrete and fixed. Even for the continuous ones, e.g., transmitter power level or frame size (fragmentation or aggregation), it is likely, and would generally be advisable, that only a set of discrete fixed values be available.

As previously defined, a policy is a probability distribution over the action space. Let it be uo one of the N0 output variables, and let be {MO 1 , MO 2 ,... , MO Λ, } the set of

N alternative options available to the controller 208 to perform the action associated with it. By definition, the next expression:

∑π(x,uoJ = ∑ P(uJx) = l (22) q=\ I = I

prevails regardless of the considered output variable uo and state x.

Once again, a Takagi-Sugeno representation has been adopted for the rules. Let the rth rule in the rule base be considered. Because of the collection of output variables and their different alternatives, the fuzzy system has to have multiple outputs. Thus, a real vector &o r = [&o ι r&o l r--- $o N rf e 7/?^* , can be associated with the oth output variable and the rth rule in the fuzzy rule base. Moreover, either center of gravity or center of area method has been chosen as de-fuzzification method. In case the components of the vector are chosen so that the equality:

E -9W = 1 (23)

holds true, then the value S0 q r can be interpreted as the probability assigned by the rth rule in the rule base, that the qth action in {uo l,uo l ,... ,uo N } is taken as the oth output variable. Concisely:

Kv = p(u o = u o,q I the rth nile is active) (24)

However, the outputs of the Takagi-Sugeno rules in the rule base are real numbers θo q r(x), which do not directly define a probability distribution over the action space, but weighting factors on actions instead. Therefore, a method has been chosen to derive the probability values 3O r from the actual output.

Let the outputs of the rule base be mapped onto the interval [0,1] through the transform:

^' = l + exp(%,,.,) (25)

Then, the resulting probability can immediately be calculated as:

Figure imgf000029_0001

where f is a positive, real, strictly monotone increasing function in [0, 1 ] to be chosen, e.g., f(x) = x or f(x)=exp(kx),k>0

This way, the resulting multiple output fuzzy system, implements the parameterized policy with parameter θ where:

θo = [θo τ Λθo τ ,2...θlN f (27)

Figure imgf000029_0002

Finally, the approximated policy associated with the pth output variable can be evaluated according to:

Figure imgf000029_0003

It should be noted that, as it can be seen from this expression, policies of each of the N0 output variables are dealt with independently; for instance, the policy followed to select the transmitter power level, has nothing to do with that adopted to pick out the channel coding scheme, and for this reason «ohas been preferred instead of u in the expression of the parameterized policy π(x, uo,θ) . Lastly, it should be observed that the functions associated to the Takagi-Sugeno rules have been chosen to be real values in this example for the sake of simplicity. Positively much more complex functional expressions, maybe involving a higher number of parameters to be incorporated by the parameterized policy, and taking into consideration the input of the fuzzy system, i.e., the state x, might as well have been selected as the expression of the function θo q r = f(x) . The same holds true, for the mapping defined by equations 25 and 26.

Reinforcement Function

The reinforcement signal plays a major role in the learning process of the controller 208 as it is the feedback from environment that is used by the controller 208 to modify its behavior. The reinforcement signal primarily concerns the goal. However, in order to sooth the so-called temporal credit assignment problem associated with delayed reinforcement, it is useful to supply some intermediate reinforcement. As a result, it is recommended to put some effort in the design of an enlightening, and hence probably more complex, reinforcement signal function, which rather acts as a performance index or controller 208 's utility function.

Intermediate reinforcement can be introduced either through reinforcing multiple goals, subtasks, or by using progress estimators. On the one hand, multiple goals can be reinforced individually by means of a heterogeneous reinforcement function. On the other hand, progress estimators act as partial internal critics associated with specific goals and provide a metric of improvement relative to the task.

For the present case several goals are targeted by the controller and accordingly, are to be contemplated and weighed by the reinforcement function.

Backlog. It is counted by means of the evolution of the weighted buffer occupancy occurred during a transmission attempt. Similarly, any frame loss or delay originated by buffer overrun has also to be taken into account.

However, it should be observed that backlog should normally be more relevant an aspect for the case of elastic or error-critical traffic than for streaming media traffic, wherein due to its delay-constraints, buffering at the transmitter side might not be such a critical issue. Thus, backlog could be skipped as a factor when transmitting streaming media traffic. Co-Existence. It is counted in terms of power spectral density at the transmitter output, which in turn can be estimated as the ratio between the RF power and the channel bandwidth in use. This is an especially sensible choice when narrow-band and broadband radio systems are to so-exist, as the latter suffer more from interference from the former ones than the other way around.

Data rate. It is counted as the number of useful information bits successfully delivered. Then again, frames belonging to different traffic categories can be differently considered.

Delay. It is counted with aid of the time passed from the instant the frame it is ready to access the medium until it is delivered.

Energy consumption. It is derived from the RF energy spent for every frame transmission attempt, and the one spent at subsequent receive periods waiting for an acknowledgement frame, if any, or at prior idle periods while backing off the medium.

Certainly, energy consumption, which not only depends upon radio transmission, might also be considered and incorporated to the evaluation for the sake of completeness.

Reliability. In absence of feedback on error rates from the receiver side, it can be counted by means of the number of transmission attempts required for successful frame delivery. On the other hand, also potential frame loss, due the maximum number of retransmissions available for the frame being exceeded, is to be considered.

At this point it ought to be noted that, in the event of parameterized QoS support, i.e., traffic being categorized through QoS parameters in lieu of traffic class labels, the reward function might be further augmented to assess the degree of conformance to the traffic QoS specification (e.g., fulfillment of threshold or average values in the input requirements and alike).

In accordance to the preceding description of the control goals, many alternative formulations can be contrived for a suited reinforcement function specification. In either case, one possible approach is described in the following description.

Auxiliary indexes have been defined for the backlog (ηb), co-existence (ηcx) and reliability (ηr) categories. Their values have been chosen to range from 0 to 1 when the current frame has been correctly delivered and from -1 to 0 for the opposite case. Furthermore, three different monotonically increasing, positive, real-valued functions defined in the domain [-1,1] and henceforth denoted respectively as fbb),fcxcx) and frr) have to be defined. The definition of those indexes has been summarized in Table 2.

Figure imgf000032_0002

Table 2: Auxiliary indexes for partial reinforcement function definition In Table 2, B denotes backlog; W denotes the power spectral density and T denotes the ordinal linked to transmission attempt of the current frame. On the other hand, the maximum QMAX or minimum O1111n values are either related to the radio LAN technology in use and to the currently transmitted frame, as TMAX denotes the maximum transmission retries available to it, before discarding the frame.

Finally, in relation to the backlog and reliability categories, two additional functions to penalize the occurrence of frame losses, be it due to buffer overrun or exceeding the number of transmission tries available have been defined:

If there is buffer overrun.

J f overrun (\

Figure imgf000032_0001
Otherwise

f J excess =p | Λ (31)

Otherwise

where both ζb o and ζrJ are negative constant values.

It should be mentioned that for some radio LAN technologies, one of the actions available to the controller might be further backing off the medium after sensing a severely adverse channel and network condition to attempt frame transmission, while waiting for a melioration of the status. In that case it is suggested that, as a hand- wired strategy on silence, the radio device not being allowed to take such decision more than once within a certain time window. The reason for that is that though the effect probably being positive for not uselessly spending energy and contributing to unnecessarily raise the overall interference level to other users, and bringing on a snow-balling effect, such extended backoff period may have as well a negative impact on the traffic waiting to be delivered by the radio device.

Such restriction can be implemented as a rule to be observed by the system, or alternatively as an additional negative penalizing term (ζxbo) for such extended back-off, similar to those herein above introduced, referring to frame loss.

The data rate, delay and energy consumption categories have been collapsed into one single ratio, which has as numerator the amount of information bits, i.e., useful data in the frame, and as denominator the amount of energy spent by the radio device throughout the time lapsed since the start of the first transmission attempt of the frame. This indicator is referred to as energy efficiency and hereinafter is denoted as ηee .

Similarly to the aforementioned categories, a monotone increasing with respect to ηee , positive, real function fee can be linked to the energy efficiency. The suggested form is as follows:

feeilee) * w (traffic category) (32)

where the positive factor w weighs the function in accordance with the traffic category the delivered frame belongs to.

The resulting reinforcement function pcan then be formulated as:

Ak] = v[k](feeee[k]) * w (traffic category) +fbb[k]) + /^J*])

+ Λ07,)[*]) + /[*] + /«„[*] + ΛJ*]) (33)

where v[k] is 1 if the kth frame has been successfully delivered and -1 otherwise. As can be easily deducted, the reinforcement function for failed frame transmission attempts, acts as some sort of accounting of the opportunity cost, in other words, the self- inflicted loss of utility the controller 208 experiences, colligated to a not proper action selection. Such extra awareness, might positively contribute to more effective decision making and resources exploitation.

Actor Critic Algorithms

The so-called Natural Actor-Critic algorithm has been found adequate to implement the policy iteration method and consequently has been set forth in this disclosed embodiment. In what follows, the main loop followed by this procedure is described:

At time step k an action u[k] is drawn according to the parameterized policy in use π(x[k],u[klp[k] ).

The resulting next state x[k + 1] and immediate reinforcement function p(x[k],u[k]) are observed by the controller 208.

The critic performs the policy evaluation step. To that purpose the fuzzy logic inference system is used as a function approximator to evaluate the state value function, which yields:

V(x[klπ[klφ[k]) = φ[kf φ(x[k]) (34)

Similarly, the compatible function approximator is computed as:

f (x[k},u[k],π[klζ[k]) = ζ[k](x[k])τVθlnπ(x[k],u[klθ[k]) (35)

The computation of V θlnπ(x[k],u[k],θ[k]) abides by the following expression:

Vθπ(x[k],u[k]Λk])

Vθlnπ(x[k],u[k],θ[k]) = - π(x[klu[k],θ\_k])

The auxiliary basis functions are updated according to: φ[k] = [φ(x[k])τOτ]T (36)

φ[k] = [φ(x[k])τVθlni(x[klu[klθ[k]f]T (37)

In order to reduce the impact of the temporal credit assignment problem, which can be defined as the difficult to credit or blame past actions that have contributed to the success or failure of the task, an accumulating eligibility trace mechanism with recency factor λ e[0,l] is applied. By doing so, previously visited states are memorized temporarily by weighting the basis functions by their proximity to time step k. The outcome is then used to compute some auxiliary elements required to perform the projection step: z[k + l] = λz[k] + φ[k] (38)

A[k + 1] = A[k] + z[k + l](φ[k] - rφ[k]f (39) b[k + l] = b[k] + p(x[k], u[k]z[k + 1]) (40) where the factor γ denotes the discount factor.

Finally, the state value function parameters and natural gradient estimate are calculated:

[φ[k + 1]1 ζ[k + 1]' Y = (A[k + l]ylb[k + 1] (41)

The actor checks the convergence of the natural gradient updated by the critic. To make it possible, it checks that the latest natural gradient estimates over an h-length time window lie inside an open ball of radius ε centered at the latest natural gradient estimate which is denoted as Bε h(ζ[k + 1]) , i.e., the convergence check ascertains the validity of:

rf(£[* + l],£[* - l]) < £,V/ e[0,A]

where d denotes the distance function or metric. Thence, tuning of the two ball defining parameters becomes relevant to prevent oscillations in the learning process as well as to handle possible non-stationarity.

In case the natural gradient has not yet converged:

θ[k + 1] = θ[k]

Otherwise, the policy parameters are updated:

6[k + l] = 6[k] + aζ[k + l] (42)

where a is a step size parameter of the gradient update. Lastly, the eligibility traces (z[kj) and other auxiliary elements are resent by means of a learning rate factor /? according to the expressions.

z[k + ϊ\ = βz[k] (43)

A[k + 1] = βA[k] (44) b[k + 1] = βb[k] (45)

In order to take action from the resulting stochastic policy, the following procedure, among many others, is suggested. Through suited random number generators, random values can be independently drawn for each of the N0 output variables in order to select, in accordance with the probability distributions determined by the current policy, the actuation command or primitive to be actually applied for that output.

Enhancing Performance Through Prior Knowledge

As discussed hereinabove, in the context of co-existence and multimedia traffic over radio LANs, reasonable doubts might arise in regard to convergence speed to an optimal, or at least well-behaved, strategy. Convergence might be difficult in some cases due to both:

The non-stationary dynamics component associated with the co-existence problem; and

The size of the state and action spaces available to the controller;

Additionally, the need for exploration on the search for an optimal mapping from state space into action space, which is inherent to any unsupervised learning approach, contributes to make the problem harder. On the other hand, it is known that learning can be both accelerated and improved incorporating some expert prior knowledge. By doing so, the learning system can be led to zones of the state-action space which are relevant to the task.

For that reason, it is suggested the adoption of a solution wherein the learning process does not start from scratch. Instead, a (non-optimal) strategy which holds for a static link adaptation scheme is initially adopted as a starting point and the optimal one is expected to be reached on-line therefrom, through the learning process.

Thusly, the performance of the device should not be jeopardized at early stages of the learning process because of the lack of prior knowledge and interaction between the device and its actual environment. But then, no optimal adaptation to the environment ought to be expected for the initial static policy drawn from the initial link adaptation scheme.

As shown, contrarily to the disclosed approach, such classical, off-line tuned, link- adaptation schemes:

Are fixed, rely on specific averaged channel models, so that no adaptation to the real channel is actually made in this sense;

Make no consideration about traffic categorization, so that it can neither yield any specific solutions to multimedia traffic nor comply to QoS requirements of conveyed traffic;

Do not consider the presence of other users sharing the spectrum with the device;

Do not consider interaction across layers of different protocols; and

Only exploit transmission parametric mechanisms available at the physical or data- link layer;

Notwithstanding these differences, still the results from an adaptation scheme derived following the classical approach can be straightforwardly incorporated to the learning controller as a starting policy.

As mentioned hereinabove, the fuzzy system presented is especially well-suited for prior knowledge incorporation as conclusions attained from an off-line tuned, simpler analytical model of the system can be easily incorporated in form of weighting factors (θo q r) .

Actually, analytical models normally yield or suggest crisp values for the output variable values, from crisp input state values, but those can be quite straightforwardly be converted into the aforementioned factors describing a probability distribution over actions. In this manner, the exploration can be initially biased towards the actions suggested by the inaccurate analytical model.

Moreover, multiple models, for different working conditions (e.g., derived from multiple different channel models) may be implemented, and simultaneously tuned by using a multiple fuzzy system architecture.

Example of Prior Knowledge Extraction

Most existing link-adaptation schemes applied by radio systems are not dynamic in the sense that they rely on fixed averaged models of the communication channel. Hence, they cannot always optimally cope with deviations between the actually observed channel condition and the one forecasted by their internal models, as the ones caused by environment change and presence of (responsive) interferers. Moreover, as presented, the problem statement herein handled does involve not just channel condition factors and link- adaptation resources available at the physical layer but also, network status and traffic factors and upper layer mechanisms.

Notwithstanding its limitations, one such approach suffices as an acceptable starting point as a way to ensure a prior system behavior during the initial stages of the learning process. Certainly many different alternative approaches might be adopted.

The followed selection criterion is the maximization of a certain performance index, e.g., throughput, energy efficiency, etc. As shown therein, analytical expressions to calculate the performance index as a function of channel condition, packet size (data payload) to be delivered, and few transmission parameters can be quite straightforwardly derived. After setting several assumptions, on the probabilistic model of the radio channel, (known) energy consumption features of the radio technology, maximum number of retransmissions per frame delivery and others, an estimate of the chosen subset of transmission parameters can be computed according to the theoretical model.

By introducing some minor modifications to the procedures therein, these procedures might straightforwardly be extended to compute also the number of fragments a frame should be partitioned into or the number of channels to be used, in the case of channel bundling or variable channelization being a possibility. Furthermore, performance indexes other than throughput or energy efficiency, in accordance with other definitions of the reinforcement function, might be defined and analytically estimated.

The disclosed embodiment can be applied to any packet-switched radio technology system, able to handle traffic with different quality of service requirements, and offering to that purpose, several configuration options regarding transmission parameters or mechanisms at different layers of the OSI protocol. That is the case for several well-known radio LAN standards as IEEE 802.11 (especially its e and h extensions), IEEE 802.5.3, IEEE 802.15.4.

None the less, the disclosed embodiment can be adapted with the aid of minor modifications and further exploited by circuit-switched radio technologies, even outside the LAN area, as the decision taking solution that describes can be incorporated as an alternative also for WAN radio technologies, for instance third and fourth generation of cellular telephony systems. As presented hereinabove, it has been stressed that the system is applicable to in- home or office scenarios where multimedia traffic and a variety of applications exchange can be expected; for license-exempt frequency bands where narrow-band and broad-band communication radio systems jointly operate, so that so-existence can be a very relevant issue, especially for the later ones. Moreover, the disclosed embodiment is also relevant for batter-driven radio devices as it pursues the energy consumption reduction.

As presented, the disclosed embodiment aims to be aligned to progress in spectrum sharing and co-existence, but also with cognitive radio, spectrum agile (or opportunistic) radio and software-defined radio advances, where the adaptation and decision-taking abilities of radio devices, be it in a centralized or a distributed manner, play a significant role.

Although various embodiments of the method and apparatus of the present invention have been illustrated in the accompanying Drawings and described in the foregoing Detailed Description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the spirit of the invention as set forth herein.

Claims

CLAIMS What is claimed is:
1. A cognizant radio system, comprising: a transmitter/receiver (202, 204) for communicating with a transmission medium to transmit information thereto and receive information therefrom; and a controller (208) for controlling predetermined controllable operating parameters of the operation of said transmitter/receiver, said controller including: a sensor (302) for sensing parameters reflecting the operation of the transmitter/receiver (202, 204) in its environment, a prediction system (402) for predicting a performance of alternative policies for the operation of the transmitter/receiver (202, 204) in its environment as a function of the sensed parameters at a time "k," said controller (208) operable to map said sensed parameters through a stochastic representation of actions to be performed by said transmitter/receiver (202, 204) to achieve a desired goal by outputting an action to be performed, an evaluation system (404) for evaluating the results of the action performed by said controller (208), and an adaptive system (406) for modifying the operation of said prediction system (402) as a function of the evaluation of the results of the action performed at time "k-1.
2. The radio system of Claim 1, wherein said prediction system (402) comprises a stochastic system for determining the probability distribution over an action space of potential actions to be taken for each controlled operating parameter of the transmitter/receiver (202, 204) that is controlled by said controller (208).
3. The radio system of Claim 2, wherein said adaptive system (406) includes a reinforcement learning system (308) for evaluating the action taken by said controller (208) and generating a reinforcement signal for use by said prediction system (402) at a future time, time "k+1."
4. The radio system of Claim 3, wherein said reinforcement learning system (308) applies a reinforcement learning paradigm to said controller (208) by utilizing the evaluation results to provide a performance index to modify the operation of said stochastic system (408) and, thus, shape the decisions made by said controller (208).
5. The radio system of Claim 4, wherein said reinforcement learning system (308) considers the cost associated with certain parameters of the operation of said transmitter/receiver (202, 204) associated with operating in its environment (304), such cost determined by said evaluation system (404).
6. The radio system of Claim 3, wherein said reinforcement learning system (308) operates in an iterative manner.
7. The radio system of Claim 3, wherein the inputs from said sensor (302) are input to a fuzzy inference system to modify the values thereof.
8. The radio system of Claim 3, wherein the inputs from said evaluation system (404) are input to a fuzzy inference system to modify the values thereof.
9. The radio system of Claim 1, wherein said prediction system (408) is operable to store an internal representation of the state of said controller's (208) environment within which it resides and said prediction system is operable to map the received inputs from said sensor (302) through said stored internal representation to provide on the output thereof an estimate of the action to be taken by said controller (208).
10. The radio system of Claim 9, wherein the inputs from said sensor (302) are input to a fuzzy inference system to modify the values thereof prior to input to said prediction system.
11. The radio system of Claim 9, wherein the outputs of said prediction system are input to a fuzzy inference system to modify the values the predicted actions to be taken by said controller (208).
PCT/IB2006/054299 2005-11-16 2006-11-16 Adaptive, distributed solution for enhanced co-existence and qos for multimedia traffic over rlans WO2007057857A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US73760205 true 2005-11-16 2005-11-16
US60/737,602 2005-11-16

Publications (1)

Publication Number Publication Date
WO2007057857A1 true true WO2007057857A1 (en) 2007-05-24

Family

ID=37896152

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/054299 WO2007057857A1 (en) 2005-11-16 2006-11-16 Adaptive, distributed solution for enhanced co-existence and qos for multimedia traffic over rlans

Country Status (1)

Country Link
WO (1) WO2007057857A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2186256A1 (en) * 2007-08-10 2010-05-19 7signal OY End-to-end service quality monitoring method and system in a radio network
CN102547725A (en) * 2012-01-13 2012-07-04 中国科学技术大学苏州研究院 Network terminal probability access control method based on cognitive radio
WO2014013031A1 (en) 2012-07-18 2014-01-23 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for managing the configuration of a telecommunication network
US9103681B2 (en) 2013-06-08 2015-08-11 Apple Inc. Navigation application with several navigation modes
US9997069B2 (en) 2012-06-05 2018-06-12 Apple Inc. Context-aware voice guidance

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030027529A1 (en) * 2001-07-19 2003-02-06 Hans Haugli Method of improving efficiency in a satellite communications system
WO2004042982A2 (en) * 2002-11-01 2004-05-21 Interdigital Technology Corporation Method for channel quality prediction for wireless communication systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030027529A1 (en) * 2001-07-19 2003-02-06 Hans Haugli Method of improving efficiency in a satellite communications system
WO2004042982A2 (en) * 2002-11-01 2004-05-21 Interdigital Technology Corporation Method for channel quality prediction for wireless communication systems

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2186256A1 (en) * 2007-08-10 2010-05-19 7signal OY End-to-end service quality monitoring method and system in a radio network
EP3206336A1 (en) * 2007-08-10 2017-08-16 7signal OY End-to-end service quality monitoring method and system in a radio network
EP2186256A4 (en) * 2007-08-10 2014-01-22 7Signal Oy End-to-end service quality monitoring method and system in a radio network
CN102547725B (en) * 2012-01-13 2015-11-11 中国科学技术大学苏州研究院 Access control method in a network-side terminal based on the probability that the cognitive radio
CN102547725A (en) * 2012-01-13 2012-07-04 中国科学技术大学苏州研究院 Network terminal probability access control method based on cognitive radio
US9997069B2 (en) 2012-06-05 2018-06-12 Apple Inc. Context-aware voice guidance
WO2014013031A1 (en) 2012-07-18 2014-01-23 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for managing the configuration of a telecommunication network
FR2993737A1 (en) * 2012-07-18 2014-01-24 Commissariat Energie Atomique Method for configuration management of a network of telecommunication
US9103681B2 (en) 2013-06-08 2015-08-11 Apple Inc. Navigation application with several navigation modes

Similar Documents

Publication Publication Date Title
Baldo et al. Fuzzy logic for cross-layer optimization in cognitive radio networks
Leith et al. A self-managed distributed channel selection algorithm for WLANs
Xing et al. Stochastic learning solution for distributed discrete power control game in wireless data networks
Kloeck et al. Dynamic and local combined pricing, allocation and billing system with cognitive radios
Sarangapani Wireless ad hoc and sensor networks: protocols, performance, and control
Pollin et al. Performance analysis of slotted carrier sense IEEE 802.15. 4 medium access layer
Liu et al. Dynamic multichannel access with imperfect channel state detection
US20050152280A1 (en) Method for operating a telecom system
Zhu et al. Tradeoff between lifetime and rate allocation in wireless sensor networks: A cross layer approach
Chang et al. Optimal channel probing and transmission scheduling for opportunistic spectrum access
Goyal et al. Power constrained and delay optimal policies for scheduling transmission over a fading channel
Toledo et al. Adaptive optimization of IEEE 802.11 DCF based on Bayesian estimation of the number of competing terminals
US20050239412A1 (en) Method for controlling data transmission in wireless networks
US20120195212A1 (en) Cooperative sensing scheduling for energy-efficient cognitive radio networks
Iannello et al. Medium access control protocols for wireless sensor networks with energy harvesting
Bhorkar et al. Adaptive opportunistic routing for wireless ad hoc networks
Park et al. Adaptive IEEE 802.15. 4 protocol for energy efficient, reliable and timely communications
Bononi et al. A distributed contention control mechanism for power saving in random-access ad-hoc wireless local area networks
Vaze Throughput-delay-reliability tradeoff with ARQ in wireless ad hoc networks
Acharya et al. Rate adaptation in congested wireless networks through real-time measurements
Johnston et al. Opportunistic file transfer over a fading channel: A POMDP search theory formulation with optimal threshold policies
Cai et al. Medium access control for unmanned aerial vehicle (UAV) ad-hoc networks with full-duplex radios and multipacket reception capability
Kandeepan et al. Time-divisional cooperative periodic spectrum sensing for cognitive radio networks
Liu et al. Foresee (4C): Wireless link prediction using link features
Raghunathan et al. Index policies for real-time multicast scheduling for wireless broadcast systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct app. not ent. europ. phase

Ref document number: 06821475

Country of ref document: EP

Kind code of ref document: A1