WO2022199792A1 - Estimation de récompense pour une politique cible - Google Patents

Estimation de récompense pour une politique cible Download PDF

Info

Publication number
WO2022199792A1
WO2022199792A1 PCT/EP2021/057321 EP2021057321W WO2022199792A1 WO 2022199792 A1 WO2022199792 A1 WO 2022199792A1 EP 2021057321 W EP2021057321 W EP 2021057321W WO 2022199792 A1 WO2022199792 A1 WO 2022199792A1
Authority
WO
WIPO (PCT)
Prior art keywords
reward
environment
value
action
parameter
Prior art date
Application number
PCT/EP2021/057321
Other languages
English (en)
Inventor
Filippo VANNELLA
Jaeseong JEONG
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2021/057321 priority Critical patent/WO2022199792A1/fr
Publication of WO2022199792A1 publication Critical patent/WO2022199792A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/18Network planning tools
    • H04W16/20Network planning tools for indoor coverage or short range network deployment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition

Definitions

  • the present disclosure relates to a method for improving the accuracy of a reward estimator for a target policy, and to a method for using a target policy to manage a communication network environment that is operable to perform a task.
  • the methods are performed by a training node and by a management node respectively.
  • the present disclosure also relates to a training node, a management node, and to a computer program product configured, when run on a computer, to carry out methods for improving the accuracy of a reward estimator for a target policy and/or for using a target policy to manage a communication network environment that is operable to perform a task.
  • the Contextual Bandit (CB) setting refers to a decision-making framework in which an agent interacts with an environment by selecting actions to be executed on the environment.
  • the agent learns an optimal policy for action selection by interacting with the environment and collecting a reward signal as a consequence of executing an action when a given context is observed in the environment.
  • the context comprises information about the state of the environment that the agent uses to select an action in accordance with its learned policy.
  • the objective is to devise a target or learning policy p e P in an offline manner from the off- policy dataset D o with the objective of maximizing the value of the learning policy p, defined as:
  • IPS Inverse Propensity Score
  • the learning algorithm based on the IPS value estimator is:
  • the quality of an estimator is usually characterized by its Bias and Variance.
  • the IPS estimator is usually unbiased but its variance scales quadratically with the reward variance and with the inverse of the propensity.
  • the IPS estimator is therefore usually regarded as a low-bias and high-variance estimator.
  • the DM estimator is based on the use of a model for the reward, which model is learned on the available off-policy data. The learned reward model will in turn be used for the direct estimation of the risk of policy p.
  • f( ) X x L ® R
  • the reward model is learned with the objective of minimizing the Mean Squared Error (MSE) between reward samples in the off-policy data and the estimated reward from the model:
  • MSE Mean Squared Error
  • the DM risk estimator of policy p is defined as:
  • V DM (n ) The optimal policy based on V DM (n ) is the deterministic policy that maximizes the estimated reward:
  • TT* DM (x) arg max f*(x, a). aEL
  • the DM estimator has lower variance than its IPS counterpart, but suffers from high model bias owing to the choice of the reward model. The DM estimator is therefore usually regarded as a high-bias and low- variance value estimator.
  • a computer implemented method for improving the accuracy of a reward estimator for a target policy wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the method performed by a training node, comprises obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • the method further comprises generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context.
  • the method further comprises initiating the reward estimator, wherein the reward estimator comprises a Machine Learning (ML) model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action, and setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy.
  • the method further comprises using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function.
  • ML Machine Learning
  • the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • Using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.
  • the method comprises obtaining a reward estimator from a training node, wherein the reward estimator comprises a Machine Learning (ML) model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according the previous aspect of the present disclosure.
  • the method further comprises receiving an observed environment context from a communication network node, and using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment, wherein the target policy evaluates possible actions for the environment using the obtained reward estimator.
  • the method further comprises causing the selected action to be executed in the environment.
  • a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method according to any one or more of the aspects or examples of the present disclosure.
  • a training node for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the training node comprises processing circuitry configured to cause the training node to obtain a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • the processing circuitry is further configured to cause the training node to generate, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context, and to initiate the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action.
  • the processing circuitry is further configured to cause the training node to set a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy, and to use the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function.
  • the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • Using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.
  • a management node for using a target policy to manage a communication network environment that is operable to perform a task
  • the management node comprises processing circuitry configured to cause the management node to obtain a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure.
  • the processing circuitry is further configured to cause the management node to receive an observed environment context from a communication network node, to use the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment, and to cause the selected action to be executed in the environment.
  • the target policy evaluates possible actions for the environment using the obtained reward estimator.
  • aspects of the present disclosure thus provide methods and nodes that facilitate flexible and adaptive off- policy learning through a combination of the IPS and DM off-policy learning techniques.
  • Methods presented herein allow for continuous adaptation between contribution from IPS or DM based solutions, leading to an improved reward estimator that results in an improved target policy, and consequently improved management of a communication network environment by that policy.
  • methods of the present disclosure allow for tuning of the impact of propensities on the reward estimator, favoring contributions from either propensity based or direct method techniques according to characteristics of the offline dataset used for learning and/or a feature of the reference policy.
  • Figure 1 is a flow chart illustrating process steps in a computer implemented method for improving the accuracy of a reward estimator for a target policy
  • Figure 2 is a flow chart illustrating process steps in a computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task
  • Figures 3a and 3b show a flow chart illustrating process steps in another example of computer implemented method for improving the accuracy of a reward estimator for a target policy
  • Figures 4a and 4b show a flow chart illustrating process steps in another example of computer implemented method for improving the accuracy of a reward estimator for a target policy
  • Figure 5 a flow chart illustrating process steps in another example of computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task;
  • Figure 6 a flow chart illustrating process steps in another example of computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task
  • Figure 7 is a flow chart illustrating implementation of the method of Figure 1 as a training pipeline
  • Figure 8 is a flow chart illustrating implementation of the method of Figure 2 as a policy pipeline
  • Figure 9 is a block diagram illustrating functional modules in a training node
  • Figure 10 is a block diagram illustrating functional modules in another example of training node
  • FIG. 11 is a block diagram illustrating functional modules in a management node
  • Figure 12 is a block diagram illustrating functional modules in another example of management node
  • FIG. 13 illustrates antenna downtilt angle
  • Figures 14 and 15 illustrate comparative performance of methods according to the present disclosure for a Remote Electronic Tilt use case. Detailed Description
  • examples of the present disclosure provide methods and nodes that facilitate flexible and adaptive off-policy learning through a combination of the IPS and DM off-policy learning techniques.
  • the resulting reward estimator is referred to as an Adaptive Propensity-Based Direct Estimator.
  • Figure 1 is a flow chart illustrating process steps in a computer implemented method 100 for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the task performed by the environment may comprise one or more aspects of provision of communication network services.
  • the environment may comprise a cell, a cell sector, or a group of cells of a commination network, and the task may comprise provision of Radio Access Network (RAN) services to wireless devices connecting to the network from within the environment.
  • RAN Radio Access Network
  • the environment may comprise a network slice, or a part of a transport or core network, in which case the task may be to provide end to end network services, core network services such as mobility management, service management etc., network management services, backhaul services or other services to wireless devices, to other parts of the communication network, to network traffic originating from wireless devices, to application services using the network, etc.
  • core network services such as mobility management, service management etc., network management services, backhaul services or other services to wireless devices, to other parts of the communication network, to network traffic originating from wireless devices, to application services using the network, etc.
  • the method 100 is performed by a training node, which may comprise a physical or virtual node, and may be implemented in a computing device or server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment.
  • the training node may for example be implemented in a core network of the communication network.
  • the training node may encompass multiple logical entities, as discussed in greater detail below, and may for example comprise a Virtualised Network Function (VNF).
  • VNF Virtualised Network Function
  • the method 100 comprises, in a first step 110, obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy.
  • each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • An observed context for an environment comprises any measured, recorded or otherwise observed information about the state of the environment.
  • the environment is an environment within a communication network, such as a cell of a cellular network, a cell sector, a group of cells, a geographical region, transport network, core network, network slice, etc.
  • An observed context for an environment may therefore comprise one or more Key Performance Indicators (KPIs) for the environment, information about a number of wireless devices connecting to the communication network in the environment, etc.
  • KPIs Key Performance Indicators
  • the action selected for execution in the environment may be any configuration, management or other action which impacts performance of the environment task. This may comprise setting one or more values of controllable parameters in the environment for example.
  • the reward value indicates an observed impact of the selected action on task performance by the environment.
  • This may comprise a change in one or more KPI values following execution of the action, or any other value, combination of values etc. which provide an indication of how the selected action has impacted the ability of the environment to perform its task.
  • the reward value may comprise a function of network coverage, quality and capacity parameters.
  • the records of task performance by the environment thus provide an indication of how the environment has been managed by the reference policy, illustrating, for each action executed on the environment, the information on the basis of which the reference policy selected the action (the context), the action selected, and the outcome of the selected action for task performance (the reward value).
  • Training of the reward estimator according to the method 100 is performed using the obtained training data in subsequent method steps, and is consequently performed offline.
  • the method 100 comprises generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context.
  • the propensity model may comprise a linear model, a logistic regression model, or any other suitable model capable of being trained using the training dataset to map from an observed context in the training dataset to a probability distribution indicating the probability of selection by the reference policy of different actions.
  • Generating the propensity model may therefore comprise initiating the propensity model and training the model by submitting as input observed contexts from the training data, and updating model parameters so as to minimize some error function based on the difference between the probability of selecting different actions generated by the model and the action selection probability distribution for the input context from the training data.
  • the method 100 comprises initiating the reward estimator, wherein the reward estimator comprises a Machine Learning (ML) model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action.
  • the method 100 then comprises, in step 140, setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy.
  • ML Machine Learning
  • Setting a value of the propensity impact parameter may comprise calculating a value based on the above mentioned feature(s), inputting a value provided by a user on the basis of the above mentioned feature(s), updating a default value of the propensity impact parameter based on previous performance of the reward estimator (which will be affected by the above mentioned features), inputting a value of the propensity impact parameter that has been learned by a model on the basis of previous experience of training datasets, reference policies and reward estimators, etc.
  • the method 100 comprises using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function.
  • Using the records of task performance in the training data in this manner may comprise inputting pairs of observed context and action from the training data to the reward estimator, using the reward estimator to estimate a reward that would result from execution of the input action given the input observed context, evaluating the estimated reward using the loss function, and then updating trainable parameters of the reward estimator to minimize the loss function, for example via backpropagation or any other suitable method.
  • the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy.
  • Each difference between observed and estimated reward is weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. This means that if the reference policy is highly likely to take a particular action given a particular observed context, the contribution to the loss function of the difference between observed and estimated reward for that combination of context and action is reduced.
  • the step of using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.
  • the impact parameter by determining the magnitude of the weighting, thus determines the impact that the output of the propensity model has on the loss function, which impact may range from no impact at all to a maximum impact, the magnitude of which will be determined by the value of the impact parameter.
  • the magnitude of the weighting by the propensity model output of each difference increases, in accordance with the value of the impact parameter, the variance of the reward estimation increases, and the bias of the reward estimation decreases.
  • the method 100 consequently results in an offline reward estimator in which, as opposed to choosing between the low bias, high variance estimation of Inverse Propensity Score, or the high bias, low variance estimation of the Direct Method, these two methods are combined in a flexible manner via the impact parameter, so achieving improved accuracy in reward estimation, and consequently in offline policy training.
  • the method 100 thus offers improved management of a communication network environment, through enabling a more accurate policy to be trained in an offline, and consequently safe, manner.
  • the method 100 uses offline training data to generate a propensity model and a reward estimator.
  • the output of the propensity model is used to weight contributions to the loss function during training of the reward estimator, and the magnitude of that weighting is determined by a propensity impact parameter.
  • the reward estimator trained in this manner is used by a target policy to control the communication network environment.
  • the increased accuracy offered by the reward estimator trained in the manner of the method 100 ensures improved performance in management of the communication network environment by the target policy, without incurring the risks of online target policy training.
  • the method 100 may be complemented by a computer implemented method 200 for using a target policy to manage a communication network environment that is operable to perform a task.
  • the method 200 is performed by a management node, which may comprise a physical or virtual node, and may be implemented in a computing device, server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment.
  • the management node may comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access Network node etc.
  • a Radio Access Network Node may comprise a base station, eNodeB, gNodeB, or any other current of future implementation of functionality facilitating the exchange of radio network signals between nodes and/or users of a communication network.
  • Any communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.
  • the management node may therefore encompass multiple logical entities, as discussed in greater detail below.
  • the method 200 comprises, in a first step 210, obtaining a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to examples of the present disclosure.
  • the method 200 comprises receiving an observed environment context from a communication network node.
  • the method 200 further comprises, in step 230, using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment. As illustrated at 230a, the target policy evaluates possible actions for the environment using the obtained reward estimator.
  • the method 200 comprises causing the selected action to be executed in the environment.
  • the method 200 may be envisaged as a policy pipeline that uses a reward estimator trained according to the method 100 (or the methods 300, 400 described below) in evaluating possible actions for execution in the environment. It will be appreciated that much of the detail described above with reference to the method 100 also applies to the method 200. For example the nature of the environment, the observed environment context, the reward value, and possible actions for execution in the environment may all be substantially as described above with reference to Figure 1 . It will also be appreciated that by virtue of having been trained using a method according to the present disclosure, the reward estimator used in the method 200 offers all of the advantages discussed above relating to improved accuracy and flexibility, and consequently improved performance. In some examples of the method 200, additional pre-processing steps may be carried out, including for example normalizing features of the received observed context.
  • Figures 3a and 3b show flow charts illustrating process steps in a further example of method 300 improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the method 300 provides various examples of how the steps of the method 100 may be implemented and supplemented to achieve the above discussed and additional functionality.
  • the method 300 is performed by a training node, which may be a physical or virtual node, and which may encompass multiple logical entities.
  • the training node obtains a training dataset comprising records of task performance by the environment during a period of management according to a reference policy.
  • each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • the training node may perform one or more pre-processing steps such as normalizing input features from the obtained training dataset.
  • the training node In step 320, the training node generates, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context.
  • the propensity model may in some examples comprise a logistic regression model, and may be generated as discussed above with reference to Figure 1 .
  • the training node initiates the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action.
  • the reward estimator may for example comprise an Artificial Neural Network, Logistic regression model, Gaussian process, Linear regression model, etc.
  • the training node samples a batch of records from the training dataset. This may comprise random sampling, and the batch size may be defined by a user or operator in accordance with the size of the training dataset, time available for training, training model complexity, data characteristics, etc.
  • the training node sets a value of the propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator. As illustrated at 340, the feature of the training dataset comprises a feature of at least the sampled batch of the training dataset. In this manner, the value of the propensity impact parameter (and consequently the magnitude of the impact of the propensity model on the loss function, and therefore the trained reward estimator) can be adjusted according to the characteristics of the particular sampled batch of data.
  • setting a value of the propensity impact parameter may comprise calculating a value based on the above mentioned feature(s), inputting a value provided by a user on the basis of the above mentioned feature(s), updating a default value of the propensity impact parameter based on previous performance of the reward estimator (which will be affected by the above mentioned features), inputting a value of the propensity impact parameter that has been learned by a model on the basis of previous experience of training datasets, reference policies and reward estimators, etc.
  • setting the value of the impact parameter may comprise setting the value such that the magnitude of the weighting of each difference by the output of the propensity model increases with at least one of decreasing noise in the training dataset and/or decreasing balance in the action distribution of the reference policy. For example, if the noise level in the training dataset is high, the propensity impact parameter may be set to reduce the magnitude of the propensity weighting, and so reduce the variance in the reward estimation. In another example, if the actions selected by the reference policy in the training dataset are distributed in an unbalanced manner, or dominated by one or a small number of actions, the impact parameter may be set to increase the magnitude of the propensity weighting, and so to reduce the bias in the reward estimation.
  • the propensity impact parameter may be set to have a value between zero and one.
  • the training node uses the records in the sampled batch of the training dataset to update the values of the reward estimator parameters so as to minimize a loss function.
  • the loss function minimized in step 350 is based on differences between observed reward from the sampled batch of the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy. Each difference is weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • using the records of task performance in the sampled batch of the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter, a value of which was set at step 340.
  • the propensity impact parameter may be set to have a value between zero and one, and adjusting a magnitude of the weighting of each difference according to the impact parameter may comprises, for a given pair of observed context and action selected by the reference policy, raising the output of the propensity model to the power of the impact parameter, and dividing a function of the difference between observed and estimated reward for the given pair of observed context and action selected by the reference policy by the output of the propensity model raised to the power of the impact parameter.
  • the step 350 of using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function may comprise, steps 350b and 350c, inputting observed context and selected action pairs from the sampled batch of the training dataset to the reward estimator, wherein the reward estimator processes the observed context and selected action pairs in accordance with current values of parameters of the reward estimator and outputs an estimated reward value, and updating the values of the reward estimator parameters so as to minimize the loss function.
  • This updating may for example be performed via backpropagation, and the reward estimator may as discussed above comprise an ANN.
  • the training node checks for fulfilment of a convergence condition.
  • the convergence condition may for example comprise a threshold value for the output of the loss function, a threshold rate of change of the output of the loss function, a maximum number of training epochs etc. If the convergence condition is not fulfilled, the training node returns to step 335, samples a new batch of records from the training dataset and repeats steps 340 (setting a value of the propensity impact parameter), and 350 (using the records in the sampled batch of the training dataset to update values of the reward estimator parameters so as to minimize a loss function).
  • the training node provides the reward estimator in step 370 to a management node for use in managing the environment of the communication network.
  • step 380 the training node checks whether any feedback has been received from the management node regarding performance of the target policy when using the provided trained reward estimator. If such feedback is available, the training node may update the value of the impact parameter according to performance of the target policy. In other examples, additional tuning of the value of the impact parameter may be performed as for the initial setting step on the basis of characteristics of the training dataset.
  • the value of the impact parameter may be decreased so as to reduce the variance in the reward estimation, and if the actions made by reference policy in training dataset are distributed in an unbalanced manner or dominated by a particular action, the impact parameter may be increased so as to reduce the bias in the reward estimation.
  • Figures 4a and 4b illustrate different examples of how the methods 100 and 300 may be applied to different technical domains of a communication network.
  • a more detailed discussion of example use cases is provided below, for example with reference to Figures 13 to 15, however Figures 4a and 4b provide an indication of example environments, contexts, actions, and rewards etc. for different technical domains. It will be appreciated that the technical domains illustrated in Figures 4a and 4b are merely for the purpose of illustration, and application of the methods 100 and 300 to other technical domains within a communication network may be envisaged.
  • Figures 4a and 4b illustrate steps of a method 400 for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the method 400 is performed by a training node, and the steps of the method 400 largely correspond to the steps of the method 100. Reference may be made to the above discussion of the method 100 for the detail of the training node and corresponding method steps.
  • the training node obtains a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • the environment may comprise at least one of a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and the task that the environment is operable to perform may comprise provision of communication network services.
  • an observed environment context in the training dataset may comprises at least one of: a value of a network coverage parameter (401a); a value of a network capacity parameter(401b); a value of a network congestion parameter (401c); a value of a network quality parameter a current network resource allocation (401 d); a current network resource configuration (401 e); a current network usage parameter (401f); a current network parameter of a neighbor communication network cell (401 g); a value of a network signal quality parameter (401 h) a value of a network signal interference parameter (401 i); a value of a Reference Signal Received Power, RSRP parameter; a value of a Reference Signal Received Quality, RSRQ, parameter; a value of a network signal to interference plus noise ratio, SINR, parameter; a value of a network power parameter (401j); a current network frequency band (401a)
  • the parameters listed above comprise observable or measurable parameters, including KPIs of the network, as opposed to configurable parameters that may be controlled by a network operator.
  • the observed context for the cell may include one or more of the parameters listed above as measured or observed for the cell in question and for one or more neighbour cells of the cell in question.
  • an action for execution in the environment may comprises at least one of: an allocation decision for at least one communication network resource; a configuration for at least one communication network node; a configuration for at least one communication network equipment; a configuration for at least one communication network operation; a decision relating to provision of communication network services for at least one wireless device; a configuration for an operation performed by at least one wireless device in relation to the communication network.
  • the training node initiates the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action.
  • the reward value indicating an observed impact of the selected action on task performance by the environment may comprise a function of at least one performance parameter for the communication network.
  • the training node sets a value of a propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator, and in step 450, the training node uses the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function.
  • the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.
  • Figure 5 is a flow chart illustrating process steps in a further example of method 500 for using a target policy to manage a communication network environment that is operable to perform a task.
  • the method 500 may complement either of the methods 100 and/or 300, and is performed by a management node.
  • the method 500 illustrates examples of how the steps of the method 200 may be implemented and supplemented to achieve the above discussed and additional functionality.
  • the management node performing the method 500 may be a physical or virtual node, and may encompass multiple logical entities.
  • the management node obtains a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using the method 100 or 200.
  • the management node receives an observed environment context from a communication network node.
  • the management node uses the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment.
  • the step of using the target policy to select an action for execution may comprise, in step 530a, evaluating possible actions for the environment using the obtained reward estimator, and, in step 530b, using a selection function to select an action for execution in the environment based on the evaluation.
  • steps 530a and 530b may take different forms, depending upon the nature for the target policy.
  • evaluating possible actions for the environment using the obtained reward estimator may comprise using the reward estimator to estimate a reward from taking each possible action given the received context, for example in the case of a deterministic target policy.
  • Using a selection function to select an action for execution in the environment based on the evaluation may in such examples comprise selecting for execution in the environment the action having the highest estimated reward, as illustrated at 530bb.
  • a stochastic target policy may be used by the management node, in which case, evaluating possible actions for the environment using the obtained reward estimator may comprise using a prediction function to predict the probability that each of the possible actions will result in the greatest reward, as illustrated at 530aa.
  • the greatest reward may be the greatest reward as estimated by the reward estimator.
  • Using a selection function to select an action for execution in the environment based on the evaluation may in such examples comprise selecting for execution in the environment the action having the highest probability, as illustrated at 530bb.
  • step 540 the management node causes the selected action to be executed in the environment. It will be appreciated that much of the detail described above with reference to the methods 100 and 300 also applies to the method 400. For example, the nature of the environment, the observed environment context, the reward value, and possible actions for execution in the environment may all be substantially as described above with reference to Figures 1, 3a and 3b.
  • Figure 6 illustrates different examples of how the methods 200 and 500 may be applied to different technical domains of a communication network.
  • a more detailed discussion of example use cases is provided below, for example with reference to Figures 13 to 15, however Figure 6 provides an indication of example environments, contexts, actions, and rewards etc. for different technical domains.
  • the method 600 of Figure 6 thus corresponds to the method 400 of Figures 4a and 4b. It will be appreciated that the technical domains illustrated in Figure 6 are merely for the purpose of illustration, and application of the methods 200 and 500 to other technical domains within a communication network may be envisaged.
  • FIG. 6 is a flow chart illustrating process steps in a computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task.
  • the method is performed by a management node, which may be a physical or virtual node, and may encompass multiple logical entities, as discussed above.
  • the management node obtains a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using the method 100, 300 or 400.
  • the reward value comprises a function of at least one performance parameter for the communication network.
  • the management node receives an observed environment context from a communication network node.
  • the environment comprises at least one of a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and the task that the environment is operable to perform comprises provision of communication network services.
  • the observed context may comprise any one or more of the parameters discussed above with reference to Figure 4b.
  • an action for execution in the environment may comprise at least one of: an allocation decision for at least one communication network resource; a configuration for at least one communication network node; a configuration for at least one communication network equipment; a configuration for at least one communication network operation; a decision relating to provision of communication network services for at least one wireless device; a configuration for an operation performed by at least one wireless device in relation to the communication network.
  • the management node causes the selected action to be executed in the environment, for example by sending a suitable instruction to one or more communication network nodes and/or wireless devices in the environment, or by directly executing the action.
  • Figures 1 to 6 discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. There now follows a detailed discussion of how different process steps illustrated in Figures 1 to 6 and discussed above may be implemented according to an example training and management pipelines, illustrated in Figures 7 and 8 respectively. Referring to Figure 7, the methods 100, 300, 400 may be implemented as a training pipeline comprising the following steps:
  • step 110, 310, 410 1) Having obtained a training dataset (step 110, 310, 410), perform feature normalization on the input features considered for training.
  • the input features are normalized to have zero mean and standard deviation equal to one.
  • step 6 For the sampled batch B, compute a weighting for each loss value from step 4 (denominator of the MSE objective in step 7), the weighting comprising the output of the propensity model raised to the power of b. 7) Using the calculated losses from step 4 and weights from step 6, train the created reward estimation model (updating the weights from the values in the initial weight vector w 0 ) to minimize the following weighted MSE objective via backpropagation (steps 150, 350, 450):
  • step 360 Check for convergence of the learning algorithm, return to step 3 to sample new batch and stop when convergence criterion satisfied (e.g. a maximum number of training epochs) (step 360).
  • convergence criterion satisfied e.g. a maximum number of training epochs
  • Figure 8 illustrates a policy pipeline for a deterministic policy comprising the following steps: a. Receive an observed environment context comprising input features. b. Normalize the input features from the received environment context. c. Estimates the reward for each available action based on the reward model delivered at step 9) of the training pipeline. d. Greedily execute the action having the maximum estimated reward.
  • the methods 100, 300 and 400 may be performed by a training node, and the present disclosure provides a training node that is adapted to perform any or all of the steps of the above discussed methods.
  • the training node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment.
  • the training node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.
  • FIG 9 is a block diagram illustrating an example training node 900 which may implement the method 100, 300 and/or 400, as illustrated in Figures 1, 3a, 3b, 4a and 4b, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 950.
  • the training node 900 comprises a processor or processing circuitry 902, and may comprise a memory
  • the processing circuitry 902 is operable to perform some or all of the steps of the method 100, 300 and/or 400 as discussed above with reference to Figures 1, 3a, 3b, 4a and 4b.
  • the memory 904 may contain instructions executable by the processing circuitry 902 such that the training node 900 is operable to perform some or all of the steps of the method 100, 300 and/or 400, as illustrated in Figures 1, 3a, 3b, 4a and 4b.
  • the instructions may also include instructions for executing one or more telecommunications and/or data communications protocols.
  • the instructions may be stored in the form of the computer program 950.
  • the processor or processing circuitry 902 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc.
  • the processor or processing circuitry 902 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc.
  • the memory 904 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive etc.
  • Figure 10 illustrates functional modules in another example of training node 1000 which may execute examples of the methods 100, 300 and/or 400 of the present disclosure, for example according to computer readable instructions received from a computer program.
  • the modules illustrated in Figure 10 are functional modules, and may be realised in any appropriate combination of hardware and/or software.
  • the modules may comprise one or more processors and may be integrated to any degree.
  • the training node 1000 is for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task.
  • the training node 1000 comprises a receiving module 1002 for obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.
  • the training node 1000 further comprises a learning module 1004 for generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context, and for initiating the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action.
  • the training node 1000 also comprises a weighting module 1006 for setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy.
  • the learning module 1004 is also for using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function, wherein the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.
  • the learning module is for using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function by adjusting a magnitude of the weighting of each difference according to the impact parameter.
  • the training node 1000 may further comprise interfaces 1008 which may be operable to facilitate communication with a management node, and/or with other communication network nodes over suitable communication channels.
  • the methods 200, 500 and 600 may be performed by a management node, and the present disclosure provides a management node that is adapted to perform any or all of the steps of the above discussed methods.
  • the management node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment.
  • the management node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.
  • FIG 11 is a block diagram illustrating an example management node 1100 which may implement the method 200, 500 and/or 600, as illustrated in Figures 2, 5 and 6, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 2050.
  • the management node 1100 comprises a processor or processing circuitry 1102, and may comprise a memory 1104 and interfaces 1106.
  • the processing circuitry 1102 is operable to perform some or all of the steps of the method 200, 500 and/or 600 as discussed above with reference to Figures 2, 5 and 6.
  • the memory 1104 may contain instructions executable by the processing circuitry 1102 such that the management node 1100 is operable to perform some or all of the steps of the method 200, 500 and/or 600, as illustrated in Figures 2, 5 and 6.
  • the instructions may also include instructions for executing one or more telecommunications and/or data communications protocols.
  • the instructions may be stored in the form of the computer program 1150.
  • the processor or processing circuitry 1102 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc.
  • DSPs digital signal processors
  • the processor or processing circuitry 1102 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc.
  • the memory 1104 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive etc.
  • FIG 12 illustrates functional modules in another example of management node 1200 which may execute examples of the methods 200, 500 and/or 600 of the present disclosure, for example according to computer readable instructions received from a computer program.
  • the modules illustrated in Figure 12 are functional modules, and may be realised in any appropriate combination of hardware and/or software.
  • the modules may comprise one or more processors and may be integrated to any degree.
  • the management node 1200 is for using a target policy to manage a communication network environment that is operable to perform a task.
  • the management node comprises a receiving module 1202 for obtaining a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure.
  • the receiving module 1202 is also for receiving an observed environment context from a communication network node.
  • the management node further comprises a policy module 1204 for using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment.
  • the target policy evaluates possible actions for the environment using the obtained reward estimator.
  • the management node further comprises an execution module 1206 for causing the selected action to be executed in the environment.
  • the management node 1200 may further comprise interfaces 1208 which may be operable to facilitate communication with a training node and/or with other communication network nodes over suitable communication channels.
  • Use case 1 Remote Electrical Tilt optimization Modern cellular networks are required to satisfy consumer demand that is highly variable in both the spatial and the temporal domains. In order to be able efficiently to provide high level of Quality of Service (QoS) to User Equipments (UEs), networks must adjust their configuration in an automatic and timely manner. Antenna vertical tilt angle, referred to as downtilt angle, is one of the most important variables to control for QoS management. The downtilt angle can be modified both in a mechanical and an electronic manner, but owing to the cost associated with manually adjusting the downtilt angle, Remote Electrical Tilt (RET) optimisation is used in the vast majority of modern networks.
  • RET Remote Electrical Tilt
  • the antenna downtilt is defined as the elevation angle of the main lobe of the antenna radiation pattern with respect to the to the horizontal plane, as illustrated in Figure 13.
  • KPIs Key Performance Indicators
  • coverage area covered in terms of a minimum received signal strength
  • capacity average total throughput in the a given area of interest
  • quality There exists a trade-off between coverage and capacity when determining an increase in antenna downtilt: increasing the downtilt angle correlates with a stronger signal in a more concentrated area, as well as higher capacity and reduced interference radiation towards other cells in the network.
  • excessive downtilting can result in insufficient coverage in a given area, with some UEs unable to receive a minimum signal quality.
  • CCO Capacity Coverage Optimization
  • the solution is based on the availability of a reference dataset 2) po generated according to a rule-based expert reference policy 0 (a j
  • the reference policy is assumed to be suboptimal and consequently improvable.
  • the goal is to devise a target policy n w ⁇ a t
  • the physical 4G or 5G mobile cellular network area considered for RET optimisation The physical 4G or 5G mobile cellular network area considered for RET optimisation.
  • the network area may be divided into C sectors, each served by an antenna.
  • Context A set of normalized KPIs collected in the area considered for the RET optimization.
  • Action A discrete unitary change in the current antenna tilt angle a t e (-1,0,1).
  • Reward A measure of the context variation induced by the action a t taken given the context x t .
  • the reward signal or function may be defined at the level of domain knowledge.
  • One example of reward considers c D0F and q D0 F ⁇ the capacity and coverage Degree Of Fire (DOF), measuring the degree of alarm perceived by the policy with respect to the capacity and coverage in the cell.
  • DOF Degree Of Fire
  • the target policy n w ai ⁇ xi) is an ANN model parametrized by weight vector w and with an output softmax layer, taking as input the 2D context vector x t and returning a probability distribution for all actions a t e ⁇ -1,0,1 ⁇ , resulting in a stochastic policy.
  • the reference dataset D no is split into training set (70%) and test set (30%).
  • An example policy is considered that is global and executes tilt changes on a cell-by-cell basis, meaning that the same policy is executed independently at each cell.
  • a central node may aggregate all the data coming from the different base stations.
  • the policy training step offline
  • the policy execution step online in the network
  • the present implementation envisages independent execution, and consequently no central coordination. It will be appreciated that other implementations may consider centralized coordination.
  • the IPS estimated value is used as a test performance metric for the two agents that are trained with training dataset.
  • the ANN model of the reward estimator has the following structure: number of input features: 35, and number of outputs: 3 (estimated reward for each action), and the batch size is 32.
  • the ANN has 2 hidden layers with size [256, 256], It can be observed from Figure 14 that the method according to the present disclosure provides a better performance than the DM method, with a gain of up to 20% performance increase.
  • Figure 15 illustrates comparative performance of a method according to the present disclosure, an RL agent using a prior art method when the agent only executes an uptilt or downtilt action.
  • Figure 15 also illustrates performance of a Self Organizing Networks solution (SON_RET).
  • SON_RET Self Organizing Networks solution
  • a computer implemented method for improving the accuracy of a reward estimator for a target policy wherein the target policy is for managing Remote Electronic Tilt (RET) in at least a sector of a cell of a communication network, which cell sector is operable to provide Radio Access Network (RAN) services for the communication network
  • the method performed by a training node, comprising: obtaining a training dataset comprising records of RAN service provision performance by the cell sector during a period of RET management according to a reference policy; wherein each record of RAN service provision performance comprises an observed context for the cell sector, an action selected for execution in the cell sector by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on RAN service provision performance by the cell sector; wherein an observed cell sector context in the training dataset comprises at least one of: a coverage parameter for the sector; a capacity parameter for the sector; a signal quality parameter for the sector; a down tilt angle of the antenna serving the sector; and wherein
  • a computer implemented method for using a target policy to manage managing Remote Electronic Tilt (RET) in at least a sector of a cell of a communication network, which cell sector is operable to provide Radio Access Network (RAN) services for the communication network the method, performed by a management node, comprising: obtaining a reward estimator from a training node, wherein the reward estimator comprises a Machine Learning (ML) model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure; receiving an observed cell sector context from a communication network node; wherein the observed environment context received from the communication network node comprises at least one of: a coverage parameter for the sector; a capacity parameter for the sector; a signal quality parameter for the sector; a down tilt angle of the antenna serving the sector; and wherein an action for execution in the environment comprises a downtilt adjustment value for an antenna serving the sector; the method further comprising: using the target policy to select
  • RET is merely one of many operational parameters for communication network cells.
  • a radio access node such as a base station, serving a communication network cell may adjust its transmit power, required Uplink power, sector shape, etc, so as to optimise some measure of cell performance, which may be represented by a combination of cell KPIs.
  • the methods and nodes of the present disclosure may be used to manage any operational parameter for a communication network cell.
  • a plurality of services may compete over resources in a shared environment such as a Cloud.
  • the services can have different requirements and their performance may be indicated by their specific QoS KPIs. Additional KPIs that can be similar across services can also include time consumption, cost, carbon footprint, etc.
  • the shared environment may also have a list of resources that can be partially or fully allocated to services. These resources can include CPU, memory, storage, network bandwidth, Virtual Machines (V Ms), Virtual Network Functions (VNFs), etc.
  • Context A set of normalized KPIs for the services deployed on the shared resource of the environment.
  • Action An allocation or change in allocation of a resource to a service.
  • Reward A measure of the context variation induced by an executed action given the context. This may comprise a function or combination of KPIs for the services.
  • Examples of the present disclosure thus provide an off-policy value estimator family and corresponding learning method that offer increased flexibility in how reward is estimated using off policy data.
  • This flexibility is provided by the tunable propensity impact parameter b, which may be set according to the characteristics of the training dataset, and provides potential for improved bias-variance behavior of the reward estimator.
  • Experimental evaluation has shown the effectiveness of methods according to the present disclosure using real-world network data for the RET optimization use case. The effectiveness was particularly marked when using a larger number of input features for the policy model. Examples of the present disclosure overcome limitations of the IPS and DM risk estimator methods by combining aspects of each method in a flexible manner.
  • the improved accuracy with respect to standard IPS and DM off-policy learning techniques is provided by controlling the extent of the impact of the propensity model on the DM reward.
  • the performance gain achieved by examples of the present disclosure increases with increasing number of input features, as existing methods for training with a large number of input features suffer from bias.
  • examples of the present disclosure allow the development of an improved policy in an offline, and hence safe manner.
  • the methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein.
  • a computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

L'invention concerne un procédé mis en œuvre par ordinateur (100) pour améliorer la précision d'un estimateur de récompense pour une politique cible, la politique cible étant destinée à gérer un environnement de réseau de communication qui peut être utilisé pour exécuter une tâche. Le procédé comprend l'obtention d'un ensemble de données d'apprentissage comprenant des enregistrements d'exécution de tâche par l'environnement pendant une période de gestion selon une politique de référence (110), et la génération, sur la base de l'ensemble de données d'apprentissage, d'un modèle de propension qui estime la probabilité de sélection par la politique de référence d'une action particulière compte tenu d'un contexte observé particulier (120). Le procédé comprend en outre l'initiation de l'estimateur de récompense (130), l'estimateur de récompense comprenant un modèle d'apprentissage machine qui peut être utilisé pour estimer une valeur de récompense compte tenu d'un contexte observé particulier et d'une action sélectionnée, et la définition d'une valeur d'un paramètre d'impact de propension en fonction d'une caractéristique d'au moins l'ensemble de données d'apprentissage ou la politique de référence (140). Le procédé comprend en outre l'utilisation des enregistrements d'exécution de tâche dans l'ensemble de données d'apprentissage pour mettre à jour les valeurs des paramètres d'estimateur de récompense de façon à minimiser une fonction de perte (150) sur la base de différences entre une récompense observée à partir de l'ensemble de données d'apprentissage et une récompense estimée par l'estimateur de récompense pour des paires données de contexte observé et d'action sélectionnée par la politique de référence (150a), et l'ajustement d'une amplitude de la pondération de chaque différence en fonction du paramètre d'impact.
PCT/EP2021/057321 2021-03-22 2021-03-22 Estimation de récompense pour une politique cible WO2022199792A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/057321 WO2022199792A1 (fr) 2021-03-22 2021-03-22 Estimation de récompense pour une politique cible

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/057321 WO2022199792A1 (fr) 2021-03-22 2021-03-22 Estimation de récompense pour une politique cible

Publications (1)

Publication Number Publication Date
WO2022199792A1 true WO2022199792A1 (fr) 2022-09-29

Family

ID=75278005

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/057321 WO2022199792A1 (fr) 2021-03-22 2021-03-22 Estimation de récompense pour une politique cible

Country Status (1)

Country Link
WO (1) WO2022199792A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117667360A (zh) * 2024-01-31 2024-03-08 湘江实验室 面向大模型任务的计算与通信融合的智能算网调度方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190258938A1 (en) * 2016-11-04 2019-08-22 Deepmind Technologies Limited Reinforcement learning with auxiliary tasks

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190258938A1 (en) * 2016-11-04 2019-08-22 Deepmind Technologies Limited Reinforcement learning with auxiliary tasks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BIETTI, A.AGARWAL, A.LANGFORD, J.: "Practical Evaluation and Optimization of Contextual Bandit Algorithms", ARXIV, ABS/1802.04064, 2018
DUDIK, MIROSLAVLANGFORD, JOHNLI, LIHONG: "Doubly robust policy evaluation and learning", 2011, ICML
FILIPPO VANNELLA ET AL: "Off-policy Learning for Remote Electrical Tilt Optimization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 May 2020 (2020-05-21), XP081676210 *
HEUNCHUL LEE ET AL: "Deep reinforcement learning approach to MIMO precoding problem: Optimality and Robustness", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 June 2020 (2020-06-30), XP081710685 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117667360A (zh) * 2024-01-31 2024-03-08 湘江实验室 面向大模型任务的计算与通信融合的智能算网调度方法
CN117667360B (zh) * 2024-01-31 2024-04-16 湘江实验室 面向大模型任务的计算与通信融合的智能算网调度方法

Similar Documents

Publication Publication Date Title
US11082115B2 (en) Beam management using adaptive learning
Cayamcela et al. Artificial intelligence in 5G technology: A survey
CN110770761B (zh) 深度学习系统和方法以及使用深度学习的无线网络优化
CN109845310B (zh) 利用强化学习进行无线资源管理的方法和单元
US10945145B2 (en) Network reconfiguration using genetic algorithm-based predictive models
US10327159B2 (en) Autonomous, closed-loop and adaptive simulated annealing based machine learning approach for intelligent analytics-assisted self-organizing-networks (SONs)
JP7279856B2 (ja) 方法及び装置
EP3972339A1 (fr) Prévision et gestion de taux de réussite de transfert utilisant l'apprentissage machine pour réseaux 5g
Fan et al. Self-optimization of coverage and capacity based on a fuzzy neural network with cooperative reinforcement learning
US20160165472A1 (en) Analytics assisted self-organizing-network (SON) for coverage capacity optimization (CCO)
US20220248237A1 (en) Neural network circuit remote electrical tilt antenna infrastructure management based on probability of actions
WO2020048594A1 (fr) Procédure d'optimisation d'un réseau auto-organisateur
WO2023003499A1 (fr) Détermination d'une politique cible pour gérer un environnement
EP4156631A1 (fr) Gestion de ressources basée sur un apprentissage par renforcement (rl) et un réseau neuronal graphique (gnn) pour des réseaux d'accès sans fil
Vannella et al. Off-policy learning for remote electrical tilt optimization
Alcaraz et al. Online reinforcement learning for adaptive interference coordination
US11558262B2 (en) Method and an apparatus for fault prediction in network management
WO2022199792A1 (fr) Estimation de récompense pour une politique cible
US20230216737A1 (en) Network performance assessment
Vaishnavi et al. Self organizing networks coordination function between intercell interference coordination and coverage and capacity optimisation using support vector machine
US11985527B2 (en) Systems and methods for autonomous network management using deep reinforcement learning
US20230099006A1 (en) Spectral Efficiency Prediction with Artificial Intelligence for Enhancing Carrier Aggregation and Proactive Radio Resource Management
US20230419172A1 (en) Managing training of a machine learning model
CN116506863A (zh) 决策优化方法、装置、电子设备及可读存储介质
CN116998175A (zh) 用于son参数优化的强化学习

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21715194

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21715194

Country of ref document: EP

Kind code of ref document: A1